.NET Forum / Languages / C# / March 2008
Bizarre benchmark result -- C# hundreds of times slower than Java?
|
|
Thread rating:  |
Michael A. Covington - 11 Mar 2008 15:21 GMT While asking some Java enthusiasts what they think about C#, I came across this:
http://www.manageability.org/blog/archive/20030520%23p_the_problem_with_cameron
Reportedly, the (essentially) same program in C# is much, much slower than in Java.
This is a program that is heavy on regexes (which I'm not an expert on) and am wondering if the C# version makes an elementary blunder. Do any experts want to have a look? See also comp.lang.java.
(Query: Is he compiling the regex once in Java, but every time through the loop in C#?)
Jon Skeet [C# MVP] - 11 Mar 2008 15:33 GMT > While asking some Java enthusiasts what they think about C#, I came across > this: [quoted text clipped - 10 lines] > (Query: Is he compiling the regex once in Java, but every time through the > loop in C#?) Looks like he's compiling it every time in both cases - which is stupid on both platforms.
One thing to note is that this test almost certainly says *nothing* about the performance of the languages. Instead, it is a benchmark of the regular expression implementations.
Oh, and it was created in 2003. Worth trying again, preferably having fixed the code to not be braindead.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet World class .NET training in the UK: http://iterativetraining.co.uk
Michael A. Covington - 11 Mar 2008 17:13 GMT Thanks!
>> While asking some Java enthusiasts what they think about C#, I came >> across [quoted text clipped - 25 lines] > Oh, and it was created in 2003. Worth trying again, preferably having > fixed the code to not be braindead. Jon Skeet [C# MVP] - 11 Mar 2008 17:32 GMT > Thanks! Having looked again at the code, you can't actually compile a single regular expression and reuse it - they're creating a new regular expression on each iteration.
However, this shows:
1) You shouldn't compile the regex unless you're going to reuse it 2) You shouldn't use regexes for cases where simple string matches work just as well
It really is a completely bogus test. The claim that "You could build a spam filter, a RSS categorization engine or even a rule based message broker" is rubbish - because in those cases there would be a relatively low number of regular expressions, tested against a high number of messages. In those cases you *would* want to compile the regex, and you could easily get completely different performance out.
To make the C# version run comparably with the Java version (still slower, admittedly - a factor of 2 on my box) just take out RegexOptions.Compile.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet World class .NET training in the UK: http://iterativetraining.co.uk
Lasse Vågsæther Karlsen - 11 Mar 2008 17:51 GMT >> While asking some Java enthusiasts what they think about C#, I came across >> this: [quoted text clipped - 20 lines] > Oh, and it was created in 2003. Worth trying again, preferably having > fixed the code to not be braindead. On my laptop the code never finishes as posted (at least not in the time I allowed it to run), but if I remove the RegexOptions.Compiled, it completes in about 6.5 seconds. I have no idea how the java program would fare on my machine so I won't vouch for its performance compared to the "equivalent" java program.
Let's see what the documentation says about .Compiled:
"Specifies that the regular expression is compiled to an assembly. This yields faster execution but increases startup time. This value should not be assigned to the Options property when calling the CompileToAssembly method."
Ok, so lets check what Pattern.compile does in java:
"Compiles the given regular expression into a pattern."
Hm, not saying much exactly but from docs and all the sources I can find, Pattern.compile(...) is the same as new Regex(...) in .NET.
So in .NET, it is compiled into a binary assembly running the same kind of code as though you built the state machine from scratch, and in java it builds the data structures.
Well, duh, this will take more time. There is no point in compiling the regular expression into an assembly if its to be executed once, like the code posted does.
This is the main problem with most benchmarks that try to prove one language better than another, they invariably fall short on some kind of user problem.
In any case, as Jon pointed out, this isn't a benchmark that tells you anything useful. If anything, it tells you that of the two programs, the java one completed, and the C# one didn't and ran a lot slower to boot. It doesn't say anything about what can be accomplished if you use the platform to its fullest, or at least avoid mistakes like this.
And besides, measuring the performance of C# would be a whole other task. What the program *attempts* to do is measure one part of the BCL against a similar one in Java.
And a rather bad attempt at that.
 Signature Lasse Vågsæther Karlsen mailto:lasse@vkarlsen.no http://presentationmode.blogspot.com/ PGP KeyID: 0xBCDEA2E3
Ethan Strauss - 11 Mar 2008 21:04 GMT That code is very interesting. I have written for similar code for a real DNA sequence manipulation application and I found that Regex's have a large amount of variability in the time it takes them to run. Most were taking 1-3 ticks and occassionally one would take 3 seconds! I never did track down exactly what was going on, but I switched from using a Regex to using a series of string.IndexOf() statements. and that consistently runs much faster. Has anyone else seen this sort of variability in Regex time to run?
Ethan Ethan Strauss Ph.D. Bioinformatics Scientist Promega Corporation 2800 Woods Hollow Rd. Madison, WI 53711 608-274-4330 800-356-9526 ethan.strauss@promega.com
> While asking some Java enthusiasts what they think about C#, I came across > this: [quoted text clipped - 10 lines] > (Query: Is he compiling the regex once in Java, but every time through the > loop in C#?) Jeroen Mostert - 11 Mar 2008 21:36 GMT > That code is very interesting. I have written for similar code for a real DNA > sequence manipulation application and I found that Regex's have a large [quoted text clipped - 4 lines] > faster. > Has anyone else seen this sort of variability in Regex time to run? Plenty, usually caused by the fact that regexes aren't regular expressions (the theoretical constructs, which always match in linear time). See, e.g., http://www.codinghorror.com/blog/archives/000488.html and http://www.regular-expressions.info/catastrophic.html.
The bottom line is that advanced regexes need optimization just like code does. People often assume that anything they can throw in a regex will magically run fast, but alas that's not the case.
 Signature J.
Ethan Strauss - 13 Mar 2008 16:41 GMT Thanks for the links! Very helpful and interesting. Ethan
> Plenty, usually caused by the fact that regexes aren't regular expressions > (the theoretical constructs, which always match in linear time). See, e.g., > http://www.codinghorror.com/blog/archives/000488.html and > http://www.regular-expressions.info/catastrophic.html. Alun Harford - 12 Mar 2008 00:40 GMT > That code is very interesting. I have written for similar code for a real DNA > sequence manipulation application and I found that Regex's have a large [quoted text clipped - 4 lines] > faster. > Has anyone else seen this sort of variability in Regex time to run? Yes. C# Regular Expressions can potentially take exponential time to run (IIRC they're NP-hard).
And I hope you realise that using strings to represent DNA sequences except for input/output makes baby Jesus cry. Your programs would be faster and more maintainable using your own type (probably backed by a byte[] and some unsafe code). The best way to represent it normally depends on what you're doing and whether you need to consider SNPs, but Unicode strings are a bad idea!
Alun Harford
Ethan Strauss - 13 Mar 2008 16:49 GMT I don't want to be the one making Jesus cry, but I am not sure that my code is what is doing it. I see that representing DNA as strings is not going to make the cpu as happy as it could be, but it is not obvious to me how to represent DNA (and RNA and protein, so I can't use a byte array anymore) as a numeric array and still get relatively programmer friendly functionality. I started yesterday (code below...) and stopped pretty rapidly because I don't see a way to recreate IndexOf or Regex type functionality without a lot of work! If you have anything more complete I would be interested! Thanks, Ethan
using System; using System.Collections.Generic; using System.Text;
namespace TestSequence { public struct DNA { private DNABase[] _Sequence; public DNA(string sequence) { List<DNABase> ThisSequence = new List<DNABase>(); foreach (char thisBase in sequence.ToUpper().ToCharArray()) { DNABase NextBase; switch (thisBase) { case "G": { NextBase = DNABase.G; break; } case "A": { NextBase = DNABase.A; break; } case "T": { NextBase = DNABase.T; break; } case "C": { NextBase = DNABase.C; break; } default: { continue; } } ThisSequence.Add(NextBase); } _Sequence = ThisSequence.ToArray(); } } public enum DNABase : byte { N = 0, G = 1, A = 2, T = 3, C = 4 } }
> And I hope you realise that using strings to represent DNA sequences > except for input/output makes baby Jesus cry. Your programs would be [quoted text clipped - 4 lines] > > Alun Harford Jon Skeet [C# MVP] - 13 Mar 2008 17:36 GMT > I don't want to be the one making Jesus cry, but I am not sure that my code > is what is doing it. I see that representing DNA as strings is not going to > make the cpu as happy as it could be, but it is not obvious to me how to > represent DNA (and RNA and protein, so I can't use a byte array anymore) as a > numeric array and still get relatively programmer friendly functionality. I'd create a type (probably *not* a value type, btw) representing a DNA sequence, which *internally* used an efficient format, but which could also present its contents as strings.
Basically it looks like you've started the right way, and you just need to write a byte equivalent of IndexOf. That shouldn't be too hard to do - and Array.IndexOf may sort it out for you. What kind of regex capabilities do you need?
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet World class .NET training in the UK: http://iterativetraining.co.uk
Ethan Strauss - 13 Mar 2008 17:56 GMT Thanks, I will think about it some more. In general, I am much more concerned with usability and correctness than effeciency, but I certainly won't turn down effeciency gains if they are just sitting there. If I end up just storing DNA as an int array and then, for most uses, convert it back to a string, will that actually gain anything?
The biggest Regex need is shown below. I do use some other Regex's, but this is the one I don't think I could really replace. I guess I can always convert to a string and then create the Regex from that is need be. Ethan
/// <summary> /// Creates a regular expression of all sequences which could code for a specific amino acid Sequence /// </summary> /// <param name="residues">A string containing the amino acid Sequence</param> /// <returns>a string which represents the Regular expression</returns> public static string BackTranslateToRegex(string residues) { if (residues == null) { throw new ArgumentNullException("residues", "you need a Sequence from which to make a regex."); } Hashtable BackTranslations = new Hashtable(); BackTranslations.Add('*', "T(GA|A[GA])"); BackTranslations.Add('A', "GC[GATC]"); BackTranslations.Add('C', "TG[TC]"); BackTranslations.Add('D', "GA[TC]"); BackTranslations.Add('E', "GA[GA]"); BackTranslations.Add('F', "TT[TC]"); BackTranslations.Add('G', "GG[GATC]"); BackTranslations.Add('H', "CA[TC]"); BackTranslations.Add('I', "AT[ATC]"); BackTranslations.Add('K', "AA[GA]"); BackTranslations.Add('L', "(TT[GA]|CT[GATC])"); BackTranslations.Add('M', "ATG"); BackTranslations.Add('N', "AA[TC]"); BackTranslations.Add('P', "CC[GATC]"); BackTranslations.Add('Q', "CA[GA]"); BackTranslations.Add('R', "(AG[GA]|CG[GATC])"); BackTranslations.Add('S', "(AG[TC]|TC[GATC])"); BackTranslations.Add('T', "AC[GATC]"); BackTranslations.Add('V', "GT[GATC]"); BackTranslations.Add('W', "TGG"); BackTranslations.Add('Y', "TA[TC]"); StringBuilder Expression = new StringBuilder(); char[] Residues = residues.ToCharArray(); foreach (char Residue in Residues) { Expression.Append(BackTranslations[Residue]); } return Expression.ToString(); }
> > I don't want to be the one making Jesus cry, but I am not sure that my code > > is what is doing it. I see that representing DNA as strings is not going to [quoted text clipped - 10 lines] > - and Array.IndexOf may sort it out for you. What kind of regex > capabilities do you need? Jon Skeet [C# MVP] - 13 Mar 2008 18:08 GMT > I will think about it some more. In general, I am much more concerned with > usability and correctness than effeciency, but I certainly won't turn down > effeciency gains if they are just sitting there. If I end up just storing DNA > as an int array and then, for most uses, convert it back to a string, will > that actually gain anything? Well, I'd say that the biggest gain is in terms of concepts. Your sequence should logically be a sequence of individual strongly typed elements, not arbitrary characters in a string.
> The biggest Regex need is shown below. I do use some other Regex's, but this > is the one I don't think I could really replace. I guess I can always convert > to a string and then create the Regex from that is need be. Yes, if necessary. Are those A-Y sequences standardised amino acid sequence identifiers? You could potentially create another type for those, of course... but that would end up being quite a bit more work, I suspect.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet World class .NET training in the UK: http://iterativetraining.co.uk
Shalin Shah - 13 Mar 2008 04:54 GMT > This is a program that is heavy on regexes (which I'm not an expert on) and > am wondering if the C# version makes an elementary blunder. Do any experts > want to have a look? See also comp.lang.java. > > (Query: Is he compiling the regex once in Java, but every time through the > loop in C#?) I think he is compiling the regular expression each time in the loop. A good benchmark would be compiling it once and matching it in the loop. Maybe C# uses a DFA-NFA hybrid (which might explain the large memory usage, as the author of the article claims) which has the potential of matching a regular expression several times faster than a backtracking implementation, but it would compile the regular expression slower than backtracking.
Another good benchmark would include the use of backreferences, which forces a regex implementation to use backtracking.
FYI, egrep uses a DFA-NFA hybrid.
Jesse Houwing - 14 Mar 2008 21:04 GMT Hello Michael,
> While asking some Java enthusiasts what they think about C#, I came > across this: [quoted text clipped - 11 lines] > (Query: Is he compiling the regex once in Java, but every time through > the loop in C#?) Replacing Regex regexpr = new Regex(matchthis, RegexOptions.Compiled);
with Regex regexpr = new Regex(matchthis, RegexOptions.None);
made it fly.
-- Jesse Houwing jesse.houwing at sogeti.n
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|