.NET Forum / Languages / Managed C++ / March 2006
C++/CLI the fastest compiler? Yes, at least for me. :-)
|
|
Thread rating:  |
Don Kim - 12 Mar 2006 02:36 GMT Ok, so I posted a rant earlier about the lack of marketing for C++/CLI, and it forked over into another rant about which was the faster compiler. Some said C# was just as fast as C++/CLI, whereas others said C++/CLI was more optimized.
Anyway, I wrote up some very simple test code, and at least on my computer C++/CLI came out the fastest. Here's the sample code, and just for good measure I wrote one in java, and it was the slowest! ;-) Also, I did no optimizing compiler switches and compiled the C++/CLI with /clr:safe only to compile to pure verifiable .net.
//C++/CLI code using namespace System;
int main() { long start = Environment::TickCount; for (int i = 0; i < 10000000; ++i) {} long end = Environment::TickCount; Console::WriteLine(end - start); }
//C# code using System;
public class ForLoopTest { public static void Main(string[] args) { long start = Environment.TickCount; for (int i =0;i < 10000000; ++i) {} long end = Environment.TickCount; Console.WriteLine((end-start)); } }
//Java code public class Performance { public static void main(String args[]) { long start = System.currentTimeMillis(); for (int i=0; i < 10000000; ++i) {} long end = System.currentTimeMillis(); System.out.println(end-start); } }
Results:
C++/CLI -> 15-18 secs C# -> 31-48 secs Java -> 65-72 secs
I know, I know, these kind of test are not always foolproof, and results can vary by computer to computer, but at least on my system, C++/CLI had the fastest results.
Maybe C++/CLI is the most optimized compiler?
-Don Kim
Carl Daniel [VC++ MVP] - 12 Mar 2006 05:42 GMT > C++/CLI -> 15-18 secs > C# -> 31-48 secs [quoted text clipped - 5 lines] > > Maybe C++/CLI is the most optimized compiler? After increasing the length of the loops by a factor of 100, I see about a 2X speed advantage for C++/CLI as well. Looking at the IL produced by the two compilers for the respective main functions:
C++:
.method assembly static int32 main() cil managed { // Code size 40 (0x28) .maxstack 2 .locals (int32 V_0, int32 V_1, int32 V_2) IL_0000: call int32 [mscorlib]System.Environment::get_TickCount() IL_0005: stloc.2 IL_0006: ldc.i4.0 IL_0007: stloc.0 IL_0008: br.s IL_000e // start of loop IL_000a: ldloc.0 IL_000b: ldc.i4.1 IL_000c: add IL_000d: stloc.0 IL_000e: ldloc.0 IL_000f: ldc.i4 0x3b9aca00 IL_0014: bge.s IL_0018 IL_0016: br.s IL_000a // end of loop IL_0018: call int32 [mscorlib]System.Environment::get_TickCount() IL_001d: stloc.1 IL_001e: ldloc.1 IL_001f: ldloc.2 IL_0020: sub IL_0021: call void [mscorlib]System.Console::WriteLine(int32) IL_0026: ldc.i4.0 IL_0027: ret } // end of method 'Global Functions'::main
C#:
.method public hidebysig static void Main(string[] args) cil managed { .entrypoint // Code size 47 (0x2f) .maxstack 2 .locals init (int64 V_0, int32 V_1, int64 V_2, bool V_3) IL_0000: nop IL_0001: call int32 [mscorlib]System.Environment::get_TickCount() IL_0006: conv.i8 IL_0007: stloc.0 IL_0008: ldc.i4.0 IL_0009: stloc.1 IL_000a: br.s IL_0012 // start of loop IL_000c: nop IL_000d: nop IL_000e: ldloc.1 IL_000f: ldc.i4.1 IL_0010: add IL_0011: stloc.1 IL_0012: ldloc.1 IL_0013: ldc.i4 0x3b9aca00 IL_0018: clt IL_001a: stloc.3 IL_001b: ldloc.3 IL_001c: brtrue.s IL_000c // end of loop IL_001e: call int32 [mscorlib]System.Environment::get_TickCount() IL_0023: conv.i8 IL_0024: stloc.2 IL_0025: ldloc.2 IL_0026: ldloc.0 IL_0027: sub IL_0028: call void [mscorlib]System.Console::WriteLine(int64) IL_002d: nop IL_002e: ret } // end of method ForLoopTest::Main
The C++ compiler did generate more optimized IL. It's surprising to me that the JIT didn't do a better job of optimizing the C#-produced code.
Note that the C# code converted the time to a 64 bit value (C#'s long is 64 bits, while C++'s long is 32 bits), but that occurred outside the loop so it should have next to no impact on the overall speed of the code.
-cd
Jochen Kalmbach [MVP] - 12 Mar 2006 07:51 GMT Hi Carl!
> The C++ compiler did generate more optimized IL. It's surprising to me that > the JIT didn't do a better job of optimizing the C#-produced code. Wasn´t there a statement that the JIT for .NET 2.0 is not doing optimizations (only simple optimizations) ? I just remember a blog-entry from someone at blogs.msdn.com... but couldn´t find it anymore...
 Signature Greetings Jochen
My blog about Win32 and .NET http://blog.kalmbachnet.de/
Jochen Kalmbach [MVP] - 12 Mar 2006 08:11 GMT >> The C++ compiler did generate more optimized IL. It's surprising to >> me that the JIT didn't do a better job of optimizing the C#-produced [quoted text clipped - 4 lines] > I just remember a blog-entry from someone at blogs.msdn.com... but > couldn´t find it anymore... Currently I could only find the confirmation of the "missing" optimization for the CF. But I tought the same was true for the "desktop"-framework...
http://blogs.msdn.com/stevenpr/archive/2005/12/12/502978.aspx
<quote> Because the CLR can throw away native code under memory pressure or when an application moves to the background, it is quite possible that the same IL code may need to be jit compiled again when the application continues running. This fact leads to our second major jit compiler design decision: the time it takes to compile IL code often takes precedence over the quality of the resulting native code. As with all good compilers, the Compact Framework jit compiler does some basic optimizations, but because of the need to regenerate code quickly in order for applications to remain responsive, more extensive optimizations generally take a back seat to shear compilation speed. </quote>
 Signature Greetings Jochen
My blog about Win32 and .NET http://blog.kalmbachnet.de/
Andre Kaufmann - 12 Mar 2006 09:00 GMT > Currently I could only find the confirmation of the "missing" > optimization for the CF. But I tought the same was true for the [quoted text clipped - 14 lines] > optimizations generally take a back seat to shear compilation speed. > </quote> That may be true. But I wonder why there cannot be both ? A fast IL compiler and one that is slow, but optimizes much better. E.g. "ngen" could have a command line switch to generate more optimized code.
Andre
Shawn B. - 12 Mar 2006 12:39 GMT > Wasn´t there a statement that the JIT for .NET 2.0 is not doing > optimizations (only simple optimizations) ? > I just remember a blog-entry from someone at blogs.msdn.com... but > couldn´t find it anymore... No, but there is a recent thread in this group where some MVP's insist that the C++ compiler doesn't do optimized IL code and produces roughly what the C# compiler does, despite the fact that your test, some VC++ devs, publications, and my own internal software production has proved that the C++/CLI compiler is the best optimized for IL of the MS stack. That said, the same MVP insists that some MS employees have stated that the C++/CLI compiler leaves all the optimzation to the JIT rather than front-end optimizing.
Thanks, Shawn
Jochen Kalmbach [MVP] - 12 Mar 2006 13:56 GMT Hi Shawn!
>>WasnŽt there a statement that the JIT for .NET 2.0 is not doing >>optimizations (only simple optimizations) ? [quoted text clipped - 9 lines] > compiler leaves all the optimzation to the JIT rather than front-end > optimizing. Really? I thought the C++/CLI compiler does not care what code it is generating. It always tryes to optimize the "pseudeo-code".
Nevertheless... I neither found docu that the JIT-compiler does optimization nor does I found some docu that it does not...
 Signature Greetings Jochen
My blog about Win32 and .NET http://blog.kalmbachnet.de/
Andre Kaufmann - 12 Mar 2006 14:27 GMT >> Wasn´t there a statement that the JIT for .NET 2.0 is not doing >> optimizations (only simple optimizations) ? [quoted text clipped - 3 lines] > No, but there is a recent thread in this group where some MVP's insist that > the C++ compiler doesn't do optimized IL code and produces roughly what the You mean the sample where W.D. [MVP] gives a samples that the C++/CLI doesn't do global optimization on IL code ?
It does. IMHO the example is wrong. If I interpret the given example correctly it's based on a call to an external DLL. So the C++/CLI compiler must do an optimization over DLL boundaries ?! Since the DLL is loaded dynamically, how should the C++/CLI compiler do any optimization ?
Why should the C++/CLI compiler not optimize the code ? I don't know how the C++/CLI compiler is implemented, but I assume that the code generation of native or CLI code is done by optimizing the generated intermediate code, before native or managed code is generated. So that (nearly) the same optimizer is used for "native code compiled to IL code" and "native x86 code". If my assumption is true it would be plain nonsense to revert this optimization, already done.
> C# compiler does, despite the fact that your test, some VC++ devs, > publications, and my own internal software production has proved that the > C++/CLI compiler is the best optimized for IL of the MS stack. That said, > the same MVP insists that some MS employees have stated that the C++/CLI > compiler leaves all the optimzation to the JIT rather than front-end > optimizing. If he gives a valid link to the statements, I will believe it. Which doesn't mean that the statements are true.
> Thanks, > Shawn Andre
Carl Daniel [VC++ MVP] - 12 Mar 2006 16:35 GMT > Why should the C++/CLI compiler not optimize the code ? I don't know > how the C++/CLI compiler is implemented, but I assume that the code [quoted text clipped - 3 lines] > code" and "native x86 code". If my assumption is true it would be > plain nonsense to revert this optimization, already done. That is indeed the case. There's a single front-end for both native and managed code. That front end produces CIL ('C' Intermediate Language) which is then fed to the back end. The back-end consists of target independent parts (e.g. CIL optimizations) and target dependent parts (e.g. code generation).
-cd
Eugene Gershnik - 12 Mar 2006 09:10 GMT > I did no optimizing compiler switches [...]
Then the test is meaningless. If you don't ask the compiler to optimize why should it spend any effort on making your code fast?
[I don't have any stake in C++/CLI, C# or Java -- they can all die as far as I am concerned -- my objection as an outsider is only about how you tested.]
 Signature Eugene http://www.gershnik.com
Don Kim - 12 Mar 2006 13:18 GMT > Then the test is meaningless. If you don't ask the compiler to optimize why > should it spend any effort on making your code fast? That was the whole point. If I were to use optimizing options, there would invariably be arguments that either I did not use the correct ones, not in the proper order, that certain compiler switches are not equivalent, etc., etc. Therefore, I compiled as is w/out any options to see how each complier would compile on its own. I also made the test as simple as possible so as to time how each compiler internally optimizes a straight iteration of a common for loop.
In this case, it seems C++/CLI is the fastest in the managed Windows environment.
-Don Kim
Eugene Gershnik - 12 Mar 2006 20:25 GMT >> Then the test is meaningless. If you don't ask the compiler to >> optimize why should it spend any effort on making your code fast? [Rearranging your post a little]
> That was the whole point. [...]
> In this case, it seems C++/CLI is the fastest in the managed Windows > environment. Let me see. Take a world record holder for a 100m dash and take me. Put us both before a 100m range and ask as to get to the end at whatever pace we want. He walks. I run. I get there before him. You conclusion seems to be that I am a faster runner.
> If I were to use optimizing options, there > would invariably be arguments that either I did not use the correct > ones, not in the proper order, that certain compiler switches are not > equivalent, etc., etc. Yes measuring compiler performance is hard. If you want to get meaningful results you will need to study each one's options in detail, determine what people usually set in their optimized builds, create a meaningfull test set etc. etc. If you don't do all this anouncing to the world that X compiler is faster is a waste of electrons.
> I also made the > test as simple as possible so as to time how each compiler internally > optimizes a straight iteration of a common for loop. You didn't ask them to optimize the loop.
 Signature Eugene http://www.gershnik.com
Willy Denoyette [MVP] - 12 Mar 2006 17:40 GMT | Ok, so I posted a rant earlier about the lack of marketing for C++/CLI, | and it forked over into another rant about which was the faster [quoted text clipped - 57 lines] | | -Don Kim Such micro benchmark has little value, an empty loop will be hoisted in optimized builds (you ain't gonna do this in real code do you?). More important is the way you measure execution time here, it is wrong. The reason or this is that Environment.TickCount is updated with the real time clock tick. That is every 10 msec or 15,6 msec or higher, depending on the CPU type (Intel AMD, variants...). For instance an AMD 64 ticks at an interval of 15.5 msec, most intel based systems have an interval of 10msec, most SMP systems tick at 20msec or higher.
To get accurate results you need to use the high performance counters or the Stopwatch class in V2. Here is the adapted code:
// C# code // csc /o- bcs.cs using System; using System.Diagnostics; public class ForLoopTest { public static void Main(string[] args) { long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;
Stopwatch sw = new Stopwatch(); sw.Start(); for (int i =0;i < 10000000; ++i) {} sw.Stop(); long ticks = sw.Elapsed.Ticks; Console.WriteLine("{0} nanoseconds", ticks * nanosecPerTick); } }
// C++/CLI code // cl /CLR:safe /Od bcc.cpp #using <System.dll> using namespace System; using namespace System::Diagnostics;
int main() { Int64 nanosecPerTick = (1000L * 1000L * 1000L) / System::Diagnostics::Stopwatch::Frequency;
Stopwatch^ sw = gcnew Stopwatch; sw->Start(); for (int i = 0; i < 10000000; ++i) {} sw->Stop(); Int64 ticks = sw->Elapsed.Ticks; Console::WriteLine("{0} nanoseconds", ticks* nanosecPerTick); }
On my system using above code and the command line arguments as specified in the source (both non optimized) show following results:
C# 37714104 nanoseconds C++/CLI 37389069 nanoseconds
That means both are equaly fast, but again this means nothing, such micro benchmarks have no value.
Note that an optimized C++ build will hoist the empty loop (removes it completely from IL). This kind of hoisting is not done by the C# compiler, and there is a reason for it. That doesn't mean there is no loop optimization, it's just done at the JIT level!!.
Willy.
Carl Daniel [VC++ MVP] - 12 Mar 2006 18:47 GMT > wrong. The reason or this is that Environment.TickCount is updated > with the real time clock tick. That is every 10 msec or 15,6 msec or [quoted text clipped - 13 lines] > C++/CLI > 37389069 nanoseconds Interesting. I took Don' sample and increased the loop count by a factor of 100 and consistently got execution times of about 530ms for the C++ code and 1200ms for the C# code.
Granted, the resolution of GetTickCount is poor - but that's a large enough difference to be significant.
Your results are actually much closer to what I expected - nearly identical performance, but I can't see why replacing GetTickCount with StopWatch would have any effect other than to increase the resolution of the time measurement.
But... here's what I found with your examples: First, I changed both to calculate nanosecPerTick as a double instead of a long - on a system with a tick rate higher than 1Ghz, your calcuation results in 0 all the time.
With that change, I get a time of 15.8us for the C++ code and 42.3us for the C# code - about the same difference I saw with GetTickCount.
It seems that there's something significantly different about your machine as compared to mine & Don's when it comes to the performance of this code - and that is very interesting!
What's your machine hardware? I'm running on a 3Ghz P4 with 1GB of RAM under XP SP2. I'm suspicious of your times (and mine as well) as I doubt my machine is 2000 times faster than yours.
-cd
Willy Denoyette [MVP] - 12 Mar 2006 20:07 GMT | > wrong. The reason or this is that Environment.TickCount is updated | > with the real time clock tick. That is every 10 msec or 15,6 msec or [quoted text clipped - 17 lines] | 100 and consistently got execution times of about 530ms for the C++ code and | 1200ms for the C# code. Are you sure you compiled non optimized (/Od and /o-)? As I said, the C++ compiler will hoist the loop when optimization is on (O&, O2 or whatever).
| Granted, the resolution of GetTickCount is poor - but that's a large enough | difference to be significant. It's not the resolution it's the interval which is the cullprit.
| Your results are actually much closer to what I expected - nearly identical | performance, but I can't see why replacing GetTickCount with StopWatch would | have any effect other than to increase the resolution of the time | measurement. Stopwatch uses the QueryPerformanceCounter and QueryPerformanceFrequency high resolution counters of the OS.
| But... here's what I found with your examples: First, I changed both to | calculate nanosecPerTick as a double instead of a long - on a system with a | tick rate higher than 1Ghz, your calcuation results in 0 all the time. That's very surprising, QueryPerformanceFrequency (StopWatch.Frequency) should not be that high, notice that this Frequency is not the CPU clock frequency, it's the output of a CPU clock divider, it's frequency is much lower, on my System it's 3579545MHz (try with: Console::WriteLine(System::Diagnostics::Stopwatch::Frequency);) If on your system it's much higher than 1GHz, you might have an issue with your system.
| With that change, I get a time of 15.8us for the C++ code and 42.3us for | the C# code - about the same difference I saw with GetTickCount. Hmmm , 15.8 ùsec. for 10000000 loops in which you execute 6 instructions [1]per loop, that would mean 60000000 instructions in 15.8µsec or 0.000263 nanosecs/instruction, or ~4.000.000.000.000 instructions/sec.- not possible really, looks like the loop is hoisted or your clock is broken ;-).
| It seems that there's something significantly different about your machine | as compared to mine & Don's when it comes to the performance of this code - | and that is very interesting! Looks like you have to investigate the Frequency value returned first, and inspect your code.
| What's your machine hardware? I'm running on a 3Ghz P4 with 1GB of RAM | under XP SP2. I'm suspicious of your times (and mine as well) as I doubt my | machine is 2000 times faster than yours. I have it running on an AMD64 Atlon 3500+, 2GB, XP SP2, whith CPU clock throttling disabled. Increasing the loop count by a factor 100 gives me:
3737032857 nanoseconds
or 3.7 seconds. or 3737032857/1000000000 = 3.737032857 nsec/loop or ~0.63 nsec. per instruction (avg.)
| -cd [1] 00d100d2 83c201 add edx,0x1 00d100d5 81fa80969800 cmp edx,0x989680 00d100db 0f9cc0 setl al 00d100de 0fb6c0 movzx eax,al 00d100e1 85c0 test eax,eax 00d100e3 75ed jnz 00d100d2
notes: - 0x989680 = 1.000.000.000 decimal - that this is native code, generated by the JIT in non optimized build.
Willy.
Carl Daniel [VC++ MVP] - 12 Mar 2006 21:51 GMT > "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam> > Are you sure you compiled non optimized (/Od and /o-)? As I said, the C++ > compiler will hoist the loop when optimization is on (O&, O2 or whatever). Quite certain - I used the exact command lines given in your posting (optimization if off by default as well, so specifying nothing is equivalent to /Od).
> It's not the resolution it's the interval which is the cullprit. We're talking about the same thing - 15ms precision is quite sufficient for measuring intervals of 500ms or more and certainly won't account for a 50% measurement error for such intervals - only 3% or so.
> That's very surprising, QueryPerformanceFrequency (StopWatch.Frequency) > should not be that high, notice that this Frequency is not the CPU clock [quoted text clipped - 3 lines] > If on your system it's much higher than 1GHz, you might have an issue with > your system. (You made a typo - on your system it's 3579545Hz, not MHz)
If your machine uses the MP HAL (which mine does), then QPC uses the RDTSC instruction which does report actual CPU core clocks. If your system doesn't use the MP HAL, then QPC uses the system board timer, which generally has a clock speed of 1X or 0.5X the NTSC color burst frequency of 3.57954545 Mhz. Note that this rate has absolutely nothing to do with your CPU clock - it's a completely independent crystal oscillator on the MB.
> | With that change, I get a time of 15.8us for the C++ code and 42.3us > for [quoted text clipped - 5 lines] > possible > really, looks like the loop is hoisted or your clock is broken ;-). I agree - it doesn't add up. I'm quite sure that I did unoptimized builds, and the results are 100% reproducible. But see below.
> | It seems that there's something significantly different about your > machine [quoted text clipped - 4 lines] > Looks like you have to investigate the Frequency value returned first, and > inspect your code. Well, it's your code - not mine. The Frequency value is right on for this machine.
I'm at my office right now, on a different computer. This one's a 3GHz Pentium D. I modified the samples as before to make nanosecPerTick double instead of Int64 and added code to print the value of Stopwatch.Frequency and the raw Ticks and nanosecPerTick. Here are the results:
C:\Dev\Misc\fortest>fortest0312cs Stopwatch frequency=3052420000 0.327608913583321 ns/tick 240117 ticks 78664.4695028862 nanoseconds
C:\Dev\Misc\fortest>fortest0312cpp Stopwatch frequency=3052420000 0.327608913583321 ns/tick 49225 ticks 16126.548771139 nanoseconds
Increasing the loop count by a factor of 10 increases the times by a factor of 10. Decreasing by a factor of 10 decreases the times by a factor of 10. Clearly the loop has not been optimized out, but that still doesn't explain the apparent execution speed of more than 200 adds per clock cycle (I know modern CPUs are somewhat super-scalar, but 200 adds/clock? I don't think so!)
I don't know what's going on here, but two things seem to be true:
1. The C++ code is faster on these machines. If I increase the loop count to 1,000,000,000 I can clearly see the difference in execution time with my eyes. 2. The Stopwatch class doesn't appear to work correctly on these machines - it's measuring times that are orders of magnitude too short, yet still proportional to the actual time spent.
Working on the assumpting that #2 is true, I modified the code to call QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are the results:
C:\Dev\Misc\fortest>fortest0312cpp QPC frequency=3052420000 0.327608913583321 ns/tick 22388910 ticks 7334806.48141475 nanoseconds
C:\Dev\Misc\fortest>fortest0312cs QPC frequency=3052420000 0.327608913583321 ns/tick 58980368 ticks 19322494.2832245 nanoseconds
The times are now much more reasonable - Stopwatch apparently doesn't work correctly with such a high value from QPF (it's apparently off by a factor of 1000). The ratio of times remains about equal though- the C++ code is still nearly 2X faster on this machine (despite the fact that that makes no sense at all, it seems to be true).
-cd
Carl Daniel [VC++ MVP] - 12 Mar 2006 22:03 GMT "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam> wrote in message
> The times are now much more reasonable - Stopwatch apparently doesn't work > correctly with such a high value from QPF (it's apparently off by a factor > of 1000). The ratio of times remains about equal though- the C++ code is > still nearly 2X faster on this machine (despite the fact that that makes > no sense at all, it seems to be true). Follow-up -
It appears that Stopwatch scales the QPF/QPC values internally if the frequency is "high", causing Stopwatch.ElapsedTicks to report a scaled value, but Stopwatch.Frequency still reports the full resolution value returned by QPF.
Stopwatch.ElapsedMilliseconds and Stopwatch.Elapsed both return correctly scaled values.
This is clearly a bug in the Stopwatch class.
-cd
Willy Denoyette [MVP] - 13 Mar 2006 00:19 GMT | "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam> | wrote in message [quoted text clipped - 17 lines] | | -cd I see, but it still doesn't explain this:
| C:\Dev\Misc\fortest>fortest0312cpp | QPC frequency=3052420000 [quoted text clipped - 7 lines] | 58980368 ticks | 19322494.2832245 nanoseconds Why is C++ almost 3 times faster than C#? Are we sure the ticks are accurate, are we sure the OS counter is updated for every tick, Are we sure the OS goes to the HAL to read the HW clock tick value at each call of QueryPerformanceCounter (this must be quite expensive)?
And why is it 2 and 5 times faster than on my AMD box, while the results are comparable (AMD a little faster) when I run it on Intel 3GHz non HT (see my previous post) ?
That means that the native code must be different, while it is on my AMD box (dispite the fact that the IL is different). add edx,0x1 cmp edx,0x989680 setl al movzx eax,al test eax,eax jnz 00d100d2
Which is not realy the best algorithm for X86, wonder how it looks like on Intel. Grr.. micro benchmarks, what a mess ;-)
Willy.
Carl Daniel [VC++ MVP] - 13 Mar 2006 00:37 GMT > That means that the native code must be different, while it is on my AMD > box [quoted text clipped - 8 lines] > Which is not realy the best algorithm for X86, wonder how it looks like on > Intel. Grr.. micro benchmarks, what a mess ;-) Here's what I see (loops going 1 billion times):
The JIT'd C++ code: // --------------------------------------------------- for (int i = 0; i < 1000000000; ++i) {} 00000077 xor edx,edx 00000079 mov dword ptr [esp],edx 0000007c nop 0000007d jmp 00000082 // start of loop 0000007f inc dword ptr [esp] 00000082 cmp dword ptr [esp],3B9ACA00h 00000089 jge 0000008E 0000008b nop 0000008c jmp 0000007F // end of loop
The JIT'd C# code: // --------------------------------------------------- for (int i =0;i < 1000000000; ++i) {} 00000098 xor ebx,ebx 0000009a nop 0000009b jmp 000000A0 // start of loop 0000009d nop 0000009e nop 0000009f inc ebx 000000a0 cmp ebx,3B9ACA00h 000000a6 setl al 000000a9 movzx eax,al 000000ac mov dword ptr [ebp-6Ch],eax 000000af cmp dword ptr [ebp-6Ch],0 000000b3 jne 0000009D // end of loop
Neither of these represent ideal code by any stretch of the imagination - but instruction count alone probably accounts for the bulk of the difference between the two programs on this machine. Why the results are so different from what you see on your AMD machine I can't even guess.
-cd
Willy Denoyette [MVP] - 13 Mar 2006 02:08 GMT | > That means that the native code must be different, while it is on my AMD | > box [quoted text clipped - 50 lines] | | -cd Thanks, that's almost exactly what I've noticed see my previous reply.
C# Intel... 00000030 90 nop 00000031 90 nop 00000032 46 inc esi 00000033 81 FE 80 96 98 00 cmp esi,989680h 00000039 0F 9C C0 setl al 0000003c 0F B6 C0 movzx eax,al 0000003f 8B F8 mov edi,eax 00000041 85 FF test edi,edi 00000043 75 EB jne 00000030
C# AMD... add edx,0x1 cmp edx,0x989680 setl al movzx eax,al test eax,eax jnz 00d100d2
Conclusion: the JIT takes care of the CPU type even in debug builds! So generates different X86 even from the same IL. This is extremely weird, for instance the inc esi used on Intel, is an add, edx, 1 on AMD; so different register allocations and a different instruction. Well I know add on AMD is prefered over an inc (according their "Optimization guide for AMD64 Processors"), can you believe MSFT went that far with the JIT (in debug builds)?
Willy.
Carl Daniel [VC++ MVP] - 13 Mar 2006 04:25 GMT > "Optimization guide for AMD64 Processors"), can you believe MSFT went > that far with the JIT (in debug builds)? Well, yeah. Maybe. I'm under the (possibly misguided) impression that debug primarily stops the JIT from inlining and hoisting - things that change the relative order of the native code compared to the IL code. Within those guidelines, I guess it still picks the best codegen it can based on the machine.
My belief is that there are multiple full-time Intel and AMD employees at MSFT that do nothing but work on the compiler back-ends, including the CLR JIT.
-cd
Willy Denoyette [MVP] - 13 Mar 2006 10:06 GMT | > "Optimization guide for AMD64 Processors"), can you believe MSFT went | > that far with the JIT (in debug builds)? [quoted text clipped - 8 lines] | MSFT that do nothing but work on the compiler back-ends, including the CLR | JIT. Well, I would expect this for the C++ compiler back-end, but not directly for the JIT compiler which is more time constrained, but I guess I'm wrong.
Willy.
Willy Denoyette [MVP] - 13 Mar 2006 13:01 GMT || > "Optimization guide for AMD64 Processors"), can you believe MSFT went || > that far with the JIT (in debug builds)? [quoted text clipped - 13 lines] | | Willy. Some more fun.
Consider this program:
//C++/CLI code // File : EmptyLoop.cpp #using <System.dll> using namespace System; using namespace System::Diagnostics; #pragma unmanaged void ForLoopTest( void ) { __asm { xor esi,esi; 0 -> esi jmp begin; iter:; inc esi; i++ begin:; cmp esi,989680h ; i < 10000000? jl iter; no } return; } #pragma managed int main() { Int64 nanosecPerTick = (1000L * 1000L * 1000L) / System::Diagnostics::Stopwatch::Frequency; Stopwatch^ sw = gcnew Stopwatch; sw->Start(); ForLoopTest(); sw->Stop(); Int64 ticks = sw->Elapsed.Ticks; Console::WriteLine("{0} nanoseconds", ticks * nanosecPerTick); }
Compiled with: cl /clr /O2 EmptyLoop.cpp output: 24935346 nanoseconds
cl /clr /Od EmptyLoop.cpp output: 37636821 nanoseconds
See the loop is in assembly, pure unmanaged X86 code, the code produced by the C++ compiler [1] is the same except for the function prolog and epilog, altough the results are different. Any takers?
[1] /Od build
void ForLoopTest( void ) { 00401000 55 push ebp 00401001 8B EC mov ebp,esp 00401003 56 push esi __asm { xor esi,esi; 0 -> esi 00401004 33 F6 xor esi,esi jmp begin; 00401006 EB 01 jmp begin (401009h) iter:; inc esi; i++ 00401008 46 inc esi begin:; cmp esi,989680h ; < 10000000? 00401009 81 FE 80 96 98 00 cmp esi,989680h jl iter; no 0040100F 7C F7 jl iter (401008h) } return; } 00401011 5E pop esi 00401012 5D pop ebp 00401013 C3 ret
/O2 build
void ForLoopTest( void ) { 00401000 56 push esi xor esi,esi; 0 -> esi 00401001 33 F6 xor esi,esi jmp begin; 00401003 EB 01 jmp begin (401006h) iter:; inc esi; i++ 00401005 46 inc esi begin:; cmp esi,989680h ; < 10000000? 00401006 81 FE 80 96 98 00 cmp esi,989680h jl iter; no 0040100C 7C F7 jl iter (401005h) __asm { 0040100E 5E pop esi } return; } 0040100F C3 ret
Willy.
Willy Denoyette [MVP] - 13 Mar 2006 14:05 GMT Ok, final update. The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect on all platforms.
Using StopWatch.Elapsed.Milliseconds gives folowing results.
Values are averges for 10 runs.
C# ~12.8 msec. for 10.000.000 loops C++/CLI ~9.1 msec.
Release build:
C# ~9.1 msec. C++/CLI - loop hoisted by C++/CLI compiler (no IL body)
The X86 code for the loop C++/CLI /Od and C# optimized build are nearly the same (different registers allocated and inc i.s.o add).
Now this:
#using <System.dll> using namespace System; using namespace System::Diagnostics; #pragma unmanaged void ForLoopTest( void ) { __asm { xor esi,esi; 0 -> esi jmp begin; iter:; inc esi; i++ begin:; cmp esi,100000000 ; < 100000000? jl iter; no } return; } #pragma managed int main() {
Stopwatch^ sw = gcnew Stopwatch; sw->Reset(); sw->Start(); ForLoopTest(); sw->Stop();
Int64 ms = sw->Elapsed.Milliseconds; Console::WriteLine("{0} msec.", ms); }
compiled with: cl /clr /Od bcca.cpp output: for 100.000.000 loops!! avg. 135 msec.
cl /clr /Od bcca.cpp output: for 100.000.000 loops!! avg. 91 msec.
Notice the same result for C# optimized build as C++/CLI with loop in assembly optimized build. Remains the question why the debug build is that much slower, guess this is due to the CLR starting some actions when running debug builds, IMO there is an GC/Finalizer run after the call to Stopwatch.Start and before running the loop. That would explain different behavior (better results) on an HT CPU as the finalizer runs on a second CPU, so doesn't disturb the user thread which runs on another core or logical CPU, on a single CPU core the finalizer pre-empts the user thread. I'll try to get an HW analizer from the lab to check this, this is simply not possible to check only by SW tools.
Willy.
Willy Denoyette [MVP] - 13 Mar 2006 20:21 GMT | Ok, final update. | The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect | on all platforms. Followup. !!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!
One should not use Elapsed.Ticks to calculate the elapsed time in nanoseconds. The only correct way to get this high precision count is by using Stopwatch.ElapsedTicks like this:
long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency; ... long ticks = sw.ElapsedTicks; Console.WriteLine("{0} nanoseconds", ticks * nanosecPerTick);
or use Stopwatch.ElapsedMiliseconds.
Note that the Stopwatch code is not broken, the code I posted used Stopwatch.Elapsed.Ticks which is wrong in this context. Sorry for all the confusion.
Willy.
Willy Denoyette [MVP] - 13 Mar 2006 21:14 GMT || Ok, final update. || The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect [quoted text clipped - 20 lines] | | Willy. Mystery solved, finally :-).
A C++/CLI debug build ( /Od flag - the default), does not generate sequence points in IL, however it generates optimized IL. A sequence point is used to mark a spot in the IL code that corresponds to a specific location in the original source. If you look at the IL generated by C# when compiled with /o-, you'll notice the nop's inserted in the stream, these nop's are used by the JIT to produce sequence points, but the /o- flags doesn't produce optimized IL. To have the same behavior in C# as /Od in C++/CLI, you need to set /debug+ /o+. This generates debug builds without nop's to trigger the sequence point, just like C++/CLI does. The "empty loop" C# sample compiled with /debug+ /o+, runs just as fast as the C++/CLI sample built with /Od. The IL produced is identical.
Willy.
Carl Daniel [VC++ MVP] - 13 Mar 2006 21:28 GMT > Mystery solved, finally :-). > [quoted text clipped - 13 lines] > as > the C++/CLI sample built with /Od. The IL produced is identical. Good sleuthing! In the end, they really ought to be about the same - having the C++ code execute 2x faster just didn't make sense.
-cd
Ajay Kalra - 13 Mar 2006 21:50 GMT This is very useful info. It was causing confusion given mixed information coming from MSFT itself.
-------- Ajay Kalra ajaykalra@yahoo.com
Willy Denoyette [MVP] - 13 Mar 2006 22:36 GMT | This is very useful info. It was causing confusion given mixed | information coming from MSFT itself. | | -------- | Ajay Kalra | ajaykalra@yahoo.com Well, the C++/CLI team did not want to generate explicit sequence points in the IL, so the JIT compiler can only rely on the implicit sequence points (that is when the evaluation stack is empty). That means also that it's not possible to synchronise the IL with the actual code while debugging C++/CLI in managed mode and you need the PDB to set breakpoint in your code, not a big deal IMO.
Willy.
Carl Daniel [VC++ MVP] - 13 Mar 2006 21:23 GMT > Followup. > !!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!! A ha! I obviously hadn't looked at the code closely enough to realize that it was using Elapsed.Ticks and not ElapsedTicks.
> One should not use Elapsed.Ticks to calculate the elapsed time in > nanoseconds. True - one should use it to calculate the elapsed time in 0.1us units, since that's what TimeSpan.Ticks is expressed as.
> The only correct way to get this high precision count is by using > Stopwatch.ElapsedTicks like this: > > long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency; but make this a double. Stopwatch.Frequency is more than 1E9 on modern machines using the MP HAL.
double nanosecPerTick = 1000.0 * 1000L * 1000L / Stopwatch.Frequency;
-cd
Willy Denoyette [MVP] - 13 Mar 2006 21:57 GMT | > Followup. | > !!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!! [quoted text clipped - 17 lines] | | double nanosecPerTick = 1000.0 * 1000L * 1000L / Stopwatch.Frequency; Sure, or use picoseconds :-)
long picosecPerTick = 1000L * 1000L * 1000L * 1000L / Stopwatch.Frequency;
90614831400 picoseconds Looks real crazy isn't it?
Willy.
Carl Daniel [VC++ MVP] - 14 Mar 2006 01:01 GMT > Sure, or use picoseconds :-) Nah - that's short sighted. Let's standardize on Attoseconds :)
long attosecPerTick = 1000L * 1000L * 1000L * 1000L * 1000L * 1000L /Stopwatch.Frequency;
now that's just getting silly... for the next few decades at least.
-cd
Willy Denoyette [MVP] - 15 Mar 2006 10:46 GMT | > Sure, or use picoseconds :-) | [quoted text clipped - 6 lines] | | -cd LOL, I'll keep it in mind for a next life maybe :-)
Willy.
Willy Denoyette [MVP] - 12 Mar 2006 23:49 GMT | > "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam> | > Are you sure you compiled non optimized (/Od and /o-)? As I said, the C++ [quoted text clipped - 9 lines] | measuring intervals of 500ms or more and certainly won't account for a 50% | measurement error for such intervals - only 3% or so. Yes, but not for a loop of 10.000.000 (as in Don's code), which takes only takes 37 msecs. to complete. And as I said on SMP systems this interval can be as large as 60 msecs. (as I have measured here on a Compaq Proliant 8 way system).
| > That's very surprising, QueryPerformanceFrequency (StopWatch.Frequency) | > should not be that high, notice that this Frequency is not the CPU clock [quoted text clipped - 5 lines] | | (You made a typo - on your system it's 3579545Hz, not MHz) Right, sorry for that.
| If your machine uses the MP HAL (which mine does), then QPC uses the RDTSC | instruction which does report actual CPU core clocks. If your system | doesn't use the MP HAL, then QPC uses the system board timer, which | generally has a clock speed of 1X or 0.5X the NTSC color burst frequency of | 3.57954545 Mhz. Note that this rate has absolutely nothing to do with your | CPU clock - it's a completely independent crystal oscillator on the MB. True MP HAL uses the externam CPU clock (yours runs at 3.052420000 GHz), but the 3.57954545 Mhz clock is derived from a divider or otherwise stated, the CPU clock (internal) is always a multiple of this 3.57954545 MHz, for instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 = 995MHz. The stepping number is important here, as it may change the dividers value.
No my current test machine is not a MP or HT, so it doesn't use an MP HAL, and you didn't specify that either in your previous reply, it's quite important as I know about the MP HAL.
| > | With that change, I get a time of 15.8us for the C++ code and 42.3us | > for [quoted text clipped - 20 lines] | Well, it's your code - not mine. The Frequency value is right on for this | machine. Well ..., it's Don's code. What do you mean with the Frequency value is right? The Frequency is also right on mine :-).
| I'm at my office right now, on a different computer. This one's a 3GHz | Pentium D. I modified the samples as before to make nanosecPerTick double | instead of Int64 and added code to print the value of Stopwatch.Frequency | and the raw Ticks and nanosecPerTick. Here are the results:
| C:\Dev\Misc\fortest>fortest0312cs | Stopwatch frequency=3052420000 [quoted text clipped - 7 lines] | 49225 ticks | 16126.548771139 nanoseconds That's for 10000000 loops I assume.
| Increasing the loop count by a factor of 10 increases the times by a factor | of 10. Decreasing by a factor of 10 decreases the times by a factor of 10. | Clearly the loop has not been optimized out, but that still doesn't explain | the apparent execution speed of more than 200 adds per clock cycle (I know | modern CPUs are somewhat super-scalar, but 200 adds/clock? I don't think | so!) That's not possible, Intel Pentium IV CPU's fetches and executes 2 instruction per cycle. The AMD Athlon 64 fetches and executes a max. of 3 instructions per cycle, (mine clocks at 2.2GHz)
These are the results on PIV 3GHz not HT running W2K3 R2. C# Frequency = 3579545 46632867 nanoseconds
C++ Frequency = 3579545 40659177 nanoseconds
Notice the difference between C++ and C#, looks like the X86 JIT'd code is not exactly the same, have to check this. Remember the results on AMD 64 bit (XP SP2) - 37368702 nanoseconds, that means that the AMD the Intel 3GHz show comparable results, as expected.
| I don't know what's going on here, but two things seem to be true: | | 1. The C++ code is faster on these machines. If I increase the loop count | to 1,000,000,000 I can clearly see the difference in execution time with my | eyes. Assumed the timings are correct, it's simply not possible to execute that number instructions during that time, so there must be something going on here.
| 2. The Stopwatch class doesn't appear to work correctly on these machines - | it's measuring times that are orders of magnitude too short, yet still [quoted text clipped - 15 lines] | 58980368 ticks | 19322494.2832245 nanoseconds How many loops here?
| The times are now much more reasonable - Stopwatch apparently doesn't work | correctly with such a high value from QPF (it's apparently off by a factor | of 1000). This is really strange as Stopwatch uses the same QueryPerformanceCounter and Frequency under the hood.
The ratio of times remains about equal though- the C++ code is
| still nearly 2X faster on this machine (despite the fact that that makes no | sense at all, it seems to be true). Time to expect the Stopwatch code, and I'll try to prepare a multicore or HT box to do some more tests.
wd.
Carl Daniel [VC++ MVP] - 13 Mar 2006 00:19 GMT > "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam> > | If your machine uses the MP HAL (which mine does), then QPC uses the [quoted text clipped - 16 lines] > dividers > value. Not (necessarily) true. For example, this Pentium D machine uses a BCLK frequency of 200Mhz with a multiplier of 15. There's no requirement (imposed by the CPU or MCH) that the CPU clock be related to color burst frequency at all.
Now, it's entirely possible that the motherboard generates that 200Mhz BCLK by multipliying a color burst crystal by 56 (200.45Mhz), but that's a motherboard detail that's unrelated to the CPU. Without really digging, there's no way I can tell one way or another - just looking at the MB, I see at least 4 different crystal oscillators of unknown frequency. Historically, the only reason color burst crystals are used is that they're cheap - they're manufactured by the gazillion for NTSC televisions.
> | Working on the assumpting that #2 is true, I modified the code to call > | QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are [quoted text clipped - 14 lines] > > How many loops here? That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty resonable rate to me - certainly not off by orders of magnitude.
> | I don't know what's going on here, but two things seem to be true: > | [quoted text clipped - 7 lines] > number instructions during that time, so there must be something going on > here. It's completely reasonable based on the times reported directly by QPC, not the bogus values from Stopwatch, which is off by a factor of 1000 on these machines.
So, any theory why the C++ code consistently runs faster than the C# code on both of my machines? I can't think of any reasonable argument why having a dual core or HT CPU would make the C++ code run faster. Clearly the JIT'd code is different for the two loops - maybe there's some pathological code in the C# case that the P4 executes much more slowly than AMD, or some optimal code in the C++ case that the P4 executes much more quickly than AMD. I'd be curious to hear the details of Don's machine - Intel/AMD, Single/HT/Dual, etc.
-cd
Don Kim - 13 Mar 2006 01:25 GMT > So, any theory why the C++ code consistently runs faster than the C# code on > both of my machines? I can't think of any reasonable argument why having a [quoted text clipped - 4 lines] > AMD. I'd be curious to hear the details of Don's machine - Intel/AMD, > Single/HT/Dual, etc. Wow, this is becomming interesting. We're getting down to dicussions CPU architecture and instructions sets. Talk about getting down to the metal!
Anyway, I just reran my test code with larger loop factors, as well as the other code with my original and larger loop factors, and C++/CLI still came out around 2X faster.
I ran these both on my laptop and desktop. Here's the configuration:
Laptop: Pentium Centrino 1.86 GHz, 1 GB Ram, Windows XP Pro, SP 2 Desktop: Pentium 4, 2.8 GHz, 1 GB RAM, Windows XP Pro, SP2
I know someone who has an AMD computer, and I'm going to run my programs on that computer to see if there's something in the CPU that's causing the discrepencies.
-Don Kim
Willy Denoyette [MVP] - 13 Mar 2006 11:19 GMT | > So, any theory why the C++ code consistently runs faster than the C# code on | > both of my machines? I can't think of any reasonable argument why having a [quoted text clipped - 8 lines] | CPU architecture and instructions sets. Talk about getting down to the | metal! That's true, if you are running empty loops, you are not only comparing compiler optimizations, you are measuring architectural differences at the CPU, L1/L2 cache & memory controler level. That's also why such micro-benchmarks have little or no value.
| Anyway, I just reran my test code with larger loop factors, as well as | the other code with my original and larger loop factors, and C++/CLI [quoted text clipped - 4 lines] | Laptop: Pentium Centrino 1.86 GHz, 1 GB Ram, Windows XP Pro, SP 2 | Desktop: Pentium 4, 2.8 GHz, 1 GB RAM, Windows XP Pro, SP2 Just currious what the QPD is on the Centrino.
| I know someone who has an AMD computer, and I'm going to run my programs | on that computer to see if there's something in the CPU that's causing | the discrepencies. Well, I noticed that for debug builds, C++/CLI produces smaller IL, and different X86 code produced by the JIT for both C# and C++/CLI, here are the for loops...
X86 for C# (debug) .. 00000030 90 nop 00000031 90 nop 00000032 46 inc esi 00000033 81 FE 80 96 98 00 cmp esi,989680h 00000039 0F 9C C0 setl al 0000003c 0F B6 C0 movzx eax,al 0000003f 8B F8 mov edi,eax 00000041 85 FF test edi,edi 00000043 75 EB jne 00000030 ...
X86 for C++/CLI (debug)
0000001f 46 inc esi 00000020 81 FE 80 96 98 00 cmp esi,989680h 00000026 7D 03 jge 0000002B 00000028 90 nop 00000029 EB F4 jmp 0000001F
An optimized C# build produces even a shorter code path: .. 0000001c 46 inc esi 0000001d 81 FE 80 96 98 00 cmp esi,989680h 00000023 7C F7 jl 0000001C ..
Now, while one would think that the run times would be better, they do not, all take the same time to finish.
The reason for this (AFAIK) is that super scalars like AMD prefer longer code paths (longer than a cacheline) in order to feed the instruction pipeline with longer bursts. Don't know how this behaves on Intel Centrino and PVI HT, but it looks like they behave differently. (I'll try this with an assembly code program).
Anyway I don't care that much about this, empty loops are not that common I guess (and C++ will hoist them anyway). Once you start something reasonable inside the loop, the loop overhead is reduced to dust and the pipeline gets filed in a more optimum way.
Willy.
Willy Denoyette [MVP] - 13 Mar 2006 01:30 GMT | > "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam> | > | If your machine uses the MP HAL (which mine does), then QPC uses the [quoted text clipped - 21 lines] | (imposed by the CPU or MCH) that the CPU clock be related to color burst | frequency at all. Carl, I'm not saying this is the case for all type of CPU's and mother boards, I only say that it's true for Pentiums up to III, things are different for other type of CPU's. See, AMD clocks at 200MHz with a multiplier of 11 or 12 depending on the type (and CPU id), this 200MHz clock can be adjusted (overclocked or underclocked), the Frequency returned by QueryPerformanceFrequency stays the same, the same is true for recent PIV's Pentium M and D. So here it's true that both aren't related, and the 3.57954545MHz clock is derived from the on baord Graphics controller or an external clock source (on mobo or not) when no on board graphics controller, but the value remains the same 3.57954545MHz unless you are using a MP HAL.
| Now, it's entirely possible that the motherboard generates that 200Mhz BCLK | by multipliying a color burst crystal by 56 (200.45Mhz), but that's a [quoted text clipped - 3 lines] | the only reason color burst crystals are used is that they're cheap - | they're manufactured by the gazillion for NTSC televisions. I know,carl, I've been working for IHV's (HP before Compac, before DEC ...) I know what you are talking about. Even on DEC Alpha (AXP) systems, the QueryPerformance frequency was 3.57954545MHz using the mono CPU HAL, while on SMP boxes like the Alpha 8400 (with the MP HAL) range it was also not the case, Jeez, what a bunch of problems did we have when porting W2K (never released for well known reasons) from intel code to AXP, just because some drivers and core OS components did not expect QueryPerformanceCounter speeds higher that 1GHz (that is when we overclocked an 800MHz CPU).
| > | Working on the assumpting that #2 is true, I modified the code to call | > | QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are [quoted text clipped - 17 lines] | That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty | resonable rate to me - certainly not off by orders of magnitude. Sure it is, I was wrong when reading the tick values (largely over midnight here, time to go to bed).
| > | I don't know what's going on here, but two things seem to be true: | > | [quoted text clipped - 22 lines] | | -cd Well I have investigated the native code generated on the Intel PIV (see previous . Here is (part of) the disassembly (VS2005)for C++: ... 0000001f 46 inc esi 00000020 81 FE 80 96 98 00 cmp esi,989680h 00000026 7D 03 jge 0000002B 00000028 90 nop ---> not sure why this one is good for, it's ignored by the CPU anyway 00000029 EB F4 jmp 0000001F ...
That means 4 instructions per loop compared to 6 on AMD. And the results are comparable to yours (for C++). Did not look at the C# code and it's result, but above shows that the JIT compiler generates (better?) code for PIV (don't know what the __cpuid call returns, but I know the CLR checks it when booting). Again, notice this is an unoptimized code build (/Od flag set), optimized code is a totally different story.
Willy.
Willy Denoyette [MVP] - 13 Mar 2006 01:53 GMT || > "Carl Daniel [VC++ MVP]" | <cpdaniel_remove_this_and_nospam@mvps.org.nospam> [quoted text clipped - 135 lines] | | Willy. Last follow up, (before my spouse pulls the plugs). Here is the X86 output of a C# release build on both AMD and Intel PIV: [1] 0000001c 46 inc esi 0000001d 81 FE 80 96 98 00 cmp esi,989680h 00000023 7C F7 jl
this results in 6.235684 msec on AMD and 7.023547 msec on PIV (10.000.000 loops).
while this is the debug build on Intel:
00000030 90 nop 00000031 90 nop 00000032 46 inc esi 00000033 81 FE 80 96 98 00 cmp esi,989680h 00000039 0F 9C C0 setl al 0000003c 0F B6 C0 movzx eax,al 0000003f 8B F8 mov edi,eax 00000041 85 FF test edi,edi 00000043 75 EB jne 00000030
See that the release build is the most optimum X86 code possible for the loop. The C++/CLI compiler in optimized build hoists the loop completely, so can't compare. Carl, could you look at the disassembly on your box, not a problem if you can't (It doesn't mean that much anyway), it looks like on you box the C++/CLI output looks more like [1] above.
Willy.
Carl Daniel [VC++ MVP] - 13 Mar 2006 01:54 GMT > Pentium M and D. So here it's true that both aren't related, and the > 3.57954545MHz clock is derived from the on baord Graphics controller or an > external clock source (on mobo or not) when no on board graphics > controller, > but the value remains the same 3.57954545MHz unless you are using a MP > HAL. I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2 color burst, or 1.7897727Mhz. But this particular branch has drifted far from the real point of this thread - interesting though (made me go look at the Pentium D data sheet, afterall!)
-cd
Willy Denoyette [MVP] - 13 Mar 2006 02:16 GMT | > Pentium M and D. So here it's true that both aren't related, and the | > 3.57954545MHz clock is derived from the on baord Graphics controller or an [quoted text clipped - 7 lines] | from the real point of this thread - interesting though (made me go look at | the Pentium D data sheet, afterall!) Can't remember this, but I guess you are right, much depends on the chip set used, I was on the Alpha team by that time (where we build the AXP HAL's and drivers), I moved to Intel architectures after the Compaq merge ;-). Digital had their own chip sets for Alpha systems (that's why they were too expensive, right?), nothing commodity, like there is available now.
Willy.
Tim Roberts - 13 Mar 2006 05:42 GMT r"Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam> wrote:
>I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2 >color burst, or 1.7897727Mhz. Nope, it was actually 1/3 of the color burst, 1.193182 MHz. The original PC had a 14.31818 MHz crystal (4x the color burst), and they divided it by 12 for the counter.
 Signature - Tim Roberts, timr@probo.com Providenza & Boekelheide, Inc.
Carl Daniel [VC++ MVP] - 13 Mar 2006 06:24 GMT > r"Carl Daniel [VC++ MVP]" > <cpdaniel_remove_this_and_nospam@mvps.org.nospam> wrote: [quoted text clipped - 5 lines] > original PC had a 14.31818 MHz crystal (4x the color burst), and they > divided it by 12 for the counter. Yep. That sounds right - 1.789 just didn't feel quite right :)
-cd
Willy Denoyette [MVP] - 13 Mar 2006 10:29 GMT | r"Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam> | wrote: [quoted text clipped - 5 lines] | PC had a 14.31818 MHz crystal (4x the color burst), and they divided it by | 12 for the counter. Yep, an old 200MHz (199.261) P6 "Model 1, Stepping 7" of mine, gives a QPC of 1.193182 MHz, that is CPU clock/167.
Willy.
MichaelG - 13 Mar 2006 17:46 GMT Richard Grimes'a article 'Is Managed Code Slower than Unmanaged Code' might be of interest. http://www.grimes.demon.co.uk/dotnet/man_unman.htm
Seems to indicate that there isn't much to choose between c# and c++/cli. c# can be faster in some circumstances.
Michael
> Ok, so I posted a rant earlier about the lack of marketing for C++/CLI, > and it forked over into another rant about which was the faster compiler. [quoted text clipped - 57 lines] > > -Don Kim
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|