.NET Forum / .NET Framework / CLR / April 2007
Managed vs Unmanaged Bare Bones Performance Test
|
|
Thread rating:  |
adhingra - 19 Apr 2007 21:52 GMT At our company we are currently at a decisive point to choose between managed and unmanaged code on the basis of their performance. I have read stuff about this on various blogs and other websites. Then I decided to take my own test as I am more concerned with basic performance at this point.
By basic I mean, just the basic stuff inside the CLR i.e. function calling cost, for loop, variable declaration, etc. Let us not consider GC, memory allocation costs, etc.
To my surprise the managed code I generated in my test through C# was lagging behind to a considerable degree when compared with the code generated by the C++ compiler.
I was wondering if someone can take a quick look at this and tell me why is this the case. I was under the assumption, once the JIT happens, the CLR virtual machine and JIT will give the same performance as native C++ compiler does (as we are talking basic stuff only - no objects, just pure language constructs and primitive data types).
I created two sample console applications (one in C# and other in C++). They both call a function passing an int by value from inside a for loop. Nothing happens inside the function. I used QueryPerformance.... apis for measurement. (Code is pasted at the bottom of this posting).
Here are the results (for release mode running from console, with default settings in the IDE)
C# Test for loop (50000 iterations) 0.000023931 (23 micro seconds) C++ Test for loop (50000 iterations) 0.000000350 (0.35 micro seconds)
So its like C++ compiler is about 20 times faster than the managed CLR Jitter. And if I also remove time taken for the QueryPerf...... apis then the diff is even more
Can anyone please elaborate.
Thanks adhingra
=========================================== C# Code PROGRAM.CS ===========================================
using System; using System.Collections.Generic; using System.Text; using System.Runtime.InteropServices;
namespace ConsoleApp { class Program { //API declarations for frequency timers [DllImport("kernel32.dll")] extern static short QueryPerformanceCounter(ref long x); [DllImport("kernel32.dll")] extern static short QueryPerformanceFrequency(ref long x);
static long m_lStart = 0, m_lStop = 0, m_lFreq = 0; static long m_lOverhead = 0; static decimal m_mTotalTime = 0;
static void Main(string[] args) { //get the CPU frequency QueryPerformanceFrequency(ref m_lFreq);
//record the overhead for calling the performance counter API QueryPerformanceCounter(ref m_lStart); QueryPerformanceCounter(ref m_lStop);
m_lOverhead = m_lStop - m_lStart;
Console.WriteLine("Starting with a simple For Loop calling a simple function");
QueryPerformanceCounter(ref m_lStart); for (int i = 0; i < 50000; i++) { Run(i); } QueryPerformanceCounter(ref m_lStop);
long lDiff = m_lStop - m_lStart; Console.WriteLine(lDiff); //Comment or Uncomment the overhead lines to see the times drop // //if (lDiff > m_lOverhead) //{ // lDiff = lDiff - m_lOverhead; //}
m_mTotalTime = ((Decimal)lDiff)/((Decimal)m_lFreq); Console.WriteLine(m_mTotalTime);
Console.WriteLine("Press Enter to Continue"); Console.ReadLine(); }
static void Run(int i) { //Console.WriteLine(i); } } }
=============================================== C++ Code ConsoleApp.cpp ===============================================
// ConsoleApp.cpp : Defines the entry point for the console application. //
#include "stdafx.h"
void Run(int i) { //printf("%d\n",i); }
int _tmain(int argc, _TCHAR* argv[]) { LARGE_INTEGER m_start, m_stop, m_freq; ::QueryPerformanceFrequency(&m_freq);
//record the overhead for calling the performance counter API ::QueryPerformanceCounter(&m_start); ::QueryPerformanceCounter(&m_stop);
LONGLONG m_overhead = m_stop.QuadPart - m_start.QuadPart; m_start.QuadPart = 0; m_stop.QuadPart = 0;
printf("%s\n","Starting with a simple For Loop calling a simple function");
QueryPerformanceCounter(&m_start); for (int i = 0; i < 50000; i++) { Run(i); } QueryPerformanceCounter(&m_stop);
LONGLONG lDiff = m_stop.QuadPart - m_start.QuadPart; printf("%d\n",lDiff); //Comment or Uncomment the overhead lines to see the times drop // //if (lDiff > m_overhead) //{ // lDiff = lDiff - m_overhead; //}
double totalTime = ((double)lDiff) / ((double)m_freq.QuadPart); printf("%15.15f\n",totalTime);
printf("%s", "Press Enter to Continue");
int c = getchar(); return 0; }
Willy Denoyette [MVP] - 19 Apr 2007 22:22 GMT > At our company we are currently at a decisive point to choose between managed > and unmanaged code on the basis of their performance. I have read stuff about [quoted text clipped - 155 lines] > return 0; > } This kind of benchmarh is meaningless.. The reason for the huge difference is that the C++ compiler hoists the loop, as it sees no sensible reason to call an empty function 50000 times, the C# compiler does not do this, it simply calls the function which only contains a ret. So what you are comparing is the time taken for a return from QueryPerformanceCounter plus the time to call QueryPerformanceCounter, against a the time taken to call 50000 times an empty function.
Willy.
Ben Voigt - 19 Apr 2007 22:38 GMT > This kind of benchmarh is meaningless.. > The reason for the huge difference is that the C++ compiler hoists the > loop, as it sees no sensible reason to call an empty function 50000 times, > the C# compiler does not do this, it simply calls the function which only > contains a ret. Inlining and optimizing away a call to an empty function is well within the capabilities of the CLR JIT.
> So what you are comparing is the time taken for a return from > QueryPerformanceCounter plus the time to call QueryPerformanceCounter, > against a the time taken to call 50000 times an empty function. > > Willy. Jon Skeet [C# MVP] - 19 Apr 2007 22:42 GMT > > This kind of benchmarh is meaningless.. > > The reason for the huge difference is that the C++ compiler hoists the [quoted text clipped - 4 lines] > Inlining and optimizing away a call to an empty function is well within the > capabilities of the CLR JIT. That was my thought too. I suspect it'll still perform the loop iteration, however, whereas the C++ compiler may well have removed that loop completely, which still means it's not a good benchmark.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
Ben Voigt - 19 Apr 2007 22:48 GMT >> > This kind of benchmarh is meaningless.. >> > The reason for the huge difference is that the C++ compiler hoists the [quoted text clipped - 11 lines] > iteration, however, whereas the C++ compiler may well have removed that > loop completely, which still means it's not a good benchmark. Oh, and if it's desired not to have the loop optimized away, touch a volatile variable from inside the function.
Willy Denoyette [MVP] - 20 Apr 2007 11:15 GMT >>> > This kind of benchmarh is meaningless.. >>> > The reason for the huge difference is that the C++ compiler hoists the [quoted text clipped - 11 lines] > Oh, and if it's desired not to have the loop optimized away, touch a volatile variable > from inside the function. True, with the following results:
C# (JIT32) 244 0,0000681650880209635582175947 C# (JIT64) 86 0,0000240253998762412541258735
C++ (O2 switch) 32 bit and 64 bit 164 0.000045815878834
see, C++ is faster than JIT 32 but slower than JIT64 code, for this particular case (50000 iterations), however such kind of micro-benchmarks have absolutely no value. For instance make the loop count an odd number (eg. 49999) and you get the same results for C# JIT64 and C++. Note that the JIT32 is known as not to being a great loop optimizer ;-).
Willy.
Willy Denoyette [MVP] - 20 Apr 2007 10:27 GMT >> This kind of benchmarh is meaningless.. >> The reason for the huge difference is that the C++ compiler hoists the loop, as it sees [quoted text clipped - 3 lines] > Inlining and optimizing away a call to an empty function is well within the capabilities > of the CLR JIT. True for optimized builds, the call is hoisted, but the loop is not hoisted by the JIT, the C++ compiler (optimized build) effectively hoists the loop. What's produced by the JIT depends on the version of the CLR.
this snip of the code: for (int i = 0; i < 50000; i++) { Run(i); }
is turned into into:
xor r11d,r11d add r11d,4 cmp r11d,0C350h jl 00000642`80150341 (jump to add r11d, 4 if less than)
by the JIT64, while the JIT32 (both v2 of the CLR), produces
xor eax,eax add eax,1 cmp eax,0C350h jl 001e014b (jump to add eax, 1 if less than)
see the subtle difference: add r11d, 4 and add eax, 1
here the JIT64 is cheating , no big deal in this case, but I would prefer some more consistent behavior across JIT versions, here I mean hoist the loop, or keep the loop as is, but don't cheat.
Willy.
Ben Voigt - 19 Apr 2007 22:43 GMT > Here are the results (for release mode running from console, with default > settings in the IDE) [quoted text clipped - 6 lines] > the > diff is even more Did you actually measure the time for QueryPerf? Ok, I see that you did. Those are native Win32 APIs, C++ will call them much faster than C#.
.35 microseconds is an extremely short time. Even 23 is too short for a useful benchmark. Run more iterations. In fact, run 50000 iterations first, ignoring the result, to force .NET to precompile everything. Then run a half billion or so iterations and compare the results.
adhingra - 19 Apr 2007 23:10 GMT Sorry
I am late with my comments. Shortly after posting this, I realized that this is a problem with my test as the C++ compiler is optimizing the whole thing away. (Looked at the disassembly)
However this does not make the benchmark obsolete, rather than measuring the performance, it actually measured the smartness of the two compilers. I did some more research and talked to one of my collegeues here at work who is an expert with C++ and even try making the code do more so that I can fool the C++ compiler to actually call the function. But the guy is way too smart and I was told the reason behind this extreme smartness is "Whole Program Optimization" offered by the VS 2005 Linker.
If the compilation unit is different (i.e. my function is in a different cpp file) this would not have happened in VS2003, but 2005 is a different beast of its own with this whole program optimization. The linker no longer just combine objs anymore, its more like an interpreter now and smart enough to chip chop objs
But Like Ben pointed out inlining and optimizing are in the feature set of the Jitter too. I think I know may be why the Jitter in managed code does not do it because the Jitter the compiling the one function at a time and it does not have the luxury due to time constraint to check the whole program and see that the whether the results of a function are used any where are not.
However I still think it should have jitted away an empty function.
Thanks All adhingra
Barry Kelly - 20 Apr 2007 00:06 GMT I wish you wouldn't multipost.
> However this does not make the benchmark obsolete, rather than measuring the > performance, it actually measured the smartness of the two compilers. It measured how good the C++ compiler is at doing nothing, versus the .NET JIT compiler. I agree, C++ is good for nothing.
:)
> I did > some more research and talked to one of my collegeues here at work who is an > expert with C++ and even try making the code do more so that I can fool the > C++ compiler to actually call the function. But the guy is way too smart and > I was told the reason behind this extreme smartness is "Whole Program > Optimization" offered by the VS 2005 Linker. .NET necessarily does whole program optimization because compilation happens so late; but it is constrained by the amount of time it has to work with - compilation must occur quickly. Performance will improve over time, when .NET adds techniques that are common in Java, such as recompiling with more aggressive optimization after many iterations.
-- Barry
 Signature http://barrkel.blogspot.com/
Jon Skeet [C# MVP] - 20 Apr 2007 07:27 GMT <snip>
> Performance will improve over time, when .NET adds techniques that > are common in Java, such as recompiling with more aggressive > optimization after many iterations. It'll be interesting to see whether or not this ever happens. In Java, it made a huge difference, because by having dynamic optimisation (and de-optimisation) you can inline virtual methods until they're first overridden. That's really important when the language makes methods virtual by default, but not as important in a world which requires you to specify that methods are virtual (which at least C# does - not sure about VB.NET).
There are other improvements as well, of course, and it could improve start-up time (one would hope) but the effects won't be quite as huge as they were in the Java world.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
Jon Skeet [C# MVP] - 20 Apr 2007 07:25 GMT > I am late with my comments. Shortly after posting this, I realized that this > is a problem with my test as the C++ compiler is optimizing the whole thing > away. (Looked at the disassembly) > > However this does not make the benchmark obsolete, rather than measuring the > performance, it actually measured the smartness of the two compilers. It measures the smartness of the compilers in *one* particular situation. Do you often run a loop which does nothing? I know I don't.
<snip>
> However I still think it should have jitted away an empty function. I strongly suspect that it did, by inlining. It just didn't optimise away the loop itself.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|