Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / Managed C++ / March 2006

Tip: Looking for answers? Try searching our database.

C++/CLI the fastest compiler?  Yes, at least for me. :-)

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Don Kim - 12 Mar 2006 02:36 GMT
Ok, so I posted a rant earlier about the lack of marketing for C++/CLI,
and it forked over into another rant about which was the faster
compiler.  Some said C# was just as fast as C++/CLI, whereas others said
C++/CLI was more optimized.

Anyway, I wrote up some very simple test code, and at least on my
computer C++/CLI came out the fastest.  Here's the sample code, and just
for good measure I wrote one in java, and it was the slowest! ;-)  Also,
I did no optimizing compiler switches and compiled the C++/CLI with
/clr:safe only to compile to pure verifiable .net.

//C++/CLI code
using namespace System;

int main()
{
    long start = Environment::TickCount;
    for (int i = 0; i < 10000000; ++i) {}
        long end = Environment::TickCount;
        Console::WriteLine(end - start);
}

//C# code
using System;

public class ForLoopTest
{
    public static void Main(string[] args)
    {
        long start = Environment.TickCount;
        for (int i =0;i < 10000000; ++i) {}
        long end = Environment.TickCount;
        Console.WriteLine((end-start));
    }
}

//Java code
public class Performance
{
    public static void main(String args[])
    {
        long start = System.currentTimeMillis();
        for (int i=0; i < 10000000; ++i) {}
        long end = System.currentTimeMillis();
        System.out.println(end-start);
    }
}

Results:

C++/CLI -> 15-18 secs
C# -> 31-48 secs
Java -> 65-72 secs

I know, I know, these kind of test are not always foolproof, and results
can vary by computer to computer, but at least on my system, C++/CLI had
the fastest results.

Maybe C++/CLI is the most optimized compiler?

-Don Kim
Carl Daniel [VC++ MVP] - 12 Mar 2006 05:42 GMT
> C++/CLI -> 15-18 secs
> C# -> 31-48 secs
[quoted text clipped - 5 lines]
>
> Maybe C++/CLI is the most optimized compiler?

After increasing the length of the loops by a factor of 100, I see about a
2X speed advantage for C++/CLI as well.   Looking at the IL produced by the
two compilers for the respective main functions:

C++:

.method assembly static int32  main() cil managed
{
 // Code size       40 (0x28)
 .maxstack  2
 .locals (int32 V_0,
          int32 V_1,
          int32 V_2)
 IL_0000:  call       int32 [mscorlib]System.Environment::get_TickCount()
 IL_0005:  stloc.2
 IL_0006:  ldc.i4.0
 IL_0007:  stloc.0
 IL_0008:  br.s       IL_000e
// start of loop
 IL_000a:  ldloc.0
 IL_000b:  ldc.i4.1
 IL_000c:  add
 IL_000d:  stloc.0
 IL_000e:  ldloc.0
 IL_000f:  ldc.i4     0x3b9aca00
 IL_0014:  bge.s      IL_0018
 IL_0016:  br.s       IL_000a
// end of loop
 IL_0018:  call       int32 [mscorlib]System.Environment::get_TickCount()
 IL_001d:  stloc.1
 IL_001e:  ldloc.1
 IL_001f:  ldloc.2
 IL_0020:  sub
 IL_0021:  call       void [mscorlib]System.Console::WriteLine(int32)
 IL_0026:  ldc.i4.0
 IL_0027:  ret
} // end of method 'Global Functions'::main

C#:

.method public hidebysig static void  Main(string[] args) cil managed
{
 .entrypoint
 // Code size       47 (0x2f)
 .maxstack  2
 .locals init (int64 V_0,
          int32 V_1,
          int64 V_2,
          bool V_3)
 IL_0000:  nop
 IL_0001:  call       int32 [mscorlib]System.Environment::get_TickCount()
 IL_0006:  conv.i8
 IL_0007:  stloc.0
 IL_0008:  ldc.i4.0
 IL_0009:  stloc.1
 IL_000a:  br.s       IL_0012
// start of loop
 IL_000c:  nop
 IL_000d:  nop
 IL_000e:  ldloc.1
 IL_000f:  ldc.i4.1
 IL_0010:  add
 IL_0011:  stloc.1
 IL_0012:  ldloc.1
 IL_0013:  ldc.i4     0x3b9aca00
 IL_0018:  clt
 IL_001a:  stloc.3
 IL_001b:  ldloc.3
 IL_001c:  brtrue.s   IL_000c
// end of loop
 IL_001e:  call       int32 [mscorlib]System.Environment::get_TickCount()
 IL_0023:  conv.i8
 IL_0024:  stloc.2
 IL_0025:  ldloc.2
 IL_0026:  ldloc.0
 IL_0027:  sub
 IL_0028:  call       void [mscorlib]System.Console::WriteLine(int64)
 IL_002d:  nop
 IL_002e:  ret
} // end of method ForLoopTest::Main

The C++ compiler did generate more optimized IL.  It's surprising to me that
the JIT didn't do a better job of optimizing the C#-produced code.

Note that the C# code converted the time to a 64 bit value (C#'s long is 64
bits, while C++'s long is 32 bits), but that occurred outside the loop so it
should have next to no impact on the overall speed of the code.

-cd
Jochen Kalmbach [MVP] - 12 Mar 2006 07:51 GMT
Hi Carl!

> The C++ compiler did generate more optimized IL.  It's surprising to me that
> the JIT didn't do a better job of optimizing the C#-produced code.

Wasn´t there a statement that the JIT for .NET 2.0 is not doing
optimizations (only simple optimizations) ?
I just remember a blog-entry from someone at blogs.msdn.com... but
couldn´t find it anymore...
Signature

Greetings
  Jochen

   My blog about Win32 and .NET
   http://blog.kalmbachnet.de/

Jochen Kalmbach [MVP] - 12 Mar 2006 08:11 GMT
>> The C++ compiler did generate more optimized IL.  It's surprising to
>> me that the JIT didn't do a better job of optimizing the C#-produced
[quoted text clipped - 4 lines]
> I just remember a blog-entry from someone at blogs.msdn.com... but
> couldn´t find it anymore...

Currently I could only find the confirmation of the "missing"
optimization for the CF. But I tought the same was true for the
"desktop"-framework...

http://blogs.msdn.com/stevenpr/archive/2005/12/12/502978.aspx

<quote>
Because the CLR can throw away native code under memory pressure or when
an application moves to the background, it is quite possible that the
same IL code may need to be jit compiled again when the application
continues running.  This fact leads to our second major jit compiler
design decision: the time it takes to compile IL code often takes
precedence over the quality of the resulting native code.  As with all
good compilers, the Compact Framework jit compiler does some basic
optimizations, but because of the need to regenerate code quickly in
order for applications to remain responsive, more extensive
optimizations generally take a back seat to shear compilation speed.
</quote>

Signature

Greetings
  Jochen

   My blog about Win32 and .NET
   http://blog.kalmbachnet.de/

Andre Kaufmann - 12 Mar 2006 09:00 GMT
> Currently I could only find the confirmation of the "missing"
> optimization for the CF. But I tought the same was true for the
[quoted text clipped - 14 lines]
> optimizations generally take a back seat to shear compilation speed.
> </quote>

That may be true. But I wonder why there cannot be both ?
A fast IL compiler and one that is slow, but optimizes much better. E.g.
"ngen" could have a command line switch to generate more optimized code.

Andre
Shawn B. - 12 Mar 2006 12:39 GMT
> Wasn´t there a statement that the JIT for .NET 2.0 is not doing
> optimizations (only simple optimizations) ?
> I just remember a blog-entry from someone at blogs.msdn.com... but
> couldn´t find it anymore...

No, but there is a recent thread in this group where some MVP's insist that
the C++ compiler doesn't do optimized IL code and produces roughly what the
C# compiler does, despite the fact that your test, some VC++ devs,
publications, and my own internal software production has proved that the
C++/CLI compiler is the best optimized for IL of the MS stack.  That said,
the same MVP insists that some MS employees have stated that the C++/CLI
compiler leaves all the optimzation to the JIT rather than front-end
optimizing.

Thanks,
Shawn
Jochen Kalmbach [MVP] - 12 Mar 2006 13:56 GMT
Hi Shawn!
>>WasnŽt there a statement that the JIT for .NET 2.0 is not doing
>>optimizations (only simple optimizations) ?
[quoted text clipped - 9 lines]
> compiler leaves all the optimzation to the JIT rather than front-end
> optimizing.

Really?
I thought the C++/CLI compiler does not care what code it is generating.
It always tryes to optimize the "pseudeo-code".

Nevertheless... I neither found docu that the JIT-compiler does
optimization nor does I found some docu that it does not...

Signature

Greetings
  Jochen

   My blog about Win32 and .NET
   http://blog.kalmbachnet.de/

Andre Kaufmann - 12 Mar 2006 14:27 GMT
>> Wasn´t there a statement that the JIT for .NET 2.0 is not doing
>> optimizations (only simple optimizations) ?
[quoted text clipped - 3 lines]
> No, but there is a recent thread in this group where some MVP's insist that
> the C++ compiler doesn't do optimized IL code and produces roughly what the

You mean the sample where W.D. [MVP] gives a samples that the C++/CLI
doesn't do global optimization on IL code ?

It does. IMHO the example is wrong. If I interpret the given example
correctly it's based on a call to an external DLL. So the C++/CLI
compiler must do an optimization over DLL boundaries ?! Since the DLL is
loaded dynamically, how should the C++/CLI compiler do any optimization ?

Why should the C++/CLI compiler not optimize the code ? I don't know how
the C++/CLI compiler is implemented, but I assume that the code
generation of native or CLI code is done by optimizing the generated
intermediate code, before native or managed code is generated. So that
(nearly) the same optimizer is used for "native code compiled to IL
code" and "native x86 code". If my assumption is true it would be plain
nonsense to revert this optimization, already done.

> C# compiler does, despite the fact that your test, some VC++ devs,
> publications, and my own internal software production has proved that the
> C++/CLI compiler is the best optimized for IL of the MS stack.  That said,
> the same MVP insists that some MS employees have stated that the C++/CLI
> compiler leaves all the optimzation to the JIT rather than front-end
> optimizing.

If he gives a valid link to the statements, I will believe it. Which
doesn't mean that the statements are true.

> Thanks,
> Shawn

Andre
Carl Daniel [VC++ MVP] - 12 Mar 2006 16:35 GMT
> Why should the C++/CLI compiler not optimize the code ? I don't know
> how the C++/CLI compiler is implemented, but I assume that the code
[quoted text clipped - 3 lines]
> code" and "native x86 code". If my assumption is true it would be
> plain nonsense to revert this optimization, already done.

That is indeed the case.  There's a single front-end for both native and
managed code.  That front end produces CIL ('C' Intermediate Language) which
is then fed to the back end.  The back-end consists of target independent
parts (e.g. CIL optimizations) and target dependent parts (e.g. code
generation).

-cd
Eugene Gershnik - 12 Mar 2006 09:10 GMT
> I did no optimizing compiler switches

[...]

Then the test is meaningless. If you don't ask the compiler to optimize why
should it spend any effort on making your code fast?

[I don't have any stake in C++/CLI, C# or Java -- they can all die as far as
I am concerned -- my objection as an outsider is only about how you tested.]

Signature

Eugene
http://www.gershnik.com

Don Kim - 12 Mar 2006 13:18 GMT
> Then the test is meaningless. If you don't ask the compiler to optimize why
> should it spend any effort on making your code fast?

That was the whole point.  If I were to use optimizing options, there
would invariably be arguments that either I did not use the correct
ones, not in the proper order, that certain compiler switches are not
equivalent, etc., etc.  Therefore, I compiled as is w/out any options to
see how each complier would compile on its own.  I also made the test as
simple as possible so as to time how each compiler internally optimizes
a straight iteration of a common for loop.

In this case, it seems C++/CLI is the fastest in the managed Windows
environment.

-Don Kim
Eugene Gershnik - 12 Mar 2006 20:25 GMT
>> Then the test is meaningless. If you don't ask the compiler to
>> optimize why should it spend any effort on making your code fast?

[Rearranging your post a little]

> That was the whole point.

[...]

> In this case, it seems C++/CLI is the fastest in the managed Windows
> environment.

Let me see. Take a world record holder for a 100m dash and take me. Put us
both before a 100m range and ask as to get to the end at whatever pace we
want. He walks. I run. I get there before him. You conclusion seems to be
that I am a faster runner.

> If I were to use optimizing options, there
> would invariably be arguments that either I did not use the correct
> ones, not in the proper order, that certain compiler switches are not
> equivalent, etc., etc.

Yes measuring compiler performance is hard. If you want to get meaningful
results you will need to study each one's options in detail, determine what
people usually set in their optimized builds, create a meaningfull test set
etc. etc. If you don't do all this anouncing to the world that X compiler is
faster is a waste of electrons.

> I also made the
> test as simple as possible so as to time how each compiler internally
> optimizes a straight iteration of a common for loop.

You didn't ask them to optimize the loop.

Signature

Eugene
http://www.gershnik.com

Willy Denoyette [MVP] - 12 Mar 2006 17:40 GMT
| Ok, so I posted a rant earlier about the lack of marketing for C++/CLI,
| and it forked over into another rant about which was the faster
[quoted text clipped - 57 lines]
|
| -Don Kim

Such micro benchmark has little value, an empty loop will be hoisted in
optimized builds (you ain't gonna do this in real code do you?).
More important is the way you measure execution time here, it is wrong. The
reason or this is that Environment.TickCount is updated with the real time
clock tick. That is every 10 msec or 15,6 msec or higher, depending on the
CPU type (Intel AMD, variants...). For instance an AMD 64 ticks at an
interval of 15.5 msec, most intel based systems have an interval of 10msec,
most SMP systems tick at 20msec or higher.

To get accurate results you need to use the high performance counters or the
Stopwatch class  in V2.
Here is the adapted code:

// C# code
// csc /o- bcs.cs
using System;
using System.Diagnostics;
public class ForLoopTest
{
public static void Main(string[] args)
{
   long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;

   Stopwatch sw = new Stopwatch();
   sw.Start();
   for (int i =0;i < 10000000; ++i) {}
   sw.Stop();
   long ticks = sw.Elapsed.Ticks;
   Console.WriteLine("{0} nanoseconds", ticks * nanosecPerTick);
   }
}

// C++/CLI code
// cl /CLR:safe /Od bcc.cpp
#using <System.dll>
using namespace System;
using namespace System::Diagnostics;

int main()
{
  Int64 nanosecPerTick = (1000L * 1000L * 1000L) /
System::Diagnostics::Stopwatch::Frequency;

   Stopwatch^ sw = gcnew Stopwatch;
   sw->Start();
   for (int i = 0; i < 10000000; ++i) {}
   sw->Stop();
   Int64 ticks = sw->Elapsed.Ticks;
   Console::WriteLine("{0} nanoseconds", ticks* nanosecPerTick);
}

On my system using above code and the command line arguments as specified in
the source (both non optimized) show following results:

C#
37714104 nanoseconds
C++/CLI
37389069 nanoseconds

That means both are equaly fast, but again this means nothing, such micro
benchmarks have no value.

Note that an optimized C++ build will hoist the empty loop (removes it
completely from IL). This kind of hoisting is not done by the C# compiler,
and there is a reason for it.
That doesn't mean there is no loop optimization, it's just done at the JIT
level!!.

Willy.
Carl Daniel [VC++ MVP] - 12 Mar 2006 18:47 GMT
> wrong. The reason or this is that Environment.TickCount is updated
> with the real time clock tick. That is every 10 msec or 15,6 msec or
[quoted text clipped - 13 lines]
> C++/CLI
> 37389069 nanoseconds

Interesting.  I took Don' sample and increased the loop count by a factor of
100 and consistently got execution times of about 530ms for the C++ code and
1200ms for the C# code.

Granted, the resolution of GetTickCount is poor - but that's a large enough
difference to be significant.

Your results are actually much closer to what I expected - nearly identical
performance, but I can't see why replacing GetTickCount with StopWatch would
have any effect other than to increase the resolution of the time
measurement.

But... here's what I found with your examples:  First, I changed both to
calculate nanosecPerTick as a double instead of a long - on a system with a
tick rate higher than 1Ghz, your calcuation results in 0 all the time.

With that change, I get a time of  15.8us for the C++ code and 42.3us for
the C# code - about the same difference I saw with GetTickCount.

It seems that there's something significantly different about your machine
as compared to mine & Don's when it comes to the performance of this code -
and that is very interesting!

What's your machine hardware?    I'm running on a 3Ghz P4 with 1GB of RAM
under XP SP2.  I'm suspicious of your times (and mine as well) as I doubt my
machine is 2000 times faster than yours.

-cd
Willy Denoyette [MVP] - 12 Mar 2006 20:07 GMT
| > wrong. The reason or this is that Environment.TickCount is updated
| > with the real time clock tick. That is every 10 msec or 15,6 msec or
[quoted text clipped - 17 lines]
| 100 and consistently got execution times of about 530ms for the C++ code and
| 1200ms for the C# code.

Are you sure you compiled non optimized (/Od and /o-)? As I said, the C++
compiler will hoist the loop when optimization is on (O&, O2 or whatever).

| Granted, the resolution of GetTickCount is poor - but that's a large enough
| difference to be significant.

It's not the resolution it's the interval which is the cullprit.

| Your results are actually much closer to what I expected - nearly identical
| performance, but I can't see why replacing GetTickCount with StopWatch would
| have any effect other than to increase the resolution of the time
| measurement.

Stopwatch uses the QueryPerformanceCounter and QueryPerformanceFrequency
high resolution counters of the OS.

| But... here's what I found with your examples:  First, I changed both to
| calculate nanosecPerTick as a double instead of a long - on a system with a
| tick rate higher than 1Ghz, your calcuation results in 0 all the time.

That's very surprising, QueryPerformanceFrequency  (StopWatch.Frequency)
should not be that high, notice that this Frequency is not the CPU clock
frequency, it's the output of a CPU clock divider, it's frequency is much
lower, on my System it's 3579545MHz (try with:
Console::WriteLine(System::Diagnostics::Stopwatch::Frequency);)
If on your system it's much higher than 1GHz, you might have an issue with
your system.

| With that change, I get a time of  15.8us for the C++ code and 42.3us for
| the C# code - about the same difference I saw with GetTickCount.

Hmmm , 15.8 ùsec. for 10000000 loops in which you execute 6 instructions
[1]per loop, that would mean 60000000 instructions in 15.8µsec or 0.000263
nanosecs/instruction, or  ~4.000.000.000.000 instructions/sec.- not possible
really, looks like the loop is hoisted or your clock is broken ;-).

| It seems that there's something significantly different about your machine
| as compared to mine & Don's when it comes to the performance of this code -
| and that is very interesting!

Looks like you have to investigate the Frequency value returned first, and
inspect your code.

| What's your machine hardware?    I'm running on a 3Ghz P4 with 1GB of RAM
| under XP SP2.  I'm suspicious of your times (and mine as well) as I doubt my
| machine is 2000 times faster than yours.

I have it running on an AMD64 Atlon 3500+, 2GB, XP SP2, whith CPU clock
throttling disabled.
Increasing the loop count by a factor 100 gives me:

3737032857 nanoseconds

or 3.7 seconds.
or 3737032857/1000000000 = 3.737032857 nsec/loop or ~0.63 nsec. per
instruction (avg.)

| -cd

[1]
00d100d2 83c201           add     edx,0x1
00d100d5 81fa80969800     cmp     edx,0x989680
00d100db 0f9cc0           setl    al
00d100de 0fb6c0           movzx   eax,al
00d100e1 85c0             test    eax,eax
00d100e3 75ed             jnz     00d100d2

notes:
- 0x989680  = 1.000.000.000 decimal
- that this is native code, generated by the JIT in non optimized build.

Willy.
Carl Daniel [VC++ MVP] - 12 Mar 2006 21:51 GMT
> "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam>
> Are you sure you compiled non optimized (/Od and /o-)? As I said, the C++
> compiler will hoist the loop when optimization is on (O&, O2 or whatever).

Quite certain - I used the exact command lines given in your posting
(optimization if off by default as well, so specifying nothing is equivalent
to /Od).

> It's not the resolution it's the interval which is the cullprit.

We're talking about the same thing - 15ms precision is quite sufficient for
measuring intervals of 500ms or more and certainly won't account for a 50%
measurement error for such intervals - only 3% or so.

> That's very surprising, QueryPerformanceFrequency  (StopWatch.Frequency)
> should not be that high, notice that this Frequency is not the CPU clock
[quoted text clipped - 3 lines]
> If on your system it's much higher than 1GHz, you might have an issue with
> your system.

(You made a typo - on your system it's 3579545Hz, not MHz)

If your machine uses the MP HAL (which mine does), then QPC uses the RDTSC
instruction which does report actual CPU core clocks.  If your system
doesn't use the MP HAL, then QPC uses the system board timer, which
generally has a clock speed of 1X or 0.5X the NTSC color burst frequency of
3.57954545 Mhz.  Note that this rate has absolutely nothing to do with your
CPU clock - it's a completely independent crystal oscillator on the MB.

> | With that change, I get a time of  15.8us for the C++ code and 42.3us
> for
[quoted text clipped - 5 lines]
> possible
> really, looks like the loop is hoisted or your clock is broken ;-).

I agree - it doesn't add up.  I'm quite sure that I did unoptimized builds,
and the results are 100% reproducible.  But see below.

> | It seems that there's something significantly different about your
> machine
[quoted text clipped - 4 lines]
> Looks like you have to investigate the Frequency value returned first, and
> inspect your code.

Well, it's your code - not mine.  The Frequency value is right on for this
machine.

I'm at my office right now, on a different computer.  This one's a 3GHz
Pentium D.  I modified the samples as before to make nanosecPerTick double
instead of Int64 and added code to print the value of Stopwatch.Frequency
and the raw Ticks and nanosecPerTick.  Here are the results:

C:\Dev\Misc\fortest>fortest0312cs
Stopwatch frequency=3052420000
0.327608913583321 ns/tick
240117 ticks
78664.4695028862 nanoseconds

C:\Dev\Misc\fortest>fortest0312cpp
Stopwatch frequency=3052420000
0.327608913583321 ns/tick
49225 ticks
16126.548771139 nanoseconds

Increasing the loop count by a factor of 10 increases the times by a factor
of 10.  Decreasing by a factor of 10 decreases the times by a factor of 10.
Clearly the loop has not been optimized out, but that still doesn't explain
the apparent execution speed of more than 200 adds per clock cycle (I know
modern CPUs are somewhat super-scalar, but 200 adds/clock? I don't think
so!)

I don't know what's going on here, but two things seem to be true:

1. The C++ code is faster on these machines.  If I increase the loop count
to 1,000,000,000 I can clearly see the difference in execution time with my
eyes.
2. The Stopwatch class doesn't appear to work correctly on these machines -
it's measuring times that are orders of magnitude too short, yet still
proportional to the actual time spent.

Working on the assumpting that #2 is true, I modified the code to call
QueryPerformanceCounter/QueryPerformanceFrequency directly.  Here are the
results:

C:\Dev\Misc\fortest>fortest0312cpp
QPC frequency=3052420000
0.327608913583321 ns/tick
22388910 ticks
7334806.48141475 nanoseconds

C:\Dev\Misc\fortest>fortest0312cs
QPC frequency=3052420000
0.327608913583321 ns/tick
58980368 ticks
19322494.2832245 nanoseconds

The times are now much more reasonable - Stopwatch apparently doesn't work
correctly with such a high value from QPF (it's apparently off by a factor
of 1000).  The ratio of times remains about equal though- the C++ code is
still nearly 2X faster on this machine (despite the fact that that makes no
sense at all, it seems to be true).

-cd
Carl Daniel [VC++ MVP] - 12 Mar 2006 22:03 GMT
"Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam>
wrote in message
> The times are now much more reasonable - Stopwatch apparently doesn't work
> correctly with such a high value from QPF (it's apparently off by a factor
> of 1000).  The ratio of times remains about equal though- the C++ code is
> still nearly 2X faster on this machine (despite the fact that that makes
> no sense at all, it seems to be true).

Follow-up -

It appears that Stopwatch scales the QPF/QPC values internally if the
frequency is "high", causing Stopwatch.ElapsedTicks to report a scaled
value, but Stopwatch.Frequency still reports the full resolution value
returned by QPF.

Stopwatch.ElapsedMilliseconds and Stopwatch.Elapsed both return correctly
scaled values.

This is clearly a bug in the Stopwatch class.

-cd
Willy Denoyette [MVP] - 13 Mar 2006 00:19 GMT
| "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam>
| wrote in message
[quoted text clipped - 17 lines]
|
| -cd

I see, but it still doesn't explain this:

| C:\Dev\Misc\fortest>fortest0312cpp
| QPC frequency=3052420000
[quoted text clipped - 7 lines]
| 58980368 ticks
| 19322494.2832245 nanoseconds

Why is C++ almost 3 times faster than C#? Are we sure the ticks are
accurate, are we sure the OS counter is updated for every tick, Are we sure
the OS goes to the HAL to read the HW clock tick value at each call of
QueryPerformanceCounter (this must be quite expensive)?

And why is it 2 and 5 times faster than on my AMD box, while the results are
comparable (AMD a little faster) when I run it on Intel 3GHz non HT (see my
previous post) ?

That means that the native code must be different, while it is on my AMD box
(dispite the fact that the IL is different).
add     edx,0x1
cmp     edx,0x989680
setl    al
movzx   eax,al
test    eax,eax
jnz     00d100d2

Which is not realy the best algorithm for X86, wonder how it looks like on
Intel. Grr.. micro benchmarks, what a mess ;-)

Willy.
Carl Daniel [VC++ MVP] - 13 Mar 2006 00:37 GMT
> That means that the native code must be different, while it is on my AMD
> box
[quoted text clipped - 8 lines]
> Which is not realy the best algorithm for X86, wonder how it looks like on
> Intel. Grr.. micro benchmarks, what a mess ;-)

Here's what I see (loops going 1 billion times):

The JIT'd C++ code:
// ---------------------------------------------------
   for (int i = 0; i < 1000000000; ++i) {}
00000077  xor         edx,edx
00000079  mov         dword ptr [esp],edx
0000007c  nop
0000007d  jmp         00000082
// start of loop
0000007f  inc         dword ptr [esp]
00000082  cmp         dword ptr [esp],3B9ACA00h
00000089  jge         0000008E
0000008b  nop
0000008c  jmp         0000007F
// end of loop

The JIT'd C# code:
// ---------------------------------------------------
   for (int i =0;i < 1000000000; ++i) {}
00000098  xor         ebx,ebx
0000009a  nop
0000009b  jmp         000000A0
// start of loop
0000009d  nop
0000009e  nop
0000009f  inc         ebx
000000a0  cmp         ebx,3B9ACA00h
000000a6  setl        al
000000a9  movzx       eax,al
000000ac  mov         dword ptr [ebp-6Ch],eax
000000af  cmp         dword ptr [ebp-6Ch],0
000000b3  jne         0000009D
// end of loop

Neither of these represent ideal code by any stretch of the imagination -
but instruction count alone probably accounts for the bulk of the difference
between the two programs on this machine.   Why the results are so different
from what you see on your AMD machine I can't even guess.

-cd
Willy Denoyette [MVP] - 13 Mar 2006 02:08 GMT
| > That means that the native code must be different, while it is on my AMD
| > box
[quoted text clipped - 50 lines]
|
| -cd

Thanks, that's almost exactly what I've noticed see my previous reply.

C# Intel...
00000030 90 nop
00000031 90 nop
00000032 46 inc esi
00000033 81 FE 80 96 98 00 cmp esi,989680h
00000039 0F 9C C0 setl al
0000003c 0F B6 C0 movzx eax,al
0000003f 8B F8 mov edi,eax
00000041 85 FF test edi,edi
00000043 75 EB jne 00000030

C# AMD...
add     edx,0x1
cmp     edx,0x989680
setl    al
movzx   eax,al
test    eax,eax
jnz     00d100d2

Conclusion: the JIT takes care of the CPU type even in debug builds! So
generates different X86 even from the same IL.
This is extremely weird, for instance the inc esi used on Intel, is an add,
edx, 1 on AMD;
so different register allocations and a different instruction. Well I know
add on AMD is prefered over an inc (according their "Optimization guide for
AMD64 Processors"), can you believe MSFT went that far with the JIT (in
debug builds)?

Willy.
Carl Daniel [VC++ MVP] - 13 Mar 2006 04:25 GMT
> "Optimization guide for AMD64 Processors"), can you believe MSFT went
> that far with the JIT (in debug builds)?

Well, yeah.  Maybe.  I'm under the (possibly misguided) impression that
debug primarily stops the JIT from inlining and hoisting - things that
change the relative order of the native code compared to the IL code.
Within those guidelines, I guess it still picks the best codegen it can
based on the machine.

My belief is that there are multiple full-time Intel and AMD employees at
MSFT that do nothing but work on the compiler back-ends, including the CLR
JIT.

-cd
Willy Denoyette [MVP] - 13 Mar 2006 10:06 GMT
| > "Optimization guide for AMD64 Processors"), can you believe MSFT went
| > that far with the JIT (in debug builds)?
[quoted text clipped - 8 lines]
| MSFT that do nothing but work on the compiler back-ends, including the CLR
| JIT.

Well, I would expect this for the C++ compiler back-end, but not directly
for the JIT compiler which is more time constrained, but I guess I'm wrong.

Willy.
Willy Denoyette [MVP] - 13 Mar 2006 13:01 GMT
|| > "Optimization guide for AMD64 Processors"), can you believe MSFT went
|| > that far with the JIT (in debug builds)?
[quoted text clipped - 13 lines]
|
| Willy.

Some more fun.

Consider this program:

//C++/CLI code
// File : EmptyLoop.cpp
#using <System.dll>
using namespace System;
using namespace System::Diagnostics;
#pragma unmanaged
void ForLoopTest( void )
{
  __asm {
      xor esi,esi;      0 -> esi
      jmp begin;
  iter:;
      inc         esi; i++
  begin:;
      cmp         esi,989680h ; i < 10000000?
  jl iter;          no
  }
  return;
}
#pragma managed
int main()
{
  Int64 nanosecPerTick = (1000L * 1000L * 1000L) /
System::Diagnostics::Stopwatch::Frequency;
  Stopwatch^ sw = gcnew Stopwatch;
  sw->Start();
  ForLoopTest();
  sw->Stop();
  Int64 ticks = sw->Elapsed.Ticks;
  Console::WriteLine("{0} nanoseconds", ticks * nanosecPerTick);
}

Compiled with:
cl /clr /O2 EmptyLoop.cpp
output:
24935346 nanoseconds

cl /clr /Od EmptyLoop.cpp
output:
37636821 nanoseconds

See the loop is in assembly, pure unmanaged X86 code, the code produced by
the C++ compiler [1] is the same except for the function prolog and epilog,
altough the results are different. Any takers?

[1]
/Od build

void ForLoopTest( void )
{
00401000 55               push        ebp
00401001 8B EC            mov         ebp,esp
00401003 56               push        esi
  __asm {
  xor esi,esi;  0 -> esi
00401004 33 F6            xor         esi,esi
  jmp begin;
00401006 EB 01            jmp         begin (401009h)
  iter:;
  inc         esi; i++
00401008 46               inc         esi
  begin:;
  cmp         esi,989680h ; < 10000000?
00401009 81 FE 80 96 98 00 cmp         esi,989680h
  jl iter;  no
0040100F 7C F7            jl          iter (401008h)
  }
  return;
}
00401011 5E               pop         esi
00401012 5D               pop         ebp
00401013 C3               ret

/O2 build

void ForLoopTest( void )
{
00401000 56               push        esi
  xor esi,esi;  0 -> esi
00401001 33 F6            xor         esi,esi
  jmp begin;
00401003 EB 01            jmp         begin (401006h)
  iter:;
  inc         esi; i++
00401005 46               inc         esi
  begin:;
  cmp         esi,989680h ; < 10000000?
00401006 81 FE 80 96 98 00 cmp         esi,989680h
  jl iter;  no
0040100C 7C F7            jl          iter (401005h)
  __asm {
0040100E 5E               pop         esi
  }
  return;
}
0040100F C3               ret

Willy.
Willy Denoyette [MVP] - 13 Mar 2006 14:05 GMT
Ok, final update.
The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect
on all platforms.

Using StopWatch.Elapsed.Milliseconds gives folowing results.

Values are averges for 10 runs.

C#         ~12.8 msec. for 10.000.000 loops
C++/CLI    ~9.1 msec.

Release build:

C#         ~9.1 msec.
C++/CLI     - loop hoisted by C++/CLI compiler (no IL body)

The X86 code for the loop C++/CLI /Od and C# optimized build are nearly the
same (different registers allocated and inc i.s.o add).

Now this:

#using <System.dll>
using namespace System;
using namespace System::Diagnostics;
#pragma unmanaged
void ForLoopTest( void )
{
  __asm {
  xor esi,esi;  0 -> esi
  jmp begin;
  iter:;
  inc         esi; i++
  begin:;
  cmp         esi,100000000  ; < 100000000?
  jl iter;  no
  }
  return;
}
#pragma managed
int main()
{

  Stopwatch^ sw = gcnew Stopwatch;
  sw->Reset();
  sw->Start();
  ForLoopTest();
  sw->Stop();

  Int64 ms = sw->Elapsed.Milliseconds;
  Console::WriteLine("{0} msec.", ms);
}

compiled with:
cl /clr /Od bcca.cpp
output: for 100.000.000 loops!!
avg. 135 msec.

cl /clr /Od bcca.cpp
output: for 100.000.000 loops!!
avg. 91 msec.

Notice the same result for C# optimized build as C++/CLI with loop in
assembly optimized build.
Remains the question why the debug build is that much slower, guess this is
due to the CLR starting some actions when running debug builds, IMO there is
an GC/Finalizer run after the call to Stopwatch.Start and before running the
loop. That would explain different behavior (better results) on an HT CPU as
the finalizer runs on a second CPU, so doesn't disturb the user thread which
runs on another core or logical CPU, on a single CPU core the finalizer
pre-empts the user thread.
I'll try to get an HW analizer from the lab to check this, this is simply
not possible to check only by SW tools.

Willy.
Willy Denoyette [MVP] - 13 Mar 2006 20:21 GMT
| Ok, final update.
| The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect
| on all platforms.

Followup.
!!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!

One should not use Elapsed.Ticks to calculate the elapsed time in
nanoseconds.
The only correct way to get this high precision count is by using
Stopwatch.ElapsedTicks like this:

long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;
...
long ticks = sw.ElapsedTicks;
Console.WriteLine("{0} nanoseconds", ticks * nanosecPerTick);

or use Stopwatch.ElapsedMiliseconds.

Note that the Stopwatch code is not broken, the code I posted used
Stopwatch.Elapsed.Ticks which is wrong in this context.
Sorry for all the confusion.

Willy.
Willy Denoyette [MVP] - 13 Mar 2006 21:14 GMT
|| Ok, final update.
|| The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect
[quoted text clipped - 20 lines]
|
| Willy.

Mystery solved, finally :-).

A C++/CLI debug build ( /Od flag - the default), does not generate sequence
points in IL, however it generates optimized IL.
A sequence point is used to mark a spot in the IL code that corresponds to a
specific location in the original source.  If you look at the IL generated
by C# when compiled with /o-, you'll notice the nop's inserted in the
stream, these nop's are used by the JIT to produce sequence points, but the
/o- flags doesn't produce optimized IL. To have the same behavior in C# as
/Od in C++/CLI, you need to set /debug+ /o+. This generates debug builds
without nop's to trigger the sequence point, just like C++/CLI does.
The "empty loop"  C# sample compiled with /debug+ /o+, runs just as fast as
the C++/CLI sample built with /Od. The IL produced is identical.

Willy.
Carl Daniel [VC++ MVP] - 13 Mar 2006 21:28 GMT
> Mystery solved, finally :-).
>
[quoted text clipped - 13 lines]
> as
> the C++/CLI sample built with /Od. The IL produced is identical.

Good sleuthing!   In the end, they really ought to be about the same -
having the C++ code execute 2x faster just didn't make sense.

-cd
Ajay Kalra - 13 Mar 2006 21:50 GMT
This is very useful info. It was causing confusion given mixed
information coming from MSFT itself.

--------
Ajay Kalra
ajaykalra@yahoo.com
Willy Denoyette [MVP] - 13 Mar 2006 22:36 GMT
| This is very useful info. It was causing confusion given mixed
| information coming from MSFT itself.
|
| --------
| Ajay Kalra
| ajaykalra@yahoo.com

Well, the C++/CLI team did not want to generate explicit sequence points in
the IL, so the JIT compiler can only rely on the implicit sequence points
(that is when the evaluation stack is empty). That means also that it's not
possible to synchronise the IL with the actual code while debugging C++/CLI
in managed mode and you need the PDB to set breakpoint in your code, not a
big deal IMO.

Willy.
Carl Daniel [VC++ MVP] - 13 Mar 2006 21:23 GMT
> Followup.
> !!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!

A ha!  I obviously hadn't looked at the code closely enough to realize that
it was using Elapsed.Ticks and not ElapsedTicks.

> One should not use Elapsed.Ticks to calculate the elapsed time in
> nanoseconds.

True - one should use it to calculate the elapsed time in 0.1us units, since
that's what TimeSpan.Ticks is expressed as.

> The only correct way to get this high precision count is by using
> Stopwatch.ElapsedTicks like this:
>
> long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;

but make this a double.  Stopwatch.Frequency is more than 1E9 on modern
machines using the MP HAL.

double nanosecPerTick = 1000.0 * 1000L * 1000L / Stopwatch.Frequency;

-cd
Willy Denoyette [MVP] - 13 Mar 2006 21:57 GMT
| > Followup.
| > !!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!
[quoted text clipped - 17 lines]
|
| double nanosecPerTick = 1000.0 * 1000L * 1000L / Stopwatch.Frequency;

Sure, or use picoseconds :-)

long picosecPerTick = 1000L * 1000L * 1000L * 1000L / Stopwatch.Frequency;

90614831400 picoseconds
Looks real crazy isn't it?

Willy.
Carl Daniel [VC++ MVP] - 14 Mar 2006 01:01 GMT
> Sure, or use picoseconds :-)

Nah - that's short sighted.  Let's standardize on Attoseconds :)

long attosecPerTick = 1000L * 1000L * 1000L * 1000L * 1000L * 1000L
/Stopwatch.Frequency;

now that's just getting silly... for the next few decades at least.

-cd
Willy Denoyette [MVP] - 15 Mar 2006 10:46 GMT
| > Sure, or use picoseconds :-)
|
[quoted text clipped - 6 lines]
|
| -cd

LOL, I'll keep it in mind for a next life maybe :-)

Willy.
Willy Denoyette [MVP] - 12 Mar 2006 23:49 GMT
| > "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam>
| > Are you sure you compiled non optimized (/Od and /o-)? As I said, the C++
[quoted text clipped - 9 lines]
| measuring intervals of 500ms or more and certainly won't account for a 50%
| measurement error for such intervals - only 3% or so.

Yes, but not for a loop of 10.000.000 (as in Don's code), which takes only
takes 37 msecs. to complete. And as I said on SMP systems this interval can
be as large as 60 msecs. (as I have measured here on a Compaq Proliant 8 way
system).

| > That's very surprising, QueryPerformanceFrequency  (StopWatch.Frequency)
| > should not be that high, notice that this Frequency is not the CPU clock
[quoted text clipped - 5 lines]
|
| (You made a typo - on your system it's 3579545Hz, not MHz)

Right, sorry for that.

| If your machine uses the MP HAL (which mine does), then QPC uses the RDTSC
| instruction which does report actual CPU core clocks.  If your system
| doesn't use the MP HAL, then QPC uses the system board timer, which
| generally has a clock speed of 1X or 0.5X the NTSC color burst frequency of
| 3.57954545 Mhz.  Note that this rate has absolutely nothing to do with your
| CPU clock - it's a completely independent crystal oscillator on the MB.

True MP HAL uses the externam CPU clock (yours runs at 3.052420000 GHz), but
the 3.57954545 Mhz clock is derived from a divider or otherwise stated, the
CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
995MHz. The stepping number is important here, as it may change the dividers
value.

No my current test machine is not a MP or HT, so it doesn't use an MP HAL,
and you didn't specify that either in your previous reply, it's quite
important as I know about the MP HAL.

| > | With that change, I get a time of  15.8us for the C++ code and 42.3us
| > for
[quoted text clipped - 20 lines]
| Well, it's your code - not mine.  The Frequency value is right on for this
| machine.

Well ..., it's Don's code. What do you mean with the Frequency value is
right? The Frequency is also right on mine :-).

| I'm at my office right now, on a different computer.  This one's a 3GHz
| Pentium D.  I modified the samples as before to make nanosecPerTick double
| instead of Int64 and added code to print the value of Stopwatch.Frequency
| and the raw Ticks and nanosecPerTick.  Here are the results:

| C:\Dev\Misc\fortest>fortest0312cs
| Stopwatch frequency=3052420000
[quoted text clipped - 7 lines]
| 49225 ticks
| 16126.548771139 nanoseconds

That's for 10000000 loops I assume.

| Increasing the loop count by a factor of 10 increases the times by a factor
| of 10.  Decreasing by a factor of 10 decreases the times by a factor of 10.
| Clearly the loop has not been optimized out, but that still doesn't explain
| the apparent execution speed of more than 200 adds per clock cycle (I know
| modern CPUs are somewhat super-scalar, but 200 adds/clock? I don't think
| so!)

That's not possible, Intel Pentium IV CPU's fetches and executes 2
instruction per cycle.
The AMD Athlon 64 fetches and executes a max. of 3 instructions per cycle,
(mine clocks at 2.2GHz)

These are the results on PIV 3GHz not HT running W2K3 R2.
C#
Frequency = 3579545
46632867 nanoseconds

C++
Frequency = 3579545
40659177 nanoseconds

Notice the difference between C++ and C#, looks like the X86 JIT'd code is
not exactly the same, have to check this.
Remember the results on AMD 64 bit (XP SP2) - 37368702 nanoseconds, that
means that the AMD the Intel 3GHz show comparable results, as expected.

| I don't know what's going on here, but two things seem to be true:
|
| 1. The C++ code is faster on these machines.  If I increase the loop count
| to 1,000,000,000 I can clearly see the difference in execution time with my
| eyes.

Assumed the timings are correct, it's simply not possible to execute that
number instructions during that time, so there must be something going on
here.

| 2. The Stopwatch class doesn't appear to work correctly on these machines -
| it's measuring times that are orders of magnitude too short, yet still
[quoted text clipped - 15 lines]
| 58980368 ticks
| 19322494.2832245 nanoseconds

How many loops here?

| The times are now much more reasonable - Stopwatch apparently doesn't work
| correctly with such a high value from QPF (it's apparently off by a factor
| of 1000).

This is really strange as Stopwatch uses the same QueryPerformanceCounter
and Frequency under the hood.

The ratio of times remains about equal though- the C++ code is
| still nearly 2X faster on this machine (despite the fact that that makes no
| sense at all, it seems to be true).

Time to expect the Stopwatch code, and I'll try to prepare a multicore or HT
box to do some more tests.

wd.
Carl Daniel [VC++ MVP] - 13 Mar 2006 00:19 GMT
> "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam>
> | If your machine uses the MP HAL (which mine does), then QPC uses the
[quoted text clipped - 16 lines]
> dividers
> value.

Not (necessarily) true.  For example, this Pentium D machine uses a BCLK
frequency of 200Mhz with a multiplier of 15.  There's no requirement
(imposed by the CPU or MCH) that the CPU clock be related to color burst
frequency at all.

Now, it's entirely possible that the motherboard generates that 200Mhz BCLK
by multipliying a color burst crystal by 56 (200.45Mhz), but that's a
motherboard detail that's unrelated to the CPU. Without really digging,
there's no way I can tell one way or another - just looking at the MB, I see
at least 4 different crystal oscillators of unknown frequency. Historically,
the only reason color burst crystals are used is that they're cheap -
they're manufactured by the gazillion for NTSC televisions.

> | Working on the assumpting that #2 is true, I modified the code to call
> | QueryPerformanceCounter/QueryPerformanceFrequency directly.  Here are
[quoted text clipped - 14 lines]
>
> How many loops here?

That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty
resonable rate to me - certainly not off by orders of magnitude.

> | I don't know what's going on here, but two things seem to be true:
> |
[quoted text clipped - 7 lines]
> number instructions during that time, so there must be something going on
> here.

It's completely reasonable based on the times reported directly by QPC, not
the bogus values from Stopwatch, which is off by a factor of 1000 on these
machines.

So, any theory why the C++ code consistently runs faster than the C# code on
both of my machines?  I can't think of any reasonable argument why having a
dual core or HT CPU would make the C++ code run faster.  Clearly the JIT'd
code is different for the two loops - maybe there's some pathological code
in the C# case that the P4 executes much more slowly than AMD, or some
optimal code in the C++ case that the P4 executes much more quickly than
AMD.   I'd be curious to hear the details of Don's machine - Intel/AMD,
Single/HT/Dual, etc.

-cd
Don Kim - 13 Mar 2006 01:25 GMT
> So, any theory why the C++ code consistently runs faster than the C# code on
> both of my machines?  I can't think of any reasonable argument why having a
[quoted text clipped - 4 lines]
> AMD.   I'd be curious to hear the details of Don's machine - Intel/AMD,
> Single/HT/Dual, etc.

Wow, this is becomming interesting.  We're getting down to dicussions
CPU architecture and instructions sets.  Talk about getting down to the
metal!

Anyway, I just reran my test code with larger loop factors, as well as
the other code with my original and larger loop factors, and C++/CLI
still came out around 2X faster.

I ran these both on my laptop and desktop.  Here's the configuration:

Laptop: Pentium Centrino 1.86 GHz, 1 GB Ram, Windows XP Pro, SP 2
Desktop: Pentium 4, 2.8 GHz, 1 GB RAM, Windows XP Pro, SP2

I know someone who has an AMD computer, and I'm going to run my programs
on that computer to see if there's something in the CPU that's causing
the discrepencies.

-Don Kim
Willy Denoyette [MVP] - 13 Mar 2006 11:19 GMT
| > So, any theory why the C++ code consistently runs faster than the C# code on
| > both of my machines?  I can't think of any reasonable argument why having a
[quoted text clipped - 8 lines]
| CPU architecture and instructions sets.  Talk about getting down to the
| metal!

That's true, if you are running empty loops, you are not only comparing
compiler optimizations, you are measuring architectural differences at the
CPU, L1/L2 cache & memory controler level. That's also why such
micro-benchmarks have little or no value.

| Anyway, I just reran my test code with larger loop factors, as well as
| the other code with my original and larger loop factors, and C++/CLI
[quoted text clipped - 4 lines]
| Laptop: Pentium Centrino 1.86 GHz, 1 GB Ram, Windows XP Pro, SP 2
| Desktop: Pentium 4, 2.8 GHz, 1 GB RAM, Windows XP Pro, SP2

Just currious what the QPD is on the Centrino.

| I know someone who has an AMD computer, and I'm going to run my programs
| on that computer to see if there's something in the CPU that's causing
| the discrepencies.

Well, I noticed that for debug builds, C++/CLI produces smaller IL, and
different X86 code produced by the JIT for both C# and C++/CLI, here are the
for loops...

X86 for C# (debug)
..
00000030 90               nop
00000031 90               nop
00000032 46               inc         esi
00000033 81 FE 80 96 98 00 cmp         esi,989680h
00000039 0F 9C C0         setl        al
0000003c 0F B6 C0         movzx       eax,al
0000003f 8B F8            mov         edi,eax
00000041 85 FF            test        edi,edi
00000043 75 EB            jne         00000030
...

X86 for C++/CLI (debug)

0000001f 46               inc         esi
00000020 81 FE 80 96 98 00 cmp         esi,989680h
00000026 7D 03            jge         0000002B
00000028 90               nop
00000029 EB F4            jmp         0000001F

An optimized C# build produces even a shorter code path:
..
0000001c 46 inc esi
0000001d 81 FE 80 96 98 00 cmp esi,989680h
00000023 7C F7 jl 0000001C
..

Now, while one would think that the run times would be better, they do not,
all take the same time to finish.

The reason for this (AFAIK) is that super scalars like AMD prefer longer
code paths (longer than a cacheline) in order to feed the instruction
pipeline with longer bursts. Don't know how this behaves on Intel Centrino
and PVI HT, but it looks like they behave differently. (I'll try this with
an assembly code program).

Anyway I don't care that much about this, empty loops are not that common I
guess (and C++ will hoist them anyway). Once you start something reasonable
inside the loop, the loop overhead is reduced to dust and the pipeline gets
filed in a more optimum way.

Willy.
Willy Denoyette [MVP] - 13 Mar 2006 01:30 GMT
| > "Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam>
| > | If your machine uses the MP HAL (which mine does), then QPC uses the
[quoted text clipped - 21 lines]
| (imposed by the CPU or MCH) that the CPU clock be related to color burst
| frequency at all.

Carl, I'm not saying this is the case for all type of CPU's and mother
boards, I only say that it's true for Pentiums up to III, things are
different for other type of CPU's. See, AMD clocks at 200MHz with a
multiplier of 11 or 12 depending on the type (and CPU id), this 200MHz clock
can be adjusted (overclocked or underclocked), the Frequency returned by
QueryPerformanceFrequency stays the same, the same is true for recent PIV's
Pentium M and D. So here it's true that both aren't related, and the
3.57954545MHz clock is derived from the on baord Graphics controller or an
external clock source (on mobo or not) when no on board graphics controller,
but the value remains the same 3.57954545MHz  unless you are using a MP HAL.

| Now, it's entirely possible that the motherboard generates that 200Mhz BCLK
| by multipliying a color burst crystal by 56 (200.45Mhz), but that's a
[quoted text clipped - 3 lines]
| the only reason color burst crystals are used is that they're cheap -
| they're manufactured by the gazillion for NTSC televisions.

I know,carl, I've been working for IHV's (HP before Compac, before DEC ...)
I know what you are talking about. Even on DEC Alpha (AXP) systems, the
QueryPerformance frequency was 3.57954545MHz using the mono CPU HAL, while
on SMP boxes like the Alpha 8400 (with the MP HAL) range it was also not the
case, Jeez, what a bunch of problems did we have when porting W2K (never
released for well known reasons) from intel code to AXP, just because some
drivers and core OS components did not expect QueryPerformanceCounter speeds
higher that 1GHz (that is when we overclocked an 800MHz CPU).

| > | Working on the assumpting that #2 is true, I modified the code to call
| > | QueryPerformanceCounter/QueryPerformanceFrequency directly.  Here are
[quoted text clipped - 17 lines]
| That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty
| resonable rate to me - certainly not off by orders of magnitude.

Sure it is, I was wrong when reading the tick values (largely over midnight
here, time to go to bed).

| > | I don't know what's going on here, but two things seem to be true:
| > |
[quoted text clipped - 22 lines]
|
| -cd

Well I have investigated the native code generated on the Intel PIV (see
previous .
Here is (part of) the disassembly (VS2005)for C++:
...
0000001f 46 inc esi
00000020 81 FE 80 96 98 00 cmp esi,989680h
00000026 7D 03 jge 0000002B
00000028 90 nop ---> not sure why this one is good for, it's ignored by the
CPU anyway
00000029 EB F4 jmp 0000001F
...

That means 4 instructions per loop compared to 6 on AMD.
And the results are comparable to yours (for C++).
Did not look at the C# code and it's result, but above shows that the JIT
compiler generates (better?) code for PIV (don't know what the __cpuid call
returns, but I know the CLR checks it when booting). Again, notice this is
an unoptimized code build (/Od flag set), optimized code is a totally
different story.

Willy.
Willy Denoyette [MVP] - 13 Mar 2006 01:53 GMT
|| > "Carl Daniel [VC++ MVP]"
| <cpdaniel_remove_this_and_nospam@mvps.org.nospam>
[quoted text clipped - 135 lines]
|
| Willy.

Last follow up, (before my spouse pulls the plugs).
Here is the X86 output of a C# release build on both AMD and Intel PIV:
[1]
0000001c 46 inc esi
0000001d 81 FE 80 96 98 00 cmp esi,989680h
00000023 7C F7 jl

this results in 6.235684 msec on AMD and 7.023547 msec on PIV (10.000.000
loops).

while this is the debug build on Intel:

00000030 90 nop
00000031 90 nop
00000032 46 inc esi
00000033 81 FE 80 96 98 00 cmp esi,989680h
00000039 0F 9C C0 setl al
0000003c 0F B6 C0 movzx eax,al
0000003f 8B F8 mov edi,eax
00000041 85 FF test edi,edi
00000043 75 EB jne 00000030

See that the release build is the most optimum X86 code possible for the
loop. The C++/CLI compiler in optimized build hoists the loop completely, so
can't compare.
Carl, could you look at the disassembly on your box, not a problem if you
can't (It doesn't mean that much anyway), it looks like on you box the
C++/CLI output looks more like [1] above.

Willy.
Carl Daniel [VC++ MVP] - 13 Mar 2006 01:54 GMT
> Pentium M and D. So here it's true that both aren't related, and the
> 3.57954545MHz clock is derived from the on baord Graphics controller or an
> external clock source (on mobo or not) when no on board graphics
> controller,
> but the value remains the same 3.57954545MHz  unless you are using a MP
> HAL.

I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
color burst, or 1.7897727Mhz.  But this particular branch has drifted far
from the real point of this thread - interesting though (made me go look at
the Pentium D data sheet, afterall!)

-cd
Willy Denoyette [MVP] - 13 Mar 2006 02:16 GMT
| > Pentium M and D. So here it's true that both aren't related, and the
| > 3.57954545MHz clock is derived from the on baord Graphics controller or an
[quoted text clipped - 7 lines]
| from the real point of this thread - interesting though (made me go look at
| the Pentium D data sheet, afterall!)

Can't remember this, but I guess you are right, much depends on the chip set
used, I was on the Alpha team by that time (where we build the AXP HAL's and
drivers), I moved to Intel architectures after the Compaq merge ;-). Digital
had their own chip sets for Alpha systems (that's why they were too
expensive, right?), nothing commodity, like there is available now.

Willy.
Tim Roberts - 13 Mar 2006 05:42 GMT
r"Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam>
wrote:

>I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
>color burst, or 1.7897727Mhz.

Nope, it was actually 1/3 of the color burst, 1.193182 MHz.  The original
PC had a 14.31818 MHz crystal (4x the color burst), and they divided it by
12 for the counter.
Signature

- Tim Roberts, timr@probo.com
 Providenza & Boekelheide, Inc.

Carl Daniel [VC++ MVP] - 13 Mar 2006 06:24 GMT
> r"Carl Daniel [VC++ MVP]"
> <cpdaniel_remove_this_and_nospam@mvps.org.nospam> wrote:
[quoted text clipped - 5 lines]
> original PC had a 14.31818 MHz crystal (4x the color burst), and they
> divided it by 12 for the counter.

Yep.  That sounds right - 1.789 just didn't feel quite right :)

-cd
Willy Denoyette [MVP] - 13 Mar 2006 10:29 GMT
| r"Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam>
| wrote:
[quoted text clipped - 5 lines]
| PC had a 14.31818 MHz crystal (4x the color burst), and they divided it by
| 12 for the counter.

Yep, an old 200MHz (199.261) P6 "Model 1, Stepping 7" of mine, gives a QPC
of 1.193182 MHz, that is CPU clock/167.

Willy.
MichaelG - 13 Mar 2006 17:46 GMT
Richard Grimes'a article 'Is Managed Code Slower than Unmanaged Code' might
be of interest.
http://www.grimes.demon.co.uk/dotnet/man_unman.htm

Seems to indicate that there isn't much to choose between c# and c++/cli. c#
can be faster in some circumstances.

Michael

> Ok, so I posted a rant earlier about the lack of marketing for C++/CLI,
> and it forked over into another rant about which was the faster compiler.
[quoted text clipped - 57 lines]
>
> -Don Kim

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.