.NET Forum / Languages / C# / January 2008
Is there a faster hi resolution timer for diy profiling
|
|
Thread rating:  |
colin - 15 Jan 2008 16:46 GMT Hi,
Im trying to time some parts of my code using the Stopwatch class, but it is so slow, I tried the QueryPerformanceFrequency() but this seems to be just as slow, if I just simply call this function in a loop it takes 21 ticks with a reported frequency of 3.5mhz its is about 6us wich is over 12k instructions on my 2ghz cpu. this negates the idea of having a high res timer ...
Is there any way to access the timer more directly ? I know the cpu actually has a very high resolution timer, it would be nice to be able to just read this value directly, maybe il have to write a c++ library and do it in asm thats assuming I can get the OS to grant me acces to that hardware address.
basicaly im making a diy profiler, I have a profile class wich needs Enter(enum p) funtion at the start and a Leave() function at the end of some functions of interest.
primarily it keeps tarck of the number of calls made, and a lists the calling functions counts too. it also measures the total time spent in the function and the subtracts the time spent in subroutins to provide an actual time too.
It works pretty quick for counts but when I use the timer it is like 20 times slower, wich although I can live with waiting 20 minutes to collect the data it does mean that there is a considerable error for short functions,
I have added a fudge wich subtracts a fiddle value from the timer, wich is increased by 21 ticks every times its called, this might be close but its far from ideal.
Im trying to find out why/where its spending 75% of the cpu time in mscorlib and mscorwks libraries, the amd codeanalyst identifies its spending most of its time in a function called mscorwks::gc_heap::adjust_limit_clr
thanks Colin =^.^=
Nicholas Paldino [.NET/C# MVP] - 15 Jan 2008 17:06 GMT Colin,
There is no way that you are going to get access to the hardware through user-mode code. You will have to write a driver of some sort which has kernel mode access in order to access that value. You definitely can't do it in C#, as you can't write assemblies with exportable functions (which would most definitely be required).
When you use the Stopwatch class, are you checking the static IsHighResolution property? If so, does it return true or false? If it returns true, then your hardware supports the high-resolution performance counter, if not, then the hardware simply does not support it. The Stopwatch class calls the QueryPerformanceFrequency method through the P/Invoke layer to find this out, as well as to get values from the performance counter, which is always going to add some overhead to the calls.
You could create a native dll which would eliminate the overhead in making those calls to the underlying APIs, but then you would have to use the P/Invoke layer to make calls to YOUR native dll, which is going to re-add the overhead.
The point is that the Stopwatch class is about as good as you are going to get for any managed solution.
 Signature - Nicholas Paldino [.NET/C# MVP] - mvp@spam.guard.caspershouse.com
> Hi, > [quoted text clipped - 40 lines] > thanks > Colin =^.^= colin - 15 Jan 2008 20:31 GMT thanks Nicholas
yes im sure high is resolution supported, I looked in the debuger at the flag, and I dump the frequency wich is 3.7~mhz, wich also means its high resolution.
I dont know much about the layers under c#, but it seems to be spending about 10k instructions in the call wich is a lot !
its unfurtunate this p/invoke overhead is there, does unsafe code help can you have shared memory ?
or maybe the low resolution timer is quicker to look at ? i might investigate ...
another option would be to do the informtation exchange the other way, ie. a high priority routine such as a hardware timer interrupt updates a variable in c#, however although this probably has a similar performance hit the timer would only need to update the variable about 100k times a second, wheres I need to read it many times more than this.
maybe I could get away with a windows form timer wich is set to 1ms. hardly high resolution but it might average out ok over enough samples. would have to be in a very high priority thread, would this work or would it likely get delayed if other threads running continuosly ?
Colin =^.^=
> Colin, > [quoted text clipped - 65 lines] >> thanks >> Colin =^.^= Willy Denoyette [MVP] - 15 Jan 2008 20:46 GMT > Hi, > [quoted text clipped - 40 lines] > thanks > Colin =^.^= Please post your code, there must be something wrong with it, QueryPerformanceFrequency() can't take 6µsec to complete. Note that there is no need to call QueryPerformanceFrequency(), this value exposed by the Stopwatch class. Willy.
colin - 15 Jan 2008 21:54 GMT I just copied the code from here http://msdn2.microsoft.com/en-us/library/aa964692(VS.80).aspx
and oops I meant the QueryPerformanceCounter() call is taking so long ...
I just have a loop wich calls it repeatedly 100000 times to measure how long it takes.
the Stopwatch also took just as long.
Colin =^.^=
> Please post your code, there must be something wrong with it, > QueryPerformanceFrequency() > can't take 6µsec to complete. > Note that there is no need to call QueryPerformanceFrequency(), this value > exposed by the Stopwatch class. > Willy. Willy Denoyette [MVP] - 15 Jan 2008 22:14 GMT >I just copied the code from here > http://msdn2.microsoft.com/en-us/library/aa964692(VS.80).aspx [quoted text clipped - 5 lines] > > the Stopwatch also took just as long. Hmmm.. this code is meant for the compact framework (WinCE). The exact PInvoke declaration for Windows is:
[DllImport("kernel32"),SuppressUnmanagedCodeSecurity] static extern int QueryPerformanceFrequency(out long freq);
but as I said in another reply QueryPerformanceFrequency is an expensive call.
Willy.
colin - 15 Jan 2008 22:25 GMT >>I just copied the code from here >> http://msdn2.microsoft.com/en-us/library/aa964692(VS.80).aspx [quoted text clipped - 14 lines] > but as I said in another reply QueryPerformanceFrequency is an expensive > call. Ah yes i found that out after I copied it and had to go search for coredll.dll it seems its in kernel32 for non ce I aded the supress security too in the hope it might speed it up ...
seems these two calls are easy to mix up when writing them lol... i just call frequency func once, and the counter func about 100million times
thanks Colin =^.^=
Willy Denoyette [MVP] - 15 Jan 2008 22:09 GMT >> Hi, >> [quoted text clipped - 47 lines] > exposed by the Stopwatch class. > Willy. Well, QueryPerformanceFrequency must transition into the kernel, which is quite expensive, (> 6000 instructions), so this function may well take +4 µsec on a 2Ghz machine. The same is true for QueryPerformanceCounter, so you need to account for this overhead when profiling.
Windows Vista and WS2008 have a new API QueryThreadCycleTime which returns the CPU cycles count. This API calls the __rdtsc(); VC++ compilers intrinsic function, you can implement a small C function that calls this intrinsic and call this one from C# using PInvoke.
Willy.
Willy Denoyette [MVP] - 15 Jan 2008 22:43 GMT >>> Hi, >>> [quoted text clipped - 61 lines] > > Willy. // C function that uses an intrinsic to retrieve the Processor's cycle counter register // Beware your CPU may not support this! // Compile from the command line: cl /EHsc /O2 <thisfile.cpp> #include <intrin.h> #pragma intrinsic(__rdtsc) extern "C" __declspec(dllexport) unsigned __int64 __stdcall rdtsc() { return __rdtsc(); }
// C# [DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity] static extern ulong rdtsc();
... ulong cycles = rdtsc();
Willy.
colin - 16 Jan 2008 01:02 GMT > "Willy Denoyette [MVP]" <willy.denoyette@telenet.be> wrote in message
> // C function that uses an intrinsic to retrieve the Processor's cycle > counter register [quoted text clipped - 16 lines] > > Willy. thanks that worked, however it stil takes 9000 clock cycles wich I think is roughly about the same time ... is that PInvoke realy so time consuming ? what does it have todo ?
thanks Colin =^.^=
colin - 16 Jan 2008 01:22 GMT >> "Willy Denoyette [MVP]" <willy.denoyette@telenet.be> wrote in message > [quoted text clipped - 23 lines] > is that PInvoke realy so time consuming ? > what does it have todo ? Ive just used Environment.TickCount, its only 1ms resolution but it seems to be instant.
as ive plenty enough samples it seems to give some results still ...
thanks
Colin =^.^=
Willy Denoyette [MVP] - 16 Jan 2008 07:52 GMT >>> "Willy Denoyette [MVP]" <willy.denoyette@telenet.be> wrote in message >> [quoted text clipped - 26 lines] > Ive just used Environment.TickCount, its only 1ms resolution > but it seems to be instant. To do what? Using TickCount to measure something that takes less that the RTC interval will return the same count, the reason is that TickCount is updated at the rithm of the system clock (every 10 or 15.6 msec. or more....).
Willy.
Willy. Willy.
colin - 16 Jan 2008 11:18 GMT >>>> "Willy Denoyette [MVP]" <willy.denoyette@telenet.be> wrote in message >>> [quoted text clipped - 33 lines] > > Willy. well yes one would think so wich is why I wanted a high res timer, but if you call it 1000 times and only 1 call it changes by 1 then you know on average its taken 1us. ofc this means its subject to random noise in the results. the usuble resolution is timer_resolution/sqrt(number sampls)
Colin =^.^=
Willy Denoyette [MVP] - 16 Jan 2008 07:47 GMT >> "Willy Denoyette [MVP]" <willy.denoyette@telenet.be> wrote in message > [quoted text clipped - 26 lines] > thanks > Colin =^.^= What exactly takes 9000 cycles?
Willy.
colin - 16 Jan 2008 11:22 GMT >>> "Willy Denoyette [MVP]" <willy.denoyette@telenet.be> wrote in message >> [quoted text clipped - 28 lines] >> > What exactly takes 9000 cycles? int cnt=100000; start = timer.Value; for (int x = 0; x < cnt; x++) { ulong cycles = timer.value } timerOverhead = (timer.Value - start)/cnt;
timer.value -> get{return rdtsc();}
timerOverhead comes out as about 9000;
Colin =^.^=
Willy Denoyette [MVP] - 16 Jan 2008 17:08 GMT >>>> "Willy Denoyette [MVP]" <willy.denoyette@telenet.be> wrote in message >>> [quoted text clipped - 42 lines] > > Colin =^.^= Following code:
[DllImport("rtc"),SuppressUnmanagedCodeSecurity] static extern ulong rdtsc();
static void Main()
{ int cnt = 100000; ulong cycles = 0; ulong start = rdtsc(); for (int x = 0; x < cnt; x++) { cycles = rdtsc(); } ulong timerOverhead = (cycles - start) / (ulong)cnt; Console.WriteLine(timerOverhead); } }
comes out as 13 on my box. That means 13 cycles per iteration, the total number of instructions executed is ~35, from which ~30 is Interop call overhead, the remaining in the increment and compare of cnt. The called function itself is only two instructions:
10001000 0f31 rdtsc 10001002 c3 ret
Seems like your processor does not support the rdtsc feature, what CPU do you run this on?
You can call following C function before you are trying to use the Time Stamp Counter, this function returns true if TSC is supported else false....
extern "C" __declspec(dllexport) bool __stdcall tsc() { bool tscFeature = false; int CPUInfo[4] = {-1}; __cpuid(CPUInfo, 0x80000001); int nFeatureInfo = CPUInfo[3]; // check tsc feature bit (bit 27) in CPUInfo[3] if (nFeatureInfo & 0x4000000) tscFeature = true; return tscFeature; }
[DllImport("yourdll"),SuppressUnmanagedCodeSecurity] static extern ulong rdtsc();
Willy.
colin - 16 Jan 2008 18:27 GMT >>>>> "Willy Denoyette [MVP]" <willy.denoyette@telenet.be> wrote in message >>>> [quoted text clipped - 94 lines] > [DllImport("yourdll"),SuppressUnmanagedCodeSecurity] > static extern ulong rdtsc(); thanks for trying it on your machie :D
hmm seems running from the ide debugger is making it incredibly slow I just made completly new files and new c# console project, managed to get 12 too by using ctrl-f5, but when I run with just f5 its 9000 !
is there a way to avoid the overhead the debugger being attached seems to impose ?
I only use the timer in debug mode to do profileing anyway, but I gues I cld detect if the debuger is not atacched and dump it to a file ...
thanks Colin =^.^=
colin - 16 Jan 2008 19:39 GMT > "Willy Denoyette [MVP]" <willy.denoyette@telenet.be> wrote in message ....
> thanks for trying it on your machie :D > [quoted text clipped - 9 lines] > but I gues I cld detect if the debuger is not atacched and dump it to a > file ... I also went back and tried the
Stopwatch.GetTimestamp()
function amd found this is also much faster with no debugger attached, the overhead is only 3 ticks instead of 13, incidently it goes upto 39 ticks if optimization is off too.
3 ticks is still 1us and the rdtsc is much much faster, but I could of lived with 3 ticks, it would of added 100s to the run time instead of over 600.
thanks again Colin =^.^=
Willy Denoyette [MVP] - 16 Jan 2008 21:57 GMT >>>>>> "Willy Denoyette [MVP]" <willy.denoyette@telenet.be> wrote in message >>>>> [quoted text clipped - 101 lines] > managed to get 12 too by using ctrl-f5, > but when I run with just f5 its 9000 ! Let me see, you are running a debug version in the IDE debugger and it returns 9000, right? Well, It will be slower but not that slow, there is clearly something wrong (this 9000 seems like a magic number, weird) .
> is there a way to avoid the overhead the debugger being attached seems to > impose ? No there is no posibility (other than running outside the debugger) to avoid the overhead imposed by the fact you are 1) running non optimized code 2) inside the IDE debugger.
> I only use the timer in debug mode to do profileing anyway, > but I gues I cld detect if the debuger is not atacched and dump it to a > file You should not instrument debug builds, but if you do, you should accept the overhead imposed by the environment (debugger, non optimized JIT code).
Willy.
colin - 16 Jan 2008 22:49 GMT > Let me see, you are running a debug version in the IDE debugger and it > returns 9000, right? Well, It will be slower but not that slow, there is > clearly something wrong (this 9000 seems like a magic number, weird) . yeah strange, maybe some who knows the internals can shed some light on this?
>> is there a way to avoid the overhead the debugger being attached seems to >> impose ? > > No there is no posibility (other than running outside the debugger) to > avoid the overhead imposed by the fact you are 1) running non optimized > code 2) inside the IDE debugger. without actually knowing where the code is spending its time during those 9000 clock cyclces its hard to know whats cuasing it.
>> I only use the timer in debug mode to do profileing anyway, >> but I gues I cld detect if the debuger is not atacched and dump it to a [quoted text clipped - 3 lines] > the overhead imposed by the environment (debugger, non optimized JIT > code). yes good point, however I wanted to make sure the realease build was clean. Theres little difference in the run time beteween debug and release.
I was only looking for the most time consuming parts of the code I didnt expect a 20 times increase in time just becuase of this timer tho lol, the increase occurs in the release or debug version, purly determined by the debugegr being attched or not.
thanks very much for your help finaly got a workable solution anyway :) Colin =^.^=
Willy Denoyette [MVP] - 17 Jan 2008 11:15 GMT >> Let me see, you are running a debug version in the IDE debugger and it >> returns 9000, right? Well, It will be slower but not that slow, there is >> clearly something wrong (this 9000 seems like a magic number, weird) . > > yeah strange, maybe some who knows the internals can shed some light on > this? You need to check your project/debug attributes for this project.
>>> is there a way to avoid the overhead the debugger being attached seems >>> to impose ? [quoted text clipped - 25 lines] > thanks very much for your help finaly got a workable solution anyway :) > Colin =^.^= The debug version of the code I posted, takes 230-350 cycles (that is the time needed to service that rdtsc function)when running in the managed debugger (IDE VS2008) on my box. Which is quite normal, because the CLR has to signal the transition into unmanaged code to the debugging engine (and obviously to the attached debugger).
Willy.
colin - 17 Jan 2008 16:48 GMT ...
> You need to check your project/debug attributes for this project. Ive looked in all the settings and theres hardly any at all that look like they might make a difference, espcially compared to the options for c++, in the build options theres optimise code wich makes a slight difference.
there may well be options I havnt discovered yet, I had to ask how to do 1 or 2 things so far only to be told you cant do that, then someone says its an option in a menu id never seen before by prerssing "ctrl-d" + e
> The debug version of the code I posted, takes 230-350 cycles (that is the > time needed to service that rdtsc function)when running in the managed > debugger (IDE VS2008) on my box. > Which is quite normal, because the CLR has to signal the transition into > unmanaged code to the debugging engine (and obviously to the attached > debugger). so im seeing an increase of 30 times greater than what your seeing, maybe it has to do with VS2005 express ? im using .net2.0 although i cant see that making a difference, I do notice however that there seems to be debug versions of the .net dlls
anyway Ive got my run time down from 3000 seconds to 20 seconds wich is acceptable for now, there seems to be no single thing wich looks like it could easily be optimised further, although some octal tree sectioning <might> help. but this is a complication il think of when ive done some other things, and only if its realy necessary.
although the managed directx does seem to spend quite a while copying vertex information, wich is an array of structs, I cant create an array of structs in advance becuse I dont know the size in advance, so I make a list wich it then has to convert to an array, then copy it to the graphics device lol. unfortunatly I cant make the struct array very big to start with becuase it copies the whole array and complains if I tell it the size is smaller than its length.
il probly try to work out the size in advance or something.
Colin =^.^=
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|