.NET Forum / Languages / Managed C++ / May 2004
/CLR floating point performance, inter-assembly function call performance
|
|
Thread rating:  |
Bern McCarty - 05 May 2004 22:34 GMT I have run an experiment to try to learn some things about floating point performance in managed C++. I am using Visual Studio 2003. I was hoping to get a feel for whether or not it would make sense to punch out from managed code to native code (I was using IJW) in order to do some amount of floating point work and, if so, what that certain amount of floating point work was approximately.
To attempt to do this I made a program that applys a 3x3 matrix to an array of 3D points (all doubles here folks). The program contains a function that applies 10 different matrices to the same test data set of 5,000,000 3D points. It does this by invoking another workhorse function that does the actual floating point operations. That function takes an input array of 3D points, an output array of 3D points, a point count, and the matrix to use. There are no __gc types in this program. It's just pointers and structs and native arrays. The outer test function looks like this:
void test_applyMatrixToDPoints(TestData *tdP, int ptsPerMultiply) { int jIterations = tdP->pointCnt / ptsPerMultiply; for (int i = 0 ; i < tdP->matrixCnt ; ++i) { for (int j = 0 ; j < jIterations; ++j) { // managed-to-native transitions happen here in V2 DMatrix3d_multiplyDPoint3dArray(tdP->matrices + i, &tdP->outPts[j*ptsPerMultiply], &tdP->inPts[j*ptsPerMultiply], ptsPerMultiply); } } }
The program calls the above routine 8 times and records the time elapsed during each call. On the first call the above function calls the workhorse function only once for each of the 10 matrices. In other words, it applies a matrix to all of the 5,000,000 points in the test data set with a single call to the other workhorse function. In the next call to the above function it passes only 50,000 points per-call to the other routine, then 5,000, then 500, et cetera, until we get all of the way down to 5, and then finally 1 where there is a function call to DMatrix3d_multiplyDPoint3dArray() for each and every of the 5,000,000 3D points in the test data set.
I was hoping someone could help interpret the results. At first I made 3 versions of this program. In all 3 of these versions the DMatrix3d_multiplyDPoint3dArray function was in a geometry.dll and the rest of the code was in my test.exe. The 3 versions were merely different combinations of native versus IL for the two executables:
test.exe geometry.dll (contains workhorse function) -------- ---------------- v1) native native v2) managed native v3) managed managed
Here are the results. All numbers are elapsed time in seconds for calls to the outer function described.
Native->Native: 0.953 0.968 0.968 0.953 0.968 0.952 1.093 1.39 Final run is 146% of first run. Final run is 127% of previous run
Managed->Native 0.968 0.968 0.968 0.969 0.968 0.968 1.124 1.952 Final run is 202% of first run. Final run is 174% of previous run
Managed->Managed 0.984 1.016 0.985 1 1 1.032 1.516 4.469 Final run is 454% of first run. Final run is 295% of previous run
This surprised me in two ways. First, I thought that for version 2 the penalty imposed by managed->native transitions would be worse. It's there, you can see performance drop off more as the call granularity becomes very fine toward the end, but it isn't as much as I might have guessed it would be. More surprising was that the managed->managed version, which didn't have any manged->native transitions slowing it down at all, dropped off far worse! The early calls to the test function compare very closely between versions 2 and 3, suggesting that the raw floating point performance of the managed versus native workhorse function is quite similar. So this seemed to point the finger at function call overhead. For some reason function call overhead is just higher for managed code than for native? On a hunch I decided to make a 4rth version of the program that was also managed->managed but which eliminated the inter-assembly call. Instead I just linked everything from geometry.dll right into test.exe. It made a big difference. The results are below. Is there some security/stack-walking stuff going on in the inter-DLL case maybe? Or does it really make sense that managed, inter-assembly calls are that much slower than the equivalent intra-assembly call? Explanations welcomed. The inter-assembly version takes 217% of the time that the intra-assembly version takes on the final call when the call granularity is fine. That seems awfully harsh.
Managed->Managed (one big test.exe) 1 0.999 0.984 1.015 0.984 1.015 1.093 2.061 Final run is 206% of first run. Final run is 189% of previous run.
Even with the improvement yielded by eliminating the inter-assembly calls, the relative performance between the version that has to make managed->native transitions and the all managed version is difficult for me to comprehend. What is it with managed->managed function call overhead that seems worse even than managed->native function call overhead?
I tried to make sure that page faults weren't affecting my test runs and the results I got were very consistent from run to run.
Bern McCarty Bentley Sytems, Inc.
P.S. For the curious, here is what DMatrix3d_multiplyDPoint3dArray looks like. There are no function calls made and it is all compiled into IL.
Public void DMatrix3d_multiplyDPoint3dArray ( const DMatrix3d *pMatrix, DPoint3d *pResult, const DPoint3d *pPoint, int numPoint ) { int i; double x,y,z; DPoint3d *pResultPoint;
for (i = 0, pResultPoint = pResult; i < numPoint; i++, pResultPoint++ ) { x = pPoint[i].x; y = pPoint[i].y; z = pPoint[i].z;
pResultPoint->x = pMatrix->column[0].x * x + pMatrix->column[1].x * y + pMatrix->column[2].x * z;
pResultPoint->y = pMatrix->column[0].y * x + pMatrix->column[1].y * y + pMatrix->column[2].y * z;
pResultPoint->z = pMatrix->column[0].z * x + pMatrix->column[1].z * y + pMatrix->column[2].z * z;
}
Yan-Hong Huang[MSFT] - 06 May 2004 07:09 GMT Hello Bern,
Generally speaking, the v1 JIT does not currently perform all the FP-specific optimizations that the VC++ backend does, making floating point operations more expensive for now. That may be why managed->managed is more expensive than managed->unmanaged in your test.
So for areas which make heavy use of floating point arithmetic, please use profilers to pick the fragments where the overhead is costing you most, and Keep the whole fragment in unmanaged space.
Also, work to minimize the number of transitions you make. If you have some unmanaged code or an interop call sitting in a loop, make the entire loop unmanaged. That way you'll only pay the transition cost twice, rather than for each iteration of the loop.
By looking into ILCode, we can see that when InterOping, there are some extra IL instructions. So minimizing the number of transitions can save many IL instructions and improve performance.
For some more information, you can refer to this chapter online: "Chapter 7 ?? Improving Interop Performance" http://msdn.microsoft.com/library/en-us/dnpag/html/scalenetchapt07.asp?frame =true#scalenetchapt07 _topic12
Hope that helps.
Best regards, Yanhong Huang Microsoft Community Support
Get Secure! ?C www.microsoft.com/security This posting is provided "AS IS" with no warranties, and confers no rights.
Bern McCarty - 06 May 2004 13:59 GMT From reading various things I had already recognized the things that you state as the current conventional wisdom. I went to the trouble to post my results in the hopes of getting some feedback on why it might be that my results run very much against that conventional wisdom. Please consider:
1) Floating point performance of managed code. At least in this little test scenario floating point performance of managed code doesn't seem to be a problem at all. In the first call out of the 8 in a test run the DMatrix3d_multiplyDPoint3dArray function is asked to apply the matrix to a whopping 5,000,000 3D points per call. So it is just sitting there doing floating point operations in a 5,000,000 iteration loop and there are no function calls in that loop at all. The managed version took only 3% longer in that case than the all native version. It seems logical then to rule out floating point performance as the culprit when things quickly change for the worse in the later calls where the call granularity to DMatrix3d_multiplyDPoint3dArray becomes very fine. It makes more sense to assign the slowdown observed in the fine-grained call cases on function call overhead, not on floating point performance.
2) The expense of transitions. What am I doing wrong? The version of my test program that involves a transition in the call from test_applyMatrixToDPoints->DMatrix3d_multiplyDPoint3dArray is actually FASTER than the all managed version (true for both the intra-assembly and inter-assembly call cases). Furthermore, the more finely-grained the calls are the more the native->managed version outperforms the managed-managed versions. Since we already established that raw floating point performance of the loop inside of the DMatrix3d_multiplyDPoint3dArray function is very equivalent between the managed and native versions, and the conventional wisdom is that native->managed transitions are expensive and bad, then what is to blame for the poor relative performance of the managed->managed versions? The managed->managed version is flat-out beaten by the version that does a transition for each and every call. It would seem that there is some serious penalty associated with making regular managed->managed function calls - not managed->native calls. What might be responsible for it and is it something I have any control over?
3) The surprising difference in cost between inter-assembly and intra-assembly managed->managed calls. Can someone explain this difference and is there anything that can be done about it besides making my program one enormous executable?
4) How can I step through JIT compiled code in assembly language in a debugger for a release executable so that I can see what is going on? I want the JIT to produce "non debug" x86 instructions and yet I want to step through them to see what they do. Tips appreciated. Can I do this with the VS.NET debugger? Windbg? How?
> Hello Bern, > [quoted text clipped - 18 lines] > For some more information, you can refer to this chapter online: > "Chapter 7 ?? Improving Interop Performance" http://msdn.microsoft.com/library/en-us/dnpag/html/scalenetchapt07.asp?frame
> =true#scalenetchapt07 _topic12 > [quoted text clipped - 6 lines] > Get Secure! ?C www.microsoft.com/security > This posting is provided "AS IS" with no warranties, and confers no rights. Yan-Hong Huang[MSFT] - 07 May 2004 04:03 GMT Hi Bern,
By using ildasm.exe, you can look into the IL code of the assembly to see the difference between inter-assembly and intra-assembly managed->managed calls.
At the same time, I have forwarded your questions to our product team for their opinions on it. I will return here as soon as possilble.
Thanks.
Best regards, Yanhong Huang Microsoft Community Support
Get Secure! ?C www.microsoft.com/security This posting is provided "AS IS" with no warranties, and confers no rights.
Kang Su Gatlin [MS] - 07 May 2004 23:02 GMT Bern, you're seeing what looks like a manifestation of the "double thunk" (aka "double p/invoke") problem. The problem is that when your managed code calls the managed code in the DLL, it first goes through a native stub (when using the Win32 DLL mechanism), so you ended up transitioning from managed to native and then back to managed.
Try #using the DLL which you have compiled managed, rather than the standard Win32 DLL mechanism. This should help. Let us know if that helps, or if this makes no sense.
Thanks,
Kang Su Gatlin Visual C++ Program Manager
--------------------
| From: "Bern McCarty" <bern.mccarty@bentley.com> | References: <eS784iuMEHA.2876@TK2MSFTNGP09.phx.gbl> <kGLwODzMEHA.3808@cpmsftngxa10.phx.gbl>
| Subject: Re: /CLR floating point performance, inter-assembly function call performance
| Date: Thu, 6 May 2004 08:59:11 -0400
| From reading various things I had already recognized the things that you | state as the current conventional wisdom. I went to the trouble to post my [quoted text clipped - 69 lines] | > For some more information, you can refer to this chapter online: | > "Chapter 7 ?? Improving Interop Performance" http://msdn.microsoft.com/library/en-us/dnpag/html/scalenetchapt07.asp?frame
| > =true#scalenetchapt07 _topic12 | > [quoted text clipped - 7 lines] | > This posting is provided "AS IS" with no warranties, and confers no | rights. Yan-Hong Huang[MSFT] - 10 May 2004 03:22 GMT Hello Bern,
Are you still monitoring this thread? We just hold a discusstion between PSS, SDE and PM.
The listed matrix of tested combination is this:
test.exe geometry.dll (contains workhorse function) -------- ---------------- v1) native native v2) managed native v3) managed managed
The key is that we think that the third variation is using exported functions and an import library to call the function in geometry.dll, as is certainly the case with the first two. If this is the case, then it is mistaken that there are no transitions in this scenario. In fact, there are twice as many transitions in variation 3 as in variation 2. The reason for this is the import libraries. Import libraries are a native construct. Any time a function call is made from managed code to a DLL through a stub in the import lib, a managed-native transition must happen. And then, since the actual implementation of the function in the DLL is managed, there must be another transition back to managed. This is very costly, as you found out.
The good news is that there is a way around these transitions for the managed/managed case. Here is a small example:
Code for DLL: public __value class Utils { // Must have a public managed type (__value or __gc) public: static int func(int i, int j) { // Must be static unless you don't mind creating instances return i + j; } };
Code for EXE: #using <testdll.dll> // Pull in the types defined in assembly testdll.dll
int main() { return Utils::func(0, 0); // Call the function }
This will eliminate all transitions from the call from the exe into the DLL.
I will email our SDE and let him look into this post also. If you have any more concerns, please feel free to post here. Or you can contact us by removing online from my email address here. Thanks very much.
Best regards, Yanhong Huang Microsoft Community Support
Get Secure! ?C www.microsoft.com/security This posting is provided "AS IS" with no warranties, and confers no rights.
Bern McCarty - 10 May 2004 15:14 GMT Yes I'm here. Thanks for the answer. That makes a certain amount of sense. I'll see if I can verify it.
I gather than in Whidbey the performance of my inter-assembly, managed->managed version would be much better without my changing anything. Yes?
-Bern
> Hello Bern, > [quoted text clipped - 54 lines] > Get Secure! ?C www.microsoft.com/security > This posting is provided "AS IS" with no warranties, and confers no rights. Bern McCarty - 10 May 2004 18:02 GMT I tried to take the suggestion of doing a #using <geometry.dll> instead of including the corresponding header files, but when I did that the result would not compile:
C:\mycode\geomTest\\test.cpp(74) : error C3861: 'DMatrix3d_multiplyDPoint3dArray': identifier not found, even with argument-dependent lookup
Then I thought, well, maybe I should include the header files AND do a #using <geometry.dll> but make sure to NOT link with the geometry.lib. Then the problem just moves to link time:
test.obj : error LNK2001: unresolved external symbol "void __cdecl DMatrix3d_multiplyDPoint3dArray(struct _dMatrix3d const *,struct _dPoint3d *,struct _dPoint3d const *,int)" (? bsiDMatrix3d_multiplyDPoint3dArray@@$$J0YAXPBU_dMatrix3d@@PAU_dPoint3d@@PBU2 @H@Z)
Here is what I can find on the function in the disassembled geometry.dll (I omitted the body):
.method /*0600003F*/ public static void modopt([mscorlib/* 23000001 */]System.Runtime.CompilerServices.CallConvCdecl/* 01000001 */) DMatrix3d_multiplyDPoint3d(valuetype _dMatrix3d/* 02000005 */ modopt([Microsoft.VisualC/* 23000002 */]Microsoft.VisualC.IsConstModifier/* 01000002 */)* pMatrix, valuetype _dPoint3d/* 02000006 */* pPoint) cil managed // SIG: 00 02 20 05 01 0F 20 09 11 14 0F 11 18
Perhaps I am doing something wrong, but it appears to me that you cannot supply the compiler/linker with the information that it needs to call global functions that were compiled into IL via /CLR by simply referencing the assembly at compile time. Does that mean that to avoid the inter-assembly double-P/Invoke that I have no choice but to wrap all of the functionality in my geometry library in GC classes? That would be a shame since I am able to call it as is just fine - it is just that it is too slow.
Will the double P/Invoke that I am seeing in this case go away as of Whidbey?
-Bern
> Yes I'm here. Thanks for the answer. That makes a certain amount of sense. > I'll see if I can verify it. [quoted text clipped - 68 lines] > > This posting is provided "AS IS" with no warranties, and confers no > rights. Yan-Hong Huang[MSFT] - 11 May 2004 02:51 GMT Hi Bern,
Based on my experience, the best way is to verify it by testing on Whidbey. You can install one in MSDN subscriber download.
For the second issue, I think you need to use __GC wrapper class to export it. Please refer to MSDN for the info of it. I think this is the "MCppWrapper Sample: Demonstrates Wrapping a C++ DLL with Managed Extensions" http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/ht ml/vcsammcppwrappersampledemonstrateswrappingcdllwithmanagedextensions.asp
Do you have any more concerns on the performance issue yet? If yes, please feel free to post here. I am glad to work with you on it. Thanks very much.
Best regards, Yanhong Huang Microsoft Community Support
Get Secure! ?C www.microsoft.com/security This posting is provided "AS IS" with no warranties, and confers no rights.
Bern McCarty - 11 May 2004 14:23 GMT Yes I am concerned about performance. I had hoped that IJW could be used to compile nearly all of our existing C++ application into IL and that that would eliminate the need for many managed->native transitions and would also free us to begin using GC types throughout our application over time. Our application consists of quite a number of .dlls and there are tons of inter-dll calls. But now what I've learned is that, though the code compiles, links and runs, every inter-dll call is suffering the double P/Invoke problem so indeed my code is littered with managed->unmanaged transitions.
Sure I could wrap every single function/method in my entire application in a GC class, but then IJW isn't at all suitable for what I thought it was. Like I said, it compiles, links and runs which is impressive. It's just too slow and that's too bad. I would still like an answer to know if the double P/Invoke problem will really be fixed in the final Whidbey release. I saw where someone from Microsoft hedged on that saying that it might not be. I hope that is not the case.
As for the Visual Studio 2005 Tech Preview on MSDN, I 've already looked at it. I had so much trouble with it I gave up on it. I found myself editing delivered headers just to try to get stuff to compile. Then the result would crash. I haven't seen anyone else posting C++ issues in here that related to this Whidbey build and I kind of reached the conclusion that the VC++ team didn't really circle the wagons for this particular build. I can only assume that they have other better quality builds that people in other programs have access to. I also found that "search" did not work for the MSDN library that came with the build and I find that terribly crippling.
Bern McCarty Bentley Systems, Inc.
> Hi Bern, > [quoted text clipped - 5 lines] > "MCppWrapper Sample: Demonstrates Wrapping a C++ DLL with Managed > Extensions" http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/ht
> ml/vcsammcppwrappersampledemonstrateswrappingcdllwithmanagedextensions.asp > [quoted text clipped - 7 lines] > Get Secure! ?C www.microsoft.com/security > This posting is provided "AS IS" with no warranties, and confers no rights. Yan-Hong Huang[MSFT] - 12 May 2004 02:43 GMT Hello Bern,
I have contacted our VC++ developer on it. Unfortunately, under this situation, it will still cost that much even in VS Whidbey. Here is the response from the developer.
-----------------------------
It will still cost that much. The problem is that the managed DLL loading mechanism only exposes managed types not stand alone functions, and exported functions must be callable from native code because of the import lib. Then the kicker is that native code cannot use or call to functions that are members of native types.
If the API needs to be exposed to both native and managed code through one dll, it is possible to do this. Here is my trivial example again, extended to expose a managed interface, and a native interface.
File: testdll.cpp
Compile: cl /clr /LD testdll.cpp // managed interface public __value class Utils { public: static int func(int i, int j) { return i + j; } };
// native interface __declspec(dllexport) int func(int i, int j) { return Utils::func(i, j); }
File: managed.cpp
Compile: cl /clr managed.cpp // Use the managed mechanism to access the API. Note that #using pulls in types, not standalone functions, // so our API must be a member of a managed type, in this case the class Utils. Also, we have not linked // to the import lib. #using <testdll.dll>
int main() { return Utils::func(0, 0); }
File: native.cpp
Compile: cl native.cpp /link testdll.lib // Now we link with the import lib. int func(int, int); int main() { return func(0, 0); }
-----------------------------
I totally understand that you need a lot of work to migrate the code to managed c++ wrapper class if the performance is quite important for you. However, from what we have discussed till now, it seems there is no other easy way to implement it yet. As that developer mentioned, managed DLL loading mechanism only exposes managed types not stand alone functions. For the time being, in order to improve the performance of inter-assembly calls, we need to implement the exported functions as wrapper class functions.
If you have any more concerns on it, please feel free to post here. Thanks very much for your understanding.
Best regards, Yanhong Huang Microsoft Community Support
Get Secure! ?C www.microsoft.com/security This posting is provided "AS IS" with no warranties, and confers no rights.
Bern McCarty - 12 May 2004 14:32 GMT Thank you for the answer. It was very important for me to know that this will remain the same in Whidbey. The Quake II .NET effort (http://www.codeproject.com/managedcpp/Quake2.asp) is something I've seen Microsoft demonstrate a couple of times. I found that quite compelling. I guess the Quake II .NET program suffers from the double P/Invoke on it's inter-mixed-mode-assembly calls too then? Or did they wrap everything that is called inter-assembly and then change every call-site to call the wrapped functions instead?
It seems to me that the suggestion for how to wrap things is backwards. My implementation methods already exist. I don't want to touch them. I just want to wrap them as they are. I would think to do it like this instead:
Compile: cl /clr /LD testdll.cpp // managed interface public __value class Utils { public: static int func(int i, int j) { myNameSpace::func(i, j); } };
namespace myNameSpace { // native interface __declspec(dllexport) int func(int i, int j) { return i + j; } }
After seeing Quake II .NET my thoughts were that I could take our large, complex and extensible multi-dll program and arrange our build process so that we could experiment with a /CLR compiled version for a good long time while we learned the ins and outs of C++ interop and how to begin to introduce GC types into it's implementation and its documented interfaces. Furthermore I thought that I could slowly add the /CLR switch to individual source files as I found time to conquer them. That has turned out to be a fair amount of work for us because many of our source files are in fact .c files. When compiling them with /CLR I ran into problems at link time and ultimately realized that Microsoft was desupporting the /CLR switch for C source code anyway. Then I realized that modules needed to be converted to C++ prior to adding the /CLR switch and that's where it begins to take quite a bit of effort in such a large application.
But since these many .dlls call each other, and since to get decent inter-assembly call performance I need to wrap all my functions as static methods of a GC type and, here is the kicker, since I have to then change each and every call site to functions that have been compiled with /CLR to instead call the GC wrapped versions of the functions, it now becomes a rather complex task to effect this transition slowly over time. You have to leave call-sites that aren't compiled into IL yet alone, yet you have to alter all the others. The equation of what is IL versus x86 is always changing as I manage to add the /CLR switch to new modules. I guess I would have to arrange to use the preprocessor to substitute the right calls with calls to the wrapped versions. Certainly possible, but a fair amount of work.
It seems like lots of folks that see Quake II .NET are going to take it to heart and try the same thing that I have tried and ultimately end up facing the very same problem. It would be nice if the linker could optionally generate and include GC wrappers for my exported functions. Imagine that I supply a namespace::classname for a GC class that I want the linker to create and then the linker dutifully adds static methods to that class which simply wrap each of my exports. It could even generate a .h file for me that mapped the native name to the appropriate method in the generated GC wrapper class.
Then I could maintain both the tranditional all native build of my application and a piece-meal mixed-mode build of our application while we work toward adding the /CLR switch to virtually everything in the application. I'm trying to follow the Quake II .NET lead, but my app is a lot larger and more complex and doing it all in one fell swoop isn't practical.
Bern McCarty Bentley Systems, Inc.
> Hello Bern, > [quoted text clipped - 74 lines] > Get Secure! ?C www.microsoft.com/security > This posting is provided "AS IS" with no warranties, and confers no rights. Yan-Hong Huang[MSFT] - 13 May 2004 06:24 GMT Hi Bern,
We will do our best to see how Quake works in that way. Your idea is also good and I will forward that to the product group. We are also seeking for the most convenient way to migrate code efficiently.
Thanks very much.
Best regards, Yanhong Huang Microsoft Community Support
Get Secure! ?C www.microsoft.com/security This posting is provided "AS IS" with no warranties, and confers no rights.
Yan-Hong Huang[MSFT] - 17 May 2004 07:39 GMT Hello Bern,
Do you have any more concerns on it? If there is any we can do , please feel free to post here.
Thanks very much.
Best regards, Yanhong Huang Microsoft Community Support
Get Secure! ?C www.microsoft.com/security This posting is provided "AS IS" with no warranties, and confers no rights.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|