I have some code like this:
///////////////
void test(int* a)
{
a[0]+=((a[1]-a[2])<<3);
}
////////////////
after compilng with vc.net 2003, the asm code is:
///////////
PUBLIC ?test@@YIXPAH@Z ; test
; Function compile flags: /Ogty
; File c:\test.cpp
; COMDAT ?test@@YIXPAH@Z
_TEXT SEGMENT
?test@@YIXPAH@Z PROC NEAR ; test, COMDAT
; _a$ = eax
; 144 : void test(int* a)
mov ecx, DWORD PTR [eax+4]
sub ecx, DWORD PTR [eax+8]
add ecx, ecx
add ecx, ecx
add ecx, ecx
add DWORD PTR [eax], ecx
ret 0
?test@@YIXPAH@Z ENDP ; test
////////////////////////////////////////////
Question:
why using "add ecx,ecx" three times and not using "shl ecx,3" instead.
To my idea shl ecx,3 is faster.
Thanks for any answers.
Tom Widmer - 28 Jan 2005 15:43 GMT
> I have some code like this:
> ///////////////
[quoted text clipped - 27 lines]
>
> To my idea shl ecx,3 is faster.
I know very little about assembler, but a quick search on google seems
to show that you may be wrong, at least for P4:
http://www.intel.com/cd/ids/developer/asmo-na/eng/44010.htm?prn=Y
It says: "Shifts, rotations Avoid if possible, schedule dependent
instructions as far away as possible; replace shl with additions". It
also says: "the shl µop has a latency of 4", whatever that means.
Tom
Carl Daniel [VC++ MVP] - 28 Jan 2005 16:42 GMT
>> why using "add ecx,ecx" three times and not using "shl ecx,3"
>> instead. To my idea shl ecx,3 is faster.
[quoted text clipped - 7 lines]
> instructions as far away as possible; replace shl with additions". It
> also says: "the shl ?op has a latency of 4", whatever that means.
IIUC, on a Pentium 4 class CPU, the adds will execute in 1/2 clock cycle
each in the execution core - three adds will be ~2X faster than the shift
would be.
-cd
mestupid - 31 Jan 2005 06:35 GMT
Thanks -cd and Tom.
That's very helpful. and I read through IA32 Intel architecture
optimization. yeah shl has a long latency.
I have another question need to make it clear:
ADD has latency 0.5, Does that mean the cpu could handle two ADD per clock
cycle?
I guess ADD needs a total of 1.5 clock cycles. one for instruction and half
for latency. So three ADDs need 4.5 cycles
But shl has one instruction cycle and 4 for latency . so shl has totally 5
cycles.
Am I right?
Carl Daniel [VC++ MVP] - 31 Jan 2005 06:45 GMT
> Thanks -cd and Tom.
> That's very helpful. and I read through IA32 Intel architecture
[quoted text clipped - 3 lines]
> ADD has latency 0.5, Does that mean the cpu could handle two ADD per
> clock cycle?
That's correct.
> I guess ADD needs a total of 1.5 clock cycles. one for instruction
> and half for latency. So three ADDs need 4.5 cycles
>
> But shl has one instruction cycle and 4 for latency . so shl has
> totally 5 cycles.
> Am I right?
You could be - I didn't check the actual instruction timings.
-cd