I came into some interesting stuff some time ago, while trying to detect performance issues in some ANSI-C software.
Sometimes, while coding, it may be difficult to decide whether to use a recursive function or a loop, and if so, what kind of loop.
The best example, is the factorial computation. That example is typically used to teach recursive function.
I always felt something strange with those examples. Why the hell using a recursive function for a factorial computation, since we can use a while loop, for instance?
If you think about machine code, or assembly, calling a procedure implies to copy the values of some registers (instruction pointer, etc), to execute a jump into another portion of memory (maybe in another segment), to execute the code, to restore the registers, and finally to return to the previous memory location (RETF).
In pure assembly, a simple loop, using the CX register, on Intel, is far more efficient, as there's no jump, nor stack operation to copy the register values.
So what about C? Take a look at this typical example:
unsigned long factorial( unsigned long n )
{
if( n == 0 )
{
return 1;
}
return n * factorial( n - 1 );
}
The same can be achieved with a while loop:
unsigned long x = n;
if( n == 0 )
{
x = 1;
}
else
{
while( n > 1 )
{
x *= --n;
}
}
We can also choose to use a for loop:
unsigned long x = n;
if( n == 0 )
{
x = 1;
}
for( ; n > 1 ; )
{
x = --n * x;
}
In pure assembly, we would also write a loop. So with GCC inline assembly:
unsigned long x;
if( n == 0 )
{
x = 1;
}
__asm__
(
" movq %[n], %%rax \n\t"
" movq %[n], %%rcx \n\t"
" decq %%rcx \n\t"
"FACTORIAL:"
" mulq %%rcx \n\t"
" decq %%rcx \n\t"
" jnz FACTORIAL \n\t"
" movq %%rax, %[x]"
: [ x ] "=m" ( x )
: [ n ] "m" ( n )
);
Which one will be the fastest? At first sight, I would have said inline assembly, while, for, and finally the recursive function.
But that's not the case.
Running each case 10000000 time gives the following results:
Recursive: 0.64492511749267578125 While: 1.47788429260253906250 Assembler: 0.62227106094360351562 For: 1.47960233688354492188
Inline assembly is the fastest, but the recursive function is very very close.
The while and for loops takes much longer.
I was very surprised. Why such results?
The answer is in the way GCC optimizes to source code to generate intermediate object code.
If we take a look closer at what GCC does, we can see it optimizes for us a lot of things.
For instance, our recusrisve function, in pure assembly, generated by GCC:
LFE7:
.globl _factorial_recursive
_factorial_recursive:
LFB8:
pushq %rbp
LCFI3:
movq %rsp, %rbp
LCFI4:
subq $16, %rsp
LCFI5:
movq %rdi, -8(%rbp)
cmpq $0, -8(%rbp)
jne L4
movq $1, -16(%rbp)
jmp L6
L4:
movq -8(%rbp), %rdi
decq %rdi
call _factorial_recursive
movq %rax, %rdx
imulq -8(%rbp), %rdx
movq %rdx, -16(%rbp)
L6:
movq -16(%rbp), %rax
leave
ret
And now the while loop:
LFB9:
pushq %rbp
LCFI6:
movq %rsp, %rbp
LCFI7:
movq %rdi, -24(%rbp)
movq -24(%rbp), %rax
movq %rax, -8(%rbp)
cmpq $0, -24(%rbp)
jne L12
movq $1, -32(%rbp)
jmp L11
L13:
decq -24(%rbp)
movq -8(%rbp), %rax
imulq -24(%rbp), %rax
movq %rax, -8(%rbp)
L12:
cmpq $1, -24(%rbp)
ja L13
movq -8(%rbp), %rax
movq %rax, -32(%rbp)
L11:
movq -32(%rbp), %rax
The same applies for the for loop.
We can see here the compiler (GCC for instance) did a HUGE optimization of our source code.
It means actually that, except if you're a assembly genius, you probably won't code that will run faster, because you did it with inline assembly.
Nowadays, compilers are so efficient that some optimizations just won't work.
In doubt, always remember to benchmark your code, and always remember that compiler can usually generate intermediate assembly code, so you can check what your code will really look like, when executed by the CPU.