I came into some interesting stuff some time ago, while trying to detect performance issues in some ANSI-C software.

Sometimes, while coding, it may be difficult to decide whether to use a recursive function or a loop, and if so, what kind of loop.
The best example, is the factorial computation. That example is typically used to teach recursive function.

I always felt something strange with those examples. Why the hell using a recursive function for a factorial computation, since we can use a `while` loop, for instance?

If you think about machine code, or assembly, calling a procedure implies to copy the values of some registers (instruction pointer, etc), to execute a jump into another portion of memory (maybe in another segment), to execute the code, to restore the registers, and finally to return to the previous memory location (`RETF`).
In pure assembly, a simple loop, using the `CX` register, on Intel, is far more efficient, as there's no jump, nor stack operation to copy the register values.

So what about C? Take a look at this typical example:

```unsigned long factorial( unsigned long n )
{
if( n == 0 )
{
return 1;
}

return n * factorial( n - 1 );
}
```

The same can be achieved with a `while` loop:

```unsigned long x = n;

if( n == 0 )
{
x = 1;
}
else
{
while( n > 1 )
{
x *= --n;
}
}
```

We can also choose to use a `for` loop:

```unsigned long x = n;

if( n == 0 )
{
x = 1;
}

for( ; n > 1 ; )
{
x = --n * x;
}
```

In pure assembly, we would also write a loop. So with GCC inline assembly:

```unsigned long x;

if( n == 0 )
{
x = 1;
}

__asm__
(
"    movq %[n], %%rax \n\t"
"    movq %[n], %%rcx \n\t"
"    decq       %%rcx \n\t"
"FACTORIAL:"
"    mulq %%rcx \n\t"
"    decq %%rcx \n\t"
"    jnz FACTORIAL \n\t"
"    movq %%rax, %[x]"

: [ x ] "=m" ( x )
: [ n ] "m"  ( n )
);
```

Which one will be the fastest? At first sight, I would have said inline assembly, while, for, and finally the recursive function.

But that's not the case.
Running each case 10000000 time gives the following results:

```Recursive:  0.64492511749267578125
While:      1.47788429260253906250
Assembler:  0.62227106094360351562
For:        1.47960233688354492188
```

Inline assembly is the fastest, but the recursive function is very very close.
The while and for loops takes much longer.

I was very surprised. Why such results?
The answer is in the way GCC optimizes to source code to generate intermediate object code.

If we take a look closer at what GCC does, we can see it optimizes for us a lot of things.

For instance, our recusrisve function, in pure assembly, generated by GCC:

```LFE7:
.globl _factorial_recursive
_factorial_recursive:
LFB8:
pushq	%rbp
LCFI3:
movq	%rsp, %rbp
LCFI4:
subq	\$16, %rsp
LCFI5:
movq	%rdi, -8(%rbp)
cmpq	\$0, -8(%rbp)
jne	L4
movq	\$1, -16(%rbp)
jmp	L6
L4:
movq	-8(%rbp), %rdi
decq	%rdi
call	_factorial_recursive
movq	%rax, %rdx
imulq	-8(%rbp), %rdx
movq	%rdx, -16(%rbp)
L6:
movq	-16(%rbp), %rax
leave
ret
```

And now the `while` loop:

```LFB9:
pushq	%rbp
LCFI6:
movq	%rsp, %rbp
LCFI7:
movq	%rdi, -24(%rbp)
movq	-24(%rbp), %rax
movq	%rax, -8(%rbp)
cmpq	\$0, -24(%rbp)
jne	L12
movq	\$1, -32(%rbp)
jmp	L11
L13:
decq	-24(%rbp)
movq	-8(%rbp), %rax
imulq	-24(%rbp), %rax
movq	%rax, -8(%rbp)
L12:
cmpq	\$1, -24(%rbp)
ja	L13
movq	-8(%rbp), %rax
movq	%rax, -32(%rbp)
L11:
movq	-32(%rbp), %rax
```

The same applies for the `for` loop.

We can see here the compiler (GCC for instance) did a HUGE optimization of our source code.
It means actually that, except if you're a assembly genius, you probably won't code that will run faster, because you did it with inline assembly.
Nowadays, compilers are so efficient that some optimizations just won't work.

In doubt, always remember to benchmark your code, and always remember that compiler can usually generate intermediate assembly code, so you can check what your code will really look like, when executed by the CPU.

Author
Ossama
Date
02/04/2011 16:41 Great article, thanks !
Author
Date
02/04/2011 20:08 You're welcome ; ) Author