In my free time, I'm writing an operating system called XEOS.
As the system is written mainly in C, I'm also writing a C99 standard library.
The last days, I wrote optimised versions of some C99 string functions, like strlen()
, strchr()
or memset()
.
This set of functions is usually used intensively in any C program or library.
So obviously, the more optimised they are, the better the program will perform.
Let's take a look a the memset()
function, which is used to fill a memory buffer with a single character.
Its prototype is:
void * memset( void * p, int c, size_t n );
A straightforward and naive C implementation would typically be:
void * memset( void * p, int c, size_t n ) { unsigned char * s = p; while( n-- ) { *( s++ ) = ( unsigned char )c; } return p; }
Easy, simple, straightforward.
But how does this perform?
Actually, it does perform very bad.
In fact, it would be difficult to perform worse unintentionally.
On my computer (Intel Core i7), calling this function one million times with a 4096 bytes buffer takes about 9 seconds.
This is huge.
Some programmers think the more simple is the code, the better it will perform, and that optimisation can be performed by reducing the amount of code.
Well, that's completely wrong.
If you're such a programmer, I deeply encourage you to revise your thinking.
So let's take a closer look.
The main issue in the previous code is that it writes one byte at a time into the destination buffer.
High-level programmers (including C programmers, as C is a high-level language) might think that memory operations are cheap, in terms of performance.
That's also wrong.
Memory operations aren't cheap.
In other words, the less times you read/write from/to memory, the better your program will perform.
So how can we achieve this here?
Well, actual CPUs are 64 bits wide, so we could simply write 8 bytes (64 bits) at a time, rather than a single byte (8 bits).
But what if the length of the buffer is not a multiple of 8? If the program doesn't crash (segmentation fault), we would write outside the buffer, which is worse, and might be harder to debug.
A simple solution is to write 8 bytes while it's possible, then writes the remaining bytes one by one.
Something like:
void * memset( void * p, int c, size_t n ) { uint8_t * sp; uint64_t * lp; uint64_t u64; uint8_t u8; u8 = ( uint8_t )c; u64 = ( uint64_t )c; u64 = ( u64 << 32 ) | u64; lp = ( uint64_t * )p; while( ( n / 8 ) > 0 ) { *( lp++ ) = u64; n -= 8; } sp = ( uint8_t * )lp; while( n-- ) { *( sp++ ) = u8; } return p; }
Let's run the same test as previously.
Calling this function one million times with a 4096 bytes buffer takes about 1.1 seconds.
9 times faster than the previous one.
We can see here that simplicity does not equal performance.
But in terms of performance, we may still have an issue with this code.
If the pointer to the memory buffer is not aligned, the overall performances would decrease. We might even experience crashes, as some CPU architectures do not allow unaligned access to memory.
The test I ran wasn't affected, as the pointer was aligned, and as the Core i7 allows unaligned access to memory without performance loss anyway.
But to be sure the code will run fast and fine, we should first align the pointer.
We may write bytes one by one util it is aligned, then write 8 bytes at a time:
void * memset( void * p, int c, size_t n ) { uint8_t * sp; uint64_t * lp; uint64_t u64; uint8_t u8; u8 = ( uint8_t )c; u64 = ( uint64_t )c; u64 = ( u64 << 32 ) | u64; sp = ( uint8_t * )p; while( n-- && ( ( ( uint64_t )sp & ( uint64_t )-8 ) < ( uint64_t )sp ) ) { *( sp++ ) = u8; } lp = ( uint64_t * )( ( void * )sp ); while( ( n / 8 ) > 0 ) { *( lp++ ) = u64; n -= 8; } sp = ( uint8_t * )( ( void * )lp ); while( n-- ) { *( sp++ ) = u8; } return p; }
This is actually a decent implementation.
The big advantage is that it's written in C, so it's pretty easy to read and it's completely portable.
Now can we optimise this more?
The answer is of course, yes.
But we might need to stop using C, and write our function using assembly code.
First of all, let's take a minute to talk about coding in assembly, for performance considerations.
Most of the time, it's usually a bad idea.
Writing assembly code first means you'll loss the portability.
You'll have to write a specific function for each processor architecture, which can be a huge and long work.
You'll also loss the readability and maintainability of C.
Also remember that writing the same algorithm using assembly will not give you more performances.
While it's true that you would be able to get rid of some unnecessary instructions generated by the compiler, this is not the bottleneck.
So it's usually better to stay away from assembly if you do not have very specific reasons to do it.
That being said, a well written assembly code can be pure magic.
In our case, it's completely worth the effort, as a function like memset()
has to be very performant.
And also because, using assembly, we have some nice way to optimise it more.
Using assembly will allow us to use SSE-2 instructions. Of course, not all CPUs support this, so we'll have to check it first.
If SSE-2 are not available, we might still use the optimised C version as fallback.
For x86 platforms, we can check if the CPU supports SSE-2 instructions using the CPUID instruction:
mov eax, 1 cpuid test edx, 0x4000000 jz .fail ; SSE-2 code .fail: ; Non SSE-2 code
Note that this code will trash the content of EAX
, ECX
and EDX
.
Now how can SSE-2 help us in the memset()
case?
Well SSE-2 registers are 128 bits wide, so using them would allow us to write 16 bytes at a time into memory.
This would result in a huge performance boost.
Note that in order to use SSE-2 instructions, the pointer to the buffer will have to be aligned on a 16-byte boundary.
Otherwise, it will just crash.
I've placed such an optimised memset()
function on GitHub, as it's really too long for here.
You can see the whole code here:
https://github.com/macmade/LibC-String-Optimisations/blob/master/source/memset.64.s
Note the following lines:
movdqa [ rdi ], xmm0 movdqa [ rdi + 16 ], xmm0 movdqa [ rdi + 32 ], xmm0 movdqa [ rdi + 48 ], xmm0 movdqa [ rdi + 64 ], xmm0 movdqa [ rdi + 80 ], xmm0 movdqa [ rdi + 96 ], xmm0 movdqa [ rdi + 112 ], xmm0
This actually writes chunks of 128 bytes into the memory buffer.
Running such a function one million times with a 4096 bytes buffer, as before, takes about 0.07 seconds.
15 times faster than our optimised C version, and 150 times faster that the naive implementation.
This is huge.
This implementation is even slightly faster than the Mac OS X implementation, which runs the same test in 0.09 seconds.
In other words, and as a conclusion, remember than simplicity does not always equals performance.
Also remember that optimisation can represent a huge amount of time, especially when dealing with assembly optimisations.
But as you can see, for some specific cases, it's worth the effort.