Code optimisations - Optimising memset()

In my free time, I'm writing an operating system called XEOS.
As the system is written mainly in C, I'm also writing a C99 standard library.

The last days, I wrote optimised versions of some C99 string functions, like strlen(), strchr() or memset().

This set of functions is usually used intensively in any C program or library.
So obviously, the more optimised they are, the better the program will perform.

Let's take a look a the memset() function, which is used to fill a memory buffer with a single character.
Its prototype is:

void * memset( void * p, int c, size_t n );

A straightforward and naive C implementation would typically be:

void * memset( void * p, int c, size_t n )
{
    unsigned char * s = p;
    
    while( n-- )
    {
        *( s++ ) = ( unsigned char )c;
    }
    
    return p;
}

Easy, simple, straightforward.
But how does this perform?

Actually, it does perform very bad.
In fact, it would be difficult to perform worse unintentionally.

On my computer (Intel Core i7), calling this function one million times with a 4096 bytes buffer takes about 9 seconds.
This is huge.

Some programmers think the more simple is the code, the better it will perform, and that optimisation can be performed by reducing the amount of code.

Well, that's completely wrong.
If you're such a programmer, I deeply encourage you to revise your thinking.

So let's take a closer look.

The main issue in the previous code is that it writes one byte at a time into the destination buffer.
High-level programmers (including C programmers, as C is a high-level language) might think that memory operations are cheap, in terms of performance.

That's also wrong.
Memory operations aren't cheap.

In other words, the less times you read/write from/to memory, the better your program will perform.
So how can we achieve this here?

Well, actual CPUs are 64 bits wide, so we could simply write 8 bytes (64 bits) at a time, rather than a single byte (8 bits).

But what if the length of the buffer is not a multiple of 8? If the program doesn't crash (segmentation fault), we would write outside the buffer, which is worse, and might be harder to debug.

A simple solution is to write 8 bytes while it's possible, then writes the remaining bytes one by one.
Something like:

void * memset( void * p, int c, size_t n )
{
    uint8_t  * sp;
    uint64_t * lp;
    uint64_t   u64;
    uint8_t    u8;
    
    u8  = ( uint8_t )c;
    u64 = ( uint64_t )c;
    u64 = ( u64 << 32 ) | u64;
    lp  = ( uint64_t * )p;
    
    while( ( n / 8 ) > 0 )
    {
        *( lp++ ) = u64;
        n        -= 8;
    }
    
    sp = ( uint8_t * )lp;
    
    while( n-- )
    {
        *( sp++ ) = u8;
    }
    
    return p;
}

Let's run the same test as previously.
Calling this function one million times with a 4096 bytes buffer takes about 1.1 seconds.

9 times faster than the previous one.

We can see here that simplicity does not equal performance.

But in terms of performance, we may still have an issue with this code.
If the pointer to the memory buffer is not aligned, the overall performances would decrease. We might even experience crashes, as some CPU architectures do not allow unaligned access to memory.

The test I ran wasn't affected, as the pointer was aligned, and as the Core i7 allows unaligned access to memory without performance loss anyway.

But to be sure the code will run fast and fine, we should first align the pointer.
We may write bytes one by one util it is aligned, then write 8 bytes at a time:

void * memset( void * p, int c, size_t n )
{
    uint8_t  * sp;
    uint64_t * lp;
    uint64_t   u64;
    uint8_t    u8;
    
    u8  = ( uint8_t )c;
    u64 = ( uint64_t )c;
    u64 = ( u64 << 32 ) | u64;
    sp  = ( uint8_t * )p;
    
    while( n-- && ( ( ( uint64_t )sp & ( uint64_t )-8 ) < ( uint64_t )sp ) )
    {
        *( sp++ ) = u8;
    }
    
    lp  = ( uint64_t * )( ( void * )sp );
    
    while( ( n / 8 ) > 0 )
    {
        *( lp++ ) = u64;
        n        -= 8;
    }
    
    sp = ( uint8_t * )( ( void * )lp );
    
    while( n-- )
    {
        *( sp++ ) = u8;
    }
    
    return p;
}

This is actually a decent implementation.
The big advantage is that it's written in C, so it's pretty easy to read and it's completely portable.

Now can we optimise this more?

The answer is of course, yes.
But we might need to stop using C, and write our function using assembly code.

First of all, let's take a minute to talk about coding in assembly, for performance considerations.
Most of the time, it's usually a bad idea.

Writing assembly code first means you'll loss the portability.
You'll have to write a specific function for each processor architecture, which can be a huge and long work.
You'll also loss the readability and maintainability of C.

Also remember that writing the same algorithm using assembly will not give you more performances.
While it's true that you would be able to get rid of some unnecessary instructions generated by the compiler, this is not the bottleneck.
So it's usually better to stay away from assembly if you do not have very specific reasons to do it.
That being said, a well written assembly code can be pure magic.

In our case, it's completely worth the effort, as a function like memset() has to be very performant.
And also because, using assembly, we have some nice way to optimise it more.

Using assembly will allow us to use SSE-2 instructions. Of course, not all CPUs support this, so we'll have to check it first.
If SSE-2 are not available, we might still use the optimised C version as fallback.

For x86 platforms, we can check if the CPU supports SSE-2 instructions using the CPUID instruction:

mov     eax,    1
cpuid
test    edx,    0x4000000    
jz      .fail

; SSE-2 code

.fail:

; Non SSE-2 code

Note that this code will trash the content of EAX, ECX and EDX.

Now how can SSE-2 help us in the memset() case?
Well SSE-2 registers are 128 bits wide, so using them would allow us to write 16 bytes at a time into memory.
This would result in a huge performance boost.

Note that in order to use SSE-2 instructions, the pointer to the buffer will have to be aligned on a 16-byte boundary.
Otherwise, it will just crash.

I've placed such an optimised memset() function on GitHub, as it's really too long for here.
You can see the whole code here:
https://github.com/macmade/LibC-String-Optimisations/blob/master/source/memset.64.s

Note the following lines:

movdqa  [ rdi       ],  xmm0
movdqa  [ rdi +  16 ],  xmm0
movdqa  [ rdi +  32 ],  xmm0
movdqa  [ rdi +  48 ],  xmm0
movdqa  [ rdi +  64 ],  xmm0
movdqa  [ rdi +  80 ],  xmm0
movdqa  [ rdi +  96 ],  xmm0
movdqa  [ rdi + 112 ],  xmm0

This actually writes chunks of 128 bytes into the memory buffer.

Running such a function one million times with a 4096 bytes buffer, as before, takes about 0.07 seconds.

15 times faster than our optimised C version, and 150 times faster that the naive implementation.
This is huge.

This implementation is even slightly faster than the Mac OS X implementation, which runs the same test in 0.09 seconds.

In other words, and as a conclusion, remember than simplicity does not always equals performance.
Also remember that optimisation can represent a huge amount of time, especially when dealing with assembly optimisations.

But as you can see, for some specific cases, it's worth the effort.

Comments

Author
Popy
Date
08/06/2013 07:38
I disagree with "simplicity does not always equals performance".
Your optimized algorythm is still simple (a bit less than the naïve one, but still), and its ASSEMBLY version is still a simple algorythm written in a hard to read language. In fact, the concept of such functions is meant to keep things simple.
BTW, shouldn't you use sizeof instead of 8 ?
Author
Jean-David Gadina
Date
08/06/2013 07:45
I said does not ** always **... ; )
And the point about simplicity was more about the code-readability/code-length, rather than the algorithm itself. As you pointed it out, they are quite simple, even in assembly.
Keeping things simple is often a good advice, but not when you can improve the performances by 15 times, IMHO, and especially for such library functions.
About sizeof, no need for it as I'm using fixed sized integers (uint64_t). : )
Author
Christian
Date
02/19/2014 13:08
Interesting article.

Just one remark, there is a bug in the optimized C version.
If I pass e.g. 0xFF it should fill all bytes with 0xFF but if it writes u64 it only writes the last byte and middle byte as 0xFF and the others as 0.
Author
Julian
Date
07/09/2014 14:30
I think the optimized C version will most likely crash when setting a single byte (n=1) that is not 64-bit aligned...
When checking the memory alignment

while( n-- && ( ( ( uint64_t )sp & ( uint64_t )-8 ) < ( uint64_t )sp ) )

n is decreased to zero in the first loop cycle and in the next loop cycle, although n is zero, the n-- statement is executed. As size_t is an unsigned integer type, it will have the maximum value then. And the following code will set memory to far beyond the intended end because n is greater than zero...
Author
Jean-David Gadina
Date
09/28/2014 01:46
Thanks for the comment... Note that the second loop uses `n--`, not `--n`.
So if `n` is zero, the second loop won't be executed, as the value of `n` is evaluated before it's decremented.
Author
Trogdan
Date
09/28/2014 01:46
BTW, according to the definition of memset, the unsigned int value is interpreted as an unsigned char, and replicated throughout each byte. your examples use the entire 32-bit value
Author
Jean-David Gadina
Date
09/28/2014 01:46
@Trogdan You mean you could have some junk in the high 24-bits?
Author
Trogdan
Date
09/28/2014 01:46
@Jean Yup! Each byte element needs to be the same unsigned char after casting value, high order 24 bites need to be dropped.
Author
Trogdan
Date
09/28/2014 01:46
@Jean memset predates function prototypes in C, thus, char was not allowed. It was left as is for backwards compatibility Http://stackoverflow.com/questions/5919735/why-does-memset-take-an-int-instead-of-a-char
Author
Alec
Date
09/28/2014 01:46
This is broken in multiple ways.

Here's a version I think is fixed. Apologies in advance for likely messed up formatting.

void * memset(void * p, int c, size_t n) {
uint8_t u8 = (uint8_t)c;
uint64_t u64 = (uint64_t)c;
u64 = (u64 << 48) | (u64 << 32) | (u64 << 16) | u64;
uint8_t *sp = (uint8_t *)p;
while (n && ((uintptr_t)sp & 7)) {
n--;
*sp++ = u8;
}
uint64_t *lp = (uint64_t *)(void *)sp;
while (n >= 8) {
*lp++ = u64;
n -= 8;
}
sp = (uint8_t *)(void *)lp;
while (n--)
*sp++ = u8;
return p;
}
Author
Alec
Date
09/28/2014 01:46
Sorry, that should be:
u64 = (u64 << 56) | (u64 << 48) | (u64 << 40) | (u64 << 32) | (u64 << 24) | (u64 << 16) | (u64 << 8) | u64;
Author
mac
Date
09/28/2014 01:46
Hi Jean-David ,

I'm a beginner and have some questions.

I did not understand the 2nd version of your memset function i.e

A simple solution is to write 8 bytes while it's possible, then writes the remaining bytes one by one.
Something like:


What does this "u64 = ( u64 << 32 ) | u64 " do?
Here is my understanding:

As written in the beginning "Let's take a look a the memset() function, which is used to fill a memory buffer with a single character."
Let's sat we are passing 'zero' (decimal value 48) as a argument to the 2nd version of your memset function.
Then
u64 = 48 (decimal value of character zero)
So, now the line :
u64 = ( u64 << 32 ) | u64
will look like --> u64 = (48<< 2^5) | 48
48<<32 = 48 * 32 = 1536

u64 = 1536 | 48 ;

Hence, u64 = 1536 OR 48 = 11000000000 OR 000001100000 = 110001100000 (binary) = 3168 (decimal value)

is my understanding right?

Coming to the 1st while loop:

while( ( n / 8 ) > 0 )
{
*( lp++ ) = u64; // But this line is placing value '3168' instead of a character in the memory buffer.
n -= 8;
}

sp = ( uint8_t * )lp;

The 2nd while loop looks good to me.
while( n-- )
{
*( sp++ ) = u8; // This places the original argument passed in the memory buffer.
}

Should'nt the value stored in variables u64 and u8 be same?
Can you please explain how this line "A simple solution is to write 8 bytes while it's possible, then writes the remaining bytes one by one." works in your function?

Thanks