Efficient strlen(3) implementation in C (part 2)
Category: Sysadmin.
Some times ago, I wrote [1] a few words about strlen(3) implementation and how inefficient it is on most systems. Although optimizing this function is utterly not the best way to make your program run faster (refer to Reducing complexity
in the previous strlen(3) post), it is interesting to digg deeper in how modern CPUs can help us to compute a string length faster.
Simultaneous access to the memory
Most system implement multi-tasking, and recent CPUs often feature multiple cores. A single application can run faster on a multiple core machine if it has been designed for splitting the work to achieve into smaller tasks and rely on subprocesses (usualy threads) to compute the results.
As threads share resources, many subprocesses can access the same memory area simultaneously. If we know the maximum length of a character string, we can split this area into smaller chunks, launch multiple threads that try to find an EOS character into each chunk in parallel, and collect all threads results to compute the length (this is a simple divide and conquer algorithm
).
Multithreading generaly leads to synchronisation issues, but in our case, each thread only reads data. There is therefore not so much difference between the classical strlen(3) implementation and a multithreaded one: we only have to care about synchronisation issues if we have other threads that may change the string while we compute its length.
From strlen2() to strnlen2()
In the previous article, I wrote a sample of strlen(3) implementation pipeline-effect friendlier. Since it is compatible with the classic strlen(3) implementation, it is not capable to stop after testing a certain amount of data. I have modified it so that it can stop after reading size
bytes, and — as the function can now fail — sets found
according to the search result.
size_t strnlen2(str, size, found) const char *str; const size_t size; int *found; { size_t result = 0; uint32_t *i = (uint32_t *)str; *found = 1; for (;result < size;) { if (!(*i & 0x000000ff)) return result; if (!(*i & 0x0000ff00)) return result + 1; if (!(*i & 0x00ff0000)) return result + 2; if (!(*i & 0xff000000)) return result + 3; result += 4; i++; /* Advance 4 bytes */ } *found = 0; return 0; }
Maybe have you noticed the new result < size
condition in the loop? It is exactly what we wanted to avoid when writing strlen2()! This will effectively ruin performances and make strnlen2() only a bit faster than strlen(3).
Multithreaded implementation: strlen3()
I have implemented strlen3() using Posix.1c threads (pthreads). There are so two functions used to compute a string length: strlen3() and strlen3_thread(). The second function is basically the same as strnlen2(), and the first one is quite simple:
#ifndef N_CPU # define N_CPU 4 #endif size_t strlen3(s, block_size, found) char *s; size_t block_size; int *found; { int i; pthread_t thread[N_CPU]; strlen3_result_t result[N_CPU]; /* Launch threads */ for (i = 0; i < N_CPU; i++) { strlen3_params_t *params = malloc(sizeof(*params)); params->s = s + (i * block_size); params->bs = block_size; params->result = &(result[i]); pthread_create(&thread[i], NULL, strlen3_thread, (void *)(params)); ① } /* Join threads */ for (i =0; i < N_CPU; i++) { pthread_join(thread[i], NULL); ② } /* Get the result */ for (i = 0; i < N_CPU; i++) { if (result[i].found) { *found = 1; return (result[i].position + i * block_size); ③ } } *found = 0; return 0; }
The functions launch N_CPU
threads to scan s
①, wait for them to finish ② and go through the filled-in data structures to locate the EOS char ③. The thread is responsible of the initialisation of its result[i]
data structure and has to free the memory area it reads its parameters from (params
).
The number of subprocess is defined at compile time with the N_CPU macro. You may not set N_CPU
to a value greater than the number of cores on the system, since it would make the function run slower.
Implementations efficiency comparison
In my previous test, I only used external commands to know the execution time of the different strlen() implementations and get an idea of how faster an implementation was compared to another. Unfortunately, this method cannot be accurate: theire is no way to distinguish CPU time used for the string length measurement and for the rest of the execution of the program (startup, memory allocation, etc).
Measuring execution time
While a program is running, the system maintains information about its ressource utilisation, including execution time. The application itself can call getrusage(2) to gather this information at any time.
With a little math, it is easy to get the CPU time needed to run a piece of code:
struct rusage before, after; getrusage(RUSAGE_SELF, &before); /* do something */ getrusage(RUSAGE_SELF, &after); fprintf(stdout, "Execution time: %fs\n", ((double)(after.ru_utime.tv_sec - before.ru_utime.tv_sec) * pow(10,6) + after.ru_utime.tv_usec - before.ru_utime.tv_usec) * pow(10, -6));
Benchmark results
I have first tried this new implementation on my FreeBSD box, a dual core x86. The table 1 below shows the performances of the different strlen() implementations.
MB per core | strlen(3) | strlen2 | strnlen2 | strlen3 |
---|---|---|---|---|
1 | 0.004377 | 0.000459 | 0.000559 | 0.000701 |
2 | 0.010613 | 0.002167 | 0.002320 | 0.003194 |
4 | 0.018185 | 0.003995 | 0.004574 | 0.005078 |
8 | 0.041938 | 0.007513 | 0.008892 | 0.009891 |
16 | 0.084411 | 0.014881 | 0.018281 | 0.019456 |
32 | 0.164846 | 0.029476 | 0.036482 | 0.038390 |
64 | 0.337937 | 0.059229 | 0.072783 | 0.076710 |
128 | 0.673236 | 0.118664 | 0.144407 | 0.146989 |
225 | 1.188218 | 0.208603 | 0.255988 | 0.263317 |
As you can see, the strlen3() implementation is not that fast: it is event the slowest of our fast
implementations. Let's benchmark on another machine: a 8-core server (Table 2).
MB per core | strlen(3) | strlen2 | strnlen2 | strlen3 |
---|---|---|---|---|
1 | 0.034994 | 0.023997 | 0.024996 | 0.043993 |
2 | 0.066990 | 0.049992 | 0.049993 | 0.096985 |
4 | 0.133980 | 0.097985 | 0.099985 | 0.206968 |
8 | 0.266959 | 0.193971 | 0.199970 | 0.426935 |
16 | 0.529919 | 0.389941 | 0.398939 | 0.872868 |
32 | 1.059839 | 0.779882 | 0.812876 | 1.742735 |
64 | 2.125677 | 1.582759 | 1.609755 | 3.510467 |
80 | 2.644598 | 1.954703 | 2.011694 | 4.392332 |
Results are event worst! Two time slower that the strnlen2() function! Something is going wrong...
The answer cam from a friend of mine, Jean-Marie Favreau. Then n thread are running within the same process, their execution time is added to the process time, and therefore n processes on CPU 1 second increase the process counters by n seconds. The results from strlen3() have so to be divided by two in table 1 and by 8 in table 2.
This leads us to the folowing results:
MB per core | strlen(3) | strlen3() |
---|---|---|
1 | 0,004377 | 0,000351 |
2 | 0,010613 | 0,001597 |
4 | 0,018185 | 0,002539 |
8 | 0,041938 | 0,004946 |
16 | 0,084411 | 0,009728 |
32 | 0,164846 | 0,019195 |
64 | 0,337937 | 0,038355 |
128 | 0,673236 | 0,073495 |
225 | 1,188218 | 0,131659 |
MB per core | strlen(3) | strlen3() |
---|---|---|
1 | 0,034994 | 0,005499 |
2 | 0,066990 | 0,012123 |
4 | 0,133980 | 0,025871 |
8 | 0,266959 | 0,053367 |
16 | 0,529919 | 0,109108 |
32 | 1,059839 | 0,217842 |
64 | 2,125677 | 0,438808 |
80 | 2,644598 | 0,549041 |
Looks better doesn't it ;-) ?
As you can see, although the strlen3() implementation is two nearly two times slower than the strlen(3) implementation on the 8-CPU RedHat machine, all 8 cores are working together and it is 4 times fatser to compute the string length with this function.
I took the time to plot all these results in a Gnumeric spreadsheet: Efficient strlen() implementations comparison.
References
1. Efficient strlen(3) implementation in C