November 13, 2007

Efficient strlen(3) implementation in C (part 2)

Category: Sysadmin.

Some times ago, I wrote [1] a few words about strlen(3) implementation and how inefficient it is on most systems. Although optimizing this function is utterly not the best way to make your program run faster (refer to Reducing complexity in the previous strlen(3) post), it is interesting to digg deeper in how modern CPUs can help us to compute a string length faster.

Simultaneous access to the memory

Most system implement multi-tasking, and recent CPUs often feature multiple cores. A single application can run faster on a multiple core machine if it has been designed for splitting the work to achieve into smaller tasks and rely on subprocesses (usualy threads) to compute the results.

As threads share resources, many subprocesses can access the same memory area simultaneously. If we know the maximum length of a character string, we can split this area into smaller chunks, launch multiple threads that try to find an EOS character into each chunk in parallel, and collect all threads results to compute the length (this is a simple divide and conquer algorithm).

Multithreading generaly leads to synchronisation issues, but in our case, each thread only reads data. There is therefore not so much difference between the classical strlen(3) implementation and a multithreaded one: we only have to care about synchronisation issues if we have other threads that may change the string while we compute its length.

From strlen2() to strnlen2()

In the previous article, I wrote a sample of strlen(3) implementation pipeline-effect friendlier. Since it is compatible with the classic strlen(3) implementation, it is not capable to stop after testing a certain amount of data. I have modified it so that it can stop after reading size bytes, and — as the function can now fail — sets found according to the search result.

size_t
strnlen2(str, size, found)
  const char *str;
  const size_t size;
  int *found;
{
  size_t result = 0;
  uint32_t *i = (uint32_t *)str;

  *found = 1;

  for (;result < size;) {
    if (!(*i & 0x000000ff)) return result;
    if (!(*i & 0x0000ff00)) return result + 1;
    if (!(*i & 0x00ff0000)) return result + 2;
    if (!(*i & 0xff000000)) return result + 3;
    result += 4;
    i++; /* Advance 4 bytes */
  }

  *found = 0;
  return 0;
}

Maybe have you noticed the new result < size condition in the loop? It is exactly what we wanted to avoid when writing strlen2()! This will effectively ruin performances and make strnlen2() only a bit faster than strlen(3).

Multithreaded implementation: strlen3()

I have implemented strlen3() using Posix.1c threads (pthreads). There are so two functions used to compute a string length: strlen3() and strlen3_thread(). The second function is basically the same as strnlen2(), and the first one is quite simple:

#ifndef N_CPU
# define N_CPU 4
#endif

size_t
strlen3(s, block_size, found)
  char *s; 
  size_t block_size;
  int *found;
{
  int i;
  pthread_t thread[N_CPU];
  strlen3_result_t result[N_CPU];

  /* Launch threads */
  for (i = 0; i < N_CPU; i++) {
    strlen3_params_t *params = malloc(sizeof(*params));
    params->s = s + (i * block_size);
    params->bs = block_size;
    params->result = &(result[i]);

    pthread_create(&thread[i], NULL, strlen3_thread, (void *)(params)); ①
  }

  /* Join threads */
  for (i =0; i < N_CPU; i++) {
    pthread_join(thread[i], NULL); ②
  }

  /* Get the result */
  for (i = 0; i < N_CPU; i++) {
    if (result[i].found) {
      *found = 1;
      return (result[i].position + i * block_size); ③
    }
  }

  *found = 0;
  return 0;
}

The functions launch N_CPU threads to scan s ①, wait for them to finish ② and go through the filled-in data structures to locate the EOS char ③. The thread is responsible of the initialisation of its result[i] data structure and has to free the memory area it reads its parameters from (params).

The number of subprocess is defined at compile time with the N_CPU macro. You may not set N_CPU to a value greater than the number of cores on the system, since it would make the function run slower.

Implementations efficiency comparison

In my previous test, I only used external commands to know the execution time of the different strlen() implementations and get an idea of how faster an implementation was compared to another. Unfortunately, this method cannot be accurate: theire is no way to distinguish CPU time used for the string length measurement and for the rest of the execution of the program (startup, memory allocation, etc).

Measuring execution time

While a program is running, the system maintains information about its ressource utilisation, including execution time. The application itself can call getrusage(2) to gather this information at any time.

With a little math, it is easy to get the CPU time needed to run a piece of code:

  struct rusage before, after;

  getrusage(RUSAGE_SELF, &before);

  /* do something */

  getrusage(RUSAGE_SELF, &after);

  fprintf(stdout, "Execution time: %fs\n",
          ((double)(after.ru_utime.tv_sec - before.ru_utime.tv_sec) * pow(10,6) +
					 after.ru_utime.tv_usec - before.ru_utime.tv_usec) * pow(10, -6));

Benchmark results

I have first tried this new implementation on my FreeBSD box, a dual core x86. The table 1 below shows the performances of the different strlen() implementations.

Table 1 — strlen() implementations on a dual-core machine
MB per corestrlen(3)strlen2strnlen2strlen3
10.0043770.0004590.0005590.000701
20.0106130.0021670.0023200.003194
40.0181850.0039950.0045740.005078
80.0419380.0075130.0088920.009891
160.0844110.0148810.0182810.019456
320.1648460.0294760.0364820.038390
640.3379370.0592290.0727830.076710
1280.6732360.1186640.1444070.146989
2251.1882180.2086030.2559880.263317

As you can see, the strlen3() implementation is not that fast: it is event the slowest of our fast implementations. Let's benchmark on another machine: a 8-core server (Table 2).

Table 2 — strlen() implementations on a 8-core machine
MB per corestrlen(3)strlen2strnlen2strlen3
10.0349940.0239970.0249960.043993
20.0669900.0499920.0499930.096985
40.1339800.0979850.0999850.206968
80.2669590.1939710.1999700.426935
160.5299190.3899410.3989390.872868
321.0598390.7798820.8128761.742735
642.1256771.5827591.6097553.510467
802.6445981.9547032.0116944.392332

Results are event worst! Two time slower that the strnlen2() function! Something is going wrong...

The answer cam from a friend of mine, Jean-Marie Favreau. Then n thread are running within the same process, their execution time is added to the process time, and therefore n processes on CPU 1 second increase the process counters by n seconds. The results from strlen3() have so to be divided by two in table 1 and by 8 in table 2.

This leads us to the folowing results:

Table 3 — 2 cores
MB per corestrlen(3)strlen3()
10,0043770,000351
20,0106130,001597
40,0181850,002539
80,0419380,004946
160,0844110,009728
320,1648460,019195
640,3379370,038355
1280,6732360,073495
2251,1882180,131659
Table 4 — 8 cores
MB per corestrlen(3)strlen3()
10,0349940,005499
20,0669900,012123
40,1339800,025871
80,2669590,053367
160,5299190,109108
321,0598390,217842
642,1256770,438808
802,6445980,549041

Looks better doesn't it ;-) ?

As you can see, although the strlen3() implementation is two nearly two times slower than the strlen(3) implementation on the 8-CPU RedHat machine, all 8 cores are working together and it is 4 times fatser to compute the string length with this function.

I took the time to plot all these results in a Gnumeric spreadsheet: Efficient strlen() implementations comparison.

References

1. Efficient strlen(3) implementation in C

Tags:

No Comments Yet

Comments RSS feed | Leave a Reply…

top