August 6, 2007

Efficient strlen(3) implementation in C

Category: Sysadmin.

Yesterday, somebody reached my website googling for efficient strlen implementation. This remind me an old discussion on code optimisation and how it was possible to dramatically reduce execution time of some basic functions.

Common implementation

Let's see how strlen(3) is implemented under FreeBSD 6.2 (the system I'm running). Chances are an equivalent implementation exists for nearly every operating system):

size_t
strlen(str)
  const char *str;
{
  const char *s;

  for (s = str; *s; ++s); ①
  return(s - str); ②
}

The design is simple: a pointer (s) runs along the given string (str) looking for an end-of-string character ('\0' == *s) ①. The string length is then computed by substracting the address of the pointer (s) from the address of the beginning of the string ②. While simple and portable, this implementation is not efficient: each loop iteration only performs a simple operation and the breaking condition is based on its result. Consequently, no pipeline effect occurs in the CPU.

Improving pipeline effect

The easiest way to improve pipelining with loops is to unroll them. In other words, create a loop that will, for each iteration, be equivalent to many iterations of the original loop. In other word, we have to make the loop process more data per iteration.

Good news! Most CPUs provide registers wider than 8 bits (commonly 32 or 64 on desktop computers): instead of testing each string chars in sequence, and working with only 8 bits in our registers, we can load 32 bits at once, and then test a string 4-bytes at once instead of 1-byte at once.

This can be implemented as this for a 32 bits i386 architecture:

size_t
strlen2(str)
  const char *str;
{
  size_t result = 0;
  uint32_t *i = (uint32_t *)str; ①

  for (;;) {
    if (!(*i & 0x000000ff)) return result;     ②
    if (!(*i & 0x0000ff00)) return result + 1;
    if (!(*i & 0x00ff0000)) return result + 2;
    if (!(*i & 0xff000000)) return result + 3;
    result += 4;
    i++; /* Advance 4 bytes */ ③
  }
}

This implementation casts the given string (str) as an integer (4 bytes on i386) ①. Each byte that composes the 32 bits integer is then tested to check if it's a NULL character ② (note that we test the caracters in reverse order since i386 is little endian). Finally, it the end of the string has not be read, the next 32 bits are loaded and proceeded ③.

This implementation is circa 4.3 times faster on my machine, unfortunately, it has some drawbacks:

  1. this code is not portable: bytes order will not be the same on big and little endian systems. This can however be worked around with preprocessor macros;
  2. this code is not safe: malicious users may read up to 3 bytes of data after the NULL character under certain circumstances.

Reducing complexity

Improvement is not that much effective? Actually, we are optimizing the wrong way! Really ;)! When optimizing code, you have to, in order:

  1. reduce complexity;
  2. tune your compiler;
  3. optimize your code.

So, how can we reduce complexity (Currently O(n) for both algorithms)? Simply by storing the string length somewhere! The complexity is then O(1) :^D.

A simple implementation would be:

struct {
  size_t len;
  char  *str;
};

Object Pascal programming language (also known as Delphi) provides a string object that offers the flexibility of objects (operators can be use to concatenate strings for example) and backward compatibility with null-terminated strings APIs. This is made possible with a simple trick: when a string is created, a new data structure is allocated but the returned value is basically the address of one member of the structure and not the structure itself. For example, if we consider this code:

var
  string s;
begin
  s := 'abc'; ①
end;

After the assignment ①, the allocated data structure will be (for example) 8 bytes long: 4 bytes will store the string length (i386 architecture), 3 bytes will hold the string itself, and a NULL character will come after the string. Behind the wood, the compiler will allocate the complete data structure but return a pointer 4 bytes after the beginning of it.

We can do the same in C. The xstring.c and xstring.h files provide some basic functions for a similar behavior:

#include "xstring.h"

char *s;
if ((s = string_new("Hello World"))) {
  printf("%s %d\n", s, string_length(s));
  string_free(s);
}

As you can see, it can be totally transparent for the user. You just have to take care to initialize and free your strings using the appropriate functions.

See also

  1. Efficient strlen(3) implementation in C (part 2)
Tags:

No Comments Yet

Comments RSS feed | Leave a Reply…

top