05 Apr 2018

Perfect hashing for fixed-size string set

As aforementioned, we’ve dicussed perfect hashing for integers.

It turns out that the concept can be generalized for string hashing.

Leading out a simple question: How you do check if a string inside a small set?

Let’s concrete the question, consider we’re about to writing a C lexer.

The first step is to parse each token, we need to issue an error if any token mismatch the finite-state machine.

For example, like a variable name categorized into reserved keywords.

Considering we want to write a C89-compatible tokenizer, we thus have following 32 keywords.

We can easily write out first naive draft.

static const char *c89kws[] = {
	"auto", "break", "case", "char", "const", "continue", "default", "do",
	"double", "else", "enum", "extern", "float", "for", "goto", "if",
	"int", "long", "register", "return", "short", "signed", "sizeof", "static",
	"struct", "switch", "typedef", "union", "unsigned", "void", "volatile", "while",
};

int is_rsrv_c89kw(const char *tok)
{
	assert(tok != NULL);
	size_t i, sz = sizeof(c89kws) / sizeof(char *);
	for (i = 0; i < sz; i++)
		if (!strcmp(tok, c89kws[i]))
			return 1;
	return 0;
}

Complexity of above solution is O(m * n), which m is average length of the keyword list, n is the size of the keyword list.

Yet some guys may happen to optimize this by binary search, which yields the second draft.

int is_rsrv_c89kw2(const char *tok)
{
	assert(tok != NULL);

	int l = 0, r = sizeof(c89kws) / sizeof(char *) - 1;
	int m, cmp;

	while (l <= r) {
		m = l + ((r - l) >> 1);
		cmp = strcmp(tok, c89kws[m]);
		if (cmp == 0)
			return 1;
		cmp > 0 ? (l = m + 1) : (r = m - 1);
	}

	return 0;
}

This should be reasonable for real-life implementation, time complexity down to O(m * log2(n).

Now we pursue a third solution, perfect hashing, first we need to probe an array that makes each elements in the list having an unique hash value(reduce the string to hash-value), the we utilize the integer perfect hashing for string matching.

By reducing string to integer value, we have lots of choices, the cheapest way is to hash.

/*
 * This is a bad string-hash function, yet useful when used in fixed-set
 */
#define LOSELOSE2(s) ({         \
    const char *p = s;          \
    unsigned int i, h = 0;      \
    for (i = 0; p[i]; i++)      \
        h += p[i] | (i + 1);    \
    h;                          \
})

Above is one of the possible hash function, when apply to the give list, it yields the result(not yet sorted, any two of them not equal).

446, 528, 414, 422, 557, 884, 754, 212, 645, 427, 439, 672, 542, 329, 444, 207,
334, 434, 892, 683, 567, 638, 659, 660, 686, 670, 763, 556, 875, 437, 890, 541,

P.S. if we replace (i + 1) with 0, the LOSELOSE2() become classic K&R lose-lose.

Now we need to probe an unique magic number which reduce those 32 numbers into range [0, 32), and also any two of them cannot overlap.

static const char *c89kws[] = {
    "auto", "break", "case", "char", "const", "continue", "default", "do",
    "double", "else", "enum", "extern", "float", "for", "goto", "if",
    "int", "long", "register", "return", "short", "signed", "sizeof", "static",
    "struct", "switch", "typedef", "union", "unsigned", "void", "volatile", "while",
};

#define COT 32
#define SLOT 32

/* See previously article for details */
#define TYPE unsigned int
#define SHIFT 27
#define HASH(x) (((TYPE) (x)) >> SHIFT)

static unsigned int kwh[COT];

void probing(void)
{
    TYPE i, j;
    char buff[SLOT];

    for (i = 0x1; i != 0; i++) {
        memset(buff, -1, sizeof(buff));

        for (j = 0; j < COT; j++) {
            /* each bit should mapped into unique slot */
            if (++buff[HASH(kwh[j] * i)] != 0)
                break;
        }

        if (j == COT)
            break;
    }

    if (i == 0) {
        fprintf(stderr, "probe failed\n");
        return;
    }

    printf("%#0*x:\n", (int) ((sizeof(i) + 1) << 1), i);

    for (j = 0; j < COT; j++) {
        /* mapping from bitmap into index */
        buff[HASH(kwh[j] * i)] = j;
    }

    for (j = 0; j < SLOT; j++) {
        printf("%2d,", buff[j]);
        putchar(j % 8 != 7 ? ' ' : '\n');
    }
}

#define LOSELOSE2(s) ({         \
    const char *p = s;          \
    unsigned int i, h = 0;      \
    for (i = 0; p[i]; i++)      \
        h += p[i] | (i + 1);    \
    h;                          \
})

int main(void)
{
    size_t i, sz = sizeof(c89kws) / sizeof(char *);

    for (i = 0; i < sz; i++)
        kwh[i] = LOSELOSE2(c89kws[i]);

    probing();

    return 0;
}

Unfortunately, if you run above program, it will print probe failed message.

If you take a closer look at those string-hash values(I sorted them for convenience).

207, 212, 329, 334, 414, 422, 427, 434, 437, 439, 444, 446, 528, 541, 542, 556,
557, 567, 638, 645, 659, 660, 670, 672, 683, 686, 754, 763, 875, 884, 890, 892,

You’ll find that they’re too close to each other(that’s why I used a variant of lose-lose), which of course proved loselose is an extremely poor hash function.

There are two obvious solutions:

Pursue an uniformly distributed(universal) hash function
Expand the slots, which allow empty(unused) slot.

Difficulty for the first one is that it’s not easy to find such hash function which strike a balance between simplicity and speed.

The drawback for the second one is that it may consume extra space to store unused slots, yet, it may overstated, consider I want to add two extra keywords inline and restrict to the keyword list, which the unused slots can now be used without reinventing a new hash function.

So, if I double the slots number(half of them will be unused thus), and of course we also need to change the hash a little bit:

#define SLOT 64
#define SHIFT 26    /* 32 - log2(64) = 32 - 6 = 26 */

You would see the following output:

0x02da187e:
-1, 31, 12, -1, -1, -1, 21, -1,
-1, -1, -1,  8, 27,  4, -1, -1,
-1, -1, -1, 15, 20, 22, 23,  7,
-1,  6, -1, -1, -1, 25, -1, 11,
26, -1, -1, -1, -1, -1, 19,  2,
-1, 24, 13, -1,  3, -1, 16, 28,
 9, -1, -1, -1, -1, 17,  5, 29,
 1, 10, 30, 18, 14,  0, -1, -1,

Of which, -1 indicate unused slots. and other indices should map to corresponding string in the c89kws array.

Hence we have a third draft, hopefully it’s the final one.

static const char *c89kwmap[] = {
    NULL, "while", "float", NULL, NULL, NULL, "signed", NULL,
    NULL, NULL, NULL, "double", "union", "const", NULL, NULL,
    NULL, NULL, NULL, "if", "short", "sizeof", "static", "do",
    NULL, "default", NULL, NULL, NULL, "switch", NULL, "extern",
    "typedef", NULL, NULL, NULL, NULL, NULL, "return", "case",
    NULL, "struct", "for", NULL, "char", NULL, "int", "unsigned",
    "else", NULL, NULL, NULL, NULL, "long", "continue", "void",
    "break", "enum", "volatile", "register", "goto", "auto", NULL, NULL,
};

#define LOSELOSE2(s) ({         \
    const char *p = s;          \
    unsigned int i, h = 0;      \
    for (i = 0; p[i]; i++)      \
        h += p[i] | (i + 1);    \
    h;                          \
})

#define MAGIC    0x02da187eu
#define SHIFT    26u
#define QHASH(x) (((unsigned int) (x * MAGIC)) >> SHIFT)

int is_rsrv_c89kw3(const char *tok)
{
    assert(tok != NULL);
    const char *p = c89kwmap[QHASH(LOSELOSE2(tok))];
    return p && !strcmp(p, tok);
}

As the code revealed, it hashed twice, and yield one string comparison. Which is O(n) time complexity for any given string.

Even better than the binary search version.

Updated Apr 29, 2018

Yet here comes a fourth draft, which even faster than is_rsrv_c89kw3().

static const char *c89kwmap2[] = {
    "default", NULL, "short", NULL, "goto", "continue", "auto", NULL,
    NULL, "typedef", "signed", "volatile", NULL, "register", NULL, "for",
    NULL, "double", NULL, "if", "int", NULL, NULL, NULL,
    "do", NULL, "break", NULL, NULL, NULL, NULL, "sizeof",
    "static", NULL, NULL, NULL, NULL, NULL, "case", "while",
    "float", NULL, NULL, "switch", NULL, "extern", "char", NULL,
    NULL, NULL, NULL, "else", NULL, NULL, "union", "const",
    "return", NULL, "long", "struct", "unsigned", "void", NULL, "enum",
};

#define LOSELOSE2(s) ({         \
    const char *p = s;          \
    unsigned int i, h = 0;      \
    for (i = 0; p[i]; i++)      \
        h += p[i] | (i + 1);    \
    h;                          \
})

/*
 * Hint:
 * We may go even further by finding an uint8_t, and shift 8 - 6 = 2 bits.
 */
#define MAGIC     0x0414u
#define SHIFT     10u
#define QHASH2(x) (((unsigned short) (x * MAGIC)) >> SHIFT)

int is_rsrv_c89kw4(const char *tok)
{
    assert(tok != NULL);
    const char *p = c89kwmap2[QHASH2(LOSELOSE2(tok))];
    /* False-positive comparison */
    return p && !strcmp(p, tok);
}

Feel free to add more keywords, and then recalculate the string-map and magic number.

If your hash function is universal enough, you may happen to find a magic number without expanding the slots, and even reduce the hash value into an unsigned char, i.e. uint8_t, so the second hash consumes nearly nothing.

The is_rsrv_c89kw4() done nearly perfect, the first string-hash is simple enough, and the second hash reduced first one into an unsigned short, i.e. uint16_t.

You may take further expansion over slots, which yield even smaller magic number(with penalty of extra space).

If you using GNU gperf to generate hash function, put above keywords as input, you’ll find similar pattern.

FYI, I’ve put the machine-generated result here. Happy hacking!

#define TOTAL_KEYWORDS 32
#define MIN_WORD_LENGTH 2
#define MAX_WORD_LENGTH 8
#define MIN_HASH_VALUE 3
#define MAX_HASH_VALUE 44
/* maximum key range = 42, duplicates = 0 */

inline static unsigned int hash(register const char *str, register size_t len)
{
  static unsigned char asso_values[] = {
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 30, 10, 10,
      25, 20,  0, 10, 45,  5, 45, 45, 25, 45,
      10, 10,  5, 45,  0,  5,  0,  0,  0, 20,
      45, 45, 25, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
      45, 45, 45, 45, 45, 45
    };
  register unsigned int hval = len;

  switch (hval) {
      default:
        hval += asso_values[(unsigned char)str[2]];
      /*FALLTHROUGH*/
      case 2:
      case 1:
        hval += asso_values[(unsigned char)str[0]];
        break;
    }
  return hval;
}

const char *in_word_set(register const char *str, register size_t len)
{
  static const char * wordlist[] = {
      "", "", "", "for", "", "", "return", "if", "int", "void",
      "union", "struct", "typedef", "unsigned", "goto", "float",
      "switch", "", "register", "case", "short", "signed", "", "",
      "enum", "const", "extern", "do", "continue", "else", "while",
      "double", "default", "volatile", "auto", "break", "sizeof",
      "", "", "long", "", "static", "", "", "char"
    };

  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) {
      register unsigned int key = hash (str, len);

      if (key <= MAX_HASH_VALUE) {
          register const char *s = wordlist[key];

          if (*str == *s && !strcmp (str + 1, s + 1))
            return s;
        }
    }
  return 0;
}

Updated Jan 03, 2020

One drawback of the perfect hashing is that after we got a magic array, the order of the array has been shuffled due to the two-way hashing.

If we want to bound the array with an extra value, we need to introduce an extra value array, with the same organization of the key array.

static const char *c89kw_key[] = {
    "default", NULL, "short", NULL, "goto", "continue", "auto", NULL,
    NULL, "typedef", "signed", "volatile", NULL, "register", NULL, "for",
    NULL, "double", NULL, "if", "int", NULL, NULL, NULL,
    "do", NULL, "break", NULL, NULL, NULL, NULL, "sizeof",
    "static", NULL, NULL, NULL, NULL, NULL, "case", "while",
    "float", NULL, NULL, "switch", NULL, "extern", "char", NULL,
    NULL, NULL, NULL, "else", NULL, NULL, "union", "const",
    "return", NULL, "long", "struct", "unsigned", "void", NULL, "enum",
};

static const void *c89kw_val[] = {
	"value for 'default' key", NULL, "value for 'short'", ...
}

Or we can also initialize the c89kw_val array on-the-fly:

(void) memset(c89kw_val, 0, sizeof(c89kw_val) / sizeof(*c89kw_val));

c89kw_val[QHASH2(LOSELOSE2("default"))] = /* value for 'default' key */;
/* ... */
c89kw_val[QHASH2(LOSELOSE2("enum"))] = /* ... */;

References

C Programming/Language Reference#ANSI C (C89)/ISO C (C90)
Hash Functions
More Hash Function Tests

一个程序员的辩白

Perfect hashing for fixed-size string set

References