Hash Tables

Arrays let you find things instantly - if you know the index. array[5] is immediate. No searching.

But what if you want to look things up by name instead of number? “What’s Alice’s phone number?” You don’t want to search through every contact.

Hash tables solve this. They turn names into numbers, then use those numbers as array indices. It’s like magic, but it’s actually just math.

The Idea

Imagine a library with 1000 shelves, numbered 0 to 999. Instead of organizing books alphabetically (which requires searching), you use a formula:

  1. Take the book’s title
  2. Run it through a math formula that produces a number 0-999
  3. Put the book on that shelf

When someone wants the book, use the same formula on the title, and you know exactly which shelf to check. No searching!

This formula is called a hash function. The number it produces is called a hash code or hash value.

A Simple Hash Function

Here’s a basic hash function for strings:

unsigned int hash(const char *str, int table_size) {
    unsigned int hash_value = 0;

    while (*str != '\0') {
        hash_value = hash_value * 31 + *str;
        str++;
    }

    return hash_value % table_size;
}

It takes each character, multiplies by 31 (a prime number - it spreads things out better), and adds them up. Then we use modulo to keep the result within our table size.

// With table_size = 10:
hash("Alice", 10);    // Returns some number 0-9
hash("Bob", 10);      // Returns some number 0-9
hash("Charlie", 10);  // Returns some number 0-9

Different strings usually get different hash values. Usually.

The Collision Problem

What happens when two different strings get the same hash value? This is called a collision.

It’s like two people having the same phone number. Can’t happen, right? Well, in hash tables, it can.

hash("cat", 10);  // Might return 3
hash("act", 10);  // Might also return 3!

We need a way to handle this.

Solution 1: Chaining

Put a linked list at each slot. If multiple items hash to the same slot, they all live in that slot’s list. In this example, Alice and Alex both hashed to index 1, so they’re chained together. Charlie, Carol, and Chris all hashed to index 4.

A hash table with 5 visible indices. Index 0 is empty. Index 1 contains Alice pointing to Alex (a chain of 2). Index 2 is empty. Index 3 contains just Bob. Index 4 contains Charlie pointing to Carol pointing to Chris (a chain of 3). When multiple items hash to the same index, they form a linked list at that slot.

When looking up a value, hash to find the slot, then search the linked list.

Chained Hash Table Implementation

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 10

// Each entry in the hash table
typedef struct Entry {
    char *key;
    char *value;
    struct Entry *next;  // For chaining
} Entry;

// The hash table
typedef struct {
    Entry *buckets[TABLE_SIZE];
} HashTable;

unsigned int hash(const char *key) {
    unsigned int h = 0;
    while (*key) {
        h = h * 31 + *key;
        key++;
    }
    return h % TABLE_SIZE;
}

void ht_init(HashTable *ht) {
    for (int i = 0; i < TABLE_SIZE; i++) {
        ht->buckets[i] = NULL;
    }
}

void ht_set(HashTable *ht, const char *key, const char *value) {
    unsigned int index = hash(key);

    // Check if key already exists
    Entry *current = ht->buckets[index];
    while (current != NULL) {
        if (strcmp(current->key, key) == 0) {
            // Update existing value
            free(current->value);
            current->value = malloc(strlen(value) + 1);
            strcpy(current->value, value);
            return;
        }
        current = current->next;
    }

    // Create new entry
    Entry *entry = malloc(sizeof(Entry));
    entry->key = malloc(strlen(key) + 1);
    entry->value = malloc(strlen(value) + 1);
    strcpy(entry->key, key);
    strcpy(entry->value, value);

    // Add to front of chain
    entry->next = ht->buckets[index];
    ht->buckets[index] = entry;
}

char *ht_get(HashTable *ht, const char *key) {
    unsigned int index = hash(key);

    Entry *current = ht->buckets[index];
    while (current != NULL) {
        if (strcmp(current->key, key) == 0) {
            return current->value;
        }
        current = current->next;
    }

    return NULL;  // Not found
}

void ht_delete(HashTable *ht, const char *key) {
    unsigned int index = hash(key);

    Entry *current = ht->buckets[index];
    Entry *prev = NULL;

    while (current != NULL) {
        if (strcmp(current->key, key) == 0) {
            if (prev == NULL) {
                ht->buckets[index] = current->next;
            } else {
                prev->next = current->next;
            }
            free(current->key);
            free(current->value);
            free(current);
            return;
        }
        prev = current;
        current = current->next;
    }
}

void ht_free(HashTable *ht) {
    for (int i = 0; i < TABLE_SIZE; i++) {
        Entry *current = ht->buckets[i];
        while (current != NULL) {
            Entry *temp = current;
            current = current->next;
            free(temp->key);
            free(temp->value);
            free(temp);
        }
    }
}

Using the Hash Table

A quick aside: In computer science examples, you’ll see the names Alice, Bob, Eve, and Mallory everywhere. They come from cryptography, where Alice and Bob are trying to communicate, Eve is eavesdropping, and Mallory is a malicious attacker. It’s a tradition - like using foo and bar for placeholder variables. Now you’re in on the joke.

int main(void) {
    HashTable phonebook;
    ht_init(&phonebook);

    ht_set(&phonebook, "jenny", "867-5309");
    ht_set(&phonebook, "alice", "555-1234");
    ht_set(&phonebook, "bob", "555-5678");
    ht_set(&phonebook, "charlie", "555-9999");
    ht_set(&phonebook, "eve", "555-0666");

    printf("Jenny's number: %s\n", ht_get(&phonebook, "jenny"));
    printf("Eve's number: %s\n", ht_get(&phonebook, "eve"));

    char *unknown = ht_get(&phonebook, "mallory");
    if (unknown == NULL) {
        printf("Mallory: Not in our database\n");
    }

    // Update a value
    ht_set(&phonebook, "charlie", "555-0000");
    printf("Charlie's new number: %s\n", ht_get(&phonebook, "charlie"));

    ht_free(&phonebook);
    return 0;
}

Solution 2: Open Addressing

Instead of linked lists, just use the next available slot:

hash("Alice") = 3
Slot 3 is empty, put Alice there.

hash("Alex") = 3 (collision!)
Slot 3 is taken, try slot 4.
Slot 4 is empty, put Alex there.

When searching, if the slot doesn’t have what you want, check the next slot, and the next, until you find it or hit an empty slot.

This is called linear probing. There are fancier versions (quadratic probing, double hashing), but the idea is the same.

#define TABLE_SIZE 20

typedef struct {
    char *key;
    char *value;
    int occupied;  // Is this slot in use?
} Slot;

typedef struct {
    Slot slots[TABLE_SIZE];
} OpenHashTable;

void oht_init(OpenHashTable *ht) {
    for (int i = 0; i < TABLE_SIZE; i++) {
        ht->slots[i].key = NULL;
        ht->slots[i].value = NULL;
        ht->slots[i].occupied = 0;
    }
}

void oht_set(OpenHashTable *ht, const char *key, const char *value) {
    unsigned int index = hash(key);

    for (int i = 0; i < TABLE_SIZE; i++) {
        unsigned int try = (index + i) % TABLE_SIZE;

        // Empty slot or same key?
        if (!ht->slots[try].occupied ||
            (ht->slots[try].key && strcmp(ht->slots[try].key, key) == 0)) {

            // Free old values if updating
            if (ht->slots[try].occupied) {
                free(ht->slots[try].key);
                free(ht->slots[try].value);
            }

            ht->slots[try].key = malloc(strlen(key) + 1);
            ht->slots[try].value = malloc(strlen(value) + 1);
            strcpy(ht->slots[try].key, key);
            strcpy(ht->slots[try].value, value);
            ht->slots[try].occupied = 1;
            return;
        }
    }
    // Table is full!
}

char *oht_get(OpenHashTable *ht, const char *key) {
    unsigned int index = hash(key);

    for (int i = 0; i < TABLE_SIZE; i++) {
        unsigned int try = (index + i) % TABLE_SIZE;

        if (!ht->slots[try].occupied) {
            return NULL;  // Empty slot = not found
        }

        if (strcmp(ht->slots[try].key, key) == 0) {
            return ht->slots[try].value;
        }
    }
    return NULL;
}

When to Use Which?

Chaining:

  • Simple to implement
  • Works well even when table gets full
  • Uses more memory (pointers for linked lists)

Open Addressing:

  • Better cache performance (everything in one array)
  • No extra pointers
  • Performance degrades badly when table gets full

Most real implementations use chaining or sophisticated versions of open addressing.

Load Factor

The load factor is how full your hash table is:

load factor = number of entries / table size

When the load factor gets too high (usually around 0.7), the table gets slow. Too many collisions.

Real hash tables resize - when they get too full, they create a bigger table and rehash everything.

void ht_resize(HashTable *ht, int new_size) {
    // Save old data
    // Create new bigger table
    // Rehash everything into new table
    // Free old table
}

This is O(n) but happens rarely, so the average stays O(1).

Good Hash Functions

A good hash function:

  • Is fast to compute
  • Spreads values evenly across the table
  • Rarely causes collisions

The one we used (multiply by 31, add characters) is simple but decent. Real implementations use more sophisticated functions.

For integers, a common approach:

unsigned int int_hash(int key, int table_size) {
    return ((unsigned int)key * 2654435761) % table_size;
}

That magic number is related to the golden ratio. Math is weird.

Complete Example: Spell Checker

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#define DICT_SIZE 1000

typedef struct Word {
    char *word;
    struct Word *next;
} Word;

typedef struct {
    Word *buckets[DICT_SIZE];
    int count;
} Dictionary;

unsigned int word_hash(const char *word) {
    unsigned int h = 0;
    while (*word) {
        h = h * 31 + tolower(*word);
        word++;
    }
    return h % DICT_SIZE;
}

void dict_init(Dictionary *d) {
    for (int i = 0; i < DICT_SIZE; i++) {
        d->buckets[i] = NULL;
    }
    d->count = 0;
}

void dict_add(Dictionary *d, const char *word) {
    unsigned int index = word_hash(word);

    // Check if already exists
    Word *current = d->buckets[index];
    while (current != NULL) {
        if (strcasecmp(current->word, word) == 0) {
            return;  // Already in dictionary
        }
        current = current->next;
    }

    // Add new word
    Word *new_word = malloc(sizeof(Word));
    new_word->word = malloc(strlen(word) + 1);
    strcpy(new_word->word, word);
    new_word->next = d->buckets[index];
    d->buckets[index] = new_word;
    d->count++;
}

int dict_contains(Dictionary *d, const char *word) {
    unsigned int index = word_hash(word);

    Word *current = d->buckets[index];
    while (current != NULL) {
        if (strcasecmp(current->word, word) == 0) {
            return 1;  // Found!
        }
        current = current->next;
    }
    return 0;  // Not found
}

void dict_free(Dictionary *d) {
    for (int i = 0; i < DICT_SIZE; i++) {
        Word *current = d->buckets[i];
        while (current != NULL) {
            Word *temp = current;
            current = current->next;
            free(temp->word);
            free(temp);
        }
    }
}

int main(void) {
    Dictionary dict;
    dict_init(&dict);

    // Add some words
    dict_add(&dict, "the");
    dict_add(&dict, "quick");
    dict_add(&dict, "brown");
    dict_add(&dict, "fox");
    dict_add(&dict, "jumps");
    dict_add(&dict, "algorithm");
    dict_add(&dict, "variable");

    // Check some words
    char *test_words[] = {"the", "fox", "xyzzy", "algorithm", "zephyr"};
    int num_tests = 5;

    for (int i = 0; i < num_tests; i++) {
        if (dict_contains(&dict, test_words[i])) {
            printf("'%s' is spelled correctly\n", test_words[i]);
        } else {
            printf("'%s' might be misspelled\n", test_words[i]);
        }
    }

    printf("\nDictionary has %d words\n", dict.count);

    dict_free(&dict);
    return 0;
}

Performance

OperationAverage CaseWorst Case
InsertO(1)O(n)
LookupO(1)O(n)
DeleteO(1)O(n)

Average case is constant time - doesn’t matter if you have 100 items or 100 million. That’s the superpower of hash tables.

Worst case (everything collides) is terrible, but with a good hash function, this basically never happens.

Real World Uses

  • Dictionaries/Maps: Python dicts, JavaScript objects, Java HashMaps
  • Caches: Store computed results for quick lookup
  • Symbol tables: Compilers track variable names
  • Databases: Quick lookups by key
  • Spell checkers: Is this word valid?
  • Deduplication: Have I seen this before?

Hash tables are everywhere. Every major programming language has them built in. Now you know how they work.

Try It Yourself

  1. Implement a hash table that stores superhero names and their powers
  2. Add a function that prints all entries in the hash table
  3. Implement automatic resizing when the load factor exceeds 0.7
  4. Write a program that counts word frequency in a text using a hash table

Common Mistakes

  • Forgetting to handle the NULL case in get()
  • Using a bad hash function (too many collisions)
  • Not resizing when the table gets too full
  • Memory leaks when updating or deleting entries

Next Up

Before we move on, there’s something important to discuss. In the next section, we’ll talk about hash functions and false promises - what happens when people build financial systems on hash functions, why “mathematically secure” doesn’t mean what they claim, and how to recognize when someone is trying to use you.


Enjoyed This?

If this helped something click, subscribe to my YouTube channel. More content like this, same approach - making things stick without insulting your intelligence. It’s free, it helps more people find this stuff, and it tells me what’s worth making more of.