Lists

Lists

A list is like an array, except that the length changes.

Items are added to a list over time, and you don't know in advance how many there will be.

This chapter implements lists in two ways in C.

The same ideas apply to lists in almost every other programming language.

Two approaches

There are two approaches that you can take:

Array lists are better when the most common operations are to add or remove at one end.

Linked lists can be better when frequently inserting and deleting in the middle.

Library approach

We will start with array lists, and implement them as if we were creating a library. That means creating functions which are reasonably reusable, and which are easy to use rather than easy to write.

On the other hand the program we are writing doesn't have to do anything other than test the functions.

The first step is to decide what items are going to be stored in the lists.

Generics

Arrays in C are generic. In other words, there can be an array of anything, e.g. char cs[], int ns[], double ds[], and so on.

Unfortunately, C makes it nearly impossible to create generic lists, so that there can be lists of anything.

So our target will be lists of int, but written so that the functions can reasonably easily be copied and reused to make lists of some other type. (We'll write xs for a list of anything, and ns for a list of int.)

Items

Let's make a type synonym:

typedef int item;

Now all the functions can be written as acting on lists of items. For any other simple type, this one typedef can be changed, without changing the functions.

The next step is to decide what we want the functions to look like when they are used.

Calls

A program which uses the lists might look like this:

...
list *ns = newList();
add(ns, 3);
add(ns, 5);
add(ns, 41);
get(ns, 2);
set(ns, 2, 42);
for (int i = 0; i < length(ns); i++) ...
...

A list starts empty, it can have items added to it indefinitely, and there are functions length, get and set similar to an array.

Testing

As well as the list functions, we need to think about how testing is going to work.

A test should check that the result of an operation is a self-consistent list containing given items, like this:

assert(check(ns, 3, (int[]){1, 2, 3}));

This checks that ns is a list with length three containing 1, 2, 3. The raw notation {1, 2, 3} can only be used in declarations, but if the type int[] is made explicit with a 'cast', it can be used elsewhere.

Prototypes

We can now write prototypes of all the functions:

list *newList();

int length(list *xs);

void add(list *xs, item x);

item get(list *xs, int i);

void set(list *xs, int i, item x);

bool check(list *xs, int n, item ys[n]);

First attempt

We could use a flexible array, with a length variable to say how full it is:

int length = 0, capacity = 4;
item *items = malloc(capacity * sizeof(item));
...
if (length >= capacity) {
  capacity = capacity * 3 / 2;
  items = realloc(items, capacity * sizeof(item));
}
...

Add function

The add function seems to need to be like this:

void add(int length, int capacity, item *items, item x) {
...

We need to pass the length and capacity as well as the array and new item. But that's not what the add function is supposed to look like.

Failure

It would be tiresome to passing around the three variables separately. We want calls to look like add(xs, n).

And the function wouldn't work anyway, because it can't update the caller's length variable, the caller's capacity variable, or the caller's items variable in case the array is moved by realloc.

Second attempt

We could pass in all three variables in one go in a structure, and return the updated structure:

struct list { int length, capacity; item *items; };
typedef struct list list;

list add(list xs, item x) {
    ...
}
...
xs = add(xs, x);

Poor attempt

The second attempt does work, but has two flaws.

It is important to be able to write add(xs, x), not xs = add(xs, x).

Third attempt

How do we achieve really simple function calls like add(xs,x) ?

Answer, pass the list structure by pointer:

struct list { int length, capacity; item *items; };
typedef struct list list;

void add(list *xs, item x) {
    ...
}

This treats a list as an 'object'.

Picture

A good picture of the situation may help:

struct list { int length, capacity; item *items; };
typedef struct list list;
...
list *xs;

The xs pointer points to a fixed structure with three fields, one of which is a pointer to a variable array.

Pointer purposes

The pointers in the picture have two different purposes.

The first, xs, allows functions to update the list structure in place. The structure never moves, so the pointer never changes, so it never needs to be updated, so functions never need to return an updated pointer.

The second pointer, xs->items, allows the array to be moved and resized. Only one pointer xs->items needs to be updated.

The newList function

Here's a function to make a new list:

// Make a new empty list
list *newList() {
    list *xs = malloc(sizeof(list));
    item *items = malloc(6 * sizeof(item));
    *xs = (list) { 0, 6, items };
    return xs;
}

How should this be tested?

Test strategy

It is really easy to make mistakes with pointers or with array bounds.

So it is vital to compile with the -fsanitize=undefined and -fsanitize=address options to catch those mistakes as quickly as possible.

Then the next task is to write the check function.

The check function

Here's the check function:

// Check that a list matches a given array.
bool check(list *xs, int n, item ys[n]) {
    bool ok = true;
    if (xs->length != n) ok = false;
    if (xs->capacity < n) ok = false;
    for (int i = 0; i < n; i++) {
        if (xs->items[i] != ys[i]) ok = false;
    }
    return ok;
}

This checks everything it can. It can't check that pointers are valid, but that's checked by the compiler options.

The program

Now we can wrap the functions in a program:

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <assert.h>

typedef int item;
struct list { int length, capacity; item *items; };
typedef struct list list;

... newList ... check ...

int main() {
    assert(check(newList(), 0, NULL));
}

Testing

What happens when we test the program?

It compiles, but the -fsanitize=address option reports a memory leak, because we have allocated a list and not freed it.

This change to main fixes the problem:

int main() {
    list *ns = newList();
    assert(check(ns, 0, NULL));
    free(ns->items);
    free(ns);
}

Missing function

The memory leak incident suggests that we need another function.

With a real library, a programmer would not be able to get at ns->items to free it.

So a freeList function is needed:

void freeList(list *xs);

The freeList function

Here's the freeList function:

// Free the memory for a list.
void freeList(list *xs) {
    free(xs->items);
    free(xs);
}

The two calls can't be in the opposite order, because it is illegal to use xs after it has been freed. In general, structures need to freed from the bottom upwards.

The next task is to write the length function.

The length function

Here's the length function:

// Find the length of a list
int length(list *ns) {
    return ns->length;
}

It doesn't seem necessary to add any testing for it.

The next task is to prepare for writing the add function.

An expand function

Here's a function to expand a list, to do part of the job of the add function:

// Make a list bigger
static void expand(list *ns) {
    ns->capacity = ns->capacity * 3 / 2;
    ns->items = realloc(
        ns->items, ns->capacity * sizeof(item)
    );
}

This is hidden from the library user by declaring it static. It is only to be called by list functions.

Continuation

ns->items = realloc(
    ns->items, ns->capacity * sizeof(item)
);

This is a single function call which is a bit long to fit on one line. The convention used here (not common in C, but borrowed from Python) is that splitting a statement over several lines is signalled by round brackets in a similar way to curly bracket blocks.

The add function

Here is the add function:

// Add an item to a list
void add(list *ns, item n) {
    if (ns->length >= ns->capacity) expand(ns);
    ns->items[ns->length] = n;
    ns->length++;
}

It is time to add more tests.

Test add

A reasonable way to test the add function is:

int main() {
    list *ns = newList();
    assert(check(ns, 0, NULL));
    add(ns, 40);
    assert(check(ns, 1, (int[]) {40}));
    add(ns, 41);
    assert(check(ns, 2, (int[]) {40, 41}));
    add(ns, 42);
    assert(check(ns, 3, (int[]) {40, 41, 42}));
    freeList(ns);
    return 0;
}

This tests newList, so it replaces the previous testing.

Independence

Some purists would argue that tests should be independent and isolated from each other.

In other words, each test should involve building a list from scratch, then applying an operation and checking the result.

It is an important principle, but it is not an absolute rule. There are also gains to be had from checking sequences of operations on the same structure, and it is certainly simpler in this case.

Next we need to prepare for get and set.

A fail function

Here's a function to report an error:

// Report a list error
static void fail(char *message) {
    fprintf(stderr, "List failure: %s\n", message);
    exit(1);
}

This is not for general use, it is only for the get and set functions to call, so it is declared static.

Get and set

Here are the get and set functions:

// Get a list item, like ns[i]
int get(list *ns, int i) {
    if (i < 0 || i >= ns->length) fail("get");
    return ns->items[i];
}

// Set a list item, like ns[i] = n
void set(list *ns, int i, item n) {
    if (i < 0 || i >= ns->length) fail("set");
    ns->items[i] = n;
}

Tests

Tests for these can be added near the end of main:

int main() {
...
    set(ns, 2, 84);
    assert(check(ns, 3, (int[]) {40, 41, 84}));
    assert(get(ns, 2) == 84);
    freeList(ns);
}

Is there anything else that needs to be tested?

Expansion

The main thing that hasn't been tested is what happens when the list gets longer. Does it expand properly? Instead of adding lots more items to the list, it is probably enough to add a direct test of the expand function at the end:

int main() {
...
    expand(ns);
    assert(check(ns, 3, (int[]) {40, 41, 84}));
    freeList(ns);
}

Section: Linked lists

A problem with array lists is that, to insert or delete an item in the middle, lots of items have to be moved up or down to make space.

Can we find a way of storing a list so that items never have to be moved?

One way is to introduce a pointer to go with each item, pointing to the next item.

Example: primes

A linked list of primes (without 5) might look like this.

Example: insertion

After inserting 5, it might look like this.

Insertion

To insert 5 into the list, these steps are needed:

That's a small fixed number of operations. The list entries end up scattered in memory, but it doesn't matter where they are.

Stack

The easy and efficient operations on a linked list are called the stack operations:

Prototypes

The functions needed for stacks are:

stack *newStack();

void freeStack(stack *xs);

bool isEmpty(stack *xs);

void push(stack *xs, item x);

item top(stack *xs);

item pop(stack *xs);

bool check(stack *xs, int n, item ys[n]);

Cells

This structure holds each item and its next pointer:

struct cell { item x; struct cell *next; };
typedef struct cell cell;

Temptation

It is tempting to say that a stack is just a pointer to the first item, or NULL when the stack is empty:

cell *xs = NULL;

The push function adds a new first cell, so the stack variable xs becomes a different pointer. That means we would have to write xs = push(xs, x). The pop function also updates the xs pointer, but it can't easily return that and the first item.

So let's have a separate unmoving structure for the stack itself, as with array lists.

Start

The stack program needs to start like this:

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <assert.h>

typedef int item;

struct cell { item x; struct cell *next; };
typedef struct cell cell;

struct stack { cell *first; };
typedef struct stack stack;
...

New stack

The function to create a new stack is:

// Create a new empty stack
stack *newStack() {
    stack *xs = malloc(sizeof(stack));
    xs->first = NULL;
    return xs;
}

The check function

Here's the check function:

// Check that a stack matches a given array.
bool check(stack *xs, int n, item ys[n]) {
    bool ok = true;
    cell *c = xs->first;
    for (int i = 0; i < n; i++) {
        if (c->x != ys[i]) ok = false;
        c = c->next;
    }
    if (c != NULL) ok = false;
    return ok;
}

The main function

To finish the initial version of the program, add this:

int main() {
    stack *ns = newStack();
    assert(check(ns, 0, NULL));
    return 0;
}

Again, we are relying on the sanitize compiler options to check things that the check function can't check.

Everything works, except for the memory leak caused by the lack of a freeStack function.

The freeStack function

Here is the freeStack function:

void freeStack(stack *xs) {
    cell *c = xs->first;
    while (c != NULL) {
        cell *next = c->next;
        free(c);
        c = next;
    }
    free(xs);
}

The next pointer has to be extracted from a cell before the cell is freed, because it is illegal to access the cell contents afterwards.

Check stack empty

The function to check if a stack is empty is:

bool isEmpty(stack *xs) {
    return xs->first == NULL;
}

The push function

The function to push an item onto a stack is:

void push(stack *xs, item x) {
    cell *c = malloc(sizeof(cell));
    *c = (cell) { x, xs->first };
    xs->first = c;
}

Here are pictures of pushing 5 onto a stack:

Classic mistake

A very common mistake with pointer handling is to do things in the wrong order:

void push(stack *xs, item x) {
    cell *c = malloc(sizeof(cell));
    xs->first = c;                        // BAD
    *c = (cell) { x, xs->first };
}

Testing

To test push, main becomes:

int main() {
    stack *ns = newStack();
    assert(check(ns, 0, NULL));
    push(ns, 40);
    assert(check(ns, 1, (int[]) {40}));
    push(ns, 41);
    assert(check(ns, 2, (int[]) {41, 40}));
    push(ns, 42);
    assert(check(ns, 3, (int[]) {42, 41, 40}));
    freeStack(ns);
    return 0;
}

The fail function

Here's a function to call if something goes wrong:

void fail(char *message) {
    fprintf(stderr, "Stack failure: %s\n", message);
    exit(1);
}

The function prints to stderr, and stops the program with an error code (as if returning 1 from main) to play nicely with any scripts that include the program.

The top function

The function to look at the top item is:

item top(stack *xs) {
    if (xs->first == NULL) fail("top of empty");
    return xs->first->x;
}

If the caller tries to get the top item from an empty stack, the fail function is called, to make sure the program doesn't do anything terrible.

The pop function

The function to remove the top item is:

item pop(stack *xs) {
    cell *c = xs->first;
    if (c == NULL) fail("pop of empty");
    xs->first = c->next;
    item x = c->x;
    free(c);
    return x;
}

This has to be written incredibly carefully, saving the first cell in a variable before removing it from the list, and extracting its fields before freeing up its space.

Visualising pop

The main steps in pop are:

Testing

To test top and pop, add:

int main() {
    ...
    assert(top(ns) == 42);
    assert(check(ns, 3, (int[]) {42, 41, 40}));
    assert(pop(ns) == 42);
    assert(check(ns, 2, (int[]) {41, 40}));
    freeStack(ns);
    return 0;
}

Structure lists

To store structures instead of ints, you could include the next field in the structure, e.g.

struct cell {
    char *name;
    int number;
    struct cell *next;
};

The next field can be ignored everywhere except in the list functions.

Although this is common, it doesn't allow an item to be stored in more than one list.

Object lists

A more flexible approach is to store objects, i.e. pointers to structures, in lists:

struct cell {
    struct entry *item;
    struct cell *next;
};

This has an extra layer of pointers, but now an object can appear in any number of lists, and updates to objects are shared by all occurrences.

Efficiency

There is an efficiency problem with what we have done.

All the stack functions are supposed to be O(1), but they may not be.

That is because of the cost of malloc and free which can, at worst, have O(n) behaviour.

Free list

To overcome the problem, it is common for a list structure to contain a free list, i.e. a list (stack) of cells which are currently unused but are free to be re-used.

struct list {
    struct cell *first;
    struct cell *free;
};

You put cells on the free list instead of calling free.

And when you want a new cell, you get it from the free list if possible, and only allocate a new one if the free list is empty.

Modules

Once you have built a good implementation of stacks, it is natural to re-use it in other programs.

To do that, you put the stack functions into a separate module.

And you make sure that programs cannot access the cells being used, and in fact cannot tell how the stack is being implemented - it is just a service, and a robust one.

List variations

The real truth

For implementing general lists, and stacks, an array list is almost always better than a linked list.

Linked lists use a lot of memory, the efficiency of insertion and deletion in the middle is offset by the cost of finding or keeping track of the right place in the list, and they are difficult to encapsulate well.

But the idea of linked lists comes up a lot in computer science, they are often used as part of something else (e.g. hash tables), and variations are often used (e.g. a linked list in an array).