A list is like an array, except that the length changes.
Items are added to a list over time, and you don't know in advance how many there will be.
This chapter implements lists in two ways in C.
The same ideas apply to lists in almost every other programming language.
There are two approaches that you can take:
Array lists are better when the most common operations are to add or remove at one end.
Linked lists can be better when frequently inserting and deleting in the middle.
We will start with array lists, and implement them as if we were creating a library. That means creating functions which are reasonably reusable, and which are easy to use rather than easy to write.
On the other hand the program we are writing doesn't have to do anything other than test the functions.
The first step is to decide what items are going to be stored in the lists.
Arrays in C are generic. In other words, there can be an array of anything,
e.g. char cs[]
, int ns[]
, double ds[]
,
and so on.
Unfortunately, C makes it nearly impossible to create generic lists, so that there can be lists of anything.
So our target will be lists of int
, but written so that the
functions can reasonably easily be copied and reused to make lists of some other
type. (We'll write xs
for a list of anything, and ns
for a list of int
.)
Let's make a type synonym:
typedef int item;
Now all the functions can be written as acting on lists of items. For any other simple type, this one typedef can be changed, without changing the functions.
The next step is to decide what we want the functions to look like when they are used.
A program which uses the lists might look like this:
... list *ns = newList(); add(ns, 3); add(ns, 5); add(ns, 41); get(ns, 2); set(ns, 2, 42); for (int i = 0; i < length(ns); i++) ... ...
A list starts empty, it can have items added to it indefinitely, and there
are functions length
, get
and set
similar to an array.
As well as the list functions, we need to think about how testing is going to work.
A test should check that the result of an operation is a self-consistent list containing given items, like this:
assert(check(ns, 3, (int[]){1, 2, 3}));
This checks that ns
is a list with length three
containing 1
, 2
, 3
. The raw notation
{1, 2, 3}
can only be used in declarations, but if the type
int[]
is made explicit with a 'cast', it can be used elsewhere.
We can now write prototypes of all the functions:
list *newList(); int length(list *xs); void add(list *xs, item x); item get(list *xs, int i); void set(list *xs, int i, item x); bool check(list *xs, int n, item ys[n]);
We could use a flexible array, with a length variable to say how full it is:
int length = 0, capacity = 4; item *items = malloc(capacity * sizeof(item)); ... if (length >= capacity) { capacity = capacity * 3 / 2; items = realloc(items, capacity * sizeof(item)); } ...
The add
function seems to need to be like this:
void add(int length, int capacity, item *items, item x) { ...
We need to pass the length and capacity as well as the array and new
item. But that's not what the add
function is supposed to look like.
It would be tiresome to passing around the three
variables separately. We want calls to look like add(xs, n)
.
And the function wouldn't work anyway, because it can't update the caller's
length
variable, the caller's capacity
variable,
or the caller's items
variable in case the array is moved by
realloc
.
We could pass in all three variables in one go in a structure, and return the updated structure:
struct list { int length, capacity; item *items; }; typedef struct list list; list add(list xs, item x) { ... } ... xs = add(xs, x);
The second attempt does work, but has two flaws.
add(xs,x)
instead
of xs = add(xs,x)
It is important to be able to write add(xs, x)
, not xs = add(xs, x)
.
How do we achieve really simple function calls
like add(xs,x)
?
Answer, pass the list structure by pointer:
struct list { int length, capacity; item *items; }; typedef struct list list; void add(list *xs, item x) { ... }
This treats a list as an 'object'.
A good picture of the situation may help:
struct list { int length, capacity; item *items; }; typedef struct list list; ... list *xs;
The xs
pointer points to a fixed structure with three fields,
one of which is a pointer to a variable array.
The pointers in the picture have two different purposes.
The first, xs
, allows functions
to update the list structure in place. The structure never moves, so the pointer
never changes, so it never needs to be updated, so functions never need to
return an updated pointer.
The second pointer, xs->items
, allows the array to be moved and
resized. Only one pointer xs->items
needs to be updated.
Here's a function to make a new list:
// Make a new empty list list *newList() { list *xs = malloc(sizeof(list)); item *items = malloc(6 * sizeof(item)); *xs = (list) { 0, 6, items }; return xs; }
How should this be tested?
It is really easy to make mistakes with pointers or with array bounds.
So it is vital to compile with the -fsanitize=undefined
and -fsanitize=address
options to catch those mistakes as quickly
as possible.
Then the next task is to write the check
function.
Here's the check
function:
// Check that a list matches a given array. bool check(list *xs, int n, item ys[n]) { bool ok = true; if (xs->length != n) ok = false; if (xs->capacity < n) ok = false; for (int i = 0; i < n; i++) { if (xs->items[i] != ys[i]) ok = false; } return ok; }
This checks everything it can. It can't check that pointers are valid, but that's checked by the compiler options.
Now we can wrap the functions in a program:
#include <stdio.h> #include <stdbool.h> #include <stdlib.h> #include <assert.h> typedef int item; struct list { int length, capacity; item *items; }; typedef struct list list; ... newList ... check ... int main() { assert(check(newList(), 0, NULL)); }
What happens when we test the program?
It compiles, but the -fsanitize=address
option reports a memory
leak, because we have allocated a list and not freed it.
This change to main
fixes the problem:
int main() { list *ns = newList(); assert(check(ns, 0, NULL)); free(ns->items); free(ns); }
The memory leak incident suggests that we need another function.
With a real library, a programmer would not be able to get at
ns->items
to free it.
So a freeList
function is needed:
void freeList(list *xs);
Here's the freeList
function:
// Free the memory for a list. void freeList(list *xs) { free(xs->items); free(xs); }
The two calls can't be in the opposite order, because it is illegal to use
xs
after it has been freed. In general, structures need to freed
from the bottom upwards.
The next task is to write the length
function.
Here's the length
function:
// Find the length of a list int length(list *ns) { return ns->length; }
It doesn't seem necessary to add any testing for it.
The next task is to prepare for writing the add
function.
Here's a function to expand a list, to do part of the job of the
add
function:
// Make a list bigger static void expand(list *ns) { ns->capacity = ns->capacity * 3 / 2; ns->items = realloc( ns->items, ns->capacity * sizeof(item) ); }
This is hidden from the library user by declaring it static
. It
is only to be called by list functions.
ns->items = realloc( ns->items, ns->capacity * sizeof(item) );
This is a single function call which is a bit long to fit on one line. The convention used here (not common in C, but borrowed from Python) is that splitting a statement over several lines is signalled by round brackets in a similar way to curly bracket blocks.
Here is the add
function:
// Add an item to a list void add(list *ns, item n) { if (ns->length >= ns->capacity) expand(ns); ns->items[ns->length] = n; ns->length++; }
It is time to add more tests.
A reasonable way to test the add function is:
int main() { list *ns = newList(); assert(check(ns, 0, NULL)); add(ns, 40); assert(check(ns, 1, (int[]) {40})); add(ns, 41); assert(check(ns, 2, (int[]) {40, 41})); add(ns, 42); assert(check(ns, 3, (int[]) {40, 41, 42})); freeList(ns); return 0; }
This tests newList
, so it replaces the previous
testing.
Some purists would argue that tests should be independent and isolated from each other.
In other words, each test should involve building a list from scratch, then applying an operation and checking the result.
It is an important principle, but it is not an absolute rule. There are also gains to be had from checking sequences of operations on the same structure, and it is certainly simpler in this case.
Next we need to prepare for get
and set
.
Here's a function to report an error:
// Report a list error static void fail(char *message) { fprintf(stderr, "List failure: %s\n", message); exit(1); }
This is not for general use, it is only for the get
and
set
functions to call, so it is declared static
.
Here are the get
and set
functions:
// Get a list item, like ns[i] int get(list *ns, int i) { if (i < 0 || i >= ns->length) fail("get"); return ns->items[i]; } // Set a list item, like ns[i] = n void set(list *ns, int i, item n) { if (i < 0 || i >= ns->length) fail("set"); ns->items[i] = n; }
Tests for these can be added near the end of main:
int main() { ... set(ns, 2, 84); assert(check(ns, 3, (int[]) {40, 41, 84})); assert(get(ns, 2) == 84); freeList(ns); }
Is there anything else that needs to be tested?
The main thing that hasn't been tested is what happens when the list gets
longer. Does it expand properly? Instead of adding lots more items to the list,
it is probably enough to add a direct test of the expand
function
at the end:
int main() { ... expand(ns); assert(check(ns, 3, (int[]) {40, 41, 84})); freeList(ns); }
A problem with array lists is that, to insert or delete an item in the middle, lots of items have to be moved up or down to make space.
Can we find a way of storing a list so that items never have to be moved?
One way is to introduce a pointer to go with each item, pointing to the next item.
A linked list of primes (without 5) might look like this.
After inserting 5, it might look like this.
To insert 5 into the list, these steps are needed:
That's a small fixed number of operations. The list entries end up scattered in memory, but it doesn't matter where they are.
The easy and efficient operations on a linked list are called the stack operations:
The functions needed for stacks are:
stack *newStack(); void freeStack(stack *xs); bool isEmpty(stack *xs); void push(stack *xs, item x); item top(stack *xs); item pop(stack *xs); bool check(stack *xs, int n, item ys[n]);
This structure holds each item and its next pointer:
struct cell { item x; struct cell *next; }; typedef struct cell cell;
It is tempting to say that a stack is just a pointer to the first item, or
NULL
when the stack is empty:
cell *xs = NULL;
The push
function adds a new first cell, so the stack variable
xs
becomes a different pointer. That means we would have to write
xs = push(xs, x)
. The pop
function also updates the
xs
pointer, but it can't easily return that and
the first item.
So let's have a separate unmoving structure for the stack itself, as with array lists.
The stack program needs to start like this:
#include <stdio.h> #include <stdbool.h> #include <stdlib.h> #include <assert.h> typedef int item; struct cell { item x; struct cell *next; }; typedef struct cell cell; struct stack { cell *first; }; typedef struct stack stack; ...
The function to create a new stack is:
// Create a new empty stack stack *newStack() { stack *xs = malloc(sizeof(stack)); xs->first = NULL; return xs; }
Here's the check
function:
// Check that a stack matches a given array. bool check(stack *xs, int n, item ys[n]) { bool ok = true; cell *c = xs->first; for (int i = 0; i < n; i++) { if (c->x != ys[i]) ok = false; c = c->next; } if (c != NULL) ok = false; return ok; }
To finish the initial version of the program, add this:
int main() { stack *ns = newStack(); assert(check(ns, 0, NULL)); return 0; }
Again, we are relying on the sanitize compiler options to check things that
the check
function can't check.
Everything works, except for the memory leak caused by the lack of a
freeStack
function.
Here is the freeStack
function:
void freeStack(stack *xs) { cell *c = xs->first; while (c != NULL) { cell *next = c->next; free(c); c = next; } free(xs); }
The next
pointer has to be extracted from a cell
before the cell is freed, because it is illegal to access the
cell contents afterwards.
The function to check if a stack is empty is:
bool isEmpty(stack *xs) { return xs->first == NULL; }
The function to push an item onto a stack is:
void push(stack *xs, item x) { cell *c = malloc(sizeof(cell)); *c = (cell) { x, xs->first }; xs->first = c; }
Here are pictures of pushing 5
onto a stack:
A very common mistake with pointer handling is to do things in the wrong order:
void push(stack *xs, item x) { cell *c = malloc(sizeof(cell)); xs->first = c; // BAD *c = (cell) { x, xs->first }; }
To test push
, main
becomes:
int main() { stack *ns = newStack(); assert(check(ns, 0, NULL)); push(ns, 40); assert(check(ns, 1, (int[]) {40})); push(ns, 41); assert(check(ns, 2, (int[]) {41, 40})); push(ns, 42); assert(check(ns, 3, (int[]) {42, 41, 40})); freeStack(ns); return 0; }
Here's a function to call if something goes wrong:
void fail(char *message) { fprintf(stderr, "Stack failure: %s\n", message); exit(1); }
The function prints to stderr
, and stops the program with an
error code (as if returning 1
from main
) to play
nicely with any scripts that include the program.
The function to look at the top item is:
item top(stack *xs) { if (xs->first == NULL) fail("top of empty"); return xs->first->x; }
If the caller tries to get the top item from an empty stack,
the fail
function is called, to make sure the program
doesn't do anything terrible.
The function to remove the top item is:
item pop(stack *xs) { cell *c = xs->first; if (c == NULL) fail("pop of empty"); xs->first = c->next; item x = c->x; free(c); return x; }
This has to be written incredibly carefully, saving the first cell in a variable before removing it from the list, and extracting its fields before freeing up its space.
The main steps in pop are:
To test top
and pop
, add:
int main() { ... assert(top(ns) == 42); assert(check(ns, 3, (int[]) {42, 41, 40})); assert(pop(ns) == 42); assert(check(ns, 2, (int[]) {41, 40})); freeStack(ns); return 0; }
To store structures instead of ints, you could include the next
field in the structure, e.g.
struct cell { char *name; int number; struct cell *next; };
The next
field can be ignored everywhere except in the list
functions.
Although this is common, it doesn't allow an item to be stored in more than one list.
A more flexible approach is to store objects, i.e. pointers to structures, in lists:
struct cell { struct entry *item; struct cell *next; };
This has an extra layer of pointers, but now an object can appear in any number of lists, and updates to objects are shared by all occurrences.
There is an efficiency problem with what we have done.
All the stack functions are supposed to be O(1), but they may not be.
That is because of the cost of malloc
and free
which can, at worst, have O(n) behaviour.
To overcome the problem, it is common for a list structure to contain a free list, i.e. a list (stack) of cells which are currently unused but are free to be re-used.
struct list { struct cell *first; struct cell *free; };
You put cells on the free list instead of calling free
.
And when you want a new cell, you get it from the free list if possible, and only allocate a new one if the free list is empty.
Once you have built a good implementation of stacks, it is natural to re-use it in other programs.
To do that, you put the stack functions into a separate module.
And you make sure that programs cannot access the cells being used, and in fact cannot tell how the stack is being implemented - it is just a service, and a robust one.
NULL
testsFor implementing general lists, and stacks, an array list is almost always better than a linked list.
Linked lists use a lot of memory, the efficiency of insertion and deletion in the middle is offset by the cost of finding or keeping track of the right place in the list, and they are difficult to encapsulate well.
But the idea of linked lists comes up a lot in computer science, they are often used as part of something else (e.g. hash tables), and variations are often used (e.g. a linked list in an array).