Error detection and correction

A big topic in itself, mostly this is concerned with data communications and is not addressed here. However memory may be regarded as a mechanism of sending data through time; errors do occur and means to alleviate these are sometimes included.
Here we just touch on the common principles used.

All error detection and correction uses some amount of redundancy – some extra data – which increases the cost of storage. What techniques are employed (if any) depend on the statistical likelihood of errors, the importance of correct function and the willingness to pay for extra RAM (or bandwidth in the case of communications).

Error detection

Parity

Parity is a simple error detection scheme. It involves knowing if there is an even or odd number of ‘1’ bits in each word. Alternatively this number can be forced to an agreed state by including an extra ‘parity bit’ with each word.
Click to play.

Below (assuming it's maintained) you can see a picture of (historic) punched paper tape storage. The bytes are striped across the tape. Count the holes in each byte: it is likely that this fragment contains 7-bit ASCII data with the parity bit at the bottom.
A clue to interpreting this is that there is a 26 character, successive sequence: can you read it?

Checksums

Sometimes data blocks may be stored correctly but subject to read or transmission problems: a relevant example might be reading a sector from a magnetic disk. Here the block may be protected by a checksum, which is a form of signature. If there is an error it is highly likely that the checksum will be wrong and the error can be detected. This has a low overhead and is useful in cases like the disk, where the read can be retried.

There are many ways to form a checksum. One example, readily available for playing with, is the Unix cksum utility. Try it on a file of your own, before and after a ‘small’ change.

Error correction

An Error Correction Code (ECC)is a means of getting the desired information from data which may have been corrupted. It is most commonly encountered in communications where it is easy to envisage ‘interference’ on a radio link, for example. However the principle can be (and is) used for some computer memories too. “Soft errors” – where a stored memory bit state is changed although the hardware is undamaged – are infrequent but do occur, typically caused by cosmic rays. (Yes, really.) Modern systems – particularly ‘servers’ – have so much RAM that this is a genuine problem.

Hamming code

A simple (i.e. easy to build in hardware) Error Correction Code is a Hamming code which uses the parity of groups of bits. In the example below a 4-bit word is protected by using three extra bits. Any single bit error can be detected and corrected.

The bits can be listed in any order (of course!) but are often ordered as shown below, where D* are the data bits and C* are the extra, check bits. At first glance this order may look strange: it is chosen so that the parity sets (circles) when taken together indicate the bit position of a fault, numbering from ‘1’ from the right, with ‘000’ indicating no error. The chosen parity, in each group, is set to ‘even’.

Believing a particular bit is wrong means it can be ‘flipped’ to the correct value, since any bit can only have two states.

ECC operation

Instructions

Click to toggle Input bits (Data bits only).
Click on Faults to introduce bit errors.
Parity within each circle will highlight problems.
Any single bit error can be corrected.

Note that this code will correct single bit errors; with multiple bit errors it is liable to ‘correct’ the wrong bit.

Within a word a single bit error can be corrected with (approx.) an extra log₂(word length) bits. In detail:

<data bits> + <check bits> <= 2^{<check bits>} - 1

4-bit data needs 3 extra bits
11-bit data needs 4 extra bits
26-bit data needs 5 extra bits
57-bit data needs 6 extra bits
120-bit data needs 7 extra bits

Longer words have a higher probability of containing more than one bit error of course, but as the soft error rate in memories is still very small the probability of two errors in one word is still extremely small. Protecting longer words is generally more efficient in storage. Note that the overhead is still quite high: protecting a 32-bit data word against a single bit fault requires 6 extra bits – about 19% more bits stored.

When the memory is read the checking and correction is applied transparently. (In the case of an error an exception may be raised so it can be repaired but this is not essential.) When a word is written the check bits are calculated and stored transparently.

But ... there may be a time penalty if a byte is written to a word, since the new pattern requires the other bytes to be read from the RAM before writing and thus two RAM operations. Typically, byte (char) operations are quite rare but this could affect performance noticeably with intensive operations on byte (char) arrays, for example.

Caution!

Note that although ECC will mask bit faults it makes those faults more likely, since there are more bits stored which can be faulty.

If you don't like that explanation, here's a video (20 mins.).

Next: secondary storage.