Memory hierarchy

Small memories can be made fast but big memories will be slower. In practice, some memory locations are used more often than others. To maximise performance we want the frequently used locations to be fast and if the infrequently used locations are slow it won't have a too-serious impact. The approach relies on the statistics of a ‘typical’ program ... whatever that may mean.

Fortunately, almost all software shares certain characteristics which can be loosely described by the Pareto principle in that most of the memory access is limited to a small subset of the total range. Whether that is 80% of references are to 20% of locations or 90% to 10% or whatever doesn't exactly matter and will vary from application to application.

Illustrations of this principle include:

fetching of instructions, where most of the time is spent iterating over a few loops and other parts of the code are rarely, if ever, visited.
some data areas, particularly around the top of stack where the local variables are stored; other structures such as data records will be large but typically infrequently accessed.

A key word here (as we expect you already know) is “locality”. This comes in two (main) ‘flavours’:

Temporal locality: if an address is used once then it is likely that it will wanted again.
Spacial locality: if an address is used then it is likely that ‘nearby’ addresses will also be wanted.

Puzzle: considering your knowledge of software behaviour, what examples of locality can you think of?
(There are some suggestions above.)

This principle is the basis of cacheing in a memory hierarchy. The organisation can be done ‘by hand’ (i.e. by a programmer knowing specific addresses for differing speed/size memory areas) but is usually automated with an algorithm deciding dynamically which items from the memory should be accelerated.

It's traditional to have something like the next picture about here. There are bigger pictures with more levels but the principle holds. In this figure there are two levels of cache. In practice there may be more levels.
Note that cache duplicates some part(s) of the memory below it in the hierarchy. Its function is to (try to) improve performance; it is not essential for correct function.

Another ‘tradition’ at this stage is to look at some performance figures and calculations. These are also good for asking test questions about, so here goes.

Assume the time taken for a cache access is the same as a processor cycle. Let the memory take T_mem cycles. If the probability of a cache miss is P_miss the average memory access time will be:

T_ave = 1 + P_miss×T_mem

For example, if T_mem = 10 and P_miss = 10% the average time per access would be 2 cycles.
Notice this would mean the processor running at half the speed it is capable of.

Note that the miss rate is more commonly quoted than (its complement) the hit rate since hit rates are usually quite high and numbers like 99.8% and 99.9% are harder for readers to differentiate than 0.2% and 0.1%.

Easy puzzle: if the miss rate in the above example is reduced to 5%, what would the average instruction cycle time be?

Activity: experiment with plotting some curves from the range of typical cache values below.

Notice how easily the system performance can be ‘crippled’ by the memory speeds! A low miss rate is therefore very important. The ‘memory speed’ may also be alleviated with more cache levels.

Puzzle: given T_mem = 10 what miss rate (P_miss) would be needed to achieve a 10% performance penalty?

Discussion: in some literature you may see a variant formula: T_ave = (1 - P_miss) + P_miss×T_mem
which removes the ‘wasted’ (failed) hit attempts. However something needs to determine each hit/miss status.
One strategy would be to start both a cache and memory access immediately and abandon the memory reference if the cache hits. What are the pros and cons of this approach?

The formula above can be extended to multi-level caches.

T_ave = 1 + P_missL1×(T_L2+ P_missL2×T_mem)

(And so on.)

Discussion: what, if anything, can be done whilst waiting for a response from a cache miss?
We return to this topic later but this is an opportunity to think what (if anything) might be practicable in these circumstances.

Note that there is a potential analogy in a virtual memory system page fault – with the inherent latency of swapping a page from a disk. How similar (or different) is this case?

Next: Caches.