Caches accelerate memory access. Another way of expressing this is thinking cache misses impose a performance penalty.
Every memory access can be thought of as one cycle plus a further
penalty for each access which is discovered to miss.
Average cycle time = 1 + miss rate × miss penalty
Thus if the penalty is high the performance will be significantly impaired if the miss rate is also high. Contemporary main memory will typically be at least 100× slower than the processor.
Note that ‘miss rate’ is more frequently quoted than ‘hit rate’, since hit rates are often values around 99.9%, 99.8%, 99.5% etc. which sound more similar than their effects actually prove.
Any cache can be subdivided into two main components:
There may also be data such as validity which indicates if the copy is there or not.
Many caches exploit spatial locality by sharing a single tag across a block of adjacent addresses; this saves storing lots of tags when they would typically have repeated values. This block is known as a ‘cache line’.
There are various different names used for cache architectures but basically there are two approaches: high associativity and low associativity. It's easiest to introduce caches which are fully associative first.
Let's save space on this page by working with a 16-bit space: the
principle extends as needed. We'll assume four words per line as in
the figure above. Plus, we'll work in hexadecimal (what else?).
Start with an empty cache: all the lines are marked as invalid.
Decide to cache a memory word
(say,
Fill the chosen line with data from across the aligned
addresses: i.e. {
Repeat this process each time another line should be cached.
When checking the cache, the input address (with the lower bits separated) is checked against all the valid tags in the hope that one will match. There's actually a good chance and, if it does, the (few) remaining address bits select the word or byte from the line.
This is what is described above. Any data element can be cached in any of the cache locations. It is the most flexible architecture and, for a given size, will give the best hit rate. The problem is that the input address needs to be compared with all the tags on every access, which is expensive, especially in energy. It can also be slow because (practically) reading the data has to wait until the correct tag is determined.
This is the opposite extreme. There is only one place any given element can be cached. To extend the example above,
{
Assuming, for illustrative purposes, a not-too-big cache, addresses
This makes ‘choosing’ where to put a fetched line trivial (there is no choice!). It restricts the cache flexibility so it lowers the hit rate. On the other hand it makes a cache look-up much easier since the address being checked only needs to be compared with one entry whose position is known from the incoming address. This saves a lot of energy. It also offers the chance to speed up access since the data read is typically done speculatively, in parallel with the tag comparison in the hope that it will be a hit. In many circumstances a hit is confirmed and the data is already ready to return.
Direct mapped caches can suffer quite badly from conflicts if the application is unlucky. To alleviate this, set associative caches offer a bit more choice. A ‘two-way’ set associative cache is basically two direct mapped caches in parallel, thus offering two possible places any datum may be cached and preventing some conflicts. The sets will each be half the size, of course.
A four-way set associative cache will have four, quarter size sets etc.
A set associative cache will usually have a better hit rate than a direct mapped cache, but worse than a fully associative cache, of the same total capacity. Similarly its energy requirements will be in between the extremes.
The tool below allows a comparison of different cache
architectures. The scale diagram is accurate in the logical
representation of the cache: the physical layout will sometimes
differ. The statistics show the overheads and how the memories are
organised. An important aspect is that low-order bits are used
‘directly’ to exploit spatial locality and the tags are
the high-order address bits.
Particular points to notice are:
The areas in the set diagram (on the left) are scaled to the panel size. The relative proportions (widths) of the tag and data areas are to scale.
The areas in the layout (right hand) figure are to scale
with (addressed) lines spread vertically and data indicated
horizontally.
(The data multiplexer, when present, is exaggerated
so it can be seen.)
The physical layout on the chip will be
‘squarer’ than most of these examples.