MMU – the Memory Management Unit

A Memory Management Unit (MMU) is not vital for a computer but is used to support the operation of virtual memory. Its chief function is address translation, with other functions such as memory protection.

Logically this is simply a pattern translation. However various expedients are needed to be able to implement the process practically. The translation process is implicitly serial in the memory access and therefore can increase the latency of every access. This would be a serious penalty.

The usual approach is to provide a virtually addressed cache to avoid the translation delay ... as long as that cache hits. This first level of cache needs to be fast: it is therefore quite small.
In parallel with the level 1 cache look up, the MMU can provide an address translation in case the cache misses.

Translation is provided via page tables. The page tables can be large: they are stored in memory. Thus each translation is subject to the latency of the main memory. Worse – with 32- and 64-bit address spaces it is impractical to store fully expanded page tables for every possible process, so the page tables are divided into multiple ‘levels’: perhaps two levels in a 32-bit system, more in a 64-bit address space. This means that multiple memory look-ups are needed for each memory reference. Unalleviated, this would be cripplingly slow!

The Translation Lookaside Buffer (TLB) is a cache of address translations. Using the same principle as memory caches, it holds a subset of the possible page translations; statistically only a few of the possible pages wanted by a process are in use and even fewer needed at a particular moment, thus all or nearly all translations can be looked up very rapidly.
Note that what is cached encompasses all the page table levels in one operation: virtual address in to physical address out.

(5 mins.) (2014)

If/when a TLB miss occurs the page tables must be read and references followed. This will take some time – possibly hundreds of processor cycles. The operation is provided by a hardware machine in a process called "table walking". Whilst this proceeds the originating process has to be suspended.
In a multi-threaded processor it is possible to continue running other threads (which share the same virtual memory map) in parallel with table walking, keeping the processor busy.

Page tables are specific to each process. When context switching the MMU is reprogrammed with a pointer to the appropriate (first level) page table. Note that this invalidates the contents of the level 1 cache(s) and the TLB and they must be flushed, as appropriate. The TLB and instruction caches can typically be discarded: a data cache may contain modified (‘dirty’) locations and thus need a significant time to copy all the updated lines back down the memory hierarchy.

MPU

Small systems (typically embedded systems) may not afford the effort to implement a full MMU. In particular, a TLB can consume considerable silicon area. A Memory Protection Unit (MPU) provides permission checks for areas or memory without the virtual-to-physical translation.

Cacheability

A simple but important function of an MMU (or MPU) is to mark which addresses can be cached and which cannot. In general, memory can be cached and this will enable significant speed up. On the other hand, areas of the address space used for I/O must not be cached since they can (and will) change contents without processor actions. Similarly memory which is involved in active DMA transfers should not be cached.
A related issue is whether write operations area allowed to be buffered and this may be another ‘enable’ bit on each page/area.

Memory access process

Interactive flowchart of memory accesses: count cycles etc.

Memory as a component

In a computing system there is ~~sometimes~~ often a need for some ‘bulk’ storage outside the programmer's direct view. This is usually in the form of a buffer. Buffers are used extensively to avoid strict synchronisation between components. For example, data may be streamed from a network connection at the network's rate into a RAM buffer and then read out – perhaps by DMA – at a different rate. Network routers, for example, often do this to store-and-forward packets.

For most on-chip applications this will be implemented with blocks of SRAM.

A problem with using SRAM as a buffer is that the normal SRAM has a single port – i.e. there can only be a single access (read or write) at any time. This imposes a bandwidth limit.

Another application for a RAM (almost always SRAM) – or, indeed a ROM – as a component might be a lookup table. Some functions (such as logarithms or trig. functions) can be quite expensive to calculate: a lookup is an array which contains precalculated answers and can supply arbitrary functions in a single memory cycle time.

Often associated with software, a lookup can also be used as part of a hardware calculation system. The input argument(s) form the ‘address’ and the data is the data.
Lookups can be useful for monadic (one argument) functions – less useful with two (or more) arguments since the address needs to be the concatenation of the inputs so the size of the table expands rapidly.

Multiport Memory

It is possible to engineer multiport memory arrays but there are some difficulties. For a dual-port RAM:

The area per bit stored is about doubled
The cycle time is increased by (maybe) 30%
The energy/cycle is increased
Some external logic must arbitrate simultaneous accesses to the same location

An alternative mechanism is to make a unit which is almost as good (probably better when the disadvantages above are averted) by interleaving locations in single-port RAM blocks. When one port is connected to one part of the buffer the other port can still use the other. As long as there are no conflicts the buffer system has twice the bandwidth of either block.

Deduction: in the figure the SRAM arrays have been labelled ‘even’ and ‘odd’. This is a hint! Which addresses (think address bits) would be routed to each block and why?
You need to consider how such a buffer is typically employed and what the address sequences might be to make sense of this.

It is, of course, possible to generalise this concept and use more ‘ways’ of interleaving. This further reduces the chance of conflicts (but to a lesser extent than the first step of interleaving). Assuming there is a minimum practical size to the component arrays (which there is!), further interleaving does increase the incremental size of the overall buffer.
E.g. if the SRAM comes in 1 KiB blocks a 2-way interleaved system can be 2 KiB, 4 KiB etc. but a 4-way interleaved system can only be 4 KiB, 8 KiB and so on. The area may be a concern.

Next: Memory.