Write buffers

There are two types of memory transaction: loads (reads) and stores (writes). Loads return data which is (presumably!) wanted so their latency is critical; a processor has to stall if it needs data which is still being loaded.

Okay – there may be a few things a processor could do, but they are really beyond our scope here. For example, a processor can and will probably issue (internally) non-dependent instructions, possibly issuing out of order. Unlike a page fault context miss there is unlikely to be time for a process context switch: we are looking at some ‘tens’ of cycles not millions here and the context switch might take hundreds of cycles (register and virtual cache flush etc.). Some sophisticated processors employ hyperthreading and can switch threads (not processes – the virtual memory map is retained) during load latency.
None of this is immediately relevant here though. Back to the main story.

Stores, on the other hand, are not (in most cases) time critical. To the processor, a write operation is (usually) ‘fire and forget’ i.e. the address and data are sent and it can make its own way to memory. There is no immediate dependency; the processor can keep going in parallel with the memory operation.

Write buffer illustration

To exploit this, a write buffer can be used to hold one or more pending write operations. Memory operations (load/store) might account for (say) a third or a quarter of all instructions and loads typically outnumber stores by (say) 2:1, so a store might occur every ten instructions or so. (‘Hand-waving’? Yes. Illustrative? Also yes.) This gives an average store plenty of time to complete, at least into cache.
Note that the store instructions will often not be evenly distributed though; having multiple write buffer locations allows some ‘elasticity’ in the pipeline and smooths out the flow.

If there are several queued write operations, what happens to a subsequent read operation? Such a circumstance can occur, for example with a series of stack PUSHes on procedure entry.
In a primitive case it would have to wait for the write buffer to be cleared, which could alleviate a lot of the benefit. However usually the read can overtake queued writes, keeping the read latency low.
(This does add another complication: see below).

Demonstration

In the example below a three-entry write buffer has been added to a system where the memory takes three cycles to complete either a read or a write transaction. (These cycles are highlighted in different colours.) The assumption/simplification here is that further operations cannot begin whilst a read is outstanding.
The ‘Stall’ button inserts idle cycles, which are needed if the processor is waiting for a memory resource. Other buttons will only respond if the corresponding step can be taken.

Instructions

Experiments

Forwarding

Giving memory read operations priority is good for performance but it introduces a hazard: what if the load is using the same address as a preceding store (not that unlikely!) which is still in the buffer? If the read simply overtakes then it will pick up an earlier (unexpected!) value. This would be a Bad Thing.

Thus, if the operations can go out of order it is essential to check the load against the write buffer to see if there is a more recent value than is in the memory (cache, whatever level we're looking at) and forward the queued data instead. This involves checking all the valid write buffer entries and choosing the most recent matching data since it is conceivable that more than one store to the same address is present.

Some extra concepts (only for the dedicated!).

Write aggregation

It is sometimes feasible to combine write operations to the same or adjacent locations to reduce the number of memory operations. This may be useful if (for example) a byte string is being written. Care must be taken that this can never have a visible effect on the instruction execution order.
In a processor like the (32-bit) ARM, which has burst transfers, this can be quite beneficial.
Writing (back) cache lines are likely to be bursts already, if the line is wider than the data bus.

Victim cache

A.k.a. “victim buffer”. An extension to a write buffer to retain recently written values (typically cache lines) after they have been written back. This acts as a fast look up to subsequent reads and can alleviate the penalty when a line eviction has been chosen … ‘unfortunately’.

When not to buffer writes

Write buffering can yield significant performance improvements and is generally safe when dealing with simple memory. However it can introduce problems in certain circumstances, typically in shared memory with multiprocessors or other devices which communicate through the memory space. A processor could perform a store operation, load the location again – ostensibly to check – and believe that the memory was up to date when, in fact, the store was still pending in the buffer.
This could lead to Trouble. If an operation were triggered to (say) output the memory to a disk or network an out-of-date copy could be picked up.

Other dangerous scenarios might be envisaged.

There are some means which can protect against potential out of order memory accesses:

Unbufferable
Memory management (q.v.) will typically contain status bits on a page-by-page granularity. One such bit may permit (or forbid) locations in that page to be cached; another may control whether write operations in the page may be buffered. If write buffering is forbidden the processor must stall until the operation has completed; since writes will be ordered that implies that preceding writes have also completed and the stall delays any subsequent reads.
Memory barrier
A memory barrier – sometimes called a “fence” – is an instruction to retain the written order of operation. Put simply, everything before the barrier must complete before anything after the barrier is allowed to complete.
Thus, placed between a store and a load of the same address it will flush any buffered write operations from the buffer.
This type of expedient may be used, for example, when part of the memory is shared between processes or processors.


Next: Operation Ordering