Keeping things in order

A computer program – or, at least, a single thread in a computer program – is an ordered list of statements. When you write the code you rely on this order.

The source code is compiled into object code. The compiler may change the order of instruction op-codes from the ‘obvious’ order by swapping some around, which is legitimate as long as there are no dependencies between them. This may bring performance benefits.

            ldr   r1, [r2]       ;
            add   r1, r1, r1     ; Depends on load
            sub   r2, r2, r3     ;

might suffer from a stall, waiting for r1 to be returned from slow memory. It can become:

            ldr   r1, [r2]       ;
            sub   r2, r2, r3     ; Can 'sneak' in between!
            add   r1, r1, r1     ; Depends on load

harmlessly, with the sub instruction filling a delay slot.

A ‘binary’ file may be produced which runs on all processor implementations but can still benefit from some reordering. Some compilers will offer options to order instructions to perform best on particular processor microarchitecture.

Some high-performance processors will reorder instructions even if the compiler has not. This is ‘out of order issue’ – but more of that later. The important point is that the code should still behave in the way the programmer expected.
The really unpleasant problem here is if something upsets the expected order: if the processor only has the first code above but swaps the instructions and issues them in the second order and then the load faults and causes an exception then it's hard to reconstruct what should have happened.
It is possible to fix this sort of thing. Stay tuned because it's covered later.

Changing the order of operation execution is a general technique which can improve performance. Another example is the write buffer which can allow more urgent reads to overtake writes.
Again, good unless something goes wrong; there can be problems and the illusion that operations are ordered needs to be maintained.

A place in the code (thread) where it is guaranteed that everything required previously has been completed and nothing later has been committed to is sometimes called a ‘sequence point’.

Potential problems with peripherals

A typical peripheral interface device will have a number of addressable registers. The sort of functions these may include might be a command register and a status register, which may be at different addresses.
An operation may involve writing a command then waiting until the status indicates the device is ‘not busy’, meaning that the operation is complete. If these are sent (in that order) by a processor the read must not be allowed to overtake the write (think about it!) which it could if there was an intervening write buffer.
Note that the usual forwarding would not prevent this if the addresses were different.

Thus, I/O areas should be unbufferable as well as uncacheable.

Memory barriers

‘Ordinary’ RAM can usually be written in different orders harmlessly – as long as two stores to the same location are kept in order (which is a ‘WAW hazard’ (discussed elsewhere). However there are times when this could cause problems. Consider a multi-threaded system where (for example) one processor is filling a data buffer after which (as the the programmer writes it) it stores a flag which validates the block. If this happens out of order when another processor has a thread which waits for the flag then reads the data ... consider what might happen.

But how could such a thing happen?
Whilst unlikely, consider (for example) an SDRAM controller managing near-simultaneous incoming requests. Indeed, some RAM controllers will have multiple interfaces each with a buffer, and will try to sequence commands from its buffers to keep the actual SDRAM chip(s) busy. Remember there are delays as rows are opened and closed and access to different banks may be interleaved, so some reordering is generally beneficial.

So can it be fixed?
There may be memory barrier (sometimes called a “fence”) instructions which can be inserted in the instruction stream. This basically means ‘finish all the operations before this point before starting any after it’. In the previous example this would ensure the store operations happen with the necessary separation.
It would not, of course, compel any synchronisation between independent threads on independent processors.

For the programmer ...

In most cases compilers (and processors) will safely and beneficially reorder operations. Most occasions where this needs to be controlled occur in embedded programming or in libraries, dealing with I/O devices etc.

There is not space to go into all the ramifications here. Some control is sometimes available, such as declaring an item as ‘volatile’ in Java or C.
Beware: the implications are subtle and different in different languages; read up on the details carefully if you ever need to enforce such constraints explicitly.

Further reading (interest only): this article might help illustrate some of the potential for confusion.
Before you ask, there will not be exam. questions on this!

Locking

Most memory has a single port which means that only one operation (either a read or a write) is possible at any time. The memory is modified by a processor, which must successively read the value, modify it and write it back. When more than one (notionally) concurrent thread might modify the same memory location there is a danger that these successive operations might interleave in an ‘embarrassing’ way.

         Thread A          Thread B
          Read
          Modify            Read
          Write             Modify
                            Write

In this example any modification from Thread A is lost. This would be a Bad Thing.
Atomicity of the read-modify-write operation should be preserved.

         Thread A          Thread B
          Read
          Modify            Read  ... Not yet
          Write             Read? ... Still busy
                            Read  ... Okay now
                            Modify
                            Write

At the bus level this can be done with an additional signal: ‘lock’. This is asserted by the read and it prevents access to that location – and, quite possibly, the whole shared bus – from any source except the locking one until the corresponding write is complete.
(Note that this may be some time: the modification may be quick but the load and store cycles may take a considerable time in ‘distant’ memory.)

(Dumb) questions: why not cache the location?
What's stopping it being cached?

Most operations don't care about this so lock is only used on ‘special occasions’. Any processor intended for multiprocessor systems will have a means of achieving this or some comparable protection mechanism.
Some examples:

ARM has a ‘swap’ instruction which reads a register and writes another (or the same one with its initial value) as an atomic operation.
x86 can perform operations ‘directly’ on memory: in practice this is a read-modify-write sequence within one (CISC) instruction. These are normally separable operations but the instruction (e.g. INC location) can be made atomic by preceding it with a special ‘LOCK’ control code.
MC68000 family has (had?) a ‘test-and-set’ (‘TAS’) instruction which reads a location, setting the processor flags and then (atomically) wrote a ‘1’ to it. This also permits exclusive access since it can be seen afterwards if the write had any true effect or not

These rare-but-important operations all assert a hardware bus-lock signal.
Note that the logical behaviour of the bus needs to be preserved, even if the information is carried in a different way, such as a serialised bus (e.g. PCIe) or a network-on-chip.

This will be revisited from the processor view later.
Note that the principle is not just applicable to RAM locations: a similar problem occurs in (for example) databases where it can be handled purely by software.

Further reading (interest only): transactional memory – another solution.

Thread synchronisation

Another synchronisation issue when multithreading is preventing some threads from passing a synchronisation point before other, parallel threads have reached it. For example, if drawing graphical objects there is clearly an opportunity for considerable parallelism but no thread should move to the next (animation) frame before all the current picture is complete.

Whilst important, this is usually the responsibility of software (at least at the moment!) so is of less interest from an architectural perspective. It's mentioned here for completeness – and for an excuse to include this demonstration by a former student.

This could be implemented with a shared variable, initialised with the number of parallel threads. On arrival at the barrier each thread decrements the variable (once) and waits to proceed, rereading the variable until it is zero.
Note that multiple threads write to a shared location so there must be atomicity of the read-modify-write operations so there is still, ultimately a hardware involvement in ensuring this will work properly.

Next: Memory Management Units