home page -> teaching -> parallel and distributed programming -> Lecture 1 - Intro

Lecture 1 - Intro

What?

concurrent: there are several tasks in execution at the same moment, that is, task 2 is started before task 1 ends:
parallel: (implies at least a degree of concurrency) there are several processing units that work simultaneaously, executing parts of several tasks at the same time.
distributed: (implies parallelism) the processing units are spatially distributed.

Why?

optimize resource utilization: Historically, the first. While a task was performing an I/O operation, the CPU was free to process another task. Also, when a user thinks about what to type next, the CPU is free to handle input from another user.
increase computing power: Single-processor systems reach their physical limits, given by the speed of light (300mm/ns) and the minimum size of components. Even single processors have been using parallelism between phases of execution (pipeline).
integrating local systems: A user may need to access mostly local resources (data), but may need also some data from other users. For performance (time to access, latency) and security reasons, it is best to have local data local, but we need a mechanism for easily (transparently) accessing remota data also.
redundancy: We have multiple units able to perform the same task; when one fails, the others take over. Note that, for software faults (bugs), it is possible that the backup unit has the same bug and fails, too.

Why not (difficulties)

increased complexity
race conditions: What happens if, while executing an operation, some of the state that is relevant for it, is changed by a concurrent operation?
deadlocks: Task A waits for task B to do something, while task B waits for task A to do someother thing.
non-determinism: The result of a computation depends on the order of completion of concurrent tasks, which in turn may depend on external factors.
lack of global state; lack of universal chronology (distributed systems only): A process can read a local variable, but cannot read a remote variable (that resides in the memory of another processor); it can only request the value to be sent, and, by the time the value arrives, the original value may have changed

Clasification

Flynn taxonomy

SISD (single instruction, single data)
SIMD (single instruction, multiple data)
MISD (multiple instruction, single data)
MIMD (multiple instruction, multiple data)

Shared memory vs message passing

Shared memory

SMP (symmetrical multi-processing): identical processors (cores) accessing the same main memory
AMP (asymmetrical multi-processing): like SMP, but processors have different capabilities (for example, only one can request I/O)
NUMA (non-uniform memory access): each processor has a local memory but can also access a main memory and/or the local memory of the other processors.

Message passing

cluster: many computers packed together, maybe linked in a special topology (star, ring, tree, hyper-cube, etc)
grid: multiple computers, maybe of different types and characteristics, networked together and with a middle-ware that allows treating them as a single system.

Hardware issues — Multi-processors with shared memory

Idea: several CPUs sharing the same RAM. (Note: by CPU, we understand the circuits that process the instruction flow and execute the arithmetic operations; a PC CPU chip contains additional circuits, as well, such as the memory cache, memory controller, etc.)

Memory caches

Problems: high memory latency; memory bottleneck

Solution: use per-processor cache

New problem: ensure cache consistency (consider that one CPU modifies a memory location and, immediately afterwards, another CPU reads the same location).

Solution: cache-to-cache protocol for ensuring consistency. (Locking, cache invalidation, direct transfer between caches.) However, this means that:

if multiple CPU access for both read and write the same memory location, the access is serialized and no speed-up results from multiple cores (moreover, there is a penalty to be paied for the cache ping-pong);
the same happens if two variables are placed in distinct memory locations, but in the same cache line (false sharing).

Note: see false-sharing.cpp and play with the alignof argument.

Instruction re-ordering

In the beginning, the CPU executed instructions purely sequentially, that is, it started one instruction only after the previous one was fully completed.

However, each instruction consists in several steps (fetch the instruction from memory, decode it, compute the addresses of the operands, get the operands from memory, executing any arithmetic operation, etc) and sub-steps (a multiplication, for instance, is a complex operation and takes many clock cycles to complete). Thus, the execution of an instruction takes several clock cycles to complete.

It is possible, however, to parallelize the stages in instruction execution, for instance, to fetch the next instrunction while the previous one is being decoded. The result is a processing pipeline, and thus, at each moment, there are several instruction in various stages of their execution. The advantage is that the average execution time per instruction is reduced, but there is a problem if an instruction needs some results from the previous instruction before those results are ready. To solve this problem, the solution is to add wait states or to re-order instructions (so that there is enough time between dependent instructions). Both waits and re-orderings can be done either by the compiler or by the CPU itself.

The result for the programmer is that instructions can be re-ordered without the programmer knowing about that. The reordering is never allowed to change the behavior of a single thread, but can change the behavior in multi-threading contexts. Consider the following code:


  bool ready = false;

  int result;



Thread 1:

  result = <some expression>

  ready = true



Thread 2:

  while(!ready) {}

  use(result)

Because of re-ordering, the above code may not be correct. The compiler or the CPU can re-order the instructions in Thread 1 because the behavior of thread 1 is not change by that. However, this makes thread 2 belive the result is ready before actually being so.

Radu-Lucian LUPŞA
2025-10-05