MIMD - memory organization
High performance computing generally requires some understanding
of computer architecture in order to optimize programs
written in high level languages.
The algorithms for solving complicated problems
must also be adapted to the hardware in order to take advantage of
the speed offered by the latest advances in supercomputers.
Computer architecture refers to the understanding of the
components that make up the computer and the way they are
interconnected. Computer organization (micro-architecture) is concerned with
the implementation of a computer architecture; for instance
two Dell servers (same architecture - both multiprocessors)
may have different number of processors, different memory
data transfer schemes or different speed. Computer engineering is
concerned with the actual construction of the machine, i.e.,
construction of the chips, cooling and electrical
requirements, ...
Computational scientists are scientists who use computers to
implement their theoretical models. Often they require
enormous amount of computing to produce results.
These scientists need to be aware
of the advancements in computer architecture and computer
organization in order
to optimize their
codes.
It is the purpose of this section to survey general
concepts of computer architecture and to introduce
some notions about various types of
computers.
High level compilers (e.g., C++, C and FORTRAN) can achieve
very high degree of efficiency in taking advantage
of some particular hardware platforms. This is particularly
true of the various scalar architectures that will
be discussed below. However, it is fair to say that
this is by far not the case yet with parallel architectures.
In the latter case a careful study of the algorithms
in view of the hardware requirements is still very necessary.
Wikipedia has an interesting page on
Computer Architecture.
Back to top of page
The basic components of a computer are:
- processor
- workhorse of the computer which performs
logical control and arithmetical operations
- memory
- passive component which accepts information for storage
and retrieves it as needed
- input/output devices
- in charge of long term storage (e.g., disk, solid state storage)
and human interface
- communication channels
- to link above three components (e.g., bus, switch, ...)
They can be represented in a PMS diagram (P:processor, M:memory and
S:switch) as in the following figure:
Back to top of page
The high level languages (e.g., C)
translate the programmers' instructions
into fetch-decode-execute cycles (fetch instructions, decode them
and store in program counters) and execute the instructions
(fetch data from memory into
registers, operate on it and store results in memory).
The speed at which codes execute depends on the efficiency
of the compiler in minimizing the instructions to execute
at the low level language, and the intrinsic speed of
the processor. The later is driven by the clock speed
as well as the number of clock cycles necessary to implement
the operations. A Reduced Instruction Set Chip (RISC) is
one in which the number of elementary instructions is
minimum for better optimization, the higher functions
being assembled by the compiler.
The units of processor speed are:
- MIPS:
- Million Instructions Per Second
- MFLOPS (GFLOPS, TFLOPS):
- Million Floating Points Operations per Second (order of
1000-3000 (1-3 GFLOPS) for a personnel computer)
- Clock speed:
- 2 GHz to 4 GHz for Pentium CPU chips.
Back to top of page
Random Access Memory (RAM) is where the information is
stored, both instructions and data. The smallest unit
of information is a bit (0-1), followed by a Byte
(8 bits - ASCII character) and by a word
(a C float or double - either 4 or 8 Bytes depending on the
computer architecture). The "bus" are nowadays mostly
capable of carrying 64bits wide words (8 bytes).
DDR (Double Data Rate) will twice the data transaction.
For instance a DDR 333 memory chip running at 176MHz x 2
rate can carry ( 8 Bytes x 333 MHz ) 2656 MB/sec on a
64bits bus.
Performance depends
on two further factors: access time and cycle time.
Access time (latency or response time) measures
how fast the memory chips respond to a fetch or a write
from the CPU. The range varies from 80ns (PC and Macintosh)
down to 10ns (specialized cache memories - see below).
Memory cycle time measures the minimum period between
successive requests. An 80ns memory chip can not in general
respond to requests at every 80ns.
Back to top of page
Much of the memory requests in large computations
come in order; for instances, the instructions are
fetched in the order the compiler produced them, while elements
of an array will often be defined or used in its index
order, i.e. x[0], x[1], x[2], ...
This leads to the idea of a cache memory, a very fast
memory chip which serves for the purpose of temporarily
storing information which might be needed next. Examples:
if x[0] is needed, the processor might request x[0], x[1], ...
to be brought in from main memory to cache, just in case
the other values might also be needed. The same occurs
in instructions buffer memory. The typical CPU chip
comes with 512KB to 1 or 2 MB
of cache. Many chips come equipped with primary and
secondary caches.
The advantage of this arrangement is to allow the processor
to calculate at its potential speed, since the fetching of
instructions and fetching/storing of the data is faster
in cache memory (expensive) than in regular memory (cheaper).
But it also relies on the designer of the cache to have
algorithms to fetch instructions and data from main memory which minimize
the misses. It also asks the programmer to fetch the data
in index order as much as possible for efficiency! If the elements of
an array are used at random, the probability of misses in
the cache augments dramatically!
Much work has been developed to designing efficient
cache algorithms and hardware: see
wikipedia page.
Back to top of page
In most large multi-user multi-tasking systems, as well as PC,
the main memory can be supplemented by a swapping
mechanism onto disk. Namely, when the total aggregate
memory required by all the processes running on the machine
exceeds the physical memory, the overflow (either entire
processes or last used part of processes) are written in
a swap space on disk.
Using virtual memory reduces performance considerably!
The CRAY supercomputers in the past did not implement virtual memory for
the sake of efficiency.
Back to top of page
Memory comes in banks dictated by the computer designer or the chip
technology. For instance, memory simms come in 32MB, 64MB, 128MB, ... sizes.
The vector CRAYs were made to use this fact in order to
improve the memory access time by eliminating the cycle time.
This technology is widely used nowadays.
Suppose we have 4 banks each containing 256 bytes. The
block-oriented scheme would assign the
virtual addresses 0...255 in bank 1, 256...511 in bank 2, etc.
The inter-leaved scheme would assign the
virtual addresses 0,4,8... in bank 1, then 1,5,9,... in bank 2, etc.
The processor can receive the content of memory locations
in different banks on successive clock cycles. Theoretically
the processor can receive one word per processor cycle
independently of the memory cycle time if the information
is fetched with an appropriate stride.
Back to top of page
It is obvious that large computational tasks often require
huge amount of data to be manipulated. Therefore the speed
of the bus or the inter-component communication speed is
of prime importance. In a simple configuration the
communication speed is characterized by a bandwidth,
in units of MB of data which can be transferred per second.
This is in general a complicated
topic which will be the subject of much discussion
in the section below on parallel architecture.
Back to top of page
The interest of the computational scientist is to have more
calculations performed per unit time, or to have a higher
throughput. Parallelism is often the way to achieve this.
- Job level parallelism:
- most cheaply achieved by having
multiple computers and distributing users among them.
This applies better to large number of smaller calculations
rather than large applications.
Any multi-tasking system automatically implements this
kind of parallelism by handling various tasks at once;
also the I/O sub-systems also function simultaneously, in
parallel, with the other components of the computer.
- Program level parallelism:
- when a single task is divided
among multiple CPU. This is the subject of this course.
- Instruction level parallelism:
- Pipelined and vector architectures
are the way to implement these approach. Cray vector computers
were very important in the early days of high performance
computing but have been abandoned nowadays due to cost.
Pipelined CPU chips are described on a
Wikipedia page.
- Arithmetic and bit level parallelism:
- i.e., how does
one handle the 64 bits of the two words in an add operation;
of concern to the chip makers.
We are here concerned primarily with program level parallelism
in our quest to use massively parallel computers.
Instruction level parallelism was important on
vector computers.
Back to top of page
The concept granularity is important to understand at the onset.
A coarse grain system is one in which the operations that
run in parallel are fairly large, on the order of entire
programs. An example could be that of a fluid dynamicist
who sub-divides a large volume in few large parcels
and lets different processors handle the fluid problem in each
parcel. Fine grain granularity is when a task is sub-divided into
very many very small tasks. For example, a sum of integers
sum=0;
for {i=1; i<=N ; i++ )
sum = sum + i ;
could be divided into sub-sums on each processor, resulting in very
few operations to be performed in each individual processors,
and a subsequent global sum of the individual answers.
The granularity in computer parallelism is reflected in or dictated
by the hardware. Historically the CM1 from Thinking Machine Corporation
contained up to 65,536 very simple one-bit processors. The
philosophy was here that these processors were cheap to produce
and eventually could add up to large MFLOPS ratings.
On the other hand the Beowulf clusters, made out of off-the-shelf
computers, have more powerful nodes. The design philosophy is here
that fewer nodes are easier to interconnect and that memory
organization is easier.
It turns out that from the computational scientist point of
view, the coarse grain model is often easier to adapt to
particular problems. This is born out by the market where
most massively parallel machines have now become
obsolete, often driving their parent companies into
bankruptcy.
Back to top of page
- SISD
- (single instruction stream, single data stream)
Conventional computer architecture with 1 processor; instructions
are done sequentially. Your PC or Macintosh is of this type.
- SIMD
- (single instruction stream, multiple data stream)
A single instruction processing unit, called a controller, and
several data processing unit. The Thinking Machines CM-1 and
MassPar MP-1 were examples of such architecture.
The way these systems worked was that the instruction processing
unit fetched an instruction, then broadcasted it to all PE which
then perform exactly the same operation. The only freedom
was that a PE could be de-activated in a given cycle.
It was hoped that this architecture would reduce the
time devoted to handle program logic. However,
the difficulty was that typical problem could not be adapted easily
to this approach.
However, these ideas are coming back in full force in the context
of Graphical Processing Units (GPUs). These follow from
the realization that graphical boards contain many processors
that can be tapped to compute. Unfortunately these come
in an architecture suitable for graphics, and not for
general computing.
- MISD
- (multiple instruction streams, single data stream)
This category in which various computing elements
work on a single data has never lead to a successful
computer.
- MIMD
- (multiple instruction streams, multiple data stream)
This is the widest and most successful category of parallel computers.
The Cray C90, Cray T3D, Cray T3E, Intel Paragon, SP2, ... all fell under this
category. Much smaller systems, like the Sun SparcServers,
also implemented this architecture.
Nowadays, DEll, IBM, and HP servers are of this type.
In fact, a group of independent workstations on
a network can be made to work as a MIMD computer. This
is the type of computers we will concern ourselves with
in the following.
Back to top of page
Processors in coarse grain parallel computers must be fast
in order to achieve global high performance. There are recent
major approaches to achieve these fast CPU speeds (besides
the use of RISC): pipelined, vector, super-scalar
and hyper-threaded approaches.
The analogy is the assembly line of a factory; the results of
operations are immediately passed to the next operations.
The goal is to increase the number of instructions performed
per second.
Consider adding 2 float numbers (of type m x 2^e). The operation
can be divided in stages:
1- If e1 < e2 swap operands. Find difference ed = e1-e2
2- Shift m2 to the right by ed bits
3- Compute mantissa of the sum, m1+m2. Exponent of sum is e1.
4- Normalize the sum.
A pipelined processor is capable of performing the above stages
in parallel in the case of an addition of vectors.
This achieves very high efficiency in manipulating vectors.
The peak CPU speed of different chips can be found by standard
benchmarks and change with each new release. The user must however write program
which avoid loop interdependency in order to achieve
these speeds and the results are highly dependent on the compiler used.
Back to top of page
Vector processors operate on entire vectors with one instruction.
Consider the following adding loop:
for ( i=0; i < N; i++ )
c(i) = a(i) + b(i);
Fast computing can be achieved if the processor
is capable of handling NV additions (or other arithmetical operations)
at once, hence the term vector processor. If NV >= N, then the sum
above can be calculated in "one" clock cycles instead of "N" cycles
(i.e., fetch a[], fetch b[], do the sum and store the results in c[]).
If NV is smaller than N, only few (much smaller than N)
clock cycles might be needed. The
advantages are that fewer instructions are performed,
producing lower overhead, and that the various elements of the
arrays are worked on in parallel (simultaneously).
The Cray computers did harbor very fast theoretical speed by
using their own vector processors.
Unfortunately the vector processors have been on their way out. The cost
of production became prohibitive by today's standards. They were certainly
the easiest to program since very efficient compilers were written
to take advantage of the architecture.
Back to top of page
This approach uses instructions level
parallelism to achieve speed. These
processors have
- an instruction fetch unit capable of obtaining
multiple instructions at once
- instruction decoding units capable of detecting
independence of instructions
- multiple execution units to execute instructions in parallel
when possible
- the execution units might also be pipelined for increased
efficiency.
Seymour Cray's CDC 6600 from 1965 is often mentioned as the first superscalar design.
Dec (alpha) , HP (HP-PA), IBM (RS/6000), SGI (PowerChallenger)
and Sun (SuperSparc) all offered superscalar processors.
Except for CPUs used in some battery-powered devices, essentially
all general-purpose CPUs developed since about 1998 are superscalar.
Wikipedia describes this architecture in a very informative
page.
Back to top of page
Wikipedia describes this architecture in a very informative
page.
This page describes
HTT as Intel's trademark for their implementation of the simultaneous
multithreading technology on the Pentium 4 micro-architecture.
It is a more advanced form of Super-threading that debuted on the
Intel Xeon processors and was later added to Pentium 4 processors.
The technology improves processor performance under certain workloads
by providing useful work for execution units that would otherwise be idle,
for example during a cache miss. A Pentium 4 with Hyper-Threading
enabled is treated by the operating system as two processors instead of one.
Hyper-Threading works by duplicating certain sections of the
processor, those that store the architectural state, but
not duplicating the main execution resources. This allows a Hyper-Threading
equipped processor to pretend to be two "logical" processors to the host
operating system, allowing the operating system to schedule two threads or
processes simultaneously. Where execution resources in a non-Hyper-Threading
capable processor are not used by the current task, and especially when
the processor is stalled, a Hyper-Threading equipped processor may use those
execution resources to execute another scheduled task.
Except for its performance implications, this innovation is transparent to operating
systems and programs. All that is required to take advantage of Hyper-Threading
is symmetric multiprocessing (SMP) support in the operating system,
as the logical processors appear as standard separate processors.
Back to top of page
There are two major ways of organizing the memory in
MIMD multiprocessors architectures:
the shared memory and distributed memory approaches.
In this architecture the various processors all have access to
the entire main memory. This is the simplest systems to use
since any piece of data is always accessible to all PEs. Pentium servers now
exits in quad format. In its
more elementary mode of operations, the operating system distributes the various processes
among the processors in order to keep them equally
busy. On the other hand, a single task may be distributed
on the processors to achieve true parallelism.
The major drawback of this configuration is the
memory contention problems. This occurs
when two or more processors require the same
data, or worse, when two processors need to modify
the same data. This problem prohibits the implementation of such systems with
large number of processors.
Think of a group of independent computers, each with
its own memory, all coupled together with very fast
communication channels and you have a fairly good
representation of a distributed memory MIMD computer.
This configuration eliminates all memory contention problems
since each processor has its own local memory. The
disadvantage is that each processor might have to fetch
the data it needs from the other processors via the
communication channels (slow compared to the bus speed).
This has a great potential
for critical bottlenecks.
The topology of the communication layout between the nodes
if of extreme importance in establishing fast connections
between the nodes.
Beowulf systems tend to have the nodes share a common
hidden Ethernet.
Back to top of page
Back to course contents