ARSITEKTUR KOMPUTER PARALEL
Apakah Arsitektur Parallel? Komputer (dengan arsitektur) paralel adalah sekumpulan elemen pemroses (Processing Elements) yang bekerjasama dalam menyelesaikan sebuah masalah besar. Mengapa perlu mengenal arsitektur paralel ? Tuntutan aplikasi Trend Teknolog Trend Arsitekture Ekonomi Trend saat ini : Kebanyakan mikroprosesor sekarang ini mempunyai fasilitas untuk mendukung multiprosesor. Server dan workstation berarsitektur multiprosesor : Sun, SGI, DEC, COMPAQ!... Mikroprosesor yad (dan sekarang) adalah multiprosesor
Technology History Applications Unsur unsur yang berpengaruh terhadap perkembangan arsitektur komputer Computer Architecture Technology Programming Languages Operating Systems History Applications
Pengolahan Paralel Pada arsitektur dengan prosesor tunggal Pada arsitektur dengan banyak-prosesor
Klasifikasi Arsitektur Komputer (berdasarkan Flynn 1972) Single Instruction-stream, Single Data-stream prosesor ALU
Single Instruction-stream, Multiple Data-stream DS Data di- load oleh host Instruksi (program) di-load oleh aliran instruksi yg sama
Multiple Instruction-stream, Multiple Data-stream
Multiple Instruction-stream, Single Data-stream
MISD SISD MIMD SIMD 1 Many Data Streams Instruction Streams
Kategori Komputer Prosesor tunggal Prosesor Paralel - SISD : Komputer Von Neuman - MIMD : (komputer skalar) - Komputer Paralel SIMD : Komputer Array, (1) Multiprosesor tipe Komputer Vektor Shared-memory (2) Multiprosesor tipe Message Passing* - Komputer terdistribusi Spesial Purposed Computer
SCALAR (1 operation) VECTOR (N operations) + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 vector length vadd.vv v3, v1, v2 VECTOR (N operations) Aplikasi image processing (signal processing), multi media
Instruksi Vector dasar Instr. Operands Operation Comment VADD.VV V1,V2,V3 V1=V2+V3 vector + vector VADD.SV V1,R0,V2 V1=R0+V2 scalar + vector VMUL.VV V1,V2,V3 V1=V2xV3 vector x vector VMUL.SV V1,R0,V2 V1=R0xV2 scalar x vector VLD V1,R1 V1=M[R1..R1+63] load, stride=1 VLDS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2 VLDX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed("gather") VST V1,R1 M[R1..R1+63]=V1 store, stride=1 VSTS V1,R1,R2 V1=M[R1..R1+63*R2] store, stride=R2 VSTX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed(“scatter") Untuk pengalamatan operand
Contoh Program untuk komputasi data vector Y[0:63] = Y[0:653] + a*X[0:63] 64 element SAXPY: scalar LD R0,a ADDI R4,Rx,#512 loop: LD R2, 0(Rx) MULTD R2,R0,R2 LD R4, 0(Ry) ADDD R4,R2,R4 SD R4, 0(Ry) ADDI Rx,Rx,#8 ADDI Ry,Ry,#8 SUB R20,R4,Rx BNZ R20,loop 64 element SAXPY: vector LD R0,a #load scalar a VLD V1,Rx #load vector X VMUL.SV V2,R0,V1 #vector mult VLD V3,Ry #load vector Y VADD.VV V4,V2,V3 #vector add VST Ry,V4 #store vector Y Pada komputer non-vektor Pada komputer vektor
Parallel vs Terdistribusi Beberapa prosesor secara serentak bekerjasama menyelesaikan satu masalah, berbagi memory dan clock Terdistribusi: Prosesor prosesor tidak berbagi memory atau clock sistem. Multiprosesor tipe Message passing ada yang termasuk dalam kategori komputer terdistribusi
SIMD MIMD
Peningkatan kapasitas pengolahan I/O ctrl Mem Inter connect Pr ocessor I/O devices Kapasitas Memory ditingkatkan dengan menambah modul modul pengingat. Kapasitas I/O dengan penambahan controllers dan piranti I/O. Penambahan prosessor untuk pengolahan!
Evolusi Komputer S1 S2 Sj prosesor Komputer sekuensial Dengan peng- Komputer dengan pengingat modular (memory interleaving) Dengan peng- Ingat cache Dengan struktur pipeline dalam prosesornya S1 S2 Sj prosesor
FU RegFile Memory ICache FU RegFile Memory ICache HUF DCT FU RegFile Memory ICache DCT HUF Multiple functional Unit (FU) Multiprosesor
Superscalar: PowerPC 604 and Pentium Pro Multiple FU
TMS320C6701 DSP Block Diagram Program Cache/Program Memory ’C67x Floating-Point CPU Core Data Path 1 D1 M1 S1 L1 A Register File Data Path 2 L2 S2 M2 D2 B Register File Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test External Memory Interface 4 Channel DMA Program Cache/Program Memory 32-bit address, 256-Bit data 512K Bits RAM Host Port Interface 2 Timers 2 Multi-channel buffered serial ports (T1/E1) Data Memory 32-Bit address 8-, 16-, 32-Bit data Power Down Instruction Decode [28] Here we have a block diagram of the C67, or the floating-point DSP...
TMS320C67x CPU Core Multiple FU Arithmetic Logic Unit Auxiliary Multiplier Unit ’C67x Floating-Point CPU Core Data Path 1 D1 M1 S1 L1 A Register File Data Path 2 L2 S2 M2 D2 B Register File Instruction Decode Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test Floating-Point Capabilities [30] Now to go into more detail about the changes between the 62x CPU and the 67x CPU. In this slide I have tried to blow up to show you where we have added floating-point capability. We've added floating-point capability to six of the eight total functional units, so the ALU, or the Arithmetic Logic Unit, the auxiliary logic unit, and the multiplier all support floating-point. The D-unit, or the address calculation unit, it doesn't care what kind of data it's looking at, so it didn't need floating-point capability. So we've only added floating-point capability to six of the eight functional units. Multiple FU
Intel IXP1200 Network Processor SDRAM Ctrl MicroEng PCI Interface SRAM SA Core Mini DCache ICache Scratch Pad IX Bus Hash Engine 126mm, 6.5 million transistors, 432 pins, BGA package StrongARM SA-1100 in 0.35 micron process IXP1200 in 0.28 micron with three metal layers 6 RISC engines, 4 contexts each -> 24 threads IX Bus Interface is an interface to the SRAM, SDRAM, PCI interface, and other companion IXP1200s the architecture is designed to be scalable We did not touch the strongARM here We removed the use of the hash engine since we are dealing with IP packets
IXP1200 MicroEngine 32 SRAM Read XFER Registers 64 Reg.array (A-Bank) 32 SDRAM 64 Reg array (B-Bank) ALU Write XFER dari SRAM Dari SDRAM Ke SRAM ke SDRAM All instructions are executed in a single cycle Multithreading support for 4 threads usually a zero overhead context swap fill with a deferred instruction (like a branch delay slot) 1 cycle if thread polls for another thread but does not find one explicitly say that you are going to sleep until a signal event specified occurs or signal a swap when using other IXP1200 resources (e.g. SRAM, SDRAM, PCI, HASH) 128 32-bit GPRs in two banks of 64 32 registers per thread are exclusive (relative addressing) absolute addressing allows sharing between the threads 128 32-bit transfer registers 8 SRAM, 8 SDRAM read, 8 SRAM, 8 SDRAM write (relative addressing) or all are visible using absolute addressing Command bus arbiter and FIFO manages accesses on the IXP1200 bust 32-bit RISC instruction set
Intel Pentium Pro Quad P-Pr o bus (64-bit data, 36-bit addr ess, 66 MHz) CPU Bus interface MIU o module 256-KB L 2 $ Interrupt contr oller PCI bridge Memory 1-, 2-, or 4-way interleaved DRAM PCI bus I/O car ds
Intel Paragon i860 i860 Intel Paragon L $ L $ node 1 1 Memory bus (64-bit, 50 MHz) Mem DMA ctrl Driver NI 4-way interleaved DRAM Sandia’ s Intel Paragon XP/S-based Super computer 8 bits, 175 MHz, bidirectional 2D grid network with processing node attached to every switch
IBM SP-2 IBM SP-2 node Memory bus General inter connection Power 2 CPU cache IBM SP-2 node Memory bus interleaved Memory 4-way contr oller DRAM General inter connection network formed from Micr oChannel bus NIC 8-port switches DMA DRAM I/O i860 NI
Stanford: Hydra Design 2.Approach: including innovative ideas and constructive plans for achieving the stated objectives. Single-chip multiprocessor Four processors Separate primary caches Write-through data caches to maintain coherence Shared 2nd-level cache Separate read and write busses
SUN Enterprise Gigaplane bus (256 data, 41 addr ess, 83 MHz) SBUS 2 FiberChannel 100bT , SCSI Bus interface CPU/mem car ds P $ 2 Mem ctrl Bus interface/switch I/O car
MULTIPROSESOR Shared memory : beberapa prosesor berbagi ruang pengingat
The Problem of Cache Coherency Shared Memory Architectures The Problem of Cache Coherency CPU CPU CPU Cache 100 200 Cache 550 200 Cache 100 200 A’ A’ A’ B’ B’ B’ Memory 100 200 Memory 100 200 Memory 100 440 A A A Processor b) Writes green, red stale c) Update memory (green), red stale in cache B B B I/O I/O Output of A gives 100 I/O Input 440 to B a) Cache and memory coherent: A’ = A, B’ = B. b) Cache and memory incoherent: A’ ^= A. c) Cache and memory incoherent: B’ ^= B.
Multiprosesor Message Passing
S t o r e P 1 2 n L a d p i v Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion Common physical addresses
Langkah langkah pembuatan Program Paralel
FU RegFile Memory ICache DCT HUF