Revised draft 20000926 Still needs a fair amount of work, but now the major pieces of of this first section are here and over a beer the project's organizaton can be considered. cheers - john -------------------------------------------------------- 1) Computer Architecture and Operating Systems Computer Architecture and Operating Systems have evolved together. Modern multiprocessing operating systems require hardware support to function effectively and architecture intended to support multiprocessing must be designed with some Operating System Model in mind! Because of this, there are some common conceptual models of how the combination of hardware and multiuser operating systems work together to provide the programming environment. What will be presented here is a brief introduction to Computer Architecture and Operating Systems, with general conceptual models that will give a programmer important insights into programming in a modern multiprocessing environment. Most of what is presented will be a sound basis to help in understanding any modern hardware and multiuser operating system (UNIX, Linux, Windows NT , Windows 2000, Windows XP, MAC OS X, OS-2, VMS ... ) and a good stepping off point for someone who wants to study the details of Computer Architecture or Operating System internals. When details of Operating Systems are mentioned, such as the specific names for Process Attributes, UNIX details will be used, other operating syetems have corresponding features. For additional details on real computer architecures and real operating system internals, consult the references. They will fill in the many details omitted here. ------------------------------------------------------------ 1.1) Introduction Computer Architecture is a deep subject with many details. It has both theortical and applied components. Our purpose here is to provide a valid programmer's model of hardware computing systems that will foster an understanding of how a modern multiprocess operating system works and how to design efficient programs, applications, and systems in such an environment. To meet our goals, we will not present a specific real computer architecure here. It is very helpful if the reader either already knows a specific architecture or studies one after covering our introduction. We introduce an abstract computer called BAD (Basic Abstract Design). BAD is a 32 bit Load-Store Architecture with: von Neumann architecture , limited, but very easy to decode instrucion set, two's complement arithmetic, no floating point operations, 16 bit address space, 16 general purpose registers and separate PC, SP and PSW registers typical PSW with NZVC and a priority bit (PRI(, 1 level of memory cache vectored interupts/traps MMU with an expicit table of size 512 bytes, Memory mapped IO for a single serial device (console) and a single block device (disk). This is a very wasteful machine design, but it is very easy to describe, learn and write a simulator for. It provides a basis for looking at the trade offs needed in designing a real architecture. The fact that BAD is bad does not effect its usefulness in understanding how real architectures work and the performance of programs on real architectures. After introducing BAD, we will use it to describe the basic features of current real computer architectures, including: multilevel caches, IEEE floating point standard, separate bus designs, harvard architectures, 8 bit microcontrollers, 64 bit architectures, RISC Architectures and pipelining, VLIW Architectures multiple function units, Multiprocesser systems. Again our purpose here will be to provide the programmer with a basic conceptual model, that will help understand performance issues for a program written in a high level programming language on real systems. In an appendix, a simulator for BAD and a two pass Assembler for BAD are presented. There are also remarks on interpreters and byte code interpreters, which are very easy to understand after looking at a BAD simulator and assembler. ------------------------------------------------------------ 1.2) The BAD Architecture There are several classic computer architectures: Stack (Intel 8087-Pentium Floating Point Units) Accumulator (some microcontrollers) Register (most current computers) Load/Store (One of the parts of RISC design) CISC: Complicated Instruction Set Computers (Intel 8080, Intel 286, Intel 386, Intel 486, IA32 - Pentium) RISC: Reduced Instruction Set Computers (SUN SPARC, HP-PA, IBM RS6000, MIPS, DEC Alpha, PowerPC) VLIW: Very Long Instruction Word; Systems with multiple independent function units (IA64). Mutiprocessor Systems: Systems with more than one CPU. BAD will be a von Neumann Load/Store Architecture. A von Neumann architecture is one that has a single memory space, which is used for both data and machine instructions. A Load/Store architecture is one that has its machine instructions broken into three groups: LOAD: these transfer data from memory into registers. They do not perform any Arithmetic/Logic operations. STORE: these transfer data from memory into registers. They do not perform any Arithmetic/Logic operations. CPU: these perform Arithmetic/Logic operations or other things such as branches that do not use any external memory. RISC machines are all LOAD/STORE architectures, with several other common features (which are not in BAD), that are designed to be highly pipelined and be very easy to implement in VLSI. BAD has a typical Central Processing Unit (CPU). There is nothing "bad" about the BAD CPU itself. It is only the choice of actual machine instructions that is "bad". While all new multiprocessing design is moving to 64 bit architectures, that does not change what is being done. It mainly removes problems with a 32 bit address space and improves performance by supporting wider buses. ------------------------------------------------------------ 1.3) The BAD CPU Registers (each 32 bits wide): PC Program Counter, contains the memory address of the next machine instruction to execute. SP Stack Pointer, contains the memory address of the top of the running processes' stack. PSW Program Status Word, contains: NZVC bits, Negative, Zero, oVerflow and Carry. These contain information on the result of the ALU or Load/Store operation as two's complement arithmetic. PRI, a bit indicating the privilege level the process is running at. PRI = 1, privileged/supervisor/kernel PRI = 0, user CE, a bit to enable/disable caching. MMUE, a bit to enable/disable the MMU. IMASK, 2 bits to mask traps and interupts. TICK Counts Clock Ticks. % 16 General Purpose Registers, %0, %1, ... %F. Internal CPU Registers: IR Instruction Register, contains the next intruction to execute (after it is fetched from memory). This is the input for the Decoder. MA Memory Address Register, Address of the next memory operation (lead/store) MD Memory Data Register, Data for the memory system MC Memory Control Register, determines if the next memory operation is a load or store and handshakes the CPU and memory systems. OP_IN1 OP_IN2 Serve as inputs for the ALU Connects, getting values for the Registers as determined by the CU. OP_OUT1 OP_OUT2: Serve as outputs from the ALU. The values will be moved to registers as directed by the CU. Hardware Blocks in the CPU CU: Control Unit, implements the von Neumann cycle, controlling the internal register busses, external memory operations and the choice of ALU operations. Decoder: Instruction Decoder, takes apart the contents of the IR and provides it as useful information to the CU. ALU: Arithmetic Logic Unit, takes input(s) from OP_IN1 (and OP_IN2) and performs the operation directed by the CU, putting results in OP_OU1 (and sometimes OP_OUT2). Register Bus: An internal CPU bus for transfering values between of registers under the direction of the CU. External Memory Bus: The connections to the outside world for both memory and devices. Most architectures connect external devices by mapping their data transfer and control functions to memory locations. This is called "Memory Mapped IO". L1 Cache (will be discussed in a seperate section later) MMU: Memory management Unit (will be discussed in a seperate section later). Privileged Instructions (only legal when PRI = 1): LOAD/STORE to other Virtual Address Spaces. Modification of the PSW RETT, Return from Trap/Interupt ____________________________________________________________________________ | +-----+ R | | ||<-->| PC | e | | ||<-->| SP | g CCCC PPPP U U | | ||<-->| PSW | i C P P U U | | R ||<-->| Tick| s C PPPP U U | | e ||<-->| %0 | t C P U U | | g ||<-->| %1 | e CCCC P UUUU | | i || ...| ... | r | | s ||<-->| %n | s | | t || +-----+ | | e || Memory Registers | | r || Bus ________________ interupt +-------------+ | | ||<*********>| CU |<---------------||<-> | Control | | | B || Control | Control UNit | ill memory op ||--> | Address | | | u || | * | or page fault ||<-> | Data | | | s || |___________*__| || +-------------+ | | || (PC) * address and || /\ | | ||---------->____________V___--control (rt)->|| || | | || | fetch * | || || | | || (PC++) | instruction | data || \/ | | ||<----------|___________*__|<---------------|| __________________ | | || | * || | | | | || V * M || | | | | || +--------+ * I e || | L1 Cache | | | || | IR | * n m || | | | | || +--------+ * CU control t o || | | | | || | * e r || | | | | || V * r y || | | | | || ____________V___ interupt n || | | | | || | decode |--> to CU a B || | | | | || | instruction | ill inst l u || | | | | || |___________*__| s || | | | | || || * || | | | | || \/ * address and || |________________| | | ||---------->____________V___--control (rd)->|| /\ | | || | load from Mem| || || | | || | source | data || || | | ||<----------|___________*__|<---------------|| \/ | | || || * || __________________ | | || operand1 \/ * || | | | | ||---------->____________V___ || | | | | || operand2 | | interupt || | MMU | | | ||---------->| ALU |--> to CU || | | | | || | | on Errors || | Memory | | | || result1 | Arithmetic | || | Management | | | ||<----------| Logic | || | Unit | | | || result2 | UNit | || | | | | ||<----------|___________*__| || | | | | || || * || | | | | || \/ * address and || | | | | ||---------->____________V___--control (wd)->|| |________________| | | || | store to mem | || |||| | | || | destination | data || |||| | | |______________|--------------->|| |||| | | Internal Clock |||| | |________||______________|_________________________________||||_____|______| || | |||| | External Clock Reset ... External Memory |||| Power Bus ------------------------------------------------------------ 1.4) The von Neumann Cycle The von Neumann Cycle is the basis of most successful computer architectures that have ever been developed. considering it was worked out in some detail more than 50 years ago by John von Neumann, before the first vacuum tube computer was built, this was a major intellectual breakthrough. The von Neumann Cycle while( !HALT ) { fetch the next instruction from memory and increment the PC; decode the instruction; fetch memory operand(s), if needed for the instruction; perform CPU operations, if needed for the instruction; store result(s) in memory, if indicated by the instruction; } Note that: "Memory" here means external memory, not CPU registers. "CPU operations" consist of all operations that do not involve external memory. The Control Unit (CU) is the heart of the CPU, it implements the von Neumann Cycle. In some machines in the past (1960/1970), the CU was actually implemented as a software porgram (called microcode, no relation to current microprocessors) that ran in a hardware microcode-CPU, internal to the CPU! Note the microcode instrucion set would look very different from the CPU's instruction set, it would control the CPU busses, external memory interface, ALU etc... In extreme cases, one could rewrite the microcode for a CPU and download it into the CPU, creating one's own instruction set!!! Current CPUs do not use "software" microcoding, because of the slow down it introduces, instead they use hardwired logic for the CU, typically random logic, or an array of diodes that represent an hardcoded microcode program. Exercise: Try to find some other architectures that have been implemented in commercial hardware (hint Dataflow). Exercise: Consult the BAD Architecture Manual and write a simulator for a BAD computer. Exercise: Consult the BAD Assembler Manual and write an assembler for the BAD assembly language. ------------------------------------------------------------ 1.5) Architecture Extensions 1.5.1) von Neumann vs Harvard ---------- ----------- | CPU | Memory Bus | Text | | |<=================>| Segment | | | | | | | | Data | | | | Segment | | | | | | | ----------- | | ---------- von Neumann Architecture ---------- ----------- | CPU | Instruction Bus | Text | | |<=================>| Segment | | | | | | | ----------- | | ----------- | | Data Bus | Data | | |<=================>| Segment | | | | | ---------- ----------- Harvard Architecture On the Harvard Architecture, where there are physically separate memories for Text and Data, The Text and Data memories may have different widths of addresses and of memory words. 1.5.2) Pipelining Assume a 5 stage von Neumann cycle and that every stage can be carried out in one "clock tick". It takes 5 clock ticks to execute an instruction "x". If we copy Henry Ford, and make the hardware that carries out the von Neumann Cycle line up in 5 independent blocks, like the steps in a factory assembly line, we can see how pipelining works for the 5 steps needed for completing an instruction "x". x_1 x_2 x_3 x_4 x_5 ------------ -------- ------------ -------- ------------ -->|fetch inst|-->|decode|-->|fetch data|-->|execute|-->|store data|--> ------------ -------- ------------ -------- ------------ Hardware Blocks to Perform the von Neumann Cycle (The Arrows represent signal lines between the Blocks.) Now consider a program (a sequence of instructions) without branches. I0 I1 I2 I3 I4 I5 ... In This will be the "input" to the "Factory" above. The "output" of the "Factory" is "completed instructions" from the program. The non-pipelined (sequential/scalar) execution of the program requires 5n clock ticks, as follows Start Time 0 1 2 3 4 5 6 7 8 9 5n-1 Instruction I0_1 I0_2 I0_3 I0_4 I0_5 I0_1 I0_2 I0_3 I0_4 I0_5 ... In_5 Note that at any time 4 out of the 5 blocks of hardware in the "Factory" are "idle". So we "could" start the second instruction as soon as the first instruction moves into the second hardware block, and so forth. Pipelined execution of the program in a 5 stage pipeline requires only n+4 clock ticks, as follows Start Time 0 1 2 3 4 5 6 7 8 9 n-3 Instruction I0_1 I0_2 I0_3 I0_4 I0_5 I1_1 I1_2 I1_3 I1_4 I1_5 I2_1 I2_2 I2_3 I2_4 I2_5 I3_1 I3_2 I3_3 I3_4 I3_5 I4_1 I4_2 I4_3 I4_4 I4_5 I5_1 I5_2 I5_3 I5_4 I5_5 I6_1 I6_2 I6_3 I6_4 I6_5 ... In_1 In_2 In_3 In_4 In_5 Note Branch instructions cause problems for pipelines, so the ideal speed up of a pipeline is not usually seen. The pipeline can only be advanced at the speed of the slowest "station". There are a number of ways real hardware addresses these two (and other) problems, which we will not discuss here. Most real machines use either a 2 or 3 (not 5) stage instruction pipeline. Pipelining is used to speed up many types of computer hardware, such as floating point and arithmetic logic units (ALUs). It is typically heavily used in high performance graphics cards as well as video compressions/decompression hardware. Exercise: Why do most RISC machines have a 2 or 3 stage pipeline, instead of 5 stages? Exercise: If d is the number of steps in a pipeline. 1) How long (pipeline delay) does it take to complete the first instruction? 2) How much "speed up" is possible over a scaler execution? Exercise: Suppose a 32 bit multiply of two numbers is implemented by 32 shifts and adds (not how it is typically done in hardware). consider how a pipelined multiplier might be designed. Exercise: The original "Supercomputers" (eg Cray 1) had special pipelines to perform linear algebra operations, such as adding two vectors, or multiplying a matrix by a vector. Try to consider how a matrix - vector multiplication might be pipelined. 1.5.3) Register Files As part of the now classic Reduced Instructions Set Computer (RISC) design, one ends up with the ability to place more registers on a CPU with space freed up from not implementing complex random logic in a typical Complicated Instruction Set Computer (CISC) design. But using say 1024 registers is not easy. If you wanted instructions to ba able to directly address all 1024 registers, it would require 10 bits in the instruction word to specify each register! A solution is to use a "Register File" and have only a few of the registers available at a time, called the "Register Window". For example, if we wanted to make 32 registers available at a time, we might make the first 8 global registers and take the next 24 from the "Register Window" as shown Virtual Physical Registers Registers --------- 0 ------------> -------------------- 1 | Global Registers | ... | (8 registers) | 7 ------------> -------------------- 8 ----------- -------------------- 9 | | Register File | ... | | (1024 registers) | | | | | | | | | | | | | | | | 31 ------| | | | | | | | | | | | | |--> |------------------| <-- CRW | | Register Window | pointer to Current | | (24 registers) | Register Window | | Only These can | | | be directly | | | addressed by the | | | CPU | |------> |------------------| | | | | | | |------------------| There will be instructions for adjusting the Current Register Window, and usually this will be automatically done as part of CALL and RETURN instructions, to protect contents of registers (holding local data) automatically as part of a function call. In the case of a machine with a register file, when a process is switched out so another process can run, the entire part of the register file in use must be saved (typically to the process's local memory stack). When the process is made active (run) again, the register file must be restored before running the process. If we only had 32 "physical registers", then only those 32 registers would need to be saved when a switch between processes occurs. As in most computer design questions, choosing between options, such as having register windows, makes some operations faster and some slower. It is the hardware design goal to choose those options that make the system perform "best" on a specific class of work and still be cost effective. 1.5.4) Caches (* more below) 1.5.5) Virtual Memory (* more below) 1.5.6) Multiple Functional Units Suppose we had space on our VLSI chip for three independent adder and three independent multiplier units. Could we make use of these to increase performance? Consider the code a = 21*(b-c) + 35*(a-3) - c*(d-e); w = 45*x - 21 + y + z; Breaking this into "three address instructions", assuming all variable are in registers and t is a temporary register, this program can be written as a scaler program: t1=b-c; t2=a-3; t3=d-e; t4=21*t1; t5=35*t2; t6=c*t2; t7=t4+t5; a=t7-t6; t8=45*x; t9=y+z; t10=t8-21; w=t10-t9; If all functional units (adders and multipliers) complete an operation in one clock tick, and if we could get all functional units to work at the same time (getting their operands and storing their results), we could run the following "concurrent" program. Adder_1 Adder_2 Adder_3 Muler_1 Muler_2 Muler_3 Time ------- ------- ------- ------- ------- ------- 0 t1=b-c; t2=a-3; t3=d-e; t8=45*x; 1 t9=y+z; t10=t8-21; t4=21*t1; t5=35*t2; t6=c*t2; 2 t7=t4+t5; w=t10-t9; 3 a=t7-t6; Note: assuming each original three address instruction takes one clock tick, this represents a 3 fold speedup over scaler execution of the program. that in this concurrent case, we complete the second instruction (w = 45*x - 21 + y + z;) before completing the first. Out of 20 possible operations in the 4 time slots, for the concurrent program, we only used 12, so there were still 8 open time slots for operations that could be performed. In practice, designing hardware that can: "get all functional units to work at the same time, getting operands and storing results", is very hard in "single clock ticks". The usual situation is that the functional units are pipelined and require several ticks each, so that loading operands and storing results is "intermixed" with computing operations. This is a very difficult problem compared to making a single pipelined design. One way to make use of multiple function units is to read in a group of "scalar" instructions (as the three address instructions above) into a buffer and then have the CPU do "Instruction Reordering" or "schedule" them into the function units (as we have done in our concurrent program) "on the fly". To the programmer the CPU looks like a scalar machine, but inside... Another approach to Multiple-functon unit machines, is to have machhine instructions that provide explicit concurrent program scheduling of operations on the multiple functional units. These machines are called Very Long Instruction Word (VLIW) machines. 1.5.7) Very Long Instruction Word (VLIW) VLIW macines are multiple function unit machines, where each machine instruction may contain several separate subinstructions for the various functional units in the CPU to execute concucrrently. An example might be taken from the last section, where the instruction would have 5 subparts, each subpart would have to specify an operation, two sources and a destination register. ... -> 2 bits 4 bits 16 bits 16 bits So there might be 38 bits in each subinstruction and a total of 190 bits in a full machine instruction. This is not the start of a reasonable VLIW architecture, but only used to help explain the general concept. Exercise: Give advantages and disadvantages of "Instruction Reordering" compared to "VLIW" designs. 1.5.8) Multiprocessor Architectures Another way to increase performane is to use multiple CPUs. There are two ways to design multiple CPU systems, shared memory and distributed memory. All modern multiprocessing operating systems have features that support both types of multiprocessor architectures. Effective Shared Memory architectures require hardware support as well, which is available for all modern CPUs. Shared memory multiprocessors Memory Switch Shared Memory ---------- Local ----------- | CPU | Memory Bus || | | | |<===========>|| | | ---------- || | | || | | ---------- Local || | | | CPU | Memory Bus ||<====>| | | |<===========>|| | | ---------- || | | || | | ... || | | || | | ---------- Local || | | | CPU | memory Bus || | | | |<===========>|| | | ---------- | | ----------- Distrubuted memory multiprocessors Network Local Memory ---------- Local ----------- | Network | CPU | Memory Bus | | |<-------->| |<===========>| | | ---------- ----------- | | ---------- Local ----------- | Network | CPU | Memory Bus | | |<-------->| |<===========>| | | ---------- ----------- | ... ... | | ---------- Local ----------- | Network | CPU | memory Bus | | |<-------->| |<===========>| | | ---------- ----------- -----------------------------------------------------------a 1.6 Virtual Memory Layout of a Process The concept of a "process" is a central to computing and over time it has evolved to the point where major Multiprocessing Opeating Systems all have processes that, at the upper level, appear very similar. This is no accident, since intense hardware support for effective process management is require and hardware evolved with Operating Systems. Neither exists in a vacuum. UNIX (mid 1970s) was one of the earliest Operating Systems to use the currently widely used process model we present. This model is used by: all UNIX and Linux version, Windows NT and later Micorsoft versions, Mac OS, VMS and also several "mainframe" and real time operating systems. Virtual Memory is the menory the CPU can detect when it is not in "Privileged Mode" and all that a user process sees when performing loads to and stores. We will see later how Virtual memory is mapped to Real (Physical) Memory. Virtual Memory layout of a Process. -------------------------- Virtual Address 0 | | | Unused | No access | | -------------------------- PC - > * | | Text Segment * | Machine Instructions | Read Only * | | -------------------------- | | | Unused | No Access | | -------------------------- * | | * | Static Read Only | Read Only * | | * -------------------------- * | | * | Static Read/Write | Read/Write * | | Data Segment * -------------------------- * | | * | Heap | Read/Write * | grows down | * -----------|-------------- + | v | + | | + | Unused | No Access/Raed/Write + | | + | ^ | SP -> * -----------|-------------- * | grows up | * | Stack | Read/Write * | | -------------------------- | | | Unused | No Access Virtual Address MAX | | -------------------------- Note that normally: the PC, which points to the location where the next machine instruction to execute is located, contains an address in the Text Segment. the SP points to the top of the Stack, which grown upward in memory when things are pushed onto the stack. Most programming languages use this for "Stack Frames" for function calls. A Stack Frame is space on the stack to: 1) Pass arguments to the function being called 2) Hold Local variables for the function being called 3) Return arguments to the calling function 4) Save the 'Context", in case of a "Context Switch". the heap contains space that is dynamically allocated, such as with "malloc()" in C or when "new" is used in C++. When a program attempts to access memory that it does not have permission to access, access permission is checked by the Memory Management Unit (MMU), an interupt is generated. Unless you explicitly catch such an illegal memory interupt in your program, the usual result is the classic fatal error "segmentation fault" or "bus error". Exercise: Write a C program that generates a segmentation fault by referencing memory location 0. Why do you think all processes usually have virtual location 0 set as "Unused" with "No Access"? Exercise: Write a function to find the locations of the text and data segments of a process. Hint, you will want to catch a signal caused by an illegal memory access interupt from the MMU. Exercise: Find out how to determine the sizes of the memory areas for a UNIX process using user commands. Exercise: What shell command allows you to see and set the maximum limits on the virtual memory (text, static, heap and stack) that a process may use? Exercise: Check the man pages for your compiler and linker to see how to alter the default sizes for the heap and stack. Why would you want to alter these sizes. Exercise: What is the system call that actually adds more memory to an address space? It is not malloc(), which is a library function, and it is not always called in malloc() or every time something is pushed on the stack. Exercise: A UNIX process can grow, by adding memory blocks, but it can not shrink, by returning memory blocks to the operating system's pool. That is the memory no longer used when free() is called or the memory on the stack no longer used when you return from a function remains in the processes' address space. Why might this choice have been made? What problems might result? ------------------------------------------------------------ 1.7 Memory Caches In our model of connecting a CPU to it's virtual memory, it should be noted that internal CPU speeds have increased far more rapidly than external memory speeds. In fact, if the current fastest CPUs had to fetch and store all memory accesses to external memory, their effective processing power would be reduced by more than a factor of 10! They would be spending well over 90% of their time waiting for memory operations to complete. To address this problem, memory caches are used between the CPU and its external memory. The intent is to reduce the "average" time required for memory accesses. This is done by keeping copies of some of the memory in the cache, which can transfer data much faster with the CPU than the external memory can. On the same VLSI Chip _______________________________________ CPU Level 1 Cache External Memory ------- ------------------- ------------ | CPU | <=========> |Tag 0 |Slot 0 | <========> |Block 0 | ------- Word ------------------- Block ------------ Transfers |Tag 1 |Slot 1 | Transfers |Block 1 | (Fast) ------------------- (Very Slow)----------- | ... | | ... | | | | | ------------------- | | |Tag t |Slot t | | | ------------------- | | | | | | |Block b | ------------ L1 (Internal) Cache Here we have memory broken into b "Blocks", typically of size about 4 words. The cache consists of a number "Slots", which are each the same size as a Block and "Tags" for each Slot. The Slots hold blocks of memory and the Tag for a Slot tells which Block is in the Slot. L1 caches are "internal" caches, on the same chip as the CPU. They are fast, ideally fast enough to return a memory value if it is in the cache without making the CPU wait. When the CPU reads a word from memory, it places the word's address on the memory bus address lines and indicates that the operation is a read on the memory bus control lines. The cache hardware will then take over and do: if( The Block containing the address is already in a Slot ) { /* Cache hit */ ; } else { /* Cache miss */ if( There is an unused Slot ) { Select it; } else { /* Have to drop a Block from the Cache */ Select a used Slot; IF ( the Slot has been modified since it was last loaded ) { /* write-back the Slot */ store the modified Slot back into its memory Block; } } Transfer the Block containing the address from memory to the Selected Slot and set the Slot's Tag; } Transfer the addressed word from the Cache to the CPU; Operation of a Cache on a word Read. (write-back cache) If the program is reading an array of word data sequentially in memory, and the Block/Slot size is 4 words, then the first time the program goes through the data, 25% of the reads will be cache misses and 75% will be cache hits. However, if the array stays in the cache, the second time through the array results in 100% cache hits. Note that the first time through, the speed might be lower than if there were no cache! Writes to a cache follow a similar prodedure to reads, with one extra twist. When you write to the cache, the cache Slot is now different than the corresponding memory block. Something must be done so that sometime before a modified Block is dropped (FLUSHED) from a cache Slot, the modified results are written back to External Memory. This is called maintaining "Cache Consistancy". There are several solutions to this problem. The two most common are write-back caches (suggested here) and write-through caches, which we will not discuss further. The concept of "Locality of Reference" is critical to getting an increase in performance by using a cache. This concept is that "When a program references a memory location, it will often soon reference the same location again, and other memory locations close to that location." Without "Locality of Reference", a program can perform poorer with a cache that it would without one. When one is designing high performance computational algorithms, this concept must be constantly kept in mind, along with the cache sizes of the machine the algorithm will be run on. An L2 (external) cache) is available for most current architectures. It sits between the L1 cache and memory, operating in the same way an L1 cache does. On the same VLSI Chip External External ____________________________ CPU Level 1 Level 2 External Cache Cache Memory ------- --------- --------- --------- | | | | | | | | | CPU | <========> | | <========> | | <========> | | | | Word | | Block1 | | Block | | ------- Transfers | | Transfers | | Transfers | | (Fast) | | (Slow) | | (Very Slow)| | --------- | | | | | | | | | | | | --------- | | | | | | --------- The Slot/Block for an L2 (External) Cache is typically about 4-8 times the size of the Slot/Block of the L1 (Interal) cache. If a CPU can carry out 100 "io units" / second, the next table shows "typical" L1 and L2 cache sizes and speeds. CPU Speed (100 "io units"/sec) Speed/Hit Slot/Block Size Size L1 Cache 100 4-8 4K-128K L2 Cache 6-30 32-64 64K-4M Memory 2-10 64M-8G "Usual" Cache Speeds/Sizes The next items to consider in detail in cache architecture would be: Choice of algorithm for selection of a used Slot when the cache is full, Choice of number of Slots, Choice of Slot/Block size, Hardware implementation, Cost in time and space, Split (eg separate Text/Data) caches, Cache consistency, single and multiprocessor systems. Program control of cache. In evaluating each of these areas of cache design, a specific class of programs must be considered, so that the performance of the system on that class of programs can be determined. We will not expand on these cache systems options here. Because the control of a cache must be all in hardware for speed, the actions of a cache are transparent to a user program. There are usually only two operations that the CPU can perform on a Cache, a "FLUSH" and "Cache enable/disable". The Flush causes all the Cache Slots to be written back to memory and marked as unused, and is a privileged instruction or an implicit part of TRAP/RETT instructions. In a system with two Cache levels, a Flush will cause both Caches to be cleared. Exercise: Why is it expected that internal CPU speeds will continue to increase in speed more rapidly than external memory speeds? Exercise: Why not make L1 cache (or L2 cache) the same size as External memory and do away with External Memory? Exercise: What might make a program run slower with cache than it would if there were not a cache? Exercise: Determine the Level 1. Level 2 and Memory Sizes for some computers you have access to. Write a program to show the effect of cache misses. Exercise: How can you use user commands to determine the number of cache misses? Exercise: In 2001, why are the Intel XEON Pentium processors used in large computational work and in heavy duty mutiprocess servers, instead of much faster, newer and much cheaper Pentiums that are non-XEON? Exercise: In accessing large arrays in memory in a C program, why might working across each row instead of down each column, be more efficient? Hint, look at how a two dimensional array is laid out in memory. Exercise: Consider "locality of reference" with repest to fetching machine instructions. What can you say about loops and nested loops? What is the "worst" type of code as far as cache performance is concerned? Exercise: disabling/enabling caching may be done be clearing/setting a bit in the PSW. Why should this be a privileged operation? Exercise: What type of real applications benefit most from caches amd which benefit least. Exercise: In pipelined machines, sometimes there are separate caches for instructions and data. Why? Exercise: What extra problems arise with caches in a multiprocessor machines? ------------------------------------------------------------ 1.8) Memory Mamagement Units When a Memory Management Unit (MMU) on a modern mulitprocessing system is properly enabled, the CPU (and a user process) "sees" a linear address memory space. This space is only "Virtual" in that the MMU will translate the "Virtual Memory Address", that the CPU sees, into a "Real (Physical) Memory Address". The MMU also provides support for two additional features, memory protection and demand paging. ------- -------------- | | Data | Real | | CPU | <===============================================> | (Phyiscal) | | | | Memory | | | ------- | | | | Virtual Address | | Real Address | | | | ===================> | MMU | ===================> | | | | | | | | | | | | | | | | | | | | | | Control | | Control | | | | <------------------> | | <------------------> | | ------- Read Interupts | | | | Write | | | | Execute ------- | | Enable/Disable | -------------- ---------- | MMU | | Table | ---------- Just as in Caching. because the translation from a virtual memory address to a real (physical) memory address happens with every memory access, this translation must be implemented in hardware for speed and is now usually part of the same VLSI chip as the CPU and L1 Cache. Unlike the Cache, the operating system has much greater control of the management of Virtual Memory, by setting up the Memory Management Table and servicing interupts from the MMU (for both illegal memory operations and page faults, to be mentioned later). There are several variations on the real hardware implementation of Virtual Memory, but only a simple conceptual model will be presented here. This model will suffice for most uses, other than Operating System Internals. The first Step is to break Virtual Memory into P = 2^p "Pages" and break Real (Physical) memory into F = 2^f "Frames". Both Pages and Frames will have the same size, offset_size bits. The offset_size is usually equal to or slightly larger than 9. Resulting in the Page/Frame size being a small multiple of 512 bytes (eg 512 to 8K = 16*512). In practice, powers of 2 are used for the offset_size so that virtual addresses can be split into a page address and an offset at a bit boundary. This is Virtual_address Real_Address Where represents the bits from "a" concatenated with the low order offset_size bits from "b". Here is a layout of both Virtual and Real memory with 512 byte Blocks/Frames (9 bit wide offsets, offset_size = 9). Virtual Memory Virtual Memory Real Memory Real Memory Page Number Address Frame Number Address -------------- -------------- ------------ -------------- 0 00000000000000 0 00000000000000 00000000000001 00000000000001 ... ... 00000011111111 00000011111111 -------------- ------------ -------------- 1 00000100000000 1 00000100000000 00000100000001 00000100000001 ... ... 00000111111111 00000111111111 -------------- ------------ -------------- ... ... ... ... -------------- ------------ -------------- P XXXXXX00000000 F YYYYYY00000000 XXXXXX00000001 YYYYYY00000001 ... ... XXXXXX11111111 YYYYYY1111111 -------------- ------------ -------------- Where "XXXXXX" is the binary expansion of P-1 (p ones) and "YYYYYY" is the expansion of F-1 (f ones). Now consider the following "MMU Table" with the values shown. Memory Management Table Index Memory Management Table Entries -------------- ---------------------------------------------------- Virtual Memory Real memory Valid Paged Dirty Protection Swap Page Frame Bit Bit Bit rwx bits Block ------------- ------------ ----- ------ ------ ------------------ 0 - (F-1) 0/1 0/1 0/1 0/1 (0 - S-1) ------------ --- --- --- ------------------ 0 0 0 X X XXX XXX 1 5 1 0 0 001 123 2 3 1 0 0 001 32 3 0 0 X X XXX XXX 4 0 0 X X XXX XXX 5 4 1 0 0 100 232 6 7 1 0 0 110 512 7 10 1 0 0 110 8 8 8 1 0 0 110 11 9 14 1 0 0 110 42 10 9 1 0 0 110 55 11 0 0 X X XXX XXX 12 0 0 X X XXX XXX 13 20 1 0 0 000 95 14 17 1 0 0 110 19 15 18 1 0 0 110 66 16 8 0 X X XXX XXX ... P-1 0 0 X X XXX XXX ---------------------------------------------------- The fields in the MMU Table for the Virtual Page "page" indicate: mmu_table[page].frame frame in Real memory where the Virtual Page is currently located. mmu_table[page].valid = 1 indicates the Virtual Page is used mmu_table[page].paged = 1 indicates the Virtual Page is not in Real memory, but in the SWAP area on disk. mmu_table[page].dirty = 1 indicates the Frame has been modified since it was last read in from the Swap Disk. mmu_table[page].read = 1 indicates that a data fetch is allowed mmu_table[page].write = 1 indicates that a data store is allowed mmu_table[page].execute = 1 indicates that an instruction fetch is allowed mmu_table[page].swap is the Block in the Swap Disk where the page will be when it is not in a Memory Frame. The paged bit, dirty bit and swap block will be discussed in the section on "Demand Paging". The MMU Table above would map the following Virtual Memory Space into the Physical Memory Space as shown below. Mote the MMU itself does not "know" that we will use the Pages indicated for the Text and Data Segments as shown, it only knows the mapping and protections as indicated in the given MMU Table. Virtual Memory Real memory ____________________________________ ___________________________ Virtual Memory Virtual Memory Real Memory Real Memory Page Number Page Frame Number Frame -------------- -------------------- ------------ -------------- 0 Unused 0 1 A | Text 1 2 B | Segment 2 3 Unused 3 B 4 Unused 4 C 5 C | Static RO 5 A 6 D | Static RW 6 7 E | 7 D 8 F | 8 F 9 G | Heap 9 H 10 H | 10 E 11 Unused 11 I 12 unused 12 13 I | Stack 13 14 J | 14 G 15 K | 15 16 Unused 16 ... 17 J P = 2^p Unused 18 K 19 20 I ... F = 2^f For a MMU with a Page/Frame size of 2^offset_size bits, the procedure the MMU effectively carries out is to compute the real_address from the virtual_address as follows: offset_mask = (1 << offset_size) -1; /* offset_size ones */ page_mask = (1 << p) -1; /* p ones */ /* compute the page and offset */ page = (virtual_address >> offset_size) & page_mask; offset = virtual_address & offset_mask; if( (! mmu_table[page].valid) || invalid protection for operation ) { /* an illegal memory reference, interupt the processor */ interupt(seg_fault); } else if ( mmu_table[page].paged ) { /* the page is not in Real memory, call the pager to "page" it in */ pager(mmu_table, page); { else { /* The Page is in memory and the operation is legal */ frame = mmu_table[page].frame real_address = (frame << offsrt_size) | offset; return(real_addess); } Note that: Only Virtual Pages that are actually used will be mapped into Real (Physical) Memory pages. The Unused Virtual Pages do not cause wasted Real Memory. Real Pages that are not used are free for use by other processes, which are not currently running, when they start running again. The MMU can cause an interupt, if an operation is attempted on a Virtual Page that does not have permission set for the operation. The disable/enable MMU may be set by clearing/setting a special bit in the PSW, which would be a privileged operation. Some device drivers may require both caching and memory management to be disabled. The MMU Table will normally be in Real Memory that is in the Kernel's Virtual Space. This will usually not be in a user process' address space. The MMU will need to know where the table is. Because of the Speed needed in memory translations, the MMU will usually cache the MMU Table as it is accessed. Exercise: Why is Virtual Page 0 always set with no permissions on every UNIX (and most other) systems. Exercise: Suppose there can only be P = 2^p pages in the Virtual Address Space (p bits in the Page Number). Can a system work if there are less then P Frames of Real memory (FP)? Explain. Exercise: Not every MMU or Operating System supports an Execute permission bit, checking only the Read bit when a Instruction is being fetched. What is an advantage of having an Execute bit? Hint: What is "stack Smashing"? Exercise: For a common real computer architecture, try to determine if the L1 and L2 caches operate on virtual or physical addresses (before or after the MMU). Why to you think that choice was made. Exercise: No real system will use an explicit MMU Table as we have shown it. The table will be implicit, using a more "advanced" data structure. Why? Exercise: The Kernel will keep and manage a separate MMU Table for each process on the system. When switching from one process to another, the Kernel must arrange for the MMU to use the proper MMU Table. In addition to the time needed to actually switch the MMU Table, what other factor related to the MMU will cause a slowdown after a process swap? Exercise: Pages, Frames and Swap Blocks are typically all the same size, a small multiple of 512 bytes. Why? ------------------------------------------------------------ 1.9) Demand Paging What happens when there are 20 processes, each with 100 MBytes of used pages (Text and Data segments), and there is only 1 GByte of Real memory? All the processes can not fit into Real Memory at the same time. First observe that only the running process needs its pages in Real Memory. Second, only the pages of the running process that are being currently accessed need to be in Real memory. These two observations lead to a method of memory management called "Demand Paging". Demand paging is almost universally used in current multi- processing operating systems. Because every memory access requires checking if the page is in memory, hardware support (via the MMU) is required for speed. Looking at our MMU model in the last section, we see that if a Frame is in memory the MMU uses it. If the Frame is not in Real memory, the MMU causes an interupt and a part of the kernel called the "pager" handles the interupt. Consider the following "Frame Table" as a kernel data structure that records the current use of the Frames of real memory. The operations the pager (in the kernel) needs to perform appear in the psuedo code pager() below. Frame_Table ---------------------------- Real memory Processes Frame Using Frame (pids) ---------------------------- 0 Unused 1 27 2 18 3 27,18 4 32 ... ... F-1 Unused /* * The pager() is called when a user process has a "Page Fault" */ pager(mmu_table, page) { pid = process id of current user process; if( There is an unused memory Frame, frame ) { frame_table[frame].pids = pid; } else { Select a currently used memory Frame, frame foreach ( pid in frame_table[frame].pids ) { mmu_table_p = mmu_tables[pid]; page_p = the page in mmu_table_p where frame is used; if( mmu_table_p[page_p].dirty ) { Write frame back to swap block mmu_table_p[page].swap; } mmu_table_p[block_p].swapped = 1; } frame_table[frame].pids = pid; } Put the current process into the Wait Queue; mmu_table[page].valid = 1; mmu_table[page].frame = frame; Place a request to read a block from the disk block mmu_table[page].swap into frame. sched(); } Note When the requested disk read is complete and the frame has been read in from the swap block, as indicated by a interupt, the kernel will move the process that had the page fault from the wait queue to the ready queue. sched() will select the highest priority process in the ready queue and dispatch it (make it the running process). Demand Paging can be viewed as just another layer of caching higher than L2 caching, between Real Memory and Disk. In general, Kernel Pages are never paged out. They are locked into Real Memory. Our models for both the MMU Table and Frame Table are useful for describing memory management operations, but they would never be implemnted as simple tables. Simple tables would be both too large and too slow. (some operations we have indicated would require linear searches). The "dirty bit" is used to detect if a Virtual Memory Page has been modified since it was last loaded into a Real Memory Frame. When a Page is "Swapped Out", only pages that have been modified have to be written back to the "Swap Block". This type of page fault management is called "Write Back". Hardware support from the MMU is used to set the dirty bit for a page in the MMU Table each time a write to the page is requested. The operation "Select a currently used memory Frame, frame" in the pager() code, is quite vague. Choosing a good algorithm here is critical and may have hardware support. A common procedure is to try to choose the "Least Recently Used" (LRU) frame. Allowing a list for "Processes Using a Frame" in the Frame Table make it a bit more complicated, but shows how processes can "share memory", mapping the same Real memory to different processes' Virtual Memory (The Virtual Memory pages do not have to have the same locations in the different Virtual Addess Spaces). There are many important implementation details and options to choose here, as far the Operating System's internals go in making Demand Paging practical. We do not expand on such details here. Exercise: For typical hard disk drives, estimate the time required to page in a Swap Block into a Memory Frame (on a "page fault"). How does this compare with an L2 "cache miss"? Exercise: What steps in the psuedo code for the pager() require linear searches? Exercise: How large would the Frame Table be on a typical system, if implemented as a simple table as shown? Exercise: How can you determine how many page faults a process, or the system as a whole, is having? (shell commnads etc...) Exercise: Considering three factors, hardware, number of processes and individual processes, without modifying the Operating System, what can be done to reduce page faults. Exercise: Which operations in the pager() require linear searches and how long is each? Which of these can be "easily" improved by using a simple linked list in the Frame_table? Exercise: The pager above may write back more blocks then it really needs to when there is more than one process sharing the block. Correct this. ------------------------------------------------------------ 1.10) The Kernel The kernel is an central part of an Operating Syetem, it is a privileged process that manages the hardware and all user processes. After the operating system boots up into it's multiprocessing state, the kernel only runs when the running user process executes a TRAP instruction or when some hardware unit provides an INTERUPT. 1.10.1) Process management There are 5 states a process may be in. The kernel will run as the result of a TRAP or INTERUPT, and if needed it will move a process from state to state by adjusting its tables. The states a process can move through are shown in the following diagram. ------------ ------------ | New | fork() | Exiting | | * * | (TRAP) | * * * | ------------ ------------ \ -------------------- / \ | Time slice up | / exit() v v (clock INTERUPT) | / (TRAP) ------------ ------------ | Ready | ---------------> | Run | | ******** | sched() | * | | ******** | dispatch() | | ------------ ------------ ^ / Completion of service\ / Request for a "slow" (INTERUPT) \ / service via a System Call \ / (TRAP) \ v ------------ | Wait | | ******** | | ******** | ------------ States ------- New Newly Created Processes ( fork() call on UNIX systems ). Ready Processes that are not waiting for services and can run. Run The single Running Process (on a multiprocessor system with k processors (CPUs) there could be k processes in this state. Wait Processes waiting for a operating system service to complete. Exiting Processes that are terminating ("zombies" on UNIX systems). Note that systems have a "Clock INTERUPT" that usually occurs every 1/60th or 1/100th of a second. The Clock INTERUPT is called a "Clock TICK" and the number of Clock TICKS is kept by the Kernel and is a measure of time. When a Clock INTERUPT occurs the Kernel may choose to move a Running Process into the Ready State (if it's "Time Slice" has expired) and select another process to run. Exercise: How can you use "nice(1) and renice(1)" to effect the kernel's selection of the next process to run in sched() on a UNIX system? Exercise: Determine the time in seconds of a Clock TICK on a machine. Exercise: Explain why timing short programs is not easy. Also consider why high precision timing may be difficult in general, What hardware support might be needed for high precision timing? 1.10.2) Security Throught Kernel Services Security and Protection in our model Operating System are supported by having the Kernel provide all input and output (IO) for a User Process. A User Process can only access it's own Virtual Memory (without changing it's MMU Table!) and the non-privileged CPU Registers. All IO, such as disk and network IO, that a user process wants must be done through a system call, so that the Kernel can check that the requested operation is legal and that the process has permission to perform the operation. This protection provides more than simple file read/write/remove access protections. As an example, if a user process could direcly access a disk, a small programming error could corrupt the entire file system! This protection also means that an error in one user process can not accidentially affect another independent user process. Exercise: How can an error in one user process affect another user process? 1.10.3) Device Management The Kernel takes care of all Device management: disks, tapes, cd-roms, cd-ws, dvds, floppies, modems, network interfaces, sound cards, graphics displays etc.... In addition to protection and security, this allows the kernel to present a virtual interface to a user process for each class of device, even if some of the devices in the class require different low level commands. As an example a user program does not have to treat SCSI disk and IDE disks differently. Exercise: List some problems with having all device support go through the kernel. Which types of devices are affected most by this restriction. ------------------------------------------------------------ 1.11) The Process Control Block (PCB) and Kernel Attributes The information the kernel maintains about a process is called that process's Process Control Block (PCB). The PCBs are maintained in the kernel's address space, but a PCB is not really one block or table, rather the information is scattered around a number of tables and data structures. We call some of this information "Process Attributes". For UNIX systems some of the most important Process Attributes are: pid Process ID ppid Parent Process ID uid euid User ID, Effective User ID gid egid Group ID, Effective Group ID pgid Process Group ID sid Session ID cwd Current Working Directory umask file mode creation MASK limits (hard and soft) limits for resources: time maxium CPU seconds for a process file maxium number of blocks that can be written into a file data maxium data segment size stack maxium stack size coredump maxium coredump file size nofiles maxium number of open file descriptors vmemory maxium virtual memory size nice process scheduling priority adjustment root root directory for all paths and commands time Approximate times start system user other process statistics (total page faults etc...) open file descriptor table (files, pipes, sockets, devices). opened IPC objects (semiphores, shared memory, message queues). There are usually both shell commands and system calls for accessing, and in some cases, changing process attributes. On UNIX new processes are generated by using a fork() call, which causes the kernel to make a new process which is almost an exact copy of the process making the fork() call. Therefore process attributes are inherited from a proccess's parent process (except for example pid and ppid). Thus, setting a process attribute in a shell (process) causes all processes you start from that shell to inherit the attribute you had set in the shell. To review, in addition to the process attributes in the PCB, a process consists of a Virtual Memory system, the contents of caches (for the running process), and the CPU State. The Memory Subsystem is made up of: The Virtual Memory image of the process. The MMU Table for the process (in the kernel's address space). Frames and Swap pages for the process. Contents of caches (only for the running process). The CPU state (register values and pipeline state); Exercise: Try to find some other Process Attributes for a UNIX system. Exercise: Try to find a shell command to read and set each of the listed Process Attributes. ------------------------------------------------------------ 1.12) Context Switches via Traps and Interupts In a typical multiprocess operating system there are two types of switches between processes, switching from a User Process to the Kernel Process and switching from the Kernel Process and a User Process. These are sometimes called Context Switches. Now each type of switch will be considered. A switch from a User Process to the Kernel Process is caused by either a TRAP instruction being executed by the User Process or by an INTERUPT from Hardware. Consider what must be done: Save the CPU State of the User Process (to User's memory) Put Parameters in Registers (for a system call) Flush Memory Caches Make the Kernel's MMU Table the current MMU Table Set PRI = 1 Set the PC to the proper Kernel Entry Point. Some of this is done by the TRAP instructions or INTERUPT and some may be done by code in either the User or Kernel Process. A switch from the Kernel Process to a User Process is caused by a RETT (RETurn from Trap) instruction being executed by the Kernel Process. Consider what must be done: Put Return Values in Registers (for return from a system call, not for a typical return from an Interupt). Flush Memory Caches Make the User's MMU Table the Current MMU Table Set PRI = 0 Restore the CPU State of the User Process (from User's memory) Set the PC to the return location in User's memory Just as with the context switch to the Kernel, some of this is done by the RETT instruction and some may be done by code. It is clear that these context switches are going to be very slow. A major cause of this slowdown is the FLUSHING of the Caches, resulting in many cache misses after a context switch. It is expected that a system call may be several humdred times (or more) slower than a function call. 1.12.2) Steps taken in a getpid() system call. The getpid() system call is a "fast" system call, which the kernel can service without delay. So there is no need to switch to another process while the getpid() call completes. Here is a breakdown on the call. Running Cause Action Process ------- ------------- --------------------------------- user_1 getpid(); kernel TRAP service getpid(); copy pid of user_1 into a register continue the user process user_1 RETT pid is still in the register Here a TRAP requires a Context Switch to the Kernel and the RETT requires a Context Switch to the user process. 1.12.3) Steps taken in a read() system call. The read() system call is a "slow" system call, which the kernel cannot service without delay. So there is a need to switch to another process while the read() call completes. Now consider the steps of a "read" system call: read(fd, buff, n) , when n is the size of 2 logical disk blocks and no errors occur. The kernel will only transfer 1 logical disk block at a time from the disk, so two separate transfer operations must be made to service this read() system call. Since those transfers may each take many milliseconds to complete, and read() is by default a "blocking system call", the kernel will schedule other processes while the transfers are going on. A "blocking system call" is one that will not return until the requested service is complete. This could result in the following; Running Cause Action Process ------- ------------- --------------------------------- user_1 read(fd, buf, n); kernel TRAP service read system call verify call is legal request disk read of first block place user_1 in wait state schedule a new user process user_2 RETT other work kernel Disk INTERUPT first block of user_1 data read in transfer block to start of buf continue interupted process user_2 RETT other work kernel Disk INTERUPT second block of user_1 data read in transfer second block to buf place user_1 in ready state continue interupted process user_2 RETT other work ... kernel Clock INTERUPT user_2's time slice is up schedule a new user process (might be user_1) user_1 RETT Time to give user_1 a time slot again. return from the write() system call As in the previous example, every action here requires a context switch. Exercise: In the two examples above, consider the effect of the system calls on the times (elapsed, system, user) kept by the kernel for the process. 1.10.3) Privileged Instructions There are several types of operations that require that the PRI bit be set. Those operations include: Writing to the PSW enabling/disabling caches enabling/disabling MMU RETT, Return from TRAPS (and INTERUPTS) Access to other processes virtual memory spaces FLUSH of caches HALT Exercise: Why are these instruction "privileged"? For each of the operations above, give an example in a multiprocessing operating system, where having a User Process (PRI = 0) execute the operation causes problems. 1.10.4) The Stack and Function Calls, Traps, and Interupts The system stack is read/write memory and could be used as any other memory by a process. The stack is usually only used as a structured way because of its connection with: function linkages (CALL/RET machine instructions), system calls (TRAP/RETT machine instructions), and interpts (hardeware generated interupts and RETT machine instructions). The space on top of the stack allocated for use when there is a function call, system call or hardware interupt is called a "Stack Frame". SP ------> ------------------------------- | SP_old (Next Stack Frame) | | PC_old (return address) | |-- not used until a function is | Arguements | | called from the currrent one | ... | | | Local Variables (automatic) | Top Stack Frame | ... | (Current Function) | Space for Context | | ... | SP_old |-----------------------------| | SP_old1 | | PC_old1 (return address) | current function return address | Arguements | arguements for current function | ... | | Local Variables (automatic) | Next Stack Frame | ... | (Calling function) | Space for Context | | ... | | | | | SP_old1 |-----------------------------| | | ... | | |-----------------------------| | | Bottom Stack Frame | | ------------------------------- Consider the effect of a function call strcpy(s1,s2); 1) copy arguements (s1 and s2) into the Stack Frame (and often passed in registers as well) 2) save any "local" context that needs to be protected (usually only a few registers. RISCs only need to advance the register window, unless there is a register window overflow) 3) save the current SP (as SP_old) in the Stack Frame 4) save the current PC (return address) in the Stack Frame 5) set the SP to the new Top Stack Frame SP = SP + size(space for Context) + (space for Local Variables) + (space for arguements (s1 and s2), 2 words here) + (space for PC and SP_old, 2 words here) 6) set the PC to the new function start address PC = strcpy Consider the effect of a system call write(fd, buf, n) 1) Same 2) save "ALL" system context on the stack. 3) Same 4) Same 5) set the SP to the new Top Stack Frame SP = SP + size(space for Context) + (space for arguements (s1 and s2), 2 words here) + (space for PC and SP_old, 2 words here) 5X) The "new SP" really needs to be saved some where "special" so that the RETT instruction can use it to find where the Stack Top is when returning to the user process. 6X) FLUSH all caches 6X) Switch to the kernel's MMU Table (use Kernel's Virtual Memory) 6X) SET PRI = 1 6X) Restore Kernel's Context (SP, PSW etc...) 6X) set the PC to the entry point for the "write" system call in the kernel's virtual address space. PC = write Consider the effect of an interupt, such as a memory violation This is the same as a system call, except that step 1) is not done and the return address is whereever teh user process is currently running. Note Not all Stack Frames are the same size (because not all functions need the same space for local variables). In C, the term for local variable is "automatic variables". The space in a stack frame reserved for context may not all be used. Systems Calls and INTERUPTS do not require space for local variable on the stack (in the user process virtual address space). The Arguements passed to the current function (which "owns" the Top Stack Frane) are in the Next Stack Frame (below the Top Stack frame, which is "owned" by the the function which called the current function). So the PC, SP_old and Arguements in a Stack Frame are not set or used until a function call, system call or interupt occurs. The exact order of the parts of a Stack Frane (and other details) are machine and operating system specific. Such details as, how much is done as part of the CALL/TRAP instruction itself and how much is carried out by other machine instructions. Exercise: What situation must be addressed with an instruction pipeline when a trap or interuprt occurs? Exercise: Work out that has to be done with a return back from a function (which uses the RET machine instruction) and with a "return"/"context switch" back from a TRAP or INTERUPT (which uses the RETT machine instruction in the kernel). Exercise: Why is a system call linkage so much slower than a function call linkage? ------------------------------------------------------------ 1.13 Memory Mapped IO Character Devices Block Devices ------------------------------------------------------------ REFERENCES Computer Architecture --------------------- RISC SPARC PowerPC INTEL Pentium AMD Athlon INTEL IA64 Operating Systems --------------------- System V v4 BSD 4.4 Linux Windows? UNIX System Programming --------------------- C Language --------------------- GNU Program Development ---------------------