Mark Smotherman.
Last updated: March 1999
Although loose usage has sometimes equated the term "microprogramming" with the idea of "programming a microcomputer", this is not the standard definition. Rather, microprogramming is a systematic technique for implementing the control unit of a computer. It is a form of stored-program logic that substitutes for sequential-logic control circuitry.
A processing unit in a computer system is composed of a data path and a control unit. The data path includes registers, function units such as ALUs (arithmetic and logic units) and shifters, interface units for main memory and/or I/O busses, and internal processor busses. The control unit governs the series of steps taken by the data path during the execution of a user-visible instruction, or macroinstruction (e.g., load, add, store, conditional branch).
Each step in the execution of a macroinstruction is a transfer of information within the data path, possibly including the transformation of data, address, or instruction bits by the function units. A step is thus called a register transfer and is accomplished by gating out register contents (i.e., sending a copy) onto internal processor busses, selecting the operation of ALUs, shifters, etc., and gating in (i.e., receiving) new values for one or more registers. Control signals consist of enabling signals to gates that control sending or receiving of data at the registers (called control points) and operation selection signals. The control signals identify the microoperations required for each register transfer and are supplied by the control unit. A complete macroinstruction is executed by generating an appropriately timed sequence of groups of control signals (microoperations).
Figure 1. Simple data path for a four-instruction computer (the small circles are control points)
For example, the first group of control signals for a macroinstruction would implement instruction fetch by gating the contents of the program counter out over an internal bus and into the memory address register. Either in this group (i.e., during this current time step) or the next, the program counter would be incremented and the memory interface sent a signal indicating a memory read. The following time steps would perform the microoperations required to fetch operands, execute the function specified by the macroinstruction opcode, and store any results. A simple macroinstruction might require five to ten time steps and involve a dozen or more control signals.
time steps T0-T3 for each instruction fetch:
T0: PC_out, MAR_in
T1: read, pcincr
T2: MDR_out, IR_in
T3: time step for decoding opcode in IR
time steps T4-T6 for a load instruction (opcode 00):
T4: IR_out(addr part), MAR_in
T5: read
T6: MDR_out, ACC_in, reset to T0
time steps T4-T7 for an add instruction (opcode 01):
T4: IR_out(addr part), MAR_in
T5: read
T6: ACC_out, aluadd
T7: TEMP_out, ACC_in, reset to T0
time steps T4-T6 for a store instruction (opcode 10):
T4: IR_out(addr part), MAR_in
T5: ACC_out, MDR_in
T6: write, reset to T0
time steps T4-T5 for a brz (branch on zero) instruction (opcode 11):
T4: if (acceq0) then { IR_out(addr part), PC_in }
T5: reset to T0
Figure 2. Time steps for the four-instruction example
The logic expression for an individual control signal can be written in the sum-of-products form in which the terms consist of a given time step (or clock phase) and'ed with a decoded signal identifying a specific macroinstruction opcode. The control signal will then be asserted when needed at one or more specific time steps during macroinstruction execution. Additionally, some terms may be and'ed with decoded signals that identify particular conditions within the datapath (e.g., register contents equal to zero) or external conditions (e.g., interrupt request). Conditional assertions are used to implement conditional branch macroinstructions and interrupt handling.
ACC_in = (load & T6) + (add & T7)
ACC_out = (store & T5) + (add & T6)
aluadd = add & T6
IR_in = T2
IR_out(addr part) = (load & T4) + (add & T4) + (store & T4) + (brz & acceq0 & T4)
MAR_in = T0 + (load & T4) + (add & T4) + (store & T4)
MDR_in = store & T5
MDR_out = T2 + (load & T6)
PC_in = brz & acceq0 & T4
PC_out = T0
pcincr = T1
read = T1 + (load & T5) + (add & T5)
TEMP_out = add & T7
write = store & T6
Figure 3. Control signal definitions in sum-of-products form
The direct implementation of the control signal logic expressions is called a hardwired control unit. Historically, direct implementations were based on individual logic gates; the result is called "random logic" and is characterized by arbitrarily complex circuits. In a microprogrammed control unit, control signals that are to be generated at a given time step (i.e., to be simultaneously active) are stored together in a control word, which is called a microinstruction. The collection of control words that implement an macroinstruction is called a microprogram, and the microprograms are stored in a memory element called the control store.
Figure 4. Control store for the four-instruction computer (control bits of zero not shown)
Thus, in a microprogrammed control unit, the sequence of microoperations needed for macroinstruction execution derives from a series of fetches from the control store rather than the operations of sequential-logic circuitry. The result is a more systematic design for control.
Since the microprograms stand between the logic design (hardware) and the program being executed (software), they are sometimes referred to as firmware. This term is also subject to loose usage. Ascher Opler first defined it in a 1967 Datamation article as the contents of a writable control store, which could be reloaded as necessary to specialize the user interface of a computer to a particular programming language or application [1]. However, in later general usage, the term came to signify any type of microcode, whether resident in read-only or writable control store. Most recently, the term has been widened to denote anything ROM-resident, including macroinstruction-level routines for BIOS, bootstrap loaders, or specialized applications.
See chapters 4 and 5 of Hamacher, Vranesic, and Zaky [2] and chapter 5 of Patterson and Hennessy [3] for overviews of datapath design, control signals, hardwiring, and microprogramming. Older texts devoted exclusively to microprogramming issues include Agrawala and Rauscher [4], Andrews [5], Habib [6], and Husson [7]. The ACM and IEEE have sponsored for over thirty years an Annual Workshop on Microprogramming and published the proceedings; more recently the conference name has changed to the International Symposium on Microarchitecture, to reflect the change of emphasis from just microprogramming to the broader field of microarchitecture, i.e., the internal design of the processor, including the areas of pipelining, branch prediction, and multiple instruction execution.
In the late 1940's Maurice Wilkes of Cambridge University started work on a stored-program computer called the EDSAC (Electronic Delay Storage Automatic Calculator). During this effort, Wilkes recognized that the sequencing of control signals within the computer was similar to the sequencing actions required in a regular program and that he could use a stored program to represent the sequences of control signals. In 1951, he published the first paper on this technique, which he called microprogramming [8].
In Wilkes' seminal paper, he described an implementation of a control store using a diode matrix. The microinstructions held in the control store had a simple format: the unencoded control signals were stored with a next-address field. Initial selection of the appropriate microprogram was handled by using the opcode value appended with zeros as a starting address in the control store, and normal sequencing used the contents of the next-address fields thereafter. Conditional transfers were handled by allowing conditions to modify bits in the next-address fields.
In an expanded paper published in 1953, Wilkes and his colleague Stringer further described the technique [9]. In this visionary paper, they considered issues of design complexity, test and verification of the control logic, alternation of and later additions to the control logic, support of different macroinstruction sets (by using different matrices), exploiting parallelism within the data path, multiway branches, environment substitution, microsubroutines, variable-length and polyphase timing, and pipelining the access to the control store. The Cambridge University group went on to implement the first microprogrammed computer, the EDSAC 2, in 1957 using an 8-by-6 magnetic core matrix.
Due to the difficulty of manufacturing fast control stores in the 1950's, microprogramming was not yet a mainstream technology. However, John Fairclough at IBM's laboratory in Hursley, England, led a development effort in the late 1950's that explored a read-only magnetic core matrix for the control unit of a small computer. Fairclough's experience with microprogramming played a key role in IBM's 1961 decision to pursue a full range of System/360 compatible computers [10]. All but two of the initial 360 models (the high-end Models 75 and 95) were microprogrammed, and this approach allowed the same program to be executed on various price/performance implementations [11].
M30 | M40 | M50 | M65 | |
---|---|---|---|---|
minimum rental fee ($K/mo.) | 4 | 7 | 15 | 35 |
Knight index - scientific (Kops) | 7.94 | 33.4 | 187 | 1390 |
Knight index - commercial (Kops) | 17.1 | 50.1 | 149 | 810 |
internal data path width (bits) | 8 | 16 | 32 | 64 |
control store size (K microinsts.) | 4 | 4 | 2.75 | 2.75 |
microinstruction width (bits) | 50 | 52 | 85 | 87 |
control store technology | CCROS | TROS | BCROS | BCROS |
control store cycle time (ns) | 750 | 625 | 500 | 200 |
main memory cycle time (ns) | 1500 | 2500 | 2000 | 750 |
Table 1. Comparison of IBM System/360 Models 30-65 (ca. 1965) [10,12,13].
Because of the vital nature of microprogramming to the plan for compatibility, IBM aggressively pursued three types of read-only control store technologies for the System/360 models [10]. This included Balanced Capacitor Read-Only Storage (BCROS), using two capacitors per bit position in a storage of 2816 words of 100 bits each, and Transformer Read-Only Storage (TROS), using the magnetic core technology developed at Hursley in a storage of 8192 words of 54 bits each. For the Model 30, Card Capacitor Read-Only Storage (CCROS) was developed. This technology was based on mylar cards the size and shape of standard punch cards that encased copper tabs and access lines. A standard card punch could be used to remove some of the tabs from a 12 by 60 bit area in the middle of the card. Removed tabs were read as zeros by sense lines, while the remaining tabs were read as ones. An assembly of 42 boards with 8 cards per board was built into the Model 30 cabinet.
CCROS cards on the Model 30 could be replaced in the field. This made design modifications easier to accomplish, as well as providing for field diagnosis. The control store of the Model 30 was divided into a section for resident routines and a section of six CCROS cards that would hold test-specific routines [14]. The reserved board could be "populated" with one of several sets of CCROS cards by an engineer running machine diagnostics. The System/360 Model 25 and models of the later System/370 series were configured with at least one portion of the control store being read-write for loading microcode patches and microdiagnostics. (On the smaller System/370 models, the control store was allocated in part of main memory). Indeed, IBM invented the eight-inch floppy diskette and its drive to improve the distribution of microcode patches and diagnostics; the diskette was first used in 1971 on the System/370 Model 145, which had a writable control store (WCS) of up to 16K words of 32 bits each. Other computer manufacturers soon followed these techniques.
During the change-over of the IBM product line to the new System/360 in the mid-sixties, the large customer investment in legacy software, especially at the assembly language level, was not fully recognized [15]. Software conversion efforts through recompilation of high-level-language programs were planned, but the pervasive use of machine-dependent codes and the unwillingness of customers (and economic infeasibility) to convert all these machine-dependent codes required efforts at providing simulators of the older computers. Initial studies of the simulators, however, indicated that, at best, performance would suffer by factors ranging from ten to two [10]. Competitors were meanwhile seeking to attract IBM customers by providing less intensive conversion efforts using more compatible hardware and automatic conversion tools, like the Honeywell "Liberator" program that accepted IBM 1401 programs and converted them into programs for the 1401-like Honeywell H-200.
IBM was spared mass defection of former customers when engineers on the System/360 Model 30 suggested using an extra control store that could be selected by a manual switch and would allow the Model 30 to execute IBM 1401 instructions [16]. The simulator and control store ideas were then combined in a study of casting crucial parts of the simulator programs as microprograms in the control store. Stuart Tucker and Larry Moss led the effort to develop a combination of hardware, software, and microprograms to execute legacy software from not only the IBM 1401 computers but also the IBM 7000 series [17]. Moss felt their work went beyond mere imitation and equaled or excelled the original in performance; thus, he termed their work as emulation [10]. The emulators they designed worked well enough so that many customers never converted legacy software and instead ran on System/360 hardware using emulation for many years.
Because of the success of the IBM System/360 product line, by the late 1960's microprogramming became the implementation technique of choice for most computers except the very fastest and the very simplest. This situation lasted for about two decades. For example, all models of IBM System/370 and all models of the DEC PDP-11 (except the PDP-11/20) were microprogrammed. At perhaps the peak of microprogramming's popularity, the DEC VAX 11/780 was delivered in 1978 with a 4K word read-only control store of 96 bits per word and an additional 1K writable region available for diagnostics and microcode patches. An extra-cost option on the 11/780 was 1K of user-writable control store.
Several early microprocessors were hardwired, but some amount of microprogramming soon became a common control unit design feature. For example, among the major eight-bit microprocessors produced in the 1974 to 1976 time frame, the MC6800 was hardwired while the Intel 8080 and Zilog Z80 were microprogrammed [18]. An interesting comparison between 1978-era 16-bit microprocessors is the hardwired Z8000 [19] and the microcoded Intel 8086 [20]. The 8086 used a control store of 504 entries, each containing a rather generic 21-bit microinstruction. Extra decoding logic was used to tailor the microinstruction to the particular byte-wide or word-wide operation. In 1978 the microprogramming of the Motorola 68000 was described [18,21,22]. This design contained a sophisticated two-level scheme. Each 17-bit microinstruction could contain either a 10-bit microinstruction jump address or a 9-bit "nanoinstruction" address. The nanoinstructions were separately stored 68-bit words, and they identified the microoperations active in a given clock cycle. Additional decoding logic was used along with the nanoinstruction contents to drive the 196 control signals.
An article by Nick Tredennick in 1982 characterized the development of four major uses of microprogramming, which he termed cultures [23]:
An important sub-area within the microprogrammable machine culture was the desire to produce a universal host machine with a general data path for efficient emulation of any macroinstruction set architecture or for various high-level-language-directed machine interfaces. Examples of this include the Nanodata QM-1 (ca. 1970) [4,28] and the Burroughs B1700 (ca. 1972) [28,29,30]. ALU operations on the QM-1 were selectable as 18-bit or 16-bit, binary or decimal, unsigned or two's complement or one's complement. However, even with efforts to provide a generalized set of operations such as this, the unavoidable mismatches between host and target machine data types and addressing structures never allowed unqualified success for the universal host concept.
The B1700 is probably the closest commercial machine to Opler's original vision of firmware. Multiple sets of microprograms could be held in its memory, each one presenting a different high-level macroinstruction interface. The B1700 operating system was written in a language called SDL, while application programs were typically written in Fortran, Cobol, and RPG. Thus, the control store for the B1700 could hold microprograms for an SDL-directed interface as well as microprograms for a Fortran-directed interface and a Cobol/RPG-directed interface. On an interrupt, the B1700 would use the SDL-directed microprograms for execution of operating system code, and then on process switch the operating system could switch the system to the Fortran-directed microprograms or the Cobol/RPG-directed microprograms. Baron and Higbie question the B1700's choice of 24-bit word size, lack of two's complement arithmetic support, address space limitations, and limited interrupt and I/O facilities, but they mainly attribute the limited commercial success of the B1700 to the lack of operating system documentation for developers [30].
Starting in the late seventies was a trend in the growth of complexity in macroinstruction sets. The VAX-11/780 superminicomputer and the MC68020 microprocessor are examples of this trend, which has been labeled CISC for complex instruction set computer. The trend toward increasing complexity is generally attributed to the success of microprogramming to that date. Freed from previous constraints arising from implementation issues, the VAX designers emphasized ease of compilation [27]. For example, one design criterion was the generality of operand specification. However, the result was that the VAX had complex, variable-length instructions that made high-performance implementations difficult [31].
Compared to the original MC68000 (ca. 1979), the MC68020 (ca. 1984) added virtual memory support, unaligned data access, additional stack pointers and status register bits, six additional stack frame formats, approximately two dozen new instructions, two new addressing modes (for a total of 14), and 18 new formats for offsets within addressing modes (for a total of 25). To support this, microcode storage increased from 36 KB to 85 KB [22]. The complex memory indirect addressing modes on the MC68020 are representative of the design style engendered by a control store with access time only a fraction of the main memory access time. This style dictates that only a relatively few complex macroinstructions should be fetched from main memory and a multitude of register transfers can then be generated by these few macroinstructions.
Added to the trend of growing complexity within "normal" macroinstruction set architectures was an emphatic call to "bridge the semantic gap" under the assumption that higher-level machine interfaces would simplify software construction and compilation; see chapter 1 of Myers [29] for an overview of this argument. However, starting in the 1980's, there was a reaction to the trend of growing complexity. Several technological developments drove this reaction. One development was that VLSI technology was making on-chip or near-chip cache memories attractive and cost effective; indeed, dual 256-byte instruction and data caches were included on the MC68030 (ca. 1987), and dual 4K-byte instruction and data caches were part of the MC68040 (ca. 1989). Thus effective memory access time was now closer to the processor clock cycle time.
A second development was the ability of VLSI designers to avoid the haphazard layouts and chip area requirements of random logic. Instead, designers could commit sum-of-products logic expressions to a programmed logic array (PLA). This structure consists of an and-gate array followed by an or-gate array, and it provides a straightforward implementation of sum-of-products expressions for control signals.
The change in the logic/memory technological ratio and the availability of a compact implementation for a hardwired control unit set the stage for a design style change to RISC, standing for reduced instruction set computer and named so as to heighten the contrast with CISC designs. The RISC design philosophy argues for simplified instruction sets for ease of hardwired, high-performance, pipelined implementations. In fact, one could view the RISC instruction sets as quite similar to highly-encoded microinstructions and the accompanying instruction cache as a replacement for a writable control store. Thus, instead of using a fixed set of microprograms for interpretation of a higher-level instruction set, RISC could be viewed as compiling directly to microcode. A counterargument to this view, given by some RISC proponents, is that RISC has less to do with exposing the microprogram level to the compiler and more to do with a rediscovery of Seymour Cray's supercomputer design style as exhibited in the CDC 6600 design of the mid-sixties and an agreement with John Cocke's set of hardware/software tradeoffs as exhibited in the IBM 801 design of the early eighties.
The 1980's proved to be a crucial turning point for traditional microprogramming. Without exception, modern-day RISC microprocessors are hardwired. The MC68000 macroinstruction set has also downsized, so that an implementation of the remaining core macroinstruction set can be hardwired [32]. Even the newest processors for IBM System/390 mainframes are hardwired [33]. Application specific tailoring of hardware now no longer depends on microprogramming but is typically approached as an ASIC or FPGA design problem.
Microcode is still used in the numerous Intel x86 compatible microprocessors like the Pentium II and AMD K6. However, within these designs, simple macroinstructions have one to four microinstructions (called uops or Rops, respectively) immediately generated by decoders without control store fetches. This suffices for all but the most complex macroinstructions, which still require a stream of microinstructions to be fetched from an on-chip ROM by a microcode sequencer. See Shriver and Smith [34] for a detailed exposition of the internal design of one of the current desktop x86 microprocessors, the AMD K6-2.
The level of abstraction of a microprogram can vary according to the amount of control signal encoding and the amount of explicit parallelism found in the microinstruction format. On one end of a spectrum, a vertical microinstruction is highly encoded and looks like a simple macroinstruction; it might contain a single opcode field and one or two operand specifiers (e.g., Microdata 1600). Thus each microinstruction can specify a single datapath operation that, when decoded, activates multiple control signals. Branches within vertical microprograms are typically handled as separate microinstructions using a "branch" opcode. This style of microprogramming is more natural to the assembly language programmer, but more microinstructions may be needed to accomplish a given task.
At the other end of the spectrum, a horizontal microinstruction might be completely unencoded and each control signal may be assigned to a separate bit position in the microinstruction format (as in Wilkes' original proposal). This extreme, however, is usually impractical since groups of control signals are often mutually exclusive (e.g., gating multiple sources onto the same internal bus) and the resulting representation efficiency is too low. Instead, horizontal microinstructions typically use several operation fields, each encoding one of several mutually exclusive choices (e.g., the System/360 Model 50 uses 25 separate fields in each 85-bit microinstruction). The multiple operation fields allow the explicit specification of parallel activities within the datapath. Branching in horizontal microprograms is also more complicated than in the vertical case; each horizontal microinstruction is likely to have at least one branch condition and associated target address and perhaps includes an explicit next-address field. The programming effect is more like developing a directed graph (with various cycles in the graph) of groups of control signals. Horizontal microprogramming is therefore less familiar to assembly language programmers and can be more error-prone. However, horizontal microprogramming typically provides better performance because of the opportunities to exploit parallelism within the datapath and to fold the microprogram branching decisions into the same clock cycle in which the datapath microoperations are active.
A combination of vertical and horizontal microinstructions in a two-level scheme is called nanoprogramming and is used in the Nanodata QM-1 and the Motorola 68000. The QM-1 used a 16K word microinstuction control store of 18 bits per word and a 1K word nanoinstruction control store of 360 bits per word. The microinstructions essentially formed calls to routines at the nanoinstruction level. The horizontal-style nanoinstruction was divided into five 72-bit fields. The first field (K) held a 10-bit branch address, various condition select subfields, and some control subfields. The remaining four fields (T1-T4) specified the particular microoperations to be performed; each T field contained 41 subfields. A nanoinstruction was executed in four phases: the K field was continuously active, and the T fields were executed in sequence, one per phase. This approach allowed one 360-bit nanoinstruction to specify the equivalent of four 144-bit nanoinstructions (i.e., the K field appended with a particular T field). Sequencing subfields with the T fields provided for repetitive execution of the same nanoinstruction until certain conditions became true, and other subfields provided for the conditional skipping of the next T field.
Interpretation of microinstructions can become more complicated than interpretation of normal macroinstructions due to some of the following features: bit steering, in which one field's value determines the interpretation of other field(s) in the microinstruction; environment substitution, in which fields from user-level registers are used as counts or operand specifiers for the current microinstruction; residual control, in which one microinstruction deposits control information in a special setup or configuration register that governs the actions and interpretation of subsequent microinstructions; polyphase specification, in which different fields in a microinstruction are active in different clock phases or clock cycles (e.g., Nanodata QM-1 as described above); and, multiphase specification, in which the time duration (i.e., number or length of clock phases or cycles) for an ALU activity such as addition would be explicitly lengthened to accommodate the time required for carry propagation (also a feature of the Nanodata QM-1).
Sequencing of microinstructions is also complicated by a basic rule of avoiding any microinstruction address additions except as simple increments to counters. Thus, address field bits may be or'ed into the control store address register or conditionally replace all or part of the control store address register. Alternatively, a microinstruction might use multiple next address fields. Subroutine calls may be available at the microprogram level, but the nesting level is typically restricted by the size of a dedicated control store return address stack. For use in decoding macroinstructions, the initial mapping to a particular microprogram can be accomplished by:
Most environments for microprogramming provide standard tools, such as micro-assemblers, linkers, loaders, and machine simulators (which may provide various forms of debugging and tracing facilities). These tools are, of necessity, machine dependent. Over the years, many efforts have been made toward developing high-level microprogramming languages (HLMLs), ranging from machine-dependent approaches to machine-independent approaches. The former typically could not be ported to other implementations or had large development costs required for porting (or instantiation); the latter typically generated rather inefficient microcode. Thus, over the years there arose no widely accepted microprogramming language.
One of the first published discussions of an attempt to provide a higher-level representation for microcode was Stuart Tucker's description of an internal IBM flowchart language for System/360 microcode [11]. Other early work included Samir Husson's 1970 text in which he described a Fortran-like HLML [7], and Dick Eckhouse's 1971 Ph.D. dissertation in which he described a PL/I-like language he called MPL. Other notable efforts include STRUM in 1976, developed by David Patterson with the goal of microcode verification; STRUM was a Algol-like, block-structured language that provided loop assertions (assert) and pre- and post-procedure assertions (assume and conclude). The first language implemented on multiple machines, specifically the DEC VAX and the HP 300, was done by Patterson and colleagues at DEC in 1979 and was named YALL (Yet Another Low Level Language). See Subrata Dasgupta's and Scott Davidson's survey papers for more information [35,36].
A major design question for a HLML is whether (and, if so, how) to express parallelism and timing constraints. On one hand, a language like Dasgupta's S* (ca. 1978) uses a Pascal-like syntax but includes two timing control structures: cocycle and coend, which identifies parallel operations active in the same clock cycle; and, stcycle and stend, which identifies parallel operations that start together in a new clock cycle (but do not necessarily have the same duration). On the other hand, Preston Gurd wrote a microcode compiler using a subset of the C programming language as source statements (Micro-C, ca. 1983) for his masters thesis at University of Waterloo. Gurd found that microprograms written without explicit constraints were easier to understand and debug. In fact, debugging could take place on any system having a C compiler and make use of standard debuggers. Moreover, providing a microcode compiler for an existing HLL allows preexisting codes to be converted to microcode without further programming effort.
Because the efficiency of microcode is so important to the performance of a system, a major goal of microprogrammers and consequently their tools is optimization. This optimization is called microcode compaction and attempts to ensure both that the microprograms will fit within a given size control store and that the microprograms execute with maximum performance [5,6]. Such optimization may reorder the microoperations, while not violating data and resource dependences. One advanced technique, called trace scheduling was developed by Josh Fisher for horizontal microprograms [37]. Fisher's approach is to predict a likely path of execution (i.e, a "trace") and perform major optimization on that path. Off-trace paths will likely require compensation code to retain correctness. Adding code to infrequently-taken paths costs relatively little in execution time, but this optimization may be constrained by control store size limits.
Current microprocessor design is characterized by performance and power optimizations. The major performance optimization is pipelining, in which a stream of (macro)instructions progresses from pipeline stage to pipeline stage with overlapping of instruction fetch, decode, and execution. Pipelining is actually a rather old idea, introduced into computer design by two late 1950's supercomputers: the IBM Stretch and the Univac LARC. Today, every microprocessor includes some degree of instruction pipelining [38,39].
Pipelining is disrupted by branches within a program. The IBM Stretch designers recognized this, but instead of stopping the processor at a branch and waiting until the branch was resolved as taken or untaken, they designed the processor such that it would preexecute index-register-related instructions (including index-related branches) and such that it would predict other branches as untaken. When the prediction was incorrect, the processor had to backtrack (i.e., recover) and start again from the taken branch. Most microprocessors today include some form of branch prediction. For example, the classic Pentium contains a 256-entry branch target buffer, which can hold the following information for up to 256 branches: the branch target address and a prediction of taken or untaken based on a 2-bit history. The Pentium Pro, Pentium II, and Pentium III (all of which are based on the same core design) contain an even more sophisticated branch prediction scheme where the branch history is organized not merely by branch instruction address but also by the previous branching patterns. Thus, for example, the Pentium Pro and subsequent models can correctly predict a branch that alternates direction on each execution [3,39].
Another emphasis on parallel activities in high-end microprocessors today is the execution of multiple user-level operations in the same cycle. A microprocessor that uses a standard instruction set but fetches, decodes, issues (i.e., starts execution), executes, and writes results back for multiple instructions per cycle is called a superscalar processor. The classic Pentium could perform this for up to two IA-32 instructions per cycle, and the Pentium Pro and subsequent models can do this for up to three IA-32 instructions per cycle [3]. The major problem in building a superscalar processor is to determine data and resource dependencies between the instructions and issue only those instructions that will not conflict. Data dependencies can be reduced by using hardware register renaming, in which each assignment to a register defined in the user-level instruction set causes the allocation of and assignment to a register from a large, processor-internal physical register set. Source register fields in subsequent instructions are renamed to reference the correct physical register. Another aggressive approach is to buffer multiple instructions once they are decoded and issue only those instructions from this buffer that are ready (i.e., that have no data or resource dependency conflicts), even if this means issuing them out of program order. Out-of-order execution was first developed for the CDC-6600 and IBM S/360 Model 91 in the 1960's. Today, the Pentium Pro and subsequent models provide both register renaming and out-of-order execution [3]. Current implementations also provide a result buffer (called a "reorder buffer" or "completion buffer") so that instructions that start and complete out of program order can write to assigned buffer locations, and yet the buffer can be unloaded in program order. This arrangement is used to guarantee that there will be a consistent processor state when branch mispredictions and instruction exceptions occur [38,39].
An alternative approach to multiple operations per cycle is reflected in an instruction set design that resembles horizontal microinstructions; this approach is called VLIW, standing for very long instruction word. In this approach, it is up to the compiler to completely determine data and resource dependencies among multiple operations and pack them into instruction words so that there will be no conflicts during execution. It is not surprising that Josh Fisher has been a major leader in VLIW computer design and that trace scheduling has been used to optimize VLIW programs [37].
A derived form of VLIW is planned for the Intel IA-64 instruction set. The design principles, pioneered by HP as well as Intel, are collectively called EPIC, standing for explicitly parallel instruction computing. The IA-64 instruction set will group three instructions per "bundle" and provide explicit dependency information. The dependency information must be determined by the compiler and will be used by the hardware to schedule the execution of the instructions. Therefore a processor designed according to the EPIC principles stands somewhere between a superscalar and a VLIW. EPIC also calls for predication, a technique that uses a set of predicate registers, each of which can hold a value of true or false. Rather than executing branch instructions to implement an if-then-else control structure within the program, the operations involved can be predicated (i.e., made conditional) on a given predicate register. Thus, operations from both the then-path and the else-path can flow through the pipeline without any disrupting branches, but only one set of operations will be allowed to write results back [40].
High-performance or low-power design is not merely a matter of deciding upon the types and levels of parallel activities and predictions supported by the datapath. The optimal use of any datapath requires an appropriate level of compiler optimization technology. This may include machine-independent optimizations such as constant-folding and procedure inlining as well as machine-dependent optimizations such as instruction scheduling for pipelines and array-tiling for caches [41]. Moreover, the new EPIC designs will require even more aggressive code optimizations to obtain the best performance [42,43].
[2] VC Hamacher, ZG Vranesic, and SG Zaky. Computer Organization. 3rd ed. New York: McGraw-Hill, 1990.
[3] DA Patterson and JL Hennessy. Computer Organization and Design: The Hardware / Software Interface. San Mateo, CA: Morgan Kaufmann, 1998 (2 ed.).
[4] AK Agrawala and TG Rauscher. Foundations of Microprogramming. New York: Academic Press, 1976.
[5] M Andrews. Principles of Firmware Engineering in Microprogram Control. Potomac, MD: Computer Science Press, 1980.
[6] S Habib (ed.), Microprogramming and Firmware Engineering Methods. New York: van Nostrand, 1988.
[7] SH Husson, Microprogramming: Principles and Practice. Englewood Cliffs, NJ: Prentice Hall, 1970.
[8] MV Wilkes. The Best Way to Design an Automated Calculating Machine.
Manchester Univ. Computer Inaugural Conf, 1951, pp. 16-18.
- Reprinted in: MV Wilkes. The Genesis of Microprogramming.
Annals Hist Computing 8n3:116-126, 1986.
[9] MV Wilkes and JB Stringer. Microprogramming and the Design of the Control
Circuits in an Electronic Digital Computer. Proc Cambridge Phil Soc
49:230-238, 1953.
- Reprinted as chapter 11 in: DP Siewiorek, CG Bell, and A Newell.
Computer Structures: Principles and Examples. New York: McGraw-Hill, 1982.
- Also reprinted in: MV Wilkes. The Genesis of Microprogramming.
Annals Hist Computing 8n3:116-126, 1986.
[10] EW Pugh, LR Johnson, and JH Palmer. IBM's 360 and Early 370 Systems. Cambridge, MA: MIT Press, 1991.
[11] SG Tucker. Microprogram Control for System/360. IBM Syst Jrnl 6n4:222-241, 1967.
[12] A Padegs. System/360 and Beyond. IBM Jrnl Res Dev 25n5:377-390, 1981.
[13] M Phister Jr. Data Processing Technology and Economics. 2nd ed. Bedford, MA: Digital Press, 1979.
[14] AM Johnson. The Microdiagnostics for the IBM System 360 Model 30. IEEE Trans Comp C-20n7:798-803, 1971.
[15] SG Tucker. Personal communication. February 1999.
[16] MA McCormack, TT Schansman, and KK Womack. 1401 Compatibility Feature on the IBM System/360 Model 30. Comm ACM 8n12:773-776, 1965.
[17] SG Tucker. Emulation of Large Systems. CACM 8n12:753-761, 1965.
[18] F Anceau. The Architecture of Microprocessors. Workingham, England: Addison-Wesley, 1986.
[19] M Shima. Demystifying Microprocessor Design. IEEE Spectrum 16n7:22-30, 1979.
[20] J McKevitt and J Bayliss. New Options from Big Chips. IEEE Spectrum 16n3:28-34, 1979.
[21] S Stritter and N Tredennick. Microprogrammed Implementation of a Single Chip Microprocessor, Proc 11th Ann Microprogramming Workshop, 1978, pp. 8-16.
[22] N Tredennick. Experiences in Commercial VLSI Microprocessor Design. Microproc Microsyst 12n8:419-432, 1988.
[23] N Tredennick. The Cultures of Microprogramming. Proc 15th Ann Microprogramming Workshop, 1982, pp. 79-83.
[24] T Atkinson. Architecture of Series 60/Level 64. Honeywell Comp J 8n2:94-106, 1974.
[25] J Mick and J Brick. Bit-Slice Microprocessor Design. New York: McGraw-Hill, 1980.
[26] JR Larus. A Comparison of Microcode, Assembly Code, and High-Level Languages on the VAX-11 and RISC I. Comp Arch News 10n5:10-15, 1982.
[27] CG Bell, JC Mudge, and JE McNamara. Computer Engineering: A DEC View of Hardware Systems Design. Bedford, MA: Digital Press, 1978.
[28] AB Salisbury. Microprogrammable Computer Architectures. New York: Elsevier, 1976.
[29] GJ Myers. Advances in Computer Architecture. 2nd ed. New York: Wiley, 1978.
[30] RJ Baron and L Higbie. Computer Architecture: Case Studies. Reading, MA: Addison-Wesley, 1992.
[31] D Bhandarkar and DW Clark. Performance from Architecture: Comparing a RISC and a CISC with Similar Hardware Organization. Proc 4th Intl Conf on Arch Support for Prog Lang and Op Systs [ASPLOS], 1991, pp. 310-319.
[32] Motorola ColdFire. Available on-line at http://www.motorola.com/ColdFire
[33] CF Webb and JS Liptay. A High-Frequency Custom CMOS S/390 Microprocessor. IBM Jrnl Res Dev 41n4/5:463-473, 1997.
[34] B Shriver and B Smith. The Anatomy of a High-Performance Microprocessor: A Systems Perspective. Los Alamitos, CA: IEEE Computer Society Press, 1998.
[35] S Dasgupta. Some Aspects of High Level Microprogramming. ACM Comp Sur 12n3:295-323, 1980.
[36] S Davidson. Progress in High-Level Microprogramming. IEEE Software 3n4:18-26, 1986.
[37] BR Rau and J Fisher. Instruction-Level Parallel Processing: History, Overview, and Perspective. J Supercomputing 7:9-50, 1993.
[38] MJ Flynn. Computer Architecture: Pipelined and Parallel Processor Design. Boston: Jones and Bertlett, 1995.
[39] JL Hennessy and DA Patterson. Computer Architecture: A Quantitative Approach. San Mateo, CA: Morgan Kaufmann, 1996 (2 ed.).
[40] C Dulong. The IA-64 Architecture at Work. IEEE Computer 31n7:24-32, 1998.
[41] SS Muchnick. Advanced Compiler Design and Implementation. San Mateo, CA: Morgan Kaufmann, 1997.
[42] IMPACT compiler publications. Available on-line at http://www.crhc.uiuc.edu/IMPACT/index.html
[43] Trimaran compiler publications. Available on-line at http://www.trimaran.org/
[History page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]
mark@cs.clemson.edu