Bruce Jacob: Embedded Systems Research

Bruce Jacob - - Embedded Systems Research

Overview
Design Verification for Heterogeneous Systems
- Performance, Energy, and Memory-System Modeling (CASES 2001, IEEE-TC 2003, SPIE 2005)
- Heterogeneous Systems-on-Chip (SSE 2004)
- Power Issues (CASES 2001, CASES 2003)
Reinvestigating the Hardware/Software Interface (e.g., for predictability)
- Co-Design of RTOS and Microcontrollers (CASES 2001, CODES+ISSS 2003)
- Issues in the Design of DSP Systems (CASES 2001, ISCA 2004)
- Low-Overhead Interrupts (ICCD 2001, HiPC 2001)
- Real-Time Cache Management
Circuit Integrity (EMC 2004, SSE 2004, ASP-DAC 2005)

Embedded Systems Research: Overview

There are three main problems that our group is investigating in the area of embedded systems:

Problem one: modern embedded systems are designed, built, and then verified by hand. Personally, I believe this to be the single most important problem facing embedded systems today.

Problem two: modern microprocessors are built for best-case average performance (throughput), not predictability.

Problem three: modern digital systems are becoming more susceptible to elecromagnetic emissions (noise), thermal effects, and voltage fluctuations. I bundle these issues all together and call the field Circuit Integrity, a term that normally means "the completeness or intactness of an electrical wire or circuit," but here I use it to mean the correct behavior of all the components necessary for the correct operation of a typical digital system (clock network, data/control signals, power/ground references). Since this is not exactly specific to embedded systems, it is given its own home page.

By way of background, an "embedded system" is a computer that has been embedded into some larger framework to supplant a dedicated electro-mechanical system or circuit. The processor's job is typically control and/or signal processing. The reason it replaces the dedicated system/circuit is because microprocessors are (finally) cheaper than even extremely simple hand-built systems/circuits.

Addressing problem one ...

Embedded systems are designed at a high level using something like MatLab. The control portion of the system ("control" as in control theory) is represented as a set of differential equations, and the plant being controlled (e.g. a motor, a robotic arm, a heating coil, a disk spindle, etc.) is also represented as a set of differential equations. This form of modeling is very high-level; an evaluation of the systems's behavior can be made very quickly; the model is reasonably accurate; and this is "the way it is done" in embedded systems. However, once one has identified a set of control laws that work correctly, one must turn that into software running on a microprocessor. This is done by hand -- it cannot be automated (very successfully) with today's tools. Similarly, once one has identified a hardware specification for the plant being controlled (shape of the robotic arms, number and type of sensors, etc.), one must turn that into a physical design -- and this is also done by hand. Because the transformations are done by hand, there is no guarantee as to how accurately the final physical product reflects the differential equations from which it was specified, designed, and built. Moreover, because the software-level modeling is so high-level, the first time one can actually verify the system is at the point the physical components are integrated. That is, you have to build it before you can test it. Needless to say, that is extremely time-consuming, expensive, and error-prone.

(Here is a Quicktime Movie that illustrates what this is about, and what types of things people at the University of Maryland are doing to address the problem)

Our group has developed a simulation framework (called SimBed) in which one can model on a workstation the actual software and hardware components of the embedded system, as well as a functional model of the plant being controlled. Thus, the system can be designed -- and verified -- before you build any hardware at all. The software that is simulated in the verification is the same software that will be used in the final physical system. The benefit: verification can be relegated to software -- i.e., automated -- which means it can be faster, more comprehensive, and infinitely cheaper.

Addressing problem two ...

We are using the simulation environment described above to investigate novel microprocessor designs that enhance the reliability and predictability of embedded systems. Modern processors use many techniques to increase performance ... it is important to note that the AVERAGE performance of the microprocessor is increased by these mechanisms, and this is usually at the expense of the (predictability of the) INSTANTANEOUS performance. This works well for desktops, laptops, and servers, because users of these machines only care about "how fast does my spreadsheet recalculate?" "how fast is my website loaded?" "how fast is the screen redrawn?" This implies good average performance. However, designers of embedded systems care nothing about average performance and instead care only about the instantaneous performance of the system ... each instance of a piece of code must have the same behavior every time it is executed, otherwise static analysis of the system's behavior becomes VERY complicated and/or enormously pessimistic (yielding a highly inefficient and/or overbuilt system).

What we have been doing is looking at the boundary between hardware and software to find instances where functions traditionally performed on one side of that boundary can be moved to the other side, thereby increasing predictability, decreasing cost, and/or decreasing energy consumption ... and, if it is possible, also increasing performance. Specific projects: we have investigated moving queue manipulation from software (the RTOS, or real-time operating system) into hardware, which increases predictability, increases performance, and decreases energy consumption of the embedded system. We have investigated moving control of the processor cache from hardware to software, thereby decreasing cost and increasing predictability of the embedded system. We have investigated turning I/O devices into programmable hardware (a blurring of the HW/SW interface), thereby increasing performance and decreasing cost of the embedded system.

Additional issues ...

In general, the issues facing designers of embedded systems are very different from those involved in building general-purpose systems: typically, when building a general-purpose system, a designer cares first and foremost about performance; a designer of embedded systems typically cares most about reliability, predictability, cost, and then (maybe) performance.

Our group is investigating numerous aspects of embedded systems design; our goal is to make such systems more reliable, more predictable, and cheaper (both cheaper to build and cheaper to operate -- i.e. low power consumption). We have studied low-power issues such as dynamic voltage scaling heuristics and hardware support. We have looked at memory-system design for DSP systems. We have designed a system that makes microprocessors tolerant of intentional EMI ... much of it is below, but much of it is still in production.

If you have any questions, please contact me.

email address

Hardware/Software Co-Design of Embedded Microcontrollers and Real-Time Operating Systems

We have conceptualized a hardware/software co-designed processor architecture and real-time operating system (RTOS) framework that together eliminate most high-overhead operating system functions in an embedded system, thus maximizing the performance and predictability of real-time applications. This project is targeted to design and build a simulator, a prototype processor, and an experimental RTOS to demonstrate these claims.

Research Objectives

Our research goal is to improve the performance, predictability, energy consumption, and reliability of embedded systems at little to no increase in cost. Our prior investigations in RTOS technology and processor architectures have led us to several orthogonal hardware mechanisms that one can add to any processor architecture, either microcontroller or DSP. These mechanisms enable an RTOS to provide a virtual machine environment in an embedded system with near-zero overhead.

We have performed initial measurements of the effectiveness of our scheme; we have built a complete embedded-system simulator that simulates both the embedded microcontroller and the RTOS. This enables us to gather a large amount of information on the behavior of real-time systems and allows us to measure the effect of changes to the system architecture that require modifications to both hardware and software. Other measurement techniques do not allow such flexibility; software simulators that execute applications directly on an emulated processor neglect operating system activity, and systems that attach logic probes to real hardware obtain accurate measurements but do not allow modifications to the processor architecture.

Need for Experimentation

This type of research requires experimental software systems. One cannot obtain these measurements without actually building prototype systems; the variables are so numerous and their interactions so complex that realistic mathematical analysis is incredibly difficult, and when the system is simplified to the point that the analysis is feasible, the results are no longer realistic. Accurate simulation of systems is the preferred method for obtaining measurements of low-level behavior where the operating system meets the hardware.

This project is supported by the National Science Foundation through the following awards:

National Science Foundation. Hardware-Software Co-Design of Real-Time Operating Systems and Embedded Microprocessors. 3/2001-3/2005; $1,400,000. PI, with D. Stewart (Co-PI).
National Science Foundation. Hardware-Software Co-Design of an Experimental Real-Time Operating System and a Microcontroller Architecture. 9/1998-9/2000; $280,000; Co-PI, with D. Stewart (PI).

Papers and Theses on the Hardware Mechanisms Investigated To Date ...

"Extended Split-Issue: Enabling flexibility in the hardware implementation of NUAL VLIW DSPs." Bharath Iyer, Sadagopan Srinivasan, and Bruce Jacob. Proc. 31st International Symposium on Computer Architecture (ISCA'04), pp. 364-375. Munchen Germany, June 2004.
"Hardware support for real-time operating systems." Paul Kohout, Brinda Ganesh, and Bruce Jacob. Proc. First IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS 2003), pp. 45-51. Newport Beach CA, October 2003.
"Transparent data-memory organizations for digital signal processors." Sadagopan Srinivasan, Vinodh Cuppu, and Bruce Jacob. Proc. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES 2001), pp. 44-48. Atlanta GA, November 2001.
Extended Split-Issue Mechanism in VLIW DSPs to Support SMT and Hardware-ISA Decoupling. Bharath Iyer. MS Thesis, University of Maryland at College Park. Winter 2003.
Nanoprocessors: Configurable Hardware Accelerators for Embedded Systems. Lei Zong. MS Thesis, University of Maryland at College Park. Winter 2003.
Architectural Support for Embedded Operating Systems. Brinda Ganesh. MS Thesis, University of Maryland at College Park. Winter 2003.
Hardware Support for Real-Time Operating Systems. Paul Kohout. MS Thesis, University of Maryland at College Park. Summer 2002.

Performance, Energy, and Memory-Subsystem Modeling

With embedded processor technology moving towards faster and smaller processors and systems on a chip, it becomes increasingly difficult to accurately evaluate real-time performance and power consumption. This research describes an evaluation method using an embedded architecture software emulator that models several embedded microprocessor architectures, including the Motorola M-CORE, the Texas Instruments C6000 DSP, and the Digital/Intel StrongARM.

Details can be found in the following papers and student theses:

"Instruction-level power dissipation in the Intel XScale embedded microprocessor." A. Varma, E. Debes, I. Kozintsev, and B. Jacob. Proc. SPIE's 17th Annual Symposium on Electronic Imaging Science & Technology, San Jose CA, January 2005.
"The performance and energy consumption of embedded real-time operating systems." K. Baynes, C. Collins, E. Fiterman, B. Ganesh, P. Kohout, C. Smit, T. Zhang, and B. Jacob. IEEE Transactions on Computers, vol. 52, no. 11, pp. 1454-1469. November 2003.
"The performance and energy consumption of three embedded real-time operating systems." K. Baynes, C. Collins, E. Fiterman, B. Ganesh, P. Kohout, C. Smit, T. Zhang, and B. Jacob. Proc. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES 2001), pp. 203-210. Atlanta GA, November 2001.
Hardware Support for Real-Time Operating Systems. Paul Kohout. MS Thesis, University of Maryland at College Park. Summer 2002.
RTOS Performance and Energy Consumption Analysis Based on an Embedded System Testbed. Tiebing Zhang. MS Thesis, University of Maryland at College Park. Spring 2001.
An Evaluation of Embedded System Behavior Using Full-System Software Emulation. Christopher M. Collins. MS Thesis, University of Maryland at College Park. Spring 2000.

The following papers use the SimBed simulation environment:

"Extended Split-Issue: Enabling flexibility in the hardware implementation of NUAL VLIW DSPs." Bharath Iyer, Sadagopan Srinivasan, and Bruce Jacob. Proc. 31st International Symposium on Computer Architecture (ISCA'04), pp. 364-375. Munchen Germany, June 2004.
"Hardware support for real-time operating systems." Paul Kohout, Brinda Ganesh, and Bruce Jacob. Proc. First IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS 2003), pp. 45-51. Newport Beach CA, October 2003.
"A control-theoretic approach to dynamic voltage scheduling." A. Varma, B. Ganesh, M. Sen, S. R. Choudhary, L. Srinivasan, and B. Jacob. Proc. International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES 2003). San Jose CA, October 2003.
"Transparent data-memory organizations for digital signal processors." Sadagopan Srinivasan, Vinodh Cuppu, and Bruce Jacob. Proc. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES 2001), pp. 44-48. Atlanta GA, November 2001.

We have also performed extremely realistic modeling of modern DRAM systems, as can be found in the following papers (see the section on DRAM Research for detailed abstracts on each paper):

"A performance comparison of contemporary DRAM architectures." Vinodh Cuppu, Bruce Jacob, Brian Davis, and Trevor Mudge. Proc. 26th International Symposium on Computer Architecture (ISCA'99), pp. 222-233. Atlanta GA, May 1999.
"Concurrency, latency, or system overhead: Which has the largest impact on uniprocessor DRAM-system performance?" Vinodh Cuppu and Bruce Jacob. Proc. 28th International Symposium on Computer Architecture (ISCA'01), pp. 62-71. Goteborg Sweden, June 2001.
"High performance DRAMs in workstation environments." Vinodh Cuppu, Bruce Jacob, Brian Davis, and Trevor Mudge. IEEE Transactions on Computers, vol. 50, no. 11. November 2001. (TC Special Issue on High-Performance Memory Systems)

Highly Integrated, Heterogeneous Systems-on-Chip

(note: this section is cross-listed with the Circuit Integrity page)

It is clear from the decades-long trends of miniaturization and integration that future systems-on-chip will combine all technologies currently found in modern embedded systems -- including digital, analog and MEMS components. We are working to develop interfaces, models, and basic understanding to make this integration possible. Current (published) work focuses on integrating digital signal processing with MEMS sensors, and developing high-level power models for (digital) SoC components such as cores, memories, I/O controllers, and busses. Other work includes topologies and protocols for networks-on-chip and related issues in circuit integrity.

Details can be found in the following papers:

"MEMS-based embedded sensor virtual components for system-on-a-chip (SoC)." M. Afridi, A. Hefner, D. Berning, C. Ellenwood, A. Varma, B. Jacob, S. Semancik. Solid-State Electronics, vol. 48, no. 10/11, pp. 1777-1781. October/November 2004.
"Instruction-level power dissipation in the Intel XScale embedded microprocessor." A. Varma, E. Debes, I. Kozintsev, and B. Jacob. Proc. SPIE's 17th Annual Symposium on Electronic Imaging Science & Technology, San Jose CA, January 2005.

Issues in Digital Signal Processors

Flexibility in Hardware Implementation of VLIW DSPs

Extended Split-Issue is a technique that enables a designer of VLIW microprocessors (examples of VLIW processors include Intel's Itanium processor; Transmeta's Crusoe; and most high-end digital signal processors, such as those from Texas Instruments, Motorola, Analog Devices, and Agere Systems) to overcome the most significant drawback of VLIW architectures--namely, that of cross-compatibility--without sacrificing performance.

The term 'VLIW' stands for 'very long instruction word' and refers to a class of processor architectures popular in modern energy-efficient yet high-performance systems. VLIW architectures take much of the dynamic decision-making logic out of hardware and move it instead into the compiler. The degree to which VLIW specifies the required hardware sets it apart from other architectures and enables the development of high-performance processors that are efficient in their die area and power consumption, relative to other architecture means of achieving high performance. The trade-off is that code written for a VLIW processor cannot easily be run on another VLIW processor with a different hardware configuration. Among other things, this sharply limits the designer's ability to offer multiple processor implementations--for instance, to offer multiple designs at different cost-performance price points such as a DSP with one hardware multiplier, or two hardware multipliers, or none at all using software emulation. Such cross-compatibility enabled the long-lived success of both IBM's System/360 and Intel's x86 architecture families.

The Extended Split-Issue mechanism can be applied to any processor design that uses VLIW, especially "NUAL" VLIWs, in which the compiler has intimate knowledge of the processor pipeline organization. The mechanism allows any hardware implementation to emulate the configuration expected by the compiler, thereby providing cross-compatibility between dissimilar processor implementations. By enabling VLIW code to run on any processor implementation, even if that implementation is not compatible with the code, Extended Split-Issue provides de facto cross-compatibility in the VLIW domain, thus solving a problem that has been open since VLIW was introduced three decades ago.

Details can be found in the following papers:

"Extended Split-Issue: Enabling flexibility in the hardware implementation of NUAL VLIW DSPs." Bharath Iyer, Sadagopan Srinivasan, and Bruce Jacob. Proc. 31st International Symposium on Computer Architecture (ISCA'04), pp. 364-375. Munchen Germany, June 2004.
Extended Split-Issue Mechanism in VLIW DSPs to Support SMT and Hardware-ISA Decoupling. Bharath Iyer. MS Thesis, University of Maryland at College Park. Winter 2003.

Memory Systems for Digital Signal Processors

Today's digital signal processors (DSPs), unlike general-purpose processors, use a non-uniform addressing model in which the primary components of the memory system--the DRAM and dual tagless SRAMs--are referenced through completely separate segments of the address space. The recent trend of programming DSPs in high-level languages instead of assembly code has exposed this memory model as a potential weakness, as the model makes for a poor compiler target. In many of today's high-performance DSPs this non-uniform model is being replaced by a uniform model--a transparent organization like that of most general-purpose systems, in which all memory structures share the same address space as the DRAM system.

In such an architecture, one must replace the DSP's traditional tagless SRAMs with something resembling a general-purpose cache. This study investigates an alternative on-chip data cache design for a high-performance DSP, the Texas Instruments 'C6000 VLIW DSP. Rather than simply adding tags to the large on-chip SRAM structure, we take advantage of the relatively regular memory access behavior of most DSP applications and replace the tagless SRAM with a near-traditional cache that uses a very small number of wide blocks. We find that one can achieve nearly the same performance as a tagless SRAM while using a much smaller footprint.

Details can be found in the following paper:

"Transparent data-memory organizations for digital signal processors." Sadagopan Srinivasan, Vinodh Cuppu, and Bruce Jacob. Proc. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES 2001), pp. 44-48. Atlanta GA, November 2001.

Low-Power Embedded Systems

This study presents the modeling of embedded systems with SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli. We briefly describe the simulation environment and present a study that compares three RTOSs: uC/OS-II, a popular public-domain embedded real-time operating system; Echidna, a sophisticated, industrial-strength (commercial) RTOS; and NOS, a bare-bones multi-rate task scheduler reminiscent of typical "roll-your-own" RTOSs found in many commercial embedded systems. The microcontroller simulated in this study is the Motorola M-CORE processor: a low-power, 32-bit CPU core with 16-bit instructions, running at 20MHz. Our simulations show what happens when RTOSs are pushed beyond their limits, and they depict situations in which unexpected interrupts or unaccounted-for task invocations disrupt timing, even when the CPU is lightly loaded. In general, there appears no clear winner in timing accuracy between preemptive systems and cooperative systems. The power-consumption measurements show that RTOS overhead is a factor of two to four higher than it needs to be, compared to the energy consumption of the minimal scheduler. In addition, poorly designed idle loops can cause the system to double its energy consumption--energy that could be saved by a simple hardware sleep mechanism.

Details can be found in the following papers:

"A control-theoretic approach to dynamic voltage scheduling." A. Varma, B. Ganesh, M. Sen, S. R. Choudhary, L. Srinivasan, and B. Jacob. Proc. International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES 2003). San Jose CA, October 2003.
"The performance and energy consumption of embedded real-time operating systems." K. Baynes, C. Collins, E. Fiterman, B. Ganesh, P. Kohout, C. Smit, T. Zhang, and B. Jacob. IEEE Transactions on Computers, vol. 52, no. 11, pp. 1454-1469. November 2003.
"The performance and energy consumption of three embedded real-time operating systems." K. Baynes, C. Collins, E. Fiterman, B. Ganesh, P. Kohout, C. Smit, T. Zhang, and B. Jacob. Proc. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES 2001), pp. 203-210. Atlanta GA, November 2001.
RTOS Performance and Energy Consumption Analysis Based on an Embedded System Testbed. Tiebing Zhang. MS Thesis, University of Maryland at College Park. May 2001.
"The performance and energy consumption of embedded real-time operating systems." Kathleen Baynes, Chris Collins, Eric Fiterman, Christine Smit, Tiebing Zhang, and Bruce Jacob. University of Maryland Systems and Computer Architecture Group Technical Report UMD-SCA-TR-2000-04. November 2000.

Low-Overhead Interrupts

The general-purpose precise interrupt mechanism, which has long been used to handle exceptional conditions that occur infrequently, is now being used increasingly often to handle conditions that are neither exceptional nor infrequent. One example is the use of interrupts to perform memory management--e.g., to handle translation lookaside buffer (TLB) misses in today's microprocessors. Because the frequency of TLB misses tends to increase with memory footprint, there is pressure on the precise interrupt mechanism to become more lightweight. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline. Doing so makes the CPU available to execute handler instructions, but it wastes potentially hundreds of cycles of execution time. However, if the handler code is small, it could potentially fit in the reorder buffer along with the user-level code already there. This essentially in-lines the interrupt-handler code. One good example of where this would be both possible and useful is in the TLB-miss handler in a software-managed TLB implementation. The benefits of doing so are two-fold: (1) the instructions that would otherwise be flushed from the pipe need not be re-fetched and re-executed; and (2) any instructions that are independent of the exceptional instruction can continue to execute in parallel with the handler code. In effect, doing so provides us with lockup-free TLBs. We simulate a lockup-free data-TLB facility on a processor model with a 4-way out-of-order core reminiscent of the Alpha 21264. We find that, by using lockup-free TLBs, one can get the performance of a fully associative TLB with a lockup-free TLB of one-fourth the size.

Details can be found in the following papers:

"Improving the precise interrupt mechanism of software-managed TLB miss handlers." Aamer Jaleel and Bruce Jacob. Proc. 8th International Conference on High Performance Computing (HiPC 2001), pp. 282-293. Hyderabad India, December 2001.
"In-line interrupt handling for software-managed TLBs." Aamer Jaleel and Bruce Jacob. Proc. 19th IEEE International Conference on Computer Design (ICCD-19), pp 62-67. Austin TX, September 2001.

Real-Time Cache/Memory Management

This study demonstrates the intractability of achieving statically predictable performance behavior with traditional cache organizations (i.e., the real-time cache problem) and describes a non-traditional organization-combined hardware and software techniques-that can solve the real-time cache problem. We show that the task of placing code and data in the memory system so as to eliminate conflicts in traditional direct-mapped and set-associative caches is NP-complete. We discuss alternatives in both software and hardware that can address the problem: using address translation with software support can eliminate non-predicted conflict misses, and explicit management of the cache contents can eliminate non-predicted capacity misses. We present a theoretical analysis of the performance benefits of managing the cache contents to extend the effective size of the cache.

Details can be found in the following paper:

"Real-time memory management: Compile-time techniques and run-time mechanisms that enable the use of caches in real-time systems." Bruce Jacob and Shuvra Bhattacharyya. University of Maryland Institute of Advanced Computer Studies (UMIACS) Technical Report UMIACS-TR-2000-60. September 2000.

Other Papers on Embedded Systems

The following papers from our research group discuss additional aspects of embedded systems:

"Hardware/software architectures for real-time caching." Bruce L Jacob. Proc. Second Workshop on Compiler and Architecture Support for Embedded Systems (CASES'99), pp. 135-138, Washington DC, October 1999.
"Hardware/software co-design of I/O interfacing hardware and real-time device drivers for embedded systems." David B Stewart and Bruce L Jacob. Proc. Second Workshop on Compiler and Architecture Support for Embedded Systems (CASES'99), Washington DC, October 1999.
and
IEEE Real-Time Applications Symposium--Works-in-Progress Section, Vancouver BC, Canada, June 1999.
"Cache design for embedded real-time systems." Bruce L Jacob. Embedded Systems Conference, Summer 1999. Danvers MA, June 1999.
The slides for the talk are available on-line in PDF format, and include many details not found in the paper.
"Software-managed caches: Architectural support for real-time embedded systems." Bruce L Jacob. CASES'98: Workshop on Compiler and Architecture Support for Embedded Systems. Washington DC, December 1998.

Other Publications