Bruce Jacob - - Embedded Systems Research
Embedded Systems Research: OverviewThere are three main problems that our group is investigating in the area of embedded systems:
By way of background, an "embedded system" is a computer that has been embedded into some larger framework to supplant a dedicated electro-mechanical system or circuit. The processor's job is typically control and/or signal processing. The reason it replaces the dedicated system/circuit is because microprocessors are (finally) cheaper than even extremely simple hand-built systems/circuits.
Addressing problem one ...
Embedded systems are designed at a high level using something like MatLab. The control portion of the system ("control" as in control theory) is represented as a set of differential equations, and the plant being controlled (e.g. a motor, a robotic arm, a heating coil, a disk spindle, etc.) is also represented as a set of differential equations. This form of modeling is very high-level; an evaluation of the systems's behavior can be made very quickly; the model is reasonably accurate; and this is "the way it is done" in embedded systems. However, once one has identified a set of control laws that work correctly, one must turn that into software running on a microprocessor. This is done by hand -- it cannot be automated (very successfully) with today's tools. Similarly, once one has identified a hardware specification for the plant being controlled (shape of the robotic arms, number and type of sensors, etc.), one must turn that into a physical design -- and this is also done by hand. Because the transformations are done by hand, there is no guarantee as to how accurately the final physical product reflects the differential equations from which it was specified, designed, and built. Moreover, because the software-level modeling is so high-level, the first time one can actually verify the system is at the point the physical components are integrated. That is, you have to build it before you can test it. Needless to say, that is extremely time-consuming, expensive, and error-prone.
(Here is a Quicktime Movie that illustrates what this is about, and what types of things people at the University of Maryland are doing to address the problem)
Our group has developed a simulation framework (called SimBed) in which one can model on a workstation the actual software and hardware components of the embedded system, as well as a functional model of the plant being controlled. Thus, the system can be designed -- and verified -- before you build any hardware at all. The software that is simulated in the verification is the same software that will be used in the final physical system. The benefit: verification can be relegated to software -- i.e., automated -- which means it can be faster, more comprehensive, and infinitely cheaper.
Addressing problem two ...
We are using the simulation environment described above to investigate novel microprocessor designs that enhance the reliability and predictability of embedded systems. Modern processors use many techniques to increase performance ... it is important to note that the AVERAGE performance of the microprocessor is increased by these mechanisms, and this is usually at the expense of the (predictability of the) INSTANTANEOUS performance. This works well for desktops, laptops, and servers, because users of these machines only care about "how fast does my spreadsheet recalculate?" "how fast is my website loaded?" "how fast is the screen redrawn?" This implies good average performance. However, designers of embedded systems care nothing about average performance and instead care only about the instantaneous performance of the system ... each instance of a piece of code must have the same behavior every time it is executed, otherwise static analysis of the system's behavior becomes VERY complicated and/or enormously pessimistic (yielding a highly inefficient and/or overbuilt system).
What we have been doing is looking at the boundary between hardware and software to find instances where functions traditionally performed on one side of that boundary can be moved to the other side, thereby increasing predictability, decreasing cost, and/or decreasing energy consumption ... and, if it is possible, also increasing performance. Specific projects: we have investigated moving queue manipulation from software (the RTOS, or real-time operating system) into hardware, which increases predictability, increases performance, and decreases energy consumption of the embedded system. We have investigated moving control of the processor cache from hardware to software, thereby decreasing cost and increasing predictability of the embedded system. We have investigated turning I/O devices into programmable hardware (a blurring of the HW/SW interface), thereby increasing performance and decreasing cost of the embedded system.
Additional issues ...
In general, the issues facing designers of embedded systems are very different from those involved in building general-purpose systems: typically, when building a general-purpose system, a designer cares first and foremost about performance; a designer of embedded systems typically cares most about reliability, predictability, cost, and then (maybe) performance.
Our group is investigating numerous aspects of embedded systems design; our goal is to make such systems more reliable, more predictable, and cheaper (both cheaper to build and cheaper to operate -- i.e. low power consumption). We have studied low-power issues such as dynamic voltage scaling heuristics and hardware support. We have looked at memory-system design for DSP systems. We have designed a system that makes microprocessors tolerant of intentional EMI ... much of it is below, but much of it is still in production.
If you have any questions, please contact me.
Hardware/Software Co-Design of Embedded Microcontrollers and Real-Time Operating SystemsWe have conceptualized a hardware/software co-designed processor architecture and real-time operating system (RTOS) framework that together eliminate most high-overhead operating system functions in an embedded system, thus maximizing the performance and predictability of real-time applications. This project is targeted to design and build a simulator, a prototype processor, and an experimental RTOS to demonstrate these claims.
Our research goal is to improve the performance, predictability, energy consumption, and reliability of embedded systems at little to no increase in cost. Our prior investigations in RTOS technology and processor architectures have led us to several orthogonal hardware mechanisms that one can add to any processor architecture, either microcontroller or DSP. These mechanisms enable an RTOS to provide a virtual machine environment in an embedded system with near-zero overhead.
We have performed initial measurements of the effectiveness of our scheme; we have built a complete embedded-system simulator that simulates both the embedded microcontroller and the RTOS. This enables us to gather a large amount of information on the behavior of real-time systems and allows us to measure the effect of changes to the system architecture that require modifications to both hardware and software. Other measurement techniques do not allow such flexibility; software simulators that execute applications directly on an emulated processor neglect operating system activity, and systems that attach logic probes to real hardware obtain accurate measurements but do not allow modifications to the processor architecture.
Need for Experimentation
This type of research requires experimental software systems. One cannot obtain these measurements without actually building prototype systems; the variables are so numerous and their interactions so complex that realistic mathematical analysis is incredibly difficult, and when the system is simplified to the point that the analysis is feasible, the results are no longer realistic. Accurate simulation of systems is the preferred method for obtaining measurements of low-level behavior where the operating system meets the hardware.
This project is supported by the National Science Foundation through the following awards:
Papers and Theses on the Hardware Mechanisms Investigated To Date ...
Performance, Energy, and Memory-Subsystem ModelingWith embedded processor technology moving towards faster and smaller processors and systems on a chip, it becomes increasingly difficult to accurately evaluate real-time performance and power consumption. This research describes an evaluation method using an embedded architecture software emulator that models several embedded microprocessor architectures, including the Motorola M-CORE, the Texas Instruments C6000 DSP, and the Digital/Intel StrongARM.
Details can be found in the following papers and student theses:
Highly Integrated, Heterogeneous Systems-on-Chip
(note: this section is cross-listed with the Circuit Integrity page)It is clear from the decades-long trends of miniaturization and integration that future systems-on-chip will combine all technologies currently found in modern embedded systems -- including digital, analog and MEMS components. We are working to develop interfaces, models, and basic understanding to make this integration possible. Current (published) work focuses on integrating digital signal processing with MEMS sensors, and developing high-level power models for (digital) SoC components such as cores, memories, I/O controllers, and busses. Other work includes topologies and protocols for networks-on-chip and related issues in circuit integrity.
Details can be found in the following papers:
Issues in Digital Signal Processors
Flexibility in Hardware Implementation of VLIW DSPsExtended Split-Issue is a technique that enables a designer of VLIW microprocessors (examples of VLIW processors include Intel's Itanium processor; Transmeta's Crusoe; and most high-end digital signal processors, such as those from Texas Instruments, Motorola, Analog Devices, and Agere Systems) to overcome the most significant drawback of VLIW architectures--namely, that of cross-compatibility--without sacrificing performance.
The term 'VLIW' stands for 'very long instruction word' and refers to a class of processor architectures popular in modern energy-efficient yet high-performance systems. VLIW architectures take much of the dynamic decision-making logic out of hardware and move it instead into the compiler. The degree to which VLIW specifies the required hardware sets it apart from other architectures and enables the development of high-performance processors that are efficient in their die area and power consumption, relative to other architecture means of achieving high performance. The trade-off is that code written for a VLIW processor cannot easily be run on another VLIW processor with a different hardware configuration. Among other things, this sharply limits the designer's ability to offer multiple processor implementations--for instance, to offer multiple designs at different cost-performance price points such as a DSP with one hardware multiplier, or two hardware multipliers, or none at all using software emulation. Such cross-compatibility enabled the long-lived success of both IBM's System/360 and Intel's x86 architecture families.
The Extended Split-Issue mechanism can be applied to any processor design that uses VLIW, especially "NUAL" VLIWs, in which the compiler has intimate knowledge of the processor pipeline organization. The mechanism allows any hardware implementation to emulate the configuration expected by the compiler, thereby providing cross-compatibility between dissimilar processor implementations. By enabling VLIW code to run on any processor implementation, even if that implementation is not compatible with the code, Extended Split-Issue provides de facto cross-compatibility in the VLIW domain, thus solving a problem that has been open since VLIW was introduced three decades ago.
Details can be found in the following papers:
Memory Systems for Digital Signal ProcessorsToday's digital signal processors (DSPs), unlike general-purpose processors, use a non-uniform addressing model in which the primary components of the memory system--the DRAM and dual tagless SRAMs--are referenced through completely separate segments of the address space. The recent trend of programming DSPs in high-level languages instead of assembly code has exposed this memory model as a potential weakness, as the model makes for a poor compiler target. In many of today's high-performance DSPs this non-uniform model is being replaced by a uniform model--a transparent organization like that of most general-purpose systems, in which all memory structures share the same address space as the DRAM system.
In such an architecture, one must replace the DSP's traditional tagless SRAMs with something resembling a general-purpose cache. This study investigates an alternative on-chip data cache design for a high-performance DSP, the Texas Instruments 'C6000 VLIW DSP. Rather than simply adding tags to the large on-chip SRAM structure, we take advantage of the relatively regular memory access behavior of most DSP applications and replace the tagless SRAM with a near-traditional cache that uses a very small number of wide blocks. We find that one can achieve nearly the same performance as a tagless SRAM while using a much smaller footprint.
Details can be found in the following paper:
Low-Power Embedded SystemsThis study presents the modeling of embedded systems with SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli. We briefly describe the simulation environment and present a study that compares three RTOSs: uC/OS-II, a popular public-domain embedded real-time operating system; Echidna, a sophisticated, industrial-strength (commercial) RTOS; and NOS, a bare-bones multi-rate task scheduler reminiscent of typical "roll-your-own" RTOSs found in many commercial embedded systems. The microcontroller simulated in this study is the Motorola M-CORE processor: a low-power, 32-bit CPU core with 16-bit instructions, running at 20MHz. Our simulations show what happens when RTOSs are pushed beyond their limits, and they depict situations in which unexpected interrupts or unaccounted-for task invocations disrupt timing, even when the CPU is lightly loaded. In general, there appears no clear winner in timing accuracy between preemptive systems and cooperative systems. The power-consumption measurements show that RTOS overhead is a factor of two to four higher than it needs to be, compared to the energy consumption of the minimal scheduler. In addition, poorly designed idle loops can cause the system to double its energy consumption--energy that could be saved by a simple hardware sleep mechanism.
Details can be found in the following papers:
Low-Overhead InterruptsThe general-purpose precise interrupt mechanism, which has long been used to handle exceptional conditions that occur infrequently, is now being used increasingly often to handle conditions that are neither exceptional nor infrequent. One example is the use of interrupts to perform memory management--e.g., to handle translation lookaside buffer (TLB) misses in today's microprocessors. Because the frequency of TLB misses tends to increase with memory footprint, there is pressure on the precise interrupt mechanism to become more lightweight. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline. Doing so makes the CPU available to execute handler instructions, but it wastes potentially hundreds of cycles of execution time. However, if the handler code is small, it could potentially fit in the reorder buffer along with the user-level code already there. This essentially in-lines the interrupt-handler code. One good example of where this would be both possible and useful is in the TLB-miss handler in a software-managed TLB implementation. The benefits of doing so are two-fold: (1) the instructions that would otherwise be flushed from the pipe need not be re-fetched and re-executed; and (2) any instructions that are independent of the exceptional instruction can continue to execute in parallel with the handler code. In effect, doing so provides us with lockup-free TLBs. We simulate a lockup-free data-TLB facility on a processor model with a 4-way out-of-order core reminiscent of the Alpha 21264. We find that, by using lockup-free TLBs, one can get the performance of a fully associative TLB with a lockup-free TLB of one-fourth the size.
Details can be found in the following papers:
Real-Time Cache/Memory ManagementThis study demonstrates the intractability of achieving statically predictable performance behavior with traditional cache organizations (i.e., the real-time cache problem) and describes a non-traditional organization-combined hardware and software techniques-that can solve the real-time cache problem. We show that the task of placing code and data in the memory system so as to eliminate conflicts in traditional direct-mapped and set-associative caches is NP-complete. We discuss alternatives in both software and hardware that can address the problem: using address translation with software support can eliminate non-predicted conflict misses, and explicit management of the cache contents can eliminate non-predicted capacity misses. We present a theoretical analysis of the performance benefits of managing the cache contents to extend the effective size of the cache.
Details can be found in the following paper:
Other Papers on Embedded SystemsThe following papers from our research group discuss additional aspects of embedded systems: