Bruce Jacob - Memory Management Research:
Fundamental Memory Management ResearchMy research in memory-management designs compares the performance of several combinations of MMU and page table, including Ultrix/MIPS, Mach/MIPS, BSD/x86, HP-UX/PA-RISC, and a system with no TLBs and a software-managed cache. These were all compared to a baseline system running with no virtual memory at all. The studies show several things:
The x86 memory-management organization outperforms other schemes, even with the handicap that every page table lookup needs two memory references. One reason is that the scheme does not use the precise interrupt mechanism and so avoids the overhead of taking an interrupt every TLB miss. Also, the scheme requires no I-cache and so avoids any memory overhead for fetching instructions.
Inverted tables can impact the data caches less than hierarchical page tables, even when their page table entries (PTEs) are four times as large as those of hierarchical tables and therefore should impact the data caches four times as much. This is due to the densely-packed nature of PTEs in an inverted table, as compared to the relatively sparsely-packed PTEs of a hierarchical table.
When one includes the overhead of cache misses inflicted on the application as a result of the VM system displacing user-level code and data, the overhead of the virtual memory system is roughly twice what was previously thought (10-20% rather than 5-10%). These numbers are normally not included in VM studies because, to make a comparison, one must execute the application without any virtual memory system. In addition, when one includes the overhead of handling VM-related interrupts, the total increases to three times what was previously thought: 10-30%.
Details can be found in the following paper:
High-Performance InterruptsThe general-purpose precise interrupt mechanism, which has long been used to handle exceptional conditions that occur infrequently, is now being used increasingly often to handle conditions that are neither exceptional nor infrequent. One example is the use of interrupts to perform memory management--e.g., to handle translation lookaside buffer (TLB) misses in today's microprocessors. Because the frequency of TLB misses tends to increase with memory footprint, there is pressure on the precise interrupt mechanism to become more lightweight. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline. Doing so makes the CPU available to execute handler instructions, but it wastes potentially hundreds of cycles of execution time. However, if the handler code is small, it could potentially fit in the reorder buffer along with the user-level code already there. This essentially in-lines the interrupt-handler code. One good example of where this would be both possible and useful is in the TLB-miss handler in a software-managed TLB implementation. The benefits of doing so are two-fold: (1) the instructions that would otherwise be flushed from the pipe need not be re-fetched and re-executed; and (2) any instructions that are independent of the exceptional instruction can continue to execute in parallel with the handler code. In effect, doing so provides us with lockup-free TLBs. We simulate a lockup-free data-TLB facility on a processor model with a 4-way out-of-order core reminiscent of the Alpha 21264. We find that, by using lockup-free TLBs, one can get the performance of a fully associative TLB with a lockup-free TLB of one-fourth the size.
Details can be found in the following papers:
Software-Managed CachesWe present a feasibility study for performing virtual address translation without specialized translation hardware. Removing address translation hardware and instead managing address translation in software has the potential to make the processor design simpler, smaller, and more energy-efficient at little or no cost in performance. The purpose of this study is to describe the design and quantify its performance impact. Trace-driven simulations show that software-managed address translation is just as efficient as hardware-managed address translation. Moreover, mechanisms to support such features as shared memory, superpages, fine-grained protection, and sparse address spaces can be defined completely in software, allowing for more flexibility than in hardware-defined mechanisms.
Details can be found in the following papers:
Virtual Memory PrimersThe following is a set of primer-level articles on virtual memory. The articles are listed in order from the most fundamental (easiest to understand for a VM novice) to the most advanced.