A memory management unit (MMU), 가끔 paged memory management unit(PMMU)로 불린다, 는 그것을 통해 넘겨지는 모든 메모리 참조들을 가지는 컴퓨터 하드웨어 unit이다, 우선적으로 가상 메모리 주소들을 물리적인 주소로 변환한다. 대개 주앙 처리장치(CPU)의 일부로 구현되지만, 별도의 통합 회로의 형태일 수도 있다.

 

MMU는 효과적으로 virtual memory 관리를 수행하며,  동시에 memory protection, cache control, bus arbitration(중재) 그리고 보다 단순한 컴퓨터 아키텍쳐(특히 8-bit 시스템)에서는 bank switching을 다룬다.

 

A memory management unit (MMU), sometimes called paged memory management unit(PMMU), is a computer hardware unit having all memory references passed through itself, primarily performing the translation of virtual memory addresses to physical addresses. It is usually implemented as part of the central processing unit (CPU), but it also can be in the form of a separate integrated circuit.

An MMU is effectively performing the virtual memory management, handling at the same timememory protection, cache control, bus arbitration and, in simpler computer architectures (especially 8-bit systems), bank switching.

Overview[edit]

Schematic of the operation of an MMU[1]:186 ff.

현재의 MMUs는 전형적으로 virtual address space (process에 의해 사용되는 address의 범위) 를  2배수의 사이즈를 가진, 대게는 몇 킬로바이트이지만 더 클 수도 있는, pages로 나눈다. address의 bottom bits(page 안에서의 offset)은 바뀌지 않은 채로 남아있다. upper addres bits는 virtual page numbers이다. 

 

Modern MMUs typically divide the virtual address space (the range of addresses used by the processor) into pages, each having a size which is a power of 2, usually a few kilobytes, but they may be much larger. The bottom bits of the address (the offset within a page) are left unchanged. The upper address bits are the virtual page numbers.[2]

Page table entries[edit]


대부분의 MMUs는 page당 하나의 "page table entry(PTE)"를 포함하고, 메인 메모리에서 vritaul page 수를 physical page 수로 매핑하는 "page table"이라고 하는 in-memory 테이블을 사용합니다. PTEs의 연관 캐시는 translation lookaside buffer(TLB)라고 불리며, 매번 virtual address가 매핑된 메인메모리에 접근해야 할 필요를 피하기 위해 사용됩니다. 다른 MMUs는 메모리 또는 일련의 page table entries를 보유하는 레지스터들 개별적인 array를 가질 수 있다. physical page 수는 전체 physical address를 제공하는 page offset과 결합되어 있다.

PTE는 또한 page가 "dirty bit"라고 쓰여졌는지에 대한 정보에 대해 포함할 수 있다, 그것이 마지막에 사용된 때 ("accessed bit," 가장 최근에 사용된(LRU) page replacement 알고리즘), 어떤 종류의 프로세스들이(user mode 또는 supervisor mode) 그것을 일고 쓸 수 있는지, 그리고 그것이 cached되어야 하는지에 대한.

 

가끔은 PTE는 virtual page에 대한 접근을 금지하는데, 그것은 아마도 어떠한 물리적인 random access memory도 virtual page에 할당되지 않았기 때문일 것이다. 이 경우에, MMU는 CPU에 page fault 시그널을 보낸다. 그러면 OS는, 아마도 RAM의 여유 프레임을 찾으려고 하고 요청된 virtual address에 그것을 매핑하기 위해 새로운 PTE를 셋업함으로서 이 상황을 다룰 것이다. 만일 free한 어떤 RAM도 없다면,("victim"이라고 알려진) 존재하는 페이지를 선택할 수도 있으며, 어떠한 대체 알고리즘을 사용하여, 그리고 disk에 저장한다.("paging"이라고 불리는 프로세스).

어떤 MMUs에서는, PTEs의 부족이 있을 수 있는데, 이 경우에 OS는 새로운 매핑을 위해 어떤 것을 free할 수 있을 것이다.

MMU는 또한 불법 액세스(illegal access) 에러 조건 또는 불법적인, 또는 존재하지 않는 메모리 접근에 대해 invalid fage faults를 발생시킬 수도 있습니다, 각각 os에 의해 다루어질 때 segmentation fault 또는 bus error로 나타납니다.



Most MMUs use an in-memory table of items called a "page table," containing one "page table entry" (PTE) per page, to map virtual page numbers to physical page numbers in main memory. An associative cache of PTEs is called a translation lookaside buffer (TLB) and is used to avoid the necessity of accessing the main memory every time a virtual address is mapped. Other MMUs may have a private array of memory[3] or registers that hold a set of page table entries. The physical page number is combined with the page offset to give the complete physical address.[2]

A PTE may also include information about whether the page has been written to (the "dirty bit"), when it was last used (the "accessed bit," for a least recently used (LRU) page replacement algorithm), what kind of processes (user mode or supervisor mode) may read and write it, and whether it should be cached.

Sometimes, a PTE prohibits access to a virtual page, perhaps because no physical random access memory has been allocated to that virtual page. In this case, the MMU signals a page fault to the CPU. The operating system (OS) then handles the situation, perhaps by trying to find a spare frame of RAM and set up a new PTE to map it to the requested virtual address. If no RAM is free, it may be necessary to choose an existing page (known as a "victim"), using some replacement algorithm, and save it to disk (a process called "paging"). With some MMUs, there can also be a shortage of PTEs, in which case the OS will have to free one for the new mapping.[2]

The MMU may also generate illegal access error conditions or invalid page faults upon illegal or non-existing memory accesses, respectively, leading to segmentation fault or bus error conditions when handled by the operating system.



Benefits
[edit]

VLSI VI475 MMU "Apple HMMU"
from the Macintosh II
used with the Motorola 68020

어떠한 경우에 page fault는 software bug를 나타낼 수도 있습니다. MMU의 주요한 장점은 memory protection입니다: OS는 특정한 프로그램이 접근해서는 안되는 메모리에 액세스하는 것을 허용하지 않음으로써, 잘못된 프로그램들에 대해 그것을 보호할 수 있습니다. 보통, OS는 각 프로그램에 그것들의 virtual address 공간을 할당합니다.

MMU는 또한 메모리의 fragmentation의 문제를 완화시킵니다. 몇개 블록의 메모리가 할당되었다가 해제된 이후에, 여유 메모리는 단편화(fragmented - 비연속적인(discontinuous))될 수 있어 가장 연속적인 free memory의 블록이 전체 양보다 훨씬 더 작아질 수 있습니다. virtual memory로, 연속적인 범위의 virtual address들은 physical memory의 몇 개의 연속적이지 않은(non-contiguous)블록으로  매핑될 수 있습니다.

일부 초기 마이크로프로세서 설계에서, 메모리 관리는 VLSI VI475 (1986), Macintosh II에서 사용된 모토롤라 68020 CPU에 사용된 모토롤라 68851 (1984), 또는 Zilog Z8000 패밀리 프로세서에 사용된 Z8015 (1985), 같은 별개로 수행되는 집적 회로에 의해 수행되었습니다.
후에 (Motorola 68030 and the Zilog Z280과 같은) 마이크로프로세서들은 Intel 80286 그리고 후에 나온 x86 마이크로프로세서들처럼 동일한 집적회로 상의 CPU를 가진 MMU로 대체되었습니다.

이 기고가 일반적으로 pages를 기반으로 하는 현대의 MMUs에 집중하는 반면, 초기의 시스템들은 보다 더 개발되어 세분화된 기본-제한 어드레싱과 유사한 컨셉을 사용합니다. 그것들은 때때로 현대의 아키텍쳐들을 나타냅니다.
x86 아키텍쳐는 80286에서는 paging보다는 segmentation을 제공했고, 80386과 이후의 프로세서들에서는 paging과 segmentation(세분화?)를 제공한다. (비록 segmentation의 사용이 64bit operation에서는 사용할 수 없었다 하더라도)

 

In some cases a page fault may indicate a software bug. A key benefit of an MMU is memory protection: an OS can use it to protect against errant programs by disallowing access to memory that a particular program should not have access to. Typically, an OS assigns each program its own virtual address space.[2]

An MMU also mitigates the problem of fragmentation of memory. After blocks of memory have been allocated and freed, the free memory may become fragmented (discontinuous) so that the largest contiguous block of free memory may be much smaller than the total amount. With virtual memory, a contiguous range of virtual addresses can be mapped to several non-contiguous blocks of physical memory.[2]

In some early microprocessor designs, memory management was performed by a separateintegrated circuit such as the VLSI VI475 (1986), the Motorola 68851 (1984) used with theMotorola 68020 CPU in the Macintosh II, or the Z8015 (1985)[4] used with the Zilog Z8000 family of processors. Later microprocessors (such as the Motorola 68030 and the Zilog Z280) placed the MMU together with the CPU on the same integrated circuit, as did the Intel 80286 and later x86 microprocessors.

While this article concentrates on modern MMUs, commonly based on pages, early systems used a similar concept for base-limit addressing that further developed into segmentation. Those are occasionally also present on modern architectures. The x86 architecture provided segmentation, rather than paging, in the 80286, and provides both paging and segmentation in the 80386 and later processors (although the use of segmentation is not available in 64-bit operation).

Examples[edit]

Most modern systems divide memory into pages that are 4-64 KB in size, often with the capability to use huge pages from 2 MB to512 MB in size. Page translations are cached in a translation lookaside buffer (TLB). Some systems, mainly older RISC designs, trapinto the OS when a page translation is not found in the TLB. Most systems use a hardware-based tree walker. Most systems allow the MMU to be disabled, but some disable the MMU when trapping into OS code.

VAX[edit]

VAX pages are 512 bytes, which is very small. An OS may treat multiple pages as if they were a single larger page. For example,Linux on VAX groups eight pages together. Thus, the system is viewed as having 4 KB pages. The VAX divides memory into four fixed-purpose regions, each 1 GB in size. They are:

  • P0 space: used for general-purpose per-process memory such as heaps,
  • P1 space: (or control space) which is also per-process and is typically used for supervisor, executive, kernel, user stacks and other per-process control structures managed by the operating system,
  • S0 space: (or system space) which is global to all processes and stores operating system code and data, whether paged or not, including pagetables, and
  • S1 space: which is unused and "Reserved to Digital".

Page tables are big linear arrays. Normally, this would be very wasteful when addresses are used at both ends of the possible range, but the page table for applications is itself stored in the kernel's paged memory. Thus, there is effectively a two-level tree, allowing applications to have sparse memory layout without wasting a lot of space on unused page table entries. The VAX MMU is notable for lacking an accessed bit. OSes which implement paging must find some way to emulate the accessed bit if they are to operate efficiently. Typically, the OS will periodically unmap pages so that page-not-present faults can be used to let the OS set an accessed bit.

ARM[edit]

ARM architecture-based application processors implement an MMU defined by ARM's virtual memory system architecture. The current architecture defines PTEs for describing 4 KB and 64 KB pages, 1 MB sections and 16 MB super-sections; legacy versions also defined a 1 KB tiny page. The ARM uses a two-level page table if using 4 KB and 64 KB pages, or just a one-level page table for1 MB sections and 16 MB sections.

TLB updates are performed automatically by page table walking hardware. PTEs include read/write access permission based on privilege, cacheability information, an NX bit, and a non-secure bit.[5]

IBM System/370 and successors[edit]

The IBM System/370 has had an MMU since the early 1970s. It was initially known as a dynamic address translation (DAT) box. It has the unusual feature of storing accessed and dirty bits outside of the page table. They refer to physical memory rather than virtual memory, and are accessed by special-purpose instructions. This reduces overhead for the OS, which would otherwise need to propagate accessed and dirty bits from the page tables to a more physically oriented data structure. This makes OS-level virtualization easier.[clarification needed] These features have been inherited by succeeding mainframe architectures, up to the currentz/Architecture.

DEC Alpha[edit]

The DEC Alpha processor divides memory into 8 KB pages. After a TLB miss, low-level firmware machine code (here called PALcode) walks a three-level tree-structured page table. Addresses are broken down as follows: 21 bits unused, 10 bits to index the root level of the tree, 10 bits to index the middle level of the tree, 10 bits to index the leaf level of the tree, and 13 bits that pass through to the physical address without modification. Full read/write/execute permission bits are supported.

MIPS[edit]

The MIPS architecture supports one to 64 entries in the TLB. The number of TLB entries is configurable at CPU configuration before synthesis. TLB entries are dual. Each TLB entry maps a virtual page number (VPN2) to either one of two page frame numbers (PFN0 or PFN1), depending on the least significant bit of the virtual address that is not part of the page mask. This bit and the page mask bits are not stored in the VPN2. Each TLB entry has its own page size, which can be any value from 1 KB to 256 MB in multiples of four. Each PFN in a TLB entry has a caching attribute, a dirty and a valid status bit. A VPN2 has a global status bit and an OS assigned ID which participates in the virtual address TLB entry match, if the global status bit is set to zero. A PFN stores the physical address without the page mask bits.

A TLB refill exception is generated when there are no entries in the TLB that match the mapped virtual address. A TLB invalid exception is generated when there is a match but the entry is marked invalid. A TLB modified exception is generated when there is a match but the dirty status is not set. If a TLB exception occurs when processing a TLB exception, a double fault TLB exception, it is dispatched to its own exception handler.

MIPS32 and MIPS32r2 support 32 bits of virtual address space and up to 36 bits of physical address space. MIPS64 supports up to 64 bits of virtual address space and up to 59 bits of physical address space.

Sun 1[edit]

The original Sun 1 was a single-board computer built around the Motorola 68000 microprocessor and introduced in 1982. It included the original Sun 1 memory management unit that provided address translation, memory protection, memory sharing and memory allocation for multiple processes running on the CPU. All access of the CPU to private on-board RAM, external Multibus memory, on-board I/O and the Multibus I/O ran through the MMU, where they were translated and protected in uniform fashion. The MMU was implemented in hardware on the CPU board.

The MMU consisted of a context register, a segment map and a page map. Virtual addresses from the CPU were translated into intermediate addresses by the segment map, which in turn were translated into physical addresses by the page map. The page size was 2 KB and the segment size was 32 KB which gave 16 pages per segment. Up to 16 contexts could be mapped concurrently. The maximum logical address space for a context was 1024 pages or 2 MB. The maximum physical address that could be mapped simultaneously was also 2 MB.

The context register was important in a multitasking operating system because it allowed the CPU to switch between processes without reloading all the translation state information. The 4-bit context register could switch between 16 sections of the segment map under supervisor control, which allowed 16 contexts to be mapped concurrently. Each context had its own virtual address space. Sharing of virtual address space and inter-context communications could be provided by writing the same values in to the segment or page maps of different contexts. Additional contexts could be handled by treating the segment map as a context cache and replacing out-of-date contexts on a least-recently used basis.

The context register made no distinction between user and supervisor states. Interrupts and traps did not switch contexts which required that all valid interrupt vectors always be mapped in page 0 of context, as well as the valid supervisor stack.[6]

PowerPC[edit]

In PowerPC G1, G2, G3, and G4 pages are normally 4 KB. After a TLB miss, the standard PowerPC MMU begins two simultaneous lookups. One lookup attempts to match the address with one of four or eight data block address translation (DBAT) registers, or four or eight instruction block address translation registers (IBAT), as appropriate. The BAT registers can map linear chunks of memory as large as 256 MB, and are normally used by an OS to map large portions of the address space for the OS kernel's own use. If the BAT lookup succeeds, the other lookup is halted and ignored.

The other lookup, not directly supported by all processors in this family, is via a so-called "inverted page table," which acts as a hashed off-chip extension of the TLB. First, the top four bits of the address are used to select one of 16 segment registers. Then 24 bits from the segment register replace those four bits, producing a 52-bit address. The use of segment registers allows multiple processes to share the same hash table.

The 52-bit address is hashed, then used as an index into the off-chip table. There, a group of eight-page table entries is scanned for one that matches. If none match due to excessive hash collisions, the processor tries again with a slightly different hash function. If this, too, fails, the CPU traps into OS (with MMU disabled) so that the problem may be resolved. The OS needs to discard an entry from the hash table to make space for a new entry. The OS may generate the new entry from a more-normal tree-like page table or from per-mapping data structures which are likely to be slower and more space-efficient. Support for no-execute control is in the segment registers, leading to 256 MB granularity.

A major problem with this design is poor cache locality caused by the hash function. Tree-based designs avoid this by placing the page table entries for adjacent pages in adjacent locations. An operating system running on the PowerPC may minimize the size of the hash table to reduce this problem.

It is also somewhat slow to remove the page table entries of a process. The OS may avoid reusing segment values to delay facing this, or it may elect to suffer the waste of memory associated with per-process hash tables. G1 chips do not search for page table entries, but they do generate the hash, with the expectation that an OS will search the standard hash table via software. The OS can write to the TLB. G2, G3, and early G4 chips use hardware to search the hash table. The latest chips allow the OS to choose either method. On chips that make this optional or do not support it at all, the OS may choose to use a tree-based page table exclusively.

IA-32 / x86[edit]

The x86 architecture has evolved over a very long time while maintaining full software compatibility, even for OS code. Thus, the MMU is extremely complex, with many different possible operating modes. Normal operation of the traditional 80386 CPU and its successors (IA-32) is described here.

The CPU primarily divides memory into 4 KB pages. Segment registers, fundamental to the older 8088 and 80286 MMU designs, are not used in modern OSes, with one major exception: access to thread-specific data for applications or CPU-specific data for OS kernels, which is done with explicit use of the FS and GS segment registers. All memory access involves a segment register, chosen according to the code being executed. The segment register acts as an index into a table, which provides an offset to be added to the virtual address. Except when using FS or GS, the OS ensures that the offset will be zero.

After the offset is added, the address is masked to be no larger than 32 bits. The result may be looked up via a tree-structured page table, with the bits of the address being split as follows: 10 bits for the branch of the tree, 10 bits for the leaves of the branch, and the 12 lowest bits being directly copied to the result. Some operating systems, such as OpenBSD with its W^X feature, and Linux with theExec Shield or PaX patches, may also limit the length of the code segment, as specified by the CS register, to disallow execution of code in modifiable regions of the address space.

Minor revisions of the MMU introduced with the Pentium have allowed very large 4 MB pages by skipping the bottom level of the tree. Minor revisions of the MMU introduced with the Pentium Pro introduced the physical address extension (PAE) feature, enabling 36-bit physical addresses via three-level page tables (with 9+9+2 bits for the three levels, and the 12 lowest bits being directly copied to the result; large pages become only 2 MB in size). In addition, the page attribute table allowed specification of cacheability by looking up a few high bits in a small on-CPU table.

No-execute support was originally only provided on a per-segment basis, making it very awkward to use. More recent x86 chips provide a per-page no-execute bit in the PAE mode. The W^X, Exec Shield, and PaX mechanisms described above emulate per-page non-execute support on machines x86 processors lacking the NX bit by setting the length of the code segment, with a performance loss and a reduction in the available address space.

x86-64[edit]

x86-64 is a 64-bit extension of x86 that almost entirely removes segmentation in favor of the flat memory model used by almost all operating systems for the 386 or newer processors. In long mode, all segment offsets are ignored, except for the FS and GS segments. When used with 4 KB pages, the page table tree has four levels instead of three. The virtual addresses are divided as follows: 16 bits unused, nine bits each for four tree levels (for a total of 36 bits), and the 12 lowest bits directly copied to the result. With2 MB pages, there are only three levels of page table, for a total of 27 bits used in paging and 21 bits of offset. Some newer CPUs also support a 1 GB page with two levels of paging and 30 bits of offset.[7] CPUID can be used to determine if 1 GB pages are supported. In all three cases, the 16 highest bits are required to be equal to the 48th bit, or in other words, the low 48 bits are sign extended to the higher bits. This is done to allow a future expansion of the addressable range, without compromising backwards compatibility. In all levels of the page table, the page table entry includes a no-execute bit.

Unisys MCP Systems (Burroughs B5000)[edit]

In a 2006 paper, Tanenbaum et al., pointed out[8] that the B5000 (and descendant systems) have no MMU. To understand the functionality provided by an MMU, it is instructive to study a counter example of a system that achieves this functionality by other means.

The B5000 was the first commercial system to support virtual memory after the Atlas. It provides the two functions of an MMU in different ways. In the mapping of virtual memory addresses, instead of needing an MMU, the MCP systems are descriptor-based. Each allocated memory block is given a master descriptor with the properties of the block (i.e., the size, address, and whether present in memory). When a request is made to access the block for reading or writing, the hardware checks its presence via the presence bit (pbit) in the descriptor.

A pbit of 1 indicates the presence of the block. In this case, the block can be accessed via the physical address in the descriptor. If the pbit is zero, an interrupt is generated for the MCP (operating system) to make the block present. If the address field is zero, this is the first access to this block, and it is allocated (an init pbit). If the address field is non-zero, it is a disk address of the block, which has previously been rolled out, so the block is fetched from disk and the pbit is set to one and the physical memory address updated to point to the block in memory (another pbit). This makes descriptors equivalent to a page-table entry in an MMU system. System performance can be monitored through the number of pbits. Init pbits indicate initial allocations, but a high level of other pbits indicate that the system may be thrashing.

All memory allocation is therefore completely automatic (one of the features of modern systems[9]) and there is no way to allocate blocks other than this mechanism. There are no such calls as malloc or dealloc, since memory blocks are also automatically discarded. The scheme is also lazy, since a block will not be allocated until it is actually referenced. When memory is nearly full, the MCP examines the working set, trying compaction (since the system is segmented, not paged), deallocating read-only segments (such as code-segments which can be restored from their original copy) and, as a last resort, rolling dirty data segments out to disk.

Another way the B5000 provides a function of a MMU is in protection. Since all accesses are via the descriptor, the hardware can check that all accesses are within bounds and, in the case of a write, that the process has write permission. The MCP system is inherently secure and thus has no need of an MMU to provide this level of memory protection. Descriptors are read only to user processes and may only be updated by the system (hardware or MCP). (Words whose tag is an odd number are read-only; descriptors have a tag of 5 and code words have a tag of 3.)

Blocks can be shared between processes via copy descriptors in the process stack. Thus, some processes may have write permission, whereas others do not. A code segment is read only, thus reentrant and shared between processes. Copy descriptors contain a 20-bit address field giving index of the master descriptor in the master descriptor array. This also implements a very efficient and secure IPC mechanism. Blocks can easily be relocated, since only the master descriptor needs update when a block's status changes.

The only other aspect is performance – do MMU-based or non-MMU-based systems provide better performance? MCP systems may be implemented on top of standard hardware that does have an MMU (for example, a standard PC). Even if the system implementation uses the MMU in some way, this will not be at all visible at the MCP level.

 

+ Recent posts