Although it is an interesting subject, the ancient history of microprocessors is not really important to the issues at hand. It might be nice to learn how the young PC grew from a small, budding 4-bit system to the gigantic, strapping 64-bit Pentium. However, there are many books that have covered this subject and unfortunately, I don’t have the space. Besides, the Intel chips on which Linux runs are only the 80386 (or 100-percent compatible clones) and higher processors.
So, instead of setting the way-back machine to Charles Babbage and his Analytic Engine, we leap ahead to 1985 and the introduction of the Intel 80386. Even compared to its immediate predecessor, the 80286, the 80386 (386 for short) was a powerhouse. Not only could it handle twice the amount of data at once (now 32 bits), but its speed rapidly increased far beyond that of the 286.
New advances were added to increase the 386s power. Internal registers were added and their size was increased. Built into the 386 was the concept of virtual memory, which was a way to make it appear as though there was much more memory on system than there actually was. This substantially increased the system efficiency. Another major advance was the inclusion of a 16-byte, pre-fetch cache. With this, the CPU could load instructions before it actually processed them, thereby speeding things up even more. Then the most obvious speed increase came when the speed of the processor was increased from 8Mhz to 16Mhz.
Although the 386 had major advantages over its predecessors, at first its cost seemed relatively prohibitive. To allow users access to the multitasking capability and still make the chip fit within their customers budgets, Intel made an interesting compromise: By making a new chip in which the interface to the bus was 16-bits instead of 32-bits, Intel made their chip a fair bit cheaper.
Internally, this new chip, designated the 80386SX, is identical to the standard 386. All the registers are there and it is fully 32 bits wide. However, data and instructions are accessed 16 bits at a time, therefore requiring two bus accesses to fill the registers. Despite this shortcoming, the 80386SX is still faster than the 286.
Perhaps the most significant advance of the 386 for Linux as well as other PC-based UNIX systems was its paging abilities. I talked a little about paging in the section on operating system basics, so you already have a general idea of what paging is about. I will also go into more detail about paging in the section on the kernel. However, I will talk about it a little here so you can fully understand the power that the 386 has given us and see how the CPU helps the OS.
There are UNIX-like products that run on a 80286, such as SCO XENIX. In fact, there was even a version of SCO XENIX that ran on the 8086. Because Linux was first released for the 386, I won’t go into anymore detail about the 286 or the differences between the 286 and 386. Instead, I will just describe the CPU Linux used as sort of an abstract entity. In addition, because most of what I will be talking about is valid for the 486 and Pentium as well as the 386, I will simply call it “the CPU” instead of 386, 486, or Pentium.
(Note: Linux will also run on non-Intel CPUs, such as those from AMD or Cyrix. However, the issues I am going to talk about are all common to Intel-based or Intel-derived CPUs.)
I need to take a side-step here for a minute. On PC-Buses, multiple things are happening at once. The CPU is busily processing while much of the hardware is being access via DMA. Although these multiple tasks are occurring simultaneously on the system, this is not what is referred to as multitasking.
When I talk about multitasking, I am referring to multiple processes being in memory at the same time. Because of the time the computer takes to switch between these processes, or tasks, is much shorter than the human brain can recognize, it appears as though the processes are running simultaneously. In reality, each process gets to use the CPU and other system resources for a brief time and then its another process’s turn.
As it runs, the process could use any part of the system memory it needs. The problem with this is that a portion of RAM that one process wants may already contain code from another process. Rather than allowing each process to access any part of memory it wants, protections keep one program from overwriting another one. This protection is built in as part of the CPU and is called, quite logically, “protected mode.” Without it, Linux could not function.
Note, however, that just because the CPU is in protected mode does not necessarily mean that the protections are being utilized. It simply means that the operating system can take advantage of the built-in abilities if it wants.
Although this capability is built into the CPU, it is not the default mode. Instead, the CPU starts in what I like to call “DOS-compatibility mode.” However, the correct term is “real mode.” Real mode is a real danger to an operating system like UNIX. In this mode, a there are no protections (which makes sense because protections exist only in protected mode). A process running in real mode has complete control over the entire system and can do anything it wants. Therefore, trying to run a multiuser system on a real-mode system would be a nightmare. All the protections would have to be built into the process because the operating system wouldn’t be able to prevent a process from doing what it wanted.
A third mode, called “virtual mode,” is also built in. In virtual mode, the CPU behaves to a limited degree as though it is in real mode. However, when a process attempts to directly access registers or hardware, the instruction is caught, or trapped, and the operating system is allowed to take over.
Lets get back to protected mode because this is what makes multitasking possible. When in protected mode, the CPU can use virtual memory. As I mentioned, this is a way to trick the system into thinking that there is more memory than there really is. There are two ways of doing this. The first is called swapping, in which the entire process is loaded into memory. It is allowed to run its course for a certain amount of time. When its turn is over, another process is allowed to run. What happens when there is not enough room for both process to be in memory at the same time? The only solution is that the first process is copied out to a special part of the hard disk called the swap space, or swap device. Then, the next process is loaded into memory and allowed its turn. The second is called paging and we will get to it in a minute.
Because it takes such a large portion of the system resources to swap processes in and out of memory, virtual memory can be very inefficient, especially when you have a lot of processes running. So lets take this a step further. What happens if there are too many process and the system spends all of its time swapping? Not good.
To avoid this problem, a mechanism was devised whereby only those parts of the process that are needed are in memory. As it goes about its business, a program may only need to access a small portion of its code. As I mentioned earlier, empirical tests show that a program spends 80 percent of its time executing 20 percent of its code. So why bother bringing in those parts that aren’t being used? Why not wait and see whether they are used?
To make things more efficient, only those parts of the program that are needed (or expected to be needed) are brought into memory. Rather than accessing memory in random units, the memory is divided into 4K chunks, called pages. Although there is nothing magic about 4K per se, this value is easily manipulated. In the CPU, data is referenced in 32-bit (4-byte) chunks, and 1K (1,024) of each chunk is a page (4,096). Later you will see how this helps things work out.
As I mentioned, only that part of the process currently being used needs to be in memory. When the process wants to read something that is not currently in RAM, it needs to go out to the hard disk to pull in the other parts of the process; that is, it goes out and reads in new pages. This process is called paging. When the process attempts to read from a part of the process that is not in physical memory, a “page fault” occurs.
One thing you must bear in mind is that a process can jump around a lot. Functions are called, sending the process off somewhere completely different. It is possible, likely, for that matter, that the page containing the memory location to which the process needs to jump is not currently in memory. Because it is trying to read a part of the process not in physical memory, this, too, is called a page fault. As memory fills up, pages that haven’t been used in some time are replaced by new ones. (I’ll talk much more about this whole business later.)
Assume that a process has just made a call to a function somewhere else in the code and the page it needed is brought into memory. Now there are two pages of the process from completely different parts of the code. Should the process take another jump or return from the function, it needs to know whether it is going into memory. The operating system could keep track of this, but it doesn’t need to the CPU will keep track for it.
Stop here for a minute! This is not entirely true. The OS must first set up the structures that the CPU uses. However, the CPU uses these structures to determine whether a section of a program is in memory. Although not part of the CPU, but rather RAM, the CPU administers the RAM utilization through page tables. As their names imply, page tables are simply tables of pages. In other words, they are memory locations in which other memory locations are stored.
Confused? I was at first, so lets look at this concept another way. Each running process has a certain part of its code currently in memory. The system uses these page tables to keep track of what is currently memory and where it is located. To limit the amount the CPU has to work, each of these page tables is only 4K, or one page, in size. Because each page contains a set of 32-bit addresses, a page table can contain only 1,024 entries.
Although this would imply that a process can only have 4K x 1,024, or 4Mb, loaded at a time, there is more to it. Page tables are grouped into page directories. Like the page table, the entries in a page directory point to memory locations. However, rather than pointing to a part of the process, page directories point to page tables. Again, to reduce the CPUs work, a page directory is only one page. Because each entry in the page directory points to a page, this means that a process can only have 1,024 page tables.
Is this enough? Lets see. A page is 4K or 4,096 bytes, which is 212. Each page table can refer to 1,024 pages, which is 210. Each page directory can refer to 1,024 page tables, which is also 210. Multiplying this out, we have
(page size) x (pages in page table) x (page tables in page directory)
or
(212) x (210) x (210) = 2 32
Because the CPU is only capable of accessing 232bytes, this scheme allows access to every possible memory address that the system can generate.
Are you still with me?
Inside of the CPU is a register called the Control Register 0, or CR0 for short. In this register is a single bit that turns on this paging mechanism. If this paging mechanism is turned on, any memory reference that the CPU receives is interpreted as a combination of page directories, page tables, and offsets, rather than an absolute, linear address.
Built into the CPU is a special unit that is responsible for making the translation from the virtual address of the process to physical pages in memory. This special unit is called (what else?) the paging unit. To understand more about the work the paging unit saves the operating system or other parts of the CPU, lets see how the address is translated.
Translation of Virtual-to-Physical Address
When paging is turned on, the paging unit receives a 32-bit value that represents a virtual memory location within a process. The paging unit takes theses values and translates them, as shown in Figure 0-11. At the top of the figure, we see that the virtual address is handed to the paging unit, which converts it to a linear address. This is not the physical address in memory. As you see, the 32-bit linear address is broken down into three components. The first 10 bits (22-31) are offset into the page directory. The location in memory of the page directory is determined by the Page Directory Base Register (PDBR).
The page directory entry contains 4 bits that point to a specific page table. The entry in the page table, as you see, is determined by bits 1221. Here again, we have 10 bits, which means each entry is 32 bits. These 32-bit entries point to a specific page in physical memory. Which byte is referenced in physical memory is determined by the offset portion of the linear address, which are bits 011. These 12 bits represent the 4,096 (4K) bytes in each physical page.
Keep in mind a couple of things. First, page tables and page directories are not part of the CPU. They can’t be. If a page directory were full, it would contain 1,024 references to 4K chunks of memory. For the page tables alone, you would need 4Mb! Because this would create a CPU hundreds of times larger than it is now, page tables and directories are stored in RAM.
Next, page tables and page directories are abstract concepts that the CPU knows how to utilize. They occupy physical RAM, and operating systems such as Linux know how to switch this capability on within the CPU. All the CPU does is the “translation” work. When it starts, Linux turns this capability on and sets up all the structures. These structures are then handed off to the CPU, where the paging unit does the work.
As I said, a process with all of its page directory entries full would require 4Mb just for the page tables. This implies that the entire process is somewhere in memory. Because each of the page table entries points to physical pages in RAM, you would need 16Gb of RAM. Not that I would mind having that much RAM, though it is a bit costly and even if you had 16Mb SIMMs, you would need 1000 of them.
Like pages of the process, it’s possible that a linear address passed to the paging unit translates to a page table or even a page directory that was not in memory. Because the system is trying to access a page (which contains a page table and not part of the process) that is not in memory, a page fault occurs and the system must go get that page.
Because page tables and the page directory are not really part of the process but are important only to the operating system, a page fault causes these structures to be created rather than read in from the hard disk or elsewhere. In fact, as the process starts up, all is without form and is void: no pages, no page tables, and no page directory.
The system accesses a memory location as it starts the process. The system translates the address, as I described above, and tries to read the page directory. It’s not there. A page fault occurs and the page directory must be created. Now that the directory is there, the system finds the entry that points to the page table. Because no page tables exist, the slot is empty and another page fault occurs. So, the system needs to create a page table. The entry in the page table for the physical page is found to be empty, and so yet another page fault occurs. Finally, the system can read in the page that was referenced in the first place.
This whole process sounds a bit cumbersome, but bear in mind that this amount of page faulting only occurs as the process is starting. Once the table is created for a given process, it won’t page fault again on that table. Based on the principle of locality, the page tables will hold enough entries for a while, unless, of course, the process bounces around a lot.
The potential for bouncing around brings up an interesting aspect of page tables. Because page tables translate to physical RAM in the same way all the time, virtual addresses in the same area of the process end up in the same page tables. Therefore, page tables fill up because the process is more likely to execute code in the same part of a process rather than elsewhere (this is spatial locality).
There is quite a lot there, yes? Well, don’t get up yet because were not finished. There are a few more issues that I haven’t addressed.
First, I have often referred to page tables and the page directory. Each process has a single page directory (it doesn’t need any more). Although the CPU supports multiple page directories, there is only one directory for the entire system. When a process needs to be switched out, the entries in the page directory for the old process are overwritten by those for the new process. The location of the page directory in memory is maintained in the Control Register 3 (CR3) in the CPU.
There is something here that bothered me in the beginning and may still bother you. As I have described, each time a memory reference is made, the CPU has to look at the page directory, then a page table, then calculate the physical address. This means that for every memory reference, the CPU has to make two more references just to find out where the next instruction or data is coming from. I though that was pretty stupid.
Well, so did the designers of the CPU. They have included a functional unit called the Translation Lookaside Buffer, or TLB. The TLB contains 32 entries and, as the internal and external caches point to sets of instructions, points to pages. If a page that is being searched is in the TLB, a TLB hit occurs (just like a cache hit). As a result of the principle of spatial locality, there is a 98-percent hit rate using the TLB.
When you think about it, this makes a lot of sense. The CPU does not just execute one instruction for a program then switch to something else, it executes hundreds or even thousands of instructions before another program gets its turn. If each page contains 1,024 instructions and the CPU executes 1000 before it’s another programs turn, all 1000 will most likely be in the same page. Therefore, they are all TLB hits.
Now, lets take a closer look at the page table entries themselves. Each is a 32-bit value that points to a 4K location in RAM. Because it points to an area of memory larger than a byte, it does not need all 32 bits to do it. Therefore, some bits are left over. Because the page table entry points to an area that has 220 bytes (4,096 bytes = 1 page), it doesn’t need 12 bits. These are the low-order 12 bits and the CPU uses them for other purposes related to that page. A few of them are unused and the operating system can, and does, use them for its own purposes. Intel also reserves a couple, and they should not be used.
One bit, the 0th bit, is the present bit. If this bit is set, the CPU knows that the page being referenced is in memory. If it is not set, the page is not in memory and if the CPU tries to access it, a page fault occurs. Also, if this bit is not set, none of the other bits has any meaning. (How can you talk about something that’s not there?)
Another important bit is the accessed bit. Should a page be accessed for either read or write, the CPU sets this bit. Because the page table entry is never filled in until the page is being accessed, this seems a bit redundant. If that was all there was to it, you’d be right. However, there’s more.
At regular intervals, the operating system clears the access bit. If a particular page is never used again, the system is free to reuse that physical page if memory gets short. When that happens, all the OS needs to do is clear the present bit so the page is considered “invalid.”
Another bit used to determine how a page is accessed is the dirty bit. If a page has been written to, it is considered dirty. Before the system can make a dirty page available, it must make sure that whatever was in that page is written to disk, otherwise the data is inconsistent.
Finally, we get to the point of what all this protected mode stuff is all about. The protection in protected mode essentially boils down to two bits in the page table entry. One bit, the user/supervisor bit, determines who has access to a particular page. If the CPU is running at user level, then it only has access to user-level pages. If the CPU is at supervisor level, it has access to all pages.
I need to say here that this is the maximum access a process can have. Other protections may prevent a user-level or even supervisor-level process from getting even this far. However, these are implemented at a higher level.
The other bit in this pair is the read/write bit. As the name implies, this bit determines whether a page can be written to. This single bit is really just an on-off switch. If the page is there, you have the right to read it if you can (that is, either you are a supervisor-level process or the page is a user page). However, if the write ability is turned off, you cant write to it, even as a supervisor.
If you have a 386 CPU, all is well. If you have a 486 and decide to use one of those bits that I told you were reserved by Intel, you are now running into trouble. Two of these bits were not defined in the 386 but are now defined in the 486: page write-through (PWT) and page cache disable (PCD).
PWT determines the write policy (see the section on RAM) for external cache regarding this page. If PWT is set, then this page has a write-through policy. If it is clear, a write-back policy is allowed.
PCD decides whether this page can be cached. If clear, this page cannot be cached. If set, then caching is allowed. Note that I said “allowed.” Setting this bit does not mean that the page will be cached. Other factors that go beyond what I am trying to get across here are involved.
Well, I’ve talked about how the CPU helps the OS keep track of pages in memory. I also talked about how the CR3 register helps keep track of which page directory needs to be read. I also talked about how pages can be protected by using a few bits in the page table entry. However, one more thing is missing to complete the picture: keeping track of which process is currently running, which is done with the Task Register (TR).
The TR is not where most of the work is done. The CPU simply uses it as a pointer to where the important information is kept. This pointer is the Task State Descriptor (TSD). Like the other descriptors that I’ve talked about, the TSD points to a particular segment. This segment is the Task State Segment (TSS). The TSD contains, among other things, the privilege level at which this task is operating. Using this information along with that in the page table entry, you get the protection that protected mode allows.
The TSS contains essentially a snapshot of the CPU. When a process’s turn on the CPU is over, the state of the entire CPU needs to be saved so that the program can continue where it left off. This information is stored in the TSS. This functionality is built into the CPU. When the OS tells the CPU a task switch is occurring (that is, a new process is getting its turn), the CPU knows to save this data automatically.
If we put all of these components together, we get an operating system that works together with the hardware to provide a multitasking, multiuser system. Unfortunately, what I talked about here are just the basics. I could spend a whole book just talking about the relationship between the operating system and the CPU and still not be done.
One thing I didn’t talk about was the difference between the 80386, 80486, and Pentium. With each new processor comes new instructions. The 80486 added an instruction pipeline to improve the performance to the point where the CPU could average almost one instruction per cycle. The Pentium has dual instructions paths (pipelines) to increase the speed even more. It also contains branch prediction logic, which is used to “guess” where the next instruction should come from.
The Pentium (as well as the later CPUs) has a few new features that make for significantly more performance. This first feature is multiple instruction paths or pipelines, which allow the CPU to work on multiple instructions at the same time. In some cases, the CPU will have to wait to finish one before working on the other, though this is not always necessary.
The second improvement is called dynamic execution. Normally, instructions are executed one after other. If the execution order is changed, the whole program is changed. Well, not exactly. In some instances, upcoming instructions are not based on previous instructions, so the processor can “jump ahead” and start executing the executions before others are finished.
The next advance is branch prediction. Based on previous activity, the CPU can expect certain behavior to continue. For example, the odds are that once the CPU is in a loop, the loop will be repeated. With more than one pipeline executing instruction, multiple possibilities can be attempted. This is not always right, but is right more than 75 percent of the time!
The PentiumPro (P6) introduced the concept of data flow analysis. Here, instructions are executed as they are ready, not necessarily in the order in which they appear in the program. Often, the result is available before it normally would be. The PentiumPro (P6) also introduced speculative execution, in which the CPU takes a guess at or anticipates what is coming.
The P6 is also new in that it is actually two separate chips. However, the function of the second chip is the level 2 cache. Both an external bus and a “private” bus connect the CPU to the level 2 cache, and both of these are 64 bits.
Both the Socket and the CPU itself changed with the Pentium II processor. Instead of a processor with pins sticking out all over the bottom, the Pentium II uses a Single Edge Contact Cartridge (SECC). This reportedly eliminates the need for resigning the socket with every new CPU generation. In addition, the CPU is encased in plastic, which protects the CPU during handling. Starting at “only”, the Pentium II can reach speeds of up to 450 MHz.
Increasing performance even further, the Pentium II has increased the internal, level-one cache to 32KiB, with 16 KiB for data and 16KiB for instructions. Technically it may be appropriate to call the level-two cache internal, as the 512KiB L2 cache is included within the SECC, making access faster than for a traditional L2 cache. The Dual Independent Bus (DIB) architecture provides for higher throughput as there are separate system and cache buses.
The Pentium II also increases performance internally through changes to the processor logic. Using Multiple Branch Prediction, the Pentium predicts the flow of instructions through several branches. Because computers usually process instructions in loops (i.e. repeatedly) it is generally easy to guess what the computer will do next. By predicting multiple branches, the processor reduces “wrong guesses.”
Processor “management” has become an important part of the Pentium II. A Built-In Self-Test( BIST) is included, which is used to test things like the cache and the TLB. It also includes a diode within the case to monitor the processor’s temperature.
The Pentium II Xeon Processor added a “system bus management interface,” which allows the CPU to communicate with other system management components (hardware and software). The thermal sensor, which was already present in the Pentium II, as well as the new Processor Information ROM (PI ROM) and the Scratch EEPROM use this bus.
The PI ROM contains various pieces of information about the CPU, like the CPU ID, voltage tolerances and other technical information. The Scratch EEPROM is shipped blank from Intel but is intended for system manufacturers to include whatever information they want to, such as an inventory of the other components, service information, system default and so forth.
Like the Pentium II, the latest (as of this writing) processor, the Pentium III also comes in the Single Edge Contact Cartridge. It has increased the number of transistors from the 7.5 million in the Pentium II to over 9.5 million. Currently, the Pentium III comes in 450Mhz and 500 MHz models, with a 550MHz model in the planning.
The Pentium II also includes the Internet Streaming SIMD Extensions, which consist of 70 new instructions which enhance imagining in general, as well as 3D graphics, streaming audio and video as well as speech recognition.
Intel also added a serial number to the Pentium II. This is extremely useful should the CPU itself get stolen or the computer get stolen, after the CPU has been installed. In addition, the CPU can be uniquely identified across the network, regardless of the network card or other components. This can be used in the future to prevent improper access to sensitive data, aid in asset management and help in remote management and configuration.
Even today in the people still think that the PC CPU is synonymous with Intel. That is if you are going to buy a CPU for your PC that it will be manufactured by Intel. This is not the case. Two manufactures Advanced Micro Devices (ADM) and Cyrix provide CPUs with comparable functionality. Like any other brand name, Intel CPUs are often more expensive than an equivalent from another company with the same performance.