Intel Processors
Although it is an interesting subject, the ancient history of
microprocessors is not really important to the issues at hand. It might be nice
to learn how the young PC grew from a small, budding 4-bit system to the
gigantic, strapping 64-bit Pentium. However, there are many books that have
covered this subject and unfortunately, I don’t have the space. Besides, the
Intel chips on which Linux runs are only the 80386 (or 100-percent compatible
clones) and higher processors.
So, instead of setting the way-back machine to Charles Babbage and his
Analytic Engine, we leap ahead to 1985 and the introduction of the Intel 80386.
Even compared to its immediate predecessor, the 80286, the 80386 (386 for short)
was a powerhouse. Not only could it handle twice the amount of data at once (now
32 bits), but its speed rapidly increased far beyond that of the 286.
New advances were added to increase the 386s power. Internal registers were
added and their size was increased. Built into the 386 was the concept of
virtual memory, which was a way to make it appear as
though there was much more memory on system than there actually was. This
substantially increased the system efficiency. Another major advance was the
inclusion of a 16-byte, pre-fetch cache. With this, the
CPU could load instructions before it actually processed
them, thereby speeding things up even more. Then the most obvious speed increase
came when the speed of the processor was increased from 8Mhz to 16Mhz.
Although the 386 had major advantages over its predecessors, at first its
cost seemed relatively prohibitive. To allow users access to the multitasking
capability and still make the chip fit within their customers budgets, Intel
made an interesting compromise: By making a new chip in which the interface to
the bus was 16-bits instead of 32-bits, Intel made their
chip a fair bit cheaper.
Internally, this new chip, designated the 80386SX, is identical to the
standard 386. All the registers are there and it is fully 32 bits wide.
However, data and instructions are accessed 16 bits at a time, therefore
requiring two bus accesses to fill the registers. Despite
this shortcoming, the 80386SX is still faster than the 286.
Perhaps the most significant advance of the 386 for Linux as well as other
PC-based UNIX systems was its paging
abilities. I talked a little about paging in the section on operating system
basics, so you already have a general idea of what paging is
about. I will also go into more detail about paging in the section on the
kernel. However, I will talk about it a little here so you
can fully understand the power that the 386 has given us and see how the
CPU helps the OS.
There are UNIX-like products that run on a 80286, such as SCO XENIX. In
fact, there was even a version of SCO XENIX that ran on the 8086. Because Linux
was first released for the 386, I won’t go into anymore detail about the 286 or
the differences between the 286 and 386. Instead, I will just describe the
CPU Linux used as sort of an abstract entity. In addition,
because most of what I will be talking about is valid for the 486 and Pentium as
well as the 386, I will simply call it “the CPU” instead of 386, 486, or
Pentium.
(Note: Linux will also run on non-Intel CPUs, such as those from AMD or
Cyrix. However, the issues I am going to talk about are all common to
Intel-based or Intel-derived CPUs.)
I need to take a side-step here for a minute. On PC-Buses, multiple things
are happening at once. The CPU is busily processing while
much of the hardware is being access via DMA. Although
these multiple tasks are occurring simultaneously on the system, this is not
what is referred to as multitasking.
When I talk about multitasking, I am referring to multiple processes being in
memory at the same time. Because of the time the computer takes to switch
between these processes, or tasks, is much shorter than the human brain can
recognize, it appears as though the processes are running simultaneously. In
reality, each process gets to use the CPU and other system
resources for a brief time and then its another process’s turn.
As it runs, the process could use any part of the system memory it needs. The
problem with this is that a portion of RAM that one
process wants may already contain code from another process. Rather than
allowing each process to access any part of memory it wants, protections keep
one program from overwriting another one. This protection is built in as part of
the CPU and is called, quite logically, “protected mode.”
Without it, Linux could not function.
Note, however, that just because the CPU
is in protected mode
does not necessarily mean that the protections are being utilized. It simply
means that the operating system can take advantage of the
built-in abilities if it wants.
Although this capability is built into the CPU,
it is not the default mode. Instead, the CPU starts in what I like to call
“DOS-compatibility mode.” However, the correct term is “real mode.” Real mode
is a real danger to an operating system like
UNIX. In this mode, a there are no protections (which
makes sense because protections exist only in protected mode). A process running
in real mode has complete control over the entire system and can do anything it
wants. Therefore, trying to run a multiuser system on a real-mode system would
be a nightmare. All the protections would have to be built into the process
because the operating system wouldn’t be able to prevent a process from doing
what it wanted.
A third mode, called “virtual mode,” is also built in. In virtual mode, the
CPU behaves to a limited degree as though it is in real
mode. However, when a process attempts to directly access registers or
hardware, the instruction is caught, or trapped, and the operating system
is allowed to take over.
Lets get back to protected mode
because this is what makes multitasking possible. When in protected mode, the
CPU
can use virtual memory.
As I mentioned, this is a way to trick the system into thinking that there is
more memory than there really is. There are two ways of doing this. The first
is called swapping, in which the entire process is loaded
into memory. It is allowed to run its course for a certain amount of time. When
its turn is over, another process is allowed to run. What happens when there is
not enough room for both process to be in memory at the same time? The only
solution is that the first process is copied out to a special part of the hard
disk called the swap space, or swap device. Then, the next
process is loaded into memory and allowed its turn. The second is called
paging and we will get to it in a minute.
Because it takes such a large portion of the system resources to swap
processes in and out of memory, virtual memory can be
very inefficient, especially when you have a lot of processes running. So lets
take this a step further. What happens if there are too many process and the
system spends all of its time swapping? Not good.
To avoid this problem, a mechanism was devised whereby only those parts of
the process that are needed are in memory. As it goes about its business, a
program may only need to access a small portion of its code. As I mentioned
earlier, empirical tests show that a program spends 80 percent of its time
executing 20 percent of its code. So why bother bringing in those parts that
aren’t being used? Why not wait and see whether they are used?
To make things more efficient, only those parts of the program that are
needed (or expected to be needed) are brought into memory. Rather than
accessing memory in random units, the memory is divided into 4K chunks, called
pages. Although there is nothing magic about 4K per se, this value is easily
manipulated. In the CPU, data is referenced in 32-bit
(4-byte) chunks, and 1K (1,024) of each chunk is a page (4,096). Later you will
see how this helps things work out.
As I mentioned, only that part of the process currently being used needs to
be in memory. When the process wants to read something that is not currently
in RAM, it needs to go out to the hard disk to pull in the
other parts of the process; that is, it goes out and reads in new pages. This
process is called paging. When the process attempts to read
from a part of the process that is not in physical memory,
a “page fault” occurs.
One thing you must bear in mind is that a process can jump around a lot.
Functions are called, sending the process off somewhere completely different.
It is possible, likely, for that matter, that the page containing the memory
location to which the process needs to jump is not currently in memory. Because
it is trying to read a part of the process not in physical memory,
this, too, is called a page fault. As
memory fills up, pages that haven’t been used in some time are replaced by new
ones. (I’ll talk much more about this whole business later.)
Assume that a process has just made a call to a function somewhere else in
the code and the page it needed is brought into memory. Now there are two pages
of the process from completely different parts of the code. Should the process
take another jump or return from the function, it needs to know whether it is
going into memory. The operating system could keep track of
this, but it doesn’t need to the CPU will keep track for
it.
Stop here for a minute! This is not entirely true. The OS must first set up
the structures that the CPU uses. However, the CPU uses
these structures to determine whether a section of a program is in memory.
Although not part of the CPU, but rather RAM, the CPU
administers the RAM utilization through page tables. As their names imply, page
tables are simply tables of pages. In other words, they are memory locations in
which other memory locations are stored.
Confused? I was at first, so lets look at this concept another way. Each
running process has a certain part of its code currently in memory. The system
uses these page tables to keep track of what is currently memory and where it is
located. To limit the amount the CPU has to work, each of
these page tables is only 4K, or one page, in size. Because each page contains a
set of 32-bit addresses, a page table can contain only
1,024 entries.
Although this would imply that a process can only have 4K x 1,024, or 4Mb,
loaded at a time, there is more to it. Page tables are grouped into page
directories. Like the page table, the entries
in a page directory point to memory locations.
However, rather than pointing to a part of the process, page directories point
to page tables. Again, to reduce the CPUs work, a page directory is only one
page. Because each entry in the page directory points to a page, this means that
a process can only have 1,024 page tables.
Is this enough? Lets see. A page is 4K or 4,096 bytes, which is
212. Each page table can refer to 1,024 pages,
which is 210. Each page directory can refer to
1,024 page tables, which is also 210. Multiplying this out, we
have
(page size) x (pages in page table) x (page tables in page directory)
or
(212) x (210) x (210) = 2 32
Because the CPU
is only capable of accessing 232bytes, this scheme allows
access to every possible memory address
that the system can generate.
Are you still with me?
Inside of the CPU
is a register called the Control Register 0, or CR0 for short. In this register
is a single bit that turns on this paging mechanism. If
this paging mechanism is turned on, any memory reference that the CPU receives
is interpreted as a combination of page directories, page tables, and offsets,
rather than an absolute, linear address.
Built into the CPU
is a special unit that is responsible for making the translation from the virtual
address of the process to physical pages in memory. This special
unit is called
(what else?) the paging unit. To understand more about the
work the paging unit saves the operating system or other
parts of the CPU, lets see how the address is translated.
Translation of Virtual-to-Physical Address
When paging
is turned on, the paging unit receives a 32-bit value that
represents a virtual memory
location within a process. The paging
unit takes theses values and translates them, as shown in Figure 0-11. At the top of the
figure, we see that the virtual address
is handed to the paging
unit, which converts it to a linear address.
This is not the physical address in memory. As
you see, the 32-bit linear address
is broken down into three components. The
first 10 bits (22-31) are offset into the page directory.
The location in memory
of the page directory
is determined by the Page Directory Base Register (PDBR).
Image – Translation of virtual addresses into physical addresses by the paging unit. (interactive)
The page directory
entry contains 4 bits that point to a specific page table.
The entry in the page table, as you see, is determined by bits 1221. Here
again, we have 10 bits, which means each entry is 32 bits. These 32-bit entries
point to a specific page in physical memory. Which byte is referenced
in physical memory is determined by the offset portion of
the linear address, which are bits 011. These 12 bits
represent the 4,096 (4K) bytes in each physical page.
Keep in mind a couple of things. First, page tables and page directories are
not part of the CPU. They can’t be. If a
page directory were full, it would contain 1,024 references
to 4K chunks
of memory. For the page tables alone, you would need 4Mb! Because this would
create a CPU hundreds of times larger than it is now, page tables and
directories are stored in RAM.
Next, page tables and page directories are abstract concepts that the
CPU knows how to utilize. They occupy physical
RAM, and operating systems such as Linux know how to
switch this capability on within the CPU. All the CPU does is the “translation”
work. When it starts, Linux turns this capability on and sets up all the
structures. These structures are then handed off to the CPU, where the
paging unit does the work.
As I said, a process with all of its page directory
entries full would require 4Mb just for the page tables. This implies that the
entire process is somewhere in memory. Because each of the page table
entries points to physical pages in RAM,
you would need 16Gb of RAM. Not that I would mind having that much RAM, though
it is a bit costly and even if you had 16Mb SIMMs, you would need 1000 of
them.
Like pages of the process, it’s possible that a linear address
passed to the paging
unit translates to a page table
or even a page directory
that was not in memory. Because the system is trying to access a page (which
contains a page table and not part of the process) that is not in memory, a page
fault occurs and the system must go get that page.
Because page tables and the page directory
are not really part of the process but are important only to the
operating system,
a page fault causes these structures to be created
rather than read in from the hard disk or elsewhere. In fact, as the process starts up,
all is without form and is void: no pages, no page tables, and no page directory.
The system accesses a memory location as it starts the process. The system
translates the address, as I described above, and tries to
read the page directory. It’s not there. A page
fault occurs and the page directory must be created. Now
that the directory is there, the system finds the entry that points to the
page table. Because no page tables exist, the slot is
empty and another page fault occurs. So, the system needs
to create a page table. The entry in the page table for the physical page is
found to be empty, and so yet another page fault occurs. Finally, the system can
read in the page that was referenced in the first place.
This whole process sounds a bit cumbersome, but bear in mind that this amount
of page faulting only occurs as the process is starting. Once the table is
created for a given process, it won’t page fault again on
that table. Based on the principle of locality, the page tables will hold enough
entries for a while, unless, of course, the process bounces around a lot.
The potential for bouncing around brings up an interesting aspect of page
tables. Because page tables translate to physical RAM in
the same way all the time, virtual addresses in the same area of the process end
up in the same page tables. Therefore, page tables fill up because the process
is more likely to execute code in the same part of a process rather than
elsewhere (this is spatial locality).
There is quite a lot there, yes? Well, don’t get up yet because were not
finished. There are a few more issues that I haven’t addressed.
First, I have often referred to page tables and the
page directory. Each process has a single page directory (it doesn’t need
any more). Although the CPU supports multiple page
directories, there is only one directory for the entire system. When a
process needs to be switched out, the entries in the page directory for the old
process are overwritten by those for the new process. The location of the page
directory in memory is maintained in the Control Register 3 (CR3) in the
CPU.
There is something here that bothered me in the beginning and may still
bother you. As I have described, each time a memory reference is made, the
CPU has to look at the
page directory,
then a page table, then calculate the physical
address. This means that for every memory
reference, the CPU has to make two more references just to find out where the
next instruction or data is coming from. I though that was pretty stupid.
Well, so did the designers of the CPU.
They have included a functional unit called the Translation Lookaside Buffer,
or TLB. The TLB contains 32 entries and, as the internal
and external caches point to sets of instructions, points to pages. If a page
that is being searched is in the TLB, a TLB hit occurs (just like a
cache hit). As a result of the principle of spatial
locality, there is a 98-percent hit rate using the TLB.
When you think about it, this makes a lot of sense. The CPU
does not just execute one instruction for a program then switch to something
else, it executes hundreds or even thousands of instructions before
another program gets its turn. If each page contains 1,024 instructions and the CPU
executes 1000 before it’s another programs turn, all 1000 will most likely be in
the same page. Therefore, they are all TLB hits.
Now, lets take a closer look at the page table
entries themselves. Each is a 32-bit value that points to a 4K location in
RAM. Because it points to an area of memory larger than a
byte, it does not need all 32 bits to do it. Therefore, some bits are left over.
Because the page table entry points to an area that has 220 bytes
(4,096 bytes = 1 page), it doesn’t need 12 bits. These are the low-order 12 bits
and the CPU uses them for other purposes related to that
page. A few of them are unused and the operating system
can, and does, use them for its own purposes. Intel also reserves a couple, and
they should not be used.
One bit, the 0th bit, is the present bit. If this bit is set, the
CPU knows that the page being referenced is in memory. If
it is not set, the page is not in memory and if the CPU tries to access it, a
page fault occurs. Also, if this bit is not set, none of
the other bits has any meaning. (How can you talk about something that’s not
there?)
Another important bit is the accessed bit. Should a page be accessed for
either read or write, the CPU sets this bit. Because the
page table entry is never filled in until the page is being
accessed, this seems a bit redundant. If that was all there was to it, you’d be
right. However, there’s more.
At regular intervals, the operating system
clears the access bit. If a particular page is never used again, the system is
free to reuse that physical page if memory gets short. When that happens, all
the OS needs to do is clear the present bit so the page is considered “invalid.”
Another bit used to determine how a page is accessed is the dirty
bit. If a page has been written to, it is considered dirty. Before the system
can make a dirty page available, it must make sure that whatever was in that
page is written to disk, otherwise the data is inconsistent.
Finally, we get to the point of what all this protected mode
stuff is all about. The protection in protected mode essentially boils down to
two bits in the page table entry. One bit, the
user/supervisor bit, determines who has access to a particular page. If the
CPU is running at user level, then it only has access to
user-level pages. If the CPU is at supervisor level, it has access to all pages.
I need to say here that this is the maximum access a process can have. Other
protections may prevent a user-level or even supervisor-level process from
getting even this far. However, these are implemented at a higher level.
The other bit in this pair is the read/write bit. As the name implies, this
bit determines whether a page can be written to. This single bit is really
just an on-off switch. If the page is there, you have the right to read it if
you can (that is, either you are a supervisor-level process or the page is a
user page). However, if the write ability is turned off, you cant write to it,
even as a supervisor.
If you have a 386 CPU, all is well. If you have a 486
and decide to use one of those bits that I told you were reserved by Intel, you
are now running into trouble. Two of these bits were not defined in the 386 but
are now defined in the 486: page write-through (PWT) and page
cache disable (PCD).
PWT determines the write policy
(see the section on RAM) for external cache
regarding this page. If PWT is set, then this page has a write-through policy.
If it is clear, a write-back policy is allowed.
PCD decides whether this page can be cached. If clear, this page cannot be
cached. If set, then caching is allowed. Note that I said “allowed.” Setting
this bit does not mean that the page will be cached. Other factors that go
beyond what I am trying to get across here are involved.
Well, I’ve talked about how the CPU
helps the OS keep track of pages in memory. I also talked about how the CR3
register helps keep track of which page directory needs
to be read. I also talked about how pages can be protected by using a few bits
in the page table entry. However, one more thing is missing
to complete the picture: keeping track of which process is currently running,
which is done with the Task Register (TR).
The TR is not where most of the work is done. The CPU
simply uses it as a pointer to where the important information is kept. This
pointer is the Task State Descriptor (TSD). Like the other descriptors that
I’ve talked about, the TSD points to a particular segment.
This segment is the Task State Segment (TSS). The TSD contains, among other
things, the privilege level at which this task is operating. Using this
information along with that in the page table entry, you
get the protection that protected mode allows.
The TSS
contains essentially a snapshot of the CPU.
When a process’s turn on the CPU is over, the state of the entire CPU needs to
be saved so that the program can continue where it left off. This information
is stored in the TSS. This functionality is built into the CPU. When the OS
tells the CPU a task switch is occurring (that is, a new process is getting its
turn), the CPU knows to save this data automatically.
If we put all of these components together, we get an
operating system that works together with the hardware to provide a
multitasking, multiuser system. Unfortunately, what I talked about here are
just the basics. I could spend a whole book just talking about the relationship
between the operating system and the CPU and still not be
done.
One thing I didn’t talk about was the difference between the 80386, 80486,
and Pentium. With each new processor comes new instructions. The 80486 added an
instruction pipeline to improve the performance to the point where the
CPU could average almost one instruction per cycle. The
Pentium has dual instructions paths (pipelines) to increase the speed even more.
It also contains branch prediction logic, which is used to “guess”
where the next instruction should come from.
The Pentium (as well as the later CPUs) has a few new features that
make for significantly more performance. This first feature is multiple
instruction paths or pipelines, which allow the CPU to work
on multiple instructions at the same time. In some cases, the CPU will have to
wait to finish one before working on the other, though this is not always
necessary.
The second improvement is called dynamic execution. Normally, instructions
are executed one after other. If the execution order is changed, the whole
program is changed. Well, not exactly. In some instances, upcoming instructions
are not based on previous instructions, so the processor can “jump ahead” and
start executing the executions before others are finished.
The next advance is branch prediction. Based on previous activity, the
CPU can expect certain behavior to continue. For example,
the odds are that once the CPU is in a loop, the loop will be repeated. With
more than one pipeline executing instruction, multiple possibilities can be
attempted. This is not always right, but is right more than 75 percent of the
time!
The PentiumPro (P6) introduced the concept of data flow analysis. Here,
instructions are executed as they are ready, not necessarily in the order in
which they appear in the program. Often, the result is available before it
normally would be. The PentiumPro (P6) also introduced speculative execution, in
which the CPU takes a guess at or anticipates what is
coming.
The P6 is also new in that it is actually two separate chips.
However, the function of the second chip is the level 2 cache.
Both an external
bus and a “private” bus
connect the CPU
to the level 2 cache,
and both
of these are 64 bits.
Both the Socket and the CPU
itself changed with the Pentium II processor.
Instead of a processor with pins sticking out all over the bottom, the Pentium
II uses a Single Edge Contact Cartridge (SECC). This reportedly eliminates the
need for resigning the socket with every new CPU
generation. In addition, the
CPU is encased in plastic, which protects the CPU
during handling. Starting at
“only”, the Pentium II can reach speeds of up to 450 MHz.
Increasing performance even further, the Pentium II has increased the
internal, level-one cache
to 32KiB, with 16 KiB
for data and 16KiB for
instructions. Technically it may be appropriate to call the level-two cache
internal, as the 512KiB L2 cache
is included within the SECC, making access
faster than for a traditional L2 cache.
The Dual Independent Bus (DIB)
architecture provides for higher throughput
as there are separate system and
cache buses.
The Pentium II also increases performance internally through changes to the
processor logic. Using Multiple Branch Prediction, the Pentium predicts the flow
of instructions through several branches. Because computers usually process
instructions in loops (i.e. repeatedly) it is generally easy to guess what the
computer will do next. By predicting multiple branches, the processor reduces
“wrong guesses.”
Processor “management” has become an important part of the Pentium II. A
Built-In Self-Test( BIST) is included, which is used to test things like the
cache and the TLB.
It also includes a diode within the case to monitor the
processor’s temperature.
The Pentium II Xeon Processor added a “system bus
management interface,”
which allows the CPU
to communicate with other system management components
(hardware and software). The thermal sensor, which was already present in the
Pentium II, as well as the new Processor Information ROM
(PI ROM) and the
Scratch EEPROM use this bus.
The PI ROM
contains various pieces of information about the CPU,
like the
CPU ID, voltage tolerances and other technical information. The Scratch EEPROM
is shipped blank from Intel but is intended for system manufacturers to include
whatever information they want to, such as an inventory of the other components,
service information, system default and so forth.
Like the Pentium II, the latest (as of this writing) processor, the Pentium III
also comes in the Single Edge Contact Cartridge. It has increased the number of
transistors from the 7.5 million in the Pentium II to over 9.5 million.
Currently, the Pentium III comes in 450Mhz and 500 MHz models, with a 550MHz
model in the planning.
The Pentium II also includes the Internet Streaming SIMD Extensions, which
consist of 70 new instructions which enhance imagining in general, as well as 3D
graphics, streaming audio and video as well as speech recognition.
Intel also added a serial number to the Pentium II. This is extremely
useful should the CPU
itself get stolen or the computer get stolen, after the
CPU has been installed. In addition, the CPU
can be uniquely identified across
the network,
regardless of the network card or other components. This can be
used in the future to prevent improper access to sensitive data, aid in asset
management and help in remote management and configuration.
Even today in the people still think that the PC CPU
is synonymous with
Intel. That is if you are going to buy a CPU
for your PC that it will be
manufactured by Intel. This is not the case. Two manufactures Advanced Micro
Devices (ADM) and Cyrix provide CPUs with comparable functionality. Like any
other brand name, Intel CPUs are often more expensive than an equivalent from
another company with the same performance.