Current location - Loan Platform Complete Network - Big data management - CPU knowledge
CPU knowledge
The CPU, also known as the Central Processing Unit, is an abbreviation of the English word Central Processing Unit, which is responsible for calculating and processing information and data, and automating its own operation process. In the early days of computers, the CPU was divided into two parts, the operator and the controller, and later, due to the increase in circuit integration, when microprocessors were introduced, they were all integrated into one chip.

The CPU is used wherever there is a need for intelligent control and large amounts of information processing.

There are general-purpose CPUs and embedded CPUs, and the distinction between general-purpose and embedded is based on the different modes of application. General-purpose CPU chips are generally more powerful and can run complex operating systems and large-scale application software. Embedded CPUs have a wide range of variations in functionality and performance. With the improvement of integration, in embedded applications, people tend to CPU, memory and some peripheral circuits integrated into a chip, constituting the so-called system-on-chip (referred to as SOC), and the SOC on the CPU to become the CPU core.

Now, there are two diametrically opposed directions for optimizing the design of the instruction system. One is to enhance the function of the instruction, set up some complex functions of the instruction, some of the original software to realize the common function of the hardware instruction system to realize, this computer becomes a complex instruction system computer. Early Intel's X86 instruction system is a CISC instruction structure.

RISC is the abbreviation of Reduced Instruction Set Computer Chinese translation into a streamlined instruction system computer, is developed in the eighties, as much as possible to simplify the function of the instructions, only to retain those functions are simple, can be executed within a beat to complete the instructions, the more complex functions with a subroutine to achieve this computer system has become a streamlined instruction system computer. This kind of computer system is called RISC (Reduced Instruction System Computer). At present, the processor chip manufacturers using RISC architecture include SUN, SGI, IBM's Power PC series, DEC's Alpha series, Motorola's Dragonball and Power PC, and so on.

Introducing the MIPS system.

MIPS is a popular RISC processor in the world, MIPS means "Microprocessor without interlocked piped stages" (Microprocessor without interlocked piped stages), and its mechanism is to try to avoid data-related problems in the pipeline by using software approaches. He was first developed in the early 80's by a research group led by Prof. Hennessy of Stanford University. MIPS R series is based on the development of RISC industrial products of microprocessors. These series of products are used by many computer companies to form a variety of workstations and computer systems.

Instruction system

To talk about the CPU, we must first talk about the instruction system. The instruction system refers to the collection of all the instructions

that a CPU can process, and is the fundamental property of a CPU. For example, all the CPUs we use nowadays use the x86 instruction set, and they are all the same type of CPU, whether it is a PIII, an Athlon, or a Joshua. we also know that there are CPUs in the world that are much faster than the PIII and the Athlon, such as the Alpha, but they are not using the x86 instruction set, and they can't use the huge number of programs that are based on it, such as Windows 98. the reason for this is that we have to talk about the instruction system first, and we have to talk about it first. The reason why the instruction system is a fundamental property of a CPU is that it determines what kind of programs a CPU can run.

All programs written in a high-level language need to be translated (compiled or interpreted) into machine language, which contains the instructions.

1. Instruction format

An instruction generally consists of two parts: the opcode and the address code. The opcode is actually the sequence number of the instruction, which is used to tell the CPU to execute the instruction. The address code is a bit more complicated, mainly including the source operand address, destination address and the address of the next instruction. In some instructions, the address code can be partially or completely omitted, such as a null instruction that has only an opcode and no address code.

For example, an instruction system has a 32-bit instruction length, an 8-bit opcode length, an 8-bit address length, and a first instruction that is an addition and a second instruction that is a subtraction. When it receives a "00000010000001000000000100000110" instruction, first take out its first 8-bit opcode, that is, 00000010, analyzed that this is a subtraction operation, there are three addresses, respectively, two source operand addresses and a destination address. So, the CPU goes to memory address 00000100 to take out the subtracted number, to 00000001 to take out the subtracted number, sent to the ALU for subtraction operation, and then send the result to 00000110.

This is just a simplified example, and the actual situation is much more complex

2. Classification and Addressing of Instructions

Generally speaking, there are the following types of instructions in the current instruction system:

(1) Arithmetic-Logic Instructions

Arithmetic-Logic instructions include arithmetic instructions such as addition, subtraction, multiplication, division, and logical instructions such as with or without or with or without. or non-orthogonal or other logical instructions. Nowadays, the instruction system also includes some decimal arithmetic instructions and string arithmetic instructions.

(2) Floating-point instructions

Used to operate on floating-point numbers. Floating-point operations are much more complex than integer operations, so the CPU usually has a floating-point unit dedicated to floating-point operations. Today's floating-point instructions generally include vector instructions, which are used to perform operations directly on matrices and are useful for today's multimedia and 3D processing.

(3) Bit operation instructions

Anyone who has studied C should know that there is a set of bit operation statements in C. Correspondingly, there is also a set of bit operation instructions in the instruction system, such as shift left by one and shift right by one. This is a very simple and fast way to manipulate data that is not represented by a binary code within the computer.

(4) Other instructions

The above three types of instructions are operational, but there are many other non-operational instructions. These instructions include: data transfer instructions, stack operation instructions, transfer type instructions, input/output instructions, and some special instructions such as privileged instructions, multiprocessor control instructions, and wait, stop, and null instructions.

For the address code in the instruction, there are many different ways of addressing, mainly direct addressing, indirect addressing, register addressing, base addressing, variable addressing, etc. Some complex instruction systems have dozens of addressing modes.

3. CISC and RISC

CISC, Complex Instruction Set Computer, and RISC, Reduced Instruction Set Computer. Although these two terms are specific to computers, in the following we will continue to examine only instruction sets.

(1) The Creation, Development, and Current Status of CISC

In the beginning, there were only a few basic instructions in the computer's instruction system, and the other complex instructions were realized by combining simple instructions when the software was compiled. For example, multiplying an a by a b can be converted to adding a and b, so there is no need for a multiplication instruction. Of course, multiplication instructions were already available in the earliest instruction systems, but why? Because multiplication is much faster to implement in hardware than additive combinations.

Because computer parts were expensive and slow, more and more complex instructions were added to the instruction system to increase speed. But there was soon a problem: the number of instructions in an instruction system is limited by the number of bits in the opcode of the instruction; if the opcode is 8 bits, then the maximum number of instructions is 256 (2 to the power of 8).

So what to do? It's hard to increase the width of an instruction, so clever designers have come up with a solution: opcode expansion. As I said earlier, an opcode is followed by an address code, and some instructions don't use an address code or only use a small amount of address code. So, you can extend the opcode to those locations.

As a simple example, if an instruction system has a 2-bit opcode, there can be four different instructions, 00, 01, 10, and 11. Now Take 11 as a reservation and extend the opcode to 4 bits, then there can be seven instructions 00, 01, 10, 1100, 1101, 1110, 1111. Four of these instructions, 1100, 1101, 1110, and 1111, must have two bits less in their address codes.

Then, in order to achieve the prerequisite of opcode expansion: to reduce the address code, designers have been using their brains to invent a variety of addressing methods, such as base addressing, relative addressing, and so on, in order to maximize the compression of the address code length, and to leave space for the opcode.

In this way, the CISC instruction system was slowly formed, with a large number of complex instructions, variable instruction lengths, and a variety of addressing modes, which are the hallmarks of CISC, but also its drawbacks: they make decoding much more difficult, and in today's high-speed hardware development, the speed increase of complex instructions is much less than the amount of time wasted on decoding. With the exception of the PC market, which still uses the x86 instruction set, servers and larger systems have long since moved away from CISC. The only reason x86 still exists is for compatibility with the vast array of software available on the x86 platform.

]: (2) The emergence, development, and current status of RISC

In 1975, John Cocke, an IBM designer, researched the then IBM 370CISC system, and found that the simple instructions, which accounted for only 20% of the total number of instructions, accounted for 80% of the program calls, and that the complex instructions, which accounted for 80% of the number of instructions, had only 20% of the chances to be used. This led him to develop the concept of RISC.

RISC proved to be a success, and in the late 1980s, RISC CPUs from a variety of companies sprang up and took over a large portion of the market. By the 1990s, x86 CPUs such as the pentium and k5 were also using advanced RISC cores.

The most important features of RISC are fixed instruction lengths, fewer types of instruction formats, fewer types of addressing methods, mostly simple instructions that can be completed in a single clock cycle, ease of designing superscalars and pipelines, large number of registers, and a large number of operations that take place between registers. Since most of the CPU cores mentioned below are about RISC cores, there is not much to introduce here, and the design of RISC cores will be discussed in detail below.

RISC is currently in full swing, and Intel's Itanium will eventually abandon x86 in favor of a RISC architecture.

Second, the CPU core structure

Well, let's take a look at the CPU. The CPU core is divided into two main parts: the operator and controller.

(I) Operator

1, Arithmetic and Logic Unit (ALU)

ALU mainly accomplishes fixed-point arithmetic operations (addition, subtraction, multiplication and division), logical operations (and or non-isotropic or), and shift operations on binary data. In some CPUs, there are also shifters that are specialized to handle shift operations.

Typically, an ALU consists of two inputs and one output. The integer unit is sometimes called the IEU (Integer Execution Unit). When we say "the CPU is XX-bit", we mean the number of bits of data that the ALU can process.

2, Floating Point Unit (FPU)

FPU is mainly responsible for floating point operations and high-precision integer operations. Some FPUs also have the function of vector operations, while others have specialized vector processing units.

3, general-purpose register group

General-purpose register group is a group of the fastest memory, used to save the operands and intermediate results of the participating operations.

The design of general-purpose registers is very different between RISC and CISC, which usually have very few registers, mainly due to the cost of hardware at that time. For example, the x86 instruction set has only 8 general purpose registers. Therefore, a CISC CPU spends most of its execution time accessing data in memory, not in registers. This slows down the whole system. RISC systems, on the other hand, tend to have a very large number of general-purpose registers, and use techniques such as overlapping register windows and register stacks to make full use of register resources.

To address the shortcomings of the x86 instruction set, which supports only 8 general-purpose registers, the latest CPUs from Intel and AMD use a technique called "register renaming," which allows x86 CPUs to go beyond the 8-register limit to 32 or more registers. However, compared to RISC, this technique requires one extra clock cycle for register operations to rename the registers.

4. Specialized registers

Specialized registers are usually status registers, which cannot be changed by the program, and are controlled by the CPU itself to indicate a certain state.

(2) Controller

Operators can only perform operations, while controllers are used to control the work of the entire CPU.

1. Instruction Controller

The instruction controller is a very important part of the controller, which takes instructions, analyzes them, and hands them over to the execution unit (ALU or FPU) for execution, and also forms the address of the next instruction.

2. Timing Controller

The function of the timing controller is to provide control signals for each instruction in time sequence. It consists of a clock generator and a multiplier definition unit, in which the clock generator emits a very stable pulse signal from a quartz crystal oscillator, which is the main frequency of the CPU; and the multiplier definition unit defines how many times the main frequency of the CPU is the memory frequency (bus frequency).

3. Bus Controller

The bus controller is mainly used to control the internal and external buses of the CPU, including the address bus, data bus, control bus, and so on.

4. Interrupt Controller

The interrupt controller is used to control a variety of interrupt requests and queue the interrupt requests according to the priority level and hand them over to the CPU one by one.

(3) CPU core design

What determines the performance of a CPU? The speed of one ALU alone is not decisive in a CPU, because ALUs are all about the same speed. What determines the performance of a CPU is the design of the CPU core.

1. Superscalar

Since it is not possible to dramatically increase the speed of the ALU, what are the alternatives? Once again, parallel processing methods have a powerful effect. A so-called superscalar CPU is a CPU that integrates only multiple ALUs, multiple FPUs, multiple decoders, and multiple pipelines to increase performance with parallel processing.

The superscalar technology should be easy to understand, but one thing to keep in mind is not to worry about the number before "superscalar", such as "9-way superscalar", which is defined differently by different vendors and is more of a commercial promotion. The only thing you need to know is that it's just a way to get the word out.

2. Pipeline

The pipeline is an important part of the design of modern RISC cores, and it has greatly improved performance.

For a specific instruction execution process, it can usually be divided into five parts: fetch instruction, instruction decode, fetch operand, operation (ALU), and write result. The first three steps are usually done by the instruction controller, and the last two steps are done by the operator. In the traditional way, all instructions are executed sequentially, so first the instruction controller works on the first three steps of the first instruction, then the operator works on the last two steps, then the instruction controller works on the first three steps of the second instruction, and then the operator works on the last two steps of the second instruction. ...... It is obvious that when the instruction controller is working, the operator is basically resting. It is clear that while the instruction controller is working, the operator is basically resting, and while the operator is working, the instruction controller is resting, resulting in a considerable waste of resources. The solution is easy to think of, when the instruction controller finishes the first three steps of the first instruction, it starts the operation of the second instruction directly, and so does the arithmetic unit. This results in a pipelined system, which is a 2-stage pipeline.

If it is a superscalar system, assuming there are three instruction controllers and two arithmetic units, then you can address the first instruction and then start the second instruction directly, which is decoded, then the third instruction is addressed, the second instruction is decoded, and the first instruction takes the operand ... ...This is a 5-stage pipeline. Obviously, the average theoretical speed of a 5-stage pipeline is 4 times faster than that of a non-pipelined pipeline.

The pipelined system maximizes the use of CPU resources, so that each component works every clock cycle, greatly improving efficiency. However, pipelining has two very big problems: correlation and transfer.

In a pipelined system, if a second instruction requires the result of a first instruction, this is called correlation. For example, in a 5-stage pipeline, when the second instruction needs to fetch an operand, the first instruction hasn't finished yet, so if the second instruction fetches the operand, it will get the wrong result. If the second instruction fetches the operand, the result will be wrong. Therefore, the whole pipeline has to stop and wait for the completion of the first instruction. This is a nasty problem, especially for longer pipelines, such as 20 stages, where such pauses usually cost more than 10 clock cycles. The current solution to this problem is disordered execution. The principle of disordered execution is to insert unrelated instructions into two related instructions to smooth out the pipeline. For example, in the above example, after the first instruction is executed, the third instruction (assuming the third instruction is irrelevant) will be executed directly, and then the second instruction will be executed, so that when the second instruction needs to fetch the operand, the first instruction will be finished, and the third instruction will be finished, and the whole pipeline will not be stopped. Of course, pipeline blocking is not completely unavoidable, especially if there are a lot of instructions involved.

Another big problem is conditional transfers. In the example above, if the first instruction is a conditional transfer, it is not clear which instruction should be executed next. Then you have to wait for the result of the first instruction before you can execute the second one. The pipeline stall caused by conditional transfer is even more serious than correlation. Therefore, branch prediction techniques are now used to deal with the transfer problem. Although our programs are full of branches, and any branch is possible, most of the time a certain branch is always chosen. For example, if the end of a loop is a branch, we will always choose to continue looping through the branch, except for the last time when we need to jump out of the loop. Based on these principles, branch prediction techniques can predict what the next instruction will be and execute it without getting a result. Today's branch prediction techniques are more than 90% correct, but if the prediction is wrong, the CPU still has to clean up the entire pipeline and go back to the branch point. This results in a significant loss of clock cycles. Therefore, further improving the accuracy of branch prediction is also a topic of ongoing research.

The longer the pipeline, the more serious the correlation and transfer problems, so the longer the pipeline, the better, and the more superscalar the better, the more important it is to find a balance between speed and efficiency.

1, Decoder (Decode Unit)

This is the x86CPU only have something, its role is to the length of the x86 instructions into a fixed length of fixed length of the RISC-like instructions, and to the RISC core. Decoding is divided into hardware decoding and micro-decoding, for simple x86 instructions as long as the hardware decoding can be faster, while encountering complex x86 instructions need to be micro-decoding, and it is divided into a number of simple instructions, slower and very complex. The good news is that these complex instructions are rarely used.

The old CISC x86 instruction set on both the Athlon and PIII severely limited their performance.

2. First-level and second-level caches

As well as caches and second-level caches to ease the conflict between faster CPUs and slower memories, and caches are usually integrated into the CPU core, while second-level caches run faster than memories in an OnDie or OnBoard fashion. The CPU's Cache is especially important for tasks that involve large amounts of data exchange.