The main frequency
The main frequency, also called the clock frequency, in megahertz (MHz) or gigahertz (GHz), is used to indicate the speed of the CPU's arithmetic, processing data.
The main frequency of a CPU = external frequency x multiplier factor. Many people believe that the main frequency determines the operating speed of the CPU, which is not only one-sided, but also for the server, this understanding has been biased. So far, there is no definitive formula to achieve the main frequency and the actual computing speed of the numerical relationship between the two, even the two major processor manufacturers Intel (Intel) and AMD, there is a great deal of controversy on this point, from the development trend of Intel's products, you can see that Intel is very much focused on strengthening their own development of the main frequency. Like other processor manufacturers, someone once took a piece of 1GHz Allmart processor to do comparison, it is equivalent to the running efficiency of 2GHz Intel processor.
There is a relationship between the main frequency and the actual computing speed, but it is not a simple linear relationship. Therefore, the CPU's main frequency and the CPU's actual computing power is not directly related, the main frequency indicates the speed of the digital pulse signal oscillation within the CPU. Examples of this can be seen in Intel's processor products: a 1
GHz Itanium chip can perform no more than as fast as a 2.66 GHz Xeon/Opteron, or a 1.5 GHz Itanium
2 is approximately as fast as a 4 GHz Xeon/Opteron. speed also depends on the CPU's pipeline, bus, and other performance metrics.
The main frequency is related to the actual computing speed, but the main frequency is only one aspect of the performance of the CPU, not the overall performance of the CPU.
The external frequency
The external frequency is the base frequency of the CPU in MHz, and the external frequency of the CPU determines the speed of the whole motherboard. In general, in desktop computers, the said overclocking, are over the external frequency of the CPU (of course, generally speaking, the multiplier frequency of the CPU is locked) believe that this point is very good to understand. But for the server CPU, overclocking is absolutely not allowed. As mentioned earlier, the CPU determines the operating speed of the motherboard, the two are synchronized operation, if the server CPU overclocked, change the external frequency, will produce asynchronous operation, (desktop many motherboards support asynchronous operation) which will cause the entire server system instability.
The vast majority of current computer systems in the external frequency and the motherboard front-side bus is not synchronized speed, and the external frequency and the front-side bus (FSB) frequency is easily confused, the following front-side bus introduction to talk about the difference between the two.
Front Side Bus (FSB) frequency
Front Side Bus (FSB) frequency (i.e., bus frequency) is a direct impact on the speed of direct data exchange between the CPU and memory. There is a formula to calculate that the data bandwidth = (bus frequency x data bit width)/8, the maximum bandwidth of data transmission depends on the width and transmission frequency of all simultaneously transmitted data. Let's say that the current Xeon Nocona with 64-bit support has a front-end bus of 800MHz, and according to the formula, its maximum bandwidth for data transfer is 6.4GB/sec.
The difference between the external frequency and the front-side bus (FSB) frequency: the speed of the front-side bus refers to the speed of data transmission, and the external frequency is the speed of synchronous operation between the CPU and the motherboard. In other words, 100MHz external frequency refers to the digital pulse signal oscillating 100 million times per second; while 100MHz front-side bus refers to the amount of data transfer per second that the CPU can accept is 100MHz x 64bit ÷ 8bit/Byte = 800MB/s.
In fact, nowadays, "HyperTransport" is the most popular way to enhance the performance of your PC.
In fact, now that the advent of the "HyperTransport" architecture has changed the frequency of the front-side bus (FSB) in this practical sense, the IA-32 architecture must have three important building blocks: the Memory Controller Hub
(MCH), the I/O Controller Hub, and the PCI Hub, which is a very typical chipset for Intel, the Intel
7501, Intel 7505 chipsets, and the Intel 7505 chipsets, which are all based on the same technology, and the Intel 7505 chipsets. However, the increasing performance of the processor also brings many problems to the system architecture. The HyperTransport architecture not only solves the problem, but also improves bus bandwidth more efficiently, as seen in the AMD
Opteron processor, where the flexible HyperTransport
I/O bus architecture lets the processor integrate a memory controller with the system bus, so that the processor does not pass through the system bus. controller, allowing the processor to exchange data directly with memory instead of passing it through the system bus to the chipset. In that case, front-side bus (FSB) frequencies are nowhere to be found in AMD
Opteron processors.
CPU bits and word length
Bits: In digital circuits and computer technology using binary, the code is only "0" and "1", where either "0 "or "1" in the CPU is a "bit".
Word length: The number of bits of a binary number that the CPU can process at one time in a unit of time (at the same time) is called the word length in computer technology. So a CPU that can process 8-bit data is usually called an 8-bit CPU, and similarly a 32-bit CPU can process 32-bit binary data in a unit of time. Difference between byte and word length: Since common English characters can be expressed in 8-bit binary, 8-bit is usually called a byte. The length of the word length is not fixed, for different CPUs, the length of the word length is not the same. 8-bit CPUs can only handle one byte at a time, while 32-bit CPUs can handle 4 bytes at a time, and similarly, CPUs with a word length of 64 bits can handle 8 bytes at a time.
Overclocking factor
Overclocking factor refers to the relative ratio between CPU main frequency and external frequency. At the same external frequency, the higher the multiplier, the higher the frequency of the CPU. However, in reality, under the premise of the same external frequency, high multiplier CPU itself is not very meaningful. This is because the data transfer speed between the CPU and the system is limited, the pursuit of high main frequency and get a high multiplier CPU will appear obvious "bottleneck" effect - the CPU from the system to get the data limit speed can not meet the speed of the CPU computing. In general, except for the engineering sample version of Intel's CPU is locked multiplier, a small number of such as the Inter
Core 2
core Pentium Dual-Core E6500K and some of the supreme version of the CPU does not lock multiplier, and AMD did not lock, and AMD has launched a black box version of the CPU (that is, the version of the unlocked multiplier, the user is free to adjust the multiplier, the adjustment of multiplier of the overclocking is much more stable than adjusting the external frequency).
Cache
Cache size is also one of the important indicators of the CPU, and the structure and size of the cache has a very large impact on the CPU speed, the cache within the CPU is running at a very high frequency, generally the same frequency as the processor, the efficiency is much greater than the system memory and hard disk. In practice, the CPU often needs to repeatedly read the same block of data, and the increase in cache capacity can significantly improve the CPU internal read data hit rate, without having to go to memory or hard disk to find, in order to improve system performance. However, due to CPU chip size and cost considerations, caches are very small.
L1 Cache is the first level of CPU cache, which is divided into data cache and instruction cache. The capacity and structure of the built-in L1 cache has a large impact on the performance of the CPU, but the cache memory is composed of static RAM, the structure is more complex, and in the case of the CPU core area can not be too large, the capacity of the L1 level cache can not be made too large. The general server CPU L1 cache capacity is usually 32-256KB.
L2 Cache (L2 Cache) is the CPU's second level of cache, divided into internal and external two chips. The internal chip L2 cache runs at the same speed as the main frequency, while the external L2 cache is only half of the main frequency.L2 cache capacity also affects the performance of the CPU, the principle is the bigger the better, the largest capacity of the CPU for home use in the past is 512KB, and now the laptop can also reach 2M, and servers and workstations use the CPU's L2 cache is higher, and can reach more than 8M.
The L3 cache is the most efficient way to store the data.
L3 Cache (L3 cache), divided into two kinds, the early is external, now are built-in. The actual role of the L3 cache is to further reduce memory latency and improve processor performance for large data-volume calculations. Reducing memory latency and increasing the ability to compute large amounts of data are both very helpful for gaming. And in the server space adding L3 cache still provides a significant performance boost. For example, a configuration with a larger L3 cache utilizes physical memory more efficiently, so its slower disk I/O subsystem can handle more data requests. Processors with larger L3 caches provide more efficient file system caching behavior and shorter message and processor queue lengths.
In fact, the earliest L3 cache was used in AMD's K6-III processors, where the L3 cache was limited by the manufacturing process and was not integrated into the chip, but rather on the motherboard. At that time, the L3 cache was not integrated into the chip due to the manufacturing process, but was integrated into the motherboard. The L3 cache, which was only able to synchronize with the system bus frequency, was not much different from the main memory. L3 cache was later used in Intel's Itanium processors for the server market. Intel also intended to introduce a 9MB
L3 cache Itanium2 processor, and later a 24MB L3 cache dual-core Itanium2 processor.
But basically, the L3 cache is not very important to the performance of the processor, for example, the Xeon MP processor with 1MB of L3 cache is still no match for the Opteron, which shows that the increase in the front-side bus is more effective than the increase in the cache to bring more performance.
CPU Extended Instruction Set
CPUs rely on instructions to compute and control systems, and each CPU is designed with a set of instructions that match its hardware circuitry. The strength of the instructions is also an important indicator of the CPU, and the instruction set is one of the most effective tools for improving microprocessor efficiency. From the current stage of the mainstream architecture, the instruction set can be divided into two parts of the complex instruction set and streamlined instruction set (instruction set **** there are four types), and from the specific use of, such as Intel's MMX (Multi
Media Extended, which is the full name for the AMD speculation, Intel did not explain the etymology), SSE, SSE2 ( Streaming-Single
instruction multiple data-Extensions
2), SSE3, SSE4 series and AMD's 3DNow! are all CPU extended instruction sets, respectively, which enhance the CPU's multimedia, graphic image, Internet and other The extended instruction set of the CPU is usually referred to as the extended instruction set of the CPU. Often referred to as the "CPU's instruction set", the SSE3 instruction set is the smallest instruction set available, with MMX containing 57 commands, SSE containing 50 commands, SSE2 containing 144 commands, and SSE3 containing 13 commands. SSE4 is also the most advanced instruction set, with Intel Core series processors already supporting SSE4, AMD will add support for SSE4 in future dual-core processors, and ANA processors will also support this instruction set.
CPU core and I/O voltages
From the 586 CPU onwards, the CPU operating voltage is divided into two types: core voltage and I/O voltage, and usually the core voltage of the CPU is less than or equal to the I/O voltage. The size of the kernel voltage is based on the CPU production process, generally the smaller the production process, the lower the kernel operating voltage; I/O voltage is generally in the 1.6 ~ 5 V. Low voltage can solve the problem of excessive power consumption and high heat.
Manufacturing process
The micron of the manufacturing process is the distance between the circuits and circuits within the IC. The trend in manufacturing processes is toward higher and higher densities. A higher density IC circuit design means that you can have a higher density and more complex circuit design in the same size area of the IC. Now the main 180nm, 130nm, 90nm, 65nm, 45nm. Recently inter has had 32nm manufacturing process for Core i3/i5 series.
And AMD has said that its own products will skip the 32nm process (a few 32nm products, such as Orochi and Llano, will be produced in the third quarter of 2010) and release 28nm products (name unspecified) in mid-early 2011
Instruction set
(1)CISC instruction set
CISC Instruction set, also known as complex instruction set, the English name is CISC, (abbreviation of Complex
Instruction Set
Computer). In a CISC microprocessor, the instructions of a program are executed serially in a sequential order, and the individual operations within each instruction are also executed serially in a sequential order. The advantage of sequential execution is that it is simple to control, but the parts of the computer are not well utilized and execution is slow. It is actually the x86 series (a.k.a. IA-32 architecture) CPUs produced by Intel and its compatible CPUs such as AMD and VIA. Even the new up and coming X86-64 (also being into AMD64) are now in the CISC category.
To know what is the instruction set from today's X86 architecture of the CPU to say. X86 instruction set is Intel for its first 16-bit CPU (i8086) specially developed, IBM launched in 1981 the world's first PC in the CPU-i8088 (i8086 simplified version) is also used in the X86 instructions, while the computer to improve the The X87 chip was added to the computer to improve floating-point data processing capabilities, and the X86 instruction set and the X87 instruction set will later be collectively referred to as the X86 instruction set.
While with the continuous development of CPU technology, Intel has developed newer i80386, i80486 until the past PII Xtreme, PIII Xtreme, Pentium
3, Pentium
4 series, and finally to today's Core 2 series, Xtreme (excluding Xtreme Nocona), the In order to ensure that the computer can continue to run all kinds of applications developed in the past to protect and inherit the rich software resources, so all CPUs produced by Intel still continue to use the X86 instruction set, so its CPUs still belong to the X86 series. Because the Intel
X86 series and its compatible CPUs (such as AMD Athlon
MP,) all use the X86 instruction set, they form today's huge lineup of X86-series and compatible CPUs. x86CPUs are currently available in two main categories: intel's server CPUs and AMD's server CPUs.
(2)RISC Instruction Set
RISC is the abbreviation of "Reduced
Instruction Set Computing"
, which means "Reduced Instruction Set Computing" in Chinese. It is an acronym for "Reduced
Instruction Set Computing " in English. It is developed on the basis of the CISC instruction system, some people test the CISC machine shows that the frequency of use of various instructions is quite disparate, the most commonly used are some relatively simple instructions, they account for only 20% of the total number of instructions, but in the program the frequency of occurrence is accounted for 80%. A complex instruction system inevitably increases the complexity of the microprocessor, making the processor's development time long and costly. And complex instructions require complex operations, which will inevitably reduce the speed of the computer. For these reasons, RISC-type CPUs were born in the 1980s. Compared with CISC-type CPUs
,RISC-type CPUs not only streamline the instruction system, but also adopt a kind of "superscalar and hyper- pipelined structure", which greatly increases the parallel processing capability. The RISC instruction set is the direction of development for high-performance CPUs. It is the opposite of the traditional CISC (Complex Instruction Set). In comparison, RISC has a unified instruction format, fewer types of instructions, and fewer addressing methods than the complex instruction set. Of course, the processing speed is much higher. Currently, CPUs with this instruction system are commonly used in mid-range and high-grade servers, especially high-end servers that all use RISC instruction system CPUs.RISC instruction system is more suitable for UNIX, the operating system of high-end servers, and now Linux also belongs to the UNIX-like operating system.RISC-type CPUs are not compatible with the CPUs of Intel and AMD, in terms of software and hardware. RISC-type CPUs are not compatible with Intel and AMD CPUs in terms of software and hardware.
Currently, the CPUs that use RISC instructions in mid-range and high-end servers are mainly of the following types: PowerPC processors, SPARC processors, PA-RISC processors, MIPS processors, and Alpha processors.
(3) IA-64
EPIC (Explicitly
Parallel Instruction
Computers, Exact Parallel Instruction Computers) There has been a lot of debate on whether it is the successor to the RISC and CISC systems, and in terms of the EPIC system alone, it is more In the case of the EPIC system alone, it is more like an important step for Intel's processors towards the RISC system. Theoretically, CPUs designed in the EPIC architecture handle Windows applications much better than Unix-based applications in the same host configuration.
Intel's server CPU with EPIC technology is the Anthem Itanium (development code name i.e. Merced). It is a 64-bit processor and the first in the IA-64 family. Microsoft has also developed an operating system codenamed Win64 to support it in software. After Intel adopted the X86 instruction set, it turned to more advanced 64-bit microprocessors. Intel did this because they wanted to get rid of the huge x86 architecture and bring in an energetic and powerful instruction set, and so the IA-64 architecture was born, using the EPIC instruction set.
In many ways, it was a vast improvement over x86. It breaks through many of the limitations of the traditional IA32 architecture and achieves breakthroughs in data processing power, system stability, security, availability, and visualization.
The biggest drawback of IA-64 microprocessors is their lack of compatibility with x86, and Intel, in order for IA-64 processors to better run the software of the two dynasties, it has introduced on IA-64 processors (Itanium, Itanium2
......) x86-to-IA-64 decoder so that it could translate x86 instructions into IA-64 instructions. This decoder is not the most efficient decoder, nor is it the best way to run x86 code (the best way is to run x86 code directly on an x86 processor), and as a result, the performance of Itanium
and Itanium2 when running x86 applications is very poor. This became the root cause of X86-64.
(4) X86-64 (AMD64 / EM64T)
Designed by AMD to handle 64-bit integer arithmetic at the same time, and compatible with the X86-32 architecture. It supports 64-bit logical addressing, and provides options to convert to 32-bit addressing; however, the data manipulation instructions default to 32-bit and 8-bit, and provide options to convert to 64-bit and 16-bit; and it supports general-purpose registers, so that if the operation is a 32-bit operation, the result should be extended to a full 64-bit. In this way, there is a difference between "direct execution" and "converted execution" of an instruction, and the instruction field is either 8-bit or 32-bit, which avoids the field being too long.
The creation of x86-64 (also called AMD64) was not an empty gesture. x86 processors are limited to 4GB of memory for 32-bit addressing space, and IA-64 processors are not compatible with x86. AMD took into account the needs of its customers, and strengthened the functionality of the x86 instruction set to make it support 64-bit modes of operation, so AMD called their structure x86-64. Technically, AMD has introduced new R8-R15 general-purpose registers as an expansion of the original X86 processor registers for 64-bit operations in the x86-64 architecture, but these registers are not fully utilized in the 32-bit environment. The original registers such as EAX and EBX have also been expanded from 32-bit to 64-bit. Eight new registers have been added to the SSE unit to provide support for SSE2. The increase in the number of registers will result in a performance increase. At the same time, in order to support both 32- and 64-bit code and registers, the x86-64 architecture allows the processor to operate in two modes: Long
Mode and Legacy Mode, with Long mode divided into two sub-modes (64-bit mode and Compatibility
Mode compatibility).
Long mode is divided into two sub-modes (64-bit mode and Compatibility mode). The standard has been introduced in AMD server processors in Opteron processors.
This year also saw the introduction of 64-bit support for EM64T technology, which has yet to be officially named EM64T before IA32E, the name of Intel's 64-bit Extended Technology, used to differentiate it from the X86 instruction set. Intel's EM64T supports 64-bit sub-mode, which is similar to AMD's X86-64 technology, and utilizes 64-bit linear planar addressing, adding eight new general-purpose registers, and a 64-bit linear plane. Intel's EM64T supports 64-bit sub-mode, similar to AMD's X86-64 technology, with 64-bit linear plane addressing, the addition of eight new general purpose registers (GPRs), and the addition of eight registers to support SSE instructions. Similar to AMD, Intel's 64-bit technology will be compatible with IA32 and IA32E only when running under a 64-bit operating system.IA32E will consist of 2 sub-modes: a 64-bit sub-mode and a 32-bit sub-mode, and is backward compatible like AMD64.Intel's EM64T will be fully compatible with AMD's X86-64 technology. Nocona processors now include some 64-bit technology, and Intel's Pentium
4E processors also support 64-bit technology.
It should be noted that both are 64-bit microprocessor architectures compatible with the x86 instruction set, but there are still some differences between EM64T and AMD64, and the NX bits in AMD64 processors will not be available in Intel's processors.
Hyperpipelining and superscalar
Before explaining hyperpipelining and superscalar, it's important to understand pipelines. Pipelines were first used by Intel in their 486 chips. The pipeline works like an assembly line in industrial production. In the CPU by 5-6 different functions of the circuit unit to form an instruction processing pipeline, and then an X86 instruction is divided into 5-6 steps, and then by these circuit units are executed, so that it can be realized in a CPU clock cycle to complete an instruction, thus improving the CPU computing speed. Each of the Classic Pentium's integer pipelines is divided into four levels of flow, namely, instruction prefetching, decoding, executing, and writing back the result, and the floating-point pipeline is further divided into eight levels of flow.
Superscalar is the simultaneous execution of multiple processors by building in multiple pipelines, which in essence is trading space for time. Hyper pipelining, on the other hand, is the process of refining the pipeline and increasing the main frequency so that one or even more operations can be completed in a single machine cycle, in essence trading time for space. For example, the Pentium
4 has a pipeline that is 20 steps long. The longer the step (level) of the pipeline design, the faster it completes an instruction, so that it can adapt to work with a higher frequency CPU. but the pipeline is also too long to bring certain side effects, it is likely that the actual computing speed of the CPU with a higher frequency will be lower, Intel's Pentium 4 appeared in this case, although its frequency can be as high as 1.4G or more, but its computing performance is In general, CPUs installed in Socket Sockets are usually packaged in PGA (Packet Grid Array), while CPUs installed in Slot
x slots are all packaged in SEC (Single Edge Connector Box). There are also PLGA (Plastic Land Grid
Array) and OLGA (Organic Land Grid
Array) packaging technologies. Due to the increasingly fierce competition in the market, the current development direction of CPU packaging technology is focused on cost saving.
Multithreading
Simultaneous Multithreading
Multithreading, or SMT for short, allows multiple threads on the same processor to synchronize their execution by replicating the structural state of the processor and to enjoy the processor's execution resources, which maximizes wide-firing, chaotic superscalar processing and improves the performance of the processor's computing components. processing, increasing the utilization of the processor's computing components and moderating the latency of accessing memory due to data correlation or Cache misses. When multiple threads are not available, the SMT processor is almost identical to a traditional wide-emission superscalar processor, and the appeal of SMT is that it requires only a small change in the processor core design, which results in a significant performance increase at little to no additional cost. Multi-threading technology, on the other hand, can reduce the idle time of the high-speed computing cores by preparing more data for processing. This is very attractive for low-end desktop systems, and Intel will support SMT on all processors starting with the 3.06GHz
Pentium 4.
Multi-core
Multi-core, also referred to as single-chip multiprocessors (Chip
Multiprocessors, or CMP for short.) CMP was proposed by Stanford University, and the idea is to integrate SMP (Symmetric Multi-Processor) in massively parallel processors into the same chip, with each processor executing different processes in parallel. . Compared with CMP,
SMT processor structure is more outstanding in terms of flexibility. However, as semiconductor processes moved beyond 0.18 microns, line delays have exceeded gate delays, requiring microprocessors to be designed by dividing the basic cell structure into many smaller, better localized cells. In contrast, CMP structures have been more promising because they have been designed by dividing them into multiple processor cores, each of which is simpler and conducive to optimized design. Currently, IBM
's Power 4 chip and Sun's
MAJC5200 chip both use the CMP architecture. Multicore processors can ****enjoy cache within the processor, improving cache utilization while simplifying the complexity of multiprocessor system design.
In the second half of 2005, new processors from Intel and AMD will also incorporate the CMP architecture. The new Anthem processor, developed under the code name Montecito, is a dual-core design with a minimum of 18MB of on-chip cache, built on a 90nm process, and is designed to absolutely challenge the chip industry today. Each of its individual cores has a separate L1, L2, and L3
cache containing approximately 1 billion transistors.
SMP
SMP (Symmetric
Multi-Processing), short for Symmetric Multi-Processing Architecture, refers to a set of processors (multi-CPUs) brought together on a single computer, with memory subsystems and a bus structure between each CPU. With the support of this technology, a server system can run multiple processors at the same time, and *** enjoy memory and other host resources. Like dual Xeon, also known as two-way, this is the most common type of symmetric processor system (Xeon MP can support up to four-way, AMD
Opteron can support 1-8-way). There are also a few that are 16-way. But in general, the SMP structure of the machine scalability is poor, it is difficult to do more than 100 multi-processor, the regular is usually 8 to 16, but this is enough for most users. Most common in high-performance servers and workstation-class motherboard architectures, like UNIX servers can support systems with up to 256 CPUs.
The necessary conditions for building an SMP system are: SMP-enabled hardware including motherboards and CPUs; an SMP-enabled system platform; and SMP-enabled application software. In order to make the SMP system to play an efficient performance, the operating system must support the SMP system, such as WINNT, LINUX, and UNIX and so on 32-bit operating system. Multitasking and multithreading. Multitasking means that the operating system allows different CPUs to accomplish different tasks at the same time; multithreading means that the operating system allows different CPUs to accomplish the same task in parallel
.
To form an SMP system, the selected CPU has high requirements, first, the CPU must be built-in APIC (Advanced Programmable
Interrupt Controllers) unit. Intel multi-processing specification is the core of the Advanced Programmable Interrupt Controller ( The core of Intel's multiprocessing specification is the use of Advanced Programmable Interrupt Controllers (APICs). When two production batches of CPUs are running as a dual processor, there is a chance that one CPU will be overburdened and the other underburdened, preventing maximum performance, or worse yet, possibly leading to a crash.
NUMA Technology
NUMA, or Non-Uniformly Accessed Distributed**** Storage, is a system consisting of a number of independent nodes connected through a high-speed private network, each node can be a single CPU or SMP system. In NUMA, there are several solutions for Cache
consistency that require operating system and special software support. An example of Sequent's NUMA system is shown in Figure 2. Here, three SMP modules are linked together using a high-speed private network to form a node, and each node can have up to 12 CPUs. a system like Sequent's can have up to 64 CPUs or even 256 CPUs. obviously, this is a combination of the two technologies, based on SMP, and then extended with the technology of NUMA.
Out-of-order execution
Out-of-order execution (OOEX) is a technique that allows the CPU to distribute multiple instructions to the corresponding circuit units out of order of the program. After analyzing the state of each circuit unit and whether each instruction can be executed in advance, the instruction that can be executed in advance is immediately sent to the corresponding circuit unit for execution, during which the instruction is executed out of order, and then the results of each execution unit are rearranged by the rearranging unit according to the order of the instruction. The purpose of the chaotic execution technique is to make the internal circuitry of the CPU operate at full capacity and to increase the speed of the CPU's running program accordingly. Branching technology: (branch) instructions need to wait for the result of the operation, generally unconditional branching only need to be executed in the order of the instructions, while conditional branching must be based on the results of the processing, and then decide whether to proceed in the original order.
Memory controllers inside the CPU
Many applications have more complex read patterns (almost randomly, especially if the cache
hit is unpredictable) and do not utilize bandwidth efficiently. Typical of such applications is business processing software, which, even with CPU features such as out of order
execution, is limited by memory latency. This means that the CPU must wait until the data required for the operation has been divisor-loaded before it can execute the instruction (whether it comes from the CPU
cache or from the main memory system). With current memory latency of around 120-150ns for low-segment systems and CPU speeds of 3GHz or more, a single memory request can waste 200-300 CPU cycles. Even with a cache
hit rate of 99%, the CPU may spend up to 50% of its time waiting for a memory request to finish - for example, because of memory latency.
You can see that with Opteron's integrated memory controller, its latency is much lower compared to the latency of the chipset's dual-channel DDR memory controller support. Intel is also integrating the memory controller inside the processor as planned, which results in the Northbridge chip will become less important. But changing the way the processor accesses the main memory helps to increase bandwidth, reduce memory latency and improve processor performance