The Pentium 4 processor with NetBurst microarchitecture is a completely redesigned processor with many new and improved innovative features and performance, such as "sequential speculative execution" and "superscalar execution" introduced in previous Intel microarchitectures. "superscalar execution" introduced in previous Intel microarchitectures. Many of these new innovations and improvements have made possible improvements in processor technology, processing technology, and circuit design, fabrication methods, and other aspects that could not previously be realized in high volumes. The characteristics of the new microarchitectures and the benefits they bring are defined in the following sections.
Designing for Performance
The results of the definition of a structure of interest are used to study the benefits of many advanced processor technologies and to determine the closest improvements in total processor performance over the next few years. The result of this definitional exercise is the construction of a microarchitecture that significantly improves the frequency capability from a P6 microarchitecture by more than 40% (with the same fabrication processing) while maintaining an average of approximately 10% to 20% of the IPC of a P6 microarchitecture. In this design, despite the lower IPC, the increased frequency capability makes up for this (Performance = Frequency x IPC) and will provide the end user with higher execution capability across the board. All of this is made possible by the NetBurst microarchitecture utilizing superpipeline technology, which is twice as deep as the P6 microarchitecture. While this deeper pipeline provides higher frequencies, potential performance impacts associated with the longer pipeline were incorporated and overcome in the design. The focus of this design outcome is to:
Minimize losses associated with branch prediction
Explanation of Branch Prediction Losses: Like the P6, the NetBurst microarchitecture utilizes promiscuous speculative execution. Processors typically use a branch prediction algorithm to predict branching outcomes in program code and then speculatively execute the predicted code branches. Although branch prediction algorithms have a high degree of accuracy, it is still impossible to achieve 100% accuracy. If the processor incorrectly predicts a branch, all speculatively executed instructions must be purged from the processor's pipeline in order to restart execution of instructions from the correct branch of the program. Deeper pipeline designs, where more instructions must be purged from the pipeline, result in longer recovery times when branch misprediction occurs. The combined result is that applications with more and less predictable branches will have a lower average IPC.
Minimization of misprediction loss: To minimize the loss of branch mispredictions and maximize the average IPC, the y pipelined NetBurst microarchitecture dramatically reduces the number of branch mispredictions and provides a fast way to recover from any mispredicted branch. To minimize this loss, the NetBurst microarchitecture implements an advanced dynamic execution engine and an execution trace cache. These features are described later in this article.
Keeping high-frequency execution units busy (as opposed to sitting and waiting)
Despite the processor's high-frequency capabilities, it must provide a way to ensure that instructions to be executed are continuously available to the execution units (integer or floating-point). This ensures that these high-frequency unit execution instructions are able to execute instructions (rather than sit and wait). With these high-frequency and fast execution engines in the NetBurst microarchitecture, the arithmetic logic units run at twice the core frequency, and Intel has implemented a number of features to ensure that these execution units get a continuous stream of instructions to execute. Intel has implemented a 400-MHz system bus, an advanced transfer cache, an execution trace cache, an advanced dynamic execution engine, and a low-waiting Level 1 data cache. Wait 1 data cache. These features work together to provide instructions and data to the processor's high performance execution units so that they can execute code at high frequencies rather than wait for it.
Shrinking the number of instructions needed to complete a task or program
Many applications typically perform repetitive operations on large data sets. Further, the data sets involved in these operations can be represented as small values with a small amount of bits. These two can be combined to improve application performance by utilizing concisely represented data sets and executing instructions that can manipulate these concise data sets. This type of operation is referred to as Single Instruction Multiple Data (SIMD) and reduces the overall number of instructions required for program execution.The NetBurst microarchitecture implements 144 new SIMD instructions referred to as Streaming SIMD Extensions 2 (SSE2).The SSE2 instruction set enhances previous SIMD instructions that utilized MMX technology and SSE technology. These new instructions support 128-bit SIMD integer operations and 128-bit SIMD double-precision floating-point operations. By doubling the amount of data that a given instruction can manipulate, only half of the instructions in the coded loop need to be executed.
Intel NetBurst microarchitecture feature details
Superpipeline technology: The NetBurst microarchitecture's superpipeline technology doubles the depth of the pipeline compared to the P6 microarchitecture. The branch prediction/recovery pipeline, a key pipeline, is implemented in the NetBurst microarchitecture with a 20-level pipeline, compared to a 10-level pipeline in the P6 microarchitecture. This technique significantly improves processor performance and the frequency scalability of the underlying microarchitecture.
Execution Trace Cache: The Execution Trace Cache is an innovative approach to implementing a Level 1 instruction cache. It stores decoded x86 instructions (micro-operations), thus eliminating the wait associated with going from the main execution loop to the instruction decoder. In addition, the execution trace cache stores these micro-operations on the program execution flow path, where the results of branching in the code are synthesized into the same lines of the cache. This increases instruction traffic in the cache and makes better use of the global cache space (12K microoperations), since the cache no longer stores instructions that branch over and are never executed. The result is a way to pass high-capacity instructions to the processor's execution unit and a way to reduce the overall time to recover from incorrectly predicted branches.
Fast execution engine: By combining architectural, physical, and circuit design, simple arithmetic logic units (ALUs) in the processor run at twice the processor core frequency. This allows the ALUs to execute specific instructions with half the latency response of the core clock resulting in higher execution throughput and lower execution response latency.
400MHz System Bus: By using a physical signal design that quadruples data transfers on a 100MHz clocked system bus and a buffer design that allows for sustained 400MHz data transfers, the Pentium 4 processor supports Intel's highest-performance desktop system bus transferring up to 3.2GB of data per second to and from the processor, and up to 3.2GB of data per second on the 133MHz system bus of the Pentium III processor. The Pentium 4 processor supports Intel's highest performance desktop system bus transfer of 3.2GB per second of data to and from the processor, and 1.06GB/s on the Pentium III processor's 133MHz system bus.
Advanced Dynamic Execution: The Advanced Dynamic Execution engine is a deep sequential speculative execution engine that keeps the execution unit executing instructions. It accomplishes this by providing a very large instruction window that the execution unit can choose from. The large window of disordered instructions allows the processor to eliminate the latency that occurs when instructions wait for the relevant content to be resolved. A more general form of latency occurs when waiting for data not selected in the cache to be loaded from memory. This is important in high-frequency designs, where the increase in latency to the main memory response is related to the core frequency. the NetBurst microarchitecture is able to have more than 126 instructions in this window, compared to the much smaller window of the P6 microarchitecture, which can only have 42 instructions.
The advanced dynamic execution engine also improves branch prediction, allowing the Pentium 4 processor to predict program branches more accurately. It will reduce the number of branch mispredictions by roughly 33% or more relative to the P6's processor branch prediction capability. This is accomplished by implementing a 4KB branch target buffer to hold more details of previous branches and by implementing a more advanced branch prediction algorithm. This improved branch prediction capability is one of the key design components to fully reduce the loss sensitivity of branch misprediction in the NetBurst microarchitecture.
Advanced Transport Cache: The Level 2 Advanced Transport Cache is 256KB in size and has a higher data throughput channel between the Level 2 cache and the processor cores. The Advanced Transfer Cache consists of a 256-bit (32-byte) interface that transfers data per core clock. As a result, a 1.4GHz Pentium 4 processor can get a data transfer rate of 44.8GB/s (32 bytes x 1 (data transfer per clock) x 1.4GHz = 44.8GB/s). Compared to the 1GHz Pentium III processor's 16GB/s data transfer rate, it provides the Pentium 4 processor with the ability to keep the high-frequency execution unit executing instructions instead of sitting around waiting.
Streaming SIMD Extensions 2 (SSE2): As an introduction to SSE2, the NetBurst microarchitecture now extends the capabilities of SIMD's MMX technology and SSE technology with the addition of 144 new instructions that provide 128-bit SIMD integer algorithmic operations and 128-bit SIMD double precision floating point. These new instructions provide the ability to reduce the full number of instructions required to perform special program tasks, with the result of providing an overall performance improvement. They facilitate a broader range of applications for programs, including video, language, graphic photo processing, encryption, financial, engineering, and scientific applications.
Overall Performance Expectations
The Pentium 4 processor demonstrates a direct improvement in the performance of software applications available today, with the level of performance improvement dependent on the type of application, the tendency of the application to execute instructions, and the optimization of the sequence of instructions to be executed on the new microarchitecture.
As time passes, more applications will be optimized, either specifically at the assembly level for the microarchitecture or modified with the latest NetBurst microarchitecture optimizing compilers and libraries, and we will continue to see even greater levels of performance scaling for software running on the Pentium 4 processor.
In summary, based on the NetBurst microarchitecture, the Pentium 4 processor provides an acceleration in performance between applications and uses that users will be able to truly experience and appreciate. These applications include: 3D visualization, gaming, video, voice, graphic photo processing, encryption, financial, engineering and technology applications.
The above was the view of p4, also known as the NetBurst architecture, when it first came out. In today's perspective, the NetBurst architecture is really not that great of an architecture. High power, low performance, and stupid fast are its main characteristics.