Pentium processor Simulation of a processor in Java

Introduction

At the University of Karlsruhe, I followed a course Processor architecture and in parallel, I developped in Java a simulation of a processor.

This processor includes following features:

The simulation, in the procsimu package uses the JDK 1.2.

Thanks to Prof. T. Ungerer and to J. Kreuzinger for this really interesting project


DLX pipeline description

The processor is based on the following DLX pipeline model:

Instruction fetch (IF)

The instruction pointed to by the PC is fetched from memory into the instruction register of the CPU, and the PC is incremented to point to the next instruction in the memory.

Instruction decode/register fetch (ID)

The instruction is decoded, and in the second half of the stage the operands are transferred from the register file into the ALU input registers (here meaning: latches).

Execution/effective address calculation (EXE)

The ALU operates on the operands from ALU input registers and eventually puts the result into ALU output register. The contents of this register depend on the type of instruction. If the instruction is:

Memory access/branch completion (MEM)

Only for load, store, and branch instructions. If the instruction is:

Write back (WB)

The result of the instruction execution (register-register or load instruction) is stored into the register file in the first half of the phase.

In particular, the load memory data register or the ALU result register is written into the register file.

Thanks to Prof. T. Ungerer


Classes description

There are a few basic classes.

The procsimu.Control class controls the simulation. It builds the GUI and executes the different action associated to the buttons and the menu items.

In the procsimu.gui package is the Output class. It represents a simple window that will show the information of a stage. The print method will add something at the end of the text in the window.

Components for the simulation

There are two important interfaces in procsimu.comp that follow the Java event model, ie they are listeners:

The procsimu.comp.Latch implements a simple latch. A latch is a container object with a write and a read method. The value wrote in the latch will only be returned by a read at the next clock cycle.

A RAM is implemented in the procsimu.comp.Memory class and a set of registers (32 by defaults) in procsimu.comp.Register.

Moreover, procsimu.comp has a few other usefull classes. Const contains all constants of the application, in particulary all constants that represent the instructions. Conversion has the methods to extract the information of a 32 bit MIPS instruction.

Stage classes

The Stage class in procsimu extends Output and implements WorkListener. That means all classes that extends Stage will see their method work() called at each cycle. The different StageXX classes, StageIF for example, extends Stage and implement the effective code fetchs, decodes, executes... the instruction.

The procsimu package also contains some other important classes for the simulation.


Screenshot

This screenshot shows the simulation that has executed the arth example.


Sources

The sources of the classes are available here.

There are also some examples, in their compiled (.mips files) and source (.s files) forms.


Benchmarks

Branch predictions

The program loop can be used to test the branch prediction mechanisms. It performs a lot of control transfers (in the form of for loop typ).

Here are the number of cycles used to complete the program in simple and super scalar modes and with different branch predictors.
The percentage indicates the percentage of good predictions.

ANT AT BT FT 1B 2B
Simple scalarCycles 313 283 267 339 283 283
% 49 62 78 40 62 62
Super scalar Cycles 277 289 277 313 283 289
% 60 46 60 46 47 47
Table 1: loop benchmarks (with cache)

The percentage of good predictions is really strange in superscalar mode. There may be a bug in the calculation, this issue still needs to be investigated.

BT 2B
Simple scalarCycles 703 726
% 84 76
Super scalar Cycles 735 741
% 71 64
Table 2: loop2 benchmarks (with cache)

IPC benchmarks

The IPC (Instruction Per Cycle) is the number of instruction executed divided by the number of cycle.

The subscal1 example does a lot of simple ALU operations (add, sub...) and is fited to the superscalar processor. As we can see in the table 3, the IPC is quite good in the superscalar mode.

IPC
Simple scalar0,76
Super scalar 1,14
Table 3: supscal1 benchmarks (no cache)

We would first think the IPC of a superscalar processor should be better than the IPC of a simple scalar processor. It is not always true, as showed in the table 4.
Actually, this is an implementation issue. The branch prediction is only implemented in the superscalar mode for the first instruction. Some control transfer instructions are not predicted, and if the branch predictor is good, not predicting a branch is a loss of performance.
A further improvement of the simulation could be to try to implement a branch prediction on both fetched instruction. But it causes some new problems that need to be solved.

CycleExecutedIPC
Simple scalar2491940,78
Super scalar 2591940.75
Table 4: loop benchmarks (no cache, predictor BT)

Superscalar description

The superscalar implementation of the simulation consists in adding a second pipeline that is able to execute all ALU and branch instruction (but no load/store instruction). It is an in-order execution, that's to say in a cycle, the instruction executed in the second pipeline is always after the instruction in the first pipeline.

The following picture gives an overview of the pipelines and their relations.

Reading two instruction in memory in parallel

To implement a superscalar processor, we need to be able to read two instructions pro cycle in the memory. So a second instruction latch, latNextInstr was added in the Memory class. This latch is updated by the work method with the instruction following the instruction written in latInstr.

The second stage classes

With the exception of the IF stage, all other stage are doubled, ie we write a second class StageXX2 for each stage (for example StageID2 is the ID stage of the second pipeline).

Instruction fetching

A completely new class was written for the superscalar mode, InstructionBuffer. The IF stage can fetch two instructions, but the pipelines are not always able to execute them. For example, the second pipeline cannot execute a LOAD instruction or they may stalled because of a data hazard. So if the ID stages can not decode an instruction, they do not remove it from the buffer and it will be decoded by the next cycle.
The IF stage will fetch two instruction if there is enough space in the buffer, otherwise it does not fetch any instruction.

Forwarding

Forwarding and interlocking issues are solved with a simple extension of the simple scalar mode mechanisms.

Control transfer

The second pipeline has its own latch latControlTransfer to indicate to the IF stage there is a control transfer. The instruction already fetched must be thrown away. There are not much differences with the simple scalar mode. The only thing that needs to be added is the following test: if there is a control transfer in the first pipeline, the instruction in the MEM stage of the second pipeline must also be thrown away.

Branch prediction

The branch prediction is done only on one instruction (explained below). This instruction is:

  1. If there still are some instructions in the instruction buffer, the prediction is done on the address of the instruction at the top of the buffer.
  2. If the instruction buffer is empty, the prediction is done on the last instruction fetched (that's to say PC - 8).

We can not do a prediction on the two instruction fetched. If we would do that, we could have no prediction of the first instruction, but a prediction on the second one. The consequence is we would need to read in one cycle two non-consecutive instruction in memory, and that is impossible.

Another important thing to be careful to is to propagate the good PC in both pipelines. It seems quite normal, but some extra code must be added to be sure of that.

Otherwise, the branch prediction is the same implementation as in simple scalar.


Contact me at vincent at oberle dot org.