Simulation of a processor in Java
At the University of Karlsruhe, I followed a course Processor architecture and in parallel, I developped in Java a simulation of a processor.
This processor includes following features:
The simulation, in the procsimu package uses the JDK 1.2.
Thanks to Prof. T. Ungerer and to J. Kreuzinger for this really interesting project
The processor is based on the following DLX pipeline model:

The instruction pointed to by the PC is fetched from memory into
the instruction register of the CPU, and the PC is incremented to
point to the next instruction in the memory.
Instruction decode/register fetch (ID)
The instruction is decoded, and in the second half of the stage the
operands are transferred from the register file into the ALU input
registers (here meaning: latches).
Execution/effective address calculation (EXE)
The ALU operates on the operands from ALU input registers and eventually puts the result into ALU output register. The contents of this register depend on the type of instruction. If the instruction is:
Only for load, store, and branch instructions. If the instruction is:
The result of the instruction execution (register-register or load instruction) is stored into the register file in the first half of the phase.
In particular, the load memory data register or the ALU result register is written into the register file.
Thanks to Prof. T. Ungerer
There are a few basic classes.
The procsimu.Control class controls the simulation. It builds the GUI and executes the different action associated to the buttons and the menu items.
In the procsimu.gui package is the Output class. It represents a simple window that will show the information of a stage. The print method will add something at the end of the text in the window.
There are two important interfaces in procsimu.comp that follow the Java event model, ie they are listeners:
The procsimu.comp.Latch implements a simple latch. A latch is a container object with a write and a read method. The value wrote in the latch will only be returned by a read at the next clock cycle.
A RAM is implemented in the procsimu.comp.Memory class and a set of registers (32 by defaults) in procsimu.comp.Register.
Moreover, procsimu.comp has a few other usefull classes. Const contains all constants of the application, in particulary all constants that represent the instructions. Conversion has the methods to extract the information of a 32 bit MIPS instruction.
The Stage class in procsimu extends Output and implements WorkListener. That means all classes that extends Stage will see their method work() called at each cycle. The different StageXX classes, StageIF for example, extends Stage and implement the effective code fetchs, decodes, executes... the instruction.
The procsimu package also contains some other important classes for the simulation.
This screenshot shows the simulation that has executed the arth example.
The sources of the classes are available here.
There are also some examples, in their compiled (.mips files) and source (.s files) forms.
The program loop can be used to test the branch prediction mechanisms. It performs a lot of control transfers (in the form of for loop typ).
Here are the number of cycles used to complete the program in simple and
super scalar modes and with different branch predictors.
The percentage indicates the percentage of good predictions.
| ANT | AT | BT | FT | 1B | 2B | ||
|---|---|---|---|---|---|---|---|
| Simple scalar | Cycles | 313 | 283 | 267 | 339 | 283 | 283 |
| % | 49 | 62 | 78 | 40 | 62 | 62 | |
| Super scalar | Cycles | 277 | 289 | 277 | 313 | 283 | 289 |
| % | 60 | 46 | 60 | 46 | 47 | 47 |
The percentage of good predictions is really strange in superscalar mode. There may be a bug in the calculation, this issue still needs to be investigated.
| BT | 2B | ||
|---|---|---|---|
| Simple scalar | Cycles | 703 | 726 |
| % | 84 | 76 | |
| Super scalar | Cycles | 735 | 741 |
| % | 71 | 64 |
The IPC (Instruction Per Cycle) is the number of instruction executed divided by the number of cycle.
The subscal1 example does a lot of simple ALU operations (add, sub...) and is fited to the superscalar processor. As we can see in the table 3, the IPC is quite good in the superscalar mode.
| IPC | |
|---|---|
| Simple scalar | 0,76 |
| Super scalar | 1,14 |
We would first think the IPC of a superscalar processor should be better than the IPC of a simple
scalar processor. It is not always true, as showed in the table 4.
Actually, this is an implementation issue. The branch prediction is only implemented
in the superscalar mode for the first instruction. Some control transfer instructions
are not predicted, and if the branch predictor is good, not predicting a branch
is a loss of performance.
A further improvement of the simulation could be to try to implement a branch prediction
on both fetched instruction. But it causes some new problems that need to be solved.
| Cycle | Executed | IPC | |
|---|---|---|---|
| Simple scalar | 249 | 194 | 0,78 |
| Super scalar | 259 | 194 | 0.75 |
The superscalar implementation of the simulation consists in adding a second pipeline that is able to execute all ALU and branch instruction (but no load/store instruction). It is an in-order execution, that's to say in a cycle, the instruction executed in the second pipeline is always after the instruction in the first pipeline.
The following picture gives an overview of the pipelines and their relations.

To implement a superscalar processor, we need to be able to read two instructions pro cycle in the memory. So a second instruction latch, latNextInstr was added in the Memory class. This latch is updated by the work method with the instruction following the instruction written in latInstr.
With the exception of the IF stage, all other stage are doubled, ie we write a second class StageXX2 for each stage (for example StageID2 is the ID stage of the second pipeline).
A completely new class was written for the superscalar mode, InstructionBuffer.
The IF stage can fetch two instructions, but the pipelines are not always able
to execute them. For example, the second pipeline cannot execute a LOAD instruction or
they may stalled because of a data hazard. So if the ID stages can not decode an instruction,
they do not remove it from the buffer and it will be decoded by the next cycle.
The IF stage will fetch two instruction if there is enough space in the buffer, otherwise
it does not fetch any instruction.
Forwarding and interlocking issues are solved with a simple extension of the simple scalar mode mechanisms.
The second pipeline has its own latch latControlTransfer to indicate to the IF stage there is a control transfer. The instruction already fetched must be thrown away. There are not much differences with the simple scalar mode. The only thing that needs to be added is the following test: if there is a control transfer in the first pipeline, the instruction in the MEM stage of the second pipeline must also be thrown away.
The branch prediction is done only on one instruction (explained below). This instruction is:
We can not do a prediction on the two instruction fetched. If we would do that, we could have no prediction of the first instruction, but a prediction on the second one. The consequence is we would need to read in one cycle two non-consecutive instruction in memory, and that is impossible.
Another important thing to be careful to is to propagate the good PC in both pipelines. It seems quite normal, but some extra code must be added to be sure of that.
Otherwise, the branch prediction is the same implementation as in simple scalar.
Contact me at vincent at oberle dot org.