Assignment 5: chArm-v2 System Emulator
C S 429, Fall 2023
1 Introduction
In this lab, you will be implementing several simulators:
-
psim, a standalone simulator for the PIPE implementation of the chArm-v2 instruction set architecture (ISA), assuming an ideal memory system;
-
csim, a standalone trace-driven cache simulator for a simple two-level memory hierarchy; and
-
pcsim, an integrated “PIPE-with-CACHE” simulator, by enhancing psim to handle variable delays in the memory stage, and connecting it to csim, resulting in a simulator for the PIPE implementation of the chArm-v2 ISA with a simple two-level memory hierarchy.
The common executable that you will create will be named se. You will be working on the psim aspect of it in the first two weeks, on the csim aspect in the third week, and on the pcsim aspect in the fourth week.
Outcomes you will gain from this lab include the following:
-
You will understand how the SEQ and PIPE- implementations of chArm-v2 work. You will under- stand the utilities of each stage and how they are connected to each other.
-
You will understand how the PIPE implementation of chArm-v2 works. You will understand how stalling, squashing, and forwarding help resolve different hazard conditions.
-
You will understand the impact that cache memories can have on the performance of programs.
-
You will understand the additional changes that need to be made to the PIPE implementation to ac- commodate a (semi-)realistic memory hierarchy.
1
2 Logistics
This assignment lasts for four weeks and consists of two interrelated parts. You are required to perform four submissions:
• Week 1, due by Thursday, 02 November 2023 23:59 CT.
• Week 2, due by Thursday, 09 November 2023 23:59 CT.
• Week 3, due by Thursday, 16 November 2023 23:59 CT.
• The final submission, due by Thursday, 30 November 2023 23:59 CT.
Start early enough to get the assignment done before the due date. Assume things will not go accord- ing to plan, and so you must allow extra time for heavily loaded systems, dropped internet connections, corrupted files, traffic delays, minor health problems, force majeure, etc.
This is an individual or partner project. If you choose to work in pairs, the team may use as many slip days as the partner with the fewest available slip days. That is, if you have two slip days and your partner has three, the team gets two slip days to use for the entire assignment. Both partners will be charged for any slip days used. Note that using slip days for a checkpoint does not adjust any future due dates. Choose your partner well; you will not be allowed to split up during the project.
All hand-ins are electronic. You may do your coding on any machine you choose, but it is your respon- sibility to test this assignment for correct build/execution on an UTCS 64-bit x86-64 Linux machine before your final hand-in. You may not share your work on lab assignments with other students outside your team, but feel free to ask instructors for help (e.g., during office hours or discussion sections). Unless it is an implementation-specific question (i.e., private to instructors), please post it on Ed Discussion publicly so that students with similar questions can benefit as well.
Any updates for this lab will be posted on Canvas. Any clarifications or corrections for this lab will be posted on Ed Discussion.
3 Download and Setup
You will be able to clone your assignment to a lab machine as usual. Also note that since you can work in pairs, GitHub Classroom may prompt you to enter who you are working with to set up your repository with them. You will only need one repository for submitting, but if both you and your partner make one, just decide whose to submit later on.
4 Assignment Details 4.1 Repository Structure
Now that you have your private repository of the code base, confirm that you have the following subdirec- tories within it.
2
• include: This subdirectory contains all the header files needed for this project. It contains three subdi- rectories: pipe, which contains the header files needed for the pipelined processor implementation; cache, which contains the header files needed for the cache simulator; and base, which contains the header files needed for the underlying simulator.
• src: This subdirectory contains all the source code files needed for this project. It contains four subdirectories: pipe, which contains the source code files needed for the pipelined processor imple- mentation; cache, which contains the source code files needed for the cache simulator; base, which contains the source code files needed for the underlying simulator; and testbench, which contains the source code of the executables you can use to compare your simulator to the provided reference.
To complete the pipelined processor emulator psim in Part A, you will modify the following files: instr_Fetch.c, instr_Decode.c, instr_Execute.c, instr_Memory.c, instr_Writeback.c, forward.c, and hazard_control.c. All of these files are located in the src/pipe directory.
To complete the standalone cache simulator csim in Part B, you will modify the file src/cache/cache.c. To complete the final pcsim, you will further modify only these files.
• testcases: This subdirectory contains test cases for testing your emulator. The basics, alu, mem, branch, exceptions, and applications subdirectories nested inside contain assembly .s files, disassembled .od files, and ELF binaries that are used to test your simulator. Some of these directories are further subdivided into simple, hazard, and hard, which narrow the focus of the tests. Finally, the cache subdirectory contains memory trace files similar to those that you used in MM Lab, which are used to test your cache simulator.
4.2 Simulator Quirks
There are a few things in the simulator that you would not find in normal hardware, so we would like to note them here.
First, it is important to note that hlt is a privileged instruction, and gcc won’t typically compile files with that instruction. To shut down the emulator, we instead check for a ret instruction with a return address of 0, which we turn into an emulated HLT instruction that stops the simulator after it reaches the writeback stage. This functionality is implemented for you in src/pipe/instr_Fetch.c, and you only need to pass the generated STAT_HLT through the pipeline for it to work.
5 Programming Tasks
This lab is a sequence of two programming parts. In Part A, you will implement psim, a PIPE simulator. In Part B, you will implement pcsim, a “PIPE-with-CACHE” simulator, by implementing a cache simulator csim, enhancing psim to handle variable delays in the memory stage, and connecting them to each other.
3
The assignment carries 16 points: four points for each week. Your submission will be auto-graded based on its ability to correctly execute the test cases for the corresponding week.
This Wiki page is work in progress to help you understand the code base that we have given you, and what you need to it. It will be updated on an ongoing basis through the assignment.
5.1 Part A: Implementing psim, A Simulator for the PIPE Implementation
The goal for Part A is to implement a PIPE simulator as described in class. You are going to complete
the code in files instr_Fetch.c, instr_Decode.c, instr_Execute.c, instr_Memory.c,
instr_Writeback.c, forward.c, and hazard_control.c, all of which are located in the src/pipe
directory. You will need to complete all functions in the files that are marked as “STUDENT TO-DO”. The
hardware elements you need are implemented for you in src/base/hw_elts.c.
5.1.1 Basic Pipelined Implementation
Start by creating the simpler PIPE- implementation we have discussed in class. If you do this correctly, you will be able to pass all the tests in the testcases/basics and testcases/*/simple directories.
You are required to implement five functions that emulate five stages for your PIPE- simulator:1
• fetch instr(): Fetch stage (including PC update actions).
• decode instr(): Decode stage.
• execute instr(): Execute stage.
• memory instr(): Memory stage. • wback instr(): Write-back stage.
The fields of the pipeline registers are defined in include/pipe/instr_pipeline.h in several struct types *_instr_impl_t and pipe_reg_t. The “clocking” of these registers is handled for you in src/base/proc.c. When implementing the combinational logic for a pipeline stage, you will be passed in the appropriate structs as your input and output arguments, which are named in and out. There are also a few global variables that represent the “backwards” wires sent from one stage to a prior stage, which you will need to update as well.
5.1.2 Hazard Control
For the second week, you are also required to implement stalling, squashing, and forwarding to deal with data hazards and control hazards, as discussed in class. For forwarding, add your implementation in src/pipe/forward.c and then call the resulting function(s) from src/pipe/instr_Decode.c.
For stalling, implement the four functions in src/pipe/hazard_control.c, which is called for you in src/base/proc.c. You are responsible for setting up the correct stall and bubble signals for each pipeline stage at each cycle. The task of taking the appropriate actions on the pipeline registers based on these signals is handled for you.
5.1.3 Testing
Make your se executable in the usual manner: make clean; make. The standard set of make targets that we have been using all semester long are also used in this assignment.
Run your se executable using the command line
se -i <testfile-name> -v <verbosity-level> -l <cycle-limit> -c <checkpoint-file>
where the i flag is mandatory, but the rest are optional. The verbosity level can be 0, 1, or 2 (default: 0) and will control how much diagnostic output you will see. The cycle limit can be an integer greater than 0, and will set the limit on the number of cycles your simulator can run for. Its default value is 500, and most tests do not require more than this, save for some of the ones in hard directories. The checkpoint file is a file that will be overwritten with a “checkpoint” of the machine state when the program ends or attempts to load from CHECKPOINT_ADDR.
Debug your program in the normal manner using gdb. 5.1.4 Submission
Submit your checkpoint version using Gradescope, by providing a pointer to the private GitHub reposi- tory where you have done your work. Remember to include your partner in your submission, if you have one. Clearing week 1 corresponds to your correctly running test programs in the testcases/basics and testcases/*/simple subdirectories. Clearing week 2 corresponds to your correctly running test programs in the testcases/*/hazard subdirectories
5.1.5 Evaluation
Part A of the assignment counts for eight points: four for the basic PIPE- implementation (week 1), and four for the full PIPE implementation including handling hazards and forwarding values (week 2). Partial credit is given where applicable, so you’ll earn points for each individual test passed.
5.2 Part B: Implementing pcsim, A Simulator for the PIPE Implementation with A Simple Memory Hierarchy
In this part, you will first write a standalone cache controller simulator csim and test it against a number of memory traces. Correctness will be determined by matching the cache events generated by your simulator against a reference. You will then augment psim and connect it to csim to produce pcsim.
5
5.2.1 Implementation and Testing
-
Start your work in the cache directory.
-
Implement the get_line() and select_line() helper functions in the file cache.c. Imple- ment the check_hit() and handle_miss() routines in the file cache.c. This will give you a skeletal cache simulator that implements the control actions (the three-state finite-state machine cache controller discussed in class) of a write-back cache with LRU replacement and write-allocate policies, for arbitrary numbers of sets, associativity values, and block sizes.
-
You can assume that each cache read/write only accesses one single cache line.
-
Test your code by running make and running test-csim. Your implementation is correct when the test score printed out is 40/40.
-
Implementthefunctionsget_word_cache()andset_word_cache()inthefilecache/cache.c. After completing this task, you will have a fully functional cache simulator that implements both the control and the data portions of the cache.
-
Integrate the memory hierarchy simulator into the PIPE simulator by making the appropriate changes in the files that you updated for Part A. This should involve no more than updating the data memory routines to the corresponding cache routines and handling stalls resulting from cache misses.
-
Test the correctness of the combined simulator using test-se. Several ELF binaries will be run with a few different cache configurations. You may test these yourself by adding the flags
-A <associativity> -B <line-size> -C <capacity> -d <delay-cycles>toany test from before.
5.2.2 Application: Matrix Multiplicaation
To explore the effects of the cache in a real application, we consider several implementations of matrix- matrix multiplication in the tests applications/hard/gemm_ijk, applications/hard/gemm_ikj, and applications/hard/gemm_block. These tests take two 64 × 64 matrices and multiply them, storing the result in another 64 × 64 matrix.
Note that these tests take several million cycles to run to completion, so you need to give them several seconds to finish running. For this reason, do NOT run your simulator with the verbose -v flag enabled, or else you will create several Gigabyte-sized files and waste large amounts of space on the lab machines.
Using these tests and your se executable, answer the following questions. You may use the se-ref executable if you are unable to pass all week 4 tests with your implementation.
• To start, run the tests without the cache enabled. You should set the cycle limit to 8,000,000 with the
-l flag in order to bypass the default limit of 500. You can filter the output to a checkpoint file to
view the number of cycles the simulator runs for, so an example command would be bin/se
-i testcases/applications/hard/gemm_ijk -c checkpoint.out -l 8000000.
Note the number of cycles it takes to run each test to completion. Which of the three tests runs in the
fewest number of cycles? Which of the three runs in the greatest number of cycles? Provide some
intuition on why this behavior occurs.
6
-
Next, run the tests with cache parameters [A,B,C,d] = [4,32,512,100]. Modern processors typically suffer a loss on the order of 100 cycles on a cache miss when the requested data is in DRAM, so this configuration more closely mimics a real scenario. You will now need to increase the cycle limit to 40,000,000 in order to allow the simulator to run to completion, so the example command frombeforebecomesbin/se -i testcases/applications/hard/gemm_ijk
-c checkpoint.out -l 40000000 -A 4 -B 32 -C 512 -d 100. Note the number of cycles it takes to run each test to completion. Now which test runs in the fewest/greatest number of cycles? Has this changed from before? Use the memory access patterns of each test to explain why the results did or did not change.
-
There will be a separate week 4 assignment on Gradescope for you to submit answers to these ques- tions.
5.2.3 Submission
Submit your final version using Gradescope, by providing a pointer to the private GitHub repository where you have done your work. As before, remember to include your partner in your submission, if you have one.
5.2.4 Evaluation
Part B of the assignment counts for eight points: four for implementing the cache and achieving 40/40 on the cache test (week 3), two for integrating the cache with the PIPE implementation and passing the tests in the testcases/*/hard subdirectories (week 4), and two for correctly answering the matrix multiplication questions. Partial credit is given where applicable, so you will earn points for each individual test passed. We will use test-se to test your integrated simulator.