CS 6290: High-Performance Computer Architecture
Project 2
This project is intended to help you understand caches and performance of out-of-order processors. As with previous projects, for this project you will need VirtualBox and our project virtual machine. Just like in previous projects, you will put your answers in the reddish boxes in this Word document, and then submit it in Canvas (but this time the submitted file name should be PRJ2.docx).
In each answer box, you must first provide your answer to the actual question (e.g. a number). You can then use square brackets to provide any explanations that the question is not asking for but that you feel might help us grade your answer. E.g. answer 9.7102 may be entered as 9.7102 [Because 9.71+0.0002 is 9.7102]. For questions that are asking “why” and/or “explain”, the correct answer is one that concisely states the cause for what the question is describing, and also states what evidence you have for that. Guesswork, even when entirely correct, will only yield up to 50% of the points on such questions.
Additional files to upload are specified in each part of this document. Do not archive (zip, rar, or anything else) the files when you submit them – each file should be uploaded separately, with the file name specified in this assignment. You will lose up to 20 points for not following the file submission and naming guidelines, and if any files are missing you will lose all the points for answers that are in any way related to a missing file (yes, this means an automatic zero score if the PRJ2.docx file is missing).
Most numerical answers should have at least two decimals of precision. Speedups should be computed to at least 4 decimals of precision, using the number of cycles, not the IPC (the IPC reported by report.pl is rounded to only two decimals). You lose points if you round to fewer decimals than required, or if you truncate digits instead of correctly rounding (e.g. a speedup of 3.141592 rounded to four decimals is 3.1416, not 3.1415).
This project can be done either individually or in groups of two students. If doing this project as a two-student group, you can do the simulations and programming work together, but each student is responsible for his/her own project report, and each student will be graded based solely on what that student submits. Finally, no collaboration with other students or anyone else is allowed. If you do have a partner you have to provide his/her name here (enter None here if no partner) and his/her Canvas username here . Note that this means that you cannot change partners once you begin working on the project, i.e. if you do any work with a partner you cannot “drop” your partner and submit the project as your own (or start working with someone else) because the collaboration you already had with your (original) partner then becomes unauthorized collaboration.
In this project we will be using the FMM benchmark with 256 particles and single-core execution, and we will continue to use the cmp4-noc.conf configuration file. Remember to first restore the cmp4-noc.conf file to its default contents. Also, if your branch predictor changes from Project 1 can result in changing the results of the simulation even when using the default (Hybrid) predictor, you need to restore the original code of the simulator. Essentially, you need to undo all the changes made in Project 0 and Project 1. Then you can run the simulation:
cd ~/sesc/apps/Splash2/fmm
If the fmm.mipseb file is not already present in this directory, then build it:
make
and run the first simulation we will need for this project exactly like this (this should be one line where all ‘-‘ characters are the normal “minus” character, the line has a single space between -ofmm.out and -efmm.err, a single space character between fmm.mipseb and -p, a single space between -p and 1, and no spaces, tabs, or anything else after that):
~/sesc/sesc.opt -f Default -c ~/sesc/confs/cmp4-noc.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
In this command line the -c, -o, and -e simulator options should already be familiar. The
-f option tells the simulator to save the report file as sesc_fmm.mipseb.Default instead of using a random string at the end. This way you can produce report files with the names you need for your Project 2 submission, without having to rename them.
As with every simulation, you should verify that the simulated execution has not crashed (or terminated too soon) due to misspelling of the command line. After a correct simulation, the fmm.err file should be empty, and fmm.out should begin with “Creating a two cluster,
non uniform distribution for 256 particles” and have a “Total time for steps 3 to 5 : 0” line at the end. Having this does not guarantee that everything is OK, but it at least means that FMM did not exit prematurely.
Part 1 [35 points]: Performance impact of caches
In this project, we will be modifying the data caches of the simulated processor, so let’s take a closer look at the [DMemory] section of the configuration file. It says that the structure the processor gets data from is of type “smpcache” (it’s a cache that can work in a multiprocessor, as we will see Project 3), which can store 32KBytes of data (size parameter), is 4-way set associative (assoc parameter), has a 64-byte block/line size (bsize parameter uses cacheLineSize, which is set to 64 earlier in the configuration file), is a write-back cache (writePolicy), uses LRU replacement policy, and has two ports with port occupancy of 1 cycle (so it can handle two accesses every cycle), has a 1-cycle hit time, and takes 1 cycles to detect a miss. If there is a miss, the processor keeps track of it using the DMSHR (data miss handling registers) structure, which is described in the [DMSHR] section as a 64-entry structure where each entry can keep track of a miss to an entire 64-byte block. On a miss, the L1 cache requests data from the core’s local slice of the L2 cache, or from the on-chip router that connects it to the L2 slices of other cores. Note that in this project we will still be using only one core (Core 0) so it gets to use the entire L2 cache (all four slices).
Now let’s change some cache parameters and see how they affect performance. Before we make any changes to the cmp4-noc.conf file, we should save the original so we can restore the default configuration later. In general, you should be very careful about ensuring that you have the correct configuration. The values for one thing (e.g. L1 cache) can affect what happens in other things (e.g. L2 cache), so you should be able to restore the default parameters when needed.
A) Run the fmm benchmark with the default configuration and submit the report from this simulation as sesc_fmm.mipseb.Default
B) Change the L1 cache size to 1kB (leave all other conf parameters unchanged, and make sure to save the original cmp4-noc.conf), run that simulation (using -f SmallL1 this time in the simulation command line), and submit the report as sesc_fmm.mipseb.SmallL1
C) With a 1kB L1 cache, the miss rate in the L1 cache is percent, and with a 32kB L1 cache the miss rate is percent. The overall speedup achieved by replacing a 1KB L1 cache with a 32KB cache is .
D) You may think that it takes less time to simulate fewer cycles, even when the same overall work gets done over fewer cycles, but actually the simulator takes less time because it does do less work when we have a larger L1 cache. What is the simulator’s work that is eliminated?
E) Now let’s compare the default 32kB 4-way set-associative L1 cache with one that has the same size but is direct-mapped. You already ran a simulation with the default (set-associative) configuration, so you only need to run a simulation for the direct-mapped L1 cache. Submit the simulation report for this run as sesc_fmm.mipseb.DMapL1
F) The miss rate with the direct-mapped L1 cache is percent, the miss rate with the 4-way set-associative L1 cache is percent, and the overall speedup achieved by changing from direct-mapped (1-way set-associative) to 4-way set-associative L1 cache is .
G) Now let’s restore the default configuration (32kB, 4-way set associative L1 cache) and change the L1 cache latency to 2 cycles (change both hitDelay and missDelay to 2) and then to 3 cycles. Submit the reports for these simulations as sesc_fmm.mipseb.SlowL1 and sesc_fmm.mipseb.SlowerL1.
H) The speedup of improving L1 latency from 3 to 2 cycles is , the speedup of improving the L1 latency from 2 to 1 cycle is , and the speedup of improving the L1 latency from 3 to 1 cycles is .
I) The change in L1 latency from 3 cycles to 1 cycle represents is a 3X improvement, and the loads and stores represent 20.75% of all dynamic instructions. So a naïve application of Amdahl’s law tells us to expect a speedup of 1.1605. Explain why the actual speedup is very different from that:
Part 2 [20 points]: Changing the simulated cache
The cache implementation in the simulator can only model LRU replacement policy – note that a RANDOM policy can be specified in the configuration file but the code that models the replacement policy will still implement LRU even when RANDOM is specified. Now we will explore what happens when we actually change the cache’s replacement policy. We will implement the NXLRU (Next to Least Recently Used). While LRU replaces the block that is the first in LRU order (i.e. the least recently used block) in the cache set, NXLRU should replace the block that is the second in LRU order in the set, i.e. the second-least-recently-used block in the set.
To implement NXLRU, we need to modify the code of the simulator. The source file which implements the ‘smpcache’ (used for our L1 cache) is in SMPCache.h and SMCache.cpp in the sesc/src/libcmp/ directory. For much of the “basic” cache behavior, the SMPCache uses code in sesc/src/libsuc/CacheCore.h (and CacheCore.cpp). There are separate classes for CacheDM (for direct mapped caches) and CacheAssoc (for set-associative caches). Since direct-mapped caches do not have a replacement policy (they must replace the one line where the new block must go), we will be looking at the CacheAssoc class. First we must add “NXLRU” as an option that can be specified in the conf file and selected when a CacheAssoc object is constructed. Probably a good approach is to look for “LRU” in the code to see how this is done for LRU (and RANDOM), and then add NXLRU. Then we must actually implement this policy. The function that actually implements the cache’s replacement policy is the findLine2Replace method of the CacheAssoc class in CacheCore.cpp. The parameter supplied to this method is the new address that needs a line in the cache. Note that this method does not only implement the replacement policy because an actual replacement (replace one valid line with another) may not be needed. For example, when addr is already in the cache (a cache hit), this method returns the line that contains addr. When the set where addr belongs contains non-valid lines, one of those non-valid lines is used – a valid block may have a cache hit in the future, while a non-valid line cannot, so we should only replace a valid line if the set has no non-valid lines.
Note that you have to add the NXLRU policy as an option in the configuration file, i.e. it is not OK to just change the existing LRU (or RANDOM) code to actually follow the NXLRU policy. Changing the behavior of existing policies will change the behavior of all cache-like structures in the processor, including TLBs. We will want to change the replacement policy only in L1 caches and leave behavior of TLBs, L2 caches, etc. unchanged!
Make the changes needed to implement the NXLRU replacement policy and then:
J) Run a simulation with a 1kB L1 cache, using NXLRU policy, and with all other settings at their default values. Submit the simulation report for this as sesc_fmm.mipseb.L1NXLRU. You already ran a simulation with the same L1 cache that uses LRU (in Part 1B). With a 1kB L1 cache, the LRU policy gave us a miss rate of percent, while NXLRU gives us a miss rate of percent. The number of blocks that are fetched (read) by the L1 cache from the L2 cache changes from with LRU to with NXLRU, and the speedup of using LRU instead of NXLRU is .
Note: Because report.pl does not provide information about how many blocks were fetched from the L2 cache into the L1 cache, you will have to directly examine the report file generated by SESC. This file begins with a copy of the configuration that was used, then reports how many events of each kind were observed in each part of the processor. Events in the DL1 cache of processor zero (the one running the application) are reported in lines that start with “P(0)_DL1:”. In the report file, the number of blocks requested by the L1 cache from the L2 cache is reported as lineFill (these become entire-block reads from the L2 cache), and the number of write-backs the L1 wants to do to the L2 is reported as writeBack (these become entire-block writes to the L2 cache).
Part 3 [45 points]: Classifying misses in the L1 cache
Now we will change the simulator to identify what kind of L1 cache miss we are having each time there is a miss – compulsory, conflict, or capacity. Recall that a miss is a compulsory miss if it would occur in an infinite-sized cache, i.e. if the block has never been in the cache before. A capacity miss is a non-compulsory miss that would occur even in a fully associative LRU cache that has the same block size and overall capacity, and a conflict miss is a miss that is neither a compulsory nor a capacity miss. The L1 cache in the simulator counts read and write misses in separate counters (which appear in the simulation report as readMiss and writeMiss number for each cache, e.g. there is line in the report for “P(0)_DL1:readMiss=something” in the report file. Now you need to have additional counters, which should appear in the simulation report file as compMiss, capMiss, and confMiss counters (these three values should add up to the readMiss+writeMiss value). Each of the new counters should count both read and write misses of that kind. It is OK to also have counters that count compulsory, capacity, and conflict misses separately for reads and writes, or to do this classification of misses for other caches. But report files that do not have the overall (reads+writes) items for P(0)_DL1:compMiss, P(0)_DL1:capMiss, and P(0)_DL1:comfMiss will not be graded.
K) Attach the CacheCore.h, CacheCore.cpp, SMPCache.cpp, and SMPCache.h files that contain your modifications for both Part 2 (NXLRU) and Part 3 (classification of misses). You should not modify (or add/delete) any other files.
L) With your new miss-classification code in the simulator, you should run a simulation with the default configuration (32kB 4-way set-associative LRU L1, cache), with an 1kB 4-way set-associative LRU L1 cache, with a direct-mapped 32kB L1 cache, and with a 32kB 4-way set-associative NXLRU L1 cache. Submit the four simulation reports as sesc_fmm.mipseb.DefLRU, sesc_fmm.mipseb.SmallLRU, sesc_fmm.mispeb.DefDM, and sesc_fmm.mispeb.DefNXLRU.
M) Fill the following table. The first row of fill-in fields is the total number of L1 misses for each configuration, the second is the percentage of all L1 misses for that configuration that are compulsory misses, the third is the percentage of L1 misses that are conflict misses, and the fourth is the percentage of L1 misses that are capacity misses. For example, if a simulation ended up with a total of 1000 L1 misses, and if they include 300 compulsory, 500 conflict, and 200 capacity misses, then the column for that configuration would have the following numbers (from top to bottom): 1000, 30,50,20.
| 32kB SA LRU | 1kB SA LRU | 32kB DM | 32kB SA NXLRU |
Total # of misses | C | C | C | C |
% Compulsory | C | C | C | C |
% Conflict | C | C | C | C |
% Capacity | C | C | C | C |
N) Now we will consider the 32kB 4-way set-associative LRU cache as the baseline and compute, for each of the other three configurations, the percentage increase of various kinds of misses relative to that baseline. For example, if the 32kB 4-way SA LRU configuration has resulted in 200 compulsory misses and the 1kB 4-way SA LRU configuration has resulted in 190 compulsory misses, then we should put -5 in the compulsory-misses entry in the 1kB SA LRU column: the change is -10 compulsory misses (190-200), and -10 is 5% of the baseline’s 200 misses (-10/200 = -0.05).
% Change in | 1kB SA LRU | 32kB DM | 32kB SA NXLRU |
Total # of Misses | C | C | C |
# of Compulsory Misses | C | C | C |
# of Conflict Misses | C | C | C |
# of Capacity Misses | C | C | C |