Exercise: Code Profiling and Branch Prediction

Engage in a Conversation

Code Profiling & Branch Prediction

This exercise comes in three phases, loosely aimed at a one-per-week schedule. There is a variety of different styles of answers involved and a number of different topics covered. You have freedom in interpreting the questions and your numeric answers probably won't exactly match some others'; however the general points should still end up similar. Let us know the process you applied. There is some programming involved but this is not a programming exercise! (We assume you know how to write software by now.) You will need to write some programs in places and may choose to produce some code to help with your analysis in other places. This need not be tidied up ‘release’ quality code – i.e. hacking is okay – as long as it's legible. The code is a means to an end, not the purpose of the exercise. Use whatever language you are most comfortable with (but please don't be so obscure that it baffles the staff). None of these phases should be unduly stressful but you are strongly encouraged to think about what you are trying to achieve in each case before leaping for the keyboard. The theory and design is all part of the process and we're not giving you a step-by-step guide! CourseNana.COM

Tasks & marking

Specific tasks requiring answers are presented in bold, red font. (Contact staff if this causes you visual problems.) Most answers are either numerical (some leeway is given as long as your answers approximately agree with ours) or short textual. Usually one sentence will do and three is probably a sensible maximum unless otherwise explicitly stated. We don't want you to write essays and we don't want to read essays. (Code should be similarly short because of course it should!) Where possible put your answers into a single document – preferably PDF. zip this with anything else (e.g. listings) and submit (what you have completed) via Blackboard by the published deadline. The exercise is marked out of 100 for convenience; the resultant mark will be scaled to the same proportion as the other exercises. CourseNana.COM

Context

A basic prefetch unit usually cycles simply incrementing its value (by an appropriate sized step). It is occasionally ‘interrupted’ by a branch which overwites its value. CourseNana.COM

Phase 1: Profiling and Statistics

This first phase analyses the dynamic execution behaviour of a real program. You are provided with two files which have been generated from a run of the espresso logic minimisation tool compiled for an ARM processor. The run executed just under 3.5 million instructions, which does not include any instructions prefetched and then discarded which have not been counted here. The actual statistics are: • 3424177 • 818792 CourseNana.COM

instructions fetched, of which were conditional in the sense that a choice is made. CourseNana.COM

Note that all these ARM instructions are, in principle, conditional (the ‘condition’ for most is ‘always’). 2986820 instructions passed their condition-code check (which includes the ‘always’ cases). These ‘truly’ conditional operations (where a decision is made) are chiefly branches. CourseNana.COM

(Only branches are conditional in the majority of processor ISAs.) There are two files taken from a profiling run of this code. They are machine-written (hence guaranteed regularly structured) and human-readable. They are: • instr_profile is a map of (the executed parts of) the code and is annotated with the number of times that instruction was fetched and decoded. This is chiefly included for interest, and cross referencing if desired but you will probably need it in the first phase of this exercise. CourseNana.COM

This exercise doesn't rely much on remembering details of ARM code. The important bits are: • That all operations can be predicated with a condition code, although mostly these are branching operations. (Predicates can be used to avoid short, forward branches but these do not appear here.) CourseNana.COM

• There are several different ways a ‘branch’ can be invoked (i.e. there is a chance the PC may be overwritten). These have been identified in the trace by the first character on each line. o ‘B‘ is an invariant branch: the target is coded directly in the instruction. This is the class of primary interest here. It includes ‘BL’ (Branch and Link: procedure call) operations. CourseNana.COM

o ‘R‘ is a procedure return by MOVing a value from another register – almost always the Link Register (LR, R14). CourseNana.COM

o ‘D‘ identifies an operation which calculates a result and deposits it in PC. These are rare once the procedure returns are separated out. CourseNana.COM

o ‘M‘ codes for (multiple) memory loads which include PC (R15) amongst the registers loaded. These are usually, but not exclusively, stack ‘pop’ operations returning from procedures. CourseNana.COM

o ‘L‘ operations are single memory loads to PC (R15) which is used occasionally to branch to a dynamically set pointer. CourseNana.COM

The next character is either a ‘*’ which indicates the branch is unconditional or a ‘?’ which signifies a conditional operation. (This is the only use of these characters in the file.) Finally there is an indication of whether the branch is or isn't taken on each encounter. CourseNana.COM

The code segment of the binary consists of 37232 words. This is misleading in terms of instructions since ARM code typically has read-only data, particularly strings, embedded in this space. Nevertheless the vast majority of the space contains instructions and this figure can be used. Some of these instructions will be present to handle cases (such as errors) which have not been needed in the profiled run. Within the profile 11339 different instructions have been executed at least once. It is often claimed that a ‘typical’ program spends 90% of its time in 10% of the code. CourseNana.COM

Derive a sensibly accurate estimate of the proportion of the whole binary in which this code spent 90% of its execution time. Briefly show how you derived your answer: the process you followed and a mention of any tools you used or wrote. [Answer: numeric or proportion.] [Marks: 2] This may require a bit of analysis code to extract an answer but there are tools which can help you order the data and extract the desired values. For example, your attention is directed to the Unix/Linux ‘sort’ utility. Some other, standard utilities may be useful for this include ‘grep’ and ‘wc’.

Basic blocks

During execution, code can be partitioned into “basic blocks”. These are adjacent sets of instructions with a single entry and potential exit point: the distance between branches. Your dynamic instruction trace reveals entry and potential exit points. There is a question about what you can't see here: one category is entry points which were never used in the trace. It is possible to identify more of these – including those in fragments of the code which haven't been executed here at all – with a static analysis of the code. CourseNana.COM

Modern processors are often pipelined to improve their performance. This allows them to operate at higher clock speeds than unpipelined units. An unfortunate consequence of a pipelined processor is that unwanted instructions will have been fetched and decoded before it is discovered that the flow has branched and they are not wanted. This imposes a penalty on performance each time a branch (of any type) is taken. CourseNana.COM

How many times is a branch taken (i.e. any non-sequential change to the PC) in the provided trace? [Answer: numeric.] [Marks: 2] CourseNana.COM
Thus, what is the average number of instructions (including the branch) taken before the PC is reloaded? [Answer: numeric.] [Marks: 2] CourseNana.COM
Assuming that each instruction takes a single cycle (slightly inaccurate for ARM since a few operations, notably the multi-register transfers, take more than one cycle – but ignore this effect) and there are two discarded instructions/lost cycles for a taken branch, estimate the proportion of the TOTAL execution time wasted due to control hazard avoidance. (Make it clear what you are calculating. The total time will include the time wasted.) [Answer: numeric/proportion, with a sentence of justification.] [Marks: 5] CourseNana.COM

This is a cheap way of reducing the penalty imposed by branches; the difficulty is that the compiler cannot always find an instruction (or more) which can be reordered into the slot(s). Since the instruction is going to be executed (before the branch, so it doesn't help unconditional ‘call’ operations) it is often necessary to insert ‘NOP’ instead, with no consequent performance benefit (indeed a penalty, since this is cluttering up the caches). CourseNana.COM

Branch delays have largely been superseded by branch prediction in high-performance ISAs. CourseNana.COM

A Branch Target Buffer (BTB) attempts to predict branches to alleviate the control hazard. A straightforward BTB can only predict invariant branches since it is much more difficult to determine a branch destination which may be different each time. It is thus confined to ‘B’ codes in the trace. CourseNana.COM

What proportion of the ‘branches’ encountered in the trace fall into this category? [Answer: proportion.] [Marks: 3] CourseNana.COM
What would be the execution overhead if all these branches could be predicted perfectly? [Answer: proportion, with a sentence of justification.] CourseNana.COM

[Marks: 4] It is not possible to predict every branch successfully, every time. Firstly a branch must have been encountered once before the BTB can know it is present, so the first encounter will go unpredicted. CourseNana.COM

For what other reason(s) might branch prediction be imperfect? [Answer: 1 or 2 descriptive phrases.] CourseNana.COM
What are the ratios of conditional branches taken to those not taken if the branch direction is considered (forwards vs. backwards)? [Answer: numeric ratios.] [Marks: 2] CourseNana.COM
Is there a simple decision-making formula which might be used to predict a branch based on this information? [Answer: one sentence suggestion.] [Marks: 2] CourseNana.COM

Phase 2: Modelling

This phase is based around building a software model of some potential hardware to see how it would behave. CourseNana.COM

A BTB is a form of cache. Its ‘tags’ are the addresses of branch instructions and their associated ‘data’ are branch targets. If the PC used for instruction fetch matches a tag then a possible branch is going to be fetched and there is the option to jump directly to the target rather than simply incrementing the PC. CourseNana.COM

The purpose of this phase is to build a model of the BTB behaviour, to study how effective it might be in a real, dynamic situation. This requires a behavioural model: notice that it does not require a complete implementation. A couple of preliminary questions: CourseNana.COM

A BTB only caches known branch instructions so can be quite small. What degree of associativity is appropriate? [Answer: suggestion with a justifying sentence.] [Marks: 3] CourseNana.COM
What is the appropriate ‘cache line’ length? [Answer: suggestion with a justifying sentence.] [Marks: 2] CourseNana.COM
Now, build a model which is capable of predicting (invariant) branches from the provided trace file. [Answer: source listing – does not need to be tidy but its structure should be apparent.] [Marks: 15] CourseNana.COM

(Note that, initially, your model only has to spot the potential branch (or not) from the trace history which will be updated as the trace is ‘run’.) You can make some decisions about whether you cache a branch or not when it is not taken (when first encountered), whether to cache only unconditional branches, forward CourseNana.COM

For this initial exercise, assume that if a branch has been cached it will be predicted taken when encountered again. CourseNana.COM

When making your model, leave some provision to add some extra information to each BTB entry. CourseNana.COM

Note that your input trace file is machine-written so the structure is regular. (Spacing is done with spaces, not TABs.) The code is specific to this job so you can take expedient ‘short-cuts’ such as picking up characters from particular columns rather than reading all the text. At some stage you will need to start selecting BTB entries for replacement. The choice of algorithm is left to you (but make your choice clear in you submission). CourseNana.COM

Try a selection of BTB (capacity) sizes and plot the results in terms of overhead (as in phase 1) vs. BTB size. [Answer: graph.] [Marks: 15]

Phase 3: Sophistication

The final phase involves extending your basic BTB and finding how much you can improve your prediction. Some of this might be bookwork/research rather than invention and experimentation. CourseNana.COM

As processor pipeline depth has increased (and widened in superscalar processors) the need for better branch prediction has become greater. A (relatively) simple improvement can be made by adding some state recording the history of BTB entries. The simplest change is to note if a prediction (taken or not taken) was right or wrong each time and predict the same thing next time. However this is simplistic when it comes to loops, where a loop-closure branch is usually taken but not taken as the loop exits; if the loop is re-entered it is better to guess it will be taken again next time. CourseNana.COM

Add the two-level prediction to your BTB model and record the improvements (if any!) to the prediction accuracy.

Note that each entry in the BTB has its own state. CourseNana.COM

Adding feedback on the prediction success adds some complexity to a processor operation. Your initial BTB model would only be informed when either an unpredicted (possibly newly encountered) branch was taken or a prediction had been made erroneously. In the latter case an ‘unbranch’ is sent to the prefetch unit which is effectively a branch to the successive PC. These are essential because the PC needs to be updated. CourseNana.COM

With feedback it is also necessary to send a signal to the prefetch unit confirming that a prediction, taken or not, was made correctly so that the BTB can increase its confidence in its predictions. Your model does not need so much sophistication since you don't need to model the pipeline. [Answer: graph – maybe superimposed on previous graph.] [Marks: 12] CourseNana.COM

Research

More recent branch predictors have more sophisticated mechanisms to improve the accuracy of their speculation as to whether a branch will be taken. It is important to have a high accuracy when there might be 30+ instructions in a pipeline. CourseNana.COM

Remember how many instructions (on average) there were between branches? CourseNana.COM

With 90% accurate prediction what are the chances of getting (say) the next five predictions all correct? That's an easy one – come on! [Answer: numeric or proportion] [Marks: 3] CourseNana.COM
Investigate and write a brief (and simple) summary of how prediction accuracy has been improved. (Don't simply copy Wikipedia – parse and simplify, describing just the general principle(s).) [Answer: short paragraph summarising approach(es) discovered.] [Marks: 8] We have ignored the invariant branches so far. Their presence is easy to predict but their corresponding destination is not. From the trace, note that the vast majority of these serve a particular purpose (which is what?). Although rarer than invariant branches their penalty can still be significant. CourseNana.COM
Investigate and comment briefly on how these ‘variant branches’ are predicted in modern, high performance processors. [Answer: short paragraph summarising approach(es) discovered.] [Marks: 8] Note that this need has influenced more recent ISAs than this (old) 32-bit ARM so that these cases are easier to spot and predict. In the given trace there is one(?) occurrence of an unusual type of ‘branch’ in the trace, not yet discussed. CourseNana.COM
What higher level function is this implementing? [Answer: one sentence description.] [Marks: 4] CourseNana.COM