const int MAXSIZE = 20000;
double fold, xa[MAXSIZE];
...
for (i = 0; i < MAXSIZE / 2; ++i) {
fold += xa[i] * xa[MAXSIZE - 1 - i];
}
This code is to be executed on a superscalar processor with 4 functional units. The processor is capable of doing one floating-point operation and one integer operation per clock cycle. One of these operations could be a memory access (a load or a store). The integer oepration could be a branch. The following table illustrate the latency and repeat rate of some instructions supported by the processor
Instruction | Latency | Repeat rate |
---|---|---|
Integer add/sub/logical/branch | 1 | 1 |
Integer load/store | 2 | 1 |
Floating-point load/store | 3 | 1 |
Floating-point add/sub/multiply | 2 | 1 |
Floating-point multiply-add | 3 | 1 |
Floating-point division | 19 | 2 |
Write down the instruction involved in the execution of the loop, and the instruction schedule for one iteration. Give the flops per cycle for the schedule, and compute the functional unit utilization
Unroll the loop once, and schedule the instruction. Give the flops per cycle for the schedule, and compute the functional unit utilization
Unroll the loop twice, and schedule the instruction. Give the flops per cycle for the schedule, and compute the functional unit utilization