MATH49111/69111 Scientific Computing Projects 2
List of Projects
Guidelines 1
Gradingcriteria .......................................... 3

1 Neural Networks 5

1.1 NeuralNetworks ...................................... 7

1.2 Evaluatingtheneuralnetoutput............................. 8

1.3 Trainingtheneuralnetwork................................ 9

1.3.1 Stochasticgradientdescent ........................... 11

1.3.2 Evaluating∇Cx{i}(p)............................... 12

1.3.3 Thetrainingalgorithm .............................. 13

1.3.4 Exercises ...................................... 14


1.4 Thecode........................................... 14 1.4.1 Exercises ...................................... 15

1.5 Trainingandusingtheneuralnetwork.......................... 17 1.5.1 Exercises ...................................... 18

1.6 Report............................................ 18


2 Compressed Sensing 20

2.1 IterativeHardThresholding ................................ 21

2.1.1 Gradientdescent.................................. 21

2.1.2 IterativeThresholding............................... 23

2.1.3 ThePhaseTransitionphenomenon ....................... 26

2.1.4 Additionalinformation .............................. 27


2.2 Report............................................ 28


3 Discontinuous Galerkin Methods for Convectiondominated problems 29

3.1 Theory............................................ 29

3.1.1 Approximatingtheunknown,u ......................... 29

3.1.2 Formulatingtheproblem............................. 31

3.1.3 Descriptionofthealgorithm........................... 33


3.2 Exercises........................................... 34

3.3 Report............................................ 37
i

LIST OF PROJECTS ii
4 Revenue Management of Carparks 38

4.1 Theory–TheCarparkModel................................ 38

4.1.1 GeneratingeventsfromaPoissonProcess ................... 40

4.1.2 GeneratingaBooking............................... 40

4.1.3 Calculatedexpectedvaluesofcarparkmeasures . . . . . . . . . . . . . . . 40

4.1.4 OptimalRevenueManagement ......................... 41


4.2 Exercises........................................... 42

4.2.1 TheCustomersclass................................ 42

4.2.2 TheCarParkClass ................................. 43

4.2.3 Revenue management with a fixed allocation strategy . . . . . . . . . . . . 44

4.2.4 Bookingrejectionstrategies ........................... 44

Project 1
Neural Networks and Machine Learning
The idea that computers may be able to think or exhibit intelligence dates back to the dawn of the computing era. Alan Turing1 argued in his seminal 1950 paper Computing Machinery and In telligence (Turing, 1950) that digital computers should be capable of thinking, and introduced the famous ‘Turing test’ to define what was meant by ‘thinking’ in this context: the computer should be able to emulate a human mind. He also suggested that a computer exhibiting human intel ligence could not be programmed in the usual way (where the programmer has ‘a clear mental picture of the state of the machine at each moment in the computation’, as he put it) but would instead learn like a human, drawing inferences from examples. Turing had previously identified that learning of this sort could occur in what he called an unorganised machine: a computational model of a brain consisting of many simple logical units connected together in a network to form a large and complex system.
Although a computer emulation of a complete human mind is still some way off, Turing’s idea that a network could be taught and learn was prescient. Models of this sort are now known as Artificial Neural Networks (ANNs), and the process by which they learn from examples is known as supervised learning, part of the large field of machine learning. In the last decade or so, ANNs have developed to become the best algorithms available for a range of ‘hard’ problems, including image recognition and classification, machine translation, texttospeech, and playing of complex games such as Go.
In this project we will develop a simple neural network and train it to classify data into one of two classes, a socalled binary classification problem. (This problem is discussed further by Higham and Higham (2018), whose notation we use in this project.) Suppose we have a set of points in R2 acquired from some realworld measurements, with each point being of one of two classes, either a blue cross or a red circle:
Figure 1.1: A set of data points in R2, each of which is of one of two types.
This type of data could arise in many applications, for example the success or failure of an industrial chemical process as a function of ambient temperature and humidity, or whether a customer has missed a loan payment as a function of their income and credit score.
Given a new point x ∈ R2, we’d like to be able to predict from the existing data which of the two classes this point is likely to be. Since our example data shows a clear pattern (red circles tend to lie at larger values of x1 than blue crosses), we could sketch a curve by eye that divides the plane into two, with most of the red circles on one side and most of the blue crosses on the other.
x2
x1
Figure 1.2: The plane is divided by a curve, with points on each side classified differently. Given a new
point on the plane, we can now estimate which class it is in by seeing which side of the line it lies.
We could then classify any new point x ∈ R2 by seeing on which side of the curve it lies. Our neural network will perform this classification for us by (1) learning from existing data where the boundary between the two classes of data point lies, and (2) allowing us to evaluate the likely class of a new data point by using the boundary found in step (1). Specifically, the network defines a function F(x), where the curve dividing our two types of point is the zero contour F(x) = 0. Regions where the network evaluates to a positive value F (x) > 0 correspond to one class, and regions where it is negative to the other class.
1.1. NEURAL NETWORKS 7
Although this twodimensional problem is relatively simple, the neural network developed extends naturally to much more difficult classification problems in higher dimensions (where x has hundreds or thousands of components), with more than two classes of point, and where the boundary between different regions is not as clear as in our example above. One example is character recognition, as used by banks and postal services to automatically read handwritten cheques and postcodes. In this case, a segmentation algorithm is applied to split the scanned text into lots of small images each containing a single digit. Suppose each such small image has 25 × 25 = 625 greyscale pixels; the data in it can be represented by a vector x ∈ R625. A neural network of exactly the sort we develop can be used to classify a handwritten character encoded in the vector x into one of 36 classes (one for each of the characters 0–9 and A–Z) with very good accuracy.
1.1 Neural Networks
An artificial neural network consists of a set of neurons. A neuron is a simple scalar functional unit with a single input and a single output, modelled by an nonlinear activation function σ : R → R. In the feedforward networks we will study, these neurons are arranged in a series of layers, with the outputs of neurons in one layer connected to the inputs of neurons in the next. The networks in this project will also be fully connected, meaning that nodes in two adjacent layers and the connections between them form a complete bipartite graph (every neuron in one layer is connected to every neuron in the next).
Input layer Layer 2 Layer 3 Output layer
Figure 1.3: Fullyconnected network topology with two input neurons, two hidden layers of four neu rons each, and one output neuron.
The inputs to the neural network are specified at the first layer; in our classification problem, this consists of two components of the input vector x = (x1, x2). The output of the network is the output of the final layer of neurons; in our example this is just a single neuron, outputting a real number that we hope is close to either 1 or −1, depending on the class of the point. In between are one or more hidden layers. The network as a whole defines a nonlinear function, in this case R2 → R (two inputs, one output).
1.2. EVALUATINGTHENEURALNETOUTPUT 8 As noted above, there are two stages to using a neural network.

Thenetworkistrainedbyspecifyingboththeinputtothenetworkandthedesiredoutput. Through an iterative procedure parameters relating to the connections between neurons (known as the weights and biases, defined below) are modified so that network reproduces the desired output for a wide range of input data. The overall structure of the network (the number of layers, and the number of neurons in each layer) is not changed in the training.

Theweightsandbiasesarethenfixed,andtheoutputofthenetworkisevaluatedforarbi trary new input data.
We will consider initially the second of these steps, assuming that the network has been pre trained and that the weights and biases are known.
1.2 Evaluating the neural net output
Since each neuron is a simple scalar function (R → R), the multiple inputs into a neuron from
neurons in the previous layer must be combined into a single scalar input. To describe how this
is done we first establish some notation. Let L represent number of layers in the network and nl
represent the number of neurons at layer l, for l = 1, 2, . . . , L. The input of the jth neuron at
layer l is denoted by z[l] and the output of the jth neuron at layer l by a[l]. The statement that jj
each neuron is modelled by an activation function σ can then be written algebraically as
a[l] = σ z[l] jj
, for l = 2,3,...,L, j = 1,...,nl, (1.1) and the statement that the inputs to the neural network are specified at the first layer can be
written as
Note that the input to the network xj determines the outputs a[1] of the neurons in the first layer
a[1] = xj, j = 1,...,n1. (1.2) j
rather than their inputs, so (1.1) is not evaluated for the first layer l = 1. The output of the network is the output of the final layer of neurons, a[L]. Consequently, the network has n1 (scalar) inputs
where w[l] are the weights and b[l] are the biases at layer l. These weights and biases are adjusted
k=1
jk j
while the network is learning from the training data, but thereafter remain fixed. Defining b[l] ∈
Rnl as the vector of biases and W[l] ∈ Rnl × Rnl−1 as the matrix of weights at layer l, with components b[l] and w[l] respectively, we can write (1.5) as
Equations (1.7) and (1.8) together describe the feedforward algorithm for evaluating the out put of a neural network given an input vector x (and known weights W[l] and biases b[l]). First (1.8) is used to define a[1], then (1.7) is used for l = 2, 3, . . . , L to obtain the neuron outputs a[2], a[3], . . . , a[L] in turn. The last of these, a[L], is the output of the neural network.
1.3 Training the neural network
The feedforward algorithm outlined in the previous section allows the output of the neural net work to be evaluated from its inputs, provided that the weights and biases at each layer of the network are known. These parameters are determined by training the neural network against measurements or other data for which both the input and desired output of the network are known.
This training data consists of a set of N inputs to the network x{i} with N corresponding de sired outputs, denoted y{i} (i = 1, . . . , N ). We can formalise the idea of how closely our network produces the desired outputs y{i} (when given the corresponding inputs x{i}) by defining a total cost function,
C=N1XN 12∥y{i}−a[L](x{i})∥2, (1.9) i=1
where ‘total cost’ refers to this cost being the sum of costs over all the N training data, and the squared L2 norm ∥ · ∥2 of a vector is the sum of its components squared,
∥x∥2 =x21 +x2 +···+x2n. (1.10)
1.3. TRAINING THE NEURAL NETWORK 10
In the cost function (1.9), for each training datum i the norm is taken of the difference between the desired network output y{i} and the actual output of the neural network a[L](x{i}), evaluated from the corresponding network input x{i} by using the feedforward algorithm. If the network reproduces exactly the desired output for every piece of input data, the cost function is equal to its minimum possible value, zero.
For a given set of training data x{i},y{i}, the cost C is a function of the parameters of the neural network, namely all the weights and biases for each layer. We could write this dependence explicitly by using (1.7) recursively to express a[L](x{i}) in terms of a[1] = x{i} and the weights and biases of layers 2 to L,
(1.11)
The network sketched above with nl = (2, 4, 4, 1) has in total 9 biases (one for each hidden and
output neuron) and 28 weights (one for each connection between a pair of neurons), making the
cost C dependent on 37 parameters in total. For ease of notation we denote by p ∈ RM the
vector containing the weights and biases at each layer that parametrise the network, where M is the number of such parameters (in this case M = 37). We can then write C = C(p) without needing to refer to the weights and biases explicitly.
The process of training the network is equivalent to minimising the difference between the desired and actual neural network outputs for the training data, or in other words, choosing p (and thus the weights and biases at each layer) so as to minimise the cost C(p), which is a nonlinear function of p.
We will solve this problem using an iterative method based on the classical technique of steep est descent. Suppose we have a current vector of parameters p and initial cost C(p), and wish to generate a new vector of parameters p+∆p that such that the new cost C(p+∆p) is minimised. What value should we choose for ∆p? For small ∆p we can use a Taylor series to expand the new cost
where pr is the rth component of the parameter vector p, and terms of order (∆p)2 and higher
have been neglected. Equivalently, we can write this in vector form,
C(p + ∆p) ≈ C(p) + (∇C(p)) · ∆p, (1.13)
where the gradient operator ∇ acts over each of the components of p (i.e. over each weight and bias in the network),
To minimise the new weight C(p + ∆p) we wish to make the final term of (1.13) as negative as
1.3. TRAINING THE NEURAL NETWORK 11
possible. The CauchySchwartz inequality states that
 (∇C(p)) · ∆p ≤ ∥∇C(p)∥2∥∆p∥2, (1.15)
with equality only when ∇C(p) lies in the same direction as (is a multiple of) ∆p. This suggests that to minimise C(p + ∆p) we should choose ∆p in the direction of −η∇C(p), and so we take
∆p = −η∇C(p), (1.16)
where the positive constant η is the learning rate. The training of the neural network therefore starts by choosing some initial parameters p for the network (we will choose these randomly), and repeatedly performing the iteration
p ← p+∆p = p−η∇C(p). (1.17)
We would like this iteration to approach the global minimum of C(p) within a reasonable number of steps. Unfortunately, minimisation of a nonlinear function in high dimensions (recall that even in our simple network p has 37 components) is a fundamentally difficult problem. For sufficiently small η, the Taylor series approximation (1.13) will be valid and each iteration of (1.17) will decrease the cost function.
However, for this to be true in general, η need to be arbitrarily small. In practice, setting η to a very small value means that that p changes only very slightly at each iteration, and a very large number of steps is needed to converge to a minimum. On the other hand, setting η too large means that the approximation (1.13) is rarely valid and the choice of step (1.16) will not decrease the cost function as desired. Choosing an appropriate learning rate η is a balance between these two considerations.
Even if the iteration converges, it may converge to a local minimum of C, where ∇C = 0, but where C is nonetheless much larger than the global minimum cost minp(C(p)). Finding the global minimum of an arbitrary nonlinear function in a highdimensional space is near impossible. We therefore abandon the goal of finding the global minimum of C(p), and instead simply look for a value of p that has a cost C(p) less than some small threshold.
1.3.1 Stochastic gradient descent
Each step of the steepest descent iteration (1.17) requires an evaluation of ∇C(p), the derivative of the cost function with respect to each of the weights and biases in the network. Recalling the definition of the cost function (1.9), we write
where the contribution to the cost from each of the training data points i is
1.3. TRAINING THE NEURAL NETWORK 12
Evaluation of the gradient of the cost (1.18) can be computationally expensive, since the number of points N in the training data set may be large (many tens of thousands), and the number of elements in ∇C (the number of parameters in the weights and biases) may be several million in a large neural network. This means that it takes a long time to calculate each iteration of the steepest descent method (1.17).
A faster alternative is to instead calculate the increment ∆p at each iteration from the gradient of cost associated with a single randomly chosen training point,
p ← p+∆p = p−η∇Cx{i}(p), (1.20)
a technique called stochastic gradient descent, or simply stochastic gradient. In stochastic gra dient the reduction in the total cost function at (1.9) is likely to be smaller than in the steepest descent method (1.17) – in fact many steps are likely to increase the total cost slightly – but since many more iterations can be performed in a given time, the convergence is often quicker.
There are several ways of choosing the training data point i to be used at each step, but we will use the simplest: at each step randomly select (with replacement) a training data point i independently of previous selections.
1.3.2 Evaluating ∇Cx{i} (p)
To use the stochastic gradient algorithm (1.20) we must evaluate ∇Cx{i} (p), the derivative of the cost function associated with a single training data point with respect to each of the network weights and biases.
Recall that this cost (1.19) is a function of the outputs of the final layer of our neural network a[L]. These outputs are related to the inputs to the final layer of the network z[L] through (1.3), and these inputs are related to the outputs of the previous layer of the network a[L−1] through the weights and biases at layer L, as described by (1.6). This iteration, summarised by (1.7) applies recursively at each layer of the network, relating the cost to the input x at the first layer (1.8).
Differentiating this cost with respect to the weights and biases seems extremely complicated, but the recursive nature of the network helps simplify this, through an algorithm called back propagation. Let us define
1.3.3 The training algorithm
The procedure for training a neural network using the stochastic gradient method is therefore as
follows

Choosealearningrateη.

Initialise each weight and bias of the network to a random value, which defines an initial
value for the vector p.

Choosearandomtrainingdatapointi∈{1,2,...,N}.

Applythefeedforwardalgorithmtoevaluatetheneuralnetworkoutputa[L]fromthetrain ing input x{i}:
(a) Evaluatea[1],using(1.8)andthetrainingdatapointx{i}.Setl=2. (b) Evaluate a[l+1] using (1.7).
(c) Ifl<L,increaselbyoneandgobacktostepii.

Applythebackpropagationalgorithmtoevaluatetheerrorsδ[l]:

(a) Evaluate δ[L] using (1.22), using the value a[L] calculated in step 4. Set l = L.

(b) Usetheiteration(1.24)toevaluatetheerroratthepreviouslayerδ[l−1]

(c) Ifl>2,decreaselby1andgobacktostep(b).
Otherwise, all errors δ[l] have been found (recall that no errors are associated with the input layer, so the loop terminates when l = 2).


Apply the stochastic gradient iteration (1.20), using (1.25) and (1.26) to evaluate the com ponents of ∇Cx{i} from errors δ[l] calculated in step 5.

Checkwhetherthenetworkistrained:
(a) Calculate the total cost C using (1.9). Each term in the sum in (1.9) is evaluated by looping over each training vector point x{i} for i = 1, 2, . . . N and evaluating a[L] using the feedforward algorithm as in step 4.
1.4. THECODE 14 (b) If this cost is less than some small threshold, the network is trained and we stop the
algorithm.
We may wish to do step 7 only rarely (perhaps every 100 training points), since evaluation of the total cost C can be relatively expensive compared to step 4, particularly when N is large.

Go back to step 3, until we have already repeated steps 3–6 some large number of times (perhaps 106).

If after this large number of iterations the cost has not decreased below the threshold in step 7(b), the network has failed to converge. Try again from step 1, changing the learning rate, initial weights and biases, convergence threshold for the total cost and/or maximum number of iterations of the stochastic gradient algorithm.
1.3.4 Exercises
1. Supposethatwetakealinearactivationfunctionσ(z)=z.Showthatthenetworkoutput a[L] is then linearly dependent on the input x,
a[L] = Mx + c, (1.27) for some matrix M and constant vector c. Do hidden layers play any useful role in such a
network?
2. Provethatthestatedequationsdefiningthebackpropagationalgorithm(1.22),(1.24),(1.25)
and (1.26) are correct, using the hints in §1.3.2.
1.4. THECODE 15
\\ TODO:
is placed in a comment. At each of these TODO locations, code needs to be written to im plement the algorithm described.
• AmainfunctionandexampleofusingtheNetworkclass,intheExampleNetworkUsefunc tion.
1.4.1 Exercises
Carefully read through the code in neural_template.cpp, particularly the latter two sections where the Network class is defined and used. The following exercises go through the process of finishing the implementation of the neural network by writing code where indicated by the TODO comments.
For each of the exercises you should write some tests in the Network::Test function that check that your code is working as you expect. An example test for the Network::FeedForward method illustrates what each test might contain:
// A example test of FeedForward
{
}
// Make a simple network with two weights and one bias
Network n({2, 1});
// Set the values of these by hand
n.biases[1][0] = 0.5;
n.weights[1](0,0) = 0.3;
n.weights[1](0,1) = 0.2;
// Call function to be tested with x = (0.3, 0.4)
n.FeedForward({0.3, 0.4});
// Display the output value calculated
std::cout << n.activations[1][0] << std::endl;
// Correct value is = tanh(0.5 + (0.3*0.3 + 0.2*0.4)) // = 0.454216432682259... // Fail if error in answer is greater than 10^10: if (std::abs(n.activations[1][0]  0.454216432682259) > 1e10) {
return false; }
A simple network is set up with the weights and biases set by hand. FeedForward is called, and the output neuron activation calculated by this is compared against a value computed by hand.
1.4. THECODE 16
If it does not agree, the test function returns false to indicate that a test has failed. The code is commented to explain what is going on and why a given answer is expcted.
Note that we do not compare floating point values for equality,
if (n.activations[1][0] == 0.454216432682259)
because small unavoidable floatingpoint errors may mean that the numbers are not exactly equal. Instead we check whether the error (absolute value of the difference between the result and the expected answer) is less than some small threshold:
if (std::abs(n.activations[1][0]  0.454216432682259) < 1e10)

Implement and write a test for the Network::Sigma and Network::SigmaPrime func
tions, which return the activation function (1.4) and its derivative.

WritetheimplementationofthemethodNetwork::InitialiseWeightsAndBiasesthat sets all components of the weights and biases in the network to normallydistributed pseudo random numbers with mean zero and standard deviation initsd.

The function as written sets up the appropriate distribution so that you can obtain such a pseudorandom number from the expression dist(rnd).

Look at the top of the Network class for the constructor. The constructor initialises the weight matrices and bias vectors for each layer of the network to the correct sizes, so in InitialiseWeightsAndBiases all that needs doing is to set the value of each component.

The commented table in the constructor may help you see what values to loop over to fill out the rows and columns of the weight matrices, and the elements of the bias vector, at each layer.

In the test you write, you should display some weights and biases generated by this function to verify that it is working as expected, but because these are random you may not be able to automatically determine whether the test has passed or failed.


Implementthefeedforwardalgorithm,equations(1.7)and(1.8),intheNetwork::FeedForward method. To test this implementation you should use the existing example test, but you can
also write your own.

Implement the Network::Cost method, which returns the cost associated with a single training data point – i.e. one term of the sum in (1.9). For the networks we will use there is only one output neuron, and so both y and a[L] have only one component, but your code should work (and be tested) in networks with more than one output neuron.

ImplementtheNetwork::TotalCostmethod,whichevaluates(1.9)overasetoftraining data. You will need to make use of the FeedForward and Cost methods that you have already written.
1.5. TRAINING AND USING THE NEURAL NETWORK 17

Implement the backpropagation algorithm in Network::BackPropagateError, using equations (1.22) and (1.24) to evaluate the error vector δ at each layer of the network from l = L to l = 2. Write a test for the implementation of (1.22) using the same threeneuron network as the example test for FeedForward, with the same weights, biases and input vector x = (0.3, 0.4). With a desired output y = (1), verify that your output node has an errorδ[2] =−0.43318....

Implement the stochastic gradient iteration (1.20), using the formulae (1.25) and (1.26) to generate the components of ∇Cx{i} . You may find the OuterProduct function provided helpful in writing the expressions in vector form.

Finally,implementthetrainingalgorithminsection1.3.3intheNetwork::Trainmember function. The overall structure of the algorithm has been written already, but you should fill in the gaps indicated by the TODO comments. Most of these steps require only a single line of code, calling one of member functions implemented in the previous exercises.
1.5 Training and using the neural network
As discussed above, a trained neural network for binary classification in two dimensions defines a function F (x), which divides the plane into two regions, where F > 0 and F < 0, respectively. The contour F = 0 dividing these regions is a curve dividing the two classes of data (figure 1.2). We will use the neural network code developed to classify the data in figure 1.1, first setting up and training the network with this data, and second using the network to draw the dividing contour sketched in figure 1.2.
The ability of network to successfully classify a set of data (evidenced by a small total cost) is dependent on the number of layers and neurons in the network, as well as the method and pa rameters used in the training. Choosing these parameters is not an exact science, and often some experimentation is required to find parameter values for which the training algorithm (stochastic gradient iteration) converges.
A suitable network for classifying the data in figure 1.1 has two input neurons, two hidden layers with three neurons each, and one output neuron. The stochastic gradient algorithm is ini tialised with each component of the parameter vector p (the weights and biases of the network) distributed normally with zero mean and a standard deviation of 0.1. Applying the stochastic gradient algorithm with a training rate η = 0.1 usually leads to convergence of the total cost to C < 10−4 in about 50000 iterations.
Example code showing how to train and run the network with these parameters is provided in the ClassifyTestData function. After training the network, this function saves two files to disk: test_points.txt, which contains a list of the training points and their class, and test_contour .txt, which contains a 251 × 251 matrix of outputs from the trained network, covering a range of network inputs x = (x1, x2) ∈ [0, 1]2 arranged in a grid, where x1 = 0, 0.004, 0.008, . . . , 1 and x2 = 0, 0.004, 0.008, . . . , 1. In MATLAB or Octave, this training data can be plotted and the zero contour of the network output drawn with the commands:
p = load('test_points.txt');
1
1.6. REPORT 18
cl = p(:,3); plot(p(cl==1,1), p(cl==1,2),'bx'); hold on plot(p(cl==1,1), p(cl==1,2),'ro'); cdat = load('test_contour.txt'); contour(linspace(0,1,251),linspace(0,1,251), cdat',[0 0],'k');
1.5.1 Exercises
1. RuntheClassifyTestDatafunctionandverifythatthecostusuallyconvergesto10−4in around 50000 iterations (if not, check that the code and tests written in the previous set of exercises are correct).
2. Generateplotsofthecost(perhapsonalogarithmicaxis)againstiterationnumber,and/or measures of the convergence rate against η. You may need to modify the Network::Train member function to export the cost data you need.
1.6 Report
You are required to use your neural network to classify the simple training data provided by the ClassifyTestData, and the more complex data from the GetCheckerboardData and GetSpiralData
functions, each of which returns 1000 training data points. N. B. For these more challenging data sets, it may be difficult to reduce the total cost below 0.01, even with a suitable network and sev eral million training iterations.
Write your report as a connected piece of prose. Read the Guidelines section at the start of this document for other instructions on the format of the report. Some questions below have marks in square brackets, indicating the 30 marks available for correct numerical results and code. The other 30 marks are awarded for the overall quality of validation, analysis and the writeup, as detailed in the grading criteria provided in the guidelines.

ProduceaplotoftheresultsforClassifyTestDataasacontourinthestyleoffigure1.2, for example. Are the results accurate? [6]

Investigatetheconvergenceofthestochasticgradientalgorithmwithlearningrate,η.You should include plots of convergence measures against η, and cost against iteration number. Are your results consistent with the known theory of the method? [7]

Investigate how the standard distribution of the initial weights and biases influences the convergence rate. You should include plots or tables of convergence measures against measures of the distribution. Why might a very large standard deviation hinder conver gence? [7]

ProduceplotsoftheresultsforClassifyTestData,andthemorecomplexdatafromthe GetCheckerboardData. Investigate how the number of hidden layers and the number
Project 2
Compressed Sensing
Author: Dr. Martin Lotz
Modern technology depends increasingly on the acquisition and processing of vast amounts of data. For many applications the data has an underlying simplicity that allows it to be greatly compressed; however, this is normally only possible after the full highresolution data set has been acquired. This is a highly inefficient process, artificially slowing down many devices (one example being an MRI scanner). A recent breakthrough has been the discovery that compressible data can be efficiently acquired directly in a compressed form, and a high quality approximation of the full data then recovered by optimisation methods.
Compressible data is defined by there being a representation that approximates the data ac curately, but has substantially fewer nonzero entries. A vector x ∈ Rn is called ksparse, if at most k coordinates are nonzero. The prototype problem in Compressed Sensing consists of finding a ksparse solution x to an underdetermined system of linear equations Ax = b, where A is an m × n measurement matrix and m < n (each row is considered a measurement of the signal x).

Ifthenumberofmeasurementsmsatisfiesm≥n,thenxisthesolutionofaleastsquares problem and can be found by standard methods. However, when x is sparse the number of measurements does not reflect the sparsity of the solution.

Ifxisksparseandm < n,butm > 2k,thenformostmatricesA,thesolutionforxis uniquely determined – but this may require an exhaustive search over submatrices that is impossible for data sizes of practical interest.
An important question is thus whether it is possible to recover a ksparse vector efficiently, while keeping the number of measurements m proportional to the information content k rather than the large ambient dimension n. Classical results from Compressed Sensing show that this is indeed the case. Namely, for most A we can recover a ksparse solution efficiently using optimization methods, provided
m ≥ C · k log(k/n). (2.1) 20
2.1. ITERATIVEHARDTHRESHOLDING 21
The goal of this project is to explore a particular algorithm that is able to accomplish sparse recovery under certain conditions. The algorithm is a gradient projection method called Nor malised Iterative Hard Thresholding (NIHT). By experimenting with the algorithm, you will find out that there is a phase transition phenomenon that governs the performance of the algorithm: there is a threshold m0 < n, such that as the number of equations increases past m0, the probability of NIHT successfully recovering a ksparse vector from Ax = b changes abruptly from almost 0 to almost 1.
2.1 Iterative Hard Thresholding
Before presenting the sparse recovery algorithm, we begin with a discussion of the classical steep est descent method for solving linear least squares problems. While this algorithm is not the best method for linear systems, it serves as a basis for the NIHT algorithm presented in the next section.
2.1.1 Gradient descent
Suppose we would like to solve a linear least squares problem, i.e.
minimise ∥Ax − b∥2 , (2.2)
with A ∈ Rm×n and m ≥ n (the matrix A is “tall and skinny”). Differentiating with respect to the (vector) x, we find that the solution to this minimisation problem is characterised by the normal equations
A⊤Ax = A⊤b, (2.3)
where A⊤ is the transpose of A. This symmetric system of equations can be solved by direct meth ods, or by iterative methods such as Conjugate Gradient. Perhaps the simplest iterative method is Gradient Descent. Given a function f(x) and a starting point x0, one tries to minimise the func tion by iteratively taking steps in the direction of the negative gradient, the steepest descent:
xi+1 = xi − αi∇f(xi). (2.4)
(Here xi denotes the vector x at the ith stage of the algorithm, not the ith component of the vector x). The step length α may be taken to minimise the function α 7→ f (xi − α∇f (xi )). Consider the function f(x) = 21 ∥Ax − b∥2. The residual at x is defined as r = A⊤(b − Ax). It is not difficult to show (try it!) that the gradient and the minimising step length α satisfy
∇f(x) = A⊤(Ax − b) = −r, α = r⊤r . (2.5) r⊤A⊤Ar
2.1. ITERATIVEHARDTHRESHOLDING 22 The gradient descent algorithm then takes the following form. Start with an initial guess x0 (usu
ally, x0 = 0 should be fine) and r0 = A⊤(b − Ax0). Then successively compute αi = ri⊤ri ,
ri⊤ A⊤ Ari xi+1 = xi + αiri,
ri+1 = A⊤(b − Axi+1) = ri − αA⊤Ari
until ∥ri+1∥ is smaller than some tolerance, or a maximum iteration bound is reached.
Exercises
(2.6)
(2.7) (2.8) (2.9)
• Recall/familiariseyourselfwiththeMVectorclassfromthefirstsetofprojects(abasicim plementation of this class can be found on Blackboard). Implement a Matrix class (or use the MMatrix class from the first set of projects):
class Matrix { public:
// Constructors and destructors (see also vector class)
explicit Matrix() : N(0),M(0) {} Matrix(int n, int m) : N(n), M(m), A(n,vector<double>(m)) {}
// Include other methods here
protected: unsigned N, M; // Matrix dimensions vector <vector<double> > A; // Store as a vector of vectors
};
Overload the operator * to implement matrix/vector and matrix/matrix multiplication.

Addamethodtranspose()totheMatrixclassthatreturnsthetransposeofamatrix:

ImplementtheSteepestDescentAlgorithmasafunction
Matrix transpose() const {
// Code to return the transpose of the matrix
}
int SDLS(const Matrix& A, const MVector& b, MVector& x, int maxIterations, double tol)
{
// Code here
2.1. ITERATIVEHARDTHRESHOLDING 23 }
The input consists of a matrix A, a vector b, an initial guess x, the maximum number of iterations maxIterations, and a tolerance (a suitable value might be 10−6 ). The solution is returned by reference in x, and the return value of the function should be the number of iterations taken to reach a solution. Choose a suitable return value for the case where a converged solution is not reached within the specified maximum number of iterations.
Record the trajectory of points xi in R2 (how might we modify the signature of the SDLS
function to optionally output these values?) and write these data to a file. Plot the trajec
tory.

What happens when the bottom row of A is changed to (1.8, −2)? How about when this row is (−2, −2)? Can you explain this behaviour? See for example, (Chong and Żak , 2008, Chapter 8).

Considersimplewaystominimisecomputationalcostofyourimplementationofthisalgo rithm.
2.1.2 Iterative Thresholding
We now turn to a slightly different problem. Consider solving for x in the system
Ax = b, (2.11)
where the matrix m × n matrix A is ‘fat’, that is, the number of rows m is smaller than the number of columns n. We will specify that x that is ksparse for some k < m, that is, x has at most k nonzero entries. (Without this constraint, the problem is underdetermined, since the number of equations m is smaller than the number of unknowns n.)
• In(2.11),Aisanm×nmatrix,xisksparse(where2k≤m)andallm×msubmatricesof A are invertible. Show that x is uniquely determined.
Let I ⊂ [n] = {0,...,n − 1} denote a subset of the column indices, and write AI and xI for the columns of a matrix and entries of a vector indexed by I. For example, if
2.1. ITERATIVEHARDTHRESHOLDING 24
and I = {0, 2}, then
Note that we use the C++ indexing convention, starting with 0 rather than 1.
If we knew exactly which entries of a solution of (2.11) were nonzero (the support of x) then,
setting I to the indices of these entries, we have
AIxI =b. (2.14)
This equation is overdetermined (the m × k matrix A is ‘tall’ since m > k), and we can find xI by solving a least squares problem with any method of choice (for example, the gradient descent method). We can then recover the solution x by filling the entries of x indexed by I with xI , and zero otherwise. However, in general we don’t know which entries of x are nonzero, and searching for the set of nonzero entries may involve a very large number of tries (up to n choose k), making this approach impractical.
An alternative method of solving (2.11) is the Normalised Iterative Thresholding Algorithm (NIHT) by Blumensath and Davies (2010). This algorithm resembles the normal Gradient Descent, except that at each step a thresholding operation is performed: only the k greatest elements of x (in absolute value) are retained:
xi+1 = Hk(xi + αA⊤(b − Axi)), (2.15)
where the operator Hk (v), applied to a vector v, returns v but with all except the k largest (in mod ulus) elements of v to zero. For example, choosing k = 3, H3 ((−9, 6, −7, 8, −5)T ) = (−9, 0, −7, 8, 0)T . This algorithm is an example of a projected gradient method, as it alternates a gradient step with a projection step (to the set of kdimensional coordinate arrangements). In this project we study a simplified, but slightly less effective, version of the NIHT algorithm proposed by Blumensath and Davies (2010).
If the algorithm converges, it returns a ksparse solution. But is it a solution to our problem? Thealgorithmdoesnotalwayssucceedinfindingtherightsparsesolution.1 However,itisknown that when the number of equations m is big enough, then for most matrices A the algorithm solves the problem satisfactorily. By most we mean that if we choose a matrix A at random, ac cording to some distribution, the matrix will be such that NIHT finds a sparse solution of Ax = b.
Exercises
These exercises work towards an implementation of NIHT.
• AddamethodthresholdtotheMVectorclass,whichsetsallbutthelargestkelements (by modulus) to zero.
1 if it could, we would be able to solve an NPhard problem in polynomial time, one of the big open problems in math ematics and computer science!
2.1. ITERATIVEHARDTHRESHOLDING 25
void threshold(int k) {
// Set all but k largest elements to zero
}
– Findthekthlargestelementbyusingasortingalgorithm.Youmayeitheruseoneof the algorithms from the first set of projects, or std::sort from the standard library (in the <algorithm> header). Be careful, as the sorting algorithms tend to change the vector under consideration! You can also think about modifying the quicksort algorithm from the first set of projects into a quickselect algorithm that finds the kth largest element without having to sort the whole array (or use std::nth_element). Then set all the entries smaller than the kth largest one to zero.

ImplementtheNIHTalgorithm.TheinputshouldbeliketheoneforGradientDescent,but in addition contain a sparsity parameter k.
Note that in NIHT we must use (2.8) not (2.9) to calculate the residual, since the latter formula assumes that no thresholding has occurred.
The algorithm should return the number of iterations it took to obtain find a solution. The value of maxIterations may have to be large here.

Inordertotestthealgorithm,itisusefultobeabletobeabletosetbandAtoarandomvec tor / matrix. Implement a method initialize_normal in the Matrix class that sets each entry to a random value, normally distributed, with zero mean and variance 1/m. Write a similar method for the MVector class that initialises the vector to a random ksparse vec tor where each nonzero component has zero mean and unit variance (this method should have an integer parameter k.)
You may find the following function useful; it generates a normally distributed value with variance 1 and zero mean by applying the BoxMüller transform to the uniformly distributed values produced by the std::rand function.
int NIHT(const Matrix& A, const MVector& b, MVector&x, int k, int maxIterations, double tol) {
// Initialise starting vector
x = A.transpose()*b; x.threshold(k); // Get s largest values
// Code here
}
#include <cstdlib> double rand_normal() {
static const double pi = 3.141592653589793238; double u = 0;
2.1. ITERATIVEHARDTHRESHOLDING 26
while (u == 0) // loop to ensure u nonzero, for log {
u = rand() / static_cast<double>(RAND_MAX); }
double v = rand() / static_cast<double>(RAND_MAX);
return sqrt(2.0*log(u))*cos(2.0*pi*v); }
Create a random 1000 × 1 matrix and record the entries in a file. Create a histogram of the data to confirm that the data looks like a normal distribution (you can use a plotting package of your choice, or calculate the histogram yourself in C++).

Totestthealgorithmfirstgeneratearandomksparsevectorx,arandomm×nmatrixA andthevectorb = Axforvariousvaluesofn,mandk(forexample,n = 100,m = 50, k = 10). Test whether the algorithm is able to recover x when given A and b as input. Can you find a combination of parameters for which the algorithm always works? If so, visualise the performance by plotting the residuals.

Canyoufindwaystodetectsignsofnonconvergenceotherthanlettingthemainlooprun maxIterations times?
2.1.3 The Phase Transition phenomenon
In numerical mathematics, a phase transition is a sharp change in the character of a computational problem as its parameters vary. In the context of compressed sensing, an algorithm for recover ing a sparse vector x may succeed with high probability when the number of samples exceeds a threshold that depends on the sparsity level; otherwise, it fails with high probability. Recent work indicates that phase transitions emerge in a variety of random optimisation problems from math ematical signal processing and computational statistics. Figure 2.1 illustrates this phenomenon in the context of compressed sensing, and in particular NIHT. Clearly, if m = 1 (one equation), then no signal can be recovered. On the other extreme, if m = n (full system), then recovery is possible in most cases (if A is not too badly conditioned). In between there is an m0, depending on the sparsity k, at which the probability of successful recovery jumps quite rapidly from 0 to 1.
Exercises
The goal of these exercises is to find the phase transition location for NIHT for a selection of pa rameters.

Generatearandomvectoroflengthn=200withatmostk=10nonzeroentries.

Repeat the above for values of m ranging from 4 to 199 in steps of 5. For each such value m, run NIHT T times (where, for example, T = 50 or 100) on random data, and record the success ratio p(m) as number of successful recoveries divided by T . Plot p(m) as a function
2.1. ITERATIVEHARDTHRESHOLDING
Sparsity k/n
Figure 2.1: For parameters (k/n, m/n) in the red region, sparse recovery succeeds with overwhelming probability, while in the blue region it fails with overwhelming probability.
of m and try to determine the range in which p changes from 0 to 1. Remember that if m < 2k a solution for x is not uniquely determined. Note: This experiment may take some time, so it can be helpful to output some values to the screen during the calculation to keep track of the progress. Keep in mind

Repeattheabovestepsfork=20.Howdoesthesuccessfulrecoveryregionchange?

Ifyoufeelambitious,youcantrytorecreateafiguresuchasFigure2.1(bewareofthecom putational cost involved, this may involve more sophisticated experimental planning!). You may find that the simplified version of NIHT presented in this project is slightly less effective at recovery than the example in figure 2.1.
2.1.4 Additional information
A general introduction to compressed sensing can be found in Foucart and Rauhut (2013). A com prehensive experimental study of NIHT and related algorithms on GPUs (graphics processor units) can be found in Blanchard and Tanner (2013). Phase transition phenomena for greedy algorithms are not yet fully understood. In contrast, a conclusive explanation phase transitions for convex optimisation approaches to finding sparse (and other simple) solutions of underdetermined sys tems has been derived in Amelunxen et al. (2013).
Number of equations m/n
2.2. REPORT 28 2.2 Report
Write your report as a connected piece of prose. Read the Guidelines section at the start of this document for other instructions on the format of the report. Some questions have marks in square brackets, indicating the 30 marks available for correct numerical results and code. The other 30 marks are awarded for the overall quality of validation, analysis and the writeup, as detailed in the grading criteria provided in the guidelines.

Describe the Gradient Descent method and its implementation. Investigate how the algo rithm performs on the 3 × 2 example in equation (2.10) and the two modifications sug gested, and discuss. Produce plots that illustrate the convergence to a solution for these 3 × 2 systems. [10]

Describe the Normalised Iterative Hard Thresholding algorithm and explain how it relates to the Gradient Descent method. Give specific details about how you have implemented the thresholding step in C++, and discuss the computational cost per iteration. Produce tables and/or plots that illustrate your testing of the algorithm, and your investigations of variations in the parameters n, m and k. [10]

DescribethephasetransitionphenomenonfortheNIHTalgorithm.Illustratethisbydeter mining, for vectors of length n = 200 and sparsities k = 20 and k = 50 (that is, sparsity ratios 0.1 and 0.25), the point m0 such that for m > m0, NIHT recovers the sparse vector with overwhelming probability. Plot a graph of the empirical recovery probabilities against m. Explain how you set up the numerical experiments to produce this graph, in particular explain how you detected nonconvergence. [10]
Project 3
Discontinuous Galerkin Methods for Convectiondominated problems
Consider the onedimensional transport equation of a scalar field, u(x, t),
where f (u) is a flux function, the time t ∈ [0, ∞) and x ∈ [a, b] is a bounded spatial domain.
We shall investigate its solution in two special cases:
• f(u) = Cu, where C is a constant — the advection equation,
• f (u) = 12 u2 — the inviscid Burgers’ equation.
In fact, it is very hard to solve these (hyperbolic) equations numerically. In this project, you will implement a method known as the discontinuous Galerkin method that can be used to solve such equations accurately. In particular, it can handle any discontinuities (shocks) that may develop. The use of C++ classes makes it easy to modify the code to handle different equations of the same form by overloading the flux function, f , in different classes.
3.1 Theory
3.1.1 Approximating the unknown, u
The general approach is to split the spatial domain into N “elements” each consisting of two “nodes” x0 (the node on the left) and x1 (the node on the right).1 In order to distinguish nodes
1In order to aid the translation into C++ we shall start all indices from zero. 29
3.1. THEORY 30
that belong to different elements we add an additional index so that xe0 is the lefthand node of the eth element. We label the elements from left to right and if the domain is to be connected, it follows that the righthand node of one element must be equal to the lefthand node of the element to its right, xe−1 = xe , see Figure 1.
are known as the local interpolation (shape) functions.
In each element, we require an approximation to the function u(x, t). We are free to use any
suitable method, but the easiest and (most) obvious approach is to interpolate u using the geo metric shape functions, ψ. Thus, in element e
u(x, t) = ue0 ψ0(s) + ue1 ψ1(s), (3.4)
where uei is the value of u at the position xei . In other words, we approximate the unknown u using linear interpolation within each element.
3.1. THEORY 31
3.1.2 Formulating the problem
Galerkin methods are based on the socalled “weak form” of the differential equation (3.1), ob tained on multiplication by a test function, v(x), and integrating over the problem domain:
We note that if u is a solution of the original equation (3.1) then it is also a solution of the “weak form”. On the other hand, if we insist that equation (3.5) must be satisfied for every (sensible)2 test function, v(x), then the solution of equation (3.5) will also be a solution of the “strong form” (3.1).
In the discontinuous Galerkin method, we integrate the equation over each element sepa rately. In element e, we have:
the final two terms represent the net flux out of the element.
In a continuous formulation the unknown u must be the same in neighbouring elements, i.e.
ue−1 = ue — the same constraint that we impose on the coordinate, x. 10
In a discontinuous formulation, this constraint is relaxed and the flux is replaced by a nu merical flux, h(ue−1,ue), that is used to approximate f(ue). The exact choice of numerical flux
function depends on the equation being solved. The governing equation in each element is
Our linear elements each contain two unknowns and so we require two equations to deter mine these unknown values. The two equations are obtained from equation (3.8) by using two different test functions. In Galerkin methods, the two test functions are precisely the two shape functions (3.3) used to interpolate the unknown. Thus, the two discrete equations that must be
2by sensible, of course, we mean satisfying suitable integrability and smoothness constraints
3.1. THEORY 32 solved in each element are:
Z x e1 ∂ u ∂ ψ
ψ − f(u) 0 dx + h(ue, ue+1)ψ (xe) − h(ue−1, ue)ψ (xe) = 0, (3.9a)
xe0∂t0∂x 10011000 Z x e1 ∂ u ∂ ψ
ψ − f(u) 1 dx + h(ue, ue+1)ψ (xe) − h(ue−1, ue)ψ (xe) = 0. (3.9b) xe0∂t1∂x 10111010
The definition of the shape functions (3.3) implies that ψ0(xe0) = ψ0(s = −1) = 1, ψ0(xe1) = ψ0(s = 1) = 0,ψ1(xe0) = ψ1(s = −1) = 0andψ1(xe1) = ψ1(s = 1) = 1. Thus,thegoverning equations are
3.1. THEORY
33
where the vectors of unknowns ar
Remember that according to our Cstyle index convention i, j ∈ {0, 1}. The remaining vector
Multiplying through by the inverse of the mass matrix gives
U ̇e =(Me)−1Fe(Ue)+H(Ue−1,Ue,Ue+1),
(3.14)
and we can use an explicit timestepper to solve this coupled (2×2) system of ordinary differential equations. If we use simple firstorder finite differences to approximate the time derivative, we obtain the iterative scheme:
hi
Ue(n+1) = Ue(n) + ∆t(Me)−1 Fe(Ue(n)) + H(Ue−1(n),Ue(n),Ue+1(n)) , (3.15)
where Ue(n) is the solution at the nth discrete time level and ∆t is the fixed time increment. The system (3.15) must be assembled and solved for each element, but the computation of the function H requires information from neighbouring elements.
3.1.3 Description of the algorithm
Initialisation
• CreateNelementsandforeachelementeassignspatialcoordinatesxe0andxe1andaninitial
guess for the unknowns ue0 and ue1.
• Setuptheconnectivityinformationforeachelement(assignitsleftandrightneighbours).
Timestepping loop
• Loopoverallelementsandforeachelemente:
– Calculate the mass matrix, Me , and flux vectors, Fe(Ue) and H(Ue−1,Ue,Ue+1). ij
– AssemblethesystemofODEs(3.14).
– Performontimestepofthelinearsystem(3.14)andstoretheresultsfortheadvanced time in temporary storage.
∂ ψ ∂ x
3.2. EXERCISES 34 • Loop over all elements again and update the current unknown values to the previously
stored advancedtime values.
Thus, the algorithm requires a method for numerical integration over each element, e. g. Gauss rule, Trapezium rule, etc, and matrix inversion and multiplication routines. You may (should) reuse classes from previous projects to perform these tasks.
3.2 Exercises
You are advised to go through the following exercises to aid in your understanding of the method
Initialisation
1. Write a C++ class called AdvectionElement that contains storage for two positions and two unknowns, all of which should be double precision variables. Your class should include two functions interpolated_x(s) and interpolated_u(s) that return the values x(s) and u(s) by implementing the equations (3.2) and (3.4) respectively; for example
class AdvectionElement { public:
// Pointer to the left neighbour
AdvectionElement *Left_neighbour_pt;
// Pointer to the right neighbour
AdvectionElement *Right_neighbour_pt;
// Storage for the coordinates
std::vector<double> X; // Storage for the unknowns
std::vector<double> U;
// Constructor: initialise the vectors to hold two entries.
AdvectionElement() { // Resize the vectors to hold two entries each // FILL THIS IN }
// Return the value of the coordinate at local coordinate s using // equation (1.2) double interpolated_x(double s) {//FILL THIS IN}
// Return the value of the unknown at local coordinate s using
3.2. EXERCISES 35
// equation (1.4) double interpolated_u(double s) {//FILL THIS IN}
}; //End of the class definition
2. Testyourclassbywritingamain()functionthatcreatesNuniformlyspacedelementsin the domain x ∈ [0, 2π]. You should do this by creating a vector of elements and looping over them to fill out the member data X, U and the neighbour pointers for each element. Make sure that you set the neighbour pointers correctly. Set periodic boundary conditions by connecting the first and last elements. In addition, set the value of the unknown to the function:
u = 1.5 + sin(x) (3.16) Don’t forget that you will need to set the two values of the coordinates x as well as the two
unknowns u in each element. For example,
int N = 50; std::vector<AdvectionElement> elements(N); for (int i=0; i<N; i++) {
elements[i].X[0] = /* some value here */; elements[i].X[1] = /* some value here */; // write code to fill out other member data for the i'th element
here
}
3. Producegraphsoftheapproximationofthefunctiondefinedinequation(3.16)byplotting the coordinate and the unknown at the centre of each element for N = 10, 100 & 200 elements. What do you notice as you increase N ? Is this what you expect?
Timestepping loop
1. Showthatthecomponentsofthemassmatrixare
Mij = 2
where s is the local coordinate in the element. Hence, show that
2. Showthatthecomponentsofthefluxvectorare

AddafunctiontotheAdvectionElementclassthatcalculatesthefluxforthesimplead vection equation at a value of the scalar field, u. In the first instance let C = 1. The function is virtual so that it can be overloaded later to solve Burgers’ equation.

Write a function that returns the integral of the flux function over the element using, for example, a twopoint Gauss rule, which states that
// Calculate the flux
virtual double flux(double u) {
return //FILL THIS IN }
for any function g(s).
// Calculate the integral of the flux function over the element // using the twopoint Gauss rule double integrate_flux() {
//FILL THIS IN
}
5. Write a function h(a, b) that returns the numerical flux. A good general choice is the local Lax–Friedrichs flux:
h(a,b)= 1(f(a)+f(b))− 1 max f′(ζ)(b−a). (3.17) 2 2 a≤ζ≤b
Again this function is virtual so that it can be overloaded later.
6. Finally,writeatimestep(dt)functionthatcalculatestheupdatedvaluesoftheunknowns U using the scheme (3.15).
virtual double h(double a, double b) {
return //FILL THIS IN }
void timestep(double dt) {
//FILL THIS IN
}
3.3. REPORT 37
3.3
Note that for the scheme to be explicit you must not update the values of the unknowns until the timestep(dt) function has been called for every element, because the numerical flux function must always use values at the present time from the neighbouring elements, see equation (3.15). Thus, you will need to provide additional storage for the values of the unknowns at the advanced time level and a mechanism to update all the values once the timestep(dt) function has been called for every element.
Report
You are required to use the Discontinuous Galerkin method with linear interpolation to find nu merical solutions to the two equations:
∂u + C ∂u = 0, (Advection equation) ∂t ∂x
∂u + u∂u = 0, (inviscid Burgers’ equation). ∂t ∂x
(3.18a) (3.18b)
on the domain x ∈ [0, 2π] with periodic boundary conditions.
Write your report as a connected piece of prose that describes the problem and your solution
to it. Read the Guidelines section at the start of this document for other instructions on the format of the report. Some questions have marks in square brackets, indicating the 30 marks available for correct numerical condition
(
1 0 ≤ x ≤ 1, (3.19) 0 otherwise.
u =
What do you notice in this case? How do you think that this problem could be resolved? [7]
• Solve Burgers’ equation (3.18b) for the sinewave and squarewave initial conditions and produce graphs of the solutions for times in the range t ∈ [0, 2]. What happens to the initial profile as time increases? Is this what you expect? [10]
[You should use inheritance to create a newBurgersElementthat overloads the appropriate flux and numerical flux functions.]
Project 4
Revenue Management of Carparks
Revenue Management is a relatively new area of economic pricing theory and is one that touches our lives on an almost daily basis. Whenever we use online booking systems to purchase perishable goods (such as travel tickets, concert tickets, or a hotel room) we often see products (such as different price classes) appear or disappear in real time. This is a dynamic form of revenue management in which the booking system uses some procedure to evaluate whether a product should be available or not.
Under the setting of a carpark booking system at an airport we use an object orientated ap proach to test different revenue management systems on a randomly generated set of customers. By averaging the results of your model across several realisations (each with different random bookings) we can obtain the expected value of various measures of the car park, such as revenue and occupancy. With these measures of car park performance we can evaluate new strategies for revenue management of the car park.
This project is relatively simple mathematically but involves more interpretation of results and pricing strategies. As such, it is most suitable for students with a background in finance or eco nomics.
4.1 Theory – The Carpark Model
In this project we try to develop algorithms to optimally manage the booking system of an airport car park. We must try to manage the car park as best we can to generate the most revenue pos sible. The car park will have a finite number of spaces available which we can sell to customers. Bookings for the car park may be made in advance on the internet or on arrival at the car park. A booking will consist of three separate times, the time the booking was made, the time of arrival at the car park and the time of departure from the car park. Given the capacity constraints we will need to make sure at the time of booking that space is available for the duration of the stay. The
38
4.1. THEORY–THECARPARKMODEL 39 ith booking made on the system Bi may be written as
where tb is the booking time, ta the arrival time and td is the departure time.
In order to keep track of all bookings made over time and their effect on the number of cars in
the car park, and also the revenue generated by the bookings, we split the problem into discrete
time steps. Take the interval in time t ∈ [a, b] and split it into K equally sized steps of length ∆t
so that
tk = a + k∆t. for k = 0,1,...,K (4.2)
In this project we set the basic unit of time to be a day, and ∆t = 1 represents a time period of one day. Furthermore we set the start time a = 0, and so tk = k. Then let Ck denote the number of cars present in the car park at any time during the day k, where t ∈ [tk, tk+1). We may write
This function f(B) simply describes the fact that a car is regarded as being in a car park for a full day if it either arrives during that day, leaves during that day, or arrives on a previous day and leaves on a later day. We can also calculate the duration of stay for a booking as the number of days for which a single car is present in the car park
This is useful in our model as the price paid will be linked to the duration of stay.
Now let the price per day for a booking Bi be given by a pricing function p(Di) where D is
the duration of stay. It follows that the revenue generated in the kth day is
(4.6)
where Di is the (integer) duration in days and α and β are positive constants. The cost to customer i for their entire booking is therefore
Dip(Di) = βDi + α, (4.7)
4.1. THEORY–THECARPARKMODEL 40
which can be interpreted constant fee of α per booking, plus parking at a rate of β per day. We can calculate the total revenue for the car park as
XX
Vk =
4.1.1 Generating events from a Poisson Process
To generate the set of bookings we suppose that bookings are generated by a Poisson process, i.e. they are generated independently and continuously at some average rate λb bookings per day. For such a process, the time between subsequent bookings is given by an exponential distri bution. If un is an independent random draw from a uniform distribution (0 < un ≤ 1) then the function to generate the time of the next event tn of a Poisson process with intensity λb, given tn−1 the time of the last event is:
tn = tn−1 − λ1 log(un). (4.9) b
4.1.2 Generating a Booking
Given that we have generated the time of the next booking, we need to know when the customer
will arrive in the car park and how long they will stay. Note that if the arrival intensity is given by
λ, then the average time of the next event will be λ1 . So if we wish to have a process where the
average time between booking and arrival is say, 28 days, and the average time between arrival
and departure is 7 days, then we may use Poisson processes with intensity λa = 1 and λd = 1 28 7
for the arrival and departure times respectively.
Put simply, given three random draws from a uniform distribution and the time the previous
(4.10)
booking was made tib we may generate the next booking as:
(assume that the first booking occurs at t0b = 0). The sequence of booking times tb is generated by a Poisson process with rate λb. For each booking, the time between booking and arrival, and the time between arrival and departure, are each exponentially distributed with mean 1/λa and 1/λb respectively. Note that this booking model is not very realistic, but it is very easy to generate, and could easily be extended to allow the intensity parameters to become functions of time to capture weekly or seasonal effects.
4.1.3 Calculated expected values of car park measures
We can use the random bookings that we generate as input to our car park model in section 4.1, and in doing so evaluate various measures of our car park, such as the revenue, and occupancy (number of cars in the car park) for a given day. These values will depend on the parameters
4.1. THEORY–THECARPARKMODEL 41
intrinsic to the car park model (for example, the size of the car park, and the pricing function), but will also depend on the particular booking times chosen by our random booking process. To study the effect of car park parameters (size, pricing function etc.) on our measurements (of revenue, occupancy, etc.) we would like these measurements to be independent of the randomness in the booking times. To do this by taking the expected value, or average, of our measurement over all the possible random bookings. For very simple measurements it may be possible to calculate this expected value algebraically, for most measurements we will need to calculate these expected values numerically, using a socalled Monte Carlo technique.
When N is large this expected value E[Rk] is approximately normally distributed (from the central
limit theorem), and an estimate for its standard deviation is
This standard deviation decreases with increasing N , so by choosing sufficiently large N our ex pected value becomes relatively insensitive to the particular random booking times.
Note that we cannot take N arbitrarily large – it may take too long to simulate the car park model a very large number of times – so the choice of N is a compromise between program run time and accuracy.
4.1.4 Optimal Revenue Management
The explanation of Revenue Management in this project will be brief but we will lay out the ba sics relevant to our simplified model. This project focuses on the problem of capacity allocation. The problem arises when the same product (a space in a car park) is sold to different customers at different prices, so the question arises of how many bookings we should allow the lowprice customers to make when there is a possibility that highprice customers may arrive later on.
In our car park model, we assume that there exists a set of low price customers that take ad vantage of the discount for staying in the car park for a long period. (By discount we mean that, while the total price of a long stay is greater than that of a short stay (4.7), the price per day of a long stay is less than that of a short stay (4.6).) They tend to book into the car park online in advance, and are typically leisure passengers. In contrast, we assume that there exists a set of highprice customers who turn up and pay a high price to stay for only a day or two. They are typ ically business passengers, and do not book trips in advance, and therefore neither do they book their parking in advance. In this model we do not assume that different customers are subject to
4.2. EXERCISES 42
different prices, just that they will receive a discount the longer they stay in the car park. Price is assumed to be given as a function of duration of stay as in equation (4.6) with positive constants α and β.
At first we will decide whether to reject bookings by simply assigning a constant proportion of the car park to each set of customers. By varying this proportion we can then investigate when the maximum revenue is generated, and thus find the optimal proportion to assign to each set of customers.
4.2 Exercises
4.2.1 The Customers class
• CreateafunctionthatreturnsthetimeofthenexteventforaPoissonprocesswithintensity λ. You may use the internal C++ random number generator rand() to generate samples from a uniform distribution. Alternatively, look at the Mersenne Twister code provided for the Gillespie Algorithm project in this booklet. Make sure that you enforce the constraint un > 0 by excluding the case where rand() generates a value of 0.
• Enterthefollowingdatastructureforbookingsintoyourcode:
class Booking { public:
double bookingTime; double arrivalTime; double departureTime;
};
// compare for sort algorithms
bool operator<(const Booking &x,const Booking &y);
(Recall that a struct is simply a class with all members public by default.) Complete the im plementation for the “less than” operator < so that bookings may be ordered by bookingTime . You may also wish to overload stream outputs using operator<<.
• Nowcreateaclass(similartotheMVector)tostoreallofthebookingsforoneclassofcus tomer.
class Customers { private:
std::vector<Booking> vectorBookings; public:
// generate a set of bookings
void generateBookings(double bookingRate,double arrivalRate, double departureRate,double startTime,double finishTime);
};
4.2. EXERCISES 43
Complete the implementation for the generateBookings method. The first three argu ments should be the intensity parameters for the Poisson process. Using startTime as your initial time, keep generating bookings until you reach finishTime. See section 4.1.2 for information on how to generate bookings.

CompletetheclassbywritingmemberfunctionstoallowaccesstovectorBookingsand also to return the number of bookings (see MVector for how we did this with double as the data type.

Generate bookings using λb = 5, λa = 1/14, and λd = 1/7 with tb ∈ [0, 150] and write them to screen. These will be the lowprice leisure customers. (Note that although we re strict tb ≤ 150, we may have bookings where ta > 150 and/or td > 150).

Generate bookings using λb = 25, λa = 1, and λd = 2 with tb ∈ [0, 150] and compare them to the previous booking set. These will be the highprice business customers.

WriteamemberfunctionofCustomerstoaddbookingsfromanotherCustomersclass. The function signature should look like:
void addBookings(Customers &C);
Complete the implementation. Try adding the business and leisure customers together to form one single set of ordered bookings.
You can sort a vector using the std::sort function, in the <algorithm> include library. For a std::vector vectorBookings we writestd::sort(vectorBookings.begin(),vectorBookings.end());
which sorts the vector by booking time, so long as the less than operator has been over loaded correctly. Display your bookings to screen, and check that the sets have combined correctly.
4.2.2 The Car Park Class

Bywritingaclass,orotherwise,createstorageforthenumberofcarspresentinthecarpark in each period k and also the revenue they generate (you could combine both using your own data structure).

NowcreateafunctionCarsPresentthattakesasinput:
– alistofbookingsstoredinaCustomersclass – thestarttime
– thenumberofdaysand gives as output
– avectorcontainingthenumberofcarspresentinthecarparkforeachday.
4.2. EXERCISES 44
You will need to use equation (4.3) to calculate the cars present in the car park for each day from the booking times. (Is there a faster way to calculate Di than simply summing over f as in (4.4)?)

What is the expected number of cars in the car park at t = 150 using either of the sets of customers? Choose what you think is a sensible number of simulations N when calculating the expected value, and give your answer to an appropriate number of digits.

Write in a price function of the form shown in equation (4.6) with α = 10 and β = 5 and use it to calculate the revenues generated at each day, and also the total revenue.

Now modify your car park class to include a capacity (number of spaces in the car park). Write a function that takes a booking and checks if space is available in the car park for the entire duration of that booking. Use this in your CarsPresent function to reject bookings if there is no space.

Using a combined set of both customer types, what is the expected time that a car park with 50 spaces will first reach capacity? As before, think about how many simulations N you should run when calculating the expectation.
4.2.3 Revenue management with a fixed allocation strategy

Forthecombinedsetofcustomers,calculatethetotalexpectedrevenueforcarparkswith capacity 10, 20, 30, . . . , 100.

Nowcreatetwodifferentcarparksforeachsetofcustomerssothatthesumofthecapacity of each car park is equal to capacity of the combined car park. Investigate what is the opti mal proportion of the car park to reserve for each set of customers. For example, you could create a plot of the optimal proportion as a function of total car park size. Optionally, can you estimate the error in your calculation of the optimal proportion?

Youmaywishtoextendthisideabygeneratingacontourplotoftherevenueasafunction of both the car park size and the proportion allocated to business/leisure customers.

Howdoestheoptimalproportionofthecarparktoallocatevarywiththebookingratesλb for each type of customer?
4.2.4 Booking rejection strategies
So far we have used a very simple strategy for rejecting bookings, by allocating a fixed proportion of the car park to each type of customer. Alternatively, we could generate a rejection rule based on the number of spaces left in the car park, the price per period of the booking, and the time remaining the period in question. We should reject a booking only if the total revenue generated from the booking is less than the expected revenue from future bookings that the car will displace over all periods it is present. For example, imagine that it is Friday and we have one space left in the car park on Monday and one space left for Tuesday. A booking is requested on the system
4.3. REPORT 45
that will stay both days, and if we generate our price from (4.6) with α = 10 and β = 5 the the booking will pay a rate of £10 per day. Now if we know that the probability that a someone will turn up and stay for one day on Monday and Tuesday and pay £15 per period is 0.8, then we can say that the net contribution from the booking is (10 − 15 × 0.8) + (10 − 15 × 0.8) = −£4. This means we should reject the booking and take our chances that the highprice customers will arrive. Alternatively if the car park had 100 spaces left, then the car park is unlikely to fill, the value of the spaces displaced is low (zero if there is no chance that the car park will fill up) and we should accept the booking. How we should calculate the notional value of the space that is displaced is left for you to discover.
Investigate booking rejection strategies of this type. The most interesting cases are likely to be found for car parks that are large enough that they will not usually be filled by the highest paying customers alone, but small enough that some customers cannot be accommodated. Can you design a method that improves on the optimal revenue generated when a fixed proportion of the car park is allocated to each type of customer?