Introduction
CS 615 - Deep Learning
Assignment 4 - Exploring Hyperparameters Spring 2024
In this assignment we will explore the effect of different hyperparameter choices and apply a multi– class classifier to a dataset.
Programming Language/Environment
As per the syllabus, we are working in Python 3.x and you must constrain yourself to using numpy, matplotlib, pillow and opencv–python add–on libraries.
Allowable Libraries/Functions
In addition, you cannot use any ML functions to do the training or evaluation for you. Using basic statistical and linear algebra functions like mean, std, cov etc.. is fine, but using ones like train, confusion, etc.. is not. Using any ML–related functions, may result in a zero for the programming component. In general, use the “spirit of the assignment” (where we’re implementing things from scratch) as your guide, but if you want clarification on if can use a particular function, DM the professor on Discord.
Grading
Part 1 (Theory)
Part 2 (Visualizing an Objective Function)
Part 3 (Exploring Model Initialization Effects)
Part 4 (Exploring Learning Rate Effects)
Part 5 (Adaptive Learning Rate)
Part 6 (Multi–class classification)
Table 1: Grading Rubric
20pts 10pts 20pts 20pts 20pts 10pts
1
Datasets
MNIST Database The MNIST Database is a dataset of hand-written digits from 0 to 9. The original dataset contains 60,000 training samples, and 10,000 testing samples, each of which is a 28 × 28 image.
To keep processing time reasonable, we have extracted 100 observations of each class from the train- ing datase,t and 10 observations of each class from the validation/testing set to create a new dataset in the files mnist train 100.csv and mnist valid 10.csv, respectively.
The files are arranged so that each row pertains to an observation, and in each row, the first column is the target class ∈ {0, 9}. The remaining 784 columns are the features of that observation, in this case, the pixel values.
For more information about the original dataset, you can visit: http://yann.lecun.com/exdb/mnist/
2
1 Theory
Whenever possible, please leave your answers as fractions so the question of rounding and loss of precision therein does not come up.
1. What would the one–hot encoding be for the following set of multi–class labels (5pts)? and the fully connected layer having weights W =
tions. For simplicity do not z-score your inputs. (5pts)
Input → Fully Connected → Softmax
-
Using the same setup as the previous question, what are the gradients to update the fully connected layer’s weights (both W and b) if we’re using a cross-entropy objective function if
0
we have three (3) total classes as the observations’ targets are Y = 1 ? Make sure to showthe intermediate gradients being passed backwards to make these computations. (5pts)
-
Given the objective function J = 41 (x1w1)4 − 34 (x1w1)3 + 23 (x1w1)2 (I know you already did this
1 2 −1 0 −4 −4
2. Given inputs X =
b = 1 0 2, what is the output of the following architecture? Show intermediate computa-
in HW3, but it will be relevant for HW4 as well): (a) What is the gradient ∂J (1pt)?
∂w1
(b) What are the locations of the extrema points for your objective function if x1 = 1? Recall
that to find these you set the derivative to zero and solve for, in this case, w1. (3pts) (c) What does J evaluate to at each of your extrema points, again when x1 = 1 (1pts)?
3
4 −1 −3
2 Visualizing an Objection Function
For the next few parts we’ll use the objective function J = 41(x1w1)4 − 43(x1w1)3 + 23(x1w1)2 from the theory section. First let’s get a look at this objective function. Using x1 = 1, plot w1 vs J, varying w1 from -2 to +5 in increments of 0.1. You will put this figure in your report.
3 Exploring Model Initialization Effects
Let’s explore the effects of choosing different initializations for our parameter(s). In the theory part you derived the partial of J = 41(x1w1)4 − 43(x1w1)3 + 23(x1w1)2 with respect to the parameter w1. Now you will run gradient descent on this for four different initial values of w1 to see the effect of weight initialization and local solutions.
Perform gradient descent as follows:
Run through 100 epochs.
Use a learning rate of η = 0.1.
Evaluate J at each epoch so we can see how/if it converges.
Assume our only data point is x = 1
Do this for initialization choices: w1 =−1.
w1 = 0.2. w1 = 0.9. w1 = 4.
In your report provide the four plots of epoch vs. J, superimposing on your plots the final value of w1 and J once 100 epochs has been reached. In addition, based on your visualization of the objective function in Section 2, describe why you think w1 converged to its final place in each case.
4
4 Explore Learning Rate Effects
Next we’re going to look at how your choice of learning rate can affect things. We’ll use the same objective function as the previous sections, namely J = 41 (x1w1)4 − 43 (x1w1)3 + 23 (x1w1)2.
For each experiment initialize w1 = 0.2 and use x = 1 as your only data point and once again run each experiment for 100 epochs.
The learning rates for the experiments are: η = 0.001
η=0.01 η = 1.0 η = 5.0
And once again, create plots of epoch vs J for each experiment and superimpose the final values of w1 and J.
NOTE: Due to the potential of overflow, you likely will want to have the evaluation of your J function in a try/except block where you break out of the gradient decent loop if an exception happens.
5
5 Adaptive Learning Rate
Finally let’s look at using an adaptive learning rate, a ́ la the Adam algorithm.
For this part of your homework assignment we’ll once again look to learn the w1 that minimizes J = 14(x1w1)4 − 43(x1w1)3 + 32(x1w1)2 given the data point x = 1. Run gradient descent with ADAM adaptive learning on this objective function for 100 epochs and produce a graph of epoch vs J. Ulti- mately, you are implementing ADAM from scratch here.
Your hyperparameter initializations are: w1 =0.2
η=5
ρ1 =0.9
ρ2 = 0.999
δ=10−8
In your report provide a plot of epoch vs J.
6
6 Multi–Class Classification
Finally, in preparation for our next assignment, let’s do multi–class classification. For this we’ll use the architecture:
Input → Fully Connected → Softmax → Output w/ Cross–Entropy Objective Function
Download the MNIST dataset from BBlearn and read in the training data. Train your system using the training data, keeping track of the value of your objective function with regards to the training set as you go. In addition, we’ll compute the cross entropy loss for the validation as well, the watch for overfitting.
Here’s some additional implementation details/specifications:
Implementation Details
-
Make sure to remember to one-hot-encode your targets!.
-
Use Xavier Initialzation to initialize your weights and biases.
-
Use ADAM learning.
-
Run your iterations until near-convergence appears (things are mostly flattening out).
-
You can decide on your own about things like hyperparameters, batch sizes, z-scoring, etc.. Just report those design decisions in your report and state why you made them.
In your final report provide:
-
A graph of epoch vs. J for the training data and the validation data. Both plots should be on the same graph with legends indicating which is which.
-
Your final training and validatino accuracy. Make sure predict the enumarated class using the argmax of the output as well as the original target enumerated classes.
-
Your hyperparameter design decisions and why you made them.
7
Submission
For your submission, upload to Blackboard a single zip file containing:
1. PDF Writeup 2. Source Code 3. readme.txt file
The readme.txt file should contain information on how to run your code to reproduce results for each part of the assignment.
The PDF document should contain the following:
1. Part (a)
2. Part (a)
3. Part
(a) (b)
4. Part (a)
5. Part (a)
6. Part (a)
(b) (c)
1:
Your solutions to the theory question
2:
Your plot.
3:
Your four plots of epoch vs. J with the terminal values of x and J superimposed on each. A description of why you think x converged to its final place in each case, justified by the
visualization of the objective function.
4:
Your four plots of epoch vs. J with the terminal values of x and J superimposed on each.
5:
Your plot of epoch vs J.
6:
A graph of epoch vs. J for the training data and the validation data. Both plots should be on the same graph with legends indicating which is which.
Your final training and validatino accuracy. Make sure predict the enumarated class using the argmax of the output as well as the original target enumerated classes.
Any additional design/hyperparameter decisions, and why.
8