MATH70099 – Big Data: Statistical Scalability with PySpark
Assessed Coursework 2
Due Monday 22ndMay 2023, 11.59pm BST Hand-in a report in PDF format, not longer than 10 pages . Write your CID only ,do not include your name anywhere in the submission. This coursework counts 20% towards the final grade for MATH70099. When required by the question, present commented code as part of the report. The code will be checked for execution quality, so please ensure that the code is self-contained and executable.
This assessment is marked out of 30 (15 marks for Q1, 12 marks for Q2, and 3 marks allocated to presentation).
Question 1
Consider a sequence of IID discrete random variables X1;X2;:::such that XiBorel (q), denoting a Borel distribution with parameter q2(0;1). The distribution has the following PMF:
P(X=xjq) =exp( qx)(qx)x 1 x!;x2f1;2;:::g: (1)
a.Prove that the distribution belongs to the exponential family, and clearly state h();g();B()andA(), where the notation for these functions is the same as presented in the lecture notes. b.From the exponential family representation of the distribution (1), deduce the form of the realisation of a minimal and complete sufficient statistic for the parameter q, given observations x1;x2;:::; xn2(0;1). c. Consider the estimator f(X1;:::; Xn) =1 n(ni=1Xi) 1(2) Isf(X1;:::; Xn)a UMVUE for q? [Hint: if X iBorel (q), thenE(Xi) =1=(1 q).] d.IfXiBorel (q), thenE(1=Xi) =1 q=2. Using this fact, derive an unbiased estimator of q. Does this estimator fit the statistical query model (SQM)? Why? e.Consider the (contrived) dataset supermarkets.csv , where each line contains the number of minutes xrequired to clear a so-called M/D/1 queue starting with one customer, in a supermarket s, on the day of the week w. The entries x,s,w are comma-separated. The dataset is available on HDFS on Athena at/shared_data/supermarkets.csv . Assume that each observation xis generated from the distribution (1), where the value of qdepends on the supermarket Siand day of the week Wi:XijSi;WiBorel (qWi;Si). This corresponds to assuming that each customer in the queue is served in one minute, and new customers join the queue according to a Poisson process with rate qWi;Si. Write a Hadoop job to calculate estimates of the parameter qSi;Wifor each supermarket and day of the week , using the estimator (2). The Hadoop job should also include a combiner via the flag -combiner . Write the mapper.py andreducer.py files in such a way that the combiner is identical to the file reducer.py . In your answer, report 5 parameter estimates (one estimate per supermarket, each for a different day of the week) the content of the python filesmapper.py andreducer.py for the mapandreduce phases, and the content of a shell file containing the code to execute the Hadoop job. Total for Q1: [15 marks]
Question 2
Consider the datasets breast_cancer_train.csv andbreast_cancer_test.csv , stored on HDFS on Athena in /shared-data/ . Each row of the datasets contains four measurements xi2R4 +(radius, texture, smoothness and com- pactness) about breast tumours, and the diagnosis yi2fM;Bg(malignant, /quotesingle.ts1M/quotesingle.ts1, or benign, /quotesingle.ts1B/quotesingle.ts1) given by a doctor. The objective of this question is to fit a quadratic discriminant analysis (QDA) model for classification of tumours. In QDA, it is assumed that xij(yi=M)N(mM;SM)andxij(yi=B)N(mB;SB), where mM;mB2R4are group-specific means, and SM;SB2R44are group-specific covariance matrices. Under this assumption, an observation xiis predicted to have label yi=Mif the difference of log-likelihoods under yi=Mandyi=Bis larger than a threshold k: (xi mB)|S 1 B(xi mB)+logfdet(SB)g (xi mM)|S 1 M(xi mM) logfdet(SM)g>k:
The choice of the threshold k, using Bayes theorem, is the following:
k=2flog[P(yi=B)] log[P(yi=M)]g:
Write two Hadoop jobs to fit the QDA classifier using the training set, predict the labels in the test set, and evaluate the performance of the classifier using sensitivity, specificity, precision and accuracy. The two jobs should perform the following tasks:
• In the first job, the model parameters mM;mB;SM;SB;P(yi=M)andP(yi=B)should be estimated from the training setbreast_cancer_train.csv using the method of maximum likelihood. You can assume that the number of observations in the training set is n=413, so you do not need to explicitly calculate this quantity. • In the second job, the output of the first job should be used to predict the labels for the data points in the test set breast_cancer_test.csv . These labels should then be used to evaluate the performance of the classifier by comparing true and estimated labels, according to the performance metrics listed above.
In your answer, report: (i) the estimated mean parameters ˆmMand ˆmB, (ii) the estimated covariances ˆSMand ˆSB, (iii) ˆP(yi=M)andˆP(yi=B), (iv) the four performance metrics (sensitivity, specificity, precision and accuracy), (v) the content of the python filesmapper.py andreducer.py for the mapandreduce phases for the two jobs, and (vi) the shell files containing the code to execute the two Hadoop jobs.