1. Homepage
2. Programming
3. MATH70099 Big Data Statistical Scalability with PySpark - Assessed Coursework 2: Borel Distribution and Quadratic Discriminant Analysis

# MATH70099 Big Data Statistical Scalability with PySpark - Assessed Coursework 2: Borel Distribution and Quadratic Discriminant Analysis

UKICLMATH70099Big Data Statistical Scalability with PySparkPython

# MATH70099 – Big Data: Statistical Scalability with PySpark

Assessed Coursework 2

Due Monday 22ndMay 2023, 11.59pm BST Hand-in a report in PDF format, not longer than 10 pages . Write your CID only ,do not include your name anywhere in the submission. This coursework counts 20% towards the ﬁnal grade for MATH70099. When required by the question, present commented code as part of the report. The code will be checked for execution quality, so please ensure that the code is self-contained and executable.

This assessment is marked out of 30 (15 marks for Q1, 12 marks for Q2, and 3 marks allocated to presentation).

## Question 1

Consider a sequence of IID discrete random variables X1;X2;:::such that XiBorel (q), denoting a Borel distribution with parameter q2(0;1). The distribution has the following PMF:

``                P(X=xjq) =exp(qx)(qx)x1 x!;x2f1;2;:::g: (1)``

a.Prove that the distribution belongs to the exponential family, and clearly state h();g();B()andA(), where the notation for these functions is the same as presented in the lecture notes. b.From the exponential family representation of the distribution (1), deduce the form of the realisation of a minimal and complete sufﬁcient statistic for the parameter q, given observations x1;x2;:::; xn2(0;1). c. Consider the estimator f(X1;:::; Xn) =1n(ni=1Xi)1(2) Isf(X1;:::; Xn)a UMVUE for q? [Hint: if X iBorel (q), thenE(Xi) =1=(1q).] d.IfXiBorel (q), thenE(1=Xi) =1q=2. Using this fact, derive an unbiased estimator of q. Does this estimator ﬁt the statistical query model (SQM)? Why? e.Consider the (contrived) dataset supermarkets.csv , where each line contains the number of minutes xrequired to clear a so-called M/D/1 queue starting with one customer, in a supermarket s, on the day of the week w. The entries x,s,w are comma-separated. The dataset is available on HDFS on Athena at/shared_data/supermarkets.csv . Assume that each observation xis generated from the distribution (1), where the value of qdepends on the supermarket Siand day of the week Wi:XijSi;WiBorel (qWi;Si). This corresponds to assuming that each customer in the queue is served in one minute, and new customers join the queue according to a Poisson process with rate qWi;Si. Write a Hadoop job to calculate estimates of the parameter qSi;Wifor each supermarket and day of the week , using the estimator (2). The Hadoop job should also include a combiner via the ﬂag -combiner . Write the mapper.py andreducer.py ﬁles in such a way that the combiner is identical to the ﬁle reducer.py . In your answer, report 5 parameter estimates (one estimate per supermarket, each for a different day of the week) the content of the python ﬁlesmapper.py andreducer.py for the mapandreduce phases, and the content of a shell ﬁle containing the code to execute the Hadoop job. Total for Q1: [15 marks]

## Question 2

Consider the datasets breast_cancer_train.csv andbreast_cancer_test.csv , stored on HDFS on Athena in /shared-data/ . Each row of the datasets contains four measurements xi2R4 +(radius, texture, smoothness and com- pactness) about breast tumours, and the diagnosis yi2fM;Bg(malignant, /quotesingle.ts1M/quotesingle.ts1, or benign, /quotesingle.ts1B/quotesingle.ts1) given by a doctor. The objective of this question is to ﬁt a quadratic discriminant analysis (QDA) model for classiﬁcation of tumours. In QDA, it is assumed that xij(yi=M)N(mM;SM)andxij(yi=B)N(mB;SB), where mM;mB2R4are group-speciﬁc means, and SM;SB2R44are group-speciﬁc covariance matrices. Under this assumption, an observation xiis predicted to have label yi=Mif the difference of log-likelihoods under yi=Mandyi=Bis larger than a threshold k: (ximB)|S1 B(ximB)+logfdet(SB)g(ximM)|S1 M(ximM)logfdet(SM)g>k:

The choice of the threshold k, using Bayes theorem, is the following:

k=2flog[P(yi=B)]log[P(yi=M)]g:

Write two Hadoop jobs to ﬁt the QDA classiﬁer using the training set, predict the labels in the test set, and evaluate the performance of the classiﬁer using sensitivity, speciﬁcity, precision and accuracy. The two jobs should perform the following tasks:

• In the ﬁrst job, the model parameters mM;mB;SM;SB;P(yi=M)andP(yi=B)should be estimated from the training setbreast_cancer_train.csv using the method of maximum likelihood. You can assume that the number of observations in the training set is n=413, so you do not need to explicitly calculate this quantity. • In the second job, the output of the ﬁrst job should be used to predict the labels for the data points in the test set breast_cancer_test.csv . These labels should then be used to evaluate the performance of the classiﬁer by comparing true and estimated labels, according to the performance metrics listed above.

In your answer, report: (i) the estimated mean parameters ˆmMand ˆmB, (ii) the estimated covariances ˆSMand ˆSB, (iii) ˆP(yi=M)andˆP(yi=B), (iv) the four performance metrics (sensitivity, speciﬁcity, precision and accuracy), (v) the content of the python ﬁlesmapper.py andreducer.py for the mapandreduce phases for the two jobs, and (vi) the shell ﬁles containing the code to execute the two Hadoop jobs.

## Get in Touch with Our Experts QQ WeChat Whatsapp
UK代写,ICL代写,MATH70099代写,Big Data Statistical Scalability with PySpark代写,Python代写,UK代编,ICL代编,MATH70099代编,Big Data Statistical Scalability with PySpark代编,Python代编,UK代考,ICL代考,MATH70099代考,Big Data Statistical Scalability with PySpark代考,Python代考,UKhelp,ICLhelp,MATH70099help,Big Data Statistical Scalability with PySparkhelp,Pythonhelp,UK作业代写,ICL作业代写,MATH70099作业代写,Big Data Statistical Scalability with PySpark作业代写,Python作业代写,UK编程代写,ICL编程代写,MATH70099编程代写,Big Data Statistical Scalability with PySpark编程代写,Python编程代写,UKprogramming help,ICLprogramming help,MATH70099programming help,Big Data Statistical Scalability with PySparkprogramming help,Pythonprogramming help,UKassignment help,ICLassignment help,MATH70099assignment help,Big Data Statistical Scalability with PySparkassignment help,Pythonassignment help,UKsolution,ICLsolution,MATH70099solution,Big Data Statistical Scalability with PySparksolution,Pythonsolution,