1. Homepage
  2. Programming
  3. MATH70099 Big Data Statistical Scalability with PySpark - Assessed Coursework 2: Borel Distribution and Quadratic Discriminant Analysis

MATH70099 Big Data Statistical Scalability with PySpark - Assessed Coursework 2: Borel Distribution and Quadratic Discriminant Analysis

Engage in a Conversation
UKICLMATH70099Big Data Statistical Scalability with PySparkPython

MATH70099 – Big Data: Statistical Scalability with PySpark

Assessed Coursework 2 CourseNana.COM

Due Monday 22ndMay 2023, 11.59pm BST Hand-in a report in PDF format, not longer than 10 pages . Write your CID only ,do not include your name anywhere in the submission. This coursework counts 20% towards the final grade for MATH70099. When required by the question, present commented code as part of the report. The code will be checked for execution quality, so please ensure that the code is self-contained and executable. CourseNana.COM

This assessment is marked out of 30 (15 marks for Q1, 12 marks for Q2, and 3 marks allocated to presentation). CourseNana.COM

Question 1

Consider a sequence of IID discrete random variables X1;X2;:::such that XiBorel (q), denoting a Borel distribution with parameter q2(0;1). The distribution has the following PMF: CourseNana.COM

                P(X=xjq) =exp(qx)(qx)x1 x!;x2f1;2;:::g: (1)

a.Prove that the distribution belongs to the exponential family, and clearly state h();g();B()andA(), where the notation for these functions is the same as presented in the lecture notes. b.From the exponential family representation of the distribution (1), deduce the form of the realisation of a minimal and complete sufficient statistic for the parameter q, given observations x1;x2;:::; xn2(0;1). c. Consider the estimator f(X1;:::; Xn) =1n(ni=1Xi)1(2) Isf(X1;:::; Xn)a UMVUE for q? [Hint: if X iBorel (q), thenE(Xi) =1=(1q).] d.IfXiBorel (q), thenE(1=Xi) =1q=2. Using this fact, derive an unbiased estimator of q. Does this estimator fit the statistical query model (SQM)? Why? e.Consider the (contrived) dataset supermarkets.csv , where each line contains the number of minutes xrequired to clear a so-called M/D/1 queue starting with one customer, in a supermarket s, on the day of the week w. The entries x,s,w are comma-separated. The dataset is available on HDFS on Athena at/shared_data/supermarkets.csv . Assume that each observation xis generated from the distribution (1), where the value of qdepends on the supermarket Siand day of the week Wi:XijSi;WiBorel (qWi;Si). This corresponds to assuming that each customer in the queue is served in one minute, and new customers join the queue according to a Poisson process with rate qWi;Si. Write a Hadoop job to calculate estimates of the parameter qSi;Wifor each supermarket and day of the week , using the estimator (2). The Hadoop job should also include a combiner via the flag -combiner . Write the mapper.py andreducer.py files in such a way that the combiner is identical to the file reducer.py . In your answer, report 5 parameter estimates (one estimate per supermarket, each for a different day of the week) the content of the python filesmapper.py andreducer.py for the mapandreduce phases, and the content of a shell file containing the code to execute the Hadoop job. Total for Q1: [15 marks] CourseNana.COM

Question 2

Consider the datasets breast_cancer_train.csv andbreast_cancer_test.csv , stored on HDFS on Athena in /shared-data/ . Each row of the datasets contains four measurements xi2R4 +(radius, texture, smoothness and com- pactness) about breast tumours, and the diagnosis yi2fM;Bg(malignant, /quotesingle.ts1M/quotesingle.ts1, or benign, /quotesingle.ts1B/quotesingle.ts1) given by a doctor. The objective of this question is to fit a quadratic discriminant analysis (QDA) model for classification of tumours. In QDA, it is assumed that xij(yi=M)N(mM;SM)andxij(yi=B)N(mB;SB), where mM;mB2R4are group-specific means, and SM;SB2R44are group-specific covariance matrices. Under this assumption, an observation xiis predicted to have label yi=Mif the difference of log-likelihoods under yi=Mandyi=Bis larger than a threshold k: (ximB)|S1 B(ximB)+logfdet(SB)g(ximM)|S1 M(ximM)logfdet(SM)g>k: CourseNana.COM

The choice of the threshold k, using Bayes theorem, is the following: CourseNana.COM

k=2flog[P(yi=B)]log[P(yi=M)]g: CourseNana.COM

Write two Hadoop jobs to fit the QDA classifier using the training set, predict the labels in the test set, and evaluate the performance of the classifier using sensitivity, specificity, precision and accuracy. The two jobs should perform the following tasks: CourseNana.COM

• In the first job, the model parameters mM;mB;SM;SB;P(yi=M)andP(yi=B)should be estimated from the training setbreast_cancer_train.csv using the method of maximum likelihood. You can assume that the number of observations in the training set is n=413, so you do not need to explicitly calculate this quantity. • In the second job, the output of the first job should be used to predict the labels for the data points in the test set breast_cancer_test.csv . These labels should then be used to evaluate the performance of the classifier by comparing true and estimated labels, according to the performance metrics listed above. CourseNana.COM

In your answer, report: (i) the estimated mean parameters ˆmMand ˆmB, (ii) the estimated covariances ˆSMand ˆSB, (iii) ˆP(yi=M)andˆP(yi=B), (iv) the four performance metrics (sensitivity, specificity, precision and accuracy), (v) the content of the python filesmapper.py andreducer.py for the mapandreduce phases for the two jobs, and (vi) the shell files containing the code to execute the two Hadoop jobs. CourseNana.COM

Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
UK代写,ICL代写,MATH70099代写,Big Data Statistical Scalability with PySpark代写,Python代写,UK代编,ICL代编,MATH70099代编,Big Data Statistical Scalability with PySpark代编,Python代编,UK代考,ICL代考,MATH70099代考,Big Data Statistical Scalability with PySpark代考,Python代考,UKhelp,ICLhelp,MATH70099help,Big Data Statistical Scalability with PySparkhelp,Pythonhelp,UK作业代写,ICL作业代写,MATH70099作业代写,Big Data Statistical Scalability with PySpark作业代写,Python作业代写,UK编程代写,ICL编程代写,MATH70099编程代写,Big Data Statistical Scalability with PySpark编程代写,Python编程代写,UKprogramming help,ICLprogramming help,MATH70099programming help,Big Data Statistical Scalability with PySparkprogramming help,Pythonprogramming help,UKassignment help,ICLassignment help,MATH70099assignment help,Big Data Statistical Scalability with PySparkassignment help,Pythonassignment help,UKsolution,ICLsolution,MATH70099solution,Big Data Statistical Scalability with PySparksolution,Pythonsolution,