Machine Learning and Intelligent Data Analysis
Learning Outcomes
Question 1 Dimensionality Reduction
(a) Explain what is meant by “dimensionality reduction” and why it is sometimes nec- essary. [4 marks]
What can you say about the underlying nature of this dataset? [4 marks] 2
Question 2 Classification
(a) Consider the Soft Margin Support Vector Machine learnt in Lecture 4e. Consider also
(i) (j) (i)T that C = 100 and that we are adopting a linear kernel, i.e., k(x ,x ) = x
Assume an illustrative binary classification problem with the following training examples:
Which of the Lagrange multipliers below is(are) a plausible solution(s) for this problem? Justify your answer.
(b) Consider a binary classification problem where around 5% of the training examples are likely to have their labels incorrectly assigned (i.e., assigned as -1 when the true label was +1, and vice-versa). Which value of k for k-Nearest Neighbours is likely to be better suited for this problem: k = 1 or k = 3? Justify your answer.
[6 marks]
(c) Consider a binary classification problem where you wish to predict whether a piece of machinery is likely to contain a defect. For this problem, 0.5% of the training examples belong to the defective class, whereas 99.5% belong to the non-defective class. When adopting Na ̈ıve Bayes for this problem, the non-defective class may almost always be the predicted class, even when the true class is the defective class. Explain why and propose a method to alleviate this issue. [8 marks]
Question 3 Document Analysis
(a) In a small universe of five web pages, one page has a PageRank of 0.4. What does this tell us about this page? [2 marks]
(b) Compare and contrast the TF-IDF and word2vec approaches to document vectori- sation. You should explain the essential principles of each method, and highlight their respective advantages and disadvantages. [8 marks]
(c) One possible approach to searching a large linked set of documents is to combine a measure of document similarity such as TF-IDF similarity with a measure of a page’s importance such as that provided by PageRank. Suggest three ways in which this could be done and discuss the advantages and disadvantages of each of them.