Given three documents S1, S2, S3 and a customized query document q:
S1 = {3,4,5},S2 = {0,1,2},S3 = {0,1,3},q = {2,3,4,h(y)};
h(y) = y mod 6.
where y is the last digit of your Student ID. For instance, suppose my Stu- dentID=xxxxx7, my query document would be q= {2, 3, 4, 1}.
1.1 Computing MinHash Signatures
1. Generate the bit-vector representation for {S1, S2, S3, q} in feature space {0, 1, 2, 3, 4, 5}. [1 mark]
2. Generate the MinHash matrix for {S1, S2, S3, q} using the following four MinHash functions. [2 marks]
h1(x) = x mod 6
h2(x)=(x+1) mod6
h3(x)=(x+3) mod6
h4(x)=(x+5) mod6
3. Consider the query q and estimate the signature-based Jaccard similari- ties: J(q, S1), J(q, S2), and J(q, S3). [1 mark]
1.2 Tuning Parameters for rNNS
In our lecture, we have learnt to formulate the collision probability (i.e., S- curve) given the number of bands b and the number of rows per band r as follows: P r(s) = 1 − (1 − s ) .
Consider three sets of parameters (r=2,b=10), (r=6,b=30), (r=10,b=50). The collision probabilities for similarity s in range of [0,1] for each (r,b) are pro- vided accordingly as follows:
1.3 c-Approximate Randomized rNNS [3 marks]
We have learnt that a family of functions H is called (d1, d2, p1, p2)-sensitive with collision probability p1 > p2 and c > 1 if the following conditions hold for any uniformly chosen h ∈ H and ∀ x,y ∈ U:
– If d(x, y) ≤ r, P r[h(x) = h(y)] ≥ p1 for similar points, and
– If d(x, y) ≥ cr, P r[h(x) = h(y)] ≤ p2 for dissimilar points.
Consider a family transformation from (d1, d2, p1, p2)-sensitive to (d1, d2, 1 − (1−pk1)L,1−(1−pk2)L)-sensitive, where k and L refer to the number of hash functions and the number of hash tables, respectively. Briefly describe steps to achieve such transformation. What is the expected impact on probability bounds after the transformation?