Computer Science 320SC – (2023)
Programming Assignment 6
Due: Oct 29 2023 (11:59pm)
Requirements
This 6th assignment lets you get familiar with randomized algorithm design and development. It is worth 5% of your total course marks. We would like you to implement a Bloom Filter to simulate a email spam filtering.
An excessive number of submissions (over 10) for a particular problem will accrue a 20% penalty per that problem if you eventually solve it. Therefore, please write a bruteforce algorithm and test your randomized algorithm with your own generated inputs at scale before submitting to the automated marker.
We only accept Python programs that use built-in packages (i.e. packages that do not require pip install).
1 Email spam filtering 1.1 Problem description
You are implementing a Bloom filter for email spam filtering for the University of Auckland. Given that you know the set A of n trusted email addresses, If the email comes from one of these trusted addresses, it is not a spam. Otherwise, it will be a spam. Since the server memory is very limited, you can not keep the list of trusted addresses on its memory (if an email address requires 25 bytes on average and you have 1 billion trusted emails, it would take 25 GB to store A in the memory).
Due to the limited resource of automarker, instead of using a string, we use a random integer in [0 . . . 231 ] to present an email address. Also, to simpify the task, please use the following code to generate the list of n integers that represent for n email addresses. Note that you have to use random.seed(320) to ensure that you generate the same set of integers as automarker.
You would need to write a Python script to implement a Bloom filter given the setting of m = 8n bits and different numbers of hash functions k = {2, 4, 6, 8}. The used hash function is: f (x) = (a · x + b) mod m where a and b are generated using the following code:
import random random.seed(320)
n = 100
a = [1, 2, 3, 4, 5, 6, 7, 8]
b = [8, 7, 6, 5, 4, 3, 2, 1]
for i in range(n):
x = random.randint(0, 2**31)
# Construct several Bloom filters here
2
# Testing here
for line in sys.stdin:
y = int(line)
# Check y is in the set S here
Note that you must not call x = random.randint(0, 2**31) several times since it will generate dif- ferent sets S.
Given k = i, you will use i hash functions, from f1 to fi(x) = (a[i−1]·x+b[i−1]) mod m. For example, given k = 2, you will use
and
f1(x)=(a[0]·x+b[0]) mod m=(x+8) mod m f2(x)=(a[1]·x+b[1]) mod m=(2x+7) mod m.
1.2 Test case description
Your test will be a sequence of 2000 integers, each value per line corresponding to the email address, containing non-spam and spam emails. Your output will be 4 integers, each per line, that count the number of detected spam emails for each value of k = {2, 4, 6, 8}.
There are 4 test cases.
1. Atrialtestcaseofn=100has0mark.
2. A test case of n = 1, 000 has 2 marks.
3. A test case of n = 10,000 and has 2 marks.
4. A test case of n = 100,000 and has 1 mark.