comp10002 Foundations of Algorithms Semester 2, 2022
Assignment 2
In this project, you will demonstrate your understanding of dynamic memory and linked data structures (Chapter 10) and extend your program design, testing, and debugging skills. You will also learn about the problem of process discovery and implement a simple algorithm for discovering a process model from event data.
Process Discovery
Process mining is a relatively young research discipline that combines studies of inferences from data in data mining and machine learning with process modeling and analysis to discover, monitor, and improve real-world processes. The studied processes get formally captured in process models.
Figure 1 shows an example process model captured inBusiness Process Model and Notation (BPMN) language. Inthe figure, actions are drawn as rectangles with rounded coners. Diamonds represent gateways. An exclusive gateway has the “×” marker, while a parallel gateway has the “+” marker inside the diamond shape. Directed arcs encode control flow dependencies. The circles with thin and thick borderlines denote the start and end of the process, respectively. f
A process model describes a collection of possible executions, or traces. A trace is a sequence of events that represent actions encountered when following control flow arcs and gateway routing decisions from the start to the end element of the model. If execution reaches an exclusive gateway, it continues along exactly one of its outgoing arcs. If execution reaches a parallel gate- way with multiple outgoing arcs, all the outgoing arcs are followed simultaneously. A parallel gateway with multiple incoming arcs “waits” for all the incoming arcs to complete and then “passes” the execution to its outgoing arcs. The fact that several paths in a model are executed simultaneously means that the ac- tions encountered on these paths are independent and can occur in any order.
In the real-world, different loan applications can be processed by following the same sequence of actions; thus, a trace can be observed multiple times. The problem of process discovery is a core problem in process mining that, given an event log, that is, a collection of observed traces, aims to automatically construct a process model that describes the traces. A process model constructed in this way aims to aggregate and generalize information on how processes were executed in the real-world.
In this assignment, you will implement a program that reads an event log from input, analyzes patterns of actions in the input traces, and discovers and reports model fragments that could have produced the traces.
Input Data
Your program should read input from stdin and write output to stdout. The input will list traces, one per input line, where each trace is provided as a comma-separated list of events that terminates either with “\n”, that is, one newline character, or EOF, that is, the end of file constant defined in the <stdio.h> header file. An event is
cd abe
Figure 1: A process model.
encoded as a single alphabetic character that when provided as input to the isalpha() function defined in the <ctype.h> header file returns a non-zero integer.
The following file test0.txt uses fifteen lines to specify fifteen traces, each composed of four events.
1 a,b,c,d 2 a,b,f,e 3 a,b,f,e
4 a,b,f,e 5 a,b,c,d 6 a,b,c,d
7 a,b,c,d 8 a,b,c,d 9 a,b,c,d
10 a,b,e,f 11 a,b,e,f 12 a,b,e,f
13 a,b,f,e 14 a,b,c,d 15 a,b,e,f
In general, input data can contain an arbitrary number of traces, with each trace composed of at least one event.
Stage 0 – Reading, Analyzing, and Printing Input Data (10/20 marks)
The first version of your program should read an event log from input, analyze it, and print basic statistics about the event log to the output. The first thirteen lines from the listing below correspond to the output your program should generate for the test0.txt input file in Stage 0.
1 ==STAGE 0============================ 2 Number of distinct events: 6
3 Number of distinct traces: 3
4 Total number of events: 60
5 Total number of traces: 15
6 Most frequent trace frequency: 7 7 abcd
8 a = 15
9 b = 15
10 c = 7
26 d=7
27 e=8
28 f=8
29 256=15
30 =====================================
31 c d e f256
32c07000 57 256 257 258 0700 0000 0004 256 0 15
40e=8 41f=8 42256=15 43 257=7
58 256 0 7 8
59 257 0 0 0
60 258 0 0 0
61 ------------------------------------- 62 259 = CHC(257,258)
==STAGE 2============================ ef256257
e0400
d0 e0 f0
0 0 0 0
256 7 ------------------------------------- 257 = SEQ(c,d)
Number of events removed: 7
f4 256 4 257 0
00 0 40 7 00 0
74 ==THE END============================ 75
-------------------------------------
Lines 2 and 3 of the output report the number of distinct events and traces in the event log, respectively. The event log contains six distinct events, a through f, and three distinct traces, ⟨a, b, c, d⟩, ⟨a, b, e, f⟩, and ⟨a, b, c, d⟩. Lines 4 and 5 of the output report the total number of events and traces in the event log. As there are fifteen traces in the log, each composed of four events, in total, there are sixty events in the event log. Line 6 provides information about the number of times the most frequent trace appears in the event log. Trace ⟨a, b, c, d⟩ is the most frequent trace; it appears seven times in the event log. The subsequent lines of the output should list all most frequent traces of the event log (one occurrence of each trace). You should not make assumptions about the maximum number of traces and events supplied. Use dynamic memory and data structures of your choice, for example, arrays or linked lists, to store the input event log.
The test input will always follow the proposed format.
Stage 1 – Identifying Sequences (16/20 marks)
Extend your program from Stage 0 to identify the most likely sequential (SEQ) patterns of actions based on the directly follows relationships between events in the event log. Two events are in the directly follows relationship if there is a trace in the log in which these events appear consecutively. First, your program should compute and print out all the directly follows relationships. Lines 15 to 21 of the output listing report the directly follows relationships between the events together with their support, that is, the number of times the events follow each other in the traces of the event log, given as a two dimensional matrix. For example, event b directly follows event a once in each trace of the event log. Thus, its support is 15; see 15 in the third column of line 16 of the output; that is, sup(a, b) = 15. Each matrix column uses five characters, and each cell entry is right-justified.
Now, use the directly follows matrix to identify pairs of actions that most likely were arranged in sequences in the model that generated the event log. Given two events x and y, x ̸= y, if sup(x, y) > sup(y, x), then there is a chance that the model that generated the event log has actions x and y arranged in a sequence. To measure this chance, compute two values pd(x, y) = (100 × abs(sup(x, y) − sup(y, x)))/max(sup(x, y), sup(y, x)) and w(x,y) = abs(50 − pd(x,y)) × max(sup(x,y),sup(y,x)); use integer division to obtain the truncated result (for example, 15/4 should result in 3). Here, pd(x, y) is percent difference of sup(x, y) and sup(y, x) that reflects confidence that x is directly followed by y, while w (x, y) is weight that further scales this confidence by the number of times y directly follows x; here, we assume that indeed sup(x, y) > sup(y, x).
Lines 23 to 29 of the output report the identified sequential pattern of actions and summarize information on the abstraction of the pattern in the event log. The most likely sequential pattern identified according to the proposed rules in the input event log is defined by the pair of events (a,b); note that pd(a,b) = 100 and w (a, b) = 750. Line 23 reports the identified pattern, assigning it the code of 256; note that the code of each subsequently identified pattern of actions should be incremented by one; see lines 38, 51, 62, and 71 of the output. Next, the identified pattern is abstracted in the event log to prepare it for discovering subsequent patterns. To abstract a pattern identified by a pair of events, rewrite each occurrence of one of these events in the traces of the event log with the abstract event identified by the code of the pattern. Then, in every trace, replace each subsequence composed of only new abstract events with one such abstract event without changing the order of the other events. For example, the abstraction of pattern 256 in the input event log leads to the below event log; keep the resulting event log in memory and do not write it to a file.
1 256,c,d 2 256,f,e 3 256,f,e
4 256,f,e 5 256,c,d 6 256,c,d
7 256,c,d 8 256,c,d 9 256,c,d
10 256,e,f 11 256,e,f 12 256,e,f
13 256,f,e 14 256,c,d 15 256,e,f
Line 24 of the output reports the number of events removed from the event log as the result of abstracting pattern 256. One event was removed from every trace to result in 15 removed events. Finally, lines 25 to 29 report all distinct events, both original and abstract, and their frequencies in the resulting event log, in the ascending order of their codes, where codes of the original events are ASCII codes of the corresponding characters. Lines 31 to 43 use the same template to report information on the next identified sequential pattern of actions.
The output of two consecutive patterns is separated by a sequence of “=” characters (see, for example, line 30), and the directly follows matrix is separated from the subsequent information on the identified pattern with a sequence of “-” characters (see, for instance, lines 22 and 37). The identification of patterns should continue from the resulting log until no pattern can be discovered based on the proposed rules. NB: In Stage 1, only patterns based on two events from the original input log, that is, non-abstract events, should be discovered.
Stage 2 – Identifying Concurrency and Choices (20/20 marks)
Extend your program from Stage 1 to allow discovering concurrency (CON) and choice (CHC) patterns of actions. Stage 2 of your program should continue from the state reached in Stage 1.
Two actions that can be executed concurrently are independent and, thus, can occur in a trace in any order. Given two events x and y, x ̸= y, if sup(x,y) > 0 and sup(y,x) > 0, there is a chance that the model that generated the event log supports their concurrent execution. Each pair of events (u, v), u ̸= v, sup(u, v) > 0, sup(v, u) > 0, for which pd(u, v) < 30, is a candidate for defining a concurrency pattern of actions. The weight of a concurrency pattern should be computed the same way as the weight of a sequential pattern.
In Stage 2, to prioritize the discovery of concurrency and sequential patterns, the weight of each sequential pattern defined over two non-abstract events from the original input event log and the weight of any concurrency pattern should be additionally multiplied by 100.
In each iteration, your program should attempt to discover one pattern of actions. Each pair of distinct events should be checked for defining a sequential, concurrency, or choice pattern. Check a pair of events for being a choice pattern candidate first and, if confirmed, do not check it for defining other patterns. Among all identified candidate patterns, the one with the highest weight should be selected. If multiple candidates with the same weight exist, the one identified first when traversing the directly follows matrix in row-major order should be chosen. When proceeding to the next iteration, abstract the pattern in the log by following the rules from Stage 1.
Lines 45 to 73 of the output report three further discovered patterns using the same output template as in Stage 1. The patterns reported on lines 23, 38, 51, 62, and 71 of the output induce the hierarchy of patterns SEQ(SEQ(a,b),CHC(SEQ(c,d),CON(e,f))), which rediscovers the model in Figure 1. Note that, in general, the rules proposed in this assignment do not guarantee the rediscoverability of the model. Your program should terminate once no further patterns can be discovered.