1 Introduction

dataLoader is one of the most important classes in pMineR. It represents the middleware between the Event Log and all the other classes: all the classes can interact with the Event Log only by the methdod dataLoader::getData().

The most important methods are, in the current version, the following:

dataLoader::load.csv() : to load the event log stored in a CSV file on the local filesystem
dataLoader::load.data.frame() : to load an Event Log stored in a data.frame, in the R environment.
dataLoader::getData() : to format the event log in a way that can be used as input for other classes
dataLoader::applyFilter() : to filter the previously loaded Event Log and generate a new Event Log or a new *dataLoder object.

2 first of all

We have to load the needed libraries

library(pMineR,quietly = TRUE)
library(kableExtra,quietly = TRUE)

3 preliminary activities: Event Log Generation

In order to facilitate learning, we create a synthetic event log using the class syntheticDataCreator() We generate 200 patients using the method syntheticDataCreator::cohort.RT(), that creates an event log that simulates the paths of oncology patients.

objSDG <- syntheticDataCreator()
EL.table <- objSDG$cohort.RT(numOfPat = 200,include.sex.attribute = TRUE)

4 Data Loading

We can instantiate an object end exploit the method dataLoader::load.data.frame() to load the csv. This method takes in input the parameters needed to declare the role of each field (ID, date, Event) and the data format. Pay attention to the data format: separators too must be indicated (e.g. “-”, “/”, etc.. ). The verbose.mode set to FALSE is uniquely to avoid boring informative prints in rMarkDown; In interactive mode, it can be useful for estimating the loading time.

objDL <- dataLoader(verbose.mode = FALSE)
objDL$load.data.frame(mydata = EL.table,  IDName = "ID", EVENTName = "Event", dateColumnName = "Date", format.column.date = "%Y-%m-%d")

If the event log were stored in a csv file on the filesystem, you can use the dataLoader::load.csv() method, making sure to specify the separator as well.

5 Getting the output

Once loaded, the EL can be exported and sent in input to the other pMineR’s classes. This can be easily done with the method dataLoder::getData() in the following way:

objDL.out <- objDL$getData()

The returned variable is a list containing a lot of useful information. Here are listed some of them:

arrayAssociativo : contains the set of events retrieved from the Event Log;
MMatrix : is the matrix containing the absolute count of the subsequent movement from an Event to the following. Here two virtual events are added: BEGIN and END. They represent the begin and the end of all the traces in the Event Log.

In our case, the objDL.out$MMatrix is:

	END	Imaging	Medical Visit	Biopsy	partial resection	death	total resection	chemotherapy	radiotherapy
BEGIN	0	100	100	0	0	0	0	0	0
END	0	0	0	0	0	0	0	0	0
Imaging	0	128	133	89	0	9	0	0	0
Medical Visit	22	131	203	94	0	70	0	0	0
Biopsy	0	0	0	0	88	30	65	0	0
partial resection	0	0	21	0	0	37	0	25	5
death	178	0	0	0	0	0	0	0	0
total resection	0	0	17	0	0	21	0	21	6
chemotherapy	0	0	13	0	0	6	0	135	27
radiotherapy	0	0	33	0	0	5	0	0	331

MMatrix.perc : is the MMatrix matrix where each line is normalized to have a sum equal to one. In other words, it contains the probability to jump from a state to another one. Here there is an example:

	END	Imaging	Medical Visit	Biopsy	partial resection	death	total resection	chemotherapy	radiotherapy
BEGIN	0.0000000	0.5000000	0.5000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
END	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
Imaging	0.0000000	0.3565460	0.3704735	0.2479109	0.0000000	0.0250696	0.0000000	0.0000000	0.0000000
Medical Visit	0.0423077	0.2519231	0.3903846	0.1807692	0.0000000	0.1346154	0.0000000	0.0000000	0.0000000
Biopsy	0.0000000	0.0000000	0.0000000	0.0000000	0.4808743	0.1639344	0.3551913	0.0000000	0.0000000
partial resection	0.0000000	0.0000000	0.2386364	0.0000000	0.0000000	0.4204545	0.0000000	0.2840909	0.0568182
death	1.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
total resection	0.0000000	0.0000000	0.2615385	0.0000000	0.0000000	0.3230769	0.0000000	0.3230769	0.0923077
chemotherapy	0.0000000	0.0000000	0.0718232	0.0000000	0.0000000	0.0331492	0.0000000	0.7458564	0.1491713
radiotherapy	0.0000000	0.0000000	0.0894309	0.0000000	0.0000000	0.0135501	0.0000000	0.0000000	0.8970190

MMatrix.perc.noLoop : This matrix is similar to matrix MMatrix.perc, but loops have been removed from it.. This means that any jump from a state x to a state x has been set to 0. By loops, we mean transitions where, from one state X, the subsequent state is still X.
pat.process : this is a list indexed by the Identifier indicated as IDName in the dataLoader() method used to load the data. Each element of this list contains the subset of the Event Log, ordered by date, concerning a specific patient. In this subset two new colums are added: pMineR.internal.ID.Evt which is an ID for pMineR internal use and pMineR.deltaDate which counts, in minutes, how far is each line from the first event. Here there is, for example, the table of a patient:

	pMineR.internal.ID.Evt	ID	Event	Date	Sex	pMineR.deltaDate
X100	996	100	Imaging	15/01/2001 00:00:00	M	0
X100.1	997	100	Medical Visit	18/01/2001 00:00:00	M	4320
X100.2	998	100	Imaging	01/02/2001 00:00:00	M	24480
X100.3	999	100	Medical Visit	05/02/2001 00:00:00	M	30240
X100.4	1000	100	Biopsy	09/02/2001 00:00:00	M	36000
X100.5	1001	100	total resection	23/02/2001 00:00:00	M	56160
X100.6	1002	100	chemotherapy	04/03/2001 00:00:00	M	69120
X100.7	1003	100	chemotherapy	06/03/2001 00:00:00	M	72000
X100.8	1004	100	chemotherapy	08/03/2001 00:00:00	M	74880
X100.9	1005	100	chemotherapy	11/03/2001 00:00:00	M	79200
X100.10	1006	100	chemotherapy	13/03/2001 00:00:00	M	82080
X100.11	1007	100	chemotherapy	15/03/2001 00:00:00	M	84960
X100.12	1008	100	Medical Visit	31/07/2001 00:00:00	M	283620
X100.13	1009	100	Medical Visit	23/10/2001 00:00:00	M	404580
X100.14	1010	100	Medical Visit	25/12/2001 00:00:00	M	495360
X100.15	1011	100	Medical Visit	30/04/2002 00:00:00	M	676740
X100.16	1012	100	Medical Visit	31/07/2002 00:00:00	M	809220
X100.17	1013	100	death	01/08/2002 00:00:00	M	810660

The time gap in hours can easily be calculated by replacing the column with:

	pMineR.internal.ID.Evt	ID	Event	Date	Sex	pMineR.deltaDate
X100	996	100	Imaging	15/01/2001 00:00:00	M	0.0000
X100.1	997	100	Medical Visit	18/01/2001 00:00:00	M	3.0000
X100.2	998	100	Imaging	01/02/2001 00:00:00	M	17.0000
X100.3	999	100	Medical Visit	05/02/2001 00:00:00	M	21.0000
X100.4	1000	100	Biopsy	09/02/2001 00:00:00	M	25.0000
X100.5	1001	100	total resection	23/02/2001 00:00:00	M	39.0000
X100.6	1002	100	chemotherapy	04/03/2001 00:00:00	M	48.0000
X100.7	1003	100	chemotherapy	06/03/2001 00:00:00	M	50.0000
X100.8	1004	100	chemotherapy	08/03/2001 00:00:00	M	52.0000
X100.9	1005	100	chemotherapy	11/03/2001 00:00:00	M	55.0000
X100.10	1006	100	chemotherapy	13/03/2001 00:00:00	M	57.0000
X100.11	1007	100	chemotherapy	15/03/2001 00:00:00	M	59.0000
X100.12	1008	100	Medical Visit	31/07/2001 00:00:00	M	196.9583
X100.13	1009	100	Medical Visit	23/10/2001 00:00:00	M	280.9583
X100.14	1010	100	Medical Visit	25/12/2001 00:00:00	M	344.0000
X100.15	1011	100	Medical Visit	30/04/2002 00:00:00	M	469.9583
X100.16	1012	100	Medical Visit	31/07/2002 00:00:00	M	561.9583
X100.17	1013	100	death	01/08/2002 00:00:00	M	562.9583

Or we can easily measure the delta between subsequent lines by replacing the column (or, better, adding a new column) with:

	pMineR.internal.ID.Evt	ID	Event	Date	Sex	pMineR.deltaDate	pMineR.deltaDate.delta
X100	996	100	Imaging	15/01/2001 00:00:00	M	0.0000	0.00000
X100.1	997	100	Medical Visit	18/01/2001 00:00:00	M	3.0000	3.00000
X100.2	998	100	Imaging	01/02/2001 00:00:00	M	17.0000	14.00000
X100.3	999	100	Medical Visit	05/02/2001 00:00:00	M	21.0000	4.00000
X100.4	1000	100	Biopsy	09/02/2001 00:00:00	M	25.0000	4.00000
X100.5	1001	100	total resection	23/02/2001 00:00:00	M	39.0000	14.00000
X100.6	1002	100	chemotherapy	04/03/2001 00:00:00	M	48.0000	9.00000
X100.7	1003	100	chemotherapy	06/03/2001 00:00:00	M	50.0000	2.00000
X100.8	1004	100	chemotherapy	08/03/2001 00:00:00	M	52.0000	2.00000
X100.9	1005	100	chemotherapy	11/03/2001 00:00:00	M	55.0000	3.00000
X100.10	1006	100	chemotherapy	13/03/2001 00:00:00	M	57.0000	2.00000
X100.11	1007	100	chemotherapy	15/03/2001 00:00:00	M	59.0000	2.00000
X100.12	1008	100	Medical Visit	31/07/2001 00:00:00	M	196.9583	137.95833
X100.13	1009	100	Medical Visit	23/10/2001 00:00:00	M	280.9583	84.00000
X100.14	1010	100	Medical Visit	25/12/2001 00:00:00	M	344.0000	63.04167
X100.15	1011	100	Medical Visit	30/04/2002 00:00:00	M	469.9583	125.95833
X100.16	1012	100	Medical Visit	31/07/2002 00:00:00	M	561.9583	92.00000
X100.17	1013	100	death	01/08/2002 00:00:00	M	562.9583	1.00000

wordSequence.raw : is a list containing, for each patients (indexed my his Patients ID) the ordered sequence of Events.
MM.mean.time : deprecated
MM.density.list : deprecated
MM.den.list.high.det : is a list indexed by an event “from” and a next event “to”. It contains the times related to all occurrences in which a transition from X to Y has been performed, not necessarily directly consecutive.
MM.mean.outflow.time : deprecated
original.CSV : the loaded Event Log.
csv.column.names,csv.IDName,csv.EVENTName,csv.dateColumnName,*csv.date.format: are the parameters passed to load the Event Log.

So, for example, let’s suppose we are intersted in plotting the mean time spent to move from the event Medical Visit to the event Biopsy. We can see the time, in minutes, with:

objDL.out$MM.den.list.high.det$`Medical Visit`$Biopsy

##   [1] 20160 11520 31680  5760 61920  8640  7200 36000 28800  7200 30240 12960
##  [13] 53280 37440 30240 34560 14400  2880  2880 11520 14400 36000  5760  1440
##  [25] 48960 15840 10080 11520 66240 44640 51840 43200 10080 23040 10080  2880
##  [37] 17280 11520 43200 60480 56160 12960 30240 14400 53280 63360 56160  1440
##  [49] 34560 14400 11520 11520 38880 28800 77760 67680 53280 24480 41760 27360
##  [61] 41760 21600 12960 27360 17280 18720 11520 10080 48960 37440 21600 12960
##  [73] 43200 10080 56160 64800 41760 24480  2880 40320 24480 10080 48960 47520
##  [85] 40320 20160 12960 70560 44640 25920 10080 21600 10080 12960  7200  1440
##  [97]  2880  7200 37440 28800 14400 25920 24480 20160 24480 60480 47520 27360
## [109] 56160 54720 37440 15840 44640 31680 38880 24480  7200 24480  7200 36000
## [121] 38880  8640  8640 48960 33120 17280 41760 10080 17280 31680 24480 17280
## [133] 11520 11520 37440 20160  4320 43200 12960 76320 34560 14400  4320 70560
## [145] 14400 38880 12960 36000  7200 11520 18720 11520 10080  7200  4320 18720
## [157] 36000 14400 60480 34560 24480 14400 54720  2880 15840  7200  1440 60480
## [169] 34560  4320 57600  5760 92160 82080 61920 48960 31680 10080 27360  1440
## [181] 54720 28800 11520 56160 33120 12960 30240 28800 12960 33120 23040 15840
## [193]  1440 25920  2880 27360 53280 11520 40320 34560  4320  1440 31680 10080
## [205]  8640 11520 14400 31680 14400 57600 41760  1440 33120 15840  4320 43200
## [217] 41760 12960 44640 20160 51840 31680 25920  7200 66240 51840  5760 30240
## [229] 40320 18720  4320 77760 60480 25920 17280 12960 27360 34560 12960  7200
## [241] 41760 21600 30240 11520  4320 34560 31680  7200  5760 28800  4320 47520
## [253]  8640 86400 48960 64800 57600 34560  4320 12960 66240 37440 14400  8640
## [265] 51840 43200 30240 10080 46080 14400 10080  5760 41760 21600 14400 10080
## [277]  7200 20160  5760 57600 38880 33120 33120 24480 59040 25920  5760 12960
## [289] 60480 54720 44640 37440 11520 41760 43200 24480 14400  5760 44640 10080
## [301]  2880 33120 10080 37440 24480 17280 34560 15840  4320 12960  2880 10080

and we can easily plot this with :

# get the time and set it in days
measured.time <- objDL.out$MM.den.list.high.det$`Medical Visit`$Biopsy
measured.time <- measured.time / (60*24)
mes.time.dens <- density(measured.time,from=0)
# calculate the cumulative
cumulative.y.values <- cumsum(mes.time.dens$y )/max(cumsum(mes.time.dens$y ))
# plot it
plot(x = mes.time.dens$x, y = cumulative.y.values, type='l', xlab="days", ylab="%", main="time needed to move \n from 'Medical Visit' to 'Biopsy'")
ooo <- length(which((cumulative.y.values <= 0.9) == TRUE))
abline(v=mes.time.dens$x[ooo],col='red',lty=2)
abline(h=0.9,col='red', lty=2)

This means that with in the 90% of the cases, to move from Medical Visit to Biopsy we need at least 39.3211291 days.

6 Plotting the Patient’s timeline

To have a first, raw, rude but practical overview of a single trace, the method dataLoder::plotPatientTimeline() can be used.

 objDL$plotPatientTimeline(PatID = "2")

7 Filtering

Sometimes we might need to remove some Patients or some Events, on the base of something (e.g. an attribute value, etc..). To do that, the class dataLoder offers the method dataLoder::applyFilter() which can operate differently, on the base of the passed parameters:

array.events.to.keep : An array containing the events to keep. All the others will be removed.
array.events.to.remove : An array containing the events to removed. All the others will be kept.
array.pazienti.to.keep : An array containing the patients to keep. All the others will be removed.
array.pazienti.to.removed : An array containing the patients to removed. All the others will be kept.
remove.events.by.attribute.name : we can specify the name of an attribute (a column name) which will be used to filters the events. The filter also requires the parameter by.arr.attribute.value to select which values of that attribute will result in the deletion of the corresponding event.
remove.patients.by.attribute.name : this is similar to the previous one, but will remove the entire patient from the Event Log. Again, this needs the parameter by.arr.attribute.value to operate.
keep.events.by.attribute.name : this works in the same way of remove.events.by.attribute.name but keep the event instead of removing it. Again, this needs the parameter by.arr.attribute.value to operate.
keep.patients.by.attribute.name : this works in the same way of remove.patients.by.attribute.name but keep the patient instead of removing him. Again, this needs the parameter by.arr.attribute.value to operate.
by.arr.attribute.value : this contains the attribute value which will be used togheter with remove.events.by.attribute.name, remove.patients.by.attribute.name, keep.events.by.attribute.name or keep.patients.by.attribute.name to filter the events/patients.
whatToReturn : specifies what should be returned. If set to “dataLoder” a new filtered dataLoder object will be returned, if set to “csv” a new filtered csv will be returned. If left BLANK (default value) the object itself will be filtered.

Here there are some practical examples.

7.1 getting the csv of the first 10 patients

we can get the first 10 Patient’s ID

arr.ID <- names(objDL.out$pat.process)[1:10]
csvTable <- objDL$applyFilter(array.pazienti.to.keep = arr.ID,whatToReturn = "csv")

7.2 Dropping the males

To drop the males we have to remove the patients where the attribute Sex is set to M:

newDL <- objDL$applyFilter(remove.patients.by.attribute.name = "Sex",by.arr.attribute.value = "M",whatToReturn = "dataLoader")

The variable newDL is now an instance of the class dataLoder filtered according to the patient’s sex. In this case we now have two ojbect: the former objDL which still keep the entire set of patients and newDL which is the filtered one.

8 Dictionary

It might happens that the events in the event log represent a too fine-grained level of detail, or there is a desire to perform event groupings by creating new, more representative ones. For example, consider a scenario where each specific laboratory test is associated with an individual event in an event log. In this case, it is quite common to want to group the tests into families to reduce the complexity of the event log and mitigate the subsequent spaghetti effect.

This operation can be performed using dictionaries, where for each entry, we map the desired translation.

For simplicity, let’s consider the following dictionary as an example.

The first column contains the list of events we want to translate, while the other columns represent possible translations. In this table, the column original can, for example, be remapped according to column translation_01 or column translation_02.

mtr.dictionary <- matrix(c("Medical Visit","Medical Visit","Diagnosis",
"Imaging","Imaging","Diagnosis",
"Biopsy","Biopsy","Diagnosis",
"partial resection","Surgery","Therapy",
"radiotherapy","radiotherapy","Therapy",
"total resection","Surgery","Therapy",
"chemotherapy","chemotherapy","Therapy",
"death","death","death"), ncol=3, byrow = TRUE)
colnames(mtr.dictionary) <- c("original","translation_01","translation_02")
mtr.dictionary %>%
  kbl() %>%
  kable_minimal()

original	translation_01	translation_02
Medical Visit	Medical Visit	Diagnosis
Imaging	Imaging	Diagnosis
Biopsy	Biopsy	Diagnosis
partial resection	Surgery	Therapy
radiotherapy	radiotherapy	Therapy
total resection	Surgery	Therapy
chemotherapy	chemotherapy	Therapy
death	death	death

Let’s load the dictionary, here we have to indicate which column is the starting column:

objDL$addDictionaryMatrix(inputMatrix = mtr.dictionary,column.event.name = "original")

now we can ask a translation according to original -> translation_01

objDL.translated.01 <- objDL$getTranslation(column.name = "translation_01",toReturn = "dataLoader")

or a translation according to original -> translation_02

objDL.translated.02 <- objDL$getTranslation(column.name = "translation_02",toReturn = "dataLoader")

Not surprisingly, the new dataLoader() objects have the events converted according to the wished translation:

objDL.translated.01$getData()$original.CSV[1:20,]  %>%
  kbl() %>%
  kable_minimal()

	pMineR.internal.ID.Evt	ID	Event	Date	Sex	pMineR.deltaDate
X1	1	1	Imaging	14/01/2001 00:00:00	F	0
X1.1	2	1	Medical Visit	28/01/2001 00:00:00	F	20160
X1.2	3	1	Imaging	09/02/2001 00:00:00	F	37440
X1.3	4	1	Biopsy	11/02/2001 00:00:00	F	40320
X1.4	5	1	Surgery	23/02/2001 00:00:00	F	57600
X1.5	6	1	death	06/03/2001 00:00:00	F	73440
X10	62	10	Medical Visit	07/01/2001 00:00:00	M	0
X10.1	63	10	Biopsy	15/01/2001 00:00:00	M	11520
X10.2	64	10	Surgery	27/01/2001 00:00:00	M	28800
X10.3	65	10	chemotherapy	01/02/2001 00:00:00	M	36000
X10.4	66	10	chemotherapy	03/02/2001 00:00:00	M	38880
X10.5	67	10	radiotherapy	04/02/2001 00:00:00	M	40320
X10.6	68	10	radiotherapy	06/02/2001 00:00:00	M	43200
X10.7	69	10	radiotherapy	08/02/2001 00:00:00	M	46080
X10.8	70	10	radiotherapy	10/02/2001 00:00:00	M	48960
X10.9	71	10	Medical Visit	01/07/2001 00:00:00	M	251940
X10.10	72	10	death	02/07/2001 00:00:00	M	253380
X100	996	100	Imaging	15/01/2001 00:00:00	M	0
X100.1	997	100	Medical Visit	18/01/2001 00:00:00	M	4320
X100.2	998	100	Imaging	01/02/2001 00:00:00	M	24480

objDL.translated.02$getData()$original.CSV[1:20,]  %>%
  kbl() %>%
  kable_minimal()

	pMineR.internal.ID.Evt	ID	Event	Date	Sex	pMineR.deltaDate
X1	1	1	Diagnosis	14/01/2001 00:00:00	F	0
X1.1	2	1	Diagnosis	28/01/2001 00:00:00	F	20160
X1.2	3	1	Diagnosis	09/02/2001 00:00:00	F	37440
X1.3	4	1	Diagnosis	11/02/2001 00:00:00	F	40320
X1.4	5	1	Therapy	23/02/2001 00:00:00	F	57600
X1.5	6	1	death	06/03/2001 00:00:00	F	73440
X10	62	10	Diagnosis	07/01/2001 00:00:00	M	0
X10.1	63	10	Diagnosis	15/01/2001 00:00:00	M	11520
X10.2	64	10	Therapy	27/01/2001 00:00:00	M	28800
X10.3	65	10	Therapy	01/02/2001 00:00:00	M	36000
X10.4	66	10	Therapy	03/02/2001 00:00:00	M	38880
X10.5	67	10	Therapy	04/02/2001 00:00:00	M	40320
X10.6	68	10	Therapy	06/02/2001 00:00:00	M	43200
X10.7	69	10	Therapy	08/02/2001 00:00:00	M	46080
X10.8	70	10	Therapy	10/02/2001 00:00:00	M	48960
X10.9	71	10	Diagnosis	01/07/2001 00:00:00	M	251940
X10.10	72	10	death	02/07/2001 00:00:00	M	253380
X100	996	100	Diagnosis	15/01/2001 00:00:00	M	0
X100.1	997	100	Diagnosis	18/01/2001 00:00:00	M	4320
X100.2	998	100	Diagnosis	01/02/2001 00:00:00	M	24480

01. DataLoader - first steps

Roberto Gatta

14/01/2022