1 Introduction

dataLoader is one of the most important classes in pMineR. It represents the middleware between the Event Log and all the other classes: all the classes can interact with the Event Log only by the methdod dataLoader::getData().

The most important methods are, in the current version, the following:

2 first of all

We have to load the needed libraries

library(pMineR,quietly = TRUE)
library(kableExtra,quietly = TRUE)

3 preliminary activities: Event Log Generation

In order to facilitate learning, we create a synthetic event log using the class syntheticDataCreator() We generate 200 patients using the method syntheticDataCreator::cohort.RT(), that creates an event log that simulates the paths of oncology patients.

objSDG <- syntheticDataCreator()
EL.table <- objSDG$cohort.RT(numOfPat = 200,include.sex.attribute = TRUE)

4 Data Loading

We can instantiate an object end exploit the method dataLoader::load.data.frame() to load the csv. This method takes in input the parameters needed to declare the role of each field (ID, date, Event) and the data format. Pay attention to the data format: separators too must be indicated (e.g. “-”, “/”, etc.. ). The verbose.mode set to FALSE is uniquely to avoid boring informative prints in rMarkDown; In interactive mode, it can be useful for estimating the loading time.

objDL <- dataLoader(verbose.mode = FALSE)
objDL$load.data.frame(mydata = EL.table,  IDName = "ID", EVENTName = "Event", dateColumnName = "Date", format.column.date = "%Y-%m-%d")

If the event log were stored in a csv file on the filesystem, you can use the dataLoader::load.csv() method, making sure to specify the separator as well.

5 Getting the output

Once loaded, the EL can be exported and sent in input to the other pMineR’s classes. This can be easily done with the method dataLoder::getData() in the following way:

objDL.out <- objDL$getData()

The returned variable is a list containing a lot of useful information. Here are listed some of them:

In our case, the objDL.out$MMatrix is:

BEGIN END Imaging Medical Visit Biopsy partial resection death total resection chemotherapy radiotherapy
BEGIN 0 0 100 100 0 0 0 0 0 0
END 0 0 0 0 0 0 0 0 0 0
Imaging 0 0 128 133 89 0 9 0 0 0
Medical Visit 0 22 131 203 94 0 70 0 0 0
Biopsy 0 0 0 0 0 88 30 65 0 0
partial resection 0 0 0 21 0 0 37 0 25 5
death 0 178 0 0 0 0 0 0 0 0
total resection 0 0 0 17 0 0 21 0 21 6
chemotherapy 0 0 0 13 0 0 6 0 135 27
radiotherapy 0 0 0 33 0 0 5 0 0 331
BEGIN END Imaging Medical Visit Biopsy partial resection death total resection chemotherapy radiotherapy
BEGIN 0 0.0000000 0.5000000 0.5000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
END 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Imaging 0 0.0000000 0.3565460 0.3704735 0.2479109 0.0000000 0.0250696 0.0000000 0.0000000 0.0000000
Medical Visit 0 0.0423077 0.2519231 0.3903846 0.1807692 0.0000000 0.1346154 0.0000000 0.0000000 0.0000000
Biopsy 0 0.0000000 0.0000000 0.0000000 0.0000000 0.4808743 0.1639344 0.3551913 0.0000000 0.0000000
partial resection 0 0.0000000 0.0000000 0.2386364 0.0000000 0.0000000 0.4204545 0.0000000 0.2840909 0.0568182
death 0 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
total resection 0 0.0000000 0.0000000 0.2615385 0.0000000 0.0000000 0.3230769 0.0000000 0.3230769 0.0923077
chemotherapy 0 0.0000000 0.0000000 0.0718232 0.0000000 0.0000000 0.0331492 0.0000000 0.7458564 0.1491713
radiotherapy 0 0.0000000 0.0000000 0.0894309 0.0000000 0.0000000 0.0135501 0.0000000 0.0000000 0.8970190
pMineR.internal.ID.Evt ID Event Date Sex pMineR.deltaDate
X100 996 100 Imaging 15/01/2001 00:00:00 M 0
X100.1 997 100 Medical Visit 18/01/2001 00:00:00 M 4320
X100.2 998 100 Imaging 01/02/2001 00:00:00 M 24480
X100.3 999 100 Medical Visit 05/02/2001 00:00:00 M 30240
X100.4 1000 100 Biopsy 09/02/2001 00:00:00 M 36000
X100.5 1001 100 total resection 23/02/2001 00:00:00 M 56160
X100.6 1002 100 chemotherapy 04/03/2001 00:00:00 M 69120
X100.7 1003 100 chemotherapy 06/03/2001 00:00:00 M 72000
X100.8 1004 100 chemotherapy 08/03/2001 00:00:00 M 74880
X100.9 1005 100 chemotherapy 11/03/2001 00:00:00 M 79200
X100.10 1006 100 chemotherapy 13/03/2001 00:00:00 M 82080
X100.11 1007 100 chemotherapy 15/03/2001 00:00:00 M 84960
X100.12 1008 100 Medical Visit 31/07/2001 00:00:00 M 283620
X100.13 1009 100 Medical Visit 23/10/2001 00:00:00 M 404580
X100.14 1010 100 Medical Visit 25/12/2001 00:00:00 M 495360
X100.15 1011 100 Medical Visit 30/04/2002 00:00:00 M 676740
X100.16 1012 100 Medical Visit 31/07/2002 00:00:00 M 809220
X100.17 1013 100 death 01/08/2002 00:00:00 M 810660

The time gap in hours can easily be calculated by replacing the column with:

pMineR.internal.ID.Evt ID Event Date Sex pMineR.deltaDate
X100 996 100 Imaging 15/01/2001 00:00:00 M 0.0000
X100.1 997 100 Medical Visit 18/01/2001 00:00:00 M 3.0000
X100.2 998 100 Imaging 01/02/2001 00:00:00 M 17.0000
X100.3 999 100 Medical Visit 05/02/2001 00:00:00 M 21.0000
X100.4 1000 100 Biopsy 09/02/2001 00:00:00 M 25.0000
X100.5 1001 100 total resection 23/02/2001 00:00:00 M 39.0000
X100.6 1002 100 chemotherapy 04/03/2001 00:00:00 M 48.0000
X100.7 1003 100 chemotherapy 06/03/2001 00:00:00 M 50.0000
X100.8 1004 100 chemotherapy 08/03/2001 00:00:00 M 52.0000
X100.9 1005 100 chemotherapy 11/03/2001 00:00:00 M 55.0000
X100.10 1006 100 chemotherapy 13/03/2001 00:00:00 M 57.0000
X100.11 1007 100 chemotherapy 15/03/2001 00:00:00 M 59.0000
X100.12 1008 100 Medical Visit 31/07/2001 00:00:00 M 196.9583
X100.13 1009 100 Medical Visit 23/10/2001 00:00:00 M 280.9583
X100.14 1010 100 Medical Visit 25/12/2001 00:00:00 M 344.0000
X100.15 1011 100 Medical Visit 30/04/2002 00:00:00 M 469.9583
X100.16 1012 100 Medical Visit 31/07/2002 00:00:00 M 561.9583
X100.17 1013 100 death 01/08/2002 00:00:00 M 562.9583

Or we can easily measure the delta between subsequent lines by replacing the column (or, better, adding a new column) with:

pMineR.internal.ID.Evt ID Event Date Sex pMineR.deltaDate pMineR.deltaDate.delta
X100 996 100 Imaging 15/01/2001 00:00:00 M 0.0000 0.00000
X100.1 997 100 Medical Visit 18/01/2001 00:00:00 M 3.0000 3.00000
X100.2 998 100 Imaging 01/02/2001 00:00:00 M 17.0000 14.00000
X100.3 999 100 Medical Visit 05/02/2001 00:00:00 M 21.0000 4.00000
X100.4 1000 100 Biopsy 09/02/2001 00:00:00 M 25.0000 4.00000
X100.5 1001 100 total resection 23/02/2001 00:00:00 M 39.0000 14.00000
X100.6 1002 100 chemotherapy 04/03/2001 00:00:00 M 48.0000 9.00000
X100.7 1003 100 chemotherapy 06/03/2001 00:00:00 M 50.0000 2.00000
X100.8 1004 100 chemotherapy 08/03/2001 00:00:00 M 52.0000 2.00000
X100.9 1005 100 chemotherapy 11/03/2001 00:00:00 M 55.0000 3.00000
X100.10 1006 100 chemotherapy 13/03/2001 00:00:00 M 57.0000 2.00000
X100.11 1007 100 chemotherapy 15/03/2001 00:00:00 M 59.0000 2.00000
X100.12 1008 100 Medical Visit 31/07/2001 00:00:00 M 196.9583 137.95833
X100.13 1009 100 Medical Visit 23/10/2001 00:00:00 M 280.9583 84.00000
X100.14 1010 100 Medical Visit 25/12/2001 00:00:00 M 344.0000 63.04167
X100.15 1011 100 Medical Visit 30/04/2002 00:00:00 M 469.9583 125.95833
X100.16 1012 100 Medical Visit 31/07/2002 00:00:00 M 561.9583 92.00000
X100.17 1013 100 death 01/08/2002 00:00:00 M 562.9583 1.00000

So, for example, let’s suppose we are intersted in plotting the mean time spent to move from the event Medical Visit to the event Biopsy. We can see the time, in minutes, with:

objDL.out$MM.den.list.high.det$`Medical Visit`$Biopsy
##   [1] 20160 11520 31680  5760 61920  8640  7200 36000 28800  7200 30240 12960
##  [13] 53280 37440 30240 34560 14400  2880  2880 11520 14400 36000  5760  1440
##  [25] 48960 15840 10080 11520 66240 44640 51840 43200 10080 23040 10080  2880
##  [37] 17280 11520 43200 60480 56160 12960 30240 14400 53280 63360 56160  1440
##  [49] 34560 14400 11520 11520 38880 28800 77760 67680 53280 24480 41760 27360
##  [61] 41760 21600 12960 27360 17280 18720 11520 10080 48960 37440 21600 12960
##  [73] 43200 10080 56160 64800 41760 24480  2880 40320 24480 10080 48960 47520
##  [85] 40320 20160 12960 70560 44640 25920 10080 21600 10080 12960  7200  1440
##  [97]  2880  7200 37440 28800 14400 25920 24480 20160 24480 60480 47520 27360
## [109] 56160 54720 37440 15840 44640 31680 38880 24480  7200 24480  7200 36000
## [121] 38880  8640  8640 48960 33120 17280 41760 10080 17280 31680 24480 17280
## [133] 11520 11520 37440 20160  4320 43200 12960 76320 34560 14400  4320 70560
## [145] 14400 38880 12960 36000  7200 11520 18720 11520 10080  7200  4320 18720
## [157] 36000 14400 60480 34560 24480 14400 54720  2880 15840  7200  1440 60480
## [169] 34560  4320 57600  5760 92160 82080 61920 48960 31680 10080 27360  1440
## [181] 54720 28800 11520 56160 33120 12960 30240 28800 12960 33120 23040 15840
## [193]  1440 25920  2880 27360 53280 11520 40320 34560  4320  1440 31680 10080
## [205]  8640 11520 14400 31680 14400 57600 41760  1440 33120 15840  4320 43200
## [217] 41760 12960 44640 20160 51840 31680 25920  7200 66240 51840  5760 30240
## [229] 40320 18720  4320 77760 60480 25920 17280 12960 27360 34560 12960  7200
## [241] 41760 21600 30240 11520  4320 34560 31680  7200  5760 28800  4320 47520
## [253]  8640 86400 48960 64800 57600 34560  4320 12960 66240 37440 14400  8640
## [265] 51840 43200 30240 10080 46080 14400 10080  5760 41760 21600 14400 10080
## [277]  7200 20160  5760 57600 38880 33120 33120 24480 59040 25920  5760 12960
## [289] 60480 54720 44640 37440 11520 41760 43200 24480 14400  5760 44640 10080
## [301]  2880 33120 10080 37440 24480 17280 34560 15840  4320 12960  2880 10080

and we can easily plot this with :

# get the time and set it in days
measured.time <- objDL.out$MM.den.list.high.det$`Medical Visit`$Biopsy
measured.time <- measured.time / (60*24)
mes.time.dens <- density(measured.time,from=0)
# calculate the cumulative
cumulative.y.values <- cumsum(mes.time.dens$y )/max(cumsum(mes.time.dens$y ))
# plot it
plot(x = mes.time.dens$x, y = cumulative.y.values, type='l', xlab="days", ylab="%", main="time needed to move \n from 'Medical Visit' to 'Biopsy'")
ooo <- length(which((cumulative.y.values <= 0.9) == TRUE))
abline(v=mes.time.dens$x[ooo],col='red',lty=2)
abline(h=0.9,col='red', lty=2)

This means that with in the 90% of the cases, to move from Medical Visit to Biopsy we need at least 39.3211291 days.

6 Plotting the Patient’s timeline

To have a first, raw, rude but practical overview of a single trace, the method dataLoder::plotPatientTimeline() can be used.

 objDL$plotPatientTimeline(PatID = "2")

7 Filtering

Sometimes we might need to remove some Patients or some Events, on the base of something (e.g. an attribute value, etc..). To do that, the class dataLoder offers the method dataLoder::applyFilter() which can operate differently, on the base of the passed parameters:

Here there are some practical examples.

7.1 getting the csv of the first 10 patients

we can get the first 10 Patient’s ID

arr.ID <- names(objDL.out$pat.process)[1:10]
csvTable <- objDL$applyFilter(array.pazienti.to.keep = arr.ID,whatToReturn = "csv")

7.2 Dropping the males

To drop the males we have to remove the patients where the attribute Sex is set to M:

newDL <- objDL$applyFilter(remove.patients.by.attribute.name = "Sex",by.arr.attribute.value = "M",whatToReturn = "dataLoader")

The variable newDL is now an instance of the class dataLoder filtered according to the patient’s sex. In this case we now have two ojbect: the former objDL which still keep the entire set of patients and newDL which is the filtered one.

8 Dictionary

It might happens that the events in the event log represent a too fine-grained level of detail, or there is a desire to perform event groupings by creating new, more representative ones. For example, consider a scenario where each specific laboratory test is associated with an individual event in an event log. In this case, it is quite common to want to group the tests into families to reduce the complexity of the event log and mitigate the subsequent spaghetti effect.

This operation can be performed using dictionaries, where for each entry, we map the desired translation.

For simplicity, let’s consider the following dictionary as an example.

The first column contains the list of events we want to translate, while the other columns represent possible translations. In this table, the column original can, for example, be remapped according to column translation_01 or column translation_02.

mtr.dictionary <- matrix(c("Medical Visit","Medical Visit","Diagnosis",
"Imaging","Imaging","Diagnosis",
"Biopsy","Biopsy","Diagnosis",
"partial resection","Surgery","Therapy",
"radiotherapy","radiotherapy","Therapy",
"total resection","Surgery","Therapy",
"chemotherapy","chemotherapy","Therapy",
"death","death","death"), ncol=3, byrow = TRUE)
colnames(mtr.dictionary) <- c("original","translation_01","translation_02")
mtr.dictionary %>%
  kbl() %>%
  kable_minimal()
original translation_01 translation_02
Medical Visit Medical Visit Diagnosis
Imaging Imaging Diagnosis
Biopsy Biopsy Diagnosis
partial resection Surgery Therapy
radiotherapy radiotherapy Therapy
total resection Surgery Therapy
chemotherapy chemotherapy Therapy
death death death

Let’s load the dictionary, here we have to indicate which column is the starting column:

objDL$addDictionaryMatrix(inputMatrix = mtr.dictionary,column.event.name = "original")

now we can ask a translation according to original -> translation_01

objDL.translated.01 <- objDL$getTranslation(column.name = "translation_01",toReturn = "dataLoader")

or a translation according to original -> translation_02

objDL.translated.02 <- objDL$getTranslation(column.name = "translation_02",toReturn = "dataLoader")

Not surprisingly, the new dataLoader() objects have the events converted according to the wished translation:

objDL.translated.01$getData()$original.CSV[1:20,]  %>%
  kbl() %>%
  kable_minimal()
pMineR.internal.ID.Evt ID Event Date Sex pMineR.deltaDate
X1 1 1 Imaging 14/01/2001 00:00:00 F 0
X1.1 2 1 Medical Visit 28/01/2001 00:00:00 F 20160
X1.2 3 1 Imaging 09/02/2001 00:00:00 F 37440
X1.3 4 1 Biopsy 11/02/2001 00:00:00 F 40320
X1.4 5 1 Surgery 23/02/2001 00:00:00 F 57600
X1.5 6 1 death 06/03/2001 00:00:00 F 73440
X10 62 10 Medical Visit 07/01/2001 00:00:00 M 0
X10.1 63 10 Biopsy 15/01/2001 00:00:00 M 11520
X10.2 64 10 Surgery 27/01/2001 00:00:00 M 28800
X10.3 65 10 chemotherapy 01/02/2001 00:00:00 M 36000
X10.4 66 10 chemotherapy 03/02/2001 00:00:00 M 38880
X10.5 67 10 radiotherapy 04/02/2001 00:00:00 M 40320
X10.6 68 10 radiotherapy 06/02/2001 00:00:00 M 43200
X10.7 69 10 radiotherapy 08/02/2001 00:00:00 M 46080
X10.8 70 10 radiotherapy 10/02/2001 00:00:00 M 48960
X10.9 71 10 Medical Visit 01/07/2001 00:00:00 M 251940
X10.10 72 10 death 02/07/2001 00:00:00 M 253380
X100 996 100 Imaging 15/01/2001 00:00:00 M 0
X100.1 997 100 Medical Visit 18/01/2001 00:00:00 M 4320
X100.2 998 100 Imaging 01/02/2001 00:00:00 M 24480
objDL.translated.02$getData()$original.CSV[1:20,]  %>%
  kbl() %>%
  kable_minimal()
pMineR.internal.ID.Evt ID Event Date Sex pMineR.deltaDate
X1 1 1 Diagnosis 14/01/2001 00:00:00 F 0
X1.1 2 1 Diagnosis 28/01/2001 00:00:00 F 20160
X1.2 3 1 Diagnosis 09/02/2001 00:00:00 F 37440
X1.3 4 1 Diagnosis 11/02/2001 00:00:00 F 40320
X1.4 5 1 Therapy 23/02/2001 00:00:00 F 57600
X1.5 6 1 death 06/03/2001 00:00:00 F 73440
X10 62 10 Diagnosis 07/01/2001 00:00:00 M 0
X10.1 63 10 Diagnosis 15/01/2001 00:00:00 M 11520
X10.2 64 10 Therapy 27/01/2001 00:00:00 M 28800
X10.3 65 10 Therapy 01/02/2001 00:00:00 M 36000
X10.4 66 10 Therapy 03/02/2001 00:00:00 M 38880
X10.5 67 10 Therapy 04/02/2001 00:00:00 M 40320
X10.6 68 10 Therapy 06/02/2001 00:00:00 M 43200
X10.7 69 10 Therapy 08/02/2001 00:00:00 M 46080
X10.8 70 10 Therapy 10/02/2001 00:00:00 M 48960
X10.9 71 10 Diagnosis 01/07/2001 00:00:00 M 251940
X10.10 72 10 death 02/07/2001 00:00:00 M 253380
X100 996 100 Diagnosis 15/01/2001 00:00:00 M 0
X100.1 997 100 Diagnosis 18/01/2001 00:00:00 M 4320
X100.2 998 100 Diagnosis 01/02/2001 00:00:00 M 24480