CFM_tutorial

Data Introduction:
CFM class initialization:
CFM class methods
Data introduction
- Compute graph: plotCFGraph() function
- Inferential Analysis: plotCFGraphComparison() function
CFM Limits

Data Introduction:

# let's generate some synthetic data
objSDG <- syntheticDataCreator()
EventLog <- objSDG$cohort.RT4(numOfPat = 200)


head(EventLog) %>%
  kbl() %>%
  kable_minimal()

ID	Event	Date	Priority	LINAC	Sex	Age	Hospital	Stop_Reason
1	Medical Visit	2001-01-06	3	TO3	Male	45	1	NA
1	Chemotherapy	2001-01-13	3	TO3	Male	45	1	NA
1	RT Start	2001-04-01	3	TO3	Male	45	1	NA
1	Clinical Suspension	2001-04-13	3	TO3	Male	45	1	1
1	Suspension	2001-04-20	3	TO3	Male	45	1	3
1	RT End	2001-05-08	3	TO3	Male	45	1	3

CFM class initialization:

The Care Flow Miner class class enables the creation of a graph outlining the most frequent paths and allows inferential analysis to be done on it.

The first step is to create an object of the Care Flow Miner class and load the EventLog.

Loading the Event Log within the CFM object must be done using the list generated by the getData method of the dataloader class. For this reason we will first go to create a DL object to perform the loading.

#DataLoader Class initialization: 
obj.DL<-dataLoader(verbose.mode = FALSE)
obj.DL$load.csv(nomeFile = "../Data/EL_CFM_Demo.csv",IDName = "ID",EVENTName = "Event",dateColumnName = "Date",format.column.date = "%Y-%m-%d")

#obj.out is the one we need to load the data into the CFM class
obj.out<-obj.DL$getData()

#CareFlowMiner Class initialization: 
objCFM<-careFlowMiner()
objCFM$loadDataset(obj.out)

CFM class methods

Data introduction

Compute graph: plotCFGraph() function

The plotCFGraph() method allows the creation of a graph representing most typical paths in the Event Log.

The creation of the graph is done using the Careflow Miner (CFM) algorithm, which extracts the more “frequent” careflows from process data. CFM algorithm is inspired by sequential pattern mining techniques. To assess the frequency of a particular trace, the algorithm relies on the notion of support. Sequence support is a proportion defined as the number of patients (NS) experiencing a specific sequence (S) of events over the total number of patients in the analyzed population (N). We define “frequent” patters as those with support above a certain user-defined threshold.

The other important parameter of the CFM algorithm is the “maximum length” parameter, which represents a constraint on the maximum number of events included in the careflow.

The plotCFgraph function needs a certain set of specified input:

abs.threshold: interger corresponding to the support parameter threshold (default= NA which equals to support threshold = 1)
depth: integer corresponding to the maximum length parameter (default=2)

The plotCFGraph function returns a list. The element of the list that is useful for graph plotting is the “script” element, which, when given as input to the grViz function of the DiagrammR package, allows for the visualization of the process model.

plot.list<-objCFM$plotCFGraph(depth = Inf, abs.threshold = 1)
grViz(plot.list$script)

plot.list<-objCFM$plotCFGraph(depth = Inf, abs.threshold = 10)
grViz(plot.list$script)

plot.list<-objCFM$plotCFGraph(depth = Inf, abs.threshold = 40)
grViz(plot.list$script)

plot.list<-objCFM$plotCFGraph(depth = 3, abs.threshold = 10)
grViz(plot.list$script)

In the default input configuration of the plotCFGraph() function, the plotted node information includes only the event label and the number of patients that pass through the event.

On each edge, are plotted the informations about the support and confidence, calculated as follows:

the support represents the number of patients who transition from a specific starting node to the next node over the total number of patients who pass through that specific starting node;
the confidence represents the number of patients who transition from the starting node to the next node over the total number of patients.

It is possible to enrich the graph with additional information.

Let’s assume, for example, that we want to plot the ID of each node and the median of the times associated with the “root” node. Additionally, we want to use the median times to determine the color of the graph according to the median time from root (darker colors represent longer median times).

out.list<-objCFM$plotCFGraph(depth = Inf, abs.threshold = 2,
                             printNodeID = T,
                             show.median.time.from.root = T,
                             heatmap.based.on.median.time = c(10,50,100), 
                             heatmap.base.color = "Gold")
grViz(out.list$script)

The plotCFgraph() function allows through the use of specific inputs to enrich the graph with information about the probability of incurring a certain future state. Specifically:

out.list<-objCFM$plotCFGraph(depth = Inf, abs.threshold = 2,
                             predictive.model = TRUE,
                             predictive.model.outcome = "Technical Suspension",arr.States.color = c("Technical Suspension"="Red"))

grViz(out.list$script)

Inferential Analysis: plotCFGraphComparison() function

The CFM implementation in pMineR enables the original version of the technique to be enhanced with several features intended to combine the benefits of Process Discovery with those of inferential statistics.

This is accomplished by splitting the population into two sub-cohorts by the value of a specific event attribute. The Care Flow Mining algorithm is then applied on each sub-cohort , thus creating two different outputs. Given the two different CFMs, these can be compared based on several parameters:

Number of patients for each of the two sub-cohorts passing through the nodes. This results in the node-by-node creation of a contingency matrix on which, depending on the observed cardinality, either a Fisher’s exact test or a Pearson’s Chi-square test is applied. If the p-value for that node is lower than the threshold entered in the “fisher.threshold” input, the node will be colored in yellow.

inf.out1<-objCFM$plotCFGraphComparison(depth = Inf,abs.threshold = 10,
                                        stratifyFor = "Priority",
                                        stratificationValues = c("1","4"),
                                        fisher.threshold = 0.005,
                                        kindOfGraph = "dot",nodeShape = "square")
grViz(inf.out1$script)

inf.out2<-objCFM$plotCFGraphComparison(depth = Inf,abs.threshold = 10,
                                        stratifyFor = "LINAC",
                                        stratificationValues = c("CY","TO3"),
                                        fisher.threshold = 0.005,
                                        kindOfGraph = "dot",nodeShape = "square")
grViz(inf.out2$script)

Time required to reach each node. The same steps as in the previous point are followed, but the Mann-Whitney test is used to determine whether there are any differences between the distribution of times between the two cohorts.

inf.out3<-objCFM$plotCFGraphComparison(depth = Inf,abs.threshold = 10,
                                       stratifyFor = "LINAC",
                                       stratificationValues = c("CY","TO3"),
                                       checkDurationFromRoot = T,
                                       fisher.threshold = 0.005,
                                       kindOfGraph = "dot",nodeShape = "square")
grViz(inf.out3$script)

Number of patients passing through the nodes, among those who would later experience a specific “future state”. The main difference between this analysis and the previous one is that comparisons will be made on a subset of this population, defined by the fact that patients in the subset will experience a given event specified as a “future state” in the future, rather than on the entire population transiting a node, for each sub-cohort. And again, a Fisher’s exact test or a Pearson’s Chi-square test will be employed.

inf.out4 <- objCFM$plotCFGraphComparison(depth = Inf,abs.threshold = 10,
                                         stratifyFor = "LINAC",arr.stratificationValues.A = c("CY","VE"),
                                         arr.stratificationValues.B = c("TO","TO3"),
                                         hitsMeansReachAGivenFinalState = TRUE,
                                         fisher.threshold = 0.005,
                                         arr.States.color=c("Technical Suspension"="Red"),finalStateForHits = "Technical Suspension",
                                         kindOfGraph = "dot",nodeShape = "square")
grViz(inf.out4$script)

CFM Limits

Here is an example of a use case where the use of careflow miner as a Process Discovery algorithm is not preferred. In this example, there is an initial phase where there is a lot of heterogeneity among events. This results in the creation of multiple possible branches that prevent the merging of the final part of the workflow, despite its similarity.

obj.DC<-syntheticDataCreator()
new.DL<-obj.DC$cohort.RT(numOfPat = 150,giveBack = "dataLoader")
new.out<-new.DL$getData()
new.CFM<-careFlowMiner()
new.CFM$loadDataset(new.out)

script<-new.CFM$plotCFGraph(depth = Inf,abs.threshold = 10)
grViz(script$script)

In this case, the model generated by the FOMM algorithm could be a better choice.

objFOMM<-firstOrderMarkovModel()
objFOMM$loadDataset(new.out)
objFOMM$trainModel()
grViz(objFOMM$getModel(kindOfOutput = "grViz"))