Show code

knitr::opts_chunk$set(warning= FALSE, message = FALSE)

1.0 Introduction

This assignment is a subset of the bigger Shiny-based Visual Analytics Application project, Understanding Prime Mover (PM) Waiting Time in Yard.

In this markdown blog, there will be 3 components 1) Introduction & Literature Review 2) Exploratory data analysis via treemaps, pie / bar charts, histograms, and parallel plots 3) Uncertainty via confidence intervals with box / violin plots

2.0 Introduction & Literature Review

2.1 Introduction

Maritime trade has been the backbone of international trade as it accounts approximately 92% of world trade (Xu, et al, 2020). In terms of value, global seaborne container trade approximately accounts for 60% of world seaborne trade, equivalent to 7.2 trillion US dollars out of 12 trillion US dollars (Statista, 2020). As the world’s busiest transshipment port, PSA Singapore handles about one-seventh of the world’s transshipped containers. In 2020, PSA Singapore and its container terminals handled 36.6 million TEUs (twenty foot equivalent units) of containers (PSA,2020). Upon a vessel’s arrival at a berth in the container terminal, containers are discharged from and loaded onto it. A typical discharging operation starts with a quay crane picking up a container from the vessel and placing it onto a prime mover (PM), which will then transport the container to a storage yard. At the yard, a yard crane picks up the container from the PM and shifts it to a designated spot. Loading operations involve the transporting of containers in the opposite direction, from the yard to the vessel (Ku, 2018).

Due to the complexity and volume of PSA operations, small productivity gains can bring about large savings to the PSA organisation. Approximately 91,000 PM actions are carried out on an average day in the port. Assuming 10 seconds are saved for each action, this is equivalent to 252 manhours saved per day, or approximately 92,000 hours saved annually.

2.2 Literature Review

Current literature mainly focuses on yard crane productivity (Li, et al, n.d.) or optimising the number of PMs and trucks to reduce average PM waiting time (Staats, n.d.). These previous studies do not focus on existing operations and the various factors, such as the type of containers, dangerous goods containers, and time of day, which may contribute to waiting time. Considering that travel time is fixed due to the speed limit of the prime mover, more emphasis should be used to determine the factors that might cause waiting time, as waiting time is a non-value adding activity.

Current PSA dashboards and visuliasations were created many years ago and have not been recently updated. These visualisations are high level visualisations that aggregate the data into an average productivity by PM faceted by Terminal ID.

This first graph showcases the average PM productivity across all terminals.

The details of each movement has been lost in the aggregation. Only a simple average is present. There are also no counts being presented in the current visualisation, thus, the user cannot check the total number of movements within the dataset. As terminals operate independently of each other, productivity across terminals are not related, therefore it makes little sense to visualise the productivity values as a line graph. There is no legend to describe what each line represents (red = ship ops, orange / purple = ITH, blue = shifting, black = overall). Colours are inconsistent, as different colours represent the same type of movement.

The percentage for each component of the stacked bar chart is not shown, leaving the reader to figure out the approximate percentage of each productivity category. As this view is too aggregated, there is limited possibility of deriving insights from why different terminals are performing differently. Every observation is based on a simple average of productivities.

The categories are too wide in this view, considering that the <15 category takes up to at least 75% of movements in each terminal. More granularity should be introduced, with more categories such as “0-5”, “5-10”, and “10-15”, which would give a better breakdown of waiting time.

Overall, there is no statistical analysis or clear visualisations being put in place in current operations. There are also no parameters that can be adjusted to suit the specific user, with only a one-size-fits-all visualisation available.

3.0 Prototyping

The main goal for the assignment is to explore various R packages and visual designs to determine what will be retained in the final Shiny environment. Due to the size of the dataset, data will be segmented from the full dataset to improve performance.

3.1 Data Preparation

Firstly, data preparation occurs. Data from March 2019 was extracted from the PSA operations database and anonymised using JMP Pro 15. R Packages are loaded using the following code for preparation.

Show code

# Preparing packages
packages = c("tidyverse", "ggstatsplot", "ggpubr", "corrplot","heatmaply", "dendextend", "seriation", "readr" , "parallelPlot", "treemap", "ggthemes", "RColorBrewer", "forcats", "GGally", "viridis", "parallelPlot", "plotly", "lubridate", "d3treeR", "knitr", "htmlwidgets", "htmltools")

for (p in packages){
    if(!require(p, character.only = T)){
        install.packages(p)
    }
    library(p, character.only = T)
}

The data is read and stored as a variable “PM” using the read_delim() function of the readr package. glimpse from the tidyverse package is used to check which variables need to be transformed or cleaned.

Show code

# Reading the csv file
PM <- read_delim("data/PM_201903.txt", delim = ",")

# Inspecting the file structure
glimpse(PM)

Rows: 5,327,083
Columns: 35
$ SHIFT_D          <dbl> 20190330, 20190330, 20190330, 20190330, 201~
$ Terminal_ID      <chr> "V2_0", "V2_0", "V2_0", "V2_0", "V2_0", "V2~
$ PM_N             <chr> "V3_0", "V3_0", "V3_0", "V3_0", "V3_0", "V3~
$ Chassis_N        <chr> "V4_0", "V4_0", "V4_0", "V4_0", "V4_0", "V4~
$ PM_Driver_IC_N   <chr> "V5_0", "V5_0", "V5_0", "V5_0", "V5_0", "V5~
$ EVENT_C          <chr> "DISC", "EQOF", "DISC", "EQOF", "DISC", "DI~
$ EVENT_DT         <chr> "2019/03/30 2:06:59.000 pm", "2019/03/30 2:~
$ EVENT_SHIFT_I    <chr> "D", "D", "D", "D", "D", "D", "D", "N", "N"~
$ MOVE_OP_C        <chr> "O", "O", "O", "O", "O", "O", "O", "O", "O"~
$ CNTR_N           <chr> "V17_0", "V17_0", "V17_1", "V17_1", "V17_2"~
$ LENGTH_Q         <chr> "40", "40", "40", "40", "40", "20", "20", "~
$ CNTR_TYPE_C      <chr> "GP", "GP", "RF", "RF", "GP", "GP", "GP", "~
$ CAT_C            <chr> "HC", "HC", "HR", "HR", "GP", "GP", "GP", "~
$ PURP_C           <chr> "I", "I", "T", "T", "T", "T", "T", "T", "T"~
$ CNTR_ST_C        <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F"~
$ CNTR_OPR_C       <chr> "V25_0", "V25_0", "V25_1", "V25_1", "V25_2"~
$ DG_I             <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"~
$ REEFER_I         <chr> "N", "N", "Y", "Y", "N", "N", "N", "N", "N"~
$ UC_I             <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"~
$ OVER_SIZE_I      <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"~
$ DG_GROUP_C       <chr> "NULL", "NULL", "NULL", "NULL", "NULL", "NU~
$ DG_IMO_CLASS_C   <chr> "NULL", "NULL", "NULL", "NULL", "NULL", "NU~
$ Equipment_N      <chr> "V42_0", "V42_1", "V42_2", "V42_3", "V42_0"~
$ EQUIPMENT_TYPE_C <chr> "QC", "RMG", "QC", "RMG", "QC", "QC", "RMG"~
$ Opr_User_ID      <chr> "V44_0", "V44_1", "V44_2", "V44_1", "V44_0"~
$ Terminal         <chr> "V48_0", "V48_0", "V48_0", "V48_0", "V48_0"~
$ `Berth/BLK`      <chr> "V49_0", "V49_1", "V49_2", "V49_3", "V49_0"~
$ `SLOT/ROW`       <chr> "V50_0", "V50_1", "V50_0", "V50_2", "V50_0"~
$ PM_DISTANCE_Q    <dbl> 545, 740, 705, 750, 1060, 0, 2082, 415, 390~
$ PM_TRAVEL_TIME_Q <dbl> 2.18, 2.96, 2.82, 3.00, 3.13, 0.00, 8.33, 1~
$ PM_WAIT_TIME_Q   <dbl> 3.09, 3.49, 9.48, 17.40, 0.00, 0.00, 15.89,~
$ PM_TIME_DIFF_Q   <dbl> 5.27, 6.45, 12.30, 20.40, 3.13, 0.00, 24.22~
$ BATCH_ID         <dbl> 20190330, 20190330, 20190330, 20190330, 201~
$ LOGON_DT         <chr> "2019/03/30 10:29:41.000 am", "2019/03/30 1~
$ LOGOFF_DT        <chr> "2019/03/30 5:39:35.000 pm", "2019/03/30 5:~

The dataset is filtered using pipes from the tidyverse package, filtering out the events “EQMT”, equipment mount to PM from yard crane, and “EQOF”, equipment offload from PM to yard crane, respectively. The other movement types in the dataset are miscellaneous movements. The ymd_hms, date, and hour functions from the lubridate package is used to transform the original string data in EVENT_DT to the correct datetime format. Only EVENT_DT was transformed as log-on or log-off time is irrelevant to the analysis.

Show code

# Ensuring that data is in the correct format
PM <- PM %>% 
  filter(EVENT_C == "EQMT" | EVENT_C == "EQOF")


PM$EVENT_DT <- ymd_hms(PM$EVENT_DT)
PM$day <- date(PM$EVENT_DT)
PM$hour <- hour(PM$EVENT_DT)

The dataset is transformed and aggregated and grouped with the tidyverse package, according to the variables that we want to explore further. A summarise_if function is used to sum up the numeric variables. glimpse is used to check the variables again.

Show code

## aggregating waiting time 

PM_AGG <- PM %>%
  group_by(Terminal, PM_N, EVENT_SHIFT_I, MOVE_OP_C, LENGTH_Q, CNTR_TYPE_C, DG_I, REEFER_I, OVER_SIZE_I, EQUIPMENT_TYPE_C, day, hour) %>%
  summarise_if(is.numeric, sum) %>%
  ungroup()

glimpse(PM_AGG)

Rows: 1,862,748
Columns: 18
$ Terminal         <chr> "V48_0", "V48_0", "V48_0", "V48_0", "V48_0"~
$ PM_N             <chr> "V3_0", "V3_0", "V3_0", "V3_0", "V3_0", "V3~
$ EVENT_SHIFT_I    <chr> "D", "D", "D", "D", "D", "D", "D", "D", "D"~
$ MOVE_OP_C        <chr> "C", "C", "C", "C", "O", "O", "O", "O", "O"~
$ LENGTH_Q         <chr> "20", "20", "40", "40", "20", "20", "20", "~
$ CNTR_TYPE_C      <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "~
$ DG_I             <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"~
$ REEFER_I         <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"~
$ OVER_SIZE_I      <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"~
$ EQUIPMENT_TYPE_C <chr> "RMG", "RMG", "RMG", "RMG", "FTR", "FTR", "~
$ day              <date> 2019-03-17, 2019-03-17, 2019-03-17, 2019-0~
$ hour             <int> 10, 11, 8, 16, 15, 12, 13, 12, 12, 13, 11, ~
$ SHIFT_D          <dbl> 40380634, 20190317, 20190317, 20190325, 201~
$ PM_DISTANCE_Q    <dbl> 2650, 885, 5241, 5014, 1230, 375, 725, 770,~
$ PM_TRAVEL_TIME_Q <dbl> 10.60, 3.54, 20.96, 20.06, 4.92, 1.50, 2.90~
$ PM_WAIT_TIME_Q   <dbl> 21.25, 46.74, 1.54, 1.71, 5.66, 29.67, 9.68~
$ PM_TIME_DIFF_Q   <dbl> 31.85, 50.28, 12.72, 21.77, 10.58, 31.17, 1~
$ BATCH_ID         <dbl> 40380634, 20190317, 20190317, 20190325, 201~

The dataset is then cleaned and transformed. as.Date is used to change “day” to a date. as.factor to transform the other character inputs into factors. fct_collapse from the forcats function is used to combine, clean, and rename the categories in the dataset. This is to reduce the number of NULLs and zeros within the dataset. It also helps to make many of the variables, such as DG or Reefer, into more understandable categories.

An additional variable was created, called “CNTR_spec” to combine the container length, DG status, and Reefer status into 1 factor. These 3 variables can generally describe the condition of most containers that PSA handles. A final glimpse is used to confirm the statuses of the variables.

Show code

## cleaning
PM_AGG$day <- as.Date(PM_AGG$day)

PM_AGG[,c(1:11)] <- lapply(PM_AGG[,c(1:11)],as.factor)
PM_AGG$hour <- as.factor(PM_AGG$hour)

PM_AGG$DG <- fct_collapse(PM_AGG$DG_I, 
                            DG = "Y",
                            STD = c("N", "NULL"))
PM_AGG$Reefer <- fct_collapse(PM_AGG$REEFER_I, 
                            REEF = "Y",
                            AMB = "N")
PM_AGG$LENGTH_Q <- fct_collapse(PM_AGG$LENGTH_Q, 
                            "20" = "20",
                            "40" = "40", 
                            "Others" = c("NULL", "00", "45"))
PM_AGG$OVER_SIZE_I <- fct_collapse(PM_AGG$OVER_SIZE_I, 
                            "Yes" = "Y",
                            "No" = c("N", "NULL"))

## Merging Container specs 
PM_AGG$CNTR_spec <- paste(PM_AGG$LENGTH_Q,PM_AGG$DG,PM_AGG$Reefer, sep = "-")
PM_AGG$CNTR_spec <- as.factor(PM_AGG$CNTR_spec)
PM_AGG$day <- date(PM_AGG$day)

glimpse(PM_AGG)

Rows: 1,862,748
Columns: 21
$ Terminal         <fct> V48_0, V48_0, V48_0, V48_0, V48_0, V48_0, V~
$ PM_N             <fct> V3_0, V3_0, V3_0, V3_0, V3_0, V3_0, V3_0, V~
$ EVENT_SHIFT_I    <fct> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D~
$ MOVE_OP_C        <fct> C, C, C, C, O, O, O, O, O, O, O, O, O, O, O~
$ LENGTH_Q         <fct> 20, 20, 40, 40, 20, 20, 20, 20, 20, 20, 20,~
$ CNTR_TYPE_C      <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,~
$ DG_I             <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N~
$ REEFER_I         <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N~
$ OVER_SIZE_I      <fct> No, No, No, No, No, No, No, No, No, No, No,~
$ EQUIPMENT_TYPE_C <fct> RMG, RMG, RMG, RMG, FTR, FTR, FTR, FTR, QC,~
$ day              <date> 2019-03-17, 2019-03-17, 2019-03-17, 2019-0~
$ hour             <fct> 10, 11, 8, 16, 15, 12, 13, 12, 12, 13, 11, ~
$ SHIFT_D          <dbl> 40380634, 20190317, 20190317, 20190325, 201~
$ PM_DISTANCE_Q    <dbl> 2650, 885, 5241, 5014, 1230, 375, 725, 770,~
$ PM_TRAVEL_TIME_Q <dbl> 10.60, 3.54, 20.96, 20.06, 4.92, 1.50, 2.90~
$ PM_WAIT_TIME_Q   <dbl> 21.25, 46.74, 1.54, 1.71, 5.66, 29.67, 9.68~
$ PM_TIME_DIFF_Q   <dbl> 31.85, 50.28, 12.72, 21.77, 10.58, 31.17, 1~
$ BATCH_ID         <dbl> 40380634, 20190317, 20190317, 20190325, 201~
$ DG               <fct> STD, STD, STD, STD, STD, STD, STD, STD, STD~
$ Reefer           <fct> AMB, AMB, AMB, AMB, AMB, AMB, AMB, AMB, AMB~
$ CNTR_spec        <fct> 20-STD-AMB, 20-STD-AMB, 40-STD-AMB, 40-STD-~

As this is a prototype, a date filter was added to the dataset using the filter function from tidyverse to reduce the number of data rows being processed to increase processing speed. The final Shiny app would have the entire dataset. Using this method, the number of rows being processed is cut down from 1.7 million entries to approximately 90 thousand. The PM_group dataset will be the main dataset used in the prototype.

Show code

PM_AGG %>% 
  group_by(LENGTH_Q) %>%
  tally()

# A tibble: 3 x 2
  LENGTH_Q       n
  <fct>      <int>
1 Others     14351
2 20        731286
3 40       1117111

Show code

upper_date <- date("2019-03-05")
lower_date <- date("2019-03-02")

PM_group <- PM_AGG %>% 
  filter(between(day, lower_date, upper_date))

head(PM_group)

# A tibble: 6 x 21
  Terminal PM_N  EVENT_SHIFT_I MOVE_OP_C LENGTH_Q CNTR_TYPE_C DG_I 
  <fct>    <fct> <fct>         <fct>     <fct>    <fct>       <fct>
1 V48_0    V3_0  D             O         20       GP          N    
2 V48_0    V3_0  D             O         20       GP          N    
3 V48_0    V3_0  D             O         20       GP          N    
4 V48_0    V3_0  D             O         20       GP          N    
5 V48_0    V3_0  D             O         20       GP          N    
6 V48_0    V3_0  D             O         20       GP          N    
# ... with 14 more variables: REEFER_I <fct>, OVER_SIZE_I <fct>,
#   EQUIPMENT_TYPE_C <fct>, day <date>, hour <fct>, SHIFT_D <dbl>,
#   PM_DISTANCE_Q <dbl>, PM_TRAVEL_TIME_Q <dbl>,
#   PM_WAIT_TIME_Q <dbl>, PM_TIME_DIFF_Q <dbl>, BATCH_ID <dbl>,
#   DG <fct>, Reefer <fct>, CNTR_spec <fct>

Show code

glimpse(PM_group)

Rows: 233,679
Columns: 21
$ Terminal         <fct> V48_0, V48_0, V48_0, V48_0, V48_0, V48_0, V~
$ PM_N             <fct> V3_0, V3_0, V3_0, V3_0, V3_0, V3_0, V3_0, V~
$ EVENT_SHIFT_I    <fct> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D~
$ MOVE_OP_C        <fct> O, O, O, O, O, O, O, O, O, O, O, O, O, O, O~
$ LENGTH_Q         <fct> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 40,~
$ CNTR_TYPE_C      <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, TK, GP,~
$ DG_I             <fct> N, N, N, N, N, N, N, N, N, Y, N, N, N, N, N~
$ REEFER_I         <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N~
$ OVER_SIZE_I      <fct> No, No, No, No, No, No, No, No, No, No, No,~
$ EQUIPMENT_TYPE_C <fct> FTR, RMG, RMG, RMG, RMG, RMG, RMG, RMG, RMG~
$ day              <date> 2019-03-02, 2019-03-02, 2019-03-03, 2019-0~
$ hour             <fct> 15, 13, 11, 12, 18, 7, 8, 12, 13, 12, 8, 13~
$ SHIFT_D          <dbl> 20190302, 20190302, 20190303, 40380606, 201~
$ PM_DISTANCE_Q    <dbl> 1230, 2650, 2220, 3510, 2603, 860, 0, 725, ~
$ PM_TRAVEL_TIME_Q <dbl> 4.92, 10.60, 8.88, 14.04, 10.41, 3.44, 0.00~
$ PM_WAIT_TIME_Q   <dbl> 5.66, 0.07, 2.84, 10.90, 3.39, 18.18, 14.98~
$ PM_TIME_DIFF_Q   <dbl> 10.58, 10.67, 11.72, 24.94, 13.80, 18.38, 1~
$ BATCH_ID         <dbl> 20190302, 20190302, 20190303, 40380606, 201~
$ DG               <fct> STD, STD, STD, STD, STD, STD, STD, STD, STD~
$ Reefer           <fct> AMB, AMB, AMB, AMB, AMB, AMB, AMB, AMB, AMB~
$ CNTR_spec        <fct> 20-STD-AMB, 20-STD-AMB, 20-STD-AMB, 20-STD-~

3.2 Exploratory Data Analysis

Exploratory data analysis will be the first component of the Shiny application to provide a general overview of the performance and breakdown of PM waiting time.

3.2.1 Treemap

One of the issues with the current visualisations is that there is no form of breakdown for the type of containers or number of containers handled in each terminal. Thus, a treemap was conceptualised as a good way to visualise the size and scale that each factor contributes to the waiting time. The treemap function from the treemap package was used to create the visualisations below.

“PM_WAIT_TIME_Q” is used as the base for the size of the category, within the vSize argument in the treemap function. Initial testing used the variables as strings, such as “LENGTH_Q” being mapped directly for the index argument. Further testing allowed for mapping the index directly with variables. These variables are used as a proxy for the R Shiny input values, and can be mapped more easily into the code chunk. However, more testing will need to be carried out to see if a checklist can be used to increase the number of categories being mapped in the treemap at any one time.

A colour palette, Blues from the RColorPalette package was used.

Show code

index1 <- "LENGTH_Q"
index2 <-  "EVENT_SHIFT_I"
vsize <- "PM_WAIT_TIME_Q"




## treemap for waiting time, group by LENGTH_Q
tree_len <- treemap(PM_group, index = c(index1, index2), vSize = vsize, palette = "Blues", title = "Treemap for PM Waiting time by CNTR Specifications and Shift", fontsize.labels = c(15,10))

Show code

index3 <- "CNTR_spec"
## treemap for waiting time, group by terminal ID
tree_cntr <- treemap(PM_group, index = c(index3, index2), vSize = vsize, palette = "Blues", title = "Treemap for PM Waiting time by CNTR Specifications and Shift", fontsize.labels = c(15,10))

Show code

## treemap for waiting time, group by terminal ID

index4 <- "Terminal"

tree_term <- treemap(PM_group, index = c(index3, index4, index2), vSize = vsize, palette = "Blues", title = "Treemap for PM Waiting time by CNTR Specifications and Shift", fontsize.labels = c(15,10))

Show code

## treemap mapping cntr spec 
index5 <- c("LENGTH_Q", "DG", "Reefer")
tree_DG <- treemap(PM_group, index = index5, vSize = vsize, palette = "Blues", title = "Treemap for PM Waiting time by CNTR Specifications and Shift", fontsize.labels = c(15,10))

Interactivity is introduced using the d3tree2 function in the d3treeR package. This helps the user to explore the categories that they are more interested in. One of the treemaps that was created above was mampped into the d3tree2 function. In the shiny app, this output will be shown below the initial treemap for easy reference.

Show code

treemap_inter <- d3tree2(tree_term, rootname = "Container Types")

treemap_inter

As can be seen from the treemaps, 40 footer general purpose (STD + AMB) containers and 20 footer GP containers are the most common type of container that the port handles. Based on visual analysis, Terminval v48_4 operates the most number of 40ft standard, ambient containers.

3.2.2 Bar chart

The next component to introduce as a visualisation are bar charts. These bar charts enable users to visualise the operations in a deeper level of detail versus the treemap. The treemap is envisioned to be something that can be looked at quickly and check rough estimates. The bar charts will allow for greater manipulation of parameters to ensure that users can get the information that they want.

The bar charts are created using ggplot from the ggplot2 package. aes_string is used for aesthetics as it enables us to use variables and strings to map the aesthetics of the plot. This will help when converting the code from a markdown document to a Shiny document.

To create the base bar chart, variables were mapped for the x-axis, bar_var, and the fill argument, bar_fill. A geom_bar was added, with the position argument defined as position_dodge. A theme, theme_few was mapped. The facet_wrap layer is mapped using a variable, bar_wrap, and put into the argument using the as.formula and paste functions. This allows all arguments in the ggplot function as variables. A variable, bar_scale was also mapped to change the scale type of the y-axis when faceting.

Multiple bar charts were created to test out various configurations. Interactivity is added to the chart by passing the named plot through ggplotly from the ggplot2 package. No numbers are placed on the charts to reduce overplotting. The count information is added into the tooltip automatically by ggplotly, thus the decision was made to not include any text.

Due to a limitation on plotly and chart size, the plotly chart was converted into a html file using saveWidget from the htmlwidgets package, and then embedded into the markdown document using the tags$iframe function. This reduces loading time and performance issues on the webpage. This step is repeated for all plotly plots.

Show code

bar_var <- "LENGTH_Q"
bar_fill <- "LENGTH_Q"
bar_wrap <- "EVENT_SHIFT_I"
bar_type <- "dodge"

bar_count <- ggplot(data = PM_group, aes_string(x = bar_var, fill = bar_fill))+
  geom_bar(position = bar_type, width = 0.5)+
  labs(title = "Count of Length of Container")+
  theme_few()+
  facet_wrap(as.formula(paste("~", bar_wrap)))+
  scale_fill_brewer(palette = "BuPu")


p1 <- ggplotly(bar_count)


saveWidget(p1, "p1.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p1.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

Show code

## changing variables and changing bar type

bar_var2 <- "DG"
bar_wrap2 <- "Terminal"
bar_type2 <- "stack"
bar_scale <- "free_y"

bar_DG <- ggplot(data = PM_group, aes_string(x = bar_var2, fill = bar_fill))+
  geom_bar(position = bar_type2)+
  labs(title = "Count of Length of Container by DG Status by Terminal ID")+
  theme_few()+
  facet_wrap(as.formula(paste("~", bar_wrap2)), scales = bar_scale)+
  scale_fill_brewer(palette = "BuPu")

p2 <- ggplotly(bar_DG)

saveWidget(p2, "p2.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p2.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

Show code

## free y axis for barID, facet by terminal ID 

bar_fill2 <- "MOVE_OP_C"
bar_scale2 <- "fixed"

bar_ID <- ggplot(data = PM_group, aes_string(x = bar_var, fill = bar_fill2))+
  geom_bar(position = position_dodge(), alpha = 0.75, width = 0.5)+ 
  labs(title = "Count of Length of Container by Move ID by Terminal ID")+
  theme_few()+
  facet_wrap(as.formula(paste("~", bar_wrap2)), scales = bar_scale2)+
  scale_fill_brewer(palette = "BuPu")

p3 <- ggplotly(bar_ID)

saveWidget(p3, "p3.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p3.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

plot_ly from the plotly package was also experimented with, however, it does not allow for faceting without using lapply or and the subplot function, which may cause problems in the Shiny environment. As such, ggplotly will be used instead for the final chart.

Show code

plot_index1 <- "LENGTH_Q"
plot_fill1 <- "MOVE_OP_C"


p4 <- PM_group %>%
  group_by(Terminal) %>%
  do(p =plot_ly(., x = ~!!as.name(plot_index1), color = ~!!as.name(plot_fill1), colors = "BuPu") %>% 
       add_histogram() %>%  
       layout(barmode = "stack", title= "Length of container by count, breakdown by Terminal ID", 
              showlegend = FALSE, 
              xaxis = list(title = "Terminal ID", tickangle = -45), 
              yaxis = list(title = "Count"))) %>%  ##  barmode = "stack", "relative" and "group" 
  subplot(shareX = TRUE, shareY = TRUE, titleX = FALSE)## stacking bar charts

saveWidget(p4, "p4.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p4.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

3.2.3 Histogram

To understand the overall performance and waiting time for the operation, histograms are used to visualise the wait time. These histograms were created using the geom_histogram function of ggplot2. As with the bar charts above, each argument was put as a variable, with the x-axis being hist_var, the fill being hist_fill, and the facet wrap variable being hist_wrap. These were then mapped using aes_string. The bin size was also created as a variable, binsize, which will be used for the binwidth argument in geom_histogram.

To create the basic histogram, the x-axis and fill were mapped with their respective variables. The bin size was defined as 5, which can be changed to a slider or dropdown list later in the Shiny environment. In order to convert facet wrap to take in variables, it was passed through a as.formula function with paste to add the “~”. The theme and palette were kept consistent with the bar charts as above.

A subset of data was created for the histogram to simulate a selection of specific terminals in the Shiny environment. An additional layer, group_activator, was added to add the main dataset to act as a comparison. Additional variables were tested to simulate inputs from the UI segment of the Shiny environment. Interactivity is provided using the ggplotly function from ggplot2. As such, no text was plotted onto the graph considering that tooltips are available to the user.

Show code

## setting variables
hist_var <- "PM_WAIT_TIME_Q"
hist_fill <- "LENGTH_Q"
hist_wrap <- "LENGTH_Q"


binsize <- 5

# Creating the histogram for waiting time for entire group
hist_wait <- ggplot(data = PM_group, aes_string(x = hist_var), fill = hist_fill)+
  geom_histogram(fill = " blue", binwidth = binsize)+
  facet_wrap(as.formula(paste("~", hist_wrap)), labeller ="label_both" )+
  theme_few()+
  scale_fill_brewer(palette = "BuPu")

p5 <- ggplotly(hist_wait)

saveWidget(p5, "p5.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p5.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

Show code

## creating data subset to compare difference

T_ID <- c("V48_1", "V48_2", "V48_9")

PM_Terminal <- PM_group %>% filter(Terminal %in% T_ID)

## create variable to compare with overall group
group_activator <- geom_histogram(data= PM_group, fill = "grey", alpha = 0.5, binwidth = 5)


# Creating the histogram for subset
hist_sub <- ggplot(data = PM_Terminal, aes_string(x = hist_var, fill = hist_fill))+
  geom_histogram(fill = "blue", binwidth = binsize)+
  facet_wrap(as.formula(paste("~", hist_wrap)), labeller ="label_both" )+
  theme_few()+
  group_activator

p6 <- ggplotly(hist_sub)

saveWidget(p6, "p6.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p6.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

Show code

#Histogram based on terminal ID

hist_wrap2 <- "."

hist_term <- ggplot(data = PM_group, aes_string(x = hist_var, fill = hist_fill))+
  geom_histogram(aes(fill =Terminal), binwidth = binsize )+
  scale_fill_brewer(palette = "BuPu")+
  facet_wrap(as.formula(paste("~", hist_wrap2)), labeller ="label_both" )+
  theme_few()

p7 <- ggplotly(hist_term)

saveWidget(p7, "p7.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p7.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

Show code

## adding free_y scale
scale <- "free_y" #"fixed" is other choice
## histogram based on ctnr spec, facet based on cntr spec
hist_spec <- ggplot(data = PM_group, aes(x = PM_TRAVEL_TIME_Q, fill = Terminal))+
  geom_histogram(binwidth = binsize)+
  facet_wrap(~CNTR_spec, scales = scale)+
  theme_few()+
  scale_fill_brewer(palette = "BuPu")

p8 <- ggplotly(hist_spec)

saveWidget(p8, "p8.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p8.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

Show code

scale2 <- "fixed" #"fixed" is other choice
hist_fill2 <- "Terminal"

## aggregated histogram
## by hour

hist_wrap3 <- "hour"


limit <- 30
ref_line <- geom_vline(xintercept = limit, linetype = "dashed", colour = "red", size = 0.25)

hist_hour <- ggplot(data = PM_group, aes_string(x = hist_var, fill = hist_fill2))+
  geom_histogram(binwidth = binsize)+
  facet_wrap(as.formula(paste("~", hist_wrap3)), scales = scale2)+
  theme_few()+
  ref_line+
  scale_fill_brewer(palette = "BuPu")

p9 <- ggplotly(hist_hour)

saveWidget(p9, "p9.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p9.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

The majority of the events have a 5 - 10 minute waiting time. The distribution is approximately normal, with a left skew. The tail is relatively long, which means that there are events with many long waiting times. In this case, a potential exploration can be done to determine the causes of these long waiting times, as these would be the ones that would bring up the mean waiting time for PM operations. Once the root cause is determined, key actions should be taken to remedy those issues.

3.2.4 Parallel Coordinates

Parallel coordinates are used to visualise multivariate data. The parallelPlot function from the parallelPlot package is used to create this plot. This helps to compare the parameters of individual observations so as to see if any patterns can be derived from these plots.

The data subset of PM_Terminal is used to reduce the processing power required to generate the plot. The relevant columns are selected, removing extra information that is not required. This dataset is named PM_parallel. A categorical list is created to ensure that the parallelPlot function codes categorical variables correctly. A histogram variable and input column variable were also created. These arguments were passed on to the parallelPlot function.

Show code

T_ID2 <- c("V48_1", "V48_9")

PM_par <- PM_group %>% filter(Terminal %in% T_ID2)

PM_parallel <- PM_par[,c(1,3:5,10,12,15:16,19:20)]
categorical1 <- list(c("v48_0","v48_1","v48_2","v48_4","v48_5", "v48_6", "v48_7", "v48_8", "v48_9"),
                    c("N", "Y"),
                    c("C","H","I", "O", "S"),
                    c("20", "40", "45", "Others"),
                    c("BC", "FTR", "NULL", "QC","RMG", "RTG"),
                    c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24"),
                    NULL, ## PM travel TIME
                    NULL, ## PM wait time
                    c("DG", "STD"), ## DG
                    c("REEF", "AMB"))

PM_parallel %>% 
  group_by(Terminal) %>%
  tally()

# A tibble: 2 x 2
  Terminal     n
  <fct>    <int>
1 V48_1    27827
2 V48_9      328

Show code

inputColumns1 <- c(TRUE, 
                  TRUE, 
                  TRUE, 
                  TRUE, 
                  TRUE,
                  TRUE,
                  TRUE,
                  TRUE,
                  TRUE,
                  TRUE)
histoVis <- rep(TRUE, ncol(PM_parallel))
p10 <- parallelPlot(data = PM_parallel, keptColumns = inputColumns1, categorical = categorical1, refColumnDim = "LENGTH_Q", rotateTitle = TRUE, histoVisibility = histoVis)


saveWidget(p10, "p10.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p10.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

3.3 Confirmatory Data Analysis

Confirmatory data analysis is carried out next. This will help to analyse the distributions that are observed from the data.

3.3.1 Scatterplots

Scatter plots are plotted next to visualise the correlation between travel time and wait time to check for correlation between them. Variables are created for the plot type and statistical method as scatter_type and scatter_stat respectively. These arguments are fed into the ggscatterstats from the ggstatsplot package to create a scatterplot diagram for users to analyse. Since the x-axis and y-axis are always numerical, they are hardcoded into the arguments. Additional plots are created to map other inputs in the ggscatterstats arguments.

Show code

scatter_type <- "histogram"
scatter_stat <- "parameteric"


ggscatterstats(data = PM_Terminal, x = PM_WAIT_TIME_Q, y = PM_TRAVEL_TIME_Q, type = scatter_stat, marginal.type = scatter_type)



scatter_type1 <- "boxplot"
scatter_stat1 <- "bayes"

ggscatterstats(data = PM_Terminal, x = PM_WAIT_TIME_Q, y = PM_TRAVEL_TIME_Q, type = scatter_stat1, marginal.type = scatter_type1)

Using the scatterplots, we can find out if there are any underlying correlations between waiting time and travel time.

Additional exploration was done using the grouped_ggscatterstats function from the ggstatsplot package. This function enables the user to break down other factors within the dataset. There is no way to remove the grouping variable using the grouped_ggscatterstats function. Therefore, to map both the aggregated dataset and those that are split by groups, there will be an indicator option for users to choose between the normal ggscatterstats function and the grouped_ggscatterstats function.

Show code

grouped_ggscatterstats(data = PM_Terminal, x = PM_WAIT_TIME_Q, y = PM_TRAVEL_TIME_Q, grouping.var = LENGTH_Q, type = scatter_stat, marginal.type = scatter_type)

Show code

grouped_ggscatterstats(data = PM_Terminal, x = PM_WAIT_TIME_Q, y = PM_TRAVEL_TIME_Q, grouping.var = Terminal,  type = scatter_stat1, marginal.type = scatter_type1)

3.3.2 Violin & box plots

Box and violin plots allow us to understand the distribution for the numerical values of the dataset. This will help to compare means between different groups, for example, Terminal ID. Instead of using a normal box plot using the ggplot package, the ggbetweenstats function from the ggstatsplot package was used. This package enables the user to explore statistical tests along with the visualisation of the distributions.

The pairwise.display function was hardcoded as “s”, representing significant values only. However, in the Shiny app, this will be converted to a variable which allows for users to view both significant and all test results. Non-significant results are not the focus, so will not be added later. Both violin and box plots will be used, as they serve their own purpose. Additional explanatory text will be included in the Shiny app to explain the output of ggbetweenstats as not all users will be able to understand the statistical output.

ggplotly from the ggplot2 package was used to give interactivity to the plot generated.

Show code

variable <-  "Terminal"

ggbetweenstats(data = PM_Terminal, x = Terminal, y = PM_WAIT_TIME_Q)

Show code

ggbetweenstats(data = PM_Terminal, x = CNTR_spec, y = PM_WAIT_TIME_Q, type = "np", pairwise.display = "s")

Warning: Number of labels is greater than default palette color count.
 Try using another color `palette` (and/or `package`).

Show code

## shift type

#boxviolin

#box & np + pairwaise display = "s"
box <- ggbetweenstats(data = PM_Terminal, x = MOVE_OP_C, y = PM_WAIT_TIME_Q, plot.type = "box", type = "np", pairwise.display = "s")

p11 <- ggplotly(box)

saveWidget(p11, "p11.html", selfcontained = F, libdir = "lib")
tags$iframe(
  src = "p11.html", 
  scrolling = "yes", 
  seamless = "seamless",
  frameBorder = "0",
  width="500",
  height="500"
)

However, upon further inspection, the interactive version built using ggplotly has removed all the statistical tests that are inherent in the ggbetweenstats function. As a result, a decision was made to remove the interactivity function in the final Shiny environment, considering that the statistical output is more important than the interactivity functionality.

Additional testing was carried out using the grouped_ggbetweenstats function. Its output is similar to that of grouped_ggscatterstats, whereby users can define a grouping variable argument to visualise different factors. However, it also has the same limitations in which it is unable to take in an “all” argument for grouping. Thus, an additional option will need to be built to select for the appropriate function accordingly.

Show code

p12 <- grouped_ggbetweenstats(data = PM_Terminal, x = LENGTH_Q, y = PM_WAIT_TIME_Q, grouping.var = Terminal, type = "np")

p12

4.0 Proposed Visualisation

The proposed visualisation is envisioned to look like this.

It will have 2 main components, exploratory data analysis and confirmatory data analysis. Each section will be split further into its sub-components as demonstrated above. Various formats of selection will be provided where relevant, such as dropdown menus for selection of variables, button menus for argument selection (type of graph in box-violin plots, histogram or box plot visualisation for scatterplots), and sliders for date selection. For date selection, only a limited range will be provided, as this will heavily impact the performance of the site, especially for parallel coordinates.

5.0 Reflections

5.1 Performance

Due to the large size of the initial data set, multiple segmentations were needed to be carried out so as to prevent R Studio from crashing, especially when creating plots with high interactivity (like parallel coordinate plots) or those that require significant computing power (like ANOVA testing from the ggstatsplot packages). As such, more care should be taken to improve performance, especially if this is to be rolled out to general users. Even if it helps operations greatly with visualisation and understanding, users will not use it if it faces significant performance issues.

5.2 Interactivity

Since the data is segmented, interactivity is key to allow users to select the relevant data that they require, such as date selection. As the end-user is much more knowledgeable about the ins and outs of the operation than us, leaving variable selection to the end-user can improve the engagement with the tool as they are more likely to find patterns and insights rather than being prescribed a view. However, interactivity heavily impacts performance, and so some limitations are required to preserve general usability of the tool.

Assignment

1.0 Introduction

2.0 Introduction & Literature Review

2.1 Introduction

2.2 Literature Review

3.0 Prototyping

3.1 Data Preparation

3.2 Exploratory Data Analysis

3.2.1 Treemap

3.2.2 Bar chart

3.2.3 Histogram

3.2.4 Parallel Coordinates

3.3 Confirmatory Data Analysis

3.3.1 Scatterplots

3.3.2 Violin & box plots

4.0 Proposed Visualisation

5.0 Reflections

5.1 Performance

5.2 Interactivity