Creating Data Visualisation Beyond Default

Author

FirGhaz

Published

January 18, 2024

1 Context Setting

OECD education director Andreas Schleicher shared in a BBC article that “Singapore managed to achieve excellence without wide differences between children from wealthy and disadvantaged families.” (2016) Furthermore, several Singapore’s Minister for Education also started an “every school a good school” slogan. The general public, however, strongly believes that there are still disparities that exist, especially between the elite schools and neighborhood school, between students from families with higher socioeconomic status and those with relatively lower socioeconomic status and immigration and non-immigration families.

2 Exercise Objectives

The 2022 Programme for International Student Assessment (PISA) data was released on December 5, 2022. PISA global education survey every three years to assess the education systems worldwide through testing 15 year old students in the subjects of mathematics, reading, and science.

In this take-home exercise, we are required to use appropriate Exploratory Data Analysis (EDA) methods and ggplot2 functions to reveal:

the distribution of Singapore students’ performance in mathematics, reading, and science, and
the relationship between these performances with schools, gender and socioeconomic status (SES) of the students.

3 Data Sets and Preparation

3.1 Data Sets

The PISA 2022 database contains the full set of responses from individual students, school principals and parents. A total of five data sets were extracted and their contents are as follows:

Student questionnaire data file
School questionnaire data file
Teacher questionnaire data file
Cognitive item data file
Questionnaire timing data file

In this exercise, we are to utilize the student questionnaire data set only.

3.2 Data Preparation

3.2.1 Installing R Packages

Installation of mulitple R packages via the use of pacman::p_load() function from the pacman package. See the code chunk below:

Code

pacman::p_load(tidyverse, haven, dplyr, plyr, ggrepel, ggthemes, knitr, kableExtra, intsvy, hrbrthemes, ggridges, ggdist, patchwork, colorspace, reshape2, scales, ggplot2, ggpol, gridExtra)

3.2.2 Uploading & Filtering the Data Set

The students questionaire data file was uploaded as stu_qqq_SG. See the code chunk below:

Code

stu_qqq <- read_sas("data/cy08msp_stu_qqq.sas7bdat")
stu_qqq_SG <- stu_qqq %>% filter(CNT =="SGP")
write_rds(stu_qqq_SG,"data/stu_qqq_SG.rds")
stu_qqq_SG <- read_rds("data/stu_qqq_SG.rds")
head(stu_qqq_SG, 5)

# A tibble: 5 × 1,279
  CNT   CNTRYID CNTSCHID CNTSTUID CYC   NatCen STRATUM SUBNATIO REGION  OECD
  <chr>   <dbl>    <dbl>    <dbl> <chr> <chr>  <chr>   <chr>     <dbl> <dbl>
1 SGP       702 70200052 70200001 08MS  070200 SGP01   7020000   70200     0
2 SGP       702 70200134 70200002 08MS  070200 SGP01   7020000   70200     0
3 SGP       702 70200112 70200003 08MS  070200 SGP01   7020000   70200     0
4 SGP       702 70200004 70200004 08MS  070200 SGP01   7020000   70200     0
5 SGP       702 70200152 70200005 08MS  070200 SGP01   7020000   70200     0
# ℹ 1,269 more variables: ADMINMODE <dbl>, LANGTEST_QQQ <dbl>,
#   LANGTEST_COG <dbl>, LANGTEST_PAQ <dbl>, Option_CT <dbl>, Option_FL <dbl>,
#   Option_ICTQ <dbl>, Option_WBQ <dbl>, Option_PQ <dbl>, Option_TQ <dbl>,
#   Option_UH <dbl>, BOOKID <dbl>, ST001D01T <dbl>, ST003D02T <dbl>,
#   ST003D03T <dbl>, ST004D01T <dbl>, ST250Q01JA <dbl>, ST250Q02JA <dbl>,
#   ST250Q03JA <dbl>, ST250Q04JA <dbl>, ST250Q05JA <dbl>, ST250D06JA <chr>,
#   ST250D07JA <chr>, ST251Q01JA <dbl>, ST251Q02JA <dbl>, ST251Q03JA <dbl>, …

Focusing on the Exercise Objectives, the Data of Interests will be scoped towards selected variables. These variables are:

STRATUM
GENDER
IMMIGRANT STATUS
SES STATUS
PV Values

Code

stu_qqq_SG_selectedV <- stu_qqq_SG %>% select(CNTSTUID, STRATUM, ST004D01T, IMMIG, ESCS,PV1READ:PV10READ, PV1SCIE:PV10SCIE, PV1MATH:PV10MATH)

The variables are all anchored based on each Student’s UNIQUE ID - CNTSTUID

3.2.3 Checking Data Structure and Coherency

Next, we will check for duplicates, missing values and convert data types as part of data pre-processing.

Data Health Handling duplicates

Code

stu_qqq_SG_selectedV[duplicated(stu_qqq_SG_selectedV),]

# A tibble: 0 × 35
# ℹ 35 variables: CNTSTUID <dbl>, STRATUM <chr>, ST004D01T <dbl>, IMMIG <dbl>,
#   ESCS <dbl>, PV1READ <dbl>, PV2READ <dbl>, PV3READ <dbl>, PV4READ <dbl>,
#   PV5READ <dbl>, PV6READ <dbl>, PV7READ <dbl>, PV8READ <dbl>, PV9READ <dbl>,
#   PV10READ <dbl>, PV1SCIE <dbl>, PV2SCIE <dbl>, PV3SCIE <dbl>, PV4SCIE <dbl>,
#   PV5SCIE <dbl>, PV6SCIE <dbl>, PV7SCIE <dbl>, PV8SCIE <dbl>, PV9SCIE <dbl>,
#   PV10SCIE <dbl>, PV1MATH <dbl>, PV2MATH <dbl>, PV3MATH <dbl>, PV4MATH <dbl>,
#   PV5MATH <dbl>, PV6MATH <dbl>, PV7MATH <dbl>, PV8MATH <dbl>, …

Checking missing values

Code

sum(is.na(stu_qqq_SG_selectedV))

[1] 283

There are a total of 283 missing values which is merely 4.2% of the overall data set hence can regarded as statistically negligible.

Converting Data Types: Converting CNTSTUID & GENDER from num type to chrtype as they are categorical in nature. Subsequently, renaming and recoding them to enhance data comprehension.

Code

stu_qqq_SG_selectedV<- stu_qqq_SG_selectedV %>% mutate(CNTSTUID = as.character(CNTSTUID))

names(stu_qqq_SG_selectedV)[names(stu_qqq_SG_selectedV) == 'CNTSTUID'] <- 'STUDENT ID'

names(stu_qqq_SG_selectedV)[names(stu_qqq_SG_selectedV) == 'ST004D01T'] <- 'GENDER'

stu_qqq_SG_selectedV <- stu_qqq_SG_selectedV %>%
  mutate(GENDER = recode(as.character(GENDER), '1' = 'FEMALE', '2' = 'MALE'))

stu_qqq_SG_selectedV <- stu_qqq_SG_selectedV %>%
  mutate(STRATUM = recode(STRATUM, 'SGP01' = 'MAINSTREAM SCH', 'SGP03' = 'PRIVATE SCH'))

stu_qqq_SG_selectedV<- stu_qqq_SG_selectedV %>%
  mutate(IMMIG = recode(IMMIG, '1' = 'NATIVE', '2' = '2ND GEN', '3' = '1ST GEN'))

4 Exploratory Data Analysis

4.1 Explore Score Distribution across all subjects

4.1.1 Observations through Prob Density and Histogram

Utilising the prob density and histogram, we develop the visualisation to observe the distributions and the summary stats across all subject of interests.

All Subjects

Code

stu_qqq_SG_selectedV <- stu_qqq_SG_selectedV %>%
  mutate(Maths = rowSums(stu_qqq_SG_selectedV[paste0('PV', c(1:10), "MATH")], 
                   na.rm = TRUE)/100) %>% 
  mutate(Reading = 
           rowSums(stu_qqq_SG[paste0('PV', c(1:10), "READ")], 
                   na.rm = TRUE)/100) %>% 
  mutate(Science = 
           rowSums(stu_qqq_SG[paste0('PV', c(1:10), "SCIE")], 
                   na.rm = TRUE)/100)
temp_Data <- stu_qqq_SG_selectedV[, c("Science", "Reading", "Maths")]
temp_Data <- melt(temp_Data, variable.name = "Subject")

No id variables; using all as measure variables

Code

ggplot(temp_Data, aes(x = value, y = Subject)) +
  stat_halfeye(aes(fill = Subject), 
               adjust = 0.5,
               justification = 0.1,
               .width = 0,
               point_colour = NA) +
  geom_boxplot(width = 0.2) +
  stat_summary(fun = mean, geom = "point", shape = 16, 
               size = 3, color = "darkred", 
               position = position_nudge(x = 0.0)) +
  stat_summary(fun = mean, colour="darkred", 
               geom = "text", show.legend = FALSE, 
               vjust = -1.5, aes( label=round(after_stat(x), 1))) +
  labs(y = NULL, x = "Scores",
       title = "Distribution of Scores",
       subtitle = "Math subject, yielded greater performance by students compared to Reading and Science") +
  theme_tidybayes()+
  theme(legend.position = "none")

Maths Score

Code

# Calculate mean and median
mean_value <- mean(stu_qqq_SG_selectedV$Maths, na.rm = TRUE)
median_value <- median(stu_qqq_SG_selectedV$Maths, na.rm = TRUE)

# Create the histogram
histogram_plot <- ggplot(data = stu_qqq_SG_selectedV, aes(x = Maths)) +
  geom_histogram(binwidth = 2, fill = "lightblue", color = "black") +  # Adjust binwidth and color here
  geom_vline(xintercept = mean_value, color = "red", linetype = "dashed", linewidth = 0.5) +
  geom_vline(xintercept = median_value, color = "blue", linetype = "dotted", linewidth = 0.5) +
  labs(title = "Histogram of Maths Scores",
       x = "Maths Scores",
       y = "Frequency",
       subtitle = paste("Mean (red):", round(mean_value, 2), 
                        "- Median (blue):", round(median_value, 2)))

# Print the plot
print(histogram_plot)

Reading

Code

# Calculate mean and median
mean_value <- mean(stu_qqq_SG_selectedV$Reading, na.rm = TRUE)
median_value <- median(stu_qqq_SG_selectedV$Reading, na.rm = TRUE)

# Create the histogram
histogram_plot <- ggplot(data = stu_qqq_SG_selectedV, aes(x = Reading)) +
  geom_histogram(binwidth = 2, fill = "lightgreen", color = "black") +  # Adjust binwidth and color here
  geom_vline(xintercept = mean_value, color = "red", linetype = "dashed", linewidth = 0.5) +
  geom_vline(xintercept = median_value, color = "blue", linetype = "dotted", linewidth = 0.5) +
  labs(title = "Histogram of Reading Scores",
       x = "Reading Scores",
       y = "Frequency",
       subtitle = paste("Mean (red):", round(mean_value, 2), 
                        "- Median (blue):", round(median_value, 2)))

# Print the plot
print(histogram_plot)

Science

Code

# Calculate mean and median
mean_value <- mean(stu_qqq_SG_selectedV$Science, na.rm = TRUE)
median_value <- median(stu_qqq_SG_selectedV$Science, na.rm = TRUE)

# Create the histogram
histogram_plot <- ggplot(data = stu_qqq_SG_selectedV, aes(x = Science)) +
  geom_histogram(binwidth = 2, fill = "lightpink", color = "black") +  # Adjust binwidth and color here
  geom_vline(xintercept = mean_value, color = "red", linetype = "dashed", linewidth = 0.5) +
  geom_vline(xintercept = median_value, color = "blue", linetype = "dotted", linewidth = 0.5) +
  labs(title = "Histogram of Science Scores",
       x = "Science Scores",
       y = "Frequency",
       subtitle = paste("Mean (red):", round(mean_value, 2), 
                        "- Median (blue):", round(median_value, 2)))

# Print the plot
print(histogram_plot)

4.1.2 Insights & Observations

The aim of these visualisations was to show the performance distribution of students across all individual subjects.
Trends in each plot showed :
- distribution characteristic is of a Normal Distribution,
- distribution mildly skews to the left.
- From here we can deduce that just about more than half of the cohort scores above the 50% mark.
Students fair better in Math, as a subject, compared to Reading and Science. :::

4.2 Exploring Performance Score by Gender

4.2.1 Observations through boxplots

Utilising the boxplots, we visualise the how gender fair across all subjects.

Code

library(ggplot2)

# Calculate mean and median for each gender
means <- aggregate(Maths ~ GENDER, data = stu_qqq_SG_selectedV, FUN = mean)

# Create the boxplot
ggplot(data = stu_qqq_SG_selectedV, aes(y = Maths, x = GENDER, color = GENDER)) +
  geom_boxplot(width = 0.3) +
  
  # Add mean lines
  geom_errorbar(data = means, aes(ymin = Maths, ymax = Maths, x = GENDER), 
                width = 0.2, color = "darkred") +
  geom_text(data = means, aes(label = round(Maths, 1), y = Maths, x = GENDER), 
            vjust = -1.5, color = "darkred", size = 3) +
  
  # Titles and subtitles
  labs(title = "Maths Scores by Gender",
       subtitle = "Male scores better in Maths in comparison with Female",
       x = "Gender",
       y = "Maths Scores",
       color = "Gender")

Code

library(ggplot2)

# Calculate mean and median for each gender
means <- aggregate(Reading ~ GENDER, data = stu_qqq_SG_selectedV, FUN = mean)

# Create the boxplot
ggplot(data = stu_qqq_SG_selectedV, aes(y = Reading, x = GENDER, color = GENDER)) +
  geom_boxplot(width = 0.3) +
  
  # Add mean lines
  geom_errorbar(data = means, aes(ymin = Reading, ymax = Reading, x = GENDER), 
                width = 0.2, color = "darkred") +
  geom_text(data = means, aes(label = round(Reading, 1), y = Reading, x = GENDER), 
            vjust = -1.5, color = "darkred", size = 3) +
  
  # Titles and subtitles
  labs(title = "Reading Scores by Gender",
       subtitle = "Female scores better in Reading",
       x = "Gender",
       y = "Reading Scores",
       color = "Gender")

Code

library(ggplot2)

# Calculate mean and median for each gender
means <- aggregate(Science ~ GENDER, data = stu_qqq_SG_selectedV, FUN = mean)

# Create the boxplot
ggplot(data = stu_qqq_SG_selectedV, aes(y = Science, x = GENDER, color = GENDER)) +
  geom_boxplot(width = 0.3) +
  
  # Add mean lines
  geom_errorbar(data = means, aes(ymin = Science, ymax = Science, x = GENDER), 
                width = 0.2, color = "darkred") +
  geom_text(data = means, aes(label = round(Science, 1), y = Science, x = GENDER), 
            vjust = -1.5, color = "darkred", size = 3) +
  
  # Titles and subtitles
  labs(title = "Science Scores by Gender",
       subtitle = "Male scores better in Science as compared to their Female counterparts",
       x = "Gender",
       y = "Science Scores",
       color = "Gender")

4.2.2 Insights & Observations

The aim of these visualisations was to show Gender performance across all individual subjects.
Males Mean scores are higher hence we can deduce that Males fair better in Math and Science subjects.
Female fair better in Reading as compared to their Male counterparts.

4.3 Exploring Performance Score by SE Index

4.3.1 SE Distribution by Scores

SE Distribution Let’s first take a look into the socioeconomic (SE) distribution to gain some insights.

Code

stu_qqq_SG_selectedV <- na.omit(stu_qqq_SG_selectedV)
ggplot(data = stu_qqq_SG_selectedV, aes(x = ESCS)) +
  stat_halfeye(adjust = 0.5,
               justification = -0.2,
               .width = 0, point_colour = NA) +
  geom_boxplot(width = 0.20,
               outlier.shape = NA) +
  labs(y = NULL,
       title = "SE Distribution", subtitle = "Left-Skewed SE Distribution")  +
  theme_tidybayes()

ESCS Index by Subject Performance

Code

cor1 <- round(cor(stu_qqq_SG_selectedV$Maths, stu_qqq_SG_selectedV$ESCS),2)
cor2 <- round(cor(stu_qqq_SG_selectedV$Science, stu_qqq_SG_selectedV$ESCS),2)
cor3 <- round(cor(stu_qqq_SG_selectedV$Reading, stu_qqq_SG_selectedV$ESCS),2)

se1 <- ggplot(data = stu_qqq_SG_selectedV,
             aes(y = Maths, x = ESCS)) +
        geom_point(size = 0.1)+
        geom_smooth(method = lm) + 
        annotate("text", x = 2.5, y = 100, label=paste0("r = ", cor1), color = 'lightblue') +
        theme_tidybayes()

se2 <- ggplot(data = stu_qqq_SG_selectedV,
             aes(y = Science, x = ESCS)) +
        geom_point(size = 0.1)+
        geom_smooth(method = lm) +
        annotate("text", x = 2.5, y = 100, label=paste0("r = ", cor2), color = 'lightgreen') + 
        theme_tidybayes()

se3 <- ggplot(data = stu_qqq_SG_selectedV,
             aes(y = Reading, x = ESCS)) +
        geom_point(size = 0.1)+
        geom_smooth(method = lm) + 
        annotate("text", x = 2.5, y = 100, label=paste0("r = ", cor3), color = 'lightpink') +
        theme_tidybayes()

se1/se2/se3

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

4.3.2 Insights & Observations

Left Skewed Distribution from the SE Index can be deduced that students with normal to high ESCS Index are more prevalent as compared to the lower ESCS index. To bring the conversation further, we can study into the distribution of native, 1st Gen and 2nd Gen by the ESCS index in the future.
Depicts a positive moderate correlation between Scores and ESCS index, hence we can infer that students with higher ESCS Index scores better in their subjects.

5 Summary

Math is the Subject that scores higher as compared to Reading and Science.
Male tends to score in both Maths and Science, while Females are comparatively higher in terms of scoring in Reading.
There is a positive correlation between the Socioecomic Index (ESCS) and performance of students.
From the distribution (SE index), students with normal to high ESCS Index are more prevalent as compared to the lower ESCS index.
Students with higher ESCS Index tends to score better in their subjects.