Self assessment questions

1 Self assessment

If you have not taken the Introduction to R course, you should ccomplete the self-assessment questions below to ensure you are at the same level as those you have taken the Introduction course. If you have taken the course, you can skip this section. Using generative AI tools such as ChatGPT to complete the tasks indicates that you do not have sufficient knowledge of R to complete the intermediate course. You should complete the Introduction to R course before continuing with this course.

The questions require you to use the PISA_student_2022_subset.parquet dataset which you can download here here.

To load the PISA data:

Copy the following code and replace <folder> with the full folder location of where your dataset was saved, make sure that you have .parquet on the end. And keep the (r"[ ]")! Make sure you run all the lines in the code below, including the library(arrow) and library(tidyverse) lines.

examples of what this should look like for PC and Mac

# For Pete (PC) the address format was:
PISA_2022 <- read_parquet(r"[C:\Users\Peter\KCL\MASTEMR\PISA_student_2022_subset.parquet]")

# For Richard (Mac) the address format was:
PISA_2022 <- read_parquet(r"[/Users/k1765032/Documents/Teaching/STEM MA/Quantitative module/Data sets/PISA_student_2022_subset.parquet]")

1.1 Question 1: Data manipulation

Produce a summary table of the mean science scores for countries in PISA 2022 for boys and girls

Show the code

PISA_2022 %>%
  select(CNT, ST004D01T, PV1SCIE) %>%
  group_by(CNT, ST004D01T) %>%
  summarise(mean_sci = mean(PV1SCIE, na.rm = T))

# A tibble: 165 × 3
# Groups:   CNT [80]
   CNT                  ST004D01T mean_sci
   <fct>                <fct>        <dbl>
 1 Albania              Female        390.
 2 Albania              Male          362.
 3 United Arab Emirates Female        445.
 4 United Arab Emirates Male          427.
 5 Argentina            Female        412.
 6 Argentina            Male          418.
 7 Australia            Female        507.
 8 Australia            Male          509.
 9 Australia            <NA>          566.
10 Austria              Female        489.
# ℹ 155 more rows

1.2 Question 2: Graphing

Plot a scatter plot of mean maths scores against mean wealth for the PISA countries. Add labels to countries with mean maths scores over 500, giving the name of the country. Add a linear regression line to the plot

Show the code

library(ggrepel)

plot_data <- PISA_2022 %>%
  select(CNT, HOMEPOS, PV1MATH) %>%
  group_by(CNT) %>%
  summarise(mean_math = mean(PV1MATH, na.rm = T),
            mean_wealth = mean(HOMEPOS, na.rm = T)) %>%
  mutate(label = ifelse(mean_math > 500, as.character(CNT), NA))

ggplot(plot_data, aes(x = mean_wealth, y = mean_math)) +
  geom_point(colour = "lightblue") +
  geom_smooth(method = "lm", colour = "darkblue") +
  geom_text_repel(aes(label = label), vjust = -1, na.rm = TRUE) +
  labs(x = "Mean Wealth", y = "Mean Maths Score") +
  theme_minimal()

1.3 Question 3: Performing tests

Use a t-test to determine if male and female reading scores in the UK are statistically significantly different.
Use a chi-square test to determine if the number of boys and girls in the US and UK are diffrent

Show the code

# a) t-test
# Filter the data for the UK and select relevant columns
uk_reading <- PISA_2022 %>%
  filter(CNT == "United Kingdom") %>%
  select(ST004D01T, PV1READ) %>%
  drop_na()
# Perform the t-test
t.test(PV1READ ~ ST004D01T, data = uk_reading)


    Welch Two Sample t-test

data:  PV1READ by ST004D01T
t = 10.159, df = 12960, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 15.03322 22.22110
sample estimates:
mean in group Female   mean in group Male 
            500.2030             481.5759

Show the code

# p-value < 2.2e-16 so reject the null hypothesis there are statistically signifcant differences in mean reading scores in the UK

# b) chi-square test

US_UK_gender <- PISA_2022 %>%
  filter(CNT %in% c("United States", "United Kingdom")) %>%
  select(CNT, ST004D01T) %>%
  droplevels()

# Create a contingency table
contingency_table <- xtabs(data = US_UK_gender, ~ ST004D01T + CNT)

# Perform the chi-square test
chisq.test(contingency_table)


    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency_table
X-squared = 0.028622, df = 1, p-value = 0.8657

Show the code

# p-value = 0.8657 - p-value is over 0.05 so accept the null that the proportions of boys and girls in the UK and US is similar