1 Why We Use Functions in R

For this session we are going to use the gapminder dataset.

About the Gapminder Dataset

The gapminder dataset contains information about life expectancy, population, and GDP per capita for countries around the world from 1952 to 2007. This dataset was made famous by Hans Rosling, a Swedish physician and statistician who used it in his captivating TED talks to challenge misconceptions about global development. Rosling’s animated visualisations of this data revealed surprising trends about how countries have progressed over time, showing that the world has become healthier and wealthier than many people realise.

The dataset is perfect for demonstrating data analysis techniques because it contains both categorical variables (country, continent) and numerical variables (year, life expectancy, population, GDP per capita), with enough complexity to showcase real-world data challenges while remaining manageable for learning purposes.

Before we start, let’s load the packages we’ll need:

library(tidyverse)
library(gapminder)

Functions are the building blocks of reproducible analysis in R. They:

Help you avoid repeating yourself (DRY principle)
Make your code easier to read, test, and debug
Allow you to generalise analyses to any dataset
Enable collaboration and sharing of code

Today, we’ll learn how to write flexible functions that work with any data, and how to document them for others to use.

2 What the Basic Structure of a Function Looks Like

Here is the basic structure of a function in R:

# Function name assignment
my_function <- function(arg1, arg2) {  # Arguments (parameters)
  # Function body - the code that does the work
  result <- arg1 + arg2
  return(result)  # Return value (optional)
}

Let’s break down the key components:

Function name: my_function - what you’ll call to use the function
Assignment operator: <- - assigns the function to a name
function() keyword: tells R you’re creating a function
Arguments: arg1, arg2 - the inputs your function accepts
Function body: the code between {} that does the actual work
Return value: what the function gives back (optional - R returns the last evaluated expression by default)

Here’s a simple working example:

# A function to calculate the area of a rectangle
calculate_area <- function(length, width) {
  area <- length * width
  return(area)
}

# Test our function
calculate_area(5, 3)

[1] 15

3 Writing a Basic Function to Show the Structure

Let’s write a function that filters the gapminder dataset for countries with life expectancy above a given threshold. We’ll keep it simple and use a fixed column name:

filter_lifeExp <- function(data, threshold) {
  data %>%
    filter(lifeExp > threshold) %>%
    slice_head(n = 10)
}

# Example usage
filter_lifeExp(gapminder, 60)

# A tibble: 10 × 6
   country continent  year lifeExp     pop gdpPercap
   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
 1 Albania Europe     1962    64.8 1728137     2313.
 2 Albania Europe     1967    66.2 1984060     2760.
 3 Albania Europe     1972    67.7 2263554     3313.
 4 Albania Europe     1977    68.9 2509048     3533.
 5 Albania Europe     1982    70.4 2780097     3631.
 6 Albania Europe     1987    72   3075321     3739.
 7 Albania Europe     1992    71.6 3326498     2497.
 8 Albania Europe     1997    73.0 3428038     3193.
 9 Albania Europe     2002    75.7 3508512     4604.
10 Albania Europe     2007    76.4 3600523     5937.

4 Writing a More Complex Function That Plots Data

Now let’s write a plotting function for the same data:

my_scatter_plot <- function(data) {
  data %>%
    ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point(colour = "blue", alpha = 0.7) +
    theme_minimal()
}

# Example usage
filtered <- filter_lifeExp(gapminder, 60)
my_scatter_plot(filtered)

Try changing the threshold in filter_lifeExp() to see how the plot changes.

The Limitation of Fixed Column Names

But what if we wanted more flexible functions that can take many different column names? Or even work with a completely different dataset with different column names?

Our current functions are quite rigid - they only work with specific column names like lifeExp and gdpPercap. This limits their reusability!

5 Working with Column Names as Strings - `[[]]`

Before we dive into tidy evaluation, let’s look at some common scenarios. Normally, we make an explicit choice about which column we want to access in our data.

gapminder %>%
  filter(lifeExp > 60) %>% # or .data$lifeExp
  slice_head(n = 10)

# A tibble: 10 × 6
   country continent  year lifeExp     pop gdpPercap
   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
 1 Albania Europe     1962    64.8 1728137     2313.
 2 Albania Europe     1967    66.2 1984060     2760.
 3 Albania Europe     1972    67.7 2263554     3313.
 4 Albania Europe     1977    68.9 2509048     3533.
 5 Albania Europe     1982    70.4 2780097     3631.
 6 Albania Europe     1987    72   3075321     3739.
 7 Albania Europe     1992    71.6 3326498     2497.
 8 Albania Europe     1997    73.0 3428038     3193.
 9 Albania Europe     2002    75.7 3508512     4604.
10 Albania Europe     2007    76.4 3600523     5937.

What if you wanted to change that column programmatically and not update your code every time? First, let’s assume your column name is stored as a string.

The [[]] operator with .data directly looks up a column by its string name. Think of it as saying “go find the column with this exact name and give me its values.” It bypasses all the fancy tidy evaluation and just does a straightforward column lookup.

# Using [[]] with .data for explicit column access
column_name <- "lifeExp"

gapminder %>%
  filter(.data[[column_name]] > 60) %>%
  slice_head(n = 10)

# A tibble: 10 × 6
   country continent  year lifeExp     pop gdpPercap
   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
 1 Albania Europe     1962    64.8 1728137     2313.
 2 Albania Europe     1967    66.2 1984060     2760.
 3 Albania Europe     1972    67.7 2263554     3313.
 4 Albania Europe     1977    68.9 2509048     3533.
 5 Albania Europe     1982    70.4 2780097     3631.
 6 Albania Europe     1987    72   3075321     3739.
 7 Albania Europe     1992    71.6 3326498     2497.
 8 Albania Europe     1997    73.0 3428038     3193.
 9 Albania Europe     2002    75.7 3508512     4604.
10 Albania Europe     2007    76.4 3600523     5937.

This is particularly useful in functions where you might receive column names as strings:

# Function that takes a column name as a string
filter_by_string_name <- function(data, col_name_string, threshold) {
  data %>%
    filter(.data[[col_name_string]] > threshold)
}

# Test with column name as string
filter_by_string_name(gapminder, "gdpPercap", 30000) %>%
  slice_head(n = 10)

# A tibble: 10 × 6
   country   continent  year lifeExp      pop gdpPercap
   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
 1 Australia Oceania    2002    80.4 19546792    30688.
 2 Australia Oceania    2007    81.2 20434176    34435.
 3 Austria   Europe     2002    79.0  8148312    32418.
 4 Austria   Europe     2007    79.8  8199783    36126.
 5 Belgium   Europe     2002    78.3 10311970    30486.
 6 Belgium   Europe     2007    79.4 10392226    33693.
 7 Canada    Americas   2002    79.8 31902268    33329.
 8 Canada    Americas   2007    80.7 33390141    36319.
 9 Denmark   Europe     2002    77.2  5374693    32167.
10 Denmark   Europe     2007    78.3  5468120    35278.

Equally, it works with ggplot2, which is quite helpful because you often want to write generic plotting functions for multiple variables.

# Function to create a simple scatter plot
my_scatter_plot <- function(data, x_column, y_column) {
  data %>%
    ggplot(aes(x = .data[[x_column]], y = .data[[y_column]])) +
    geom_point(alpha = 0.7) +
    theme_minimal()
}

dat <- filter_by_string_name(gapminder, "gdpPercap", 30000) %>%
  slice_head(n = 10)

# Create a plot
my_scatter_plot(dat, "gdpPercap", "lifeExp")

Try one of the other numeric variables, or reverse x_column and y_column.

Did you know that piping works with your own functions too?

gg <- gapminder %>% 
  filter_by_string_name("gdpPercap", 30000) %>%
  slice_head(n = 10) %>%
  my_scatter_plot("gdpPercap", "lifeExp")

The [[]] approach works, but…

The [[]] syntax with .data is explicit column access - you’re directly looking up columns by their string names. This works well when:

You have column names as strings
You want to be very explicit about data access
You’re building functions that work with string-based specifications

But what if users want to pass unquoted column names like filter_function(data, lifeExp, 75) instead of filter_function(data, "lifeExp", 75)?

That’s where tidy evaluation comes in…

6 What is Tidy Evaluation?

Tidy evaluation is the system that allows tidyverse functions to work with unquoted column names. It’s what makes this possible:

# This works - no quotes around column names despite `year` not being in your environment!
gapminder %>% 
  filter(year == 2007) %>% 
  select(country, continent, lifeExp, gdpPercap)

# A tibble: 142 × 4
   country     continent lifeExp gdpPercap
   <fct>       <fct>       <dbl>     <dbl>
 1 Afghanistan Asia         43.8      975.
 2 Albania     Europe       76.4     5937.
 3 Algeria     Africa       72.3     6223.
 4 Angola      Africa       42.7     4797.
 5 Argentina   Americas     75.3    12779.
 6 Australia   Oceania      81.2    34435.
 7 Austria     Europe       79.8    36126.
 8 Bahrain     Asia         75.6    29796.
 9 Bangladesh  Asia         64.1     1391.
10 Belgium     Europe       79.4    33693.
# ℹ 132 more rows

6.1 The Problem: Functions Don’t Work with dplyr

The challenge comes when you try to write your own functions. Let’s see what happens:

# This function won't work as expected
my_filter <- function(data, column_name, value) {
  data %>%
    filter(column_name == value)  # This doesn't work!
}

# Try to use it
my_filter(gapminder, year, 2007)

Error in `filter()`:
ℹ In argument: `column_name == value`.
Caused by error in `column_name == value`:
! comparison (==) is possible only for atomic and list types

The function fails because column_name is treated as the literal text “column_name”, not as the variable name we passed in (year). This is where tidy evaluation and data masking comes to the rescue!

7 Data Masking

What is Data Masking?

Data masking makes column names in your data frame available as if they were regular variables. It is what allows you to write filter(gapminder, year == 2007) instead of filter(gapminder, gapminder$year == 2007). The column name year gets “masked” so it appears to exist in your current environment.

This is why you can write:

# This works - year refers to the column
gapminder %>% filter(year == 2007)

# A tibble: 142 × 6
   country     continent  year lifeExp       pop gdpPercap
   <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
 1 Afghanistan Asia       2007    43.8  31889923      975.
 2 Albania     Europe     2007    76.4   3600523     5937.
 3 Algeria     Africa     2007    72.3  33333216     6223.
 4 Angola      Africa     2007    42.7  12420476     4797.
 5 Argentina   Americas   2007    75.3  40301927    12779.
 6 Australia   Oceania    2007    81.2  20434176    34435.
 7 Austria     Europe     2007    79.8   8199783    36126.
 8 Bahrain     Asia       2007    75.6    708573    29796.
 9 Bangladesh  Asia       2007    64.1 150448339     1391.
10 Belgium     Europe     2007    79.4  10392226    33693.
# ℹ 132 more rows

You’ve been using data masking every time you use dplyr functions - you just didn’t know it had a name!

8 The Curly-Curly Operator `{{}}`

The curly-curly operator {{}} is the solution to our function problem!

Conceptually, {{}} captures and forwards whatever the user passed to your function. Think of it as a “pass-through” mechanism - it takes the unquoted column name the user gave you (like lifeExp) and forwards it to dplyr functions as if you had typed that column name directly in the code.

What does {{}} do?

The {{}} operator (called “curly-curly”) is like a magic wrapper that:

Captures the column name you pass to your function
Passes it through to dplyr functions correctly
Makes it work just like typing the column name directly

Think of {column_name} as saying “use whatever column name the user gave me here.”

# Fixed version using curly-curly
my_filter <- function(data, column_name, value) {
  data %>%
    filter({{column_name}} == value)  # {{}} makes this work!
}

# Now it works!
my_filter(gapminder, year, 2007) %>%
  slice_head(n = 10)

# A tibble: 10 × 6
   country     continent  year lifeExp       pop gdpPercap
   <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
 1 Afghanistan Asia       2007    43.8  31889923      975.
 2 Albania     Europe     2007    76.4   3600523     5937.
 3 Algeria     Africa     2007    72.3  33333216     6223.
 4 Angola      Africa     2007    42.7  12420476     4797.
 5 Argentina   Americas   2007    75.3  40301927    12779.
 6 Australia   Oceania    2007    81.2  20434176    34435.
 7 Austria     Europe     2007    79.8   8199783    36126.
 8 Bahrain     Asia       2007    75.6    708573    29796.
 9 Bangladesh  Asia       2007    64.1 150448339     1391.
10 Belgium     Europe     2007    79.4  10392226    33693.

Let’s try a few more simple examples:

# Function to select and arrange data
my_select_arrange <- function(data, column_name) {
  data %>%
    select(country, year, {{column_name}}) %>%
    arrange({{column_name}})
}

# Test it out
my_select_arrange(gapminder, lifeExp) %>%
  slice_head(n = 10)

# A tibble: 10 × 3
   country       year lifeExp
   <fct>        <int>   <dbl>
 1 Rwanda        1992    23.6
 2 Afghanistan   1952    28.8
 3 Gambia        1952    30  
 4 Angola        1952    30.0
 5 Sierra Leone  1952    30.3
 6 Afghanistan   1957    30.3
 7 Cambodia      1977    31.2
 8 Mozambique    1952    31.3
 9 Sierra Leone  1957    31.6
10 Burkina Faso  1952    32.0

# Function to create a simple summary
my_summary <- function(data, group_column, summary_column) {
  data %>%
    group_by({{group_column}}) %>%
    summarise(
      average = mean({{summary_column}}, na.rm = TRUE),
      count = n(),
      .groups = "drop"
    )
}

# Test with different combinations
my_summary(gapminder, continent, lifeExp)

# A tibble: 5 × 3
  continent average count
  <fct>       <dbl> <int>
1 Africa       48.9   624
2 Americas     64.7   300
3 Asia         60.1   396
4 Europe       71.9   360
5 Oceania      74.3    24

8.1 Using `{{}}` with ggplot2

The curly-curly operator also works with ggplot2:

# Function to create a simple scatter plot
my_scatter_plot <- function(data, x_column, y_column) {
  data %>%
    ggplot(aes(x = {{x_column}}, y = {{y_column}})) +
    geom_point(alpha = 0.7) +
    theme_minimal()
}

# Create a plot
my_scatter_plot(gapminder %>% filter(year == 2007), gdpPercap, lifeExp)

Key Point: {{}} works everywhere

The {{}} operator works with:

dplyr functions: filter(), select(), mutate(), group_by(), etc.
ggplot2: inside aes() for aesthetics
Any tidyverse function that uses tidy evaluation This consistency makes it easy to remember - just wrap your column arguments in {{}} and they’ll work!

9 Using `...` for Ultimate Flexibility

The dot-dot-dot (...) parameter is another powerful tool for creating flexible functions. It allows your function to accept any number of additional arguments and pass them along to other functions.

What does ... do?

The ... parameter (called “dot-dot-dot” or “ellipsis”) allows your function to:

Accept unlimited arguments that you don’t need to specify in advance
Pass arguments through to other functions without knowing what they are
Make wrapper functions that extend existing functionality

Think of ... as saying “accept whatever else the user wants to pass along.”

9.1 Example 1: Passing Arguments to ggplot2

Let’s create a plotting function that can accept any additional ggplot2 arguments:

# Function that accepts additional ggplot arguments via ...
flexible_scatter <- function(data, x_col, y_col, ...) {
  data %>%
    ggplot(aes(x = {{x_col}}, y = {{y_col}})) +
    geom_point(...) +  # Pass ... to geom_point()
    theme_minimal()
}

# Now we can pass any geom_point() arguments
flexible_scatter(gapminder %>% filter(year == 2007),
                gdpPercap, lifeExp,
                colour = "red", size = 2, alpha = 0.6)

9.2 Example 2: Flexible Summary Function

Here’s a summary function that can accept additional arguments for the summary functions:

# Function that passes ... to summary functions
flexible_summary <- function(data, group_col, summary_col, ...) {
  data %>%
    group_by({{group_col}}) %>%
    summarise(
      mean_val = mean({{summary_col}}, ...),  # Pass ... to mean()
      median_val = median({{summary_col}}, ...),  # Pass ... to median()
      count = n(),
      .groups = "drop"
    )
}

# Test with na.rm = TRUE passed through ...
gapminder_with_na <- gapminder
gapminder_with_na$lifeExp[1:5] <- NA  # Add some missing values

flexible_summary(gapminder_with_na, continent, lifeExp, na.rm = TRUE)

# A tibble: 5 × 4
  continent mean_val median_val count
  <fct>        <dbl>      <dbl> <int>
1 Africa        48.9       47.8   624
2 Americas      64.7       67.0   300
3 Asia          60.4       62.0   396
4 Europe        71.9       72.2   360
5 Oceania       74.3       73.7    24

9.3 Example 3: Combining `{{}}` and `...`

You can use both {{}} and ... in the same function for maximum flexibility:

# Ultimate flexible plotting function
ultimate_plot <- function(data, x_col, y_col, plot_type = "point", ...) {
  p <- data %>%
    ggplot(aes(x = {{x_col}}, y = {{y_col}}))
  
  if (plot_type == "point") {
    p <- p + geom_point(...)
  } else if (plot_type == "line") {
    p <- p + geom_line(...)
  }
  
  p + theme_minimal()
}

# Use with different geom arguments
ultimate_plot(gapminder %>% filter(country == "Australia"),
              year, lifeExp,
              plot_type = "line",
              colour = "blue", linewidth = 1.2)

When to Use ...

Use ... when you want to:

Wrap existing functions and allow users to pass additional arguments
Create flexible plotting functions that accept aesthetic parameters
Build summary functions that can handle different options (like na.rm)
Make your functions extensible without knowing all possible future arguments

Be careful: ... can make debugging harder since you don’t know what arguments might be passed!

10 Documenting Your Functions

Documenting your functions is essential for sharing and future use. Good documentation helps others (and future you!) understand:

What the function does
What inputs it expects
What it returns
How to use it

The standard in R is to use roxygen2 comments above your function. These special comments start with #' and use specific tags:

Keyboard Shortcuts for Roxygen2

RStudio shortcuts for adding roxygen2 headers:

Windows/Linux: Ctrl + Alt + Shift + R
Mac: Cmd + Option + Shift + R

Place your cursor anywhere inside your function and use this shortcut to automatically generate a roxygen2 template with all the @param tags for your function arguments!

#' Filter a data frame by a column and threshold
#'
#' This function filters a data frame to show only rows where the specified
#' column values are above a given threshold, returning the top 3 results.
#'
#' @param data A data frame to filter
#' @param column_name Column name as a string (e.g., "lifeExp")
#' @param threshold Numeric threshold value for filtering
#' @return A filtered data frame with up to 3 rows
#' @examples
#' filter_gapminder(gapminder, "lifeExp", 75)
#' filter_gapminder(gapminder, "gdpPercap", 30000)
filter_gapminder <- function(data, column_name, threshold) {
  data %>%
    filter(.data[[column_name]] > threshold) %>%
    slice_head(n = 10)
}

10.1 Common roxygen2 Tags

@param: Describes each parameter
@return: Describes what the function returns
@examples: Shows how to use the function

Let’s document one of our curly-curly functions as well:

#' Create a summary by group
#'
#' This function groups data by a specified column and calculates the mean
#' and count for another column.
#'
#' @param data A data frame
#' @param group_column Column to group by (unquoted)
#' @param summary_column Column to summarise (unquoted)
#' @return A data frame with group, average, and count columns
#' @examples
#' my_summary(gapminder, continent, lifeExp)
#' my_summary(gapminder, year, gdpPercap)
my_summary <- function(data, group_column, summary_column) {
  data %>%
    group_by({{group_column}}) %>%
    summarise(
      average = mean({{summary_column}}, na.rm = TRUE),
      count = n(),
      .groups = "drop"
    )
}

Building Packages

If you’re building an R package, you can use devtools::document() to automatically generate help files from your roxygen2 comments. This creates the standard R help documentation that users can access with ?function_name.

11 Summary

Today we’ve covered:

Basic function structure - the essential components of any R function
String-based column access - using .data[[]] for programmatic column selection
Tidy evaluation - understanding how tidyverse functions work with unquoted names
Data masking - the system that makes column names available as variables
The curly-curly operator {{}} - making your functions work seamlessly with tidyverse
Documentation - using roxygen2 to create professional function documentation

11.1 Key Takeaways

Use {{}} when you want users to pass unquoted column names
Use .data[[]] when you have column names as strings
Always document your functions with roxygen2 comments
Test your functions with different datasets to ensure flexibility

11.2 Practice Exercises

Try creating your own functions that:

Filter data based on multiple conditions using {{}}
Create different types of plots with flexible aesthetics
Calculate custom summaries with grouping variables
Include proper roxygen2 documentation

Remember: the best functions are flexible, well-documented, and solve real problems in your analysis workflow!