Introduction

Causal diagrams

To understand bias in causal research, it is important to recognize that biases are often spurious (non-causal) associations that arise from underlying causal structures. These structures can be formally represented as causal diagrams using Directed Acyclic Graphs (DAGs), which are visual models where variables are represented as nodes and causal relationships are depicted as arrows (edges) pointing from causes to their consequences. In a DAG, a causal path is a sequence of variables connected by directed edges (arrows) where all arrows point in the same direction, from the cause to the consequence. Conversely, a non-causal path contains at least one edge that points against the direction of the assumed causal flow. DAGs are acyclic, meaning they do not include feedback loops: no variable can causally influence itself, either directly or indirectly, within a single time point. DAGs are qualitative, nonparametric tools. They illustrate assumed causal relationships without specifying their strength or functional form. The arrows do not imply deterministic links, but rather the possibility of influence given appropriate conditions. Despite their simplicity, DAGs provide powerful insights into how statistical associations emerge—and how they can mislead us if misinterpreted. Biases can be characterized graphically based on the structural patterns in DAGs. There are four fundamental causal structures that serve as the basic building blocks for constructing more complex causal relationships in DAGs:

Confounder (also called fork or common cause): A variable that causally affects (at least) two others.
Mediator (also called pipe or chain): A variable lying on the causal pathway from the cause to the consequence.
Collider (also called common effect): A variable that is causally influenced by (at least) two others.
Descendant: A variable that is causally influenced by another variable.

Code

library(ggdag) # for DAG
library(dagitty)
library(ggplot2) # for visualisation

dag_coords.intro <-
  data.frame(name = c('C', 'I', 'V', 'S', 'D', 'P'),
             x = c(1, 1, 3.5, 6, 6, 3.5),
             y = c(1, 2, 3, 1, 2, 2))

DAG.intro <-
  dagify(I ~ C,
         C ~ V,
         P ~ V,
         S ~ C + V,
         D ~ S, 
         coords = dag_coords.intro)

node_labels <- c(
  C = 'bold(C)',
  I = 'bold(I)',
  V = 'bold(V)',
  S = 'bold(S)',
  D = 'bold(D)',
  P = 'bold(P)'
)

ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_text(aes(label = node_labels[name]),
                colour = 'black', size = 10, parse = TRUE,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = 'mono', 
                 fontface = 'bold') + 
  annotate('text', x = 3.5, y = 3.3, label = 'ventilation', 
           size = 4, hjust = 0.5, colour = 'grey50') +
  annotate('text', x = 1, y = 2.3, label = 'IAQ index', 
           size = 4, hjust = 0.5, colour = 'grey50') +
  annotate('text', x = 6, y = 2.3, label = 'daily productivity', 
           size = 4, hjust = 0.6, colour = 'grey50') +
  annotate('text', x = 3.5, y = 1.7, label = 'power usage', 
             size = 4, hjust = 0.5, colour = 'grey50') +
  annotate('text', x = 1, y = 0.7, label = 'CO2', 
           size = 4, hjust = 0.6, colour = 'grey50') +
  annotate('text', x = 6, y = 0.7, label = 'sleep quality', 
           size = 4, hjust = 0.6, colour = 'grey50') +
  coord_cartesian(xlim = c(0.5, 6.5), ylim = c(0.8, 3.2))  +
  theme_dag()

Fig. 1. Example of a graphical representation of a causal structure via a Directed Acyclic Graph.

For example, let us consider the hypothetical causal structure described in Fig. 1. Here, CO₂ concentration \((\text{C})\) affects sleep quality \((\text{S})\), while ventilation \((\text{V})\) reduces CO₂ but also affects sleep quality \((\text{S})\), for example, by increasing noise. Moreover, high ventilation increases power usage \((\text{P})\), CO₂ concentration influences the Indoor Air Quality (IAQ) index \((\text{I})\), and low sleep quality leads to decreased daily productivity \((\text{D})\). In the context of this example, ventilation can act as a confounder (common cause) of CO₂ concentration and sleep quality (i.e., \(\text{C} \leftarrow \text{V} \rightarrow \text{S}\)), while CO₂ concentration can be seen as a mediator between ventilation and sleep quality (i.e., \(\text{V} \rightarrow \text{C} \rightarrow \text{S}\)). Sleep quality can serve as a collider (common effect) between ventilation and CO₂ concentration (i.e., \(\text{V} \rightarrow \text{S} \leftarrow \text{C}\)), while power usage can function as a descendant of ventilation (i.e., \(\text{V} \rightarrow \text{P}\)). To better understand the four fundamental causal structures, it is helpful to visualize them graphically. Importantly, these fundamental causal structures are not inherent properties of the variables—they depend entirely on the specific DAG and the causal question of interest.

Confounder and descendant

Code

#example a: C -> S is open
ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
  #backdoor path: common cause
  geom_segment(x = 1, xend = 3.5, y = 1, yend = 3,
               linewidth = 8, lineend = 'square', colour = '#D55E00', alpha = 0.05) + #'#C77CFF'
  geom_segment(x = 3.5, xend = 6, y = 3, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#D55E00', alpha = 0.05) +
  #visualise causal effect path
  geom_segment(x = 1, xend = 6, y = 1, yend = 1,
               linewidth = 14, lineend = 'round', colour = '#009E73', alpha = 0.05) +  
  
  geom_dag_text(aes(label = node_labels[name]),
                colour = 'black', size = 10, parse = TRUE,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(0.5, 6.5), ylim = c(0.8, 3.2))  +
  theme_dag()


#example b: C -> S is closed
ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
  #backdoor path: common cause
  geom_segment(x = 1, xend = 3.5, y = 1, yend = 3,
               linewidth = 8, lineend = 'square', colour = '#0072B2', alpha = 0.05) +
  geom_segment(x = 3.5, xend = 6, y = 3, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#0072B2', alpha = 0.05) +
  #adjusted variable
#  geom_point(x = 3.5, y = 3, shape = 0, size = 12, stroke = 0.9, color = 'black') +
  geom_point(x = 3.5, y = 3, shape = 22, size = 14, stroke = 0.9, fill = 'grey80', colour = 'black') +
  #visualise causal effect path
  geom_segment(x = 1, xend = 6, y = 1, yend = 1,
               linewidth = 14, lineend = 'round', colour = '#009E73', alpha = 0.05) +  
  
  geom_dag_text(aes(label = node_labels[name]),
                colour = 'black', size = 10, parse = TRUE,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(0.5, 6.5), ylim = c(0.8, 3.2))  +
  theme_dag()


#example c: C -> S is partially closed
ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
  #backdoor path: common cause
  geom_segment(x = 1, xend = 3.5, y = 1, yend = 3,
               linewidth = 8, lineend = 'square', colour = '#0072B2', alpha = 0.05, linetype = '12') +
  geom_segment(x = 3.5, xend = 6, y = 3, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#0072B2', alpha = 0.05, linetype = '12') +
  #adjusted variable
#  geom_point(x = 3.5, y = 2, shape = 0, size = 12, stroke = 0.9, color = 'black') +
  geom_point(x = 3.5, y = 2, shape = 22, size = 14, stroke = 0.9, fill = 'grey80', colour = 'black') +
  #visualise causal effect path
  geom_segment(x = 1, xend = 6, y = 1, yend = 1,
               linewidth = 14, lineend = 'round', colour = '#009E73', alpha = 0.05) +  
  
  geom_dag_text(aes(label = node_labels[name]),
                colour = 'black', size = 10, parse = TRUE,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(0.5, 6.5), ylim = c(0.8, 3.2))  +
  theme_dag()

In Fig. 2, we are interested in the total average causal effect (green line) of CO₂ concentration \((\text{C})\) on sleep quality \((\text{S})\). In this setting, ventilation \((\text{V})\) is a confounder (i.e., common cause) of \(\text{C}\) and \(\text{S}\) and creates a non-causal path between \(\text{C}\) and \(\text{S}\) (i.e., \(\text{C} \leftarrow \text{V} \rightarrow \text{S}\)). Leaving a confounder unadjusted keeps the non-causal path open (orange line in Fig. 2 (a)), introducing bias. Adjusting for a confounder (e.g., via including it as a predictor in a regression model) blocks this path. Fig. 2 (b) illustrates this, where the grey-filled box visually represents the adjustment, and the light blue line indicates the blocked path. Adjusting for the descendant of a confounder is illustrated in Fig. 2 (c). Generally, adjusting for a descendant has a similar effect as adjusting for its direct parent variable, but the impact depends on how well the descendant captures the parent’s influence. The effect will not be identical unless the descendant fully represents the parent (i.e., if the descendant is a perfect proxy for the parent). Assuming that power usage \((\text{P})\) is not a perfect proxy for ventilation \((\text{V})\), adjusting for \(\text{P}\) (grey-filled box) will only partially close the path (dashed light blue line). As a result, bias will occur, but its absolute magnitude will be less than that with a fully open path.

Mediator and descendant

Code

#example a: V -> S is open
ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
  #mediation path
  geom_segment(x = 1, xend = 6, y = 1, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#D55E00', alpha = 0.05) +
  geom_segment(x = 1, xend = 3.5, y = 1, yend = 3,
               linewidth = 8, lineend = 'square', colour = '#D55E00', alpha = 0.05) +
  #visualise causal effect path
  geom_segment(x = 3.5, xend = 6, y = 3, yend = 1,
               linewidth = 14, lineend = 'round', colour = '#009E73', alpha = 0.05) +
  
  geom_dag_text(aes(label = node_labels[name]),
                colour = 'black', size = 10, parse = TRUE,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(0.5, 6.5), ylim = c(0.8, 3.2))  +
  theme_dag()


#example b: V -> S is closed
ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
  #mediation path
  geom_segment(x = 1, xend = 6, y = 1, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#0072B2', alpha = 0.05) +
  geom_segment(x = 1, xend = 3.5, y = 1, yend = 3,
               linewidth = 8, lineend = 'square', colour = '#0072B2', alpha = 0.05) +
  #adjusted variable
#  geom_point(x = 1, y = 1, shape = 0, size = 12, stroke = 0.9, color = 'black') +
  geom_point(x = 1, y = 1, shape = 22, size = 14, stroke = 0.9, fill = 'grey80', colour = 'black') +
  #visualise causal effect path
  geom_segment(x = 3.5, xend = 6, y = 3, yend = 1,
               linewidth = 14, lineend = 'round', colour = '#009E73', alpha = 0.05) +
  
  geom_dag_text(aes(label = node_labels[name]),
                colour = 'black', size = 10, parse = TRUE,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(0.5, 6.5), ylim = c(0.8, 3.2))  +
  theme_dag()


#example c: V -> S is partially closed
ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
  #mediation path
  geom_segment(x = 1, xend = 6, y = 1, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#0072B2', alpha = 0.05, linetype = '12') +
  geom_segment(x = 1, xend = 3.5, y = 1, yend = 3,
               linewidth = 8, lineend = 'square', colour = '#0072B2', alpha = 0.05, linetype = '12') +
  #adjusted variable
#  geom_point(x = 1, y = 2, shape = 0, size = 12, stroke = 0.9, color = 'black') +
  geom_point(x = 1, y = 2, shape = 22, size = 14, stroke = 0.9, fill = 'grey80', colour = 'black') +
  #visualise causal effect path
  geom_segment(x = 3.5, xend = 6, y = 3, yend = 1,
               linewidth = 14, lineend = 'round', colour = '#009E73', alpha = 0.05) +
  
  geom_dag_text(aes(label = node_labels[name]),
                colour = 'black', size = 10, parse = TRUE,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(0.5, 6.5), ylim = c(0.8, 3.2))  +
  theme_dag()

In Fig. 3, we are interested in the direct average causal effect (green line) of ventilation \((\text{V})\) on sleep quality \((\text{S})\). In this setting, CO₂ concentration \((\text{C})\) is a mediator of \(\text{V}\) and \(\text{S}\) (i.e., \(\text{V} \rightarrow \text{C} \rightarrow \text{S}\)). The path \(\text{V} \rightarrow \text{C} \rightarrow \text{S}\) is a causal path, specifically the indirect causal effect of \(\text{V}\) on \(\text{S}\). However, it is not the causal effect of interest, since we are interested in the direct causal effect of \(\text{V}\) on \(\text{S}\). Leaving the mediator unadjusted keeps that causal path open (orange line in Fig. 3 (a)). Adjusting for a mediator blocks the causal pathway on which it lies. Fig. 3 (b) illustrates this, where the grey-filled box visually represents the adjustment, and the light blue line indicates the blocked path. Adjusting for the descendant of a mediator is illustrated in Fig. 3 (c). Assuming IAQ index \((\text{I})\) is not a perfect proxy of CO₂ concentration \((\text{C})\), adjusting for \(\text{I}\) (grey-filled box) will only partially close the path (dashed light blue line).

Collider and descendant

Code

#example a: V -> C is closed
ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
  #backdoor path: common effect
  geom_segment(x = 1, xend = 6, y = 1, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#0072B2', alpha = 0.05) +
  geom_segment(x = 3.5, xend = 6, y = 3, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#0072B2', alpha = 0.05) +
  #visualise causal effect path
  geom_segment(x = 1, xend = 3.5, y = 1, yend = 3,
               linewidth = 14, lineend = 'round', colour = '#009E73', alpha = 0.05) +
  
  geom_dag_text(aes(label = node_labels[name]),
                colour = 'black', size = 10, parse = TRUE,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(0.5, 6.5), ylim = c(0.8, 3.2))  +
  theme_dag()


#example b: V -> C is open
ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
  #backdoor path: common effect
  geom_segment(x = 1, xend = 6, y = 1, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#D55E00', alpha = 0.05) +
  geom_segment(x = 3.5, xend = 6, y = 3, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#D55E00', alpha = 0.05) +
  #adjusted variable
#  geom_point(x = 6, y = 1, shape = 0, size = 12, stroke = 0.9, color = 'black') +
  geom_point(x = 6, y = 1, shape = 22, size = 14, stroke = 0.9, fill = 'grey80', colour = 'black') +
  #visualise causal effect path
  geom_segment(x = 1, xend = 3.5, y = 1, yend = 3,
               linewidth = 14, lineend = 'round', colour = '#009E73', alpha = 0.05) +
  
  geom_dag_text(aes(label = node_labels[name]),
                colour = 'black', size = 10, parse = TRUE,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(0.5, 6.5), ylim = c(0.8, 3.2))  +
  theme_dag()


#example c: V -> C is partially open
ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
  #backdoor path: common effect
  geom_segment(x = 1, xend = 6, y = 1, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#D55E00', alpha = 0.05, linetype = '12') +
  geom_segment(x = 3.5, xend = 6, y = 3, yend = 1,
               linewidth = 8, lineend = 'square', colour = '#D55E00', alpha = 0.05, linetype = '12') +
  #adjusted variable
#  geom_point(x = 6, y = 2, shape = 0, size = 12, stroke = 0.9, color = 'black') +
  geom_point(x = 6, y = 2, shape = 22, size = 14, stroke = 0.9, fill = 'grey80', colour = 'black') +
  #visualise causal effect path
  geom_segment(x = 1, xend = 3.5, y = 1, yend = 3,
               linewidth = 14, lineend = 'round', colour = '#009E73', alpha = 0.05) +
  
  geom_dag_text(aes(label = node_labels[name]),
                colour = 'black', size = 10, parse = TRUE,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(0.5, 6.5), ylim = c(0.8, 3.2))  +
  theme_dag()

In Fig. 4, we are interested in the total average causal effect (green line) of ventilation \((\text{V})\) on CO₂ concentration \((\text{C})\). In this setting, sleep quality \((\text{S})\) is a collider (i.e., common effect) of \(\text{V}\) and \(\text{C}\) and creates a non-causal path between \(\text{V}\) and \(\text{C}\) (i.e., \(\text{V} \rightarrow \text{S} \leftarrow \text{C}\)). Leaving a collider unadjusted keeps the non-causal path closed (light blue line in Fig. 4 (a)). Adjusting for a collider opens this path, introducing bias. Fig. 4 (b) illustrates this, where the grey-filled box visually represents the adjustment, and the orange line indicates the opened path. Adjusting for the descendant of a collider is illustrated in Fig. 4 (c). Assuming that daily productivity \((\text{D})\) is not a perfect proxy of sleep quality \((\text{S})\), adjusting for \(\text{D}\) (grey-filled box) will only partially open the path (dashed orange line). As a result, bias will occur, but its absolute magnitude will be less than that with a fully open path.

Backdoor criterion and adjustment criterion

A central question in causal inference is whether it is possible to isolate and identify the true causal effect. For this purpose, a simple yet powerful graphical tool exists: the backdoor criterion. While it may sound technical, its logic is intuitive and has become increasingly relevant in human-centric building science research.

The backdoor criterion is a test that can be applied directly to the DAG to determine whether a set(s) of variables exists that is sufficient to identify the total average causal effect of interest. This rule can be applied purely visually using DAGs without requiring any statistical equations. Two conditions must be met to satisfy the backdoor criterion, specifically:

The adjustment set (i.e., the group of variables that must be adjusted for to accurately estimate the causal effect of interest) must not include any descendant of the cause of interest;
The adjustment set must block every path between the cause and its consequence that starts with an arrow pointing into the cause. These paths are called backdoor paths.

The core idea of the backdoor criterion for identifying the total average causal effect of a treatment on the outcome is to block all backdoor paths between them, without adjusting for any descendant of the treatment. When these two conditions are satisfied, the total average causal effect is identifiable, that is, it is in principle estimable from the data. If these conditions are not met, under the given causal assumptions, the causal effect may not be identifiable. There are cases where a valid adjustment set exists (i.e., one can identify the causal effect by adjusting), but the backdoor criterion cannot find it. To address this issue, Shpitser et al. introduced the adjustment criterion, which generalized the backdoor criterion. Specifically, if a valid adjustment set exists, the adjustment criterion will identify at least one such set.

The adjustment criterion is also based on two conditions. Assuming that \(\text{X}\) is the cause and \(\text{Y}\) is the outcome of interest, the conditions are:

The adjustment set must exclude any descendant of \(\text{X}\) that lies on a causal path from \(\text{X}\) to \(\text{Y}\);
The adjustment set must block all non-causal paths from \(\text{X}\) to \(\text{Y}\).

These conditions differ from those of the backdoor criterion. The first condition permits the inclusion of descendants of \(\text{X}\), provided they are not on the causal path from \(\text{X}\) to \(\text{Y}\). The second condition requires closing all non-causal paths, not just the backdoor paths (since backdoor paths are non-causal paths, but not all non-causal paths are backdoor paths). If either of these conditions is violated, under the given causal assumptions, no valid adjustment set exists.

Causation and prediction

Code

library(ggdag) # for DAG
library(dagitty)
library(dplyr)
library(ggplot2) # for visualisation
library(ggforce)

dag_coords.intro <-
  data.frame(name = c('X5', 'X1', 'X3', 'X7', 'Y', 'X2', 'X4', 'X8', 'X6'),
             x = c(1, 3.5, 3.5, 3.5, 6, 8.5, 8.5, 8.5, 11),
             y = c(2, 4, 3, 1, 2, 4, 3, 1, 2))

DAG.intro <-
  dagify(X3 ~ X1,
         X4 ~ X2,
         Y ~ X3 + X4,
         X7 ~ Y + X5,
         X8 ~ Y + X6,
         X9 ~ X7, 
         coords = dag_coords.intro)

node_labels <- c(
  X1 = 'bold(X[1])', 
  X2 = 'bold(X[2])', 
  X3 = 'bold(X[3])', 
  X4 = 'bold(X[4])', 
  X5 = 'bold(X[5])',
  X6 = 'bold(X[6])', 
  X7 = 'bold(X[7])', 
  X8 = 'bold(X[8])', 
  Y = 'bold(Y)'
)

highlight.nodes_cause <- c('X3', 'X4', 'Y')

highlight.nodes_prediction <- c('X3', 'X4', 'Y', 'X5', 'X7', 'X8', 'X6')

highlight.nodes_collider1 <- c('Y', 'X5', 'X7')

highlight.nodes_collider2 <- c('Y',  'X8', 'X6')


ggplot(data = DAG.intro, aes(x = x, y = y, xend = xend, yend = yend)) +
 geom_mark_hull(
    data = dag_coords.intro %>%
      filter(name %in% highlight.nodes_cause) %>%
      { centroid <- c(mean(.$x), mean(.$y))
        mutate(., 
               x = x + 0.1 * (x - centroid[1]),
               y = y + 0.1 * (y - centroid[2]))
        },
    #subset(dag_coords.intro, name %in% highlight.nodes_cause),
    aes(x = x, y = y), inherit.aes = FALSE,
    colour = alpha('#CC79A7', 0.5),
    size = 2) +
   geom_mark_hull(
    data = dag_coords.intro %>%
      filter(name %in% highlight.nodes_prediction) %>%
      { centroid <- c(mean(.$x), mean(.$y))
        mutate(., 
               x = x + 0.05 * (x - centroid[1]),
               y = y + 0.05 * (y - centroid[2]))
        },
    #subset(dag_coords.intro, name %in% highlight.nodes_prediction),
    aes(x = x, y = y), inherit.aes = FALSE,
    colour = alpha('#0072B2', 0.5),
    size = 2) +
     geom_mark_hull(
    data = dag_coords.intro %>%
      filter(name %in% highlight.nodes_collider1) %>%
      { centroid <- c(mean(.$x), mean(.$y))
        mutate(., 
               x = x + 0.1 * (x - centroid[1]),
               y = y + 0.1 * (y - centroid[2]))
        },
    #subset(dag_coords.intro, name %in% highlight.nodes_collider1),
    aes(x = x, y = y), inherit.aes = FALSE,
    colour = alpha('#E69F00', 0.5),
    size = 2) +
     geom_mark_hull(
    data = dag_coords.intro %>%
      filter(name %in% highlight.nodes_collider2) %>%
      { centroid <- c(mean(.$x), mean(.$y))
        mutate(., 
               x = x + 0.1 * (x - centroid[1]),
               y = y + 0.1 * (y - centroid[2]))
        },
    #subset(dag_coords.intro, name %in% highlight.nodes_collider2),
    aes(x = x, y = y), inherit.aes = FALSE,
    colour = alpha('#E69F00', 0.5), 
    size = 2) +
  geom_dag_text(aes(label = node_labels[name]),
                parse = TRUE,
                colour = 'black',
                size = 10,
                family = 'mono') +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'open'),
                 edge_colour = 'black',
                 family = 'mono',
                 fontface = 'bold') +
  coord_cartesian(xlim = c(0.5, 11.5), ylim = c(0.8, 4.2)) +
  theme_dag()

Fig. 5. Illustrations of causation and prediction for an outcome \(\text{Y}\). The pink outline indicates the direct causes of \(\text{Y}\); the blue outline indicates the Markov blanket for \(\text{Y}\); the orange outlines indicate the colliders.

As previously mentioned, in a Directed Acyclic Graph (DAG), causation follows the the direction of the arrow but association does not. In Fig. 5, for example, \(\text{X}_1\) and \(\text{X}_3\) are associated because there is a line connecting them, but since the arrow goes from \(\text{X}_1\) to \(\text{X}_3\) (i.e., \(\text{X}_1 \rightarrow \text{X}_3\)), the DAG assumes that \(\text{X}_1\) is causing \(\text{X}_3\): changing \(\text{X}_1\) will change \(\text{X}_3\), but changing \(\text{X}_3\) will not affect \(\text{X}_1\). This directionality is critical for causal inference, which seeks to answer “what-if” questions about the effect of actively intervening on a variable. Pure prediction, in contrast, relies on observed associations. To predict \(\text{X}_1\), one could use \(\text{X}_3\), even though \(\text{X}_3\) is the consequence of \(\text{X}_1\). If the association between \(\text{X}_1\) and \(\text{X}_3\) is positive, observing a high value for \(\text{X}_3\) would imply observing a high value for \(\text{X}_1\). The important point is that this implies observing \(\text{X}_1\) and \(\text{X}_3\) rather than directly changing them. Manually setting \(\text{X}_3\) to a high value will not affect \(\text{X}_1\).

To better illustrate the difference between causation and prediction, consider that in Fig. 5 the outcome of interest is \(\text{Y}\). Here, \(\text{Y}\) is directly causes by \(\text{X}_3\) and \(\text{X}_4\) and indirectly by \(\text{X}_1\) (i.e., \(\text{X}_1 \rightarrow \text{X}_3 \rightarrow \text{Y}\)) and \(\text{X}_2\) (i.e., \(\text{X}_2 \rightarrow \text{X}_4 \rightarrow \text{Y}\)). As such, all causal information for \(\text{Y}\) is contained in \(\text{X}_3\) and \(\text{X}_4\) (the pink outline in Fig. 5). However, if the goal is to predict \(\text{Y}\), a more accurate prediction can be obtained by including also \(\text{X}_5\), \(\text{X}_6\), \(\text{X}_7\) and \(\text{X}_8\) (the blue outline in Fig. 5). This is the case because, by blocking all paths leading to \(\text{Y}\), these variables (i.e., \(\text{X}_3\), \(\text{X}_4\), \(\text{X}_5\), \(\text{X}_6\), \(\text{X}_7\) and \(\text{X}_8\)) encompass all the information available in the data about \(\text{Y}\). The set of variables that contains all information about \(\text{Y}\) (or more formally, the set of variables that renders \(\text{Y}\) conditionally independent of all other variables in the DAG) is called the Markov blanket.

A predictive model can leverage all the information contained in the Markov blanket to improve accuracy, even though some of the variables are not causal (i.e., \(\text{X}_5\), \(\text{X}_6\), \(\text{X}_7\) and \(\text{X}_8\)). A causal model should only include variables that are needed to identify the causal effect of interest. As explained previously, \(\text{X}_5 \rightarrow \text{X}_7 \leftarrow \text{Y}\) and \(\text{Y} \rightarrow \text{X}_8 \leftarrow \text{X}_6\) are colliders (the orange outline in Fig. 5). Therefore, including \(\text{X}_5\), \(\text{X}_6\), \(\text{X}_7\) and \(\text{X}_8\) will introduce bias.