Oui-Love Plots: Outcome-informed Love plots for covariate balance in causal inference

Ehud Karavani

Abstract

Assessing balance between exposure groups by visualizing the absolute standardized mean differences (ASMD) in a Love plot is a common approach to diagnose models like propensity score weighting or matching. However, the ASMD only captures covariate-exposure associations and neglects to integrate information about the associations between covariates and the outcome. Since adjustment sets are often defined a-priori to the data (e.g., via directed acyclic graphs), there is an increased risk of classifying variables as confounders even if they are not actually associated with both the exposure and the outcome in the data. And since confounders are required to be balanced between exposure groups to achieve unbiased effect estimation, assessing the balance of other variables may create a skewed picture of the required balance and may mislead researchers to pursue balancing where it is not strictly necessary.

To overcome this issue, we propose the Oui-Love plot: an OUtcome-Informed Love plot, consisting of two main parts. First, we introduce an outcome-informed ASMD score by incorporating covariate-outcome importance measures with the ASMD. Second, since we love plots, this additional score is visualized through several graphical channels (opacity, marker size, and covariate order) to augment the standard Love plot. This enhanced Love plot can better assist researchers to evaluate balancing on variables that are more likely to bias treatment effects in the data at hand by emphasizing variables that are statistically associated with both exposure and the outcome. The method is modular and easy to implement with modern statistical software and may hopefully benefit causal inference researchers and practitioners.

Introduction

Covariate balancing is a significant and essential concept for causal inference from observational studies. Generally, balance diagnostics assess the difference in covariate distribution between exposure levels. Intuitively, if the covariate distribution is the same across exposure groups, there will be no systematic bias in exposure and difference in outcomes will only be due to the exposure status. Therefore, balancing is the most common diagnostics for assessing balancing methods, like inverse propensity weighting (IPW) or matching (Ali et al. 2015; Granger et al. 2020).

Ideally, covariate balance is assessed over confounding variables. These are variables affecting both the exposure and the outcome, and which failing to adjust for can introduce bias when examining the influence of the exposure on the outcome (Hernán et al. 2002). Adjusting for these confounders reduces their difference across exposure groups, effectively equaling their influence on the outcome across exposure groups; thus enabling the model to isolate the effect of the exposure.

Whether a variable is a confounder cannot be determined from the data itself. Therefore, the most common approach to select confounders for an adjustment set is to let a domain expert hand pick them manually (Hernán et al. 2002; Tennant et al. 2021). The structure arising from this specification is often formulated and presented as a Directed Acyclic Graph (DAG), where variables are depicted as nodes and two nodes are connected by an arrow if one causally influences the other (Greenland, Pearl, and Robins 1999). When specifying how any pair of variables interact, the omission of an edge – assuming there is no association whatsoever – is a much stronger assumption than its inclusion – allowing the model to infer zero association from the data. Thus, when determining the structure, modellers may prefer to err on the inclusion of edges, rather than their omission (Tennant et al. 2021; Greenland, Pearl, and Robins 1999).

Most importantly, since the structure – and therefore the confounders – are determined a-priori, based on prior knowledge rather than data, we denote them structural confounders. However, an a-priori structural confounder may not necessarily be an a-posteriori statistical confounder. Namely, the prior assumption a variable is associated with both exposure and outcome may not manifest at the data at hand. This is quite plausible, even in the absence of finite sampling errors, since we often build DAGs one pair at a time, failing to grasp how many factors may interact to determine the exposure or outcome conditioned on all the other factors. For instance, a factor we assumed is considered when prescribing medicine may not be used, or we can fail to understand the mechanism under which a certain factor conditionally explains the outcome. Both will result with the factor not being statistically associated with the exposure or outcome, respectively (or both).

Therefore, covariate balance assessments that only capture covariate-exposure associations can be insufficient. For example, such a putative confounder may heavily influence differential exposure, leading to high imbalance, but have no impact on the outcome (these may be referred to as instrumental variables). In such cases, it will be unnecessary (or even harmful (Ding, VanderWeele, and Robins 2017)) to balance over that putative confounder. However, commonly used diagnostics like the (Absolute) Standardized Mean Difference (ASMD) (Austin 2009) and the corresponding Love plot (Ahmed et al. 2006) will fail to capture that and may mislead the researcher to focus on where they should not.

To overcome this issue, information about the covariate-outcome association should be incorporated to provide a fuller picture. Fortunately, assessing the statistical importance of a variable on an outcome is a very known problem in regression modeling and machine learning (Harrell et al. 2001, chap. 9.8.4; Shalev-Shwartz and Ben-David 2014, chap. 25; Lundberg and Lee 2017). Assessing covariate-outcome importance will provide us with additional information orthogonal and complementary to the covariate-exposure balance assessment. This will allow us to, literally speaking, paint a fuller picture about confounder imbalance.

Our contribution in this manuscript is twofold. First, we will provide visual augmentation to the known Love plot based on both standardized mean difference and covariate-outcome importance. Second, we will combine both measures together to suggest a better metric for covariate/model selection and further enhance Love plots or any other balance assessment plots.

Outcome-informed Love plot

Data visualization allows us to encode numeric data values as visual elements. When we “visualize data” we essentially map between data values to graphical elements. Good, consistent visualization will map different dimensions of the data to different types of encoding channels using appropriate graphical marks (Wilke 2019, chap. 2).

For instance, a traditional Love plot (Figure 1) maps the covariates to the y-axis - so each covariate gets its own row, the (absolute) standardized mean difference (ASMD) of each covariate to the x-axis, and the type of the model (e.g., weighted/unweighted) to the color (blue/orange) and possibly also to different markers (circle/triangle). All in all, we mapped three data dimensions: covariates, their ASMD, and adjustment model, to three graphical channels: y-axis, x-axis, and color (and possibly a fourth channel of marker shape for emphasis).

In an outcome-informed Love plot, we calculate an additional data dimension: the covariate-outcome impact (“variable importance”) of each covariate on the outcome. The covariate-outcome importance is a non-negative score positively associated with importance. There are several approaches to calculate feature importance, which we describe in more details in the Methods section, but all adhere to the same principle: a low score indicates relatively small importance – the exclusion of the covariate causes small changes in prediction error, and a high score indicates relatively high importance – the exclusion of the covariate causes large changes in the prediction error. Put differently, more important covariates explain more variation in the outcome and contribute more to the accuracy¹ of the prediction.

To augment the traditional Love plot, the additional dimension for the covariate-outcome importance score is mapped to one or more visual channels. In this work we suggest three candidate channels, which can be arbitrarily combined:

The opacity channel. Marks corresponding to more important covariates are more opaque, while less important marks are more transparent.
The size channel. Marks corresponding to more important covariate are larger, while less important marks are smaller.
The order of the y-axis. Covariates are ranked by their importance with more important covariates appearing on top.

Figure 2 presents the same Love plot but enhanced by each of those channels separately.

The common property for options (1) and (2) is that less important covariates appear less prominent, either being smaller or more transparent. The argument being that if they do not influence the outcome, they will not bias the estimation, and therefore are not important and less interesting to examine. If they are less interesting to examine, there is less need for them to stand out and can therefore be salient. This will reduce clutter and allow the viewer to focus on the more important (and thus visually prominent) covariates. Meanwhile, option (3) clusters more important covariates to specific regions of the plot, but breaks the standard of ordering covariates by the unadjusted ASMD that may be familiar to practitioners. All options achieve a similar objective of differential attention onto more important covariates, either by differential prominence (transparency and size) or by differential spatial location (order).

Figure 2: Encoding covariate-outcome importance information as marker opacity, marker size, and covariate order. Opacity and size allow less important covariates to appear more salient by making them more transparent or smaller, respectively. Y-axis order moves less important covariates further down the plot, focusing more important covariates in a specific region of the figure.

Outcome-informed balance metrics

Oftentimes, we use the ASMD for model selection. For instance, the maximum ASMD after adjustment (e.g., weighting or matching) across all covariates can be considered as a good summary of the Love plot. Maximal post-adjustment ASMD describes the worst case scenario for imbalance. If our model can keep the max ASMD – and therefore the ASMD for all covariates – under reasonable tolerance (0.1 threshold is arbitrarily but commonly used), then we can gain further trust in the downstream effect being estimated. Once we have a single numeric metric that can diagnose model performance, we can use it to choose between two (or more) candidate models, choosing the one with minimal post-adjustment max ASMD.

However, as argued above, ASMD alone can be a poor diagnostic. If the covariate with max ASMD has little influence on the outcome, there is little benefit in making the effort to improve its balance. That covariate should not be part of the model’s objective. In fact, that covariate creates a distorted image of the desired confounder balance.

One possible solution is combining the ASMD with the covariate-outcome importance measures into a single numeric score. In this manuscript, we argue for the multiplication of the two, as it can assess the interaction of the two orthogonal measures. Specifically, it addresses the issue depicted above naturally, by allowing the two measures to cancel out each other (Figure 3). For example, small covariate-outcome importance will lead to small score overall, regardless of how large or small is the corresponding ASMD, and vice versa. High scoring covariates will, therefore, only be comprised of both large ASMD and large covariate-outcome importance - meaning strong covariate-exposure association and strong covariate-outcome association, which is exactly the definition of a confounder.

Figure 3: Outcome-informed ASMD score. A) Inverse propensity weighted ASMD, with 0.1 threshold reference (dashed) B) Covariate-outcome importance score (mean decrease in mean square error). C) Outcome-informed ASMD, generated by multiplying the above two. The ASMD emphasizes covariate-exposure associations (\(X_A, X_{AY}\)), the outcome-importance score emphasizes covariate-outcome associations (\(X_Y, X_{AY}\)), and outcome-informed ASMD emphasizes the interaction of the former two scores (\(X_{AY}\)).

Methods

Covariate balance measures

The task of assessing covariate balancing is essentially a two-sample test between the exposed and unexposed. Since two-sample tests often do not scale, making the comparison of two multivariable distributions ill-defined, researchers resort to comparing multiple univariable distributions by examining each covariate separately. For balance assessment in causal inference modeling, the most common metric used is the standardized mean difference (SMD) Granger et al. (2020). The SMD is the difference in covariate averages divided by the pooled standard error. Mathematically, for each covariate \(j\) we define: \[ SMD_j = \frac{\bar{x}_j \vert_{A=1}- \bar{x}_j\vert_{A=0}}{\sqrt{\hat{\sigma}_j^2\vert_{A=1} + \hat{\sigma}_j^2\vert_{A=0}}} \]

Where \(\bar{x}_j \vert_{A=1}\) is the average of feature \(x_j\) among those exposed, and \(\hat{\sigma}_j^2\vert_{A=0}\) is the estimated standard deviation of \(x_j\) among the unexposed. Furthermore, since the direction of the bias is insignificant for our purposes, we further take the absolute value and denote the \(ASMD_j = \left\vert SMD_j \right\vert\).

Covariate importance measures

The task of assessing the influence of covariates on an outcome is a well established task in statistics, often utilized for dimensionality reduction (feature selection) or model selection. There are multiple approaches to compute this importance: regression models can use absolute magnitude of coefficients (assuming input is standardized), or non-zero coefficients in L1-penalized regression (LASSO). A more model agnostic approach, the covariate importance can be assessed by how “excluding” each covariate affects some goodness-of-fit metric. This “exclusion” is either done by removing the feature entirely (Harrell et al. 2001, chap. 9.8.4) from the model or just shuffling its values across observations (Breiman 2001, sec. 10) (the latter may also be evaluated on an out-of-sample test split). The goodness-of-fit measure evaluated can be any arbitrary metric like decreasing the loss or increasing the accuracy. The change in goodness-of-fit can either be multiplicative or additive, grounding the full model (with all covariates) as the baseline to compare against. Covariates that are more important for predicting the outcome will cause larger decrease in performance relative to the full model (that includes these covariates), meaning they are the ones driving the accuracy of the predictions.

In this work, our goodness-of-fit metric is the natural deviance, which, since the outcome in our simulations is continuous, is the mean squared error. The importance of a covariate is defined using the mean decrease in accuracy (Breiman 2001) – the percentage change in deviance between the full model and the model fitted with that covariate removed. Importantly, we do not consider univariable importance measures, but rather a metric that is always conditional on the rest of the covariates and the exposure (VanderWeele 2019). However, the outcome-informed Love plot can work with any arbitrary non-negative importance measure, as long as lower scores correspond to little importance and higher scores to high importance.

Data

We present our augmented Love plot on a minimally sufficient data simulation. The simulation includes four covariates, one (\(X_0\)) is not associated with neither the exposure (\(A\)) nor the outcome (\(Y\)), one (\(X_A\)) is associated only with the exposure, another (\(X_Y\)) only with the outcome, and one true confounder (\(X_{AY}\)) that is associated with both. Mathematically, the full generating process is \[ \begin{aligned} Y &\sim A + X_Y + X_{AY} + \epsilon \\ A &\sim \text{Bernoulli}(\pi) \\ \text{logit}(\pi) &= X_A + X_{AY} \\ X_0, X_A, X_Y, X_{AY} &\sim \text{Normal}(0, 1) \\ \epsilon &\sim \text{Normal}(0, 1) \end{aligned} \]

The directed acyclic graph depicting this setting (Figure 4) describes a setting where \(X_0, X_A, X_Y\) are wrongly considered to be confounders (influence both exposure and outcome) a-priori, but are actually not.

Figure 4: Confounding structure underlying the simulation example. Right is the true confounding structure where \(X_0\) is not associated with neither the exposure \(A\) nor the outcome \(Y\), \(X_A\) only influence the exposure, \(X_Y\) only influences the outcome, and \(X_{AY}\) is the only true confounder influencing both \(A\) and \(Y\). Left is the assumed confounding structure where all variables assumed to be confounders influencing both the treatment and the outcome. Wrongly specified edges are depicted red and dashed.

Discussion

We further present advantages, limitations, and extensions to the outcome-informed Love plots.

Prognostic and confounding variables

Throughout this manuscript, we have focused on confounding variables–factors influencing both the outcome and treatment assignment–that are necessary for obtaining unbiased effect estimations. However, prognostic variables–factors influencing the outcome but not the treatment assignment–can be of similar importance as they are required for obtaining precise effect estimations (Cinelli, Forney, and Pearl (2024)). Including such pre-treatment prognostic factors in balancing models like inverse probability weighting is, therefore, recommended (Brookhart et al. (2006)).

This distinction between prognostic and confounding variables immediately translates to the proposed Love plot augmentation with outcome-informed scores and outcome-informed ASMD scores. Outcome-informed Love plots, utilizing only outcome-importance variables, will favor emphasizing prognostic variables, as they only take covariate-outcome association into consideration. Outcome-informed ASMD scores, taking the product of the covariate-outcome importance with the ASMD (covariate-treatment importance), will favor emphasizing confounding variables. Therefore, choosing whether to augment a Love plot only with outcome-importance scores or with outcome-importance ASMDs can shed light on different aspects of the model depending on the interest of the analyst.

Confounding variables may still be detectable when augmenting a Love plot with just outcome-importance scores. First, because confounding factors do associate with the outcome, they will be visible and shown prominently in the plot. Second, unlike prognostic factors, confounding variables will have higher unadjusted ASMD, that will distinguish them from solely prognostic factors. This is a subtle cue that may lead to more cognitive load by anyone examining the plot for potentially biasing imbalances, but it will still contain both pieces of information, unlike outcome-informed ASMD that may obscure prognostic information.

In this manuscript, we nonetheless chose to augment the Love plots with the combined outcome-informed ASMD score instead of just the outcome-importance score. The reason being both for simplicity and stronger visual emphasis. Simplicity, because focusing on unbiased estimation (or confounding) is a single consideration, while unbiasedness and preciseness are two considerations that paint a slightly more complex picture with additional considerations. Second, in the simulated data used, the distinction between the proposed Love plot and the plain Love plot is more prominent using outcome-informed ASMD. For completeness, however, Figure 5 shows the difference between an outcome-informed Love plot and a Love plot augmented by outcome-informed ASMD score.

Figure 5: Love plot augmented by outcome-informed ASMD. On the left, an outcome-informed Love plot augmented by covariate-outcome importance (Similar to Figure 1, right, but without ordering). On the right, an outcome-informed Love plot, but augmented by the combined outcome-importance ASMD score. While the former may emphasize prognostic variables too much (\(X_Y\)), the latter is able to minimize their importance and emphasize instead confounding variables that have both larger covariate-outcome importance *and* have large ASMD.

Augmenting more than just Love plots

The augmentation presented, encoding covariate-outcome importance in the opacity and/or size channels, is not only limited to the standard, traditional Love plot. Covariate imbalance, pre- and post-adjustment, can sometimes be presented as either scatter plots or slope-graphs. In scatter plots, the adjusted and unadjusted ASMDs are plotted on the x and y axes. In slope graphs, the pre/post-adjustment indication is mapped to the x-axis and the ASMD to the y-axis. Scatter plots are better suited for high-dimensional settings where plotting all covariates on the y-axis is infeasible (or at least too crowded). Slope graphs have a similar strength, compromising some information richness (covariate identity) in exchange of emphasizing overall reduction in ASMD. Figure S1 demonstrates both variants can benefit by the same augmenting encoding.

Assessing the utility of Outcome-informed Love plots

The main limitation of this manuscript is that its fundamental premise is not validated empirically. We believe outcome-informed Love plots are useful since we are practitioners and we found them to be so. However, anecdotal evidence is not good evidence. The proper way to examine the utility of our suggested augmentation is to conduct a user study, providing causal inference practitioners with the traditional Love plot and our extension (even breaking down the different channels and their combinations) and evaluating whether they indeed benefitted from it. This kind of work, however, is out of scope for the current manuscript.

Scale-less relative ranking

A second limitation considers the lack of interpretable, absolute scale for the outcome-informed ASMD score. The original ASMD has a clear scale in standard deviation units of the data. It is well-established and can be interpreted as Cohen’s d (Austin (2009)), so thresholds—although arbitrary—are still fairly understood and agreed on. The outcome-informed ASMD score, on the other hand, being a multiply of the ASMD, no longer enjoys the familiarity and interpretability of the original ASMD. Therefore, its utility is only in its ranking, comparing covariates relative to other covariates in the same analytic sample.

However, when used to augment a Love plot, all that matters is how the outcome-informed ASMD score ranks covariates relative to each other, and specifically, ranking confounding variables above prognostic variables. Therefore, the limitation of the outcome-informed ASMD score being scale-less is irrelevant for the visualization task, and since the visualization only requires relative ranking, the outcome-informed ASMD score still faithfully fulfills its job.

Summary

We have introduced an augmentation to the Love plot by incorporating additional information about covariate-outcome importance. Love plot is a common graphical diagnostic for group balancing methods in causal inference, visualizing the (Absolute) Standardized Mean Difference (ASMD) for each covariate before and after adjustment. ASMD alone, however, can be misleading if the covariates under investigation are not true confounding variables, influencing both exposure and outcome. Therefore, outcome-informed Love plot can help paint a fuller picture, emphasizing covariates that are both imbalanced and drive change in the outcome.

“Oui-Love plots”² is a modular, extendable, and easy-to-implement idea that can support the workflow of causal inference practitioners and we hope it will.

Appendix

Outcome-informed balance plots

Figure S1: Encoding covariate-outcome importance information (right) in scatter plots (top) and slope graphs (bottom).

Source: Outcome-informed balance plots

References

Ahmed, Ali, Ahsan Husain, Thomas E Love, Giovanni Gambassi, Louis J Dell’Italia, Gary S Francis, Mihai Gheorghiade, Richard M Allman, Sreelatha Meleth, and Robert C Bourge. 2006. “Heart Failure, Chronic Diuretic Use, and Increase in Mortality and Hospitalization: An Observational Study Using Propensity Score Methods.” European Heart Journal 27 (12): 1431–39.

Ali, M Sanni, Rolf HH Groenwold, Svetlana V Belitser, Wiebe R Pestman, Arno W Hoes, Kit CB Roes, Anthonius de Boer, and Olaf H Klungel. 2015. “Reporting of Covariate Selection and Balance Assessment in Propensity Score Analysis Is Suboptimal: A Systematic Review.” Journal of Clinical Epidemiology 68 (2): 122–31.

Austin, Peter C. 2009. “Balance Diagnostics for Comparing the Distribution of Baseline Covariates Between Treatment Groups in Propensity-Score Matched Samples.” Statistics in Medicine 28 (25): 3083–3107.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45: 5–32.

Brookhart, M Alan, Sebastian Schneeweiss, Kenneth J Rothman, Robert J Glynn, Jerry Avorn, and Til Stürmer. 2006. “Variable Selection for Propensity Score Models.” American Journal of Epidemiology 163 (12): 1149–56.

Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2024. “A Crash Course in Good and Bad Controls.” Sociological Methods & Research 53 (3): 1071–1104.

Ding, Peng, TJ VanderWeele, and James M Robins. 2017. “Instrumental Variables as Bias Amplifiers with General Outcome and Confounding.” Biometrika 104 (2): 291–302.

Granger, Emily, Tim Watkins, Jamie C Sergeant, and Mark Lunt. 2020. “A Review of the Use of Propensity Score Diagnostics in Papers Published in High-Ranking Medical Journals.” BMC Medical Research Methodology 20: 1–9.

Greenland, Sander, Judea Pearl, and James M Robins. 1999. “Causal Diagrams for Epidemiologic Research.” Epidemiology 10 (1): 37–48.

Harrell, Frank E et al. 2001. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer.

Hernán, Miguel A, Sonia Hernández-Dı́az, Martha M Werler, and Allen A Mitchell. 2002. “Causal Knowledge as a Prerequisite for Confounding Evaluation: An Application to Birth Defects Epidemiology.” American Journal of Epidemiology 155 (2): 176–84.

Lundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems 30, 4765–74. Curran Associates, Inc.

Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press.

Tennant, Peter WG, Eleanor J Murray, Kellyn F Arnold, Laurie Berrie, Matthew P Fox, Sarah C Gadd, Wendy J Harrison, et al. 2021. “Use of Directed Acyclic Graphs (DAGs) to Identify Confounders in Applied Health Research: Review and Recommendations.” International Journal of Epidemiology 50 (2): 620–32.

VanderWeele, Tyler J. 2019. “Principles of Confounder Selection.” European Journal of Epidemiology 34: 211–19.

Wilke, Claus O. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media.

Footnotes

We will use “accuracy” in a colloquial manner throughout.↩︎
Yes, we love plots.↩︎