Data preparation and input variable standardization

Before you run tar_make(), you must supply a single CSV file containing the extracted study data. The pipeline reads this file as its only external input — every downstream model, plot, and report derives from it. This page describes the required directory layout, the expected column format, and the variable transformations that clean() applies before any statistical modeling begins.

Directory structure

Place your extracted study data at exactly the following path relative to the project root:

data/
├── raw/
│   └── data.csv    ← place your extracted study data here
└── processed/      ← secondary products generated by the pipeline

The data/raw/ directory is scanned by lsData() at pipeline startup. The file must be named data.csv. The data/processed/ directory is reserved for any intermediate or feature-engineered outputs generated during the run.

data.csv must be placed in data/raw/ before you call tar_make(). If the file is missing, the fpath target will error and the entire pipeline will halt at the first step.

Input data format

Your data.csv must contain the following columns. Column names are matched exactly as they appear in the source code; extra columns are ignored.

Column	Type	Notes
`Author`	string	Study author name(s)
`Patient's age`	string	Age descriptor; copied as-is into the `Age` variable
`Year of Publication`	numeric	Four-digit year; cast to integer by `clean()`
`Prevalence`	string or numeric	Proportion of inappropriate use; accepts comma (`,`) or dot (`.`) as decimal separator
`Sample size`	numeric	Total number of patients; cast to integer
`Inappropriate indication`	numeric	Count of patients with inappropriate indication; cast to integer
`Continent`	string	Must contain `"Asia"`, `"Europe"`, or `"North America"` as a substring; anything else maps to `"Other"`
`Setting`	string	Must contain `"Hospital"` as a substring to map to `"Hospital Setting"`; otherwise `"Other"`
`Guideline`	string	`"Yes"` maps to `"Followed Guideline(s)"`; any other value maps to `"No Guideline"`
`JBI_Classification`	string	Methodological quality classification; used as-is in subgroup and regression models

Data cleaning

The clean() function in src/R/clean.R applies all variable transformations in a single dplyr::mutate() call. The cleaned data frame is stored as the tbl_clean target and is the direct input to every modeling target.

clean <- function(tbl, ...) {
  #' Clean the Data Frame
  #'
  #' Clean the data frame by standardizing the variables.
  #'
  #' @param tbl A data frame object
  #'
  #' @return A data frame object

  res <- tbl |>
    dplyr::mutate(
      "Age" = `Patient's age`,
      "Year" = as.integer(`Year of Publication`),
      "Prevalence" = gsub(x = Prevalence, ",", ".") |> as.numeric(),
      "Sample_size" = as.integer(`Sample size`),
      "Inappropriate_indication" = as.integer(`Inappropriate indication`),
      "Continent" = dplyr::case_when(
        grepl(x = Continent, "Asia") ~ "Asia",
        grepl(x = Continent, "Europe") ~ "Europe",
        grepl(x = Continent, "North America") ~ "North America",
        .default =  "Other"
      ) |> factor(levels = c("North America", "Asia", "Europe", "Other")),
      "Setting" = dplyr::case_when(
        grepl(x = Setting, "Hospital") ~ "Hospital Setting",
        .default = "Other"
      ) |> factor(levels = c("Hospital Setting", "Other")),
      "use_guideline" = ifelse(
        `Guideline` == "Yes", "Followed Guideline(s)", "No Guideline"
      ),
      "logit_prevalence" = log(Prevalence / (1 - Prevalence)),
      "var_logit_prevalence" = 1 / (Sample_size * logit_prevalence)
    ) %>%
    set_names(names(.) |> make.names())

  return(res)
}

Each transformation is explained below.

Age standardization

"Age" = `Patient's age`

The backtick-quoted column Patient's age is renamed to the syntactically valid R name Age. No type conversion is applied; the value is carried forward as-is for use in subgroup analysis.

Year as integer

"Year" = as.integer(`Year of Publication`)

Year of Publication is cast to integer, dropping any trailing decimals that may result from spreadsheet export. Year is used as a continuous covariate in the multivariable meta-regression.

Prevalence normalization

"Prevalence" = gsub(x = Prevalence, ",", ".") |> as.numeric()

Some studies report prevalence with a comma decimal separator (e.g., 0,42). gsub() replaces all commas with dots before coercing to numeric. The resulting value is a proportion between 0 and 1.

Sample size as integer

"Sample_size" = as.integer(`Sample size`)

The total patient count is stored as an integer. It is used both as a covariate in the multivariable meta-regression and in the variance computation below.

Continent classification

"Continent" = dplyr::case_when(
  grepl(x = Continent, "Asia") ~ "Asia",
  grepl(x = Continent, "Europe") ~ "Europe",
  grepl(x = Continent, "North America") ~ "North America",
  .default =  "Other"
) |> factor(levels = c("North America", "Asia", "Europe", "Other"))

Free-text continent values are collapsed to four levels using substring matching. The reference level for modeling is "North America" (the first level of the factor).

Setting dichotomization

"Setting" = dplyr::case_when(
  grepl(x = Setting, "Hospital") ~ "Hospital Setting",
  .default = "Other"
) |> factor(levels = c("Hospital Setting", "Other"))

Any value containing "Hospital" maps to "Hospital Setting"; all others become "Other". The factor reference level is "Hospital Setting".

Guideline use categorization

"use_guideline" = ifelse(
  `Guideline` == "Yes", "Followed Guideline(s)", "No Guideline"
)

The binary Guideline column is recoded to a descriptive label. Studies that reported using a guideline are tagged "Followed Guideline(s)"; all others are tagged "No Guideline".

Logit prevalence

"logit_prevalence" = log(Prevalence / (1 - Prevalence))

The logit transformation stabilizes variance and is the effect measure used throughout the meta-analysis. Values of exactly 0 or 1 will produce -Inf or Inf; verify that no study reports a prevalence at the boundary before running the pipeline.

Variance of logit prevalence

"var_logit_prevalence" = 1 / (Sample_size * logit_prevalence)

This approximation of the within-study variance of the logit-transformed prevalence is passed to the meta-analysis functions as the vi (variance) argument.

Column name sanitization

set_names(names(.) |> make.names())

make.names() is applied to all column names in the final step, replacing spaces and special characters with dots so every column is a syntactically valid R name (e.g., Patient's age → Patient.s.age).

Get Started

Pipeline

Analysis Methods

Results & Reporting

Directory structure

Input data format

Data cleaning

Age standardization

Year as integer

Prevalence normalization

Sample size as integer

Continent classification

Setting dichotomization

Guideline use categorization

Logit prevalence

Variance of logit prevalence

Column name sanitization

Build docs developers (and LLMs) love

Get Started

Pipeline

Analysis Methods

Results & Reporting

Documentation Index

​Directory structure

​Input data format

​Data cleaning

​Age standardization

​Year as integer

​Prevalence normalization

​Sample size as integer

​Continent classification

​Setting dichotomization

​Guideline use categorization

​Logit prevalence

​Variance of logit prevalence

​Column name sanitization

Build docs developers (and LLMs) love

Directory structure

Input data format

Data cleaning

Age standardization

Year as integer

Prevalence normalization

Sample size as integer

Continent classification

Setting dichotomization

Guideline use categorization

Logit prevalence

Variance of logit prevalence

Column name sanitization