Skip to main content
Sumo Oracle reads match data from a CSV file. Each row represents one head-to-head bout between two wrestlers, referred to throughout the codebase as wrestler 1 (left) and wrestler 2 (right).

CSV structure

The file has nine columns. The first eight are physical and historical features of each wrestler; the ninth is the match outcome.
ColumnTypeUnitsDescription
weight1numericpoundsBody weight of wrestler 1
weight2numericpoundsBody weight of wrestler 2
wins1numericcountNumber of wins for wrestler 1
wins2numericcountNumber of wins for wrestler 2
age1numericyearsAge of wrestler 1
age2numericyearsAge of wrestler 2
height1numericcentimetersHeight of wrestler 1
height2numericcentimetersHeight of wrestler 2
resultlogical1 = wrestler 1 wins, 0 = wrestler 2 wins, blank/NA = undecided (to predict)

Example rows

These are the first rows of sumo.csv exactly as they appear in the file:
weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
401,379,6,5,35,31,190,184,1
344,370,8,11,36,34,189,192,1
351,245,3,2,22,27,189,169,0
335,291,3,12,32,30,177,176,0
375,306,0,0,35,28,186,185,0
326,295,1,2,29,30,173,174,0
355,375,4,7,25,27,176,185,1
311,328,2,2,22,31,185,183,0
386,333,2,5,37,28,187,187,1

The result column

The result column drives the entire pipeline:
  • 1 — wrestler 1 (left) won the bout.
  • 0 — wrestler 2 (right) won the bout.
  • blank / NA — the match has not yet taken place; these rows are the prediction targets.

Loading data with sumo.read()

sumo.read <- function(csv) {
  data <- tibble(read.csv(csv)) %>%
    mutate_all(as.numeric) %>%
    mutate(result = as.logical(result))
  return(data)
}

data <- sumo.read('sumo.csv')
sumo.read() does three things in sequence:
  1. Reads the CSV into a tibble via read.csv().
  2. Coerces every column to numeric with mutate_all(as.numeric) — this turns the bare 1/0/blank values in result into 1, 0, and NA respectively.
  3. Re-casts result as logical (TRUE/FALSE/NA) so that the model formula result ~ . produces a valid binary response variable.

Splitting into training and prediction sets

After loading, the data is split into two subsets based on whether result is known:
history   <- na.omit(data)               # rows with a known result — used for training
undecided <- filter(data, is.na(result)) # rows with NA result — used for prediction
SubsetFilterPurpose
historyna.omit(data)All completed bouts; passed to model fitting
undecidedfilter(is.na(result))Future/unknown bouts; passed to predict()
na.omit() drops any row with any NA value, not just NA in result. Make sure all feature columns are populated for historical rows, or those rows will be silently excluded from training.

Build docs developers (and LLMs) love