Dataset Format

Sumo Oracle reads match data from a CSV file. Each row represents one head-to-head bout between two wrestlers, referred to throughout the codebase as wrestler 1 (left) and wrestler 2 (right).

CSV structure

The file has nine columns. The first eight are physical and historical features of each wrestler; the ninth is the match outcome.

Column	Type	Units	Description
`weight1`	numeric	pounds	Body weight of wrestler 1
`weight2`	numeric	pounds	Body weight of wrestler 2
`wins1`	numeric	count	Number of wins for wrestler 1
`wins2`	numeric	count	Number of wins for wrestler 2
`age1`	numeric	years	Age of wrestler 1
`age2`	numeric	years	Age of wrestler 2
`height1`	numeric	centimeters	Height of wrestler 1
`height2`	numeric	centimeters	Height of wrestler 2
`result`	logical	—	`1` = wrestler 1 wins, `0` = wrestler 2 wins, blank/`NA` = undecided (to predict)

Example rows

These are the first rows of sumo.csv exactly as they appear in the file:

weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
401,379,6,5,35,31,190,184,1
344,370,8,11,36,34,189,192,1
351,245,3,2,22,27,189,169,0
335,291,3,12,32,30,177,176,0
375,306,0,0,35,28,186,185,0
326,295,1,2,29,30,173,174,0
355,375,4,7,25,27,176,185,1
311,328,2,2,22,31,185,183,0
386,333,2,5,37,28,187,187,1

The `result` column

The result column drives the entire pipeline:

1 — wrestler 1 (left) won the bout.
0 — wrestler 2 (right) won the bout.
blank / NA — the match has not yet taken place; these rows are the prediction targets.

Loading data with `sumo.read()`

sumo.read <- function(csv) {
  data <- tibble(read.csv(csv)) %>%
    mutate_all(as.numeric) %>%
    mutate(result = as.logical(result))
  return(data)
}

data <- sumo.read('sumo.csv')

sumo.read() does three things in sequence:

Reads the CSV into a tibble via read.csv().
Coerces every column to numeric with mutate_all(as.numeric) — this turns the bare 1/0/blank values in result into 1, 0, and NA respectively.
Re-casts result as logical (TRUE/FALSE/NA) so that the model formula result ~ . produces a valid binary response variable.

Splitting into training and prediction sets

After loading, the data is split into two subsets based on whether result is known:

history   <- na.omit(data)               # rows with a known result — used for training
undecided <- filter(data, is.na(result)) # rows with NA result — used for prediction

Subset	Filter	Purpose
`history`	`na.omit(data)`	All completed bouts; passed to model fitting
`undecided`	`filter(is.na(result))`	Future/unknown bouts; passed to `predict()`

na.omit() drops any row with any NA value, not just NA in result. Make sure all feature columns are populated for historical rows, or those rows will be silently excluded from training.

Get Started

Concepts

Guides

Reference

CSV structure

Example rows

The `result` column

Loading data with `sumo.read()`

Splitting into training and prediction sets

Build docs developers (and LLMs) love

Get Started

Concepts

Guides

Reference

​CSV structure

​Example rows

​The result column

​Loading data with sumo.read()

​Splitting into training and prediction sets

Build docs developers (and LLMs) love

CSV structure

Example rows

The `result` column

Loading data with `sumo.read()`

Splitting into training and prediction sets