Skip to main content

Overview

All data is stored in a single flat CSV file (conventionally named sumo.csv). Each row represents one match between two wrestlers. Rows with a known outcome are used as training and evaluation data; rows with a blank result column are the undecided matches the models are asked to predict.

Column Schema

ColumnTypeDescriptionExample
weight1numericWeight of wrestler 1 (lbs)401
weight2numericWeight of wrestler 2 (lbs)379
wins1numericNumber of wins for wrestler 16
wins2numericNumber of wins for wrestler 25
age1numericAge of wrestler 1 (years)35
age2numericAge of wrestler 2 (years)31
height1numericHeight of wrestler 1 (cm)190
height2numericHeight of wrestler 2 (cm)184
resultlogical/numericMatch outcome: 1 = wrestler 1 wins, 0 = wrestler 2 wins, blank = undecided1
There are no additional columns. The result column is always the last (rightmost) column.

Example Data

The following is a representative excerpt from sumo.csv:
weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
401,379,6,5,35,31,190,184,1
344,370,8,11,36,34,189,192,1
351,245,3,2,22,27,189,169,0
335,291,3,12,32,30,177,176,0
375,306,0,0,35,28,186,185,0
326,295,1,2,29,30,173,174,0
355,375,4,7,25,27,176,185,1
311,328,2,2,22,31,185,183,0
386,333,2,5,37,28,187,187,1
289,366,2,1,27,29,180,184,1
To represent an undecided match (one you want to predict), leave the result field blank:
weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
370,355,4,3,28,26,183,180,

Encoding Rules

Decided matches

result must be either 1 or 0:
ValueMeaning
1Wrestler 1 wins
0Wrestler 2 wins

Undecided matches

Leave the result field empty (trailing comma, no value). read.csv() will import this as NA. After sumo.read() processes the file, these rows have result = NA and are selected with:
undecided <- filter(data, is.na(result))
A blank result field and a missing result field are equivalent in CSV — both are imported as NA by R’s read.csv().

Validation Rules

Before loading the file with sumo.read(), confirm the following:
  • All eight feature columns (weight1, weight2, wins1, wins2, age1, age2, height1, height2) must contain numeric values — no text, no empty cells.
  • result must be 1, 0, or blank. Any other value (e.g. "win", 2) will be coerced to NA by as.numeric() and the row will be treated as undecided.
  • The header row must be present and column names must match exactly (case-sensitive).
  • The file must use standard CSV encoding (comma-delimited, UTF-8 or ASCII).
If a feature column contains a non-numeric value, mutate_all(as.numeric) will silently coerce it to NA, which causes glm() and neuralnet() to drop that row or error. Inspect your data with summary(data) after loading to catch unexpected NA values.

How sumo.read() Processes the File

sumo.read() performs three sequential transformations:
sumo.read <- function(csv) {
  data <- tibble(read.csv(csv)) %>%   # 1. Read CSV into a tibble
    mutate_all(as.numeric) %>%         # 2. Coerce every column to numeric
    mutate(result = as.logical(result)) # 3. Cast result: 1→TRUE, 0→FALSE, NA→NA
  return(data)
}
StepOperationEffect on result
1read.csv()"1"1, ""NA (character)
2mutate_all(as.numeric)"1"1.0, blank already NA
3mutate(result = as.logical(result))1TRUE, 0FALSE, NANA
After loading, the caller is expected to split the tibble into decided and undecided subsets:
data      <- sumo.read("sumo.csv")
history   <- na.omit(data)                  # decided matches  → train/evaluate models
undecided <- filter(data, is.na(result))    # undecided matches → run inference
Keep decided and undecided matches in the same file. The split is done in code, so you only need to maintain one CSV. Append a new row with a blank result whenever you want a new prediction.

Build docs developers (and LLMs) love