Overview
All data is stored in a single flat CSV file (conventionally namedsumo.csv). Each row represents one match between two wrestlers. Rows with a known outcome are used as training and evaluation data; rows with a blank result column are the undecided matches the models are asked to predict.
Column Schema
| Column | Type | Description | Example |
|---|---|---|---|
weight1 | numeric | Weight of wrestler 1 (lbs) | 401 |
weight2 | numeric | Weight of wrestler 2 (lbs) | 379 |
wins1 | numeric | Number of wins for wrestler 1 | 6 |
wins2 | numeric | Number of wins for wrestler 2 | 5 |
age1 | numeric | Age of wrestler 1 (years) | 35 |
age2 | numeric | Age of wrestler 2 (years) | 31 |
height1 | numeric | Height of wrestler 1 (cm) | 190 |
height2 | numeric | Height of wrestler 2 (cm) | 184 |
result | logical/numeric | Match outcome: 1 = wrestler 1 wins, 0 = wrestler 2 wins, blank = undecided | 1 |
result column is always the last (rightmost) column.
Example Data
The following is a representative excerpt fromsumo.csv:
result field blank:
Encoding Rules
Decided matches
result must be either 1 or 0:
| Value | Meaning |
|---|---|
1 | Wrestler 1 wins |
0 | Wrestler 2 wins |
Undecided matches
Leave theresult field empty (trailing comma, no value). read.csv() will import this as NA. After sumo.read() processes the file, these rows have result = NA and are selected with:
A blank
result field and a missing result field are equivalent in CSV — both are imported as NA by R’s read.csv().Validation Rules
Before loading the file withsumo.read(), confirm the following:
- All eight feature columns (
weight1,weight2,wins1,wins2,age1,age2,height1,height2) must contain numeric values — no text, no empty cells. resultmust be1,0, or blank. Any other value (e.g."win",2) will be coerced toNAbyas.numeric()and the row will be treated as undecided.- The header row must be present and column names must match exactly (case-sensitive).
- The file must use standard CSV encoding (comma-delimited, UTF-8 or ASCII).
How sumo.read() Processes the File
sumo.read() performs three sequential transformations:
| Step | Operation | Effect on result |
|---|---|---|
| 1 | read.csv() | "1" → 1, "" → NA (character) |
| 2 | mutate_all(as.numeric) | "1" → 1.0, blank already NA |
| 3 | mutate(result = as.logical(result)) | 1 → TRUE, 0 → FALSE, NA → NA |