CSV structure
The file has nine columns. The first eight are physical and historical features of each wrestler; the ninth is the match outcome.| Column | Type | Units | Description |
|---|---|---|---|
weight1 | numeric | pounds | Body weight of wrestler 1 |
weight2 | numeric | pounds | Body weight of wrestler 2 |
wins1 | numeric | count | Number of wins for wrestler 1 |
wins2 | numeric | count | Number of wins for wrestler 2 |
age1 | numeric | years | Age of wrestler 1 |
age2 | numeric | years | Age of wrestler 2 |
height1 | numeric | centimeters | Height of wrestler 1 |
height2 | numeric | centimeters | Height of wrestler 2 |
result | logical | — | 1 = wrestler 1 wins, 0 = wrestler 2 wins, blank/NA = undecided (to predict) |
Example rows
These are the first rows ofsumo.csv exactly as they appear in the file:
The result column
The result column drives the entire pipeline:
1— wrestler 1 (left) won the bout.0— wrestler 2 (right) won the bout.- blank /
NA— the match has not yet taken place; these rows are the prediction targets.
Loading data with sumo.read()
sumo.read() does three things in sequence:
- Reads the CSV into a tibble via
read.csv(). - Coerces every column to numeric with
mutate_all(as.numeric)— this turns the bare1/0/blank values inresultinto1,0, andNArespectively. - Re-casts
resultas logical (TRUE/FALSE/NA) so that the model formularesult ~ .produces a valid binary response variable.
Splitting into training and prediction sets
After loading, the data is split into two subsets based on whetherresult is known:
| Subset | Filter | Purpose |
|---|---|---|
history | na.omit(data) | All completed bouts; passed to model fitting |
undecided | filter(is.na(result)) | Future/unknown bouts; passed to predict() |
na.omit() drops any row with any NA value, not just NA in result. Make sure all feature columns are populated for historical rows, or those rows will be silently excluded from training.