Helper Functions
These three functions are defined insumo.Rmd (and sumo.read is also in pred_sumo.R). They handle data loading, train/test splitting, and feature normalization.
sumo.read(csv)
Reads the sumo match CSV file and returns a clean tibble ready for modelling. Every column is coerced to numeric first, then the result column is cast to logical so that TRUE represents a win for wrestler 1 and FALSE represents a win for wrestler 2.
Signature
Path to the CSV file containing match data (e.g.
"sumo.csv").tibble where all columns are numeric except result, which is logical (TRUE/FALSE). Rows with a blank result field in the raw CSV are imported as NA and represent undecided matches.
Implementation
mutate_all(as.numeric) is applied before as.logical, so the result column passes through the numeric coercion step (becoming 1 or 0 or NA) and is then converted to TRUE/FALSE/NA by as.logical.data.split(data, ratio)
Randomly partitions a tibble into a training set and an evaluation set. Row indices for the training set are drawn with sample(), and the remaining indices are found with setdiff().
Signature
The full dataset to split — typically
history (decided matches only).Proportion of rows to allocate to the training set. Must be between
0 and 1 (exclusive). For example, 0.85 reserves 85 % of rows for training and 15 % for evaluation.| Index | Contents |
|---|---|
[[1]] | Training set — round(nrow(data) * ratio) rows sampled at random |
[[2]] | Test / evaluation set — the remaining rows |
normalize(x)
Applies min-max normalization to a numeric vector, scaling all values to the interval [0, 1]. This is required before training a neuralnet model because neural networks are sensitive to feature scale.
Signature
A numeric vector (or column) to be scaled. Typically an entire training-set tibble passed through
normalize() column-wise.x with all values rescaled to [0, 1] using the formula:
Model-Fitting Functions
These standard R and package functions are used directly inpred_sumo.R and sumo.Rmd.
glm() — Binomial Logistic Regression
Fits a generalized linear model with a binomial family link to predict match outcomes. Provided by base R’s stats package.
Usage in this project
| Argument | Value | Meaning |
|---|---|---|
formula | result ~ . | Predict result from all other columns |
data | history / tr | Decided-match rows only |
family | "binomial" | Logistic regression |
neuralnet() — Neural Network
Fits a feed-forward neural network. Provided by the neuralnet package.
Usage in this project
| Argument | Value | Meaning |
|---|---|---|
formula | result ~ . | Predict result from all other columns |
data | history / normalize(tr) | Training data (normalize before passing) |
hidden | 8 or c(4, 2) | Hidden-layer sizes |
linear.output | FALSE | Use activation function on output node |
act.fct | "logistic" | Logistic (sigmoid) activation |
Inference Functions
predict.glm() — GLM Predictions
Generates log-odds predictions from a fitted glm object. Positive values indicate wrestler 1 wins; negative values indicate wrestler 2 wins.
Usage in this project
predict.glm() returns log-odds by default (i.e. the linear predictor). A value greater than 0 corresponds to a predicted probability above 0.5 for wrestler 1 winning.predict() — Neural Network Predictions
Generates output-layer predictions from a fitted neuralnet object. Values are in [0, 1]; a value ≥ 0.5 is interpreted as a win for wrestler 1.
Usage in this project
Evaluation Functions
confusionMatrix() — Classification Metrics
Computes a confusion matrix plus accuracy, sensitivity, specificity, and other metrics for a set of predicted vs. actual class labels. Provided by the caret package.
Usage in this project
factor before passing to confusionMatrix().
Full evaluation workflow (from sumo.Rmd)