Evaluation happens in the TESTING block of sumo.Rmd. The workflow is: split the labelled history into training and test sets, fit both models on the training set, generate predictions on the test set, binarise those predictions, then compare them against the known results.
Train/test split
spl <- data.split(history, 0.85)
tr <- spl[[1]] # training set (~85 % of rows)
ev <- spl[[2]] # evaluation set (~15 % of rows)
data.split() randomly samples a fraction of rows for training and puts the rest aside for evaluation:
data.split <- function(data, ratio) {
n <- round(nrow(data) * ratio)
train <- sample(1:nrow(data), n)
test <- setdiff(1:nrow(data), train)
return(list(data[train,], data[test,]))
}
The default ratio is 0.85, meaning 85 % of completed bouts train the models and 15 % are held back for unbiased testing.
With a small dataset the 15 % evaluation split may contain only a handful of rows. Accuracy figures from a single split can vary significantly — run multiple splits and average the results for a more stable estimate.
Fitting on the training set
bin <- glm(result ~ ., tr, family = 'binomial')
nn <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
linear.output = FALSE, act.fct = 'logistic')
Note that the neural network receives a normalised version of tr (min–max scaled to [0, 1]) while the GLM receives the raw values. The evaluation set ev is passed unnormalised to both predict calls.
Generating and binarising predictions
Raw model outputs are collected and then converted to boolean (TRUE/FALSE) using the threshold rule for each model:
bin.ans <- predict.glm(bin, ev)
nn.ans <- predict(nn, ev)
results <- tibble(bin.ans) %>% cbind(nn.ans) %>% cbind(ev$result)
GLM threshold: bin.ans > 0
results$bin.ans = sapply(results$bin.ans,
function(x) {if (x > 0) {
TRUE
} else {
FALSE}})
The GLM returns a log-odds score. Any positive value means the model favours wrestler 1; any negative value means it favours wrestler 2.
Neural network threshold: nn.ans >= 0.5
results$nn.ans = sapply(results$nn.ans,
function(x) {if (x >= 0.5) {
TRUE
} else {
FALSE}})
The neural network (with linear.output = FALSE) returns a probability. The natural decision boundary is 0.5.
The results tibble
After binarisation, column names are standardised:
colnames(results) <- c('glm', 'nn', 'truth')
| Column | Type | Description |
|---|
glm | logical | GLM’s binarised prediction (TRUE = left wins) |
nn | logical | Neural network’s binarised prediction |
truth | logical | Actual match result from the evaluation set |
Confusion matrix
caret’s confusionMatrix() compares each model’s predictions against the ground truth:
confusionMatrix(as.factor(results$glm), as.factor(results$truth))
confusionMatrix(as.factor(results$nn), as.factor(results$truth))
The function reports accuracy, sensitivity, specificity, and other metrics. The GLM consistently achieves around 70% accuracy on the held-out evaluation set — higher than the neural network.
Both vectors must be converted with as.factor() before passing to confusionMatrix(). Passing raw logical vectors will raise an error.
Interpreting the results
| Metric | What it tells you |
|---|
| Accuracy | Overall fraction of correctly predicted bouts |
| Sensitivity | How often the model correctly predicts wrestler 1 wins when they do |
| Specificity | How often the model correctly predicts wrestler 2 wins when they do |
| Kappa | Agreement corrected for chance; useful when classes are imbalanced |
A ~70% accuracy from the GLM means it picks the correct winner roughly 7 times out of 10, substantially above the 50% random baseline. The neural network’s accuracy sits below the GLM’s on this dataset, likely due to the small training size relative to the network’s capacity.