Skip to main content
Evaluation happens in the TESTING block of sumo.Rmd. The workflow is: split the labelled history into training and test sets, fit both models on the training set, generate predictions on the test set, binarise those predictions, then compare them against the known results.

Train/test split

spl <- data.split(history, 0.85)
tr  <- spl[[1]]   # training set  (~85 % of rows)
ev  <- spl[[2]]   # evaluation set (~15 % of rows)
data.split() randomly samples a fraction of rows for training and puts the rest aside for evaluation:
data.split <- function(data, ratio) {
  n     <- round(nrow(data) * ratio)
  train <- sample(1:nrow(data), n)
  test  <- setdiff(1:nrow(data), train)
  return(list(data[train,], data[test,]))
}
The default ratio is 0.85, meaning 85 % of completed bouts train the models and 15 % are held back for unbiased testing.
With a small dataset the 15 % evaluation split may contain only a handful of rows. Accuracy figures from a single split can vary significantly — run multiple splits and average the results for a more stable estimate.

Fitting on the training set

bin <- glm(result ~ ., tr, family = 'binomial')

nn <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
                linear.output = FALSE, act.fct = 'logistic')
Note that the neural network receives a normalised version of tr (min–max scaled to [0, 1]) while the GLM receives the raw values. The evaluation set ev is passed unnormalised to both predict calls.

Generating and binarising predictions

Raw model outputs are collected and then converted to boolean (TRUE/FALSE) using the threshold rule for each model:
bin.ans <- predict.glm(bin, ev)
nn.ans  <- predict(nn, ev)

results <- tibble(bin.ans) %>% cbind(nn.ans) %>% cbind(ev$result)

GLM threshold: bin.ans > 0

results$bin.ans = sapply(results$bin.ans,
                         function(x) {if (x > 0) {
                           TRUE
                           } else {
                             FALSE}})
The GLM returns a log-odds score. Any positive value means the model favours wrestler 1; any negative value means it favours wrestler 2.

Neural network threshold: nn.ans >= 0.5

results$nn.ans = sapply(results$nn.ans,
                         function(x) {if (x >= 0.5) {
                           TRUE
                           } else {
                             FALSE}})
The neural network (with linear.output = FALSE) returns a probability. The natural decision boundary is 0.5.

The results tibble

After binarisation, column names are standardised:
colnames(results) <- c('glm', 'nn', 'truth')
ColumnTypeDescription
glmlogicalGLM’s binarised prediction (TRUE = left wins)
nnlogicalNeural network’s binarised prediction
truthlogicalActual match result from the evaluation set

Confusion matrix

caret’s confusionMatrix() compares each model’s predictions against the ground truth:
confusionMatrix(as.factor(results$glm), as.factor(results$truth))
confusionMatrix(as.factor(results$nn),  as.factor(results$truth))
The function reports accuracy, sensitivity, specificity, and other metrics. The GLM consistently achieves around 70% accuracy on the held-out evaluation set — higher than the neural network.
Both vectors must be converted with as.factor() before passing to confusionMatrix(). Passing raw logical vectors will raise an error.

Interpreting the results

MetricWhat it tells you
AccuracyOverall fraction of correctly predicted bouts
SensitivityHow often the model correctly predicts wrestler 1 wins when they do
SpecificityHow often the model correctly predicts wrestler 2 wins when they do
KappaAgreement corrected for chance; useful when classes are imbalanced
A ~70% accuracy from the GLM means it picks the correct winner roughly 7 times out of 10, substantially above the 50% random baseline. The neural network’s accuracy sits below the GLM’s on this dataset, likely due to the small training size relative to the network’s capacity.

Build docs developers (and LLMs) love