This guide walks through the full training workflow: formatting your data, splitting it into train/test sets, fitting both a GLM and a neural network, and evaluating accuracy with a confusion matrix.
Sumo Oracle expects a CSV file where each row represents one historical match. The columns must be numeric. The result column encodes the outcome as 1 (left wrestler wins) or 0 (right wrestler wins). Leave result blank only for matches you want to predict — those rows are treated as undecided and excluded from training automatically.
weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
401,379,6,5,35,31,190,184,1
344,370,8,11,36,34,189,192,1
351,245,3,2,22,27,189,169,0
335,291,3,12,32,30,177,176,0
The column names are flexible — you can add or remove features freely. The model formula result ~ . automatically uses every column except result as a predictor, so no formula changes are needed when you add columns.
All columns are coerced to numeric by sumo.read(). Categorical values will be converted by R’s as.numeric(), which may not produce meaningful encodings. Encode any categorical features manually before adding them to the CSV.
Loading data
Use sumo.read() to load the CSV. It reads the file, converts every column to numeric, and casts result to a logical vector so the GLM family works correctly.
sumo.read <- function(csv) {
data <- tibble(read.csv(csv)) %>%
mutate_all(as.numeric) %>%
mutate(result = as.logical(result))
return(data)
}
data <- sumo.read('sumo.csv')
After loading, separate decided matches (the training history) from undecided ones:
history <- na.omit(data)
undecided <- filter(data, is.na(result))
Splitting into train and test sets
data.split() takes a data frame and a ratio, then randomly partitions the rows into a training set and an evaluation set.
data.split <- function(data, ratio) {
n <- round(nrow(data) * ratio)
train <- sample(1:nrow(data), n)
test <- setdiff(1:nrow(data), train)
return(list(data[train,], data[test,]))
}
A ratio of 0.85 allocates 85% of rows to training and 15% to evaluation:
spl <- data.split(history, 0.85)
tr <- spl[[1]]
ev <- spl[[2]]
With small datasets (under ~100 rows), use a higher ratio like 0.9 to ensure the model sees enough examples. With larger datasets you can lower the ratio to 0.7–0.8 to get a more reliable evaluation set.
Normalization
The normalize() function scales values to the [0, 1] range using min-max normalization. This is required before training the neural network — neuralnet’s gradient-based optimization is sensitive to feature scale.
normalize <- function(x) {
return( (x - min(x)) / (max(x) - min(x)) )
}
Apply it to the training split when passing data to neuralnet(). The GLM does not require normalization.
Only normalize the training set when passing it to neuralnet(). Pass the raw (unnormalized) evaluation set to predict() — the network uses the training distribution internally and does not expect pre-normalized inputs at prediction time.
Training the GLM
Fit a binomial GLM (logistic regression) on the training split:
bin <- glm(result ~ ., tr, family = 'binomial')
result ~ . means: predict result using all other columns.
family = 'binomial' specifies logistic regression, which is appropriate for a binary outcome.
The GLM is fast to train, interpretable, and works well even with limited data. It is a good baseline before adding a neural network.
Training the neural network
nn <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
linear.output = FALSE, act.fct = 'logistic')
hidden = c(4, 2) defines two hidden layers with 4 and 2 neurons respectively.
linear.output = FALSE tells the network to apply the activation function to the output layer, which is necessary for classification.
act.fct = 'logistic' uses the sigmoid activation throughout, producing outputs in (0, 1).
Evaluating accuracy
After training, generate predictions on the evaluation set and convert them to logical values using the same thresholds used during prediction:
bin.ans <- predict.glm(bin, ev)
nn.ans <- predict(nn, ev)
results <- tibble(bin.ans) %>% cbind(nn.ans) %>% cbind(ev$result)
results$bin.ans = sapply(results$bin.ans,
function(x) {if (x > 0) {
TRUE
} else {
FALSE}})
results$nn.ans = sapply(results$nn.ans,
function(x) {if (x >= 0.5) {
TRUE
} else {
FALSE}})
colnames(results) <- c('glm', 'nn', 'truth')
confusionMatrix(as.factor(results$glm), as.factor(results$truth))
confusionMatrix(as.factor(results$nn), as.factor(results$truth))
confusionMatrix() from the caret package reports accuracy, sensitivity, specificity, and other metrics for each model.
Complete training script
The full training and evaluation workflow from sumo.Rmd:
library(dplyr)
library(MASS)
library(ggplot2)
library(neuralnet)
library(caret)
sumo.read <- function(csv) {
data <- tibble(read.csv(csv)) %>%
mutate_all(as.numeric) %>%
mutate(result = as.logical(result))
return(data)
}
data.split <- function(data, ratio) {
n <- round(nrow(data) * ratio)
train <- sample(1:nrow(data), n)
test <- setdiff(1:nrow(data), train)
return(list(data[train,], data[test,]))
}
normalize <- function(x) {
return( (x - min(x)) / (max(x) - min(x)) )
}
data <- sumo.read('sumo.csv')
history <- na.omit(data)
spl <- data.split(history, 0.85)
tr <- spl[[1]]
ev <- spl[[2]]
bin <- glm(result ~ ., tr, family = 'binomial')
nn <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
linear.output = FALSE, act.fct = 'logistic')
bin.ans <- predict.glm(bin, ev)
nn.ans <- predict(nn, ev)
results <- tibble(bin.ans) %>% cbind(nn.ans) %>% cbind(ev$result)
results$bin.ans = sapply(results$bin.ans,
function(x) {if (x > 0) {
TRUE
} else {
FALSE}})
results$nn.ans = sapply(results$nn.ans,
function(x) {if (x >= 0.5) {
TRUE
} else {
FALSE}})
colnames(results) <- c('glm', 'nn', 'truth')
confusionMatrix(as.factor(results$glm), as.factor(results$truth))
confusionMatrix(as.factor(results$nn), as.factor(results$truth))
plot(bin)
plot(nn)
GLM vs neural network
| Consideration | GLM | Neural network |
|---|
| Dataset size | Works well with small datasets | Benefits from more data |
| Interpretability | Coefficients are directly readable | Black box |
| Training speed | Near-instant | Slower (gradient descent) |
| Non-linear patterns | Cannot capture them | Can capture them |
| Overfitting risk | Low | Higher — tune hidden layers carefully |
For most sumo datasets with under a few hundred rows, the GLM is the more reliable choice. The neural network may outperform it when you have several hundred or more historical matches and the relationships between features are non-linear.