Skip to main content

Helper Functions

These three functions are defined in sumo.Rmd (and sumo.read is also in pred_sumo.R). They handle data loading, train/test splitting, and feature normalization.

sumo.read(csv)

Reads the sumo match CSV file and returns a clean tibble ready for modelling. Every column is coerced to numeric first, then the result column is cast to logical so that TRUE represents a win for wrestler 1 and FALSE represents a win for wrestler 2. Signature
sumo.read <- function(csv)
Parameters
csv
character
required
Path to the CSV file containing match data (e.g. "sumo.csv").
Returns A tibble where all columns are numeric except result, which is logical (TRUE/FALSE). Rows with a blank result field in the raw CSV are imported as NA and represent undecided matches. Implementation
sumo.read <- function(csv) {
  data <- tibble(read.csv(csv)) %>%
    mutate_all(as.numeric) %>%
    mutate(result = as.logical(result))
  return(data)
}
Example
data <- sumo.read("sumo.csv")

# Separate decided matches from undecided ones
history   <- na.omit(data)            # rows where result is TRUE or FALSE
undecided <- filter(data, is.na(result))  # rows where result is NA
mutate_all(as.numeric) is applied before as.logical, so the result column passes through the numeric coercion step (becoming 1 or 0 or NA) and is then converted to TRUE/FALSE/NA by as.logical.

data.split(data, ratio)

Randomly partitions a tibble into a training set and an evaluation set. Row indices for the training set are drawn with sample(), and the remaining indices are found with setdiff(). Signature
data.split <- function(data, ratio)
Parameters
data
tibble
required
The full dataset to split — typically history (decided matches only).
ratio
numeric
required
Proportion of rows to allocate to the training set. Must be between 0 and 1 (exclusive). For example, 0.85 reserves 85 % of rows for training and 15 % for evaluation.
Returns A named list of two tibbles:
IndexContents
[[1]]Training set — round(nrow(data) * ratio) rows sampled at random
[[2]]Test / evaluation set — the remaining rows
Implementation
data.split <- function(data, ratio) {
  n     <- round(nrow(data) * ratio)
  train <- sample(1:nrow(data), n)
  test  <- setdiff(1:nrow(data), train)
  return(list(data[train,], data[test,]))
}
Example
data <- sumo.read("sumo.csv")
history <- na.omit(data)

spl <- data.split(history, 0.85)
tr  <- spl[[1]]   # training set  (~85 % of rows)
ev  <- spl[[2]]   # evaluation set (~15 % of rows)
sample() is called without a fixed set.seed(), so results differ on every run. Set a seed before calling data.split() if you need reproducible splits.
set.seed(42)
spl <- data.split(history, 0.85)

normalize(x)

Applies min-max normalization to a numeric vector, scaling all values to the interval [0, 1]. This is required before training a neuralnet model because neural networks are sensitive to feature scale. Signature
normalize <- function(x)
Parameters
x
numeric vector
required
A numeric vector (or column) to be scaled. Typically an entire training-set tibble passed through normalize() column-wise.
Returns A numeric vector of the same length as x with all values rescaled to [0, 1] using the formula:
(x - min(x)) / (max(x) - min(x))
Implementation
normalize <- function(x) {
  return( (x - min(x)) / (max(x) - min(x)) )
}
Example
# Apply normalize() to the training tibble before fitting the neural network
nn <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
                linear.output = FALSE, act.fct = "logistic")
Apply normalize() only to the training data. When you call predict() on the evaluation set you should pass the raw (un-normalized) ev tibble, as shown in the testing workflow in sumo.Rmd.

Model-Fitting Functions

These standard R and package functions are used directly in pred_sumo.R and sumo.Rmd.

glm() — Binomial Logistic Regression

Fits a generalized linear model with a binomial family link to predict match outcomes. Provided by base R’s stats package. Usage in this project
# Full-data prediction (pred_sumo.R)
bin <- glm(result ~ ., history, family = "binomial")

# Train-set evaluation (sumo.Rmd)
bin <- glm(result ~ ., tr, family = "binomial")
ArgumentValueMeaning
formularesult ~ .Predict result from all other columns
datahistory / trDecided-match rows only
family"binomial"Logistic regression

neuralnet() — Neural Network

Fits a feed-forward neural network. Provided by the neuralnet package. Usage in this project
# Single hidden layer, 8 units (pred_sumo.R)
nn <- neuralnet(result ~ ., history, hidden = 8)

# Two hidden layers, normalized training data (sumo.Rmd)
nn <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
                linear.output = FALSE, act.fct = "logistic")
ArgumentValueMeaning
formularesult ~ .Predict result from all other columns
datahistory / normalize(tr)Training data (normalize before passing)
hidden8 or c(4, 2)Hidden-layer sizes
linear.outputFALSEUse activation function on output node
act.fct"logistic"Logistic (sigmoid) activation

Inference Functions

predict.glm() — GLM Predictions

Generates log-odds predictions from a fitted glm object. Positive values indicate wrestler 1 wins; negative values indicate wrestler 2 wins. Usage in this project
bin.ans <- predict.glm(bin, undecided)

if (bin.ans > 0) {
  print("GLM says left.")
} else {
  print("GLM says right.")
}
predict.glm() returns log-odds by default (i.e. the linear predictor). A value greater than 0 corresponds to a predicted probability above 0.5 for wrestler 1 winning.

predict() — Neural Network Predictions

Generates output-layer predictions from a fitted neuralnet object. Values are in [0, 1]; a value ≥ 0.5 is interpreted as a win for wrestler 1. Usage in this project
nn.ans <- predict(nn, undecided)

if (nn.ans >= 0.5) {
  print("NN says left.")
} else {
  print("NN says right.")
}

Evaluation Functions

confusionMatrix() — Classification Metrics

Computes a confusion matrix plus accuracy, sensitivity, specificity, and other metrics for a set of predicted vs. actual class labels. Provided by the caret package. Usage in this project
# Convert numeric predictions to logical class labels first
results$bin.ans <- sapply(results$bin.ans, function(x) { x > 0 })
results$nn.ans  <- sapply(results$nn.ans,  function(x) { x >= 0.5 })

confusionMatrix(as.factor(results$glm),  as.factor(results$truth))
confusionMatrix(as.factor(results$nn),   as.factor(results$truth))
Both predicted and reference vectors must be converted to factor before passing to confusionMatrix(). Full evaluation workflow (from sumo.Rmd)
data <- sumo.read("sumo.csv")
spl  <- data.split(history, 0.85)
tr   <- spl[[1]]
ev   <- spl[[2]]

bin <- glm(result ~ ., tr, family = "binomial")
nn  <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
                 linear.output = FALSE, act.fct = "logistic")

bin.ans <- predict.glm(bin, ev)
nn.ans  <- predict(nn, ev)

results <- tibble(bin.ans) %>% cbind(nn.ans) %>% cbind(ev$result)

results$bin.ans <- sapply(results$bin.ans,
                          function(x) { if (x > 0) TRUE else FALSE })
results$nn.ans  <- sapply(results$nn.ans,
                          function(x) { if (x >= 0.5) TRUE else FALSE })

colnames(results) <- c("glm", "nn", "truth")

confusionMatrix(as.factor(results$glm), as.factor(results$truth))
confusionMatrix(as.factor(results$nn),  as.factor(results$truth))

Build docs developers (and LLMs) love