Skip to main content
Sumo Oracle is built around a small set of composable R functions and standard model interfaces. This guide explains the knobs you can turn and how to adapt the tool for other binary sports prediction problems.

Changing the neural network architecture

The hidden argument to neuralnet() controls the number and size of hidden layers.
# One hidden layer with 8 neurons (pred_sumo.R default)
nn <- neuralnet(result ~ ., history, hidden = 8)
Guidelines for choosing architecture:
  • Fewer neurons / shallower networks overfit less and train faster. Start here with small datasets.
  • More neurons / deeper networks can capture more complex patterns but require more data and are more prone to overfitting.
  • For a dataset the size of sumo.csv (~130 rows, 8 features), hidden = c(4, 2) or hidden = 8 are reasonable starting points.
Adding hidden layers does not automatically improve accuracy. Always evaluate changes using confusionMatrix() on a held-out test set before trusting a deeper network on real predictions.

Adjusting the train/test ratio

data.split() takes a ratio between 0 and 1 controlling what fraction of rows go to training:
data.split <- function(data, ratio) {
  n <- round(nrow(data) * ratio)
  train <- sample(1:nrow(data), n)
  test <- setdiff(1:nrow(data), train)
  return(list(data[train,], data[test,]))
}
# 85% train, 15% test
spl <- data.split(history, 0.85)

# 70% train, 30% test — more reliable evaluation, less training data
spl <- data.split(history, 0.70)
More historical data generally improves accuracy. If you find that model performance varies significantly between runs, your evaluation set is probably too small. Either collect more data or use cross-validation instead of a single split.

Adding or removing features

The formula result ~ . means “predict result using all other columns in the data frame.” This means you can add or remove features by simply adding or removing columns from your CSV — no formula changes needed.
weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
If you want to use only specific features rather than all columns, replace result ~ . with an explicit formula:
# Use only weight and wins
bin <- glm(result ~ weight1 + weight2 + wins1 + wins2, tr, family = 'binomial')
nn  <- neuralnet(result ~ weight1 + weight2 + wins1 + wins2, normalize(tr),
                 hidden = c(4, 2), linear.output = FALSE, act.fct = 'logistic')
All columns in the CSV are coerced to numeric by sumo.read(). Features with many unique values (like IDs or names) will produce meaningless numeric encodings. Either drop those columns or encode them manually before including them.

Changing the activation function

The act.fct argument in neuralnet() accepts either a string name or a custom function:
# Logistic sigmoid (default in sumo.Rmd)
nn <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
                linear.output = FALSE, act.fct = 'logistic')

# Hyperbolic tangent
nn <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
                linear.output = FALSE, act.fct = 'tanh')
The logistic sigmoid is a natural choice for binary classification because it maps outputs to (0, 1). The >= 0.5 threshold used in prediction corresponds directly to the midpoint of the sigmoid range.

Adapting to other binary classification sports problems

Sumo Oracle’s structure is not sumo-specific. Any sport where you want to predict a binary outcome (team A wins vs team B wins) can use the same workflow:
1

Define your features

Identify numeric statistics that describe each competitor: historical win rate, physical attributes, recent form, head-to-head record, etc. Add one column per statistic, with 1 and 2 suffixes to distinguish the two sides.
2

Build your CSV

Use the same format: one row per historical match, numeric columns, result as 1 (left/home wins) or 0 (right/away wins), blank result for future matches.
stat1_home,stat1_away,stat2_home,stat2_away,...,result
3

Reuse sumo.read() and data.split() unchanged

These functions are data-agnostic. They work on any CSV matching the expected format.
data <- sumo.read('my_sport.csv')
history <- na.omit(data)
undecided <- filter(data, is.na(result))
spl <- data.split(history, 0.85)
tr <- spl[[1]]
ev <- spl[[2]]
4

Train and evaluate

bin <- glm(result ~ ., tr, family = 'binomial')
nn  <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
                 linear.output = FALSE, act.fct = 'logistic')

confusionMatrix(as.factor(predict.glm(bin, ev) > 0), as.factor(ev$result))

How the R formula interface works

R formulas use the syntax response ~ predictors. In result ~ .:
  • result is the response variable (what you are predicting).
  • . is shorthand for “all other columns in the data frame.”
You can also write explicit formulas to select or transform specific features:
FormulaMeaning
result ~ .All columns except result
result ~ weight1 + weight2Only weight1 and weight2
result ~ . - height1 - height2All columns except result, height1, and height2
result ~ weight1 * weight2weight1, weight2, and their interaction term
Both glm() and neuralnet() accept the same formula syntax, so you can use identical formulas for both models.

Build docs developers (and LLMs) love