Skip to main content
This guide walks you through everything you need to go from a fresh R installation to a working prediction. The whole process takes about five minutes.

Prerequisites

  • R 4.0 or later — download from cran.r-project.org
  • Git (optional) — only needed if you want to clone the repository
RStudio is not required, but it makes running .R and .Rmd files easier. You can run everything from the R console or a terminal.
1

Install the required packages

Open an R console and install all five dependencies:
install.packages
install.packages(c("dplyr", "MASS", "neuralnet", "ggplot2", "caret"))
PackagePurpose
dplyrData wrangling and tibble construction
MASSBayesian GLM support via glm() with binomial
neuralnetNeural network training and prediction
ggplot2Plotting model diagnostics (used in sumo.Rmd)
caretConfusion matrix and accuracy evaluation
If you already have some of these packages installed, R will skip them automatically. Running the full install.packages() call is always safe.
2

Download the project

Clone the repository or download the source directly.
git clone https://github.com/samreeves/sumo.git
cd sumo
The project contains three key files:
  • pred_sumo.R — the main prediction script
  • sumo.Rmd — the R Markdown notebook with testing and evaluation
  • sumo.csv — historical match data (135 matches)
3

Prepare sumo.csv

Open sumo.csv in any text editor. It contains one row per match with nine columns:
weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
401,379,6,5,35,31,190,184,1
344,370,8,11,36,34,189,192,1
351,245,3,2,22,27,189,169,0
Rows with a numeric result (0 or 1) are used as training data. To predict a new match, append a row with the eight wrestler attributes and leave result blank:
weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
350,370,4,7,28,30,182,185,
  • result = 1 means the left wrestler won
  • result = 0 means the right wrestler won
  • result blank means this row will be predicted
You must have at least one row with a blank result. If every row has a result, the script has no undecided matches to predict and will produce an error when calling predict.glm() or predict().
4

Run pred_sumo.R

From your terminal, run the script with Rscript:
Rscript pred_sumo.R
Or source it from an R console:
source("pred_sumo.R")
Make sure your working directory is set to the folder containing both pred_sumo.R and sumo.csv. The script loads the CSV with a relative path ('sumo.csv').
5

Interpret the output

The script prints one prediction line per model, followed by the raw numeric score.Both models predict the left wrestler wins:
[1] "GLM says left."
        1
0.8473214
[1] "NN says left."
          [,1]
[1,] 0.7341289
Both models predict the right wrestler wins:
[1] "GLM says right."
         1
-1.2038471
[1] "NN says right."
          [,1]
[1,] 0.2847563
How to read the scores:
  • GLM score: a log-odds value. Positive → left wins. Negative → right wins. Larger absolute values indicate higher confidence.
  • NN score: a probability between 0 and 1. ≥ 0.5 → left wins. < 0.5 → right wins. Values close to 0.5 indicate low confidence.
When the two models agree, the prediction is more trustworthy. When they disagree, rely on the GLM — it achieves approximately 70% accuracy vs. a lower rate for the neural network on this dataset.

Complete working example

The full source of pred_sumo.R, exactly as it appears in the repository:
pred_sumo.R
library(dplyr)
library(MASS)
library(neuralnet)

sumo.read <- function(csv) {
  data <- tibble(read.csv(csv)) %>%
    mutate_all(as.numeric) %>%
    mutate(result = as.logical(result))
  return(data)
}

data <- sumo.read('sumo.csv')
history <- na.omit(data)
undecided <- filter(data, is.na(result))

nn <- neuralnet(result ~ ., history, hidden = 8)
bin <- glm(result ~ ., history, family = 'binomial')

bin.ans <- predict.glm(bin, undecided)
nn.ans <- predict(nn, undecided)

if (bin.ans > 0) {
  print('GLM says left.')
} else {print('GLM says right.')}
bin.ans

if (nn.ans >= 0.5) {
  print('NN says left.')
} else {print('NN says right.')}
nn.ans
Walk through what each section does:
  1. sumo.read() — reads the CSV, coerces all columns to numeric, then casts result to logical so GLM receives a proper binary response variable.
  2. history — all rows where result is not NA; used to train both models.
  3. undecided — all rows where result is NA; the matches to predict.
  4. neuralnet(..., hidden = 8) — trains a single hidden-layer network with 8 units using the full feature set (result ~ .).
  5. glm(..., family = 'binomial') — fits a logistic regression model (Bayesian GLM) on the same data.
  6. Prediction and thresholding — GLM uses a sign check on the log-odds; the neural network uses a 0.5 probability threshold.

Next steps

Data format

Full CSV schema reference, NA handling rules, and tips for building your own dataset.

Models

How the GLM and neural network are configured, trained, and evaluated with confusion matrices.

Build docs developers (and LLMs) love