Quickstart

This guide walks you through everything you need to go from a fresh R installation to a working prediction. The whole process takes about five minutes.

Prerequisites

R 4.0 or later — download from cran.r-project.org
Git (optional) — only needed if you want to clone the repository

RStudio is not required, but it makes running .R and .Rmd files easier. You can run everything from the R console or a terminal.

Install the required packages

Open an R console and install all five dependencies:

install.packages

install.packages(c("dplyr", "MASS", "neuralnet", "ggplot2", "caret"))

Package	Purpose
`dplyr`	Data wrangling and tibble construction
`MASS`	Bayesian GLM support via `glm()` with `binomial`
`neuralnet`	Neural network training and prediction
`ggplot2`	Plotting model diagnostics (used in `sumo.Rmd`)
`caret`	Confusion matrix and accuracy evaluation

If you already have some of these packages installed, R will skip them automatically. Running the full install.packages() call is always safe.

Download the project

Clone the repository or download the source directly.

git clone https://github.com/samreeves/sumo.git
cd sumo

The project contains three key files:

pred_sumo.R — the main prediction script
sumo.Rmd — the R Markdown notebook with testing and evaluation
sumo.csv — historical match data (135 matches)

Prepare sumo.csv

Open sumo.csv in any text editor. It contains one row per match with nine columns:

weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
401,379,6,5,35,31,190,184,1
344,370,8,11,36,34,189,192,1
351,245,3,2,22,27,189,169,0

Rows with a numeric result (0 or 1) are used as training data. To predict a new match, append a row with the eight wrestler attributes and leave result blank:

weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
350,370,4,7,28,30,182,185,

result = 1 means the left wrestler won
result = 0 means the right wrestler won
result blank means this row will be predicted

You must have at least one row with a blank result. If every row has a result, the script has no undecided matches to predict and will produce an error when calling predict.glm() or predict().

Run pred_sumo.R

From your terminal, run the script with Rscript:

Rscript pred_sumo.R

Or source it from an R console:

source("pred_sumo.R")

Make sure your working directory is set to the folder containing both pred_sumo.R and sumo.csv. The script loads the CSV with a relative path ('sumo.csv').

Interpret the output

The script prints one prediction line per model, followed by the raw numeric score.Both models predict the left wrestler wins:

[1] "GLM says left."
        1
0.8473214
[1] "NN says left."
          [,1]
[1,] 0.7341289

Both models predict the right wrestler wins:

[1] "GLM says right."
         1
-1.2038471
[1] "NN says right."
          [,1]
[1,] 0.2847563

How to read the scores:

GLM score: a log-odds value. Positive → left wins. Negative → right wins. Larger absolute values indicate higher confidence.
NN score: a probability between 0 and 1. ≥ 0.5 → left wins. < 0.5 → right wins. Values close to 0.5 indicate low confidence.

When the two models agree, the prediction is more trustworthy. When they disagree, rely on the GLM — it achieves approximately 70% accuracy vs. a lower rate for the neural network on this dataset.

Complete working example

The full source of pred_sumo.R, exactly as it appears in the repository:

pred_sumo.R

library(dplyr)
library(MASS)
library(neuralnet)

sumo.read <- function(csv) {
  data <- tibble(read.csv(csv)) %>%
    mutate_all(as.numeric) %>%
    mutate(result = as.logical(result))
  return(data)
}

data <- sumo.read('sumo.csv')
history <- na.omit(data)
undecided <- filter(data, is.na(result))

nn <- neuralnet(result ~ ., history, hidden = 8)
bin <- glm(result ~ ., history, family = 'binomial')

bin.ans <- predict.glm(bin, undecided)
nn.ans <- predict(nn, undecided)

if (bin.ans > 0) {
  print('GLM says left.')
} else {print('GLM says right.')}
bin.ans

if (nn.ans >= 0.5) {
  print('NN says left.')
} else {print('NN says right.')}
nn.ans

Walk through what each section does:

sumo.read() — reads the CSV, coerces all columns to numeric, then casts result to logical so GLM receives a proper binary response variable.
history — all rows where result is not NA; used to train both models.
undecided — all rows where result is NA; the matches to predict.
neuralnet(..., hidden = 8) — trains a single hidden-layer network with 8 units using the full feature set (result ~ .).
glm(..., family = 'binomial') — fits a logistic regression model (Bayesian GLM) on the same data.
Prediction and thresholding — GLM uses a sign check on the log-odds; the neural network uses a 0.5 probability threshold.

Next steps

Data format

Full CSV schema reference, NA handling rules, and tips for building your own dataset.

Models

How the GLM and neural network are configured, trained, and evaluated with confusion matrices.

Get Started

Concepts

Guides

Reference

Prerequisites

Complete working example

Next steps

Data format

Models

Build docs developers (and LLMs) love

Get Started

Concepts

Guides

Reference

​Prerequisites

​Complete working example

​Next steps

Data format

Models

Build docs developers (and LLMs) love

Prerequisites

Complete working example

Next steps