Skip to main content
Sumo Oracle is an R-based machine learning tool that predicts the outcome of sumo wrestling matches. Given wrestler attributes for both competitors — weight, age, height, and career wins — it trains two models on historical match data and generates a directional prediction: the left wrestler wins, or the right wrestler wins. The project is open source under the GPL-3.0 license.

What problem it solves

Sumo match outcomes depend on a complex interplay of physical attributes and experience. Sumo Oracle formalises this as a binary classification problem. You supply a CSV of past matches with known outcomes, add one or more rows with a blank result field for the matches you want to forecast, and the tool does the rest.

The two models

Sumo Oracle trains and runs two models in parallel.

Bayesian GLM

The generalized linear model uses logistic regression via R’s glm() with family = 'binomial'. It is fit on the full history of labeled matches and produces a real-valued log-odds score for each prediction.
  • Accuracy: approximately 70% on held-out data
  • Output threshold: scores above 0"GLM says left.", scores at or below 0"GLM says right."
The GLM is the more reliable of the two models and should be treated as the primary signal.

Neural network

The neural network is fit using R’s neuralnet package with 8 hidden units. It is trained on the same historical data as the GLM and produces a probability between 0 and 1.
  • Output threshold: probability ≥ 0.5"NN says left.", probability < 0.5"NN says right."
  • Useful as a secondary signal and for comparing model agreement
When both models agree, the prediction is more reliable. When they disagree, treat the GLM result as the primary answer.

Input features

Each row in sumo.csv describes a single match between two wrestlers. The eight input columns are:
ColumnDescription
weight1Weight of the left wrestler (lbs)
weight2Weight of the right wrestler (lbs)
wins1Number of wins for the left wrestler
wins2Number of wins for the right wrestler
age1Age of the left wrestler (years)
age2Age of the right wrestler (years)
height1Height of the left wrestler (cm)
height2Height of the right wrestler (cm)
All columns are read as numeric values. The result column is cast to logical (TRUE/FALSE) internally.

Prediction output

The result column encodes the match outcome as a binary integer:
  • 1 — the left wrestler wins
  • 0 — the right wrestler wins
Rows with a populated result are used as training data. Rows where result is blank (NA) are treated as undecided matches and passed to both models for prediction. Running pred_sumo.R prints one line per model:
[1] "GLM says left."
[1] "NN says left."
Or, if the right wrestler is predicted to win:
[1] "GLM says right."
[1] "NN says right."

The dataset

sumo.csv is a plain CSV file with 135 historical matches. A sample of the first few rows:
weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
401,379,6,5,35,31,190,184,1
344,370,8,11,36,34,189,192,1
351,245,3,2,22,27,189,169,0
335,291,3,12,32,30,177,176,0
375,306,0,0,35,28,186,185,0
To predict a new match, append a row with all eight attribute columns filled in and leave result empty:
weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
350,370,4,7,28,30,182,185,
Both models require at least one row with a blank result to produce a prediction. If all rows have a result value, the script will have nothing to predict against and will error.

License

Sumo Oracle is released under the GNU General Public License v3.0 (GPL-3.0). You are free to use, modify, and distribute the code under the terms of that license.

Where to go next

Quickstart

Install the required R packages and run your first prediction end to end.

Data format

Learn the full CSV schema, how training and prediction rows differ, and how to prepare your own data.

Models

Understand the GLM and neural network architectures, training process, and accuracy evaluation.

Functions

Reference documentation for sumo.read(), data.split(), and normalize().

Build docs developers (and LLMs) love