Introduction

Sumo Oracle is an R-based machine learning tool that predicts the outcome of sumo wrestling matches. Given wrestler attributes for both competitors — weight, age, height, and career wins — it trains two models on historical match data and generates a directional prediction: the left wrestler wins, or the right wrestler wins. The project is open source under the GPL-3.0 license.

What problem it solves

Sumo match outcomes depend on a complex interplay of physical attributes and experience. Sumo Oracle formalises this as a binary classification problem. You supply a CSV of past matches with known outcomes, add one or more rows with a blank result field for the matches you want to forecast, and the tool does the rest.

The two models

Sumo Oracle trains and runs two models in parallel.

Bayesian GLM

The generalized linear model uses logistic regression via R’s glm() with family = 'binomial'. It is fit on the full history of labeled matches and produces a real-valued log-odds score for each prediction.

Accuracy: approximately 70% on held-out data
Output threshold: scores above 0 → "GLM says left.", scores at or below 0 → "GLM says right."

The GLM is the more reliable of the two models and should be treated as the primary signal.

Neural network

The neural network is fit using R’s neuralnet package with 8 hidden units. It is trained on the same historical data as the GLM and produces a probability between 0 and 1.

Output threshold: probability ≥ 0.5 → "NN says left.", probability < 0.5 → "NN says right."
Useful as a secondary signal and for comparing model agreement

When both models agree, the prediction is more reliable. When they disagree, treat the GLM result as the primary answer.

Input features

Each row in sumo.csv describes a single match between two wrestlers. The eight input columns are:

Column	Description
`weight1`	Weight of the left wrestler (lbs)
`weight2`	Weight of the right wrestler (lbs)
`wins1`	Number of wins for the left wrestler
`wins2`	Number of wins for the right wrestler
`age1`	Age of the left wrestler (years)
`age2`	Age of the right wrestler (years)
`height1`	Height of the left wrestler (cm)
`height2`	Height of the right wrestler (cm)

All columns are read as numeric values. The result column is cast to logical (TRUE/FALSE) internally.

Prediction output

The result column encodes the match outcome as a binary integer:

1 — the left wrestler wins
0 — the right wrestler wins

Rows with a populated result are used as training data. Rows where result is blank (NA) are treated as undecided matches and passed to both models for prediction. Running pred_sumo.R prints one line per model:

[1] "GLM says left."
[1] "NN says left."

Or, if the right wrestler is predicted to win:

[1] "GLM says right."
[1] "NN says right."

The dataset

sumo.csv is a plain CSV file with 135 historical matches. A sample of the first few rows:

weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
401,379,6,5,35,31,190,184,1
344,370,8,11,36,34,189,192,1
351,245,3,2,22,27,189,169,0
335,291,3,12,32,30,177,176,0
375,306,0,0,35,28,186,185,0

To predict a new match, append a row with all eight attribute columns filled in and leave result empty:

weight1,weight2,wins1,wins2,age1,age2,height1,height2,result
350,370,4,7,28,30,182,185,

Both models require at least one row with a blank result to produce a prediction. If all rows have a result value, the script will have nothing to predict against and will error.

License

Sumo Oracle is released under the GNU General Public License v3.0 (GPL-3.0). You are free to use, modify, and distribute the code under the terms of that license.

Where to go next

Quickstart

Install the required R packages and run your first prediction end to end.

Data format

Learn the full CSV schema, how training and prediction rows differ, and how to prepare your own data.

Models

Understand the GLM and neural network architectures, training process, and accuracy evaluation.

Functions

Reference documentation for sumo.read(), data.split(), and normalize().

Get Started

Concepts

Guides

Reference

What problem it solves

The two models

Bayesian GLM

Neural network

Input features

Prediction output

The dataset

License

Where to go next

Quickstart

Data format

Models

Functions

Build docs developers (and LLMs) love

Get Started

Concepts

Guides

Reference

​What problem it solves

​The two models

​Bayesian GLM

​Neural network

​Input features

​Prediction output

​The dataset

​License

​Where to go next

Quickstart

Data format

Models

Functions

Build docs developers (and LLMs) love

What problem it solves

The two models

Bayesian GLM

Neural network

Input features

Prediction output

The dataset

License

Where to go next