Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt

Use this file to discover all available pages before exploring further.

The Exploratory Data Analysis (EDA) page gives you a set of interactive tools to understand the structure and characteristics of your data before — or independently of — running threat detection. You can examine feature distributions, scatter relationships between variables, a correlation heatmap, categorical value counts, and a quick anomaly scan powered by IsolationForest. EDA is useful both for validating that your data is clean and for building intuition about which features tend to separate normal from anomalous behavior.

Choose your data source

1

Navigate to Exploratory Data Analysis

Use the sidebar dropdown to select Exploratory Data Analysis.
2

Select a data source

Use the radio selector to choose between Upload CSV and Use sample dataset.
  • Upload CSV — upload any CSV file you want to explore. This does not need to be formatted for threat detection.
  • Use sample dataset — loads the bundled insider_threat_clean_dataset.csv from the AI_Model_Code/ directory. This is the dataset used to train the model.
Once loaded, the app confirms the number of records.
3

View the dataset snapshot

The Dataset Snapshot section displays the shape of your data (rows × columns) and a collapsible preview of the first 50 rows.
4

Review the quick summary

The Quick Summary section shows a describe() table covering all columns (count, mean, min, max, quartiles for numeric columns; count and frequency for categorical columns). Alongside it, a bar chart displays any columns with missing values, sorted by count. If no missing values are present, the app confirms this.
5

Explore features interactively

Use the sections below the summary to drill into specific aspects of your data. See EDA features for details on each tool.
If you do not have a dataset ready, select Use sample dataset to immediately explore the feature space ThreatDetect was trained on. This is a good way to understand which columns the model expects and what typical value ranges look like before you prepare your own data.

EDA features

Numeric feature exploration

Select any numeric column from the dropdown to generate two side-by-side charts:
  • Histogram with KDE — shows the distribution of values for that column, with a smoothed density curve overlaid.
  • Boxplot — shows the median, interquartile range, and outliers.
Together these reveal skew, modality, and the presence of extreme values in any numeric feature.

Scatter / relationship explorer

Choose an X axis column, a Y axis column, and optionally a Color by column (hue). ThreatDetect renders a scatterplot with semi-transparent points. When a hue column is provided, points are coloured by that variable — for example, by the target label — to reveal whether the two numeric features separate classes visually. This tool is useful for identifying linear or non-linear relationships between features and for spotting clusters that correspond to threat status.

Correlation matrix

The Correlation Matrix section computes Pearson correlations across all numeric columns and displays a heatmap using a coolwarm colour scale centred at zero. Red cells indicate positive correlation; blue cells indicate negative correlation. If a target column is auto-detected (see Target column detection), the app also displays a ranked table of the top 10 features most correlated with the target by absolute correlation value. Features near the top of this list tend to carry the most predictive signal for classifying the target.

Categorical feature counts

Select any categorical column from the dropdown to see a horizontal bar chart of value counts (up to the top 30 values). This is useful for checking class imbalance, spotting unexpected categories, and understanding the distribution of nominal fields like department or role.

Quick anomaly scan

Click Run quick anomaly scan to fit an IsolationForest model (100 estimators, 1% contamination) on a random sample of up to 2,000 rows using all numeric columns. The app reports the approximate number of outliers detected in the top 1% of anomaly scores and displays a histogram of anomaly score values across the sample.
The Quick Anomaly Scan uses a lightweight IsolationForest fitted only on the EDA sample — it is independent of the trained threat detection model. Use it for a fast sense-check of data quality and outlier prevalence, not as a substitute for the full detection pipeline.

Target column detection

ThreatDetect automatically checks your column names for any of the following: is_malicious, malicious, label, target, is_threat, threat. The first matching column is designated as the target, and the app:
  1. Displays its value counts (so you can see class balance at a glance).
  2. Uses it as the default hue option in the Scatter / Relationship Explorer.
  3. Shows correlations with the target in the Correlation Matrix section.
If none of these column names are present, the app skips target-specific analysis silently. You can still use all other EDA tools on your data.
If your binary label column uses a name not in the auto-detection list, the target-specific panels will not appear. You can still explore all features manually. To take full advantage of the target correlation panel, rename your column to one of the supported names before uploading.
Yes. All EDA tools — distributions, scatter explorer, correlation matrix, categorical counts, and anomaly scan — work with any tabular CSV regardless of whether a target column is present. Only the target-specific sub-panels are skipped when no recognised target column is found.
The app has no hard row limit for EDA. However, the Quick Anomaly Scan samples at most 2,000 rows for performance reasons, and the SHAP tools on other pages sample at most 100 records. Very large files may cause slower rendering of charts in your browser.

Build docs developers (and LLMs) love