The Exploratory Data Analysis (EDA) page gives you a set of interactive tools to understand the structure and characteristics of your data before — or independently of — running threat detection. You can examine feature distributions, scatter relationships between variables, a correlation heatmap, categorical value counts, and a quick anomaly scan powered by IsolationForest. EDA is useful both for validating that your data is clean and for building intuition about which features tend to separate normal from anomalous behavior.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt
Use this file to discover all available pages before exploring further.
Choose your data source
Select a data source
Use the radio selector to choose between Upload CSV and Use sample dataset.
- Upload CSV — upload any CSV file you want to explore. This does not need to be formatted for threat detection.
- Use sample dataset — loads the bundled
insider_threat_clean_dataset.csvfrom theAI_Model_Code/directory. This is the dataset used to train the model.
View the dataset snapshot
The Dataset Snapshot section displays the shape of your data (rows × columns) and a collapsible preview of the first 50 rows.
Review the quick summary
The Quick Summary section shows a
describe() table covering all columns (count, mean, min, max, quartiles for numeric columns; count and frequency for categorical columns). Alongside it, a bar chart displays any columns with missing values, sorted by count. If no missing values are present, the app confirms this.Explore features interactively
Use the sections below the summary to drill into specific aspects of your data. See EDA features for details on each tool.
EDA features
Numeric feature exploration
Select any numeric column from the dropdown to generate two side-by-side charts:- Histogram with KDE — shows the distribution of values for that column, with a smoothed density curve overlaid.
- Boxplot — shows the median, interquartile range, and outliers.
Scatter / relationship explorer
Choose an X axis column, a Y axis column, and optionally a Color by column (hue). ThreatDetect renders a scatterplot with semi-transparent points. When a hue column is provided, points are coloured by that variable — for example, by the target label — to reveal whether the two numeric features separate classes visually. This tool is useful for identifying linear or non-linear relationships between features and for spotting clusters that correspond to threat status.Correlation matrix
The Correlation Matrix section computes Pearson correlations across all numeric columns and displays a heatmap using a coolwarm colour scale centred at zero. Red cells indicate positive correlation; blue cells indicate negative correlation. If a target column is auto-detected (see Target column detection), the app also displays a ranked table of the top 10 features most correlated with the target by absolute correlation value. Features near the top of this list tend to carry the most predictive signal for classifying the target.Categorical feature counts
Select any categorical column from the dropdown to see a horizontal bar chart of value counts (up to the top 30 values). This is useful for checking class imbalance, spotting unexpected categories, and understanding the distribution of nominal fields like department or role.Quick anomaly scan
Click Run quick anomaly scan to fit an IsolationForest model (100 estimators, 1% contamination) on a random sample of up to 2,000 rows using all numeric columns. The app reports the approximate number of outliers detected in the top 1% of anomaly scores and displays a histogram of anomaly score values across the sample.The Quick Anomaly Scan uses a lightweight IsolationForest fitted only on the EDA sample — it is independent of the trained threat detection model. Use it for a fast sense-check of data quality and outlier prevalence, not as a substitute for the full detection pipeline.
Target column detection
ThreatDetect automatically checks your column names for any of the following:is_malicious, malicious, label, target, is_threat, threat. The first matching column is designated as the target, and the app:
- Displays its value counts (so you can see class balance at a glance).
- Uses it as the default hue option in the Scatter / Relationship Explorer.
- Shows correlations with the target in the Correlation Matrix section.
What if my target column has a different name?
What if my target column has a different name?
If your binary label column uses a name not in the auto-detection list, the target-specific panels will not appear. You can still explore all features manually. To take full advantage of the target correlation panel, rename your column to one of the supported names before uploading.
Can I use the EDA page with data that has no threat labels?
Can I use the EDA page with data that has no threat labels?
Yes. All EDA tools — distributions, scatter explorer, correlation matrix, categorical counts, and anomaly scan — work with any tabular CSV regardless of whether a target column is present. Only the target-specific sub-panels are skipped when no recognised target column is found.
How large a dataset can I upload for EDA?
How large a dataset can I upload for EDA?
The app has no hard row limit for EDA. However, the Quick Anomaly Scan samples at most 2,000 rows for performance reasons, and the SHAP tools on other pages sample at most 100 records. Very large files may cause slower rendering of charts in your browser.