Overview
TheDataProfiler class analyzes datasets to provide comprehensive profiling including:
- Schema information (columns, types)
- Statistical summaries
- Missing value analysis
- Categorical variable analysis
- Target variable analysis
Class Definition
DataProfiler
Path to the dataset file (CSV or Parquet)
Name of the target column for prediction
Type of ML task:
'classification' or 'regression'Methods
load_data()
Load the dataset from file into a pandas DataFrame.pd.DataFrame - Loaded pandas DataFrame
Raises:
FileNotFoundError- If the dataset file does not existValueError- If file format is not supported or cannot be parsed
- CSV (
.csv) - Parquet (
.parquet)
profile()
Generate a complete profile of the dataset.DataProfile - Pydantic model containing:
Total number of rows in the dataset
Total number of columns
List of all column names
Mapping of column names to their data types
List of numeric column names
List of categorical column names (object, category, bool types)
Name of the target column
Type of target:
'numeric' or 'categorical'Count of missing values per column
Percentage of missing values per column
Statistical summary for numeric columns:
mean: Mean valuestd: Standard deviationmin: Minimum value25%,50%,75%: Quartilesmax: Maximum valueskew: Skewness coefficient
Statistics for categorical columns:
n_unique: Number of unique valuestop_values: Top 10 most frequent valuescardinality_ratio: Ratio of unique values to total values
Target-specific statistics:For classification:
n_classes: Number of unique classesclass_distribution: Count per classclass_balance: Ratio of min to max class counts
- Same statistics as numeric columns (mean, std, quartiles, etc.)
ValueError- If target column not found or dataset has insufficient rows
get_profile_summary()
Get a human-readable text summary of the profile.str - Formatted summary including:
- Dataset name and shape
- Target information
- Feature counts
- Missing value summary
- Target variable statistics
to_dict()
Convert profile to dictionary for JSON serialization.dict - Dictionary representation of the DataProfile
Complete Example
Data Validation
The profiler automatically validates:Source Location
~/workspace/source/src/execution/data_profiler.py