Phase 1 is the foundation of the entire HRIA pipeline. Before any transformation or modeling can take place, this phase loads all 11 raw LinkedIn CSV files, inspects their structure, and documents every dimension that will inform downstream cleaning and analysis decisions. By establishing baseline counts, cardinality, and missing-value rates here, the team ensures that every subsequent step is grounded in empirical evidence rather than assumptions about data quality.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt
Use this file to discover all available pages before exploring further.
Notebook
Fase1_Exploracion_Inicial.ipynb
Libraries
| Library | Version | Purpose |
|---|---|---|
pandas | 2.2.2 | DataFrame loading, profiling, and aggregation |
NumPy | 2.0.2 | Numerical operations and null checks |
warnings | stdlib | Suppress non-critical deprecation warnings |
Dataset Files
All 11 CSV files are sourced from thearchive/ directory. Together they form a relational schema with one central fact table (postings.csv) and ten dimension/bridge tables.
| File | Rows | Columns | Description |
|---|---|---|---|
postings.csv | 123,849 | 31 | Core job postings — the fact table |
companies.csv | 24,428 | 14 | Company profiles |
company_industries.csv | 36,853 | 2 | Company ↔ industry mapping |
company_specialities.csv | 34,904 | 2 | Company ↔ speciality mapping |
job_industries.csv | 148,668 | 2 | Job posting ↔ industry mapping |
job_skills.csv | 993,400 | 2 | Job posting ↔ skill mapping |
benefits.csv | 254,126 | 4 | Benefits offered per posting |
salaries.csv | 9,008 | 8 | Structured salary ranges |
skills.csv | 35,916 | 2 | Skill ID → skill name lookup |
industries.csv | 149 | 2 | Industry ID → industry name lookup |
employee_counts.csv | 54,262 | 5 | Company headcount snapshots |
Key Profiling Performed
For every file, Phase 1 applies the same systematic profiling sequence:.shape— confirm row and column counts before any joins.dtypes— identify numeric vs. object fields and flag mistyped columns (e.g., salary stored as string).head()— visual spot-check of raw values and formatting inconsistencies.isnull().sum()/.isnull().mean()— absolute and percentage missing-value rates per column.nunique()— cardinality analysis to distinguish identifier columns from categorical ones.value_counts()— frequency distributions for categorical fields such aswork_type,formatted_experience_level, andformatted_work_type
Critical Findings
Volume and Breadth
- 123,849 total job postings spanning 24,428 unique companies
- 72,521 unique job titles — the most frequent is Sales Manager with 673 occurrences, confirming extreme long-tail distribution in job title naming
Work Type and Experience
- 80 % of postings are Full-time (98,814 of 123,849 rows), meaning Part-time and Contract segments are structurally underrepresented in any aggregate analysis
- Mid-Senior level is the dominant experience band with 41,489 occurrences, followed by Associate (23,904) and Entry Level (14,157)
Salary Landscape
- Median normalized salary ≈ 535M+, confirming the need for IQR-based outlier removal in Phase 2
- Salary data is sparse: only ~7 % of postings carry any salary information, and the pattern of missingness is non-random
Missing-Value Pattern
The salary columns (min_salary, max_salary, med_salary) exhibit the highest null rates in the dataset. Crucially, the companies with missing salary data are not a random sample — large, well-known employers tend to disclose salaries more readily, while smaller or less competitive employers appear to omit this information strategically.
Phase 1 is where the Missing Not At Random (MNAR) hypothesis is first observed: companies
with less competitive compensation appear to systematically omit salary data. This pattern is
documented here as an empirical observation and is formally tested and quantified in
Phase 3 — Statistical Analysis.
Initial Load — Sample Code
med_salary, min_salary, max_salary) consistently appear at the top of this list, reinforcing the MNAR hypothesis even before any statistical test is applied.
Next Step
With the raw schema fully understood and baseline statistics documented, Phase 2 merges all 11 files into a single master DataFrame, normalizes salary figures to annual USD, and removes outliers in preparation for statistical analysis.Phase 2: Data Cleaning and Master Dataset Preparation
Merge all 11 CSVs, normalize salaries, filter to data roles, and produce three clean output files.