Documentation Index
Fetch the complete documentation index at: https://mintlify.com/obedc295/proyect_dw/llms.txt
Use this file to discover all available pages before exploring further.
DataTransformer handles the Transform phase of the ETL pipeline entirely in memory using Pandas. It receives a DataFrame produced by DataExtractor, applies one or more column-level transformations, and writes each result into a new column — leaving the original source columns intact. Because every method both modifies the DataFrame in-place and returns it, you can chain calls or pass the DataFrame directly into subsequent pipeline steps without intermediate copies.
Class: DataTransformer
Constructor
DataTransformer.__init__() accepts no parameters. The class holds no state — all transformation logic lives entirely in the methods below, making each instance safe to reuse across multiple DataFrame operations.
Methods
capitalize_transform()
Applies a text case operation to a string column and writes the result to a new column. The original column is preserved. Use 'upper' to normalise names or codes for warehouse key matching, or 'lower' for case-insensitive searches.
The source DataFrame to transform. Modified in-place; the same object is also returned.
Name of the existing string column to read from. Must be a
str-typed (object dtype) Pandas column for .str accessor methods to function correctly.Name of the column to write the transformed values into. If a column with this name already exists it will be overwritten. If it does not exist it will be appended.
Transformation to apply. Accepted values:
| Value | Pandas call | Result |
|---|---|---|
'upper' | .str.upper() | All characters uppercased |
'lower' | .str.lower() | All characters lowercased |
The modified DataFrame (same object passed in). The
new_column now contains the transformed values.concat_transform()
Concatenates two string columns with a single space separator and writes the result to a new column. Useful for building full-name or address fields that do not exist as a single column in the OLTP source.
The operation applied is:
The source DataFrame to transform. Modified in-place; the same object is also returned.
Name of the column to write the concatenated result into. Created if it does not exist; overwritten if it does.
Name of the first (left) column. Its values appear before the space separator.
Name of the second (right) column. Its values appear after the space separator.
The modified DataFrame with the new concatenated column appended.
The separator is always a single space character. If you need a different separator (e.g., a comma or hyphen), apply a direct Pandas expression on the returned DataFrame before passing it to
DataLoader.date_transform()
Extracts a single date component (year, month, or day) from a date or timestamp column and writes the integer result to a new column. The source column is converted to datetime64 in-place via pd.to_datetime() as a first step, which means its dtype changes from object or str to datetime64[ns] after the call.
The source DataFrame to transform. Modified in-place; the same object is also returned.
Name of the date/timestamp column to read from. The column is first coerced to
datetime64 using pd.to_datetime(df[column]), so string dates in common ISO or locale formats are accepted.Name of the column to write the extracted integer component into. Created if it does not exist; overwritten if it does.
Date component to extract. Accepted values:
| Value | Pandas accessor | Result dtype |
|---|---|---|
'year' | .dt.year | int64 — four-digit year (e.g., 2024) |
'month' | .dt.month | int64 — month number 1–12 |
'day' | .dt.day | int64 — day of month 1–31 |
The modified DataFrame. The source
column is now datetime64[ns] and new_column contains the extracted integer component.Chaining Transformations
All three methods accept and return the same DataFrame object, so multiple transformations can be composed in sequence. TheETLPipeline class uses exactly this pattern, iterating over a column_mappings list and calling the appropriate transformer method for each entry.