Documentation Index
Fetch the complete documentation index at: https://mintlify.com/obedc295/proyect_dw/llms.txt
Use this file to discover all available pages before exploring further.
DataExtractor is the entry point of every ETL run. It opens a connection to the OLTP source database through db_client.get_oltp_connection(), executes a SQL query against SQL Server, and returns the result as a Pandas DataFrame ready for downstream transformation. All SQL generation is handled internally — callers only need to supply a table name, an optional column list, and an optional row limit.
Class overview
DataExtractor is constructed with a single db_client argument (a DatabaseClient instance). It exposes three public methods: one for simple table selects, one for arbitrary SQL queries, and one for schema introspection.
Public methods
extract_by_table
SELECT statement from its arguments and runs it against the OLTP database. The generated SQL depends on whether columns and limit are supplied:
| Scenario | Generated SQL |
|---|---|
| No columns, no limit | SELECT * FROM {table_name} |
| Columns provided, no limit | SELECT col1, col2 FROM {table_name} |
| No columns, limit provided | SELECT TOP {limit} * FROM {table_name} |
| Columns and limit provided | SELECT TOP {limit} col1, col2 FROM {table_name} |
- When
columnsisNone, the selector defaults to*. - When
limitis set, SQL Server’sTOPclause is used — this is a server-side limit, not a Python-side slice, so only the specified number of rows are transferred over the wire.
table_name should be schema-qualified (e.g., dbo.Customers, Sales.SalesOrderHeader) to avoid ambiguity when the OLTP database contains multiple schemas with identically named tables.extract_by_query
limit is provided, the original query is wrapped as a subquery and TOP is applied to the outer select:
extract_tables
inspect() and returns a list of all user-facing tables in schema.table format. It iterates over every schema name exposed by the engine and skips the following system schemas:
SQL Server built-in schemas
sysINFORMATION_SCHEMAguest
Fixed database roles
db_ownerdb_accessadmindb_securityadmindb_ddladmindb_backupoperatordb_datareaderdb_datawriterdb_denydatareaderdb_denydatawriter
schema.lower() before testing membership in system_schemas. All-lowercase entries in the set (sys, guest, db_owner, and the other db_* role names) are therefore reliably excluded. The set also contains "INFORMATION_SCHEMA" in its original mixed-case form; because the comparison lowercases the incoming name to "information_schema" before testing, this entry is effectively never matched. In practice, SQL Server’s SQLAlchemy dialect returns INFORMATION_SCHEMA in uppercase, so the information_schema schema — not present in the set — does not filter it out. User schemas with entirely lowercase names are unaffected.
All entries in the returned list use the
schema.table format — for example dbo.Customers or Sales.SalesTerritory. Pass these strings directly to extract_by_table() as the table_name argument.The Streamlit UI calls
extract_tables() at startup to populate the Source Table dropdown. Any user-visible table in the OLTP database will appear there automatically; system schemas are always hidden.How DataExtractor connects to the database
DataExtractor never manages connections directly. It delegates to db_client.get_oltp_connection(), which returns a SQLAlchemy connection context manager backed by oltp_engine. All connection pooling, pre-ping health checks, and teardown are handled by DatabaseClient.
text() wrapper from SQLAlchemy ensures the query string is treated as literal SQL and passed to the driver without additional interpretation.