Skip to main content

Profile File Paths

The Delta Sharing Python Connector supports multiple ways to specify profile file paths, enabling flexible deployment scenarios.

FSSPEC Support

Profile paths for SharingClient and load_as_pandas support any URL format recognized by FSSPEC (Filesystem Spec).
FSSPEC is a Python library that provides a unified interface to access different filesystems. It allows you to use the same code to work with local files, cloud storage (S3, Azure, GCS), HTTP endpoints, and more.

Local File System

import delta_sharing

# Absolute path
client = delta_sharing.SharingClient("/home/user/profiles/production.share")

# Relative path
client = delta_sharing.SharingClient("./profiles/production.share")

Amazon S3

# S3 path (requires s3fs: pip install delta-sharing[s3])
profile_path = "s3://my-bucket/profiles/production.share"
client = delta_sharing.SharingClient(profile_path)

# Load table with S3 profile
table_url = f"{profile_path}#my_share.my_schema.my_table"
df = delta_sharing.load_as_pandas(table_url)
Install S3 support with: pip install delta-sharing[s3]

Azure Blob Storage

# Azure Blob Storage (requires adlfs: pip install delta-sharing[abfs])
profile_path = "abfs://container@account.dfs.core.windows.net/profiles/production.share"
client = delta_sharing.SharingClient(profile_path)
Install Azure Blob Storage support with: pip install delta-sharing[abfs]

Google Cloud Storage

# GCS path (requires gcsfs: pip install delta-sharing[gcs])
profile_path = "gs://my-bucket/profiles/production.share"
client = delta_sharing.SharingClient(profile_path)
Install GCS support with: pip install delta-sharing[gcs]

DBFS Support

When using Databricks File System (DBFS), you can access profile files using the /dbfs/ prefix to treat them as local files:
import delta_sharing

# DBFS path accessed as local file
profile_path = "/dbfs/mnt/profiles/production.share"
client = delta_sharing.SharingClient(profile_path)

# Construct table URL
table_url = f"{profile_path}#my_share.my_schema.my_table"
df = delta_sharing.load_as_pandas(table_url)
The /dbfs/ prefix works for SharingClient and load_as_pandas, but load_as_spark requires Hadoop FileSystem-compatible paths (e.g., dbfs:/mnt/profiles/production.share).

Spark Profile Paths

For load_as_spark and load_table_changes_as_spark, profile paths must be compatible with Hadoop FileSystem:
import delta_sharing

# Hadoop-compatible paths for Spark
table_url = "s3a://my-bucket/profile.share#my_share.my_schema.my_table"
df = delta_sharing.load_as_spark(table_url)

# DBFS path for Spark
table_url = "dbfs:/mnt/profiles/production.share#my_share.my_schema.my_table"
df = delta_sharing.load_as_spark(table_url)
Use s3a:// (not s3://) for S3 paths with Spark, and dbfs:/ (not /dbfs/) for DBFS paths.

JSON Predicate Hints

Predicate hints allow you to push down filters to the Delta Sharing server, reducing the amount of data transferred and improving query performance.

Predicate Format

Predicates are specified as JSON strings following the Delta Sharing Protocol specification.

Basic Predicates

Equality

import delta_sharing

# Filter where country = 'USA'
predicate = '''{
  "op": "equal",
  "children": [
    {"op": "column", "name": "country", "valueType": "string"},
    {"op": "literal", "value": "USA", "valueType": "string"}
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

Less Than / Greater Than

# Filter where age > 21
predicate = '''{
  "op": "greaterThan",
  "children": [
    {"op": "column", "name": "age", "valueType": "int"},
    {"op": "literal", "value": "21", "valueType": "int"}
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

IS NULL

# Filter where email is null
predicate = '''{
  "op": "isNull",
  "children": [
    {"op": "column", "name": "email", "valueType": "string"}
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

Composite Predicates

AND

# Filter where country = 'USA' AND age > 21
predicate = '''{
  "op": "and",
  "children": [
    {
      "op": "equal",
      "children": [
        {"op": "column", "name": "country", "valueType": "string"},
        {"op": "literal", "value": "USA", "valueType": "string"}
      ]
    },
    {
      "op": "greaterThan",
      "children": [
        {"op": "column", "name": "age", "valueType": "int"},
        {"op": "literal", "value": "21", "valueType": "int"}
      ]
    }
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

OR

# Filter where country = 'USA' OR country = 'Canada'
predicate = '''{
  "op": "or",
  "children": [
    {
      "op": "equal",
      "children": [
        {"op": "column", "name": "country", "valueType": "string"},
        {"op": "literal", "value": "USA", "valueType": "string"}
      ]
    },
    {
      "op": "equal",
      "children": [
        {"op": "column", "name": "country", "valueType": "string"},
        {"op": "literal", "value": "Canada", "valueType": "string"}
      ]
    }
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

Date and Timestamp Predicates

# Filter where hire_date = '2021-04-29'
predicate = '''{
  "op": "equal",
  "children": [
    {"op": "column", "name": "hire_date", "valueType": "date"},
    {"op": "literal", "value": "2021-04-29", "valueType": "date"}
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

Supported Operators

  • equal: Equality comparison
  • lessThan: Less than
  • lessThanOrEqual: Less than or equal
  • greaterThan: Greater than
  • greaterThanOrEqual: Greater than or equal
  • and: Logical AND
  • or: Logical OR
  • not: Logical NOT
  • isNull: Check if value is NULL
  • isNotNull: Check if value is NOT NULL

Supported Value Types

  • bool: Boolean values
  • int: Integer values
  • long: Long integer values
  • float: Floating-point values
  • double: Double-precision floating-point values
  • string: String values
  • date: Date values (format: YYYY-MM-DD)
  • timestamp: Timestamp values (format: YYYY-MM-DDThh:mm:ss[.SSS]Z)
Predicate hints are optimization hints only. The server may return more data than specified by the predicate, so additional client-side filtering may be necessary for correctness.

Memory Optimization with convert_in_batches

For large tables, the convert_in_batches parameter can significantly reduce memory consumption by converting Parquet files to pandas DataFrames one batch at a time instead of one file at a time.

When to Use

Use convert_in_batches=True when:
  • Working with large tables that don’t fit comfortably in memory
  • Individual Parquet files are large
  • You want to reduce peak memory usage
  • Query performance is less critical than memory usage

Trade-offs

Advantages:
  • Lower peak memory usage
  • Prevents out-of-memory errors on constrained systems
  • Better for parquet format queries
Disadvantages:
  • May take longer to complete
  • May download more data due to batch overhead
  • Slightly more CPU overhead for batch processing

Example Usage

import delta_sharing

# Load large table with batch conversion
df = delta_sharing.load_as_pandas(
    table_url,
    convert_in_batches=True
)

# Combine with other optimizations
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True,
    convert_in_batches=True,
    jsonPredicateHints=my_predicate
)

With Change Data Feed

# Load large CDF query with batch conversion
changes = delta_sharing.load_table_changes_as_pandas(
    table_url,
    starting_version=0,
    ending_version=1000,
    convert_in_batches=True,
    use_delta_format=True
)

Delta Format vs Parquet Format

Delta Sharing supports two data formats for transferring table data: Delta format and Parquet format.

Format Comparison

FeatureDelta FormatParquet Format
PerformanceFaster for most queriesSlower for large tables
Predicate PushdownExcellentLimited
Server SupportRequires Delta format supportUniversal
Data TransferMore efficientLess efficient
CompatibilityNewer featureWidely supported

Using Delta Format

import delta_sharing

# Explicitly use Delta format
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True
)

# Delta format with predicates (best performance)
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True,
    jsonPredicateHints=my_predicate
)

Automatic Format Selection

If use_delta_format is not specified, the connector automatically selects the best format based on table metadata:
# Connector chooses format automatically
df = delta_sharing.load_as_pandas(table_url)

Format for Change Data Feed

For Change Data Feed queries, Delta format is recommended:
# CDF with Delta format for best performance
changes = delta_sharing.load_table_changes_as_pandas(
    table_url,
    starting_version=0,
    ending_version=100,
    use_delta_format=True
)
When using Delta format for CDF, the connector properly handles metadata changes by replaying the Delta log, ensuring data consistency across versions.

Working with DeltaSharingProfile Objects

Instead of using profile file paths, you can create and use DeltaSharingProfile objects directly for more control over authentication.

Reading from File

from delta_sharing.protocol import DeltaSharingProfile

# Read profile from file
profile = DeltaSharingProfile.read_from_file("/path/to/profile.share")

# Use with SharingClient
client = delta_sharing.SharingClient(profile)

# Use with load_as_spark
df = delta_sharing.load_as_spark(
    "my_share.my_schema.my_table",
    delta_sharing_profile=profile
)

Profile Authentication Types

The connector supports multiple authentication types:

Bearer Token (Version 1)

{
  "shareCredentialsVersion": 1,
  "endpoint": "https://sharing.example.com/delta-sharing/",
  "bearerToken": "your-token-here",
  "expirationTime": "2024-12-31T23:59:59Z"
}

Bearer Token (Version 2)

{
  "shareCredentialsVersion": 2,
  "type": "bearer_token",
  "endpoint": "https://sharing.example.com/delta-sharing/",
  "bearerToken": "your-token-here",
  "expirationTime": "2024-12-31T23:59:59Z"
}

OAuth Client Credentials

{
  "shareCredentialsVersion": 2,
  "type": "oauth_client_credentials",
  "endpoint": "https://sharing.example.com/delta-sharing/",
  "tokenEndpoint": "https://auth.example.com/oauth/token",
  "clientId": "your-client-id",
  "clientSecret": "your-client-secret",
  "scope": "delta-sharing"
}

Basic Authentication

{
  "shareCredentialsVersion": 2,
  "type": "basic",
  "endpoint": "https://sharing.example.com/delta-sharing/",
  "username": "your-username",
  "password": "your-password"
}

Programmatic Profile Creation

from delta_sharing.protocol import DeltaSharingProfile

# Create profile programmatically
profile = DeltaSharingProfile(
    share_credentials_version=2,
    type="bearer_token",
    endpoint="https://sharing.example.com/delta-sharing",
    bearer_token="your-token-here",
    expiration_time="2024-12-31T23:59:59Z"
)

# Use with connector
client = delta_sharing.SharingClient(profile)

Performance Best Practices

1. Use Delta Format When Possible

# Best performance for most queries
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True
)

2. Apply Predicate Hints

# Reduce data transfer with predicates
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True,
    jsonPredicateHints=my_predicate
)

3. Use Limits for Exploration

# Sample data before loading full table
sample = delta_sharing.load_as_pandas(table_url, limit=1000)
print(sample.describe())

# Load full table only if needed
if looks_good(sample):
    df = delta_sharing.load_as_pandas(table_url)

4. Enable Batch Conversion for Large Tables

# Reduce memory usage for large tables
df = delta_sharing.load_as_pandas(
    table_url,
    convert_in_batches=True
)

5. Cache Profile Files Locally

# Avoid repeated S3 reads of profile files
import shutil
import delta_sharing

# Download profile once
with fsspec.open("s3://bucket/profile.share", "rb") as src:
    with open("/tmp/profile.share", "wb") as dst:
        shutil.copyfileobj(src, dst)

# Use local copy
profile = delta_sharing.DeltaSharingProfile.read_from_file("/tmp/profile.share")
client = delta_sharing.SharingClient(profile)

6. Reuse SharingClient Instances

# Create client once
client = delta_sharing.SharingClient(profile_file)

# Reuse for multiple operations
tables = client.list_all_tables()
for table in tables:
    # Client connection is reused
    print(table.name)

Error Handling

Common Errors and Solutions

import delta_sharing
from requests.exceptions import HTTPError

try:
    df = delta_sharing.load_as_pandas(table_url)
except HTTPError as e:
    if e.response.status_code == 401:
        print("Authentication failed. Check your bearer token.")
    elif e.response.status_code == 403:
        print("Access denied. You don't have permission for this table.")
    elif e.response.status_code == 404:
        print("Table not found. Check the table URL.")
    else:
        print(f"HTTP error: {e}")
except ValueError as e:
    print(f"Invalid URL or parameters: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Validating Profile Files

from delta_sharing.protocol import DeltaSharingProfile

try:
    profile = DeltaSharingProfile.read_from_file(profile_path)
    print(f"Profile valid. Endpoint: {profile.endpoint}")
except FileNotFoundError:
    print(f"Profile file not found: {profile_path}")
except ValueError as e:
    print(f"Invalid profile format: {e}")

Complete Advanced Example

import delta_sharing
from delta_sharing.protocol import DeltaSharingProfile
import json

# Load profile from S3
profile_path = "s3://my-bucket/profiles/production.share"
profile = DeltaSharingProfile.read_from_file(profile_path)

# Create client
client = delta_sharing.SharingClient(profile)

# Explore available tables
print("Available tables:")
tables = client.list_all_tables()
for table in tables:
    print(f"  {table.share}.{table.schema}.{table.name}")

# Select a table
table_url = f"{profile_path}#{tables[0].share}.{tables[0].schema}.{tables[0].name}"

# Get table metadata
metadata = delta_sharing.get_table_metadata(table_url)
print(f"\nTable: {metadata.name}")
print(f"Format: {metadata.format}")
print(f"Partition columns: {metadata.partition_columns}")

# Build predicate
predicate = json.dumps({
    "op": "and",
    "children": [
        {
            "op": "greaterThan",
            "children": [
                {"op": "column", "name": "timestamp", "valueType": "timestamp"},
                {"op": "literal", "value": "2024-01-01T00:00:00Z", "valueType": "timestamp"}
            ]
        },
        {
            "op": "equal",
            "children": [
                {"op": "column", "name": "status", "valueType": "string"},
                {"op": "literal", "value": "active", "valueType": "string"}
            ]
        }
    ]
})

# Load with optimizations
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True,
    jsonPredicateHints=predicate,
    convert_in_batches=True
)

print(f"\nLoaded {len(df)} rows matching predicate")
print(df.head())

# Query change data feed
changes = delta_sharing.load_table_changes_as_pandas(
    table_url,
    starting_version=0,
    use_delta_format=True,
    convert_in_batches=True
)

print(f"\nTotal changes: {len(changes)}")
print(changes['_change_type'].value_counts())

Build docs developers (and LLMs) love