Quick Start

Prerequisites

Before you begin, ensure you have:

Installed the delta-sharing package (see Installation)
A Delta Sharing profile file (.share file) with credentials to access a Delta Sharing server

You can download a sample profile file to access our example Delta Sharing server here.

Profile Files

A profile file is a JSON file containing credentials to access a Delta Sharing Server. Here’s what a typical profile file looks like:

{
  "shareCredentialsVersion": 1,
  "endpoint": "https://sharing.example.com/delta-sharing/",
  "bearerToken": "your-bearer-token-here"
}

Profile files can be stored:

On your local file system: /path/to/profile.share
On cloud storage with FSSPEC support: s3://bucket/profile.share
On Databricks File System (DBFS): /dbfs/path/to/profile.share

Basic Usage with Pandas

Creating a SharingClient

The SharingClient allows you to explore available shares, schemas, and tables:

import delta_sharing

# Point to your profile file
profile_file = "/path/to/profile.share"

# Create a SharingClient
client = delta_sharing.SharingClient(profile_file)

# List all shared tables
tables = client.list_all_tables()
for table in tables:
    print(f"{table.share}.{table.schema}.{table.name}")

Loading Data as Pandas DataFrame

To load a shared table, construct a table URL in the format: <profile-path>#<share>.<schema>.<table>

import delta_sharing

# Construct table URL
profile_file = "/path/to/profile.share"
table_url = f"{profile_file}#my_share.my_schema.my_table"

# Load entire table as pandas DataFrame
df = delta_sharing.load_as_pandas(table_url)
print(df.head())

Sampling Data

For large tables, use the limit parameter to fetch only a sample:

# Fetch only 10 rows to explore the table structure
sample_df = delta_sharing.load_as_pandas(table_url, limit=10)
print(sample_df)

The limit parameter is useful for exploration but does not guarantee which rows are returned. For production use cases, load the full table or use appropriate filtering.

Time Travel Queries

Query historical versions of a table using version or timestamp:

# Load table at a specific version
df = delta_sharing.load_as_pandas(
    table_url,
    version=5
)

Using Delta Format

For better performance with supported tables, use Delta format:

# Explicitly use Delta format for reading
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True
)

Delta format provides more efficient data transfer and better predicate pushdown. The connector automatically chooses the best format if use_delta_format is not specified.

Memory-Efficient Loading

For large tables, use batch conversion to reduce memory consumption:

# Convert parquet files to pandas one batch at a time
df = delta_sharing.load_as_pandas(
    table_url,
    convert_in_batches=True
)

This approach:

Reduces peak memory usage
May take longer to complete
Particularly beneficial for parquet format queries

Using with PySpark

To use load_as_spark, you must be running in a PySpark environment with the Apache Spark Connector for Delta Sharing installed. See Apache Spark Connector documentation for setup instructions.

Loading as Spark DataFrame

import delta_sharing
from pyspark.sql import SparkSession

# Ensure you're in a PySpark session
spark = SparkSession.builder \
    .appName("DeltaSharing") \
    .config("spark.jars.packages", "io.delta:delta-sharing-spark_2.12:3.1.0") \
    .getOrCreate()

# Load table as Spark DataFrame
table_url = f"{profile_file}#my_share.my_schema.my_table"
df = delta_sharing.load_as_spark(table_url)

# Use Spark DataFrame operations
df.show()
df.printSchema()

Time Travel with Spark

df = delta_sharing.load_as_spark(
    table_url,
    version=10
)

Reading Change Data Feed (CDF)

If a table has Change Data Feed enabled, you can query table changes:

CDF with Pandas

import delta_sharing

# Load table changes from version 0 to version 5
changes_df = delta_sharing.load_table_changes_as_pandas(
    table_url,
    starting_version=0,
    ending_version=5
)

print(changes_df.head())

The resulting DataFrame includes these columns:

All original table columns
_change_type: Type of change (insert, update_preimage, update_postimage, delete)
_commit_version: Version number of the change
_commit_timestamp: Timestamp of the change

CDF with Version Range

# Get all changes from version 10 to latest
changes_df = delta_sharing.load_table_changes_as_pandas(
    table_url,
    starting_version=10
)

CDF with Timestamp Range

# Get changes between two timestamps
changes_df = delta_sharing.load_table_changes_as_pandas(
    table_url,
    starting_timestamp="2024-01-01T00:00:00Z",
    ending_timestamp="2024-01-31T23:59:59Z"
)

CDF with Spark

# Load table changes as Spark DataFrame
changes_df = delta_sharing.load_table_changes_as_spark(
    table_url,
    starting_version=0,
    ending_version=5
)

changes_df.show()

Memory-Efficient CDF

# Use batch conversion for large change sets
changes_df = delta_sharing.load_table_changes_as_pandas(
    table_url,
    starting_version=0,
    ending_version=100,
    convert_in_batches=True,
    use_delta_format=True
)

Getting Table Metadata

Retrieve table metadata and version information:

import delta_sharing

# Get current table version
version = delta_sharing.get_table_version(table_url)
print(f"Current version: {version}")

# Get version at specific timestamp
version_at_time = delta_sharing.get_table_version(
    table_url,
    starting_timestamp="2024-01-15T10:00:00Z"
)

# Get table metadata
metadata = delta_sharing.get_table_metadata(table_url)
print(f"Table ID: {metadata.id}")
print(f"Schema: {metadata.schema_string}")

# Get table protocol
protocol = delta_sharing.get_table_protocol(table_url)
print(f"Min reader version: {protocol.min_reader_version}")

Complete Example

Here’s a complete example that explores and reads shared data:

import delta_sharing
import pandas as pd

# Initialize client
profile_file = "/path/to/profile.share"
client = delta_sharing.SharingClient(profile_file)

# Explore available data
print("Available shares:")
for share in client.list_shares():
    print(f"  - {share.name}")

# List all tables
print("\nAll tables:")
tables = client.list_all_tables()
for table in tables:
    print(f"  - {table.share}.{table.schema}.{table.name}")

# Select first table
if tables:
    first_table = tables[0]
    table_url = f"{profile_file}#{first_table.share}.{first_table.schema}.{first_table.name}"
    
    # Get metadata
    version = delta_sharing.get_table_version(table_url)
    print(f"\nTable version: {version}")
    
    # Sample the data
    print("\nSample data (first 5 rows):")
    sample = delta_sharing.load_as_pandas(table_url, limit=5)
    print(sample)
    
    # Load full table
    print("\nLoading full table...")
    df = delta_sharing.load_as_pandas(table_url)
    print(f"Loaded {len(df)} rows, {len(df.columns)} columns")

Get Started

Core Concepts

Python Connector

Apache Spark Connector

Reference Server

Community Connectors

Prerequisites

Profile Files

Basic Usage with Pandas

Creating a SharingClient

Loading Data as Pandas DataFrame

Sampling Data

Time Travel Queries

Using Delta Format

Memory-Efficient Loading

Using with PySpark

Loading as Spark DataFrame

Time Travel with Spark

Reading Change Data Feed (CDF)

CDF with Pandas

CDF with Version Range

CDF with Timestamp Range

CDF with Spark

Memory-Efficient CDF

Getting Table Metadata

Complete Example

Next Steps

API Reference

Advanced Usage

Build docs developers (and LLMs) love

Get Started

Core Concepts

Python Connector

Apache Spark Connector

Reference Server

Community Connectors

​Prerequisites

​Profile Files

​Basic Usage with Pandas

​Creating a SharingClient

​Loading Data as Pandas DataFrame

​Sampling Data

​Time Travel Queries

​Using Delta Format

​Memory-Efficient Loading

​Using with PySpark

​Loading as Spark DataFrame

​Time Travel with Spark

​Reading Change Data Feed (CDF)

​CDF with Pandas

​CDF with Version Range

​CDF with Timestamp Range

​CDF with Spark

​Memory-Efficient CDF

​Getting Table Metadata

​Complete Example

​Next Steps

API Reference

Advanced Usage

Build docs developers (and LLMs) love

Prerequisites

Profile Files

Basic Usage with Pandas

Creating a SharingClient

Loading Data as Pandas DataFrame

Sampling Data

Time Travel Queries

Using Delta Format

Memory-Efficient Loading

Using with PySpark

Loading as Spark DataFrame

Time Travel with Spark

Reading Change Data Feed (CDF)

CDF with Pandas

CDF with Version Range

CDF with Timestamp Range

CDF with Spark

Memory-Efficient CDF

Getting Table Metadata

Complete Example

Next Steps