rasteret.build_from_table()

Function Signature

rasteret.build_from_table(
    path: str | Path | pa.Table | pads.Dataset,
    *,
    name: str = "",
    data_source: str = "",
    workspace_dir: str | Path | None = None,
    column_map: dict[str, str] | None = None,
    href_column: str | None = None,
    band_index_map: dict[str, int] | None = None,
    url_rewrite_patterns: dict[str, str] | None = None,
    filesystem: Any | None = None,
    columns: list[str] | None = None,
    filter_expr: Any | None = None,
    enrich_cog: bool = False,
    band_codes: list[str] | None = None,
    cloud_config: Any = None,
    max_concurrent: int = 300,
    force: bool = False,
    backend: StorageBackend | None = None,
) -> Collection

Description

Build a Collection from an external Parquet/GeoParquet record table. A record table is a Parquet dataset where each row is a raster item (satellite scene, drone image, derived product, etc.) with at minimum id, datetime, geometry, assets, or columns that can be normalized into them via column_map and href_column. This is the heavy ingest path: it can normalize schema and optionally enrich COG headers. For in-memory tables that are already read-ready, use as_collection(). When enrich_cog=True, COG headers are parsed from the asset URLs and cached as {band}_metadata struct columns in the Parquet index, enabling fast tiled reads and TorchGeo integration.

Parameters

path

str | Path | pa.Table | pads.Dataset

required

Path/URI to a Parquet/GeoParquet file or dataset directory, or an in-memory Arrow object (pyarrow.Table or pyarrow.dataset.Dataset).

name

str

default:""

Optional collection name. When given without workspace_dir, the collection is cached in the default workspace.

data_source

str

default:""

Data source identifier for band mapping and URL policy.

workspace_dir

str | Path

Persist the collection as partitioned Parquet at this path. Defaults to ~/rasteret_workspace/{name}_records/ when name is provided.

column_map

dict[str, str]

{source_name: contract_name} alias map. Source columns are preserved; contract-name columns are added as zero-copy aliases.

href_column

str

Column containing COG URLs. When set and assets is absent after aliasing, the normalization layer constructs the assets struct from this column and band_index_map.

band_index_map

dict[str, int]

{band_code: sample_index} for multi-band COGs.

url_rewrite_patterns

dict[str, str]

{source_prefix: target_prefix} for URL rewriting during assets construction.

filesystem

pyarrow.fs.FileSystem

PyArrow filesystem for reading remote URIs (e.g., S3FileSystem(anonymous=True)).

columns

list[str]

Scan-time column projection.

filter_expr

pyarrow.dataset.Expression

Scan-time predicate pushdown.

enrich_cog

bool

default:"False"

Parse COG headers and add per-band metadata columns.

band_codes

list[str]

Bands to enrich. Defaults to all bands in assets.

cloud_config

CloudConfig

Cloud configuration for URL rewriting.

max_concurrent

int

default:"300"

Maximum concurrent HTTP connections for COG header parsing.

force

bool

default:"False"

Rebuild even if a cached collection already exists at the resolved workspace path.

backend

StorageBackend

I/O backend for authenticated range reads during COG header parsing. See create_backend().

Returns

collection

Collection

A Collection object ready for spatial queries and pixel reads.

Usage Example

import rasteret
import pyarrow.dataset as pads

# Build from a GeoParquet file
collection = rasteret.build_from_table(
    "s3://example-bucket/records.parquet",
    name="custom-dataset",
    data_source="custom",
    enrich_cog=True,
)

# Build with column mapping
collection = rasteret.build_from_table(
    "/path/to/table.parquet",
    name="mapped-dataset",
    column_map={
        "scene_id": "id",
        "capture_date": "datetime",
        "footprint": "geometry",
    },
    href_column="image_url",
    band_index_map={"R": 0, "G": 1, "B": 2},
    enrich_cog=True,
)

# Build with filtering and enrichment
filter_expr = (
    (pads.field("year") == 2024) &
    (pads.field("cloud_cover") < 10)
)

collection = rasteret.build_from_table(
    "s3://bucket/large-dataset/",
    name="filtered-2024",
    filter_expr=filter_expr,
    enrich_cog=True,
    band_codes=["B04", "B03", "B02"],
    max_concurrent=50,
)

# Build from in-memory Arrow table
import pyarrow as pa

table = pa.table({
    "id": ["scene1", "scene2"],
    "datetime": pa.array(["2024-01-01", "2024-01-02"], type=pa.timestamp("us")),
    "geometry": [...],  # WKB binary
    "assets": [...],  # struct array
})

collection = rasteret.build_from_table(
    table,
    name="in-memory",
    data_source="custom",
)

build() - Build from a registered dataset
build_from_stac() - Build directly from STAC API
as_collection() - Wrap an in-memory Arrow object without enrichment
Collection - Collection class reference

Core API

Building Collections

Data Access

Configuration

CLI

rasteret.build_from_table()

Function Signature

Description

Parameters

Returns

Usage Example

Build docs developers (and LLMs) love

Core API

Building Collections

Data Access

Configuration

CLI

Documentation Index

​Function Signature

​Description

​Parameters

​Returns

​Usage Example

​Related Functions

Build docs developers (and LLMs) love

Function Signature

Description

Parameters

Returns

Usage Example

Related Functions