Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/terrafloww/rasteret/llms.txt

Use this file to discover all available pages before exploring further.

Function Signature

rasteret.build_from_table(
    path: str | Path | pa.Table | pads.Dataset,
    *,
    name: str = "",
    data_source: str = "",
    workspace_dir: str | Path | None = None,
    column_map: dict[str, str] | None = None,
    href_column: str | None = None,
    band_index_map: dict[str, int] | None = None,
    url_rewrite_patterns: dict[str, str] | None = None,
    filesystem: Any | None = None,
    columns: list[str] | None = None,
    filter_expr: Any | None = None,
    enrich_cog: bool = False,
    band_codes: list[str] | None = None,
    cloud_config: Any = None,
    max_concurrent: int = 300,
    force: bool = False,
    backend: StorageBackend | None = None,
) -> Collection

Description

Build a Collection from an external Parquet/GeoParquet record table. A record table is a Parquet dataset where each row is a raster item (satellite scene, drone image, derived product, etc.) with at minimum id, datetime, geometry, assets, or columns that can be normalized into them via column_map and href_column. This is the heavy ingest path: it can normalize schema and optionally enrich COG headers. For in-memory tables that are already read-ready, use as_collection(). When enrich_cog=True, COG headers are parsed from the asset URLs and cached as {band}_metadata struct columns in the Parquet index, enabling fast tiled reads and TorchGeo integration.

Parameters

path
str | Path | pa.Table | pads.Dataset
required
Path/URI to a Parquet/GeoParquet file or dataset directory, or an in-memory Arrow object (pyarrow.Table or pyarrow.dataset.Dataset).
name
str
default:""
Optional collection name. When given without workspace_dir, the collection is cached in the default workspace.
data_source
str
default:""
Data source identifier for band mapping and URL policy.
workspace_dir
str | Path
Persist the collection as partitioned Parquet at this path. Defaults to ~/rasteret_workspace/{name}_records/ when name is provided.
column_map
dict[str, str]
{source_name: contract_name} alias map. Source columns are preserved; contract-name columns are added as zero-copy aliases.
href_column
str
Column containing COG URLs. When set and assets is absent after aliasing, the normalization layer constructs the assets struct from this column and band_index_map.
band_index_map
dict[str, int]
{band_code: sample_index} for multi-band COGs.
url_rewrite_patterns
dict[str, str]
{source_prefix: target_prefix} for URL rewriting during assets construction.
filesystem
pyarrow.fs.FileSystem
PyArrow filesystem for reading remote URIs (e.g., S3FileSystem(anonymous=True)).
columns
list[str]
Scan-time column projection.
filter_expr
pyarrow.dataset.Expression
Scan-time predicate pushdown.
enrich_cog
bool
default:"False"
Parse COG headers and add per-band metadata columns.
band_codes
list[str]
Bands to enrich. Defaults to all bands in assets.
cloud_config
CloudConfig
Cloud configuration for URL rewriting.
max_concurrent
int
default:"300"
Maximum concurrent HTTP connections for COG header parsing.
force
bool
default:"False"
Rebuild even if a cached collection already exists at the resolved workspace path.
backend
StorageBackend
I/O backend for authenticated range reads during COG header parsing. See create_backend().

Returns

collection
Collection
A Collection object ready for spatial queries and pixel reads.

Usage Example

import rasteret
import pyarrow.dataset as pads

# Build from a GeoParquet file
collection = rasteret.build_from_table(
    "s3://example-bucket/records.parquet",
    name="custom-dataset",
    data_source="custom",
    enrich_cog=True,
)

# Build with column mapping
collection = rasteret.build_from_table(
    "/path/to/table.parquet",
    name="mapped-dataset",
    column_map={
        "scene_id": "id",
        "capture_date": "datetime",
        "footprint": "geometry",
    },
    href_column="image_url",
    band_index_map={"R": 0, "G": 1, "B": 2},
    enrich_cog=True,
)

# Build with filtering and enrichment
filter_expr = (
    (pads.field("year") == 2024) &
    (pads.field("cloud_cover") < 10)
)

collection = rasteret.build_from_table(
    "s3://bucket/large-dataset/",
    name="filtered-2024",
    filter_expr=filter_expr,
    enrich_cog=True,
    band_codes=["B04", "B03", "B02"],
    max_concurrent=50,
)

# Build from in-memory Arrow table
import pyarrow as pa

table = pa.table({
    "id": ["scene1", "scene2"],
    "datetime": pa.array(["2024-01-01", "2024-01-02"], type=pa.timestamp("us")),
    "geometry": [...],  # WKB binary
    "assets": [...],  # struct array
})

collection = rasteret.build_from_table(
    table,
    name="in-memory",
    data_source="custom",
)

Build docs developers (and LLMs) love