Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/terrafloww/rasteret/llms.txt

Use this file to discover all available pages before exploring further.

Once you’ve built a Collection, you’ll often want to narrow it down to specific scenes before loading pixel data. Rasteret provides high-level filtering methods and low-level Arrow expressions for maximum flexibility.

Quick Filtering with subset()

The subset() method returns a filtered view of the Collection (no data copying):
import rasteret

collection = rasteret.build(
    "earthsearch/sentinel-2-l2a",
    name="madrid",
    bbox=(-3.75, 40.38, -3.65, 40.48),
    date_range=("2024-01-01", "2024-06-30"),
)

print(f"Original: {len(collection)} scenes")

# Filter by cloud cover
filtered = collection.subset(cloud_cover_lt=15)
print(f"Cloud < 15%: {len(filtered)} scenes")
Key filters:
  • cloud_cover_lt: Keep scenes with eo:cloud_cover below this threshold (0-100)
  • date_range: Temporal range ("2024-03-01", "2024-04-30")
  • bbox: Spatial extent (minx, miny, maxx, maxy)
  • geometries: Filter by one or more geometries
  • split: Filter by a split column (e.g. "train", "val", "test")

Combining Filters

All filters are combined with AND:
# Cloud < 10% + March 2024 + specific bbox
filtered = collection.subset(
    cloud_cover_lt=10,
    date_range=("2024-03-01", "2024-03-31"),
    bbox=(-3.72, 40.40, -3.68, 40.44),
)

print(f"Combined filters: {len(filtered)} scenes")

Temporal Filtering

# Spring 2024
spring = collection.subset(date_range=("2024-03-01", "2024-05-31"))

# Single month
april = collection.subset(date_range=("2024-04-01", "2024-04-30"))

Spatial Filtering

Bounding Box

# Filter by bbox (WGS84 / EPSG:4326)
subset = collection.subset(bbox=(-122.45, 37.75, -122.35, 37.85))
Requirements: The Collection must have scalar bbox columns (bbox_minx, bbox_miny, bbox_maxx, bbox_maxy). Collections built with rasteret>=1.0.0 include these automatically.

Geometries

Filter by one or more geometries (points, polygons, etc.):
import geopandas as gpd
from shapely.geometry import box

# From Shapely
aoi = box(-122.45, 37.75, -122.35, 37.85)
filtered = collection.subset(geometries=aoi)

# From GeoDataFrame
aois = gpd.read_file("regions.geojson")
filtered = collection.subset(geometries=aois.geometry)

# Multiple bboxes as tuples
bboxes = [
    (-122.45, 37.75, -122.35, 37.85),
    (-122.50, 37.70, -122.40, 37.80),
]
filtered = collection.subset(geometries=bboxes)
Note: Geometry filtering uses bbox overlap (not full intersection). A scene is kept if its bbox overlaps any input geometry’s bbox.

Cloud Cover Filtering

# Very clear scenes
clear = collection.subset(cloud_cover_lt=5)

# Acceptable for analysis
usable = collection.subset(cloud_cover_lt=20)

# Combine with temporal filter
summer_clear = collection.subset(
    cloud_cover_lt=10,
    date_range=("2024-06-01", "2024-08-31"),
)

Split Filtering (ML Workflows)

If you’ve annotated your Collection with train/val/test splits:
# Filter to training split
train = collection.subset(split="train")

# Or use the convenience method
train = collection.select_split("train")

# Multiple splits
train_val = collection.subset(split=["train", "val"])
See the ML Training guide for details on assigning splits.

Advanced: Arrow Expressions

For custom queries, use where() with raw PyArrow expressions:
import pyarrow.dataset as ds

# Cloud cover between 5 and 15%
filtered = collection.where(
    (ds.field("eo:cloud_cover") >= 5) &
    (ds.field("eo:cloud_cover") < 15)
)

# Scenes from specific satellite
filtered = collection.where(
    ds.field("platform") == "sentinel-2b"
)

# Combined expression
filtered = collection.where(
    (ds.field("eo:cloud_cover") < 10) &
    (ds.field("proj:epsg") == 32630)  # UTM zone 30N
)
Arrow operators:
  • Comparisons: ==, !=, <, <=, >, >=
  • Logic: & (AND), | (OR), ~ (NOT)
  • Membership: ds.field("column").isin([val1, val2])
  • Pattern matching: ds.field("id").match_substring("S2A_")

Complex Queries

# Sentinel-2A scenes in summer with low cloud cover
expr = (
    ds.field("platform").match_substring("sentinel-2a") &
    (ds.field("eo:cloud_cover") < 15) &
    (ds.field("datetime") >= pd.Timestamp("2024-06-01")) &
    (ds.field("datetime") <= pd.Timestamp("2024-08-31"))
)

filtered = collection.where(expr)

Combining Methods

You can chain subset() and where() calls:
# High-level filter first
spring = collection.subset(
    date_range=("2024-03-01", "2024-05-31"),
    cloud_cover_lt=20,
)

# Then apply custom logic
final = spring.where(
    ds.field("proj:epsg").isin([32630, 32631])  # Specific UTM zones
)

Inspecting Results

# Count scenes
print(f"Total: {len(filtered)} scenes")

# View summary
print(filtered.describe())

# Check date range
if filtered.start_date and filtered.end_date:
    print(f"Date range: {filtered.start_date} to {filtered.end_date}")

# Available bands
print(f"Bands: {filtered.bands}")

# Spatial extent
if filtered.bounds:
    print(f"Bounds: {filtered.bounds}")

Real-World Example: Multi-Season Analysis

import rasteret

# Build once
collection = rasteret.build(
    "earthsearch/sentinel-2-l2a",
    name="farm-monitoring",
    bbox=(11.3, 48.1, 11.5, 48.3),
    date_range=("2024-01-01", "2024-12-31"),
)

# Define seasons
seasons = {
    "winter": collection.subset(
        date_range=("2024-01-01", "2024-02-29"),
        cloud_cover_lt=30,  # More lenient in winter
    ),
    "spring": collection.subset(
        date_range=("2024-03-01", "2024-05-31"),
        cloud_cover_lt=15,
    ),
    "summer": collection.subset(
        date_range=("2024-06-01", "2024-08-31"),
        cloud_cover_lt=10,  # Clearest season
    ),
    "fall": collection.subset(
        date_range=("2024-09-01", "2024-11-30"),
        cloud_cover_lt=15,
    ),
}

for season_name, subset in seasons.items():
    print(f"{season_name}: {len(subset)} scenes")

Performance Notes

Fast Operations

  • subset() and where() create views (no data copying)
  • Filters are applied lazily (only evaluated when reading pixels)
  • Arrow predicate pushdown means filtering happens at the Parquet scan level

Column Availability

Some filters require specific columns:
  • cloud_cover_lt: Requires eo:cloud_cover
  • bbox: Requires bbox_minx, bbox_miny, bbox_maxx, bbox_maxy
  • geometries: Requires scalar bbox columns
  • split: Requires a split column (you must add this via PyArrow after building)
If a required column is missing, Rasteret raises a ValueError with a clear message.

Common Patterns

Best Scene Per Month

import pandas as pd
import pyarrow.compute as pc

# Get table and find best scene per month
table = collection.dataset.to_table(
    columns=["id", "datetime", "eo:cloud_cover", "year", "month"]
)

best = (
    table.group_by(["year", "month"])
    .aggregate([("eo:cloud_cover", "min")])
)

print(best.to_pandas())

Exclude Bad Scenes

# Scenes to exclude (known issues)
bad_ids = ["scene_123", "scene_456"]

filtered = collection.where(
    ~ds.field("id").isin(bad_ids)
)

UTM Zone Consistency

# Keep only scenes in UTM zone 32N
utm_32n = collection.where(
    ds.field("proj:epsg") == 32632
)

Next Steps

Build docs developers (and LLMs) love