Skip to main content

Overview

OpenAVM Kit supports remote storage services for syncing input data and outputs across different environments. This is especially useful for:
  • Sharing datasets across team members
  • Backing up your work to the cloud
  • Accessing public datasets maintained by organizations
  • Deploying models in production environments

Supported Storage Services

OpenAVM Kit currently supports three cloud storage methods:

Microsoft Azure

Azure Blob Storage for enterprise deployments

Hugging Face

Hugging Face datasets for ML workflows

SFTP

Secure FTP for traditional file servers

Environment Configuration

All cloud storage credentials must be stored in a .env file located in the notebooks/ directory.
Security NoticeThe .env file is already in .gitignore, but always verify you don’t accidentally commit credentials to version control. Never share this file publicly.

Creating the .env File

Create a plain text file at notebooks/.env with your configuration:
notebooks/
  ├── .env          # <-- Create this file
  ├── pipeline/
The file should follow this format:
SOME_VARIABLE=some_value
ANOTHER_VARIABLE=another_value
YET_ANOTHER_VARIABLE=123
You only need to provide values for the service you’re actually using. The library will automatically detect which cloud service to use based on the available environment variables.

Azure Blob Storage

Configure Azure Blob Storage for enterprise-grade cloud storage.

Required Environment Variables

AZURE_ACCESS
string
required
Access level for your Azure accountOptions:
  • read_only - Download files only
  • read_write - Download and upload files
AZURE_STORAGE_CONTAINER_NAME
string
required
The name of your Azure storage container
AZURE_STORAGE_CONNECTION_STRING
string
required
The connection string for your Azure storage account. Find this in the Azure Portal under your storage account’s “Access keys” section.

Example Configuration

# Azure Blob Storage Configuration
AZURE_ACCESS=read_write
AZURE_STORAGE_CONTAINER_NAME=openavmkit-data
AZURE_STORAGE_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=...

Getting Your Connection String

  1. Navigate to your Storage Account in the Azure Portal
  2. Go to Security + networking > Access keys
  3. Click Show keys
  4. Copy one of the connection strings

Hugging Face

Configure Hugging Face for ML-focused workflows and easy dataset sharing.

Required Environment Variables

HF_ACCESS
string
required
Access level for your Hugging Face accountOptions:
  • read_only - Download datasets only
  • read_write - Download and upload datasets
HF_REPO_ID
string
required
The Hugging Face repository ID in format username/repo-name or organization/repo-name
HF_TOKEN
string
Your Hugging Face API token. Required for private repositories or write access. Optional for public read-only access.

Example Configuration

# Hugging Face Configuration
HF_ACCESS=read_write
HF_REPO_ID=your-username/localities-data
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

Getting Your API Token

  1. Go to huggingface.co/settings/tokens
  2. Click New token
  3. Give it a descriptive name like “openavmkit”
  4. Select the appropriate permissions (read or write)
  5. Copy the token and add it to your .env file
Using Public DatasetsYou can access public Hugging Face datasets without a token by using read_only access. This is perfect for getting started with datasets from The Center for Land Economics or other public sources.

SFTP (Secure File Transfer Protocol)

Configure SFTP for traditional file server setups.

Required Environment Variables

SFTP_ACCESS
string
required
Access level for your SFTP accountOptions:
  • read_only - Download files only
  • read_write - Download and upload files
SFTP_HOST
string
required
The hostname or IP address of your SFTP server
SFTP_USERNAME
string
required
Your SFTP username
SFTP_PASSWORD
string
Your SFTP password (required if not using key-based authentication)
SFTP_PORT
number
default:"22"
The port number for your SFTP server

Example Configuration

# SFTP Configuration
SFTP_ACCESS=read_write
SFTP_HOST=sftp.example.com
SFTP_PORT=22
SFTP_USERNAME=your-username
SFTP_PASSWORD=your-secure-password

Switching Between Cloud Services

If you work with multiple projects that use different cloud storage services, you can configure the service selection in your settings.json:
{
  "cloud": {
    "type": "huggingface",
    "access": "read_write"
  }
}
cloud.type
string
Force a specific cloud serviceOptions:
  • azure
  • huggingface
  • sftp
cloud.access
string
Override the access level from the .env fileOptions:
  • read_only
  • read_write
Never Store Credentials in settings.jsonOnly use settings.json to specify which cloud service to use, not to store credentials. The settings file may be synced to cloud storage, creating a security risk. Always keep credentials in the .env file.

Using Cloud Storage in Code

The cloud storage functionality is integrated into the pipeline:
from openavmkit.pipeline import download_inputs, upload_outputs

# Download inputs from cloud storage
download_inputs(
    locality="us-tx-travis",
    cloud_type="huggingface"  # or "azure", "sftp"
)

# Upload outputs to cloud storage
upload_outputs(
    locality="us-tx-travis",
    cloud_type="huggingface"
)

Troubleshooting

Connection Errors

Verify your connection string is complete and includes AccountName, AccountKey, and DefaultEndpointsProtocol. Copy it directly from the Azure Portal.
For private repositories or write access, ensure your HF_TOKEN is valid and has the appropriate permissions. Regenerate the token if necessary.
Check that:
  • The hostname is correct and reachable
  • The port is correct (default is 22)
  • Your firewall allows outbound connections on the SFTP port
  • The SFTP server is running

Permission Errors

If you encounter permission errors when uploading:
  1. Verify your access is set to read_write in the .env file
  2. Check that your account has write permissions on the remote storage
  3. For Hugging Face, ensure your token has write access
  4. For Azure, verify your connection string is from an account with write permissions

File Not Found

If files aren’t downloading:
  1. Verify the file exists in the remote storage
  2. Check that the locality slug matches exactly
  3. Ensure your access level allows reading
  4. For Hugging Face, verify the repository ID is correct

Best Practices

Use Read-Only for Shared DatasetsWhen accessing shared or public datasets, use read_only access to prevent accidental modifications.
Separate Development and ProductionUse different storage containers or repositories for development and production environments to avoid data conflicts.
Version Your DataFor Hugging Face repositories, use git tags or branches to version your datasets and ensure reproducibility.

Next Steps

Settings Configuration

Learn more about configuring settings.json

Getting Started

Return to the getting started guide

Build docs developers (and LLMs) love