Dataset configurations are managed in the data/__init__.py file. This centralized approach allows you to define multiple datasets and control their usage during training.
Here’s a complete example showing how to add a custom dataset:
data/__init__.py
# Define your datasetMY_DATASET = { "annotation_path": "/data/my_dataset/annotations.json", "data_path": "/data/my_dataset/images/",}# Example: Using an existing public datasetCAMBRIAN_737K = { "annotation_path": "/data/cambrian/cambrian_737k.json", "data_path": "/data/cambrian/images/",}# Register datasets in the data dictionarydata_dict = { "my_dataset": MY_DATASET, "cambrian_737k": CAMBRIAN_737K,}
Control the proportion of data used from each dataset by appending %X to the dataset name, where X is the percentage:
# Use 50% of the datasetdataset_names = ["my_dataset%50"]# Use 20% of the datasetdataset_names = ["my_dataset%20"]# Use 100% of the dataset (default)dataset_names = ["my_dataset%100"]# Or simply:dataset_names = ["my_dataset"]
You can combine multiple datasets with different sampling rates:
dataset_names = [ "my_dataset%100", # Use all of my_dataset "cambrian_737k%30", # Use 30% of cambrian_737k "llava_next%50" # Use 50% of llava_next]configs = data_list(dataset_names)
Sampling rates are applied independently to each dataset. This allows you to balance dataset sizes and prevent larger datasets from dominating the training.