Supported Dataset Formats

CVAT supports over 30 dataset formats for import and export, making it compatible with most computer vision workflows and frameworks. All format conversions are powered by the Datumaro framework.

Format Overview

The table below shows all supported formats with import/export capabilities:

Format	Import	Export	Use Case
CVAT for images	✔️	✔️	CVAT native format for image annotation
CVAT for video	✔️	✔️	CVAT native format for video annotation
Datumaro	✔️	✔️	Universal format with full feature support
PASCAL VOC	✔️	✔️	Object detection, classification
PASCAL VOC Segmentation	✔️	✔️	Semantic segmentation
YOLO	✔️	✔️	Darknet YOLO object detection
MS COCO Object Detection	✔️	✔️	Object detection with MS COCO
MS COCO Keypoints	✔️	✔️	Keypoint/pose estimation
Cityscapes	✔️	✔️	Urban scene segmentation
MOT	✔️	✔️	Multi-object tracking
MOTS PNG	✔️	✔️	Multi-object tracking with segmentation
LabelMe	✔️	✔️	General-purpose annotation
ImageNet	✔️	✔️	Image classification
CamVid	✔️	✔️	Semantic segmentation for autonomous driving
WIDER Face	✔️	✔️	Face detection
VGGFace2	✔️	✔️	Face recognition
Market-1501	✔️	✔️	Person re-identification
ICDAR13/15	✔️	✔️	Text detection and recognition
Open Images V6	✔️	✔️	Large-scale object detection
KITTI	✔️	✔️	Autonomous driving benchmarks
KITTI Raw	✔️	✔️	Raw KITTI sensor data
LFW	✔️	✔️	Face verification
Supervisely Point Cloud	✔️	✔️	3D point cloud annotation
Ultralytics YOLO Detection	✔️	✔️	YOLOv8+ object detection
Ultralytics YOLO OBB	✔️	✔️	Oriented bounding boxes
Ultralytics YOLO Segmentation	✔️	✔️	Instance segmentation
Ultralytics YOLO Pose	✔️	✔️	Pose estimation
Ultralytics YOLO Classification	✔️	✔️	Image classification

Format Details

CVAT Formats

CVAT for images 1.1 and CVAT for video 1.1 are CVAT’s native XML-based formats. Features:

Full support for all CVAT annotation types
Preserves all metadata and attributes
Tracks, shapes, tags, and skeleton annotations
Best for backup and transfer between CVAT instances

When to use:

Backing up CVAT projects
Migrating tasks between CVAT servers
When you need complete annotation preservation

Export example:

task.export_dataset(
    format_name="CVAT for images 1.1",
    filename="cvat_backup.zip"
)

See the CVAT XML format specification in the CVAT GitHub repository for details.

Datumaro

Datumaro is a universal dataset framework providing lossless conversion between formats. Features:

Supports all CVAT annotation types
Python-based dataset manipulation
Format conversion and validation
Dataset versioning and comparison

When to use:

Complex dataset transformations
Converting between incompatible formats
Dataset analysis and statistics
Building custom ML pipelines

Export example:

task.export_dataset(
    format_name="Datumaro 1.0",
    filename="datumaro_dataset.zip",
    include_images=True
)

Learn more: Datumaro documentation

PASCAL VOC

PASCAL VOC is a classic format for object detection and segmentation. Features:

XML annotation files
Bounding boxes with classes
Segmentation masks (separate format)
Attributes stored in XML

When to use:

Object detection with classic frameworks
Academic research and benchmarks
Simple detection tasks

Structure:

dataset/
├── Annotations/
│   ├── image1.xml
│   └── image2.xml
├── JPEGImages/
│   ├── image1.jpg
│   └── image2.jpg
└── ImageSets/
    └── Main/
        └── train.txt

Export example:

cvat-cli task export-dataset \
  --format "PASCAL VOC 1.1" \
  --output voc_dataset.zip \
  123

YOLO Formats

CVAT supports multiple YOLO format variants:

YOLO 1.1 (Darknet)

Original YOLO format with text-based annotations. Features:

One .txt file per image
Normalized bounding box coordinates
Class ID, center_x, center_y, width, height
Requires obj.names and obj.data files

Format:

# image1.txt
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2

Export example:

task.export_dataset(
    format_name="YOLO 1.1",
    filename="yolo_dataset.zip"
)

Ultralytics YOLO

Modern YOLO format (YOLOv8+) with YAML configuration. Variants:

Detection: Bounding boxes for object detection
Segmentation: Polygon annotations for instance segmentation
OBB: Oriented bounding boxes for rotated objects
Pose: Keypoint annotations for pose estimation
Classification: Image-level labels

Structure:

dataset/
├── data.yaml
├── train/
│   ├── images/
│   └── labels/
└── valid/
    ├── images/
    └── labels/

Export example:

task.export_dataset(
    format_name="Ultralytics YOLO Detection 1.0",
    filename="yolov8_dataset.zip"
)

Learn more: Ultralytics YOLO formats

COCO Formats

MS COCO is an industry-standard format for complex annotations.

COCO Object Detection

Features:

JSON-based annotations
Bounding boxes with categories
Image metadata (size, file name)
Supports attributes as custom fields
Crowd annotations for groups

Structure:

{
  "images": [{"id": 1, "file_name": "image1.jpg", "width": 640, "height": 480}],
  "annotations": [{"id": 1, "image_id": 1, "category_id": 1, "bbox": [x, y, w, h]}],
  "categories": [{"id": 1, "name": "person", "supercategory": "human"}]
}

When to use:

Training modern detection models (Faster R-CNN, YOLO, etc.)
Complex multi-class detection tasks
Integration with popular ML frameworks

COCO Keypoints

Features:

Keypoint annotations for pose estimation
Skeleton definitions
Visibility flags per keypoint
Compatible with pose estimation models

Export example:

task.export_dataset(
    format_name="COCO Keypoints 1.0",
    filename="coco_keypoints.zip"
)

Learn more: COCO format specification

Cityscapes

Cityscapes format for urban scene understanding. Features:

Semantic segmentation masks
Instance segmentation annotations
19 standard classes for street scenes
Polygons and pixel-level masks

When to use:

Autonomous driving applications
Urban scene segmentation
Street-level computer vision

Export example:

task.export_dataset(
    format_name="Cityscapes 1.0",
    filename="cityscapes_dataset.zip"
)

Learn more: Cityscapes dataset

MOT Formats

MOT (Multiple Object Tracking) formats for tracking tasks.

MOT 1.1

Features:

Track annotations over time
CSV-based format
Frame number, track ID, bounding box
Compatible with MOT challenge

Format:

# frame, id, bb_left, bb_top, bb_width, bb_height, conf, class, visibility
1, 1, 100, 50, 30, 40, 1, 1, 1
2, 1, 105, 52, 30, 40, 1, 1, 1

MOTS PNG

Features:

Instance segmentation tracks
PNG masks for each frame
Pixel-level tracking annotations

Export example:

cvat-cli task export-dataset \
  --format "MOT 1.1" \
  --output mot_tracks.zip \
  456

LabelMe

LabelMe 3.0 format for general-purpose annotation. Features:

JSON annotations per image
Polygons, rectangles, and points
Flexible attribute system
Web-based annotation tool compatible

When to use:

General object detection and segmentation
Research projects
Legacy LabelMe tool compatibility

Export example:

task.export_dataset(
    format_name="LabelMe 3.0",
    filename="labelme_annotations.zip"
)

ImageNet

ImageNet format for image classification. Features:

Directory-based class organization
Image-level labels only
Standard classification dataset structure

Structure:

dataset/
├── train/
│   ├── class1/
│   │   ├── img1.jpg
│   │   └── img2.jpg
│   └── class2/
│       └── img3.jpg
└── val/
    └── ...

When to use:

Image classification tasks
Transfer learning
Training classification networks

CamVid

CamVid format for video segmentation. Features:

Semantic segmentation for video
11 or 32 predefined classes
Per-frame segmentation masks

When to use:

Video segmentation tasks
Autonomous driving research
Sequential scene understanding

WIDER Face

WIDER Face format for face detection. Features:

Face bounding boxes
Occlusion and pose attributes
Specialized for face detection benchmarks

When to use:

Face detection model training
Benchmarking face detectors
Large-scale face recognition

VGGFace2

VGGFace2 format for face recognition. Features:

Face identity labels
Bounding boxes and landmarks
Identity-based organization

When to use:

Face recognition training
Face verification tasks
Identity classification

Market-1501

Market-1501 format for person re-identification. Features:

Person bounding boxes
Identity labels across cameras
Track IDs for same person

When to use:

Person re-identification
Multi-camera tracking
Pedestrian recognition

ICDAR

ICDAR13/15 formats for text detection. Features:

Text region bounding boxes or polygons
Transcription labels
Oriented text support

When to use:

Scene text detection
OCR training data
Text recognition tasks

Open Images

Open Images V6 format for large-scale detection. Features:

Hierarchical label taxonomy
Image-level and object-level labels
Relationship annotations
Attributes and groups

When to use:

Large-scale detection tasks
Multi-label classification
Complex object relationships

KITTI Formats

KITTI formats for autonomous driving.

KITTI Detection

Features:

3D bounding boxes
Object detection in driving scenes
Occlusion and truncation flags

KITTI Raw

Features:

Raw sensor data format
Calibration information
Multi-modal data (camera, LiDAR)

When to use:

Autonomous driving research
3D object detection
Sensor fusion tasks

LFW

Labeled Faces in the Wild format. Features:

Face verification pairs
Identity labels
Standard face recognition benchmark

When to use:

Face verification
Face recognition benchmarking

Supervisely Point Cloud

Supervisely Point Cloud Format for 3D annotation. Features:

3D bounding boxes
Point cloud annotations
3D object detection

When to use:

LiDAR data annotation
3D object detection
Autonomous driving 3D perception

Ultralytics YOLO

See YOLO Formats above for detailed information on Ultralytics YOLO variants.

Choosing the Right Format

Consider these factors when selecting a format:

By Task Type

Object Detection: COCO, YOLO, Pascal VOC, Open Images
Instance Segmentation: COCO, YOLO Segmentation, MOTS
Semantic Segmentation: Cityscapes, CamVid, Pascal VOC Segmentation
Classification: ImageNet, YOLO Classification
Pose/Keypoints: COCO Keypoints, YOLO Pose
Tracking: MOT, MOTS PNG
3D Detection: KITTI, Supervisely Point Cloud
Face Tasks: WIDER Face, VGGFace2, LFW
Text Detection: ICDAR

By Framework

PyTorch: COCO, YOLO, ImageNet
TensorFlow: COCO, Pascal VOC
Darknet: YOLO 1.1
Ultralytics: Ultralytics YOLO variants
MMDetection: COCO
Detectron2: COCO

By Complexity

Simple: ImageNet, YOLO
Moderate: Pascal VOC, LabelMe
Complex: COCO, Open Images, Datumaro

Format Limitations

Some formats have specific limitations:

YOLO: Only supports bounding boxes or polygons (depending on variant)
ImageNet: Only supports image-level classification
Pascal VOC: Limited attribute support compared to CVAT
COCO: Polygons only (no ellipses without conversion)
MOT: Primarily for tracking, limited object attributes

For format-specific conversions and workarounds, see Format Conversion.

Next Steps

Import & Export - Learn how to use these formats
Format Conversion - Convert between formats
Datumaro Documentation - Advanced dataset operations

Get Started

Annotation

Projects & Tasks

Dataset Management

Integrations

Account & Organization

Documentation Index

​Format Overview

​Format Details

​CVAT Formats

​Datumaro

​PASCAL VOC

​YOLO Formats

​YOLO 1.1 (Darknet)

​Ultralytics YOLO

​COCO Formats

​COCO Object Detection

​COCO Keypoints

​Cityscapes

​MOT Formats

​MOT 1.1

​MOTS PNG

​LabelMe

​ImageNet

​CamVid

​WIDER Face

​VGGFace2

​Market-1501

​ICDAR

​Open Images

​KITTI Formats

​KITTI Detection

​KITTI Raw

​LFW

​Supervisely Point Cloud

​Ultralytics YOLO

​Choosing the Right Format

​By Task Type

​By Framework

​By Complexity

​Format Limitations

​Next Steps

Build docs developers (and LLMs) love

Format Overview

Format Details

CVAT Formats

Datumaro

PASCAL VOC

YOLO Formats

YOLO 1.1 (Darknet)

Ultralytics YOLO

COCO Formats

COCO Object Detection

COCO Keypoints

Cityscapes

MOT Formats

MOT 1.1

MOTS PNG

LabelMe

ImageNet

CamVid

WIDER Face

VGGFace2

Market-1501

ICDAR

Open Images

KITTI Formats

KITTI Detection

KITTI Raw

LFW

Supervisely Point Cloud

Ultralytics YOLO

Choosing the Right Format

By Task Type

By Framework

By Complexity

Format Limitations

Next Steps