CVAT supports over 30 dataset formats for import and export, making it compatible with most computer vision workflows and frameworks. All format conversions are powered by the Datumaro framework.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/cvat-ai/cvat/llms.txt
Use this file to discover all available pages before exploring further.
Format Overview
The table below shows all supported formats with import/export capabilities:| Format | Import | Export | Use Case |
|---|---|---|---|
| CVAT for images | ✔️ | ✔️ | CVAT native format for image annotation |
| CVAT for video | ✔️ | ✔️ | CVAT native format for video annotation |
| Datumaro | ✔️ | ✔️ | Universal format with full feature support |
| PASCAL VOC | ✔️ | ✔️ | Object detection, classification |
| PASCAL VOC Segmentation | ✔️ | ✔️ | Semantic segmentation |
| YOLO | ✔️ | ✔️ | Darknet YOLO object detection |
| MS COCO Object Detection | ✔️ | ✔️ | Object detection with MS COCO |
| MS COCO Keypoints | ✔️ | ✔️ | Keypoint/pose estimation |
| Cityscapes | ✔️ | ✔️ | Urban scene segmentation |
| MOT | ✔️ | ✔️ | Multi-object tracking |
| MOTS PNG | ✔️ | ✔️ | Multi-object tracking with segmentation |
| LabelMe | ✔️ | ✔️ | General-purpose annotation |
| ImageNet | ✔️ | ✔️ | Image classification |
| CamVid | ✔️ | ✔️ | Semantic segmentation for autonomous driving |
| WIDER Face | ✔️ | ✔️ | Face detection |
| VGGFace2 | ✔️ | ✔️ | Face recognition |
| Market-1501 | ✔️ | ✔️ | Person re-identification |
| ICDAR13/15 | ✔️ | ✔️ | Text detection and recognition |
| Open Images V6 | ✔️ | ✔️ | Large-scale object detection |
| KITTI | ✔️ | ✔️ | Autonomous driving benchmarks |
| KITTI Raw | ✔️ | ✔️ | Raw KITTI sensor data |
| LFW | ✔️ | ✔️ | Face verification |
| Supervisely Point Cloud | ✔️ | ✔️ | 3D point cloud annotation |
| Ultralytics YOLO Detection | ✔️ | ✔️ | YOLOv8+ object detection |
| Ultralytics YOLO OBB | ✔️ | ✔️ | Oriented bounding boxes |
| Ultralytics YOLO Segmentation | ✔️ | ✔️ | Instance segmentation |
| Ultralytics YOLO Pose | ✔️ | ✔️ | Pose estimation |
| Ultralytics YOLO Classification | ✔️ | ✔️ | Image classification |
Format Details
CVAT Formats
CVAT for images 1.1 and CVAT for video 1.1 are CVAT’s native XML-based formats. Features:- Full support for all CVAT annotation types
- Preserves all metadata and attributes
- Tracks, shapes, tags, and skeleton annotations
- Best for backup and transfer between CVAT instances
- Backing up CVAT projects
- Migrating tasks between CVAT servers
- When you need complete annotation preservation
Datumaro
Datumaro is a universal dataset framework providing lossless conversion between formats. Features:- Supports all CVAT annotation types
- Python-based dataset manipulation
- Format conversion and validation
- Dataset versioning and comparison
- Complex dataset transformations
- Converting between incompatible formats
- Dataset analysis and statistics
- Building custom ML pipelines
PASCAL VOC
PASCAL VOC is a classic format for object detection and segmentation. Features:- XML annotation files
- Bounding boxes with classes
- Segmentation masks (separate format)
- Attributes stored in XML
- Object detection with classic frameworks
- Academic research and benchmarks
- Simple detection tasks
YOLO Formats
CVAT supports multiple YOLO format variants:YOLO 1.1 (Darknet)
Original YOLO format with text-based annotations. Features:- One .txt file per image
- Normalized bounding box coordinates
- Class ID, center_x, center_y, width, height
- Requires obj.names and obj.data files
Ultralytics YOLO
Modern YOLO format (YOLOv8+) with YAML configuration. Variants:- Detection: Bounding boxes for object detection
- Segmentation: Polygon annotations for instance segmentation
- OBB: Oriented bounding boxes for rotated objects
- Pose: Keypoint annotations for pose estimation
- Classification: Image-level labels
COCO Formats
MS COCO is an industry-standard format for complex annotations.COCO Object Detection
Features:- JSON-based annotations
- Bounding boxes with categories
- Image metadata (size, file name)
- Supports attributes as custom fields
- Crowd annotations for groups
- Training modern detection models (Faster R-CNN, YOLO, etc.)
- Complex multi-class detection tasks
- Integration with popular ML frameworks
COCO Keypoints
Features:- Keypoint annotations for pose estimation
- Skeleton definitions
- Visibility flags per keypoint
- Compatible with pose estimation models
Cityscapes
Cityscapes format for urban scene understanding. Features:- Semantic segmentation masks
- Instance segmentation annotations
- 19 standard classes for street scenes
- Polygons and pixel-level masks
- Autonomous driving applications
- Urban scene segmentation
- Street-level computer vision
MOT Formats
MOT (Multiple Object Tracking) formats for tracking tasks.MOT 1.1
Features:- Track annotations over time
- CSV-based format
- Frame number, track ID, bounding box
- Compatible with MOT challenge
MOTS PNG
Features:- Instance segmentation tracks
- PNG masks for each frame
- Pixel-level tracking annotations
LabelMe
LabelMe 3.0 format for general-purpose annotation. Features:- JSON annotations per image
- Polygons, rectangles, and points
- Flexible attribute system
- Web-based annotation tool compatible
- General object detection and segmentation
- Research projects
- Legacy LabelMe tool compatibility
ImageNet
ImageNet format for image classification. Features:- Directory-based class organization
- Image-level labels only
- Standard classification dataset structure
- Image classification tasks
- Transfer learning
- Training classification networks
CamVid
CamVid format for video segmentation. Features:- Semantic segmentation for video
- 11 or 32 predefined classes
- Per-frame segmentation masks
- Video segmentation tasks
- Autonomous driving research
- Sequential scene understanding
WIDER Face
WIDER Face format for face detection. Features:- Face bounding boxes
- Occlusion and pose attributes
- Specialized for face detection benchmarks
- Face detection model training
- Benchmarking face detectors
- Large-scale face recognition
VGGFace2
VGGFace2 format for face recognition. Features:- Face identity labels
- Bounding boxes and landmarks
- Identity-based organization
- Face recognition training
- Face verification tasks
- Identity classification
Market-1501
Market-1501 format for person re-identification. Features:- Person bounding boxes
- Identity labels across cameras
- Track IDs for same person
- Person re-identification
- Multi-camera tracking
- Pedestrian recognition
ICDAR
ICDAR13/15 formats for text detection. Features:- Text region bounding boxes or polygons
- Transcription labels
- Oriented text support
- Scene text detection
- OCR training data
- Text recognition tasks
Open Images
Open Images V6 format for large-scale detection. Features:- Hierarchical label taxonomy
- Image-level and object-level labels
- Relationship annotations
- Attributes and groups
- Large-scale detection tasks
- Multi-label classification
- Complex object relationships
KITTI Formats
KITTI formats for autonomous driving.KITTI Detection
Features:- 3D bounding boxes
- Object detection in driving scenes
- Occlusion and truncation flags
KITTI Raw
Features:- Raw sensor data format
- Calibration information
- Multi-modal data (camera, LiDAR)
- Autonomous driving research
- 3D object detection
- Sensor fusion tasks
LFW
Labeled Faces in the Wild format. Features:- Face verification pairs
- Identity labels
- Standard face recognition benchmark
- Face verification
- Face recognition benchmarking
Supervisely Point Cloud
Supervisely Point Cloud Format for 3D annotation. Features:- 3D bounding boxes
- Point cloud annotations
- 3D object detection
- LiDAR data annotation
- 3D object detection
- Autonomous driving 3D perception
Ultralytics YOLO
See YOLO Formats above for detailed information on Ultralytics YOLO variants.Choosing the Right Format
Consider these factors when selecting a format:By Task Type
- Object Detection: COCO, YOLO, Pascal VOC, Open Images
- Instance Segmentation: COCO, YOLO Segmentation, MOTS
- Semantic Segmentation: Cityscapes, CamVid, Pascal VOC Segmentation
- Classification: ImageNet, YOLO Classification
- Pose/Keypoints: COCO Keypoints, YOLO Pose
- Tracking: MOT, MOTS PNG
- 3D Detection: KITTI, Supervisely Point Cloud
- Face Tasks: WIDER Face, VGGFace2, LFW
- Text Detection: ICDAR
By Framework
- PyTorch: COCO, YOLO, ImageNet
- TensorFlow: COCO, Pascal VOC
- Darknet: YOLO 1.1
- Ultralytics: Ultralytics YOLO variants
- MMDetection: COCO
- Detectron2: COCO
By Complexity
- Simple: ImageNet, YOLO
- Moderate: Pascal VOC, LabelMe
- Complex: COCO, Open Images, Datumaro
Format Limitations
Some formats have specific limitations:- YOLO: Only supports bounding boxes or polygons (depending on variant)
- ImageNet: Only supports image-level classification
- Pascal VOC: Limited attribute support compared to CVAT
- COCO: Polygons only (no ellipses without conversion)
- MOT: Primarily for tracking, limited object attributes
Next Steps
- Import & Export - Learn how to use these formats
- Format Conversion - Convert between formats
- Datumaro Documentation - Advanced dataset operations