The UTUAV Urban Traffic Dataset

We present the UTUAV dataset which consists of three different scenes captured in the second largest city of Colombia: Medellín. Road user classes representative of those in emerging countries such as Colombia, have been chosen: motorcycles (MC), light vehicles (LV) and heavy vehicles (HV).

The dataset was initially annotated by means of Viper annotation tool. Subsequently, the annotations were converted to the Pascal VOC (XML) format (directories named "Annotations", bounding boxes in absolute coordinates xmin, xmax, ymin, ymax) and to Ultralytics YOLOv8 format (directories named "labels", class label (0: motorbike, 1: LV, 2: HV followed bounding boxes in normalised xywhcoordinates relative to image width and height: xcentroid, ycentroid, width, height). The images are stored in directories named "images".

Publications:

If you use these datasets please cite the following publication:

@article{felipe2025utuav,
  title={UTUAV: A Drone Dataset for Urban Traffic Analysis},
  author={Felipe, Lepin and Velastin, Sergio A and Le{\'o}n, Roberto and Jes{\'u}s, Garc{\'\i}a-Herrero and Rojas-Mart{\'\i}nez, 
Gonzalo and Espinosa-Oviedo, Jorge Ernesto},
  journal={Drones},
  volume={10},
  number={1},
  pages={15},
  year={2025},
  publisher={MDPI AG}
}

And for an earlier publication:

@article{espinosa2020detection,
  title={Detection of motorcycles in urban traffic using video analysis: A review},
  author={Espinosa, Jorge E and Velast{\'\i}n, Sergio A and Branch, John W},
  journal={IEEE transactions on intelligent transportation systems},
  volume={22},
  number={10},
  pages={6115--6130},
  year={2020},
  publisher={IEEE}
}

Researchers can use these datasets provided that is only for research, not commercial, purposes. To access these datasets please contact Prof. Sergio A Velastin with your name, institution and purpose of your research.

UTUAV-A Dataset (road side)

This dataset corresponds to an extension of Espinosa et al. dataset, which originally only contained annotated motorbikes in 10,000 frames with a resolution of 640x364 pixels. The images were taken from an unmanned air vehicle (UAV), elevated 4.5 meters from the ground. The UAV is kept at the same position and small movement of the camera is noted. The extension includes the annotation of light and heavy vehicles. The following table presents the main dataset characteristics, including the number of annotated vehicles, mean area of the vehicle bounding box, total occluded vehicles, the mean duration of total occlusions measured in frames and the mean displacement in pixels of objects when are occluded. Note that due the limited elevation, and the capture angle of the sequence, occlusions appear frequently, and object sizes changes significantly.

(NB: the numbers are indicative only as they might be slightly different to the final annotations, in this case there are 10,050 images)

Vehicle	Motorcycle	Ligth Vehicle	Heavy Vehicle
Number of annotations	56970	44415	44415
Annotated objects	318	159	16
Mean Size (pixels)	1.763	4.546	4.771
Totally occluded objects	7	6	3
Mean occlusion duration (frames)	8.1	2.9	269.3
Mean occlusion displacement (pixels)	31.2	6.3	369.4

Fully annotated dataset
Dataset split 80:10:10 into training, validation and evaluation sub-sets. This also contains json files in COCO format that we have used to train and evaluate DETR models.

Bird-eye Datasets

Currently, we are concentrating on the following two sets ("B" and "C") which are of bird-eye views of urban traffic taken from two different (but similar) heights for two different road topologies. These sets are usseful to experiment with:

Detection methods able to detect both very small and larger objects.
Measuring generalisation capabilities e.g. how a model trained on B performs with images from C and vice versa.

Global Data

When you download these datasets, you will find the following sub-directories:

Visualise: The images with added bounding boxes on the annotated objects (each class shown with a different colour)
Annotations: The annotated objects in XML-encoded Pascal VOC format (absolute image coordinates)
images: the jpeg images of that dataset
labels: the YOLO (Ultralytics)-formatted annotations (ground truth) for those images
You might also find files

*.xgtf: original annotations in Viper-GT format (historical)
default_copy.yaml: hyper parameter configurations for YOLO (Ultralytics) training
X_Dataset.yaml: X=B or C, Ultralytics definitions of file locations (train, val, test) and classes

Note that the downloadable datasets contain ALL images and labels (annotations) without distinguishing between training, validation and testing sub-sets. These are defined by "partitions" as explained below.

New! (April 2026): Annotations for static vehicles and vehicle ids for tracking

We are now providing annotations for previously non-annotated stationary vehicles (thanks for Claudio Inal from Universidad Tecnica Federico Santa Maria, Chile). These are offered in two ways:

Explicit separate classes for static motorbike, static cars and static buses (large vehicles). This way, the datasets contain 6 clases (including the non-stationary vehicles).
Since detection training and inference is likely to be based on static separate images, we also provide annotations where the stationary classes have been "merged" into their corresponding classes (e.g. stationary motorbikes simply into motorbikes).

We now provide data in the following formats:

Traditional YOLO (Ultralytics) "labels" normalised cooredinates format (class centroid_x, centroid_y, width, height), one file per image
XML pseudo Pascal VOC format where we have added attributes "track_id" (a unique number for each vehicles as they go from one frame to the next) and "occlusion" (always the same!, for future annotation refinement), one file per image
MOT format (as per the MOT challenge). A single file of comma separated rows for each image: frame_id, track_id, top_left_x, top_left_y, width, height, confidence (always 1), class_id, visibility. These files have been generated from the pseudo Pascal VOC files.

Data Partitions

This refers on how the data is partitioned into training, validation and evaluation (test). We do not mix B with C as one of the purposes it to check generalisation (see above). Originally, we "naively" created such partitions by random selection of frames. This however tends to "inflate" evaluation metrics because the trained model had been exposed to similar images. This is aggravated by the fact that images come from temporal sequences (videos). So, we provide the following partitioning schemes:

Original: Where train/val/test frames were randomly selected from the global data (separately for B and C). This is mainly of historical interest.
We then first separate an evaluation (test) sub-set consisting of the last temporal segment of the video sequence. The same evaluation sub-set is used for all tests (for different partitions, except from "Original"). Then training and validation sub-sets are provided using the following approaches:

Random: where training and validation images have been randomly selected from the remaining images
Sequential: where the training sub-set is extracted from the start of the video sequence up to a point in time where there is a given proportion of these frames. Then the frames that follow (after the evaluation set was extracted) form the validation sub-set
Sequential_gap: Where we drop a number of frames (100) around the dividing line between training and validation frames and between the validation and the evaluation frames.
Sequential_gap_300: As above, but with a gap of 300 frames between train/val and val/eval frames. This is the data we used in our Drones paper and what we recommend to be used as a baseline. The other possible partitions have been left as historical records.

These data partitions are stored as simple text files (e.g. training.txt) that contain a list of image files
The partitions for the B dataset are stored in a directory B_Split and those for C in C_Split

Download partitions

(Note: we use terminology such as "B_Sequential_Gap300Stat" to indicate: B dataset, Sequential_gap_300 partition, Stat: 6 clases, and "B_Sequential_Gap300Stat" for 3 "merged" classes).

To generate Ultralytics-compatible images and labels directories using the above text files and the global datasets, we provide the python script: GenDataset.py (it works in Linux and by default it creates symbolic links to the original global files to reduce storage requirements). The script can be downloaded from here.

Segment GTs

We have also generated "segment" GTs compatible with Ultralytics (each object is represented by a polygon corresponding to the segment contour). We have done this using Meta's SAM (version 1, perhaps later versions produce better results?). As per the normal (bounding box) annotations, the "labels" directory for segment GT contains the GT all the images. You will need to then separate these into training, validation and evaluation for a given partition (we recommend the Sequential_gap partition). Note that this has NOT been done for Sequential_gap_300.

Oblique Bounding Boxes (OBB) GTs

By using SAM-1, we have generated oblique bounding boxes (OBBs) for researchers to test if these provide better results than traditional bounding boxes. For more details see Ultralytics Documentation for OBB. (Thanks to UC3M's student Luis Garrido for doing this transformation). Note that this has NOT been done for Sequential_gap_300.

UTUAV-B Dataset

Exploiting the top view angle that a high elevation UAV could reach (approx 100 meters), the second dataset is composed of 6,500 labelled images with a resolution of 3840x2160 (4k) pixels. The entire sequence is captured with a top view and turbulence or camera movement is hardly perceived. The visualisation colors for the annotated bounding boxes (see pictures below) are red for light vehicles, blue for motorbikes and green for heavy vehicles

Vehicle	Motorcycle	Ligth Vehicle	Heavy Vehicle
Number of annotations	70,064	331,508	18,864
Annotated objects	128	282	13
Mean Size (pixels)	992	3,318	5,882
Totally occluded objects	80	84	4
Mean occlusion duration (frames)	19.6	18.6	34.0
Mean occlusion displacement (pixels)	108.4	130.7	197.5

Annotated Image (4K resolution - Here resized) of UTUAV-B Dataset

Fully annotated dataset (24GB). Note: 3-classes not including stationary vehicles.
Segment GT dataset (99MB)
OBB GT dataset (5.3MB)
New (3- and 6-classes, detection and tracking) (57 MB) (images are downloaded using the "Full" link above)

UTUAV-C Dataset

The third dataset was a sequence of 10,560 frames, of which 6,600 are annotated, with a resolution of 3840x2160 (4k) pixels. This video sequence was also captured from an UAV, elevated at 120 meters from the ground. The road configuration is different. These differences have been introduced intentionally to test generalisation ability e.g. of detection on this dataset using a model trained with UTUAV-B. This dataset also uses a top view angle and uses the same color code for annotations as per UTUAV-B.

(NB: the numbers are indicative only as they are different to the final annotations, in this case there are only 6600 annotated images)

Vehicle	Motorcycle	Ligth Vehicle	Heavy Vehicle
Number of annotations	463,009	1,477,287	130,142
Annotated objects	456	997	86
Mean Size (pixels)	467	1,722	4,275
Totally occluded objects	211	265	31
Mean occlusion duration (frames)	89.9	86.8	110.8
Mean occlusion displacement (pixels)	226.9	210.1	260.3

Annotated Image (4K resolution - Here resized) of UTUAV-C Dataset

Fully annotated dataset (49GB)
Segment GT dataset (226MB)
OBB GT dataset (14MB)
New (3- and 6-classes, detection and tracking) (143 MB) Note: 3-classes not including stationary vehicles (images are downloaded using the "Full" link above)

For any queries related to these datasets please contact Jorge Espinosa or Sergio A Velastin