2025 Model Training: YOLO11n vs YOLO11s
Inference latency: YOLO11n = YOLO8n <YOLO11s (3.8)
Accuracy: YOLO11s >YOLO11n > YOLO8n
Size: YOLO11n<YOLO8n<YOLO11s
Since we are working on a real-time system and running inferences on the airside where there are limited hardware resources, high latency can introduce framerate drops as the video input worker queue fills up, leading to decisions made based on outdated positions.
Our safest bet is YOLO11n, something that has slightly higher accuracy and lower/similar latency (depending on the dataset) than YOLO8n, a model that was proven to be working last term.
However, YOLO11s can be more noticeably accurate than YOLO11n but with 10-20% higher latency than YOLO8n.
Might be worth it to try YOLO11s since accuracy was more of an issue than latency last term, though YOLO11n might already be slightly more accurate than YOLOv8.
This depends on latency requirements and how much the latency will change when running inferences on airside.
Model performance can be very dependent on datasets, as demonstrated in the research below that uses a dataset that is not COCO.
Validation with Roboflow dataset on 4070:
definitions at the end
Model | Latency (ms) | mAP 50-95 | mAP 50 | Precision | Recall |
---|---|---|---|---|---|
YOLO8n | 2.9 | 0.78 | 0.994 | 0.983 | 0.99 |
YOLO11n | 2.9 | 0.809 | 0.995 | 0.991 | 0.992 |
YOLO11s | 3.8 | 0.82 | 0.995 | 0.995 | 0.992 |
Graph (double click to enlarge) | YOLO8n | YOLO11n | YOLO11s |
---|---|---|---|
Confusion Matrix |
|
|
|
Normalized Confusion matrix |
|
| |
Precision-Recall Curve |
|
| |
F1 Confidence Curve |
|
| |
Precision-Confidence Curve |
|
| |
Recall- Confidence Curve |
|
|
Ultralytics Performance Metrics (COCO Dataset)
https://docs.ultralytics.com/models/yolo11/#performance-metrics https://docs.ultralytics.com/models/yolov8/#performance-metrics
Roboflow leaderboard (COCO 2017):
Computer Vision Model Leaderboard
Other research: Evaluating the Evolution of YOLO models
Definitions:
MAP val: Mean Average Precision calculated during the validation phase of model training (area of precision(accuracy)–recall(if it detects every instance) curve
MAP50: mean average precision calculated at an Intersection over Union (IoU) threshold of 0.5. This means a detection is considered correct if the overlap between the predicted bounding box and the ground truth bounding box is at least 50%.
MAP50-95: average mAP values across IoU thresholds of 50% to 95%
Speed ONNX (ms) = runtime using Open Neural Network Exchange (open-source format for representing machine learning and deep learning models)
Speed T4 TensorRT10 (ms) = runtime using NvidiaGPU
FLOPs: floating point calculations (represents complexity)
F1 score: harmonic mean of precision (accuracy of positive predictions) and recall (how well a model identifies all relevant instances)
System Requirements: https://docs.ultralytics.com/help/FAQ/#what-are-the-system-requirements-for-running-ultralytics-models
For Raspberry Pi: https://docs.ultralytics.com/guides/raspberry-pi/