2025 Model Training: YOLO11n vs YOLO11s

Inference latency: YOLO11n = YOLO8n <YOLO11s (3.8)
Accuracy: YOLO11s >YOLO11n > YOLO8n
Size: YOLO11n<YOLO8n<YOLO11s
Since we are working on a real-time system and running inferences on the airside where there are limited hardware resources, high latency can introduce framerate drops as the video input worker queue fills up, leading to decisions made based on outdated positions.
Our safest bet is YOLO11n, something that has slightly higher accuracy and lower/similar latency (depending on the dataset) than YOLO8n, a model that was proven to be working last term.
However, YOLO11s can be more noticeably accurate than YOLO11n but with 10-20% higher latency than YOLO8n.
Might be worth it to try YOLO11s since accuracy was more of an issue than latency last term, though YOLO11n might already be slightly more accurate than YOLOv8.
This depends on latency requirements and how much the latency will change when running inferences on airside.
Model performance can be very dependent on datasets, as demonstrated in the research below that uses a dataset that is not COCO.

Validation with Roboflow dataset on 4070:

Model	Latency (ms)	mAP 50-95	mAP 50	Precision	Recall

Model	Latency (ms)	mAP 50-95	mAP 50	Precision	Recall
YOLO8n	2.9	0.78	0.994	0.983	0.99
YOLO11n	2.9	0.809	0.995	0.991	0.992
YOLO11s	3.8	0.82	0.995	0.995	0.992

Graph (double click to enlarge)	YOLO8n	YOLO11n	YOLO11s

Graph (double click to enlarge)	YOLO8n	YOLO11n	YOLO11s
Confusion Matrix
Normalized Confusion matrix
Precision-Recall Curve
F1 Confidence Curve
Precision-Confidence Curve
Recall- Confidence Curve

MAP val: Mean Average Precision calculated during the validation phase of model training (area of precision(accuracy)–recall(if it detects every instance) curve
MAP50: mean average precision calculated at an Intersection over Union (IoU) threshold of 0.5. This means a detection is considered correct if the overlap between the predicted bounding box and the ground truth bounding box is at least 50%.
MAP50-95: average mAP values across IoU thresholds of 50% to 95%
Speed ONNX (ms) = runtime using Open Neural Network Exchange (open-source format for representing machine learning and deep learning models)
Speed T4 TensorRT10 (ms) = runtime using NvidiaGPU
FLOPs: floating point calculations (represents complexity)
F1 score: harmonic mean of precision (accuracy of positive predictions) and recall (how well a model identifies all relevant instances)