Tech Report - Ray

Features and Capabilities

Computer Vision Architecture: Object detection & tracking

Autonomous intruders and explosives identification require object detection and classification - commonly used neural networks YOLO and R-CNN series are typically used for this purpose. Taking into consideration of model performance and implementation, Yolov5 was employed. It is chosen because the average mAP from 0.5 to 0.95 was 37.2 and the average detection time on CPU was 98ms. To ensure accuracy, over 1000 images will be used to train the neural networks to identify both intruders and explosives for training. Once trained, the neural network is expected to provide both bounding box and centroid coordinates, both important for identifing where the intruder/explosive is in latitude and longitude and navigating the drone towards the explosive and the intruder.

Note: mAP stands for mean Average Precision, it indicates the performance of the model.

Computer Vision Architecture: Processing and Interfacing:

WARG's CV comes in the form of a Python program. The processing system is composed of subsystems instantiated as classes in a singleton pattern within a mediator class, which acts as the main entry point of the computer vision system. The subsystems include the video source module, which grabs video frames from the video source, a processing module, that runs the appropriate neural network for each task, and location modules that perform the required processes to turn data from the neural network into GPS coordinates or relative locations. Thus, the system output is an API that provides this location data to to Zeropilot's telemetry manager.

Autonomous Identification of humans:

For autonomous identification of humans, there are three major components working together to make the entire system work: Intruder detection, intruder tracking and tracking-based driving.

The Intruder detection component is based on a deep learning approach. We will collect custom data on the person(intruders) walking/running from an aerial view and train our own Yolov5 model based on the pre-trained weights provided by ultralytics using MS COCO dataset. With this component, our drone will have the capability to detect in real-time the objects in the camera’s field of view on the drone. This component will take in the video stream from the camera and return the bounding box of the intruder in the image if he is in the field of view.

The intruder tracking component is based on Simple Online and Realtime Tracking with a Deep Association Metric, which is DeepSORT in short. Compared to other tracking techniques such as SORT, Boosting, KCF tracker and GOTURN tracker, DeepSORT can track objects through longer periods of occlusions thus effectively reducing the number of identity switches. the intruder tracking component will take the bounding box produced by the intruder detection component as its starting point and tracks the movement of the intruder as long as he is in the field of view.

The tracking based driving component is a assisting module for the intruder tracking component. While the tracking component is tracking the intruder, the change of coordinates and size of the bounding box on the intruder will change in real-time. Then the tracking based driving component will take the bounding box information and calculate how large is the bounding box and where is the centre of the bounding box compared to the image. If the bounding box is large compared to the size of the image, it means the intruder is close to the drone and the drone needs to back off and vice versa. If the bounding box is not centred in the image, it means the drone needs to drive sideways or up and down so that the intruder can be remain in the centre of the image. It will then send out commands to drive the drone based on the logic above.