Tech Report - Ray
Features and Capabilities
Computer Vision Architecture: Object detection & tracking
Autonomous intruders and explosives identification require object detection and classification - commonly used neural networks YOLO and R-CNN series are typically used for this purpose. Taking into consideration of model performance and implementation, Yolov5 was deployed. It is chosen because the average mAP from 0.5 to 0.95 was 37.2 and the average detection time on CPU was 98ms. To ensure accuracy, over 1000 images will be used to train the neural networks to identify both intruders and explosives for training. Once trained, the neural network is expected to provide both bounding box and centroid coordinates, both important for identifying where the intruder/explosive is in latitude and longitude and navigating the drone towards the explosive and the intruder.
Note: mAP stands for mean Average Precision, it indicates the performance of the model.
Computer Vision Architecture: Processing and Interfacing:
WARG's CV comes in the form of a Python program. The processing system is composed of subsystems instantiated as classes in a singleton pattern within a mediator class, which acts as the main entry point of the computer vision system. The subsystems include the video source module, which grabs video frames from the video source, a processing module, that runs the appropriate neural network for each task, and location modules that perform the required processes to turn data from the neural network into GPS coordinates or relative locations. Thus, the system output is an API that provides this location data to Zeropilot's telemetry manager.
Autonomous Identification of humans:
For autonomous identification of humans, there are three major components working together to make the entire system work: Intruder detection, intruder tracking and tracking-based driving.
The Intruder detection component is based on a deep learning approach. We will collect custom data on the person(intruders) walking/running from an aerial view and train our own Yolov5 model based on the pre-trained weights provided by ultralytics using MS COCO dataset. With this component, our drone will have the capability to detect in real-time the objects in the camera’s field of view on the drone. This component will take in the video stream from the camera and return the bounding box of the intruder in the image if he is in the field of view.
The intruder tracking component is based on Simple Online and Realtime Tracking with a Deep Association Metric, which is DeepSORT in short. Compared to other tracking techniques such as SORT, Boosting, KCF tracker and GOTURN tracker, DeepSORT can track objects through longer periods of occlusions thus effectively reducing the number of identity switches. the intruder tracking component will take the bounding box produced by the intruder detection component as its starting point and track the movement of the intruder as long as he is in the field of view.
The tracking-based driving component is an assisting module for the intruder tracking component. While the tracking component is tracking the intruder, the change of coordinates and size of the bounding box on the intruder will change in real-time. Then the tracking-based driving component will take the bounding box information and calculate how large is the bounding box and where is the centre of the bounding box compared to the image. If the bounding box is large compared to the size of the image, it means the intruder is close to the drone and the drone needs to back off and vice versa. If the bounding box is not centred in the image, it means the drone needs to drive sideways or up and down so that the intruder can remain in the centre of the image. It will then send out commands to drive the drone based on the logic above.
QR Scanner methodology for pickup/dropoff:
In the computer vision codebase, the QR scanner is implemented for extracting the latitude and longitude of intruder’s breaking in location and the information for the pilot (questions, date, time, device id, sensor id) and differentiate between the two. Python packages “pyzbar” and “opencv” are used to identify the exact location of the QR code in the image frame and decode the QR code into a string of information separated by semi-colons. After the information is extracted, it is then sent to the flight controller via an output pipeline and also sent for the pilot to the ground station via an output pipeline.
Novel approach and Novel elements
DeepSORT
Only object detection seems to be not enough to satisfy this year’s task, as a result, another object tracking method called DeepSORT is added. DeepSORT is a tracking-by-detection algorithm that considers both the bounding box parameters of the detection results and the information about the appearance of the tracked objects to associate the detections in a new frame with previously tracked objects. It is also an online tracking algorithm, which means that it only considers information about the current and previous frames to make predictions about the current frame without the need to process the whole video at once. This makes DeepSORT a great fit for our purpose. During the intruder tracking, only object detection might not be stable enough to keep track of that person because when following the person with object detection only, confidence on that bounding box on that person might change and if the bounding box is locked on an object that is not person, it will lose track. With DeepSORT, it will follow the person based on the previous frame and reduce the likelyhood of losing track of the target.