2020-2021 Computer Vision Architecture Overview

Goals

PS: The goals stated here are for the 2020-2021 season. The competition for this season has concluded, however we will retain this architecture and the future competition goal may be similar to this one.

If you’re unfamiliar with the competition we’re participating in, here’s a summary of the competition CONOPS (Concept of Operations, basically the rules/goals): https://uwarg-docs.atlassian.net/wiki/spaces/AD/pages/1089830913

 

As a quick overview, some of the competition’s main tasks involve identifying and retrieving a package from a “medical clinic” to a “remote depot”, each site being marked by a group of four pylons. Points throughout the competition for the extent of autonomy with which our vehicle can perform these tasks.

 

Nomenclature note: before this competition moved to become all virtual, the CONOPS discussed the packages being placed within tents. The codebase includes many references to such “tents”, though really, this simply refers to the groups of pylons.

With respect to these goals, the Computer Vision subteam at WARG is responsible for completing a few main tasks:

  • Identifying a clinic/depot from the air from within a video stream

  • Correlating the location of the landing site within the image to an actual position on ground

  • Searching for, and geolocating boxes once on the ground

  • Taxiing to the boxes in coordination with the firmware system

  • Scanning QR codes attached to the boxes

To do so, the CV system is segmented into a list of modules, each of which assist in accomplishing one of the above tasks.


The Modules

Note: it may be useful to follow along exploring our CV repository as you read this guide (https://github.com/UWARG/computer-vision-python ).

DeckLinkSRC

Video data for the CV system originates from a GoPro mounted on the plane. The GoPro then uses a Blacksheep video transmitter to send live-stream data to the ground. The receiver is then attached to a video card on the CV computer, known as the DeckLink, which interprets and processes video input. The DeckLinkSRC program was designed to extract video data from this card and attach it to the CV program’s data pipelines.

However, this implementation has since changed, and the module now simply reads from an OBS stream which pulls data directly from the DeckLink. This module is also responsible for handling the life cycle of the video stream, including saving it for future training and closing the stream when prompted.

Command

The command module serves as the communication interface between the firmware and CV systems. It includes two major components: the POGI (Plane Out, Ground In) system, responsible for transferring telemetry data from the Xbee (firmware RF receiver) to the CV system, and the PIGO (Plane In, Ground Out) system, responsible for transferring CV data about pylon or box locations to the firmware autopilot. Data is transferred via two shared JSON files that are used by both the CV and firmware systems.

Target Acquisition

This module hosts and runs the YOLOV5 object detection neural network which detects pylons within a video frame. Object detection involves creating a so-called “bounding box” within an image that identifies the rough location of a target object within an image. The bounding box includes data about its height, width, and x-y coordinates of its top-right corner. Note that the coordinate system of an image is such that the origin is at the top left, and the bottom right is defined by (+IMAGE_WIDTH, +IMAGE_HEIGHT).

Geolocation

The geolocation module implements a complex mathematical algorithm designed to correlate data about bounding boxes detected within an image to actual coordinates on the ground, based on telemetry data such as altitude, GPS coordinates of the plane, and camera gimbal. The algorithm itself is a bit too involved to cover here. You may refer to the following link if you’re interested in learning more about the math behind it: https://docs.google.com/document/d/1VKYIrmWJYfpLPCAjQPiH1J5fnQPtNcjEC2NE5Eg4QxU/edit .

Search

Once the plane has landed, the search module is responsible for using stored coordinates of depot/clinic locations to give the firmware program information on how to turn to face the rough direction of the boxes. Note that this location will not be very accurate, due to GPS accuracy limitations and unaccounted-for variables within the geolocating process, but the point of this module is to get the plane facing in the rough direction so that the Taxi program can begin package detection.

Taxi

The taxi program uses another YOLOV5 object detection model to search for cardboard boxes (the packages) within a video frame. It also includes logic on calculating the distance from the plane to the box, as defined by the size of the box within the image and the camera’s focal length. This data is sent to the firmware system to help it navigate towards the box. Once the plane has travelled to a relatively close position, manual control is handed over to the pilot to allow for precise maneuvers that can facilitate QR code scanning.

QR Scanner

The QR scanner module uses OpenCV and the PyZbar library to detect QR codes within a video frame, decode those QRs, and display the decoded information using a bounding box for the pilot to see.

Timestamp

The timestamp module simply attaches time stamps to pieces of data as they are queued into the data pipelines (more on that below), to help with tasks such as matching image data with telemetry data.

 

Main Program & Multiprocessing

The CV architecture is very object-oriented. The modules that are discussed above are all instantiated as objects within the main program (located in the root directory).

At any given time, there are several different tasks the CV system is responsible for. During flight for instance, the program needs to handle retrieving frames from the video stream, then calling a detection model on them, then running the geolocation module in the case that a pylon is found - all while using the command module to get information on telemetry and communicate with the autopilot. All of this is simply not achievable with a sequential program.

 

Instead, parallelism (running multiple programs/processes at once) becomes important. This is enabled by a concept and python library called multiprocessing. Multiprocessing takes advantage of multiple CPU cores to perform several operations simultaneously (or, in technical terms, concurrently). In the CV system, most modules have their own “process”.



With several operations happening concurrently, passing data is much different from the regular sequential implementations you may be used to. Instead, the CV architecture includes so-called “pipelines” of data - FIFO (First In, First Out) data structures called queues, which handle streams of data. One pipeline may hold POGI information, while another may include video frame data.

These pipelines are fed into and out of so-called “worker functions” that exist for each module within the CV architecture. These functions simply indicate if/what data the specific module takes in, and if/what data the module outputs. Worker functions that only take in data are called consumers, workers that emit data are called producers, and workers that do both are called producer-consumers.

 

In this manner, producer-consumers, pipelines, and processes compose the main concepts of the CV multiprocessing system.

Approach to Testing

Unit tests are included in each of the modules, and are created with the pytest library. The integration testing strategy is covered here: https://uwarg-docs.atlassian.net/wiki/spaces/CV/pages/1686208530 .