...
In computer vision, it is a common task to map pixels convert pixel coordinates in an image to a coordinate system geographical coordinates in the real world and vice versa. The geolocation module in the airside system repository identifies the relative positioning of ground targets from the given sensory This is done through the geolocation algorithm using sensor data.
Note: Throughout this document bolded lowercase variables represent vector quantities and bolded UPPERCASE variables represent matrices.
Intrinsics
...
An image is a grid of pixels. Pixels are the smallest component in a digital image. The resolution of an image is the number of pixels in an image. The top-left pixel in the image is the origin. The positive x-direction points in the right direction and the positive y-direction points downwards.
The camera coordinate system follows the North-East-Down (NED) coordinate system. In the NED system, the x-axis is the forward direction, the y-axis is the rightward direction, and the z-axis the downward direction. In the camera coordinate system, c corresponds to the x-direction, u corresponds to the y-direction, and v corresponds to the z-direction. The optical center corresponds to the origin in the camera coordinate system. See more information about the NED system here: https://en.wikipedia.org/wiki/Local_tangent_plane_coordinates.
Image to Camera Space
If you are not familiar with the Field of View of Cameras, I suggest reading this document to understand the necessary background for this section https://uwarg-docs.atlassian.net/wiki/spaces/CV/pages/2237530135/Cameras#Field-of-View.
...
To calculate the vectors c, u and v, we assume that the magnitude of the c vector is 1 (The magnitude of c can be arbitrary as u and v will scale, we assume the magnitude c is 1 for convenience). Knowing that the magnitude of the c vector is 1 and the Field of View we can calculate the magnitude of the u and v vectors (the magnitude of u and v correspond to a and b in the diagram above respectively). This is done with basic trigonometry (we can take the tangent of half of the FOV angle since the magnitude of the c vector is 1).
If the ratio of a:b is not the same as the ratio of rx:ry then the pixels are not square, which doesn’t really matter but it is useful to note. Usually the pixels will be square.
To map a pixel in the image space to its corresponding vector in the camera space, we can apply the scaling function f(p) = 2(p/R) - 1
where p
is the pixel location (in x or y axis), and R
is the resolution (in x or y axis). This function is chosen since it maps the domain of the image space which is [0, R] to the codomain [-1, 1]. This is achieved by scaling the value by 2 and translating it down by 1.
...
We can multiply the value of the scaling function with the vectors u and v to get the horizontal and vertical axis of the pixel vector (upixel = f(p) * u and vpixel = f(p) * v). Thus, the pixel vector, p, is equal to p = c + upixel + vpixel.
Extrinsic
Three-Dimensional Rotation Matrices
It is useful to think about matrices as a transformation of space. When we multiply a vector with a matrix, we can visually see this as transforming a vector from one coordinate space to another. To better understand matrices as transformations, you can look at the resources below.
A rotation matrix is a special type of matrix, as it transforms the coordinate space by revolving it around the origin. A visualization of a rotation matrix can be seen below.
...
Rotation matrices are useful since it allows us to model the orientation of an object in 3D space. Multiplying a vector with a rotation matrix allows us to rotate a vector and change its orientation. The rotation matrices for 3D space are shown below. For more information on rotation matrices, see here: https://en.wikipedia.org/wiki/Rotation_matrix.
...
It is important to note that matrices are not commutative. This means that the product of two matrices A and B are not equal if I switch the order of the matrices (AB != BA).
The orientation of a rigid-body in three-dimensional space can be described with three angles (Tait-Bryan Angles). The name of these three angles in aviation are yaw, pitch, and roll. Yaw describes the angle around the z-axis, pitch describes the angle around the y-axis, and roll describes the angle around the x-axis.
...
Camera to Drone Space
Once we are able to describe a pixel in the camera space, we need to convert it to a vector in the drone space. This vector will point to the object in the image from the perspective of the drone. To convert the vector from the camera space to the drone space, we need to know how the camera is positioned and oriented with respect to the drone.
When calculating the position of the camera in relation to the drone, the measurements must be reported from the center of the drone to the center of the camera. Any unit of measurement can be used as long as the units are consistent. Once we have the measurements of the camera in the x, y, and z axis, we can generate a vector to model the position of the camera in the drone space.
When calculating the orientation of the camera in relation to the drone, the measurements must be reported in radians when the drone is on a flat surface. The camera’s yaw, pitch, and roll with respect to the drone are used to generate a rotation matrix that models how the camera is oriented with respect to the drone.
Let’s say R is the rotation matrix describing the camera’s orientation in relation to the drone, t is the vector representing the position of the camera with respect to the drone, and p is a vector in the camera space. If we want convert the vector p into a vector in the drone space (let’s call this p') we can perform the following equation: p' = Rp + t.
TODO: Redraw diagrams. Not sure what was the issue with the old ones.
World
Two-Dimensional Rotation and Translation
TODO: Not sure what to put in this section or why it is needed.
Drone to World Space
Once we know where the object is in the drone space, we need to convert it into a vector in the world space. To do this we can get the drone’s yaw, pitch, and roll in the world space and create a rotation matrix. This provides us with a vector that points to the detected object in the world coordinate system.
Projective Perspective Transform Matrix
The goal of the geolocation module is to map where a particular pixel in an image maps to the ground. The following sections dives into the theory on how geolocation works.
...
The diagram above displays the vectors used in the geolocation algorithm. The world space is the coordinate system used to describe the position of an object in the world. The world space is shown in the diagram with the black coordinate system. The camera space is the coordinate system used to describe the position of an object in relation to the camera. The camera space is showing in the diagram with vectors c, u, v. Note that bolded variables are vector quantities.
The table below outlines what each variable represents.
...
Vector
...
What it represents
...
o
...
The location of the camera in the world space (latitude and longitude of camera).
...
c
...
Orientation of the camera in the world space (yaw, pitch, and roll of camera).
...
u
...
Horizontal axis of the image in the camera space (right is positive).
...
v
...
Vertical axis of the image in the camera space (down is positive).
...
a
...
Camera location to individual pixel
The geolocation module works under the following assumptions.
Geolocation assumes the ground is flat. To be clear I am not saying the Earth is flat, but we can assume the ground is flat at small distances because the radius of the earth is so large. The image below displays this assumption. This assumption allows us to assume a planar coordinate system for our calculations. We make this assumption because incorporating GIS data is difficult.
...
World to Ground Space
Given a vector pointing to the object in the world space, we can compute where the object is on the ground by finding the intersection between the vector with the ground. Since we are assuming a planar coordinate system, the ground is the plane z = 0
. We can calculate the scalar multiple, t, that extends the vector to the ground. Knowing the t value, we can then compute the x and y values and determine the ground location of the object.
...
Using this calculation, we can now get ground locations of target. However, these calculations become costly if we need a large number of pixels translated into ground locations.
To resolve this issue, we can compute a perspective transform matrix. A perspective transform matrix is a 3x3 matrix that maps a 2D point in one plane to another 2D point in a different plane. We use the perspective transform matrix to map a point in the image plane to a point in the world space. We use the OpenCV
library to calculate the transform matrix. To get the matrix, the library needs 4 points in the image plane and the corresponding 4 points on the ground plane. Among these 4 points, 3 of the points should not be collinear (3 of the points should not be on the same line). Below is a diagram of the 4 points used to calculate the perspective transform matrix in the geolocation module.
...
Using the algorithm from above, we can find the corresponding ground locations for the 4 pixels above. We can then send the image points and ground points to the getPerspectiveTransform
function in OpenCV
and compute the ground locations for any pixels of interest in the imageTODO Tell that this doc assumes basical lin alg knowledge. Link 3 Blue 1 Brown vids or smth
An example is Ax + b which represents the vector x multiplied by the matrix A and summed with the vector b.
Intrinsics
TODO explain that this is all internal to camera
Image Coordinate System
TODO Image diagram
Explain that coordinate system is not discrete, top left of pixel is “coordinate” of the pixel (e.g. 0, 0 is the top left of the pixel at 0, 0)
Diagram this as well (Like the MS Paint Mihir drew)
Possible: Other coordinate systems (e.g. rectangular pixels, triangular pixels (CRT monitors).
Explain that Autonomy uses square pixels
An image is a grid of pixels. Pixels are the smallest component in a digital image. The resolution of an image is the number of pixels in an image. The top-left pixel in the image is the origin. The positive x-direction points in the right direction and the positive y-direction points downwards.
Camera Coordinate System
TODO Camera diagram
TODO Split this image into two, one for image coordinate system and one for camera coordinate system
TODO Fix camera diagram because someone looking at it doesn’t expect it to be 3D. Stick with a side view. Similar to the MS paint image.
TODO The camera coordinate system is a right-hand coordinate system, not NED. Explain the direction of the c, u, and v vectors. As this is a different coordinate system, the x axis is the image coord system is different from the y axis coord system.
The camera coordinate system follows the North-East-Down (NED) coordinate system. In the NED system, the x-axis is the forward direction, the y-axis is the rightward direction, and the z-axis the downward direction. In the camera coordinate system, c corresponds to the x-direction, u corresponds to the y-direction, and v corresponds to the z-direction. The optical center corresponds to the origin in the camera coordinate system. See more information about the NED system here: https://en.wikipedia.org/wiki/Local_tangent_plane_coordinates.
Image to Camera Space
If you are not familiar with the Field of View of Cameras, I suggest reading this document to understand the necessary background for this section https://uwarg-docs.atlassian.net/wiki/spaces/CV/pages/2237530135/Cameras#Field-of-View.
TODO: Just do a quick explanation of the Field of View. Just the angle between the edges of the image
...
TODO Just do 1 triangle for the image, similar to the MS paint thing pasted above. Get rid of the intermediate variables (alpha and a).
To calculate the vectors c, u and v, we assume that the magnitude of the c vector is 1 (The magnitude of c can be arbitrary as u and v will scale, we assume the magnitude c is 1 for convenience). Knowing that the magnitude of the c vector is 1 and the Field of View we can calculate the magnitude of the u and v vectors (the magnitude of u and v correspond to a and b in the diagram above respectively). This is done with basic trigonometry (we can take the tangent of half of the FOV angle since the magnitude of the c vector is 1).
TODO For the text above make it an ordered list to make it easier to read. Like a math proof.
If the ratio of a:b is not the same as the ratio of rx:ry then the pixels are not square, which doesn’t really matter but it is useful to note. Usually the pixels will be square.
Explain that the ratio between the magnitudes u and v are the same as the ratio between rx and ry because the assumption is that the pixels are square
To map a pixel in the image space to its corresponding vector in the camera space, we can apply the scaling function f(p) = 2(p/R) - 1
where p
is the pixel location (in x or y axis), and R
is the resolution (in x or y axis). This function is chosen since it maps the domain of the image space which is [0, R] to the codomain [-1, 1]. This is achieved by scaling the value by 2 and translating it down by 1.
...
Add line diagram
...
We can multiply the value of the scaling function with the vectors u and v to get the horizontal and vertical axis of the pixel vector (upixel = f(p) * u and vpixel = f(p) * v). Thus, the pixel vector, p, is equal to p = c + upixel + vpixel.
Add diagram that shows this
Extrinsic
Motivation
TODO Now that we are done with the camera intrinsics, you want to know what is around the camera which is the camera extrinsic.
Want to mount camera on external platform, external platform has its own coordinate system in 3D, how to transform between them: 3D rotation matrix
Example: Drone forward direction is x axis, but camera is pointing down, so the c vector is in a different direction
Rotation
Three-Dimensional Rotation Matrices
Matrices as a transformation of space. Multiply a vector in one coordinate system by a matrix results in the vector in another coordinate space. Multiplying a vector by a rotation matrix reorients the vector.
When we multiply a vector with a matrix, we can visually see this as transforming a vector from one coordinate space to another. To better understand matrices as transformations, you can look at the resources below.
A rotation matrix is a special type of matrix, as it transforms the coordinate space by revolving it around the origin. A visualization of a rotation matrix can be seen below.
...
Rotation matrices are useful since it allows us to model the orientation of an object in 3D space. Multiplying a vector with a rotation matrix allows us to rotate a vector and change its orientation. The rotation matrices for 3D space are shown below. For more information on rotation matrices, see here: https://en.wikipedia.org/wiki/Rotation_matrix.
...
It is important to note that matrices are not commutative. This means that the product of two matrices A and B are not equal if I switch the order of the matrices (AB != BA).
TODO Do not teach why matrices are not commutative, just say that they are not commutative and mention that if you multiply in another order the answer is different
The standard for calculating rotations is with Tait-Bryan Z-Y-X.
Diagram:
Rotate around z axis
Rotate around new y axis (pitch axis, not the original y axis)
Rotate around new new x axis (roll axis, not the original x axis)
The orientation of a rigid-body in three-dimensional space can be described with three angles (Tait-Bryan Angles). The name of these three angles in aviation are yaw, pitch, and roll. Yaw describes the angle around the z-axis, pitch describes the angle around the y-axis, and roll describes the angle around the x-axis (the coordinate system abides the NED system).
...
Intrinsic rotations are elemental rotations that occur about the axes of a coordinate system attached to a moving body. In contrast, extrinsic rotations are elemental rotations that occur about the axes of the fixed coordinate system. The rotations in geolocation are intrinsic. There are 6 different intrinsic rotations that can be performed (since there are 6 permutations to rotate an object around the x, y, and z axis). The order matters because as mentioned above, matrices are not commutative. The intrinsic rotation that is performed in the geolocation module is z-y-x or 3-2-1. If we have X as the rotation matrix for roll, Y as the rotation matrix for pitch,and Z for the rotation matrix for yaw, the overall transformation matrix T would be T = ZYX. The diagram below showcases the intrinsic rotation for z-y-x. For more information you can check out the link here: https://www.wikiwand.com/en/Euler_angles#Tait%E2%80%93Bryan_angles .
...
Camera to Drone Space
TODO Add more diagrams.
TODO Create a document somewhere else to explain how to set the config parameters for the camera wrt to the drone. Possibly put it in the Airside repo documentation.
Once we are able to describe a pixel as a vector in the camera space, we need to convert it to a vector in the drone space. This vector will point to the object in the image from the perspective of the drone. To convert the vector from the camera space to the drone space, we need to know how the camera is positioned and oriented with respect to the drone.
When calculating the position of the camera in relation to the drone, the measurements must be reported from the center of the drone to the center of the camera. Any unit of measurement can be used as long as the units are consistent. Once we have the measurements of the camera in the x, y, and z axis, we can generate a vector to model the position of the camera in the drone space.
When calculating the orientation of the camera in relation to the drone, the measurements must be reported in radians when the drone is on a flat surface. The camera’s yaw, pitch, and roll with respect to the drone are used to generate a rotation matrix that models how the camera is oriented with respect to the drone.
Let’s say R is the rotation matrix describing the camera’s orientation in relation to the drone, t is the vector representing the position of the camera with respect to the drone, and p is a vector in the camera space. If we want convert the vector p into a vector in the drone space (let’s call this p') we can perform the following calculation: p' = Rp + t.
TODO: Redraw diagrams. Not sure what was the issue with the old ones.
Drone to World Space
Once we know the vector pointing to the object is in the drone space, we need to convert it into a vector in the world space. To do this we can get the drone’s yaw, pitch, and roll in the world space and create a rotation matrix. This provides us with a vector that points to the detected object in the world coordinate system.
Let’s say W is the rotation matrix describing the drone’s orientation in the world space, and p' is the vector pointing to the object in the drone space. To get the vector pointing to the object in the world space (let’s call this p''), we can perform the following calculation: p'' = Wp'.
Translation
TODO If your optical center is not at the center of the drone, we need to translate it. Where is the camera in the world? First where is the drone in the world? Then where is the camera with respect to the drone. End goal is to get
Projective Perspective Transform Matrix
Now that the rotation and translation of a pixel to world is known, there is one final step to figure out where that maps to on the ground in the world.
...
TODO Move this diagram to the bottom to reiterate as a summary of the geolocation algorithm
The diagram above displays the vectors used in the geolocation algorithm. The world space is the coordinate system used to describe the position of an object in the world. The world space is shown in the diagram with the black coordinate system. The camera space is the coordinate system used to describe the position of an object in relation to the camera. The camera space is showing in the diagram with vectors c, u, v. Note that bolded variables are vector quantities.
TODO the above paragraph reiterates a lot of stuff said, take it out.
The table below outlines what each variable represents.
Vector | What it represents |
---|---|
o | The location of the camera in the world space (latitude and longitude of camera). |
c | Orientation of the camera in the world space (yaw, pitch, and roll of camera). |
u | Horizontal axis of the image in the camera space (right is positive). |
v | Vertical axis of the image in the camera space (down is positive). |
a | Camera location to individual pixel |
The geolocation module works under the following assumptions.
Geolocation assumes the ground is flat. To be clear I am not saying the Earth is flat, but we can assume the ground is flat at small distances because the radius of the earth is so large. The image below displays this assumption. This assumption allows us to assume a planar coordinate system for our calculations. We make this assumption because incorporating GIS data is difficult.
...
World to Ground Space
Given a vector pointing to the object in the world space, we can compute where the object is on the ground by finding the intersection between the vector with the ground. Since we are assuming a planar coordinate system, the ground is the plane z = 0
. We can calculate the scalar multiple, t, that extends the vector to the ground. Knowing the t value, we can then compute the x and y values and determine the ground location of the object.
...
Using this calculation, we can now get ground locations of target. However, these calculations become costly if we need a large number of pixels translated into ground locations.
To resolve this issue, we can compute a perspective transform matrix. A perspective transform matrix is a 3x3 matrix that maps a 2D point in one plane to another 2D point in a different plane. We use the perspective transform matrix to map a point in the image plane to a point in the world space. We use the OpenCV
library to calculate the transform matrix. To get the matrix, the library needs 4 points in the image plane and the corresponding 4 points on the ground plane. Among these 4 points, 3 of the points should not be collinear (3 of the points should not be on the same line). Below is a diagram of the 4 points used to calculate the perspective transform matrix in the geolocation module.
...
Using the algorithm from above, we can find the corresponding ground locations for the 4 pixels above. We can then send the image points and ground points to the getPerspectiveTransform
function in OpenCV
and compute the perspective transform matrix to map any pixel in the image to a location on the ground.
Let’s say P is the perspective transform matrix and a pixel in an image are p = (x, y).
Logs
Geolocation saves logs locally for the detections it makes in logs/{date+time}/geolocation_worker.log
.
Below is an example of what it could look like (full logs in onedrive Geolocation_GroundTest_Oct4_2024):
Code Block |
---|
12:36:56: [INFO] [/home/warg/computer-vision-python/modules/geolocation/geolocation_worker.py | geolocation_worker | 39] Logger initialized
12:37:02: [INFO] [/home/warg/computer-vision-python/modules/geolocation/geolocation.py | run | 320] <class 'modules.detection_in_world.DetectionInWorld'>, vertices: [[0.6552691643408014, 9.98856225523965], [0.6482836937744504, 9.986302393051126], [0.6578720594692677, 9.980787791068765], [0.6508864409775312, 9.97852507854267]], centre: [ 0.65308 9.9835], label: 0, confidence: 1.0
12:37:05: [INFO] [/home/warg/computer-vision-python/modules/geolocation/geolocation.py | run | 320] <class 'modules.detection_in_world.DetectionInWorld'>, vertices: [[0.4275230900336949, 10.152804162814556], [0.2980973314064277, 10.114202092691663], [0.46949445525840316, 10.023504609670422], [0.340985020395882, 9.982833647293372]], centre: [ 0.38457 10.068], label: 0, confidence: 1.0
12:37:08: [INFO] [/home/warg/computer-vision-python/modules/geolocation/geolocation.py | run | 320] <class 'modules.detection_in_world.DetectionInWorld'>, vertices: [[0.49811098655414815, 10.170383875702877], [0.36995788863253354, 10.095773204556759], [0.5752676468673406, 10.046568235880605], [0.44786116828996236, 9.969197821890818]], centre: [ 0.47329 10.071], label: 0, confidence: 1.0 |
If the operation is successful, the [INFO]
tag is used to log the results; otherwise, an [ERROR]
tag will be present. The input is a bounding box (with 4 corners and a center), so the output also has 4 corners and a center. The vertices are in the order: (top left, top right, bottom left, bottom right).
Each point will have 2 or 3 coordinates, in the NED format (in meters). If there are 2 coordinates, it would be (North, East), and if there are 3 coordinates, it would be (North, East, Down).
All of the distances are relative to the home location, so [0.65, 9.98] indicates that the point is 0.65 meters north and 9.98 meters east of the home location (take-off position).
Break ---------------------------------------------------------------
...