ArticlePDF Available

Shot Type Constraints in UAV Cinematography For Autonomous Target Tracking

Authors:

Abstract and Figures

During the past years, camera-equipped Unmanned Aerial Vehicles (UAVs) have revolutionized aerial cinematography, allowing easy acquisition of impressive footage. In this context, autonomous functionalities based on machine learning and computer vision modules are gaining ground. During live coverage of outdoor events, an autonomous UAV may visually track and follow a specific target of interest, under a specific desired shot type, mainly adjusted by choosing appropriate focal length and UAV/camera trajectory relative to the target. However, the selected UAV/camera trajectory and the object tracker requirements (which impose limits on the maximum allowable focal length) affect the range of feasible shot types, thus constraining cinematography planning. Therefore, this paper explores the interplay between cinematography and computer vision in the area of autonomous UAV filming. UAV target-tracking trajectories are formalized and geometrically modeled, so as to analytically compute maximum allowable focal length per scenario, to avoid 2D visual tracker failure. Based on this constraint, formulas for estimating the appropriate focal length to achieve the desired shot type in each situation are extracted, so as to determine shot feasibility. Such rules can be embedded into practical UAV intelligent shooting systems, in order to enhance their robustness by facilitating on-the-fly adjustment of the cinematography plan.
Content may be subject to copyright.
Shot Type Constraints in UAV Cinematography For
Autonomous Target Tracking
Iason Karakostas*, Ioannis Mademlis*, Nikos Nikolaidis and Ioannis Pitas
Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
Abstract
During the past years, camera-equipped Unmanned Aerial Vehicles (UAVs) have rev-
olutionized aerial cinematography, allowing easy acquisition of impressive footage.
In this context, autonomous functionalities based on machine learning and computer
vision modules are gaining ground. During live coverage of outdoor events, an au-
tonomous UAV may visually track and follow a specific target of interest, under a
specific desired shot type, mainly adjusted by choosing appropriate focal length and
UAV/camera trajectory relative to the target. However, the selected UAV/camera trajec-
tory and the object tracker requirements (which impose limits on the maximum allow-
able focal length) affect the range of feasible shot types, thus constraining cinematog-
raphy planning. Therefore, this paper explores the interplay between cinematography
and computer vision in the area of autonomous UAV filming. UAV target-tracking
trajectories are formalized and geometrically modeled, so as to analytically compute
maximum allowable focal length per scenario, to avoid 2D visual tracker failure. Based
on this constraint, formulas for estimating the appropriate focal length to achieve the
desired shot type in each situation are extracted, so as to determine shot feasibility.
Such rules can be embedded into practical UAV intelligent shooting systems, in order
to enhance their robustness by facilitating on-the-fly adjustment of the cinematography
plan.
Keywords: UAV cinematography, shot type, target tracking, autonomous drones
1. Introduction
Automation in applications involving cinematic video footage (e.g., TV/movie pro-
duction, outdoor event coverage, advertising, etc.) is constantly improving, both in the
post-production stage (e.g., shot cut/scene change detection [26], automated editing [3]
or framing [1], etc.) and during production (e.g., [6]). Relevant algorithms typically
1*The first two authors contributed equally and are joint first authors.
22019. This manuscript version is made available under the CC-BY-NC-ND 4.0 license
http://creativecommons.org/licenses/by-nc-nd/4.0/
Preprint submitted to Journal of L
A
T
E
X Templates August 10, 2019
utilize expert knowledge about the film creative process and the cinematic grammar, in
order to assist in footage shooting, indexing, annotation, and/or post-processing.
While filming, the most important creative decisions made by the director pertain
to the shot type and the camera motion type. The shot type is defined mainly by the per-
centage of the video frame area covered by the target being filmed. In traditional film
grammar the target is assumed to be a human subject, but this is not strictly necessary
(for instance, it can be a static or moving vehicle). If the distance between the target
and the camera remains constant, the shot type is controlled primarily by changing the
camera focal length f, hence adjusting the zoom level. The camera motion type refers
to the camera motion trajectory relative to the target for the duration of a shot.
Despite the presence of a large body of research dedicated to automated shot type
and camera motion type recognition in existing footage during post-production (e.g.,
[37] [4] [11] [8]), little work has been performed on autonomously capturing new
videos with desired shot type/camera motion type combinations. Such methods are
typically given the label of intelligent shooting. In dynamic environments, relevant ap-
proaches require robotic cameras that partially rely on real-time machine learning and
computer vision algorithms, for visually detecting/tracking [25] [38] [19] [27] [31] [32]
and physically following a specific desired target (e.g., the lead athlete in a race). How-
ever, to the best of our knowledge, the interplay between 2D visual tracker operation
and cinematographic properties, i.e., shot type and camera motion type, has not been
thoroughly investigated.
An important issue from this respect is determining the range of feasible shot types
at each time point, so that visual tracking algorithms do not fail. The selected shot
type severely affects the perceived 2D displacement of a moving target image between
consecutive video frames, due to the effects of zooming. Thus, real-time visual object
tracking [18] is heavily influenced by cinematography decisions, given that virtually all
trackers search a restricted video frame region for the next target instance, positioned
around the previously found one. Although the size of this search region in pixels is
partially adaptive, according to the target’s image area on the previous video frame, it
is practically limited by the video frame dimensions. Thus, the shot type requested by
the director for a particular scenario at a certain time instance may not be feasible, de-
pending on the specifics of the target and the camera motion velocities and trajectories.
Vertical Take-off and Landing (VTOL) Unmanned Aerial Vehicles (UAVs, or “drones”)
equipped with professional cameras have recently become an indispensable asset in the
cinematographer’s arsenal. They permit rapid capture of impressive footage, flexible
shot setup, novel shot types and access to narrow or hard-to-reach spaces, at a small
fraction of the cost associated with spidercams, helicopters and cranes. Essentially,
they provide a level of camera motion freedom that, so far, was only available in an-
imation. Typically, in professional productions, the UAV and its mounted camera are
manually remote-controlled by two different operators, acting in synchronization under
a rough cinematography plan defined by the director. The latter can be conceived as a
sequence of desired target assignments, shot types and UAV/camera motion trajectories
relative to the target.
There is, however, a growing trend of increasing automation in drone functions,
so as to reduce the challenges arising from fully manual operation [21] [24]. This
is especially important in cinematography applications, where great precision and co-
2
ordination may be required in order to properly capture the desired shot. Thus, in
the near future, production costs are expected to be significantly reduced, with semi-
autonomous or fully autonomous drones replacing human crews currently required and
shifting production focus to the direct realization of the director’s creative vision, rather
than the minutiae of drone operation.
Autonomous UAV filming is, therefore, a promising emerging offshoot of intelli-
gent shooting with potentially exceptional industrial impact. However, challenges such
as tracking fast and unpredictably moving targets in real-time, as well as the lack of
standardization in UAV shot types and meaningful UAV/camera motion trajectories,
are a reality interfering with the ability to on-the-fly adjust the cinematography plan,
according to dynamic environment conditions. The restrictions imposed on the feasi-
ble shot types by the requirements of the 2D visual tracker, especially, are particularly
significant for autonomous UAVs, when contrasted with indoor robotic cameras, due to
the possibly higher target speed in outdoor settings and the increased camera mobility
offered by a drone.
Therefore, although the above apply to autonomous filming in general, this pa-
per focuses on outdoor target-following UAV cinematography applications (e.g., for
live sports event coverage). By significantly extending preliminary work [23] [40]
[20] [22], it presents a theoretical study of the constraints imposed on cinematography
decision-making during autonomous UAV shooting. The contributions of this paper
are:
Formalizing and geometrically modelling a range of common, target-following
UAV motion types.
Analytically determining the maximum permissible camera focal length fmax,
so that 2D visual object tracking does not get lost, for each UAV motion type.
Extracting formulas for determining the feasibility of the requested shot type
(dependent on fmax and on the appropriate focal length fsfor that shot type).
Providing specific examples and simulated scenarios that showcase the practical
applicability of the proposed study.
Current industry practice simply ignores constraints implicitly imposed on zoom
level/shot type by 2D visual tracker requirements. This is problematic, since it dis-
regards the possibility of the target ROI going out of frame (or simply getting too
spatially displaced in 2D pixel coordinates) among consecutive time instances, due to
the target’s abrupt 3D motion and too high a focal length, thus breaking visual track-
ing. Therefore, to the best of our knowledge, our proposed, analytically derived rule
set marks the first time this issue is studied in-depth in the context of autonomous UAV
cinematography.
Incorporating shot type permissibility rules into media production automation soft-
ware, such as intelligent UAV shooting algorithms [15] [16] [30] [35], is expected
to greatly enhance the robustness of autonomous drones deployed in cinematography
applications, by facilitating tracker-aware on-the-fly adjustment of the pre-computed
cinematography plan.
3
Table 1: Shot types and their corresponding ROI to video frame height ratio percentage.
Shot type Video frame height coverage
Extreme Long Shot (ELS) <5%
Very Long Shot (VLS) 520%
Long Shot (LS) 20 40%
Medium Shot (MS) 40 60%
Medium Close-Up (MCU) 60 75%
Close-Up (CU) >75%
2. UAV Cinematography Modelling
In cinematography, each camera motion type can be combined with a subset of the
available shot types, so as to achieve an aesthetically pleasing visual result. Thus, a
shot can be described by the combination of a camera motion type and a shot type.
Below, shot types and camera motion types are studied for the specific case of UAV
cinematography.
Each shot type is mainly defined by the ratio of the Region-of-Interest (ROI) height
to the video frame height. The ratio can vary from less than 5% for the Extreme Long
Shot, to more than 75% for Close-Up shot. The taxonomy presented in Table 1 is
derived/adapted from traditional ground and aerial cinematography [5] [7] [34], based
on extensive visual inspection of professional and semi-professional UAV footage.
In a typical scenario, the on-board camera is mounted on a gimbal that allows rapid
camera rotation around its yaw, pitch and roll axes. Additionally, a zoom lens with
adjustable focal length f(within certain limits) is employed. Simply altering fis
typically sufficient for achieving the shot type desired by the director and prescribed
in the cinematography plan. Thus, any constraints on the maximum permissible focal
length directly correspond to restrictions in the range of feasible shot types at each time
instance.
Regarding UAV/camera motion, several industry-standard types have emerged since
the popularization of UAVs, with most of them being derived/adapted from traditional
ground and aerial cinematography. For outdoor events (e.g., in live sports broadcast-
ing), the most important motion types are relative to a still or moving target being
tracked.
Recent aerial videography literature [7] [34] contains a description of a few such
UAV motion types. However, no systematic analysis has been presented in the literature
so far. Below, 8 UAV industry-standard camera motion types are detailed, geometri-
cally modelled and matched to compatible shot types, based on our extensive visual
survey of professional UAV footage. For instance, in a Chase shot (where the UAV
follows/leads a moving target from behind/from the front, while maintaining a steady
distance), the viewer is meant to experience a “simulation” of the target motion within
its environment, while the target is fully visible. Thus, a CU that excludes most of
the surroundings from the video frame is an unsuitable shot type in this context. Such
findings are summarized in Table 2.
The mathematical treatment in this paper assumes a realistic setting similar to [35],
where the autonomous UAV operates in a consistent, global, Cartesian 3D map, upon
4
Table 2: Compatibility of UAV camera motion and shot types.
Camera motion Shot types
MAPMT LS, MS, MCU
MATMT LS, MS
LTS VLS, LS, MS, MCU
VTS VLS, LS, MS, MCU
ORBIT LS, MS, MCU, CU
FLYOVER LS, MS, MCU, CU
FLYBY LS, MS, MCU, CU
CHASE VLS, LS, MS
which both the drone itself and the target are constantly localized. This can be achieved
by employing Global Positioning System (GPS) receivers [10] on both the UAV and
the target. For increased robustness, GPS-derived drone localization information can
be aligned and fused with Visual SLAM results [28], preferably derived by jointly
exploiting stereoscopic 3D camera and Inertial Measurement Unit (IMU) [29] inputs,
based on a similarity transformation [13]. Issues such as the possibility of temporarily
losing the GPS signal, or the usual GPS position error (in the range of up to 5 me-
ters [10]), may be overcome by fusing IMU/GPS and Visual SLAM localization, or
by replacing GPS with an Active Radio-Frequency IDentification (RFID) positioning
system [14]. Regarding the target, the output of 2D visual tracking itself can also be
exploited for augmenting target localization precision (assuming a calibrated camera),
thus making it even more imperative to reduce the chance of visual tracker failure.
Below, given a camera frame-rate F, time tis discrete and proceeds in steps of
1
Fseconds. A separate timeline is employed for each shot description, i.e., t= 0
indicates the start of a shot shooting session. At each time instance t, the 3D positions
˜
xt= [˜xt1,˜xt2,˜xt3]T,˜
pt= [˜pt1,˜pt2,˜pt3]Tof the UAV and the target respectively
(assuming they are 3D points), as well as an estimated 3D target velocity vector ˜
ut, are
assumed known (as in [35]) in a fixed, orthonormal, right-handed World Coordinate
System (WCS), ˜
i,˜
j,˜
kwith its ˜
k-axis perpendicular to a local tangent plane (hereafter
shortened to “ground plane”). A local East-North-Up (ENU) coordinate system may be
employed [9]. Note that the term “local tangent plane” is employed for a plane parallel
to the local sea level, while the term “terrain tangent plane” is reserved for the plane
instantaneously tangent to the local terrain surface.
Additionally, at each time instance t, a current, orthonormal, right-handed target-
centered coordinate system (TCS), i,j,k, is defined. Its origin lies on the current
target position, its k-axis is perpendicular to the ground plane and its i-axis is the L2-
normalized projection of the current target velocity vector onto the ground plane. In the
case of a still target, the TCS i-axis is defined as parallel to the projection of the vector
˜
p0˜
x0onto the ground plane. In both coordinate systems, the ij-plane is parallel to
the ground plane and the k-component is called “altitude”. Below, vectors expressed
in TCS are denoted without the tilde symbol (e.g., xt,pt,qtand ut).
Transforming between the two coordinate systems is trivial. A subset of the pre-
sented motion types require pre-specification of motion parameters meant to adapt the
5
UAV motion trajectory to concrete directorial guidelines (e.g., distance to be covered
by the UAV).
In mobile robotics literature, an additional, vehicle-centered coordinate system is
typically employed, having its origin located at a fixed distance from the UAV-mounted
camera. Since the scope of this paper does not include UAV control per se, we do not
make use of such a coordinate frame and limit our analysis to cinematography issues.
Additionally, for reasons of simplicity, the employed modelling ignores the distinction
between the drone and its mounted camera, since it is typically trivial to compute the
3D pose of the one given the other and gimbal feedback.
The 3D scene point where the camera looks at time instance t, is denoted by lt(in
TCS). The LookAt vector at time instance tis a scalar multiple of the camera axis and
denoted by ot=ltxt(or ˜
ot, when expressed in WCS). Below, it is assumed that
lt=ptand, therefore, ot=xt. As a result, the selected target point is visible at
the center of the video frame. This is a simple and common framing approach, called
“central composition”. Standard measurement units for the implicated quantities are
also assumed, i.e., distance is measured in meters, speed in meters per second and the
video frame-rate in frames per second.
In a number of cases, the UAV/camera motion type is only meaningful if the target
is moving linearly. Moreover, such an assumption is additionally made below in cases
where the future target or UAV position needs to be predicted, for reasons of modelling
convenience (these cases are appropriately marked in the following analysis). Constant
linear motion is assumed for both these scenarios, although extending the formulas
for the case of constantly accelerated linear motion is trivial (assuming that the target
acceleration vector can be reliably estimated).
The eight target-tracking UAV motion types are illustrated in Figure 1 and de-
scribed below:
1) Lateral Tracking Shot (LTS) [7] [34] and 2) Vertical Tracking Shot (VTS) are
non-parametric camera motion types, where the camera gimbal does not rotate and the
camera is directly locked on the moving target. In LTS, the camera axis is approxi-
mately perpendicular both to the local target trajectory and to the WCS vertical axis
vector ˜
k, while the UAV flies sideways/in parallel to the target, matching its speed (if
possible). In VTS, the camera axis is perpendicular to the target trajectory and the
UAV flies exactly above the target, matching its speed (if possible). In both cases, ˜
pt
refers to a varying target position in WCS. During shooting, the UAV position remains
constant in TCS, but varies in WCS.
The base mathematical description for both these UAV/camera motion types is
fairly simple:
˜
vt=˜
ut,˜
oT
t˜
ut0,xt=xt1,lt=pt,t. (1)
Additionally, the following relations hold for LTS and VTS, respectively:
ot×j0, x03 0,(2)
oT
tj0, x03 >0.(3)
6
a) b)
c) d)
,
e) f)
g) h)
Figure 1: Examples of different target-tracking UAV camera motion types: a) Lateral Tracking Shot (LTS); b)
Vertical Tracking Shot (VTS); c) Moving Aerial Pan with Moving Target (MAPMT); d) Moving Aerial Tilt
with Moving Target (MATMT); e) Fly-By (FLYBY); f) Fly-Over (FLYOVER); g) Chase/Follow (CHASE);
and h) Orbit (ORBIT) .
7
3) Moving Aerial Pan with Moving Target (MAPMT) and 4) Moving Aerial Tilt
with Moving Target (MATMT) are parametric camera motion types, where the cam-
era gimbal rotates (mainly with respect to the yaw/pitch axis, for MAPMT/MATMT,
respectively) so as to always keep the linearly moving target centrally framed, while
the UAV is flying at a linear trajectory with constant velocity. ˜
ptrefers to the target
position, varying over time in such a manner that the target and the UAV velocity vec-
tor projections onto the ground plane are approximately perpendicular/parallel to each
other, for MAPMT/MATMT, respectively.
The drone velocity vector ˜
vt= [˜vt1,˜vt2,˜vt3]Tmust be specified. The base mathe-
matical description for both these UAV/camera motion types is given by:
˜
vt=˜
vt1,˜
xt=˜
x0+˜
vt
Ft, lt=pt,t. (4)
Additionally, the following relations hold for MAPMT and MATMT, respectively:
ut1,˜ut2,0][˜vt1,˜vt2,0]T0,(5)
ut1,˜ut2,0]T×[˜vt1,˜vt2,0]T0.(6)
5) Fly-By (FLYBY) and 6) Fly-Over (FLYOVER) [34]. They are parametric camera
motion types, where the camera gimbal is rotating, so that the still or linearly mov-
ing target is always centrally framed. The UAV intercepts the target from behind/from
the front (and to the left/right, in the case of FLYBY), at a steady altitude (in TCS)
with constant velocity, flies exactly above it/passes it by (for FLYOVER/FLYBY, re-
spectively) and keeps on flying at a linear trajectory, with the camera still pointing at
the receding target. The UAV and target velocity vector projections onto the ground
plane remain approximately parallel during shooting. They can have either identical or
opposite direction. ˜
ptrefers to a varying or static target position in WCS.
The common parameter that must be specified is K, i.e., the time (in seconds) until
UAV is located exactly above the target (for FLYOVER), or until the distance between
the target and the UAV is minimized (for FLYBY). Additionally, the length dof the
projection of that minimum distance vector onto the ground plane, must be specified
for FLYBY. Below, the target velocity is assumed constant for reasons of modelling
convenience. The mathematical description common to both camera motion types is
the following one, for t[0,2KF ]:
v0= [u01 Kx01
K,0, u03]T,(7)
˜
vt=˜
vt1,˜
ut=˜
ut1,lt=pt,t, (8)
˜
xt=˜
x0+t
KF (˜
xKF ˜
x0),(9)
ut1,˜ut2,0]T×[˜vt1,˜vt2,0]T0.(10)
8
Additionally, the following relations holds for FLYOVER:
˜
xKF = [ ˜p01 + ˜u01 K, ˜p02 + ˜u02K, ˜x03 + ˜u03 K]T,(11)
xt20,xT
tj0,t, (12)
and the following hold for FLYBY:
|x02|=d > 0, xt2=x02 ,t, (13)
xKF = [0, x02 , x03]T.(14)
7) Chase/Follow Shot (CHASE) is a non-parametric camera motion type, where the
camera gimbal does not rotate and the camera always points at the target [34]. The
UAV follows/leads the target from behind/from the front, while maintaining a steady
distance by matching its speed, if possible. ˜
ptrefers to a varying target position in
WCS. The mathematical description is the following:
˜
vt˜
ut,(15)
xt2=x02 0,xt=xt1,lt=pt,t. (16)
8) Orbit (ORBIT). It is a parametric camera motion type, where the camera gimbal
is slowly rotating, so as to always keep the still or linearly moving target properly
framed, while the UAV (semi-)circles around the target and, simultaneously, follows
the target linear trajectory (if the target is moving) [7] [34]. During shooting, the UAV
altitude remains constant in TCS, but may vary in WCS. ˜
ptrefers to a varying or static
target position in WCS.
The parameters that must be specified are the desired 3D Euclidean distance d3D=
k˜
xt˜
ptk2=kxtk2(constant over time), the rotation angle θaround the target and
the desired UAV angular velocity ω. Additionally, we can easily derive the initial angle
θ0formed by the TCS i-axis (of time instance t= 0) and the vector from p0to the
projection of the known initial position x0onto the TCS ij-plane. Then, ORBIT may
be described in TCS using a planar circular motion, for t[0,T θ
ω]:
θ0=arctan x02
x01 ,(17)
xt3=x03,t, (18)
λ=qλ2
3Dx2
t3,(19)
xt= [λcos (tω
F+θ0), λ sin (tω
F+θ0), xt3]T,(20)
lt=pt.(21)
9
3. Constraints on Maximum Focal Length
In order for a visual tracker to operate properly, the location (in pixel coordinates)
of the target ROI should differ no more than a threshold between successive video
frames/time instances. This requirement places a constraint on the maximum target
speed and on the maximum camera focal length f(the main factor determining max-
imum achievable zoom level), since a given 3D target displacement (in WCS) corre-
sponds to a greater 2D ROI displacement (in pixels) at a greater zoom level. Proper
estimation of the maximum allowable fin each shooting case is of utmost importance
in cinematography applications, since it directly affects the range of permissible shot
types.
Without loss of generality, we always consider time instance t= 0 and, thus,
examine an entire shooting session as a sequence of repeated transitions between the
“first” (t= 0) and the “second” video frame (t+ 1 = 1). We also assume that the
target ROI center is always meant to be fixed at the principal point (image center) of
all video frames (central composition). Target position ˜
ptis initially known and ˜
pt+1
can be predicted using the estimated velocity vector ˜
ut, i.e., ˜
pt+1 =˜
pt+˜
ut1
F. If
the prediction is accurate, the target ROI indeed remains at the center of the (t+ 1)-th
video frame.
In contrast, if the actual current target motion differs from the predicted one by the
unknown velocity deviation vector ˜
qt= [˜qt1,˜qt2,˜qt3]T, the target ROI at time t+ 1
has to be explicitly localized via 2D visual tracking (in pixel coordinates), so that it can
be exploited for 3D target position ˜
pt+1 estimation and/or for adjusting the framing.
Since ˜
qtand, therefore, ˜
pt+1 are unknown, the following analysis utilizes the TCS
defined by the expected/predicted target position at time instance t+ 1.
Whenever ˜
qtis a non-zero vector and, therefore, prediction of ˜
pt+1 fails, the re-
sults of 2D visual tracking and actual ˜
pt+1 estimation must be employed for updating
the target velocity vector and, hopefully, achieving a better prediction during the next
time instance. Given that tracker behavior varies per algorithm, we simply assume a
maximum search radius Rmax (in pixels) defining the video frame region within which
the tracked object ROI of time instance t+1 must lie, relatively to the video frame cen-
ter, in order to permit successful tracking. Thus, a distance Rt+1 between the actual
target ROI center of t+ 1 and the center of that video frame, where Rt+1 > Rmax,
implies tracking failure. The case where Rt+1 =Rmax marks the limit scenario where
the tracker marginally succeeds. Note that Rmax is not fixed, since modern trackers
adapt the size of their search region to the current ROI size.
3.1. Maximum focal length
In order to find the maximum focal length so that there is no target tracking failure,
we assume that the expected position of the target in TCS is always at [0,0,0]T. Let
ot=ltxtbe the LookAt vector at time instance tand dt=px2
t1+x2
t2is the
distance between the target and the UAV, projected on the ij-plane.
Based on the above and the camera projection equations [36], the following hold:
xd(t+ 1) = oxf
sx
rT
1(pt+1 xt+1)
rT
3(pt+1 xt+1),(22)
10
yd(t+ 1) = oyf
sy
rT
2(pt+1 xt+1)
rT
3(pt+1 xt+1),(23)
where xd(t+ 1),yd(t+ 1) are the target center pixel coordinates at the time instance
(t+ 1),ox,oydefine the image center in pixel coordinates and sx,sydenote the
pixel size (in mm) along the horizontal and vertical directions. r1,r2and r3refer,
respectively, to the first, second and third row of the rotation matrix Rthat orients the
camera gimbal according to the LookAt vector.
In general, the coordinate transform matrix from TCS to the camera coordinate
system can be found by two rotations and one translation of the unit TCS vectors. The
required rotations are around the TCS k-axis and j-axis. Thus, Rcan be described as
follows [2]:
R=
cos(θz)cos(θy)sin(θz)cos(θz)sin(θy)
sin(θz)cos(θy)cos(θz)sin(θz)sin(θy)
sin(θy) 0 cos(θy)
,(24)
where θzand θyare the appropriate angles of rotation for Rzand Ryrespectively.
However, given that Ris an orthogonal change-of-basis matrix and that, in most of the
motion types, the UAV does not fly exactly above the target, it is easier to obtain the
rows of Ras follows. Since the camera axis points directly at the target, the unit vector
of the k-axis for the Camera Coordinate System, i.e., r3, can be obtained from xt+1 as
follows:
r3=xt+1
kxt+1 kT
.(25)
For motion types where the UAV does not fly exactly above the target, r1is the cross
product of r3with the unit vector k:
r0
1=k×xt+1
kxt+1 kT
,(26)
r1=r0
1
kr0
1k.(27)
Thus, r2is given by the cross product r3×r1:
r0
2=xt+1
kxt+1 k×k×xt+1
kxt+1 kT
,(28)
r2=r0
2
kr0
2k.(29)
In our approach we consider central composition, thus the target ROI center should
be located at (ox,oy) at all times. Assuming that in time instance tthe target ROI center
is aligned with the frame center, in time instance t0=t+ 1, the target ROI center will
be translated to a new pixel coordinates, due to camera and target movement in the real
world. The central pixel translation of the ROI, R, can be calculated by employing
11
Figure 2: ROI translation between two consecutive video frames for time instance tand t0=t+ 1. The
distance between the central pixels of the two ROIs, Rcan be calculated by employing the results of Eqs.
(22) and (23).
Eqs. (22) and (23), and simple geometrical rules, as depicted in Fig. 2. By setting
a maximum Rvalue, thus applying the limit constraint Rt+1 =Rmax, we derive the
following equation:
Rmax =q(xd(t+ 1) ox)2+ (yd(t+ 1) oy)2.(30)
Assuming that xt0= [xt01, xt02, xt03]Tand pt0= [qt1
F,qt2
F,qt3
F]T, where t0=t+ 1,
and substituting Eqs. (22) and (23) in Eq. (30), Rmax can be obtained by:
Rmax =sf2
max kxt0k2E2
3
s2
x
+(qt3NE2xt03)2
s2
y(N+x2
t03)(31)
where
N= (x2
t01+x2
t02)
Eq. (31) can be solved for fto obtain the maximum focal length fmax for motion
types having dt0>0:
fmax =Rmaxdt0sxsy|E1+Fkxt0k2|
q(sxqt3d2
t0sxxt03E2)2+s2
yE2
3kxt0k2
,(32)
where
E1=qt1xt01qt2xt02qt3xt03,
E2=qt1xt01+qt2xt02,
E3=qt2xt01qt1xt02.
Since most of the UAV motion types are not affected by target altitude changes
between successive video frames, which are less likely to happen than direction and
12
speed changes, pt0can be expressed as follows:
pt0= [qt1
F,qt2
F,0]T.(33)
In this case, the maximum focal length is given by:
fmax =Rmaxdt0sxsy| − E2+Fkxt0k2|
qs2
xE2
2x2
t03+s2
yE2
3kxt0k2
.(34)
When the UAV/camera is located exactly above the target for the (t+ 1)-th video
frame, i.e., xt0= [0,0, xt03]T,Rcannot be derived as described in Eqs. (25)-(29),
since r1×k=0. In this special case, where dt0= 0, it is easier to calculate the
rotation matrix using (24), for θz= 0 and θy= 180o:
R=
1 0 0
010
0 0 1
.(35)
Then, the maximum focal length is given by:
fmax =RmaxF xt03sxsy
qs2
yq2
t1+s2
xq2
t2
.(36)
As it can be seen from the above, in general, the derived formulas rely on knowing,
predicting or estimating a velocity deviation vector qtthat models the degree to which
instantaneous target 3D motion differs from uniform linear motion. Several options are
available for obtaining qt. A reasonable choice would be to assume an instantaneously
constant acceleration vector at each time instance. A more strict policy would be to
derive fmax for various candidate velocity deviations, which displace the target towards
different spatial directions, and output the minimum among the computed fmax values.
3.2. Simulations for specific UAV/camera motion types
In order to investigate the maximum possible focal length for a specific motion
type shot, we simulated the motion for various representative UAV shooting scenarios.
We studied 8 different cases for the deviation vector qt. In the first two cases, the
target linearly accelerates/decelerates, i.e., qt1= [7.5,0,0]T,qt2= [7.5,0,0]T.
Velocity deviations are expressed in meters/second. In the third and fourth cases, the
target is moving along a different direction than the expected one (qt3= [0,7.5,0]T,
qt4= [0,7.5,0]T), but remains on the TCS j-axis. In the remaining cases, the target
is moving diagonally to the TCS axes (qt5= [7.5,7.5,0]T,qt6= [7.5,7.5,0]T,
qt7= [7.5,7.5,0]T,qt8= [7.5,7.5,0]T). Figure 3 depicts the expected against
the actual position of the target in each case.
The following parameters have been used in the performed simulations. Maximum
tracker search radius Rmax was generously fixed to 360 pixels, so as to model the
obvious constraint that the central target ROI pixel stays visible among consecutive
video frames (when using High Definition camera sensor), otherwise visual tracking
13
i
j
case 5
case 6 case 8
case 7 case 3
case 4
case 1case 2 expected
Figure 3: The expected against the actual target position in the (t+ 1)-th time instance, for the 8 simulated
cases. TCS iand jaxes are denoted by black and grey color, respectively.
fails. This is a hard upper bound on Rmax , thus bypassing the need for adaptive Rmax
in this set of experiments. The pixel size was set to sx=sy= 0.009 mm and video
frame rate to F= 25 fps. All of the experiments were carried out on a Linux PC
equipped with an Intel i7 CPU and 32 GB of RAM. However, the proposed rules can
be easily computed in real-time on an embedded system (e.g. nVidia Jetson, Intel NUC,
etc.), in conjunction with a fast 2D visual tracker.
3.2.1. Lateral Tracking Shot
In LTS, the UAV flies alongside the target, as described in Section 2. In this
shot type, even small target altitude variations have a great impact on picture framing.
Therefore, we assume that qt36= 0. The UAV position is given by xt+1 = [0, xt2,0]T.
As pt+1 = [qt1
F,qt2
F,qt3
F]T, Eq. (32) can now be rewritten as follows:
fmax =Rmaxsxsy|qt2F xt2|
qs2
yq2
t1+s2
xq2
t3
.(37)
The LTS simulation was performed for varying values of qt3. The horizontal distance
between the UAV and the target was chosen to be λ=xt2= 30m. Simulation results
are shown in Figure 4. As expected, variations in altitude affect all study cases 1 - 8.
When the target deviates from its expected TCS position [0,0,0]T, but is located on
the j-axis, i.e., pt+1 = [0,qt2
F,0]T,fmax is only affected by altitude changes. This
behavior is reasonable, since the camera k-axis unit vector can be expressed in TCS
as kc= [0,1,0]T. Consequently, the projected ROI center will not change in pixel
coordinates, therefore, this target deviation should have no impact at all on fmax, when
qt3= 0. The other results are affected by linear target acceleration/deceleration along
the TCS i-axis. As expected, fmax is maximized for these cases (1, 2 and 5 - 8) when
the target altitude does not vary between successive video frames. Due to the position
of the UAV, target acceleration and deceleration have identical impact on fmax.
14
-10 -5 0 5 10
qt3 (m/s)
0
500
1000
1500
2000
maximum focal length (mm)
Cases 3, 4
Cases 1, 2, 5 - 8
Figure 4: Simulation results for LTS: fmax against qt3.
20 40 60 80 100
xt3 (m)
0
500
1000
1500
maximum focal length (mm)
Cases 1, 2, 3, 4
Cases 5, 6, 7, 8
Figure 5: Simulation results for VTS: fmax against altitude (xt3).
3.2.2. Vertical Tracking Shot
In VTS, the UAV flies exactly above the target, therefore, the maximum focal length
is given by Eq. (36). The UAV is positioned at xt+1 = [0,0, xt3]T. The 8 case studies
were simulated for various UAV TCS altitudes, i.e., for various values of xt3. Thus,
we obtained the maximum focal length allowed in the VTS scenario for various UAV
altitudes, under the assumption that target altitude remains approximately constant be-
tween successive video frames, i.e., qt3= 0. Target position at time t+ 1 is given
by: pt+1 = [qt1
F,qt2
F,0]T. The results are presented in Figure 5, where the horizontal
axis unit is meters and the vertical axis unit is millimetres. As expected, the maximum
focal length increases linearly with xt3. When the target is moving diagonally to the
TCS axes (cases 5 - 8) the maximum possible focal length is lower than in cases 1 -
4. Target motion along the j-axis (cases 3 and 4) and target linear acceleration (cases
1 and 2) have similar effect on the maximum allowed focal length, since the UAV is
positioned exactly above the target.
15
3.2.3. Moving Aerial Pan with Moving Target/Moving Aerial Tilt with Moving Target
Given the mathematical description for MAPMT/MATMT in (4) and the fact that
the target is moving along the i-axis, we can assume that xt+1 = [xt1, xt2+vt2
F, xt3]T
for MAPMT and xt+1 = [xt1+vt1
F, xt2, xt3]Tfor MATMT. For the UAV position at
time instance t+ 1, the target position in the next video frame is given by Eq. (33). By
substituting xt+1 in Eq. (34), the following relations hold:
fmax =Rmaxdmp sxsy| − Emp1+Fkxt+1 k2|
qs2
xE2
mp1x2
t3+s2
yE2
mp2kxt+1 k2
(38)
fmax =Rmaxdmt sxsy| − Emt1+Fkxt+1 k2|
qs2
xE2
mt1x2
t3+s2
yE2
mt2kxt+1 k2
(39)
for MAPMT and MATMT, respectively, where:
dmp =rx2
t1+ (xt2+vt2
F)2,
Emp1=qt1xt1+qt2(xt2+vt2
F),
Emp2=qt2xt1+qt1(xt2+vt2
F),
dmt =rx2
t2+ (xt1+vt1
F)2,
Emt1=qt2xt2+qt1(xt1+vt1
F),
Emt2=qt1xt2+qt2(xt1+vt1
F).
For simulation purposes, fmax was studied for varying distances between the target
and the UAV, corresponding to consecutive time instances of the UAV/camera motion
type execution. The following initial values were selected: x01 = 30 m,x02 =60 m
(MAPMT), x01 =60 m,x02 = 30 m(MATMT), x03 = 10 m,vt2= 10 m
s(both).
The similarities between Figures 6 and 7, for MAPMT and MATMT, respectively, are
evident. As expected, cases 1, 2/3, 4 of MAPMT correspond to cases 3, 4/1, 2 of
MATMT, since these two motion types differ only in the UAV motion direction: it is
parallel to the j-axis/i-axis in MAPMT/MATMT, respectively. The impact on fmax
for target motion deviation along the TCS j-axis for MAPMT will be the same as the
impact for target motion deviation along the TCS i-axis for MATMT, and vice versa,
as Figure 8 demonstrates. Therefore, cases 5, 6 and 7, 8 produce identical results in
both motion types.
Studying the results of cases 1 and 2 for MAPMT and cases 3 and 4 for MATMT,
fmax takes its maximum value when xt2= 0 and xt1= 0, respectively. The reason
is that, in these positions, the UAV in MAPMT is above the i-axis, while in MATMT
above the j-axis, thus any deviations in target motion affect minimally the ROI location
in the next video frame. On the other hand, in all other cases, these UAV positions are
approximately where any target motion deviations have the greatest impact on the next
ROI location.
16
-50 0 50
xt2 (m)
0
500
1000
1500
maximum focal length (mm)
Cases 1, 2
Cases 3, 4
Cases 5, 6
Cases 7, 8
Figure 6: Simulation results for MAPMT: fmax against xt2.
-50 0 50
xt1 (m)
0
500
1000
1500
maximum focal length (mm)
Cases 1, 2
Cases 3, 4
Cases 5, 6
Cases 7, 8
Figure 7: Simulation results for MATMT: fmax against xt1.
17
Figure 8: Target velocity deviation vectors as seen from the UAV camera, when the camera axis lies on: a)
the j-axis and b) the i-axis. Black dot denotes target expected position. Black vectors correspond to cases 1
and 2, grey vectors to cases 3 and 4 and, finally, the dashed lined vectors to cases 5-8. In a) target velocity
deviation on the j-axis will affect less the fmax than target linear speed changes, while in b) the opposite.
3.2.4. Fly-By/Fly-Over
In these motion types, where shot duration is specified by K, we can determine the
maximum focal length directly over time (t[0,2K]). For FLYBY, the UAV position
in TCS is given by xt+1 = [x01
Kt+x01, x02 , x03]T. We study these motion types
together, since FLYOVER is a special case of FLYBY, where x02 = 0.
By substituting xt+1 in Eq. (34), fmax is given by:
fmax =Rmaxdf b sxsy| − Ef b1+Fkxt+1 k2|
qs2
xE2
fb1x2
t3+s2
yE2
fb2kxt+1 k2
,(40)
fmax =Rmaxdf o1sxsy| − Ef o1+Fkxt+1 k2|
qs2
xE2
fo1x2
t3+s2
yE2
fo2kxt+1 k2
,(41)
18
for FLYBY and FLYOVER, respectively, where:
dfb =r(x01
Kt+x01)2+x2
02,
Efb1=qt1(x01
Kt+x01) + qt2xt2,
Efb2=qt2(x01
Kt+x01)qt1xt2,
dfo =|(x01
Kt+x01)|,
Efo1=qt1(x01
Kt+x01),
Efo2= (qt2(x01
Kt+x01)).
The following parameter values where chosen for the simulation: x01 =30 m,x03 =
10 m,K= 10, thus t[0,20]. Additionally, x02 = 15 mfor FLYBY. Results
are shown in Figures 9 and 10, for FLYBY and FLYOVER, respectively. The gap in
FLYOVER for t= 10 stems from the fact that the UAV is actually above the target
and, thus, the motion type is momentarily converted to VTS.
In cases 1 and 2, both motion types produce similar results. As the UAV approaches
the target, the maximum focal length decreases, before increasing again as the UAV is
flying parallel to the i-axis. When the drone is positioned far from the target, any
change in target speed corresponds to a small change in the distance between the UAV
and the target.
In general, for cases 3 and 4 of FLYBY, where the target deviates from its expected
position but remains on the j-axis, fmax increases with rising distance between the
UAV and the target. Additionally, fmax also slightly increases when the UAV is very
close to the target. Then, the latter’s velocity deviation corresponds to a small change
in distance between the target and the UAV, mapped to a small ROI displacement and,
thus, greater focal length tolerance. In FLYOVER, where any deviation of the target
motion on the j-axis will always displace the target ROI to the left or right of the video
frame, fmax is significantly smaller for cases 3 and 4.
Finally, in cases 5-8 of FLYBY, fmax depends on the angle between the LookAt
vector and the i-axis: it has lower values when this angle is close to π
2(t= 10 in
the simulation). In FLYOVER, the overall minimum values of fmax are also obtained
for cases 5-8 when t= 10, since, then, the 3D distance between the expected and
the actual target position is slightly greater compared to cases 1-4, as it can be seen in
Figure 3, leading to greater 2D ROI displacement.
3.2.5. Chase
The focal length constraint for this motion type is a special case of Eq. (34) where
xt2= 0. Since the UAV is always located in front of/behind the target and at a steady
distance, its position at time instance t+ 1 is given by xt+1 = [xt1,0, xt3]T. Target
position in the next time instance is given by Eq. (33). By combining (33) and (34),
19
0 5 10 15 20
time (seconds)
0
200
400
600
800
maximum focal length (mm)
Cases 1, 2
Cases 3, 4
Cases 5, 6
Cases 7, 8
Figure 9: Simulation results for FLYBY: fmax over time t.
0 5 10 15 20
time (seconds)
0
200
400
600
800
1000
1200
maximum focal length (mm)
Cases 1, 2
Cases 3, 4
Cases 5, 6, 7, 8
Figure 10: Simulation results for FLYOVER: fmax over time t.
20
10 20 30 40 50 60
xt1 (m)
0
1000
2000
3000
4000
maximum focal length (mm)
Cases 1, 2
Cases 3, 4, 5, 6, 7 ,8
Figure 11: Simulation results for CHASE: fmax against distance from target.
the following relation holds:
fmax =Rmaxsxsyφc| − F φ2
c+xt1qt1|
xt1qs2
yφ2
cq2
t2+s2
xx2
t3q2
t1
,(42)
where
φc=qx2
t1+x2
t3.(43)
For simulation purposes, we studied fmax using varying distances between the
target and the UAV, as well as constant TCS altitude (xt3= 10 m). The results are
shown in Figure 11. As expected, the maximum focal length increases with rising
distance between the UAV and the target. In cases 1 and 2, fmax is much larger than
in the other cases, since an increase or a decrease of the target speed will simply move
the target slightly away or closer to the UAV. When distance between the UAV and
the target is increased, the target has to deviate more from its expected position, so
that Rt+1 > Rmax in the next video frame. This is due to the fact that target speed
deviation has less effect on target position in the next video frame, as this UAV/camera
motion type starts to produce a visual result similar to that of LTS, but with the UAV
located ahead/behind the target.
On the contrary, for cases 3 and 4 where the target deviates along the j-axis in
the next video frame, this UAV/camera motion type is highly affected. As Figure 8b
demonstrates, if the target moves along the j-axis, the ROI center in the next video
frame is displaced according to target motion velocity deviation. However, this dis-
placement is also inversely proportional to the distance between the target and the
UAV/camera, due to perspective projection. Thus, lower focal length tolerances and
a more linear increase in fmax as xt1rises is expected. Similar conclusions can be
drawn for cases 5 - 8.
21
3.2.6. Orbit
For the ORBIT motion type, the target position is given by Eq. (33). By using Eqs.
(17) - (21), fmax is given by substituting
xt+1 = [λcos ( ω
F+θ0), λ sin ( ω
F+θ0), xt3]T(44)
in (34):
fmax =Rmaxdor sxsy| − Eor1+Fkxt+1 k2|
qs2
xE2
or1x2
t3+s2
yE2
or2kxt+1 k2
,(45)
where:
dor =r(λcos ( ω
F+θ0))2+ (λsin ( ω
F+θ0))2,
Eor1=qt1λcos ( ω
F+θ0) + qt2λsin ( ω
F+θ0),
Eor2=qt1λsin ( ω
F+θ0) + qt2λcos ( ω
F+θ0).
The following parameter values where used in the simulations: λ= 30 m, x03 =
10 m, ω=π
20 rad/sec. The results are depicted in Figure 12. The horizontal axis
represents the current θ0, i.e., the angle denoting the current UAV position relative
to the target along a circular trajectory. The estimated fmax complies with intuitive
expectations in all cases. For instance, in case 1, the target linearly accelerates. If
the UAV lies exactly behind the target (θ0= 0), fmax takes its maximum value,
since, from that perspective, a linear acceleration will not significantly alter the target
ROI center pixel coordinates. In contrast, linear acceleration will have a much greater
impact from a lateral perspective (θ0= 90). Indeed, fmax takes its minimum value
in this case. As expected, fmax varies periodically as the UAV view changes from a
lateral one to a collinear one and vice versa. Similar conclusions can be drawn for the
scenario of linear target deceleration (case 2), where the target trajectory also remains
identical to the expected one.
In cases 3 and 4, if the UAV is positioned collinearly to the estimated target velocity
vector (θ0= 0), it has in fact a lateral view of the actual target motion. If it is
positioned perpendicularly to the estimated velocity vector (θ0= 90), it has in fact
a collinear (frontal/rear) view of the actual target motion. Therefore, the plots of the
cases 1, 2 and of the cases 3, 4 have a relative phase difference of π
2, as one would
expect.
As shown in Figure 12, in cases 5 and 6, where the target moves diagonally to
its expected trajectory, the corresponding plots have an absolute phase difference of
π
8relative to the previously described plots. Additionally, the fmax values are lower
than those of cases 3 and 4. These observations are reasonable, since, when θ0= 45,
the UAV has in fact a frontal/rear view of the actual target motion. Also, this scenario
presents the greatest difference (in pixel coordinates) between the expected and the
actual target ROI center location. Therefore, greater limitations are naturally imposed
on fmax, so that 2D visual tracking is successful.
22
0 50 100 150 200
0 (degrees)
200
400
600
800
1000
1200
maximum focal length (mm)
Cases 1, 2
Cases 3, 4
Cases 5, 6
Cases 7, 8
Figure 12: Simulation results for ORBIT motion type: fmax against θ0.
Finally, cases 7 and 8 produce similar results, since the target again moves diago-
nally to the TCS axes. However, when compared to cases 5 and 6, the perpendicularity
of the motion directions leads to a phase difference of π
4.
4. Shot Type Feasibility
In cinematography planning, it is important to be able to determine whether a de-
sired shot type is feasible, given a specific camera motion type and the target’s physical
dimensions. The shot type is primarily defined by the ratio of the target ROI height to
the video frame height, therefore, it is linked to the video frame area being covered by
the target ROI. Thus, below, video frame coverage refers to the ROI-to-video-frame-
height ratio.
In order to examine the feasibility of a shot type, the appropriate focal length fs
leading to the desired target video frame coverage must be calculated. For motion
types where the distance between the target and the UAV varies over time, keeping a
constant target video frame coverage by constantly adjusting the camera focal length
simulates the cinematographic “dolly zoom” effect [5].
The shot type can be achieved without risking 2D visual tracking failure, if the
following relation holds:
fsfmax (46)
In order to calculate the appropriate fsfor achieving the shot types described in
Section 2 with respect to the desired UAV/camera motion type, we model the target as
a sphere, with its center located at the TCS point [0,0,0]Tand having constant radius
Rt. Simple sphere-modelling allows us to consider its image on the video frame as a
circle, with no perspective distortion when lt= [0,0,0]T.
This rather simplistic target volume modelling facilitates us in deriving closed
forms for fs, without much deviation from reality when the object is not very flat-
tened. In the case of significantly flattened targets, which could be better modelled
23
with a rectangular parallelepiped, sphere-based modelling results in an overestimation
of fs. Then, a simple solution is to perform the same analysis considering three differ-
ent sphere radii, i.e., one for each parallelepiped dimension, and use either their mean,
their maximum or their minimum. However, in the case of human heads, which is
very important in cinematic media imaging, simple bounding sphere-based modelling
is already quite accurate.
Below, the deviation vector qtis assumed to be equal to [0,0,0]Tfor the desired
fscalculations. Thus, no target motion deviations are taken into consideration, since
they do not significantly affect the resulting video frame coverage percentage.
4.1. Constant target video frame coverage
Determining the video frame coverage for every UAV/camera motion type would
normally include projecting the target sphere onto the video frame, finding the cor-
responding radius of the projected circle and computing the resulting coverage. This
requires a search for the radius of the projected circle. The parameters determining
the video frame coverage are the distance between UAV/camera and target, the camera
focal length fand the physical target dimensions. Thus, without loss of generality,
instead of directly projecting the target onto the current image plane, we determine the
video frame coverage as if the UAV/camera was positioned exactly above the target in
an altitude equal to the actual distance between them. Thus, it is trivial to find a 3D
point being projected on the target image circle. Then, the latter’s radius is the distance
between the projection of the above 3D point and the principal point. This projec-
tion can be obtained by Eqs. (22) and (23) in pixel coordinates. The corresponding
continuous coordinates of xim and yim on the image sensor are given by:
xim =xdsx, yim =ydsy.(47)
Thus, the video frame coverage percentage for the circular target ROI is given by:
cs=2Rim
Hsy
, Rim =qx2
im +y2
im.(48)
where His the height of the video frame in pixels and sythe physical height of one
pixel.
The above equations can be further simplified by defining Rim as the perspective
projection of pr= [Rt,0,0]T(in TCS), where Rtis target radius, and by positioning
the UAV/camera at x0=xt+1 = [0,0, zd]Twhere zd=px2
t01+x2
t02+x2
t03is the
distance between the target and the camera. Then, yim = 0, thus, Rim =xim and:
xim =1
2csHsy(49)
By utilizing Eqs. (22) and (47), and setting ox= 0:
xim =fs
r1(prx0)
r3(prx0).(50)
The rotation matrix in this case is described by Eq. (35), and the appropriate focal
length can be obtained by:
fs=csHsyzd
2Rt
.(51)
24
Table 3: Shot type feasibility for UAV/camera motion types with constant distance from the target.
Motion type min fmax fs, when cs= 25% fs, when cs= 85%
LTS 194.4mm 78.57 mm 267.14 mm
CHASE 142.4mm 78.57 mm 267.14 mm
ORBIT 241.5mm 78.57 mm 267.14 mm
0 20 40 60 80 100
xt3 (m)
0
200
400
600
800
1000
maximum focal length (mm)
Minimum f max
fs for c s = 25 %
fs for c s = 85 %
Figure 13: Maximum focal length fmax and fsfor Medium Shot and Close-Up Shot, against various UAV
altitude, for VTS.
4.2. Simulations for constant target video frame coverage
In order to investigate the target tracking feasibility for specific shot type-UAV/camera
motion type combinations, one can repeat the simulations described in Section 3.2 and
determine if the desired fsis below the minimum value of fmax for all cases. A triv-
ial addition, which is omitted here for brevity, would include a check for violations of
lens-specific upper/lower focal length limits.
For the UAV/camera motion types where the distance between the camera and the
target remains constant (i.e., CHASE, ORBIT, LTS), the desired fsis also constant
for the entire shot. On the contrary, when the distance between the target and the
UAV/camera varies (i.e., MAPMT, MATMT, FLYBY, FLYOVER, VTS), the appropri-
ate fsvaries correspondingly. Although VTS is normally a UAV/camera motion type
where the distance between the UAV and the target remains constant, it was studied for
varying zdin our simulations. Hence, in the first group of camera motion types, shot
feasibility can be determined simply by two values, the minimum fmax and the desired
fs. In the second group, feasibility should be examined for the entire shot duration, or
for a range of zdvalues in the case of VTS.
For simulation purposes, we assume a sphere-shaped target positioned in p=
[0,0,0]T(in TCS), with radius Rt= 1 m(e.g., a racing bicycle during sports event
coverage). In all motion types, the UAV and target position/motion/deviation proper-
ties comply with the descriptions in Section 3.2. In addition, the video frame resolution
was set to W= 1280 pixels and H= 720 pixels. Simulations were carried out for
two desired video frame coverage percentages, i.e., cs= 25% and cs= 85%, corre-
25
0 5 10 15 20
time (seconds)
0
50
100
150
200
250
300
350
maximum focal length (mm)
Minimum f max
fs for c s = 25 %
fs for c s = 85 %
Figure 14: Maximum focal length fmax and fsfor Medium Shot and Close-Up Shot, against time t, for
FLYBY.
0 5 10 15 20
time (seconds)
0
50
100
150
200
250
300
350
maximum focal length (mm)
Minimum f max
fs for c s = 25 %
fs for c s = 85 %
Figure 15: Maximum focal length fmax and fsfor Medium Shot and Close-Up Shot, against time t, for
FLYOVER.
26
-50 0 50
xt2 (m)
0
100
200
300
400
500
600
700
maximum focal length (mm)
Minimum f max
fs for c s = 25 %
fs for c s = 85 %
Figure 16: Maximum focal length fmax and fsfor Medium Shot and Close-Up Shot, against various UAV
positions, for MAPMT.
-50 0 50
xt1 (m)
0
100
200
300
400
500
600
700
maximum focal length (mm)
Minimum f max
fs for c s = 25 %
fs for c s = 85 %
Figure 17: Maximum focal length fmax and fsfor Medium Shot and Close-Up Shot, against various UAV
positions, for MATMT.
27
sponding to a Long Shot and a Close-Up Shot, respectively. Table 3 indicates that a
Long Shot is achievable for the UAV/camera motion types CHASE, ORBIT and for
LTS, while a Close-Up not feasible for any of these motion types.
For VTS, FLYBY, FLYOVER, MAPMT and MATMT the results are presented in
Figures 13, 14, 15, 16 and 17 respectively. In these motion types, a Long Shot is
achievable at all times (fs< fmax), but a Close-Up could cause visual tracking failure
in the presence of target velocity deviations.
The simulation results lead to the conclusion that 2D visual tracking of a real target
is indeed a fairly challenging task at greater zoom levels, if the target deviates non-
negligibly from the expected position on the next video frame.
4.3. Maximum permissible velocity deviation vector
By inverting the analysis made for fmax and fixing focal length to the fsneeded
for a specific shot type, we can define the maximum permissible norm of the target
velocity deviation vector qt= [qt1, qt2,0]T. This way, one can pre-determine whether
a shot type is feasible from known/expected target/target route characteristics.
Below, we assume for simplicity that:
qt=qt1=qt2,(52)
to demonstrate the process. By denoting t0=t+ 1, then qtis given by solving the
following equation, derived from Eq. (34):
(f2
sDqA2
qB2
q)q2
t+ 2A2
qBqCqqtA2
qC2
q= 0,(53)
where Aq=Rmaxdt0sxsy,Bq=xt01+xt02,Cq=Fkxt0k2and Dq=s2
xx2
t03B2
q+
s2
ykxt0k2(xt01xt02)2.
When qt>0, as in case 5 of the performed simulations, qtcan be directly obtained
by:
qt=AqFkxt0k2
fspDq+Aq(xt01+xt02).(54)
The maximum qtcan be obtained similarly for other cases and UAV/camera motion
types, in order to estimate the range of permissible target velocity deviations for a
specific shot type-UAV/camera motion type combination.
4.4. AirSim simulations for evaluating shot feasibility rules
In order to evaluate the presented shot feasibility rules under actual media produc-
tion conditions, a realistic simulation was developed that implements the platform setup
discussed thus far and incorporates the proposed rules. To this end, AirSim [33] was
employed, i.e., an open source, highly realistic UAV simulation environment (based on
the Unreal 4 real-time 3D graphics engine). For the evaluation purposes two differ-
ent scenarios were developed (bike and track and field scenarios). In both scenarios,
the generated shots involve a moving target (cyclist or running athlete) and a UAV
equipped with a cinematographic camera, controlled by an API script, that follows the
target according to the desired shot type/camera motion type combination. Snapshots
28
Figure 18: Snapsot from the synthetic, realistic evaluation environment. The UAV follows the target (bicycle)
while performing an ORBIT motion type. The focal length of the camera is set to 50mm, resulting in a Long
Shot shot type.
Figure 19: Snapsot from the scenario in the synthetic, realistic evaluation environment. The UAV follows a
running athlete while performing an ORBIT motion type.
from the generated footage are depicted in Figures 18 and 19, while an example 2D
plot of the target and UAV trajectories, during an ORBIT, are shown in Figure 20.
The various parameters (e.g., focal length, UAV height, initial position relative to
target etc.) were set similarly to the evaluation in Section 3.2. Rmax was set adaptively
to min(1
2H, wk
syRim), where the latter term is the search region size, defined by the 2D
target ROI radius (in pixels) 1
syRim, a constant scaling factor w(set here to 1.5, as is
the default value in [12]) and a varying scaling factor k[0,1] that shrinks the search
region according to the proximity of the current ROI to the video frame borders, so as
to restrict out-of-frame ROI translations that would cause 2D tracker drift and gimbal
control failure.
Datasets created in such a manner can produce fully accurate results for both the
target and UAV 3D location. However, this is not in line with a real-world scenario
involving noisy GPS sensors. Thus, the 3D positions of both the target and UAV for
every time instance twere distorted according to a Gaussian noise distribution, so as
to simulate GPS measurements.
The experiments were carried out for all motion types, while attempting to achieve
three different shot types: Long Shot (LS), Medium Close-Up (MCU) and Close-Up
29
-500 0 500 1000
X
-1000
-500
0
500
Y
UAV trajectory
Target trajectory
Figure 20: 2D plot of the UAV and target trajectories in WCS, during an ORBIT session in the AirSim
simulator.
(CU). For evaluation purposes, we obtained the noisy 3D positions of both the target
and the UAV at every time instance t. Additionally, the previous noisy 3D position of
the target (from time instance t1) was employed to calculate its velocity. Assuming
that the target will follow momentarily a linear trajectory, we estimate its 3D position
in the next time instance (t0=t+ 1) and adjust the UAV motion, so that the desired
central composition framing is maintained. Then, at time instance t0, we compare the
2D projection of the estimated 3D target position with the 2D projection of the ground-
truth 3D target position. If the distance of the two ROI center points, Rf, is above the
Rmax limit, ground-truth tracking failure is assumed (Rf> Rmax). This is then com-
pared with the predictions of Eqs. (32) for the current maximal permissible focal length
and (51) for the desired one, regarding the current shot’s feasibility, given the noisy 3D
positions of the target and the UAV, the calculated target velocity and the estimated
target position on the next video frame. By employing the above the proposed method
assumes tracking failure when the desired focal length given by Eq. (51) is greater than
the result of Eq. (32), as described by Eq. (46). The velocity deviation vector qtin
Eq. (32) is simply calculated as the difference between the estimated target velocity at
time instance t1and the actual target velocity at time instance t(distorted by noise).
Therefore, a reasonable assumption of temporally localized constant target accelera-
tion is made. Thus, true/false positive/negative prediction labels (TP, FP, TN, FN) are
computed for each time instance. Then, precision is calculated as P=T P
T P +F P , recall
rate R=T P
T P +F N and F-Measure as F=2T P
2T P +F P +F N .
In the first evaluation scenario of cycling, the mean precision, recall and F-Measure
of the proposed rules over all motion types were 0.929,0.994 and 0.960, respectively.
Table 4 depicts the evaluation results per shot type, while Figure 21 contains the F-
Measure box-plots for all motion types, separately for each shot type. In the second
scenario of the running athlete, the mean precision, recall and F-Measure were 0.961,
0.927,0.995 while the individual results per shot types are depicted in Table 5. Figure
22 demonstrates the F-Measure box-plots for all motion types in the second scenario,
30
LS
MCU
CU
0.8
0.9
1
Figure 21: Box-plot of F-Measure for the three different shot types in the AirSim cycling evaluation test.
The line inside the boxes demonstrates the median value in each case. Overall, CHASE performed the best
and FLYOVER the worst.
LS
MCU
CU
0.8
0.9
1
Figure 22: Box-plot of F-Measure for the three different shot types in the AirSim track and field evaluation
test. The line inside the boxes demonstrates the median value in each case. Overall, VTS performed the best
and LTS the worst.
31
Table 4: Mean evaluation results for the proposed shot feasibility rules over all motion types, in the realistic
AirSim cycling setup.
Shot type F-Measure Precision Recall
LS 0.992 0.991 0.997
MCU 0.956 0.923 0.993
CU 0.926 0.872 0.990
Mean 0.960 0.929 0.994
Table 5: Mean evaluation results for the proposed shot feasibility rules over all motion types, in the realistic
AirSim track and field setup.
Shot type F-Measure Precision Recall
LS 0.999 0.991 0.997
MCU 0.971 0.944 0.991
CU 0.913 0.845 0.999
Mean 0.961 0.927 0.995
separated per shot type.
In addition, the target ROI size calculation methodology was evaluated. As already
mentioned, we treat the target as a sphere-shaped object in order to derive the desired
focal length fs. This can lead to approximation errors in video frame coverage estima-
tion, especially with flattened targets. The focal length necessary to keep the desired
shot type was calculated for each video frame, using the noisy 3D UAV and target
positions, as well as the target ROI prediction for the next video frame.
The actual ROI-to-video-frame-height ratio was calculated at each time instance
and compared with the desired value of cs, as defined by each shot type. Figure 23 de-
picts the distribution of the actual video frame coverage vs the estimated one. Despite
variations in the actual target ROI size, the proposed fscalculation manages to keep
the estimated target ROI size within the video frame coverage range of the desired shot
type. Table 6 demonstrates the mean video frame coverage values for the three eval-
LS act
LS est
MCU act
MCU est
CU act
CU est
0.4
0.6
0.8
1CU
MCU
LS
Figure 23: Box-plot of the estimated vs the actual target video frame coverage for the three desired framing
shot types. Despite the simple sphere-based target modeling and the target/UAV localization noise, the
estimated target ROI size lies within the range of the same shot type as the actual target ROI size.
32
Table 6: Desired, actual and estimated mean video frame coverage.
Shot type Desired csActual csEstimated cs
LS 0.3 0.307 0.310
MCU 0.6 0.606 0.620
CU 0.85 0.872 0.880
uated shot types, over all the simulated motion types. Desired csis the video frame
coverage percentage requested by the director, actual csis the video frame coverage
percentage achieved by the produced ROIs, while estimated csrefers to the coverage
percentage that would be achieved if ground-truth, non-noisy UAV and target 3D posi-
tions were available. The largest deviation is observed in the CU case where, as already
demonstrated in Section 4, target tracking is not feasible most of the time.
5. Conclusions
In this paper, a close examination of the shot type constraints arising in computer
vision-assisted UAV active target following for cinematography applications has been
performed.To this end, a number of industry-standard target-tracking UAV motion
types have been strictly defined and geometrically modelled, while compatible shot
types have been identified for each case. Subsequently, maximum permissible cam-
era focal length, so that 2D visual tracking does not fail, as well shot type feasibility
conditions were analytically determined. The relevant derived formulas can be readily
employed as low-level rules in UAV intelligent shooting and cinematography planning
systems. Practical simulations showcase the validity of our findings, since results com-
ply with intuitive expectations in all cases.
Several extensions can be envisioned for the proposed rules. For instance, tighter
integration with a specific real-time 2D visual tracker may lead to improvements. Ad-
ditionally, since our formulas rely on the estimated velocity deviation vector qat each
time instance, learning to predict this vector from visual data (e.g., expected target
route) would be a promising avenue for future research. Such a prediction may con-
currently benefit the 2D visual tracker itself, as in [17] [39].
6. Acknowledgement
Funding: The research leading to these results has received funding from the Euro-
pean Union’s Horizon 2020 research and innovation programme under grant agreement
No 731667 (MULTIDRONE). This publication reflects the authors’ views only. The
European Commission is not responsible for any use that may be made of the informa-
tion it contains.
References
[1] Computational UAV cinematography for intelligent shooting based on semantic
visual analysis.
33
[2] J. Angeles. Fundamentals of robotic mechanical systems, volume 2. Springer,
2002.
[3] I. Arev, H. S. Park, Y. Sheikh, J. K. Hodgins, and A. Shamir. Automatic editing
of footage from multiple social cameras. ACM Transactions on Graphics, 33(4):
81, 2014.
[4] S. Bhattacharya, R. Mehran, R. Sukthankar, and M. Shah. Classification of cine-
matographic shots using lie algebra and its application to complex event recogni-
tion. IEEE Transactions on Multimedia, 16(3):686–696, 2014.
[5] B. Brown. Cinematography: Theory and Practice: Image Making for Cinematog-
raphers and Directors. Focal Press, 3rd edition, 2016.
[6] P. Carr, M. Mistry, and I. Matthews. Hybrid robotic/virtual pan-tilt-zom cam-
eras for autonomous event recording. In Proceedings of the ACM International
Conference on Multimedia. ACM, 2013.
[7] E. Cheng. Aerial Photography and Videography Using Drones. Peachpit Press,
2016.
[8] L.-Y. Duan, J. S. Jin, Q. Tian, and C.-S. Xu. Nonparametric motion characteri-
zation for robust classification of camera motion patterns. IEEE Transactions on
Multimedia, 8(2):323–340, 2006.
[9] H. Fourati and D.E.C. Belkhiat. Multisensor Attitude Estimation: Fundamental
Concepts and Applications. CRC Press LLC, 2016.
[10] M. S. Grewal, L. R. Weill, and A. P. Andrews. Global Positioning Systems,
inertial navigation, and integration. John Wiley & Sons, 2007.
[11] M. A. Hasan, M. Xu, X. He, and C. Xu. CAMHID: Camera motion histogram
descriptor and its application to cinematographic shot classification. IEEE Trans-
actions on Circuits and Systems for Video Technology, 24(10):1682–1695, 2014.
[12] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with
kernelized correlation filters. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 37(3):583–596, 2015.
[13] B. K. P. Horn. Closed-form solution of absolute orientation using unit quater-
nions. Journal of the Optical Society of America A, 4(4):629–642, 1987.
[14] X. Huang, R. Janaswamy, and A. Ganz. Scout: Outdoor localization using Ac-
tive RFID technology. In Proceedings of the IEEE Conference on Broadband
Communications, Networks and Systems (BROADNETS), pages 1–10, 2006.
[15] N. Joubert, M. Roberts, A. Truong, F. Berthouzoz, and P. Hanrahan. An interac-
tive tool for designing quadrotor camera shots. ACM Transactions on Graphics,
34(6):238, 2015.
34
[16] N. Joubert, D. B. Goldman, F. Berthouzoz, M. Roberts, J. A. Landay, and P. Han-
rahan. Towards a drone cinematographer: Guiding quadrotor cameras using vi-
sual composition principles. arXiv preprint arXiv:1610.01691, 2016.
[17] T. Li. Single-road-constrained positioning based on deterministic trajectory ge-
ometry. IEEE Communications Letters, 23(1):80–83, 2018.
[18] N. Liang, G. Wu, W. Kang, Z. Wang, and D. D. Feng. Real-time long-term
tracking with prediction-detection-correction. IEEE Transactions on Multimedia,
PP(99):1–1, 2018.
[19] C. Liu, P. Liu, W. Zhao, and X. Tang. Robust tracking and re-detection: Collab-
oratively modeling the target and its context. IEEE Transactions on Multimedia,
2017.
[20] I. Mademlis, V. Mygdalis, C. Raptopoulou, N. Nikolaidis, N. Heise, T. Koch,
J. Grunfeld, T. Wagner, A. Messina, F. Negro, S. Metta, and I. Pitas. Overview
of drone cinematography for sports filming. In European Conference on Visual
Media Production (CVMP) (short), 2017.
[21] I. Mademlis, V. Mygdalis, N. Nikolaidis, and I. Pitas. Challenges in Autonomous
UAV Cinematography: An Overview. In Proceedings of the IEEE International
Conference on Multimedia and Expo (ICME), 2018.
[22] I. Mademlis, V. Mygdalis, N. Nikolaidis, M. Montagnuolo, F. Negro, A. Messina,
and I. Pitas. High-level multiple-UAV cinematography tools for covering outdoor
events. IEEE Transactions on Broadcasting, 2019.
[23] I. Mademlis, N. Nikolaidis, A. Tefas, I. Pitas, T. Wagner, and A. Messina. Au-
tonomous UAV cinematography: A tutorial and a formalized shot type taxonomy.
ACM Computing Surveys, 2019. accepted for publication.
[24] I. Mademlis, N. Nikolaidis, A. Tefas, I. Pitas, T. Wagner, and A. Messina. Au-
tonomous unmanned aerial vehicles filming in dynamic unstructured outdoor en-
vironments. IEEE Signal Processing Magazine, 36(1):147–153, 2019.
[25] S. Minaeian, J. Liu, and Y.-J. Son. Effective and efficient detection of moving
targets from a UAV’s camera. IEEE Transactions on Intelligent Transportation
Systems, 2018.
[26] P. P. Mohanta, S. K. Saha, and B. Chanda. A model-based shot boundary detection
technique using frame transition parameters. IEEE Transactions on Multimedia,
14(1):223–233, 2012.
[27] M. Mueller, N. Smith, and B. Ghanem. A benchmark and simulator for UAV
tracking. In Proceedings of the European Conference on Computer Vision
(ECCV). Springer, 2016.
[28] R. Mur-Artal and J. D. Tard´
os. ORB-SLAM2: an open-source SLAM system for
monocular, stereo and RGB-D cameras. arXiv preprint arXiv:1610.06475, 2016.
35
[29] R. Mur-Artal and J. D. Tard´
os. Visual-inertial monocular SLAM with map reuse.
IEEE Robotics and Automation Letters, 2(2):796–803, 2017.
[30] T. N¨
ageli, L. Meier, A. Domahidi, J. Alonso-Mora, and O. Hilliges. Real-time
planning for automated multi-view drone cinematography. ACM Transactions on
Graphics, 36(4):132:1–132:10, 2017.
[31] P. Nousi, E. Patsiouras, A. Tefas, and I. Pitas. Convolutional neural networks for
visual information analysis with limited computing resources. In Proceedings of
the IEEE International Conference on Image Processing (ICIP), 2018.
[32] P. Nousi, I. Mademlis, I. Karakostas, A. Tefas, and I. Pitas. Embedded UAV
Real-time Visual Object Detection and Tracking. In Proceedings of the IEEE
International Conference on Real-time Computing and Robotics (RCAR), 2019.
[33] S. Shah, D. Dey, C. Lovett, and A. Kapoor. AirSim: High-Fidelity Visual and
Physical Simulation for Autonomous Vehicles. In Proceedings of the Field and
Service Robotics Conference, 2017.
[34] C. Smith. The Photographer’s Guide to Drones. Rocky Nook, 2016.
[35] A. Torres-Gonz´
alez, J. Capit´
an, R. Cunha, A. Ollero, and I. Mademlis. A mul-
tidrone approach for autonomous cinematography planning. In Proceedings of
the Iberian Robotics Conference (ROBOT’), 2017.
[36] E. Trucco and A. Verri. Introductory Techniques for 3-D Computer Vision. Pren-
tice Hall, 1998.
[37] I. Tsingalis, A. Tefas, N. Nikolaidis, and I. Pitas. Shot type characterization in
2D and 3D video content. In Proceedings of the IEEE International Workshop on
Multimedia Signal Processing (MMSP), 2014.
[38] X. Wang, H. Zhu, D. Zhang, D. Zhou, and X. Wang. Vision-based detection and
tracking of a mobile ground target using a fixed-wing UAV. International Journal
of Advanced Robotic Systems, 11, 2014.
[39] L. Xu, Y. Liang, Z. Duan, and G. Zhou. Route-based dynamics modeling and
tracking with application to air traffic surveillance. IEEE Transactions on Intelli-
gent Transportation Systems, 2019.
[40] O. Zachariadis, V. Mygdalis, I. Mademlis, N. Nikolaidis, and I. Pitas. 2D visual
tracking for sports UAV cinematography applications. In Proceedings of the IEEE
Global Conference on Signal and Information Processing (GlobalSIP), 2017.
36
... Many researchers have been looking into the problem of cooperative path planning of multiple UAVs [12][13][14]. More research on drones can be found in [15][16][17][18][19][20]. ...
... In formula (16), s and c represent ) sin( and ) cos( , respectively. Considering the small UAV body, all the components except the main rotor and the balance wing as the ideal rigid body, the dynamic equation of the translational motion of the UAV can be derived by using the Newton-Eulerian equation of the rigid body motion. ...
Preprint
Full-text available
This paper addresses the challenges faced by novice drone operators in mastering flight altitude, speed, and shooting angles. It analyzes strategies for adjusting these parameters to capture satisfactory photographs, focusing on four key questions. Firstly, it examines the geometric relationship between flight altitude and camera coverage area, establishing an optimization model for drone shooting accuracy. Secondly, it independently analyzes the numerical relationship between shooting angles and camera coverage area, identifying optimal shooting angles using a UAV shooting accuracy score optimization model. Thirdly, it develops a small UAV system model and ground target model, employing a recursive target tracking algorithm to continuously adjust shooting angles for target acquisition. Finally, it introduces a novel RRT* algorithm for path planning around obstacles encountered during flight. We use MATLAB to select a reasonable obstacle avoidance strategy, and the global optimal route is obtained by smoothing processing. Simulation results demonstrate model stability and robustness across varying flight conditions.
... The advantage of using UAV technology to make photographs [178][179][180], video recordings [181,182], or cinematography [183,184] consists in the possibility of exploring and capturing territories and areas with high flexibility and range while producing long continuous shots at a reasonable cost. Imaging ranges are from a few centimeters to more than 100 m. ...
... Aerial photography and videography [73,[178][179][180][182][183][184] Search and rescue [185][186][187][188][189][190][191] Military [192][193][194][195][196][197] Checking electrical lines and pipes [198][199][200][201][202][203] 6.3. Discussion on UAV Carriers, Sensors, and Software ...
Article
Full-text available
Using Unmanned Aerial Vehicles (UAVs) combined with various sensors brings the benefits associated with fast, automatic, and contactless spatial data collection with high resolution and accuracy. The most frequent application is the possibility of effectively creating spatial models based on photogrammetric and lidar data. This review analyzes the current possibilities of UAVs. It provides an overview of the current state of the art and research on selected parameters regarding their history and development, classification, regulation, and application in surveying with creating spatial models. Classification and regulation are based on national sources. The importance and usability of this review are also carried out by analyzing the UAV application with selected photogrammetric and lidar sensors. The study explores and discusses results achieved by many authors in recent years, synthesizing essential facts. By analyzing the network of co-occurring High-Frequency Words, in addition, we visualized the importance of the primary keyword UAV in the context of other keywords in the literary sources processed.
... Thus, existing computer vision solutions can be readily utilized for providing the proper PID error signal at each time instance. -The proposed method is concretely implemented for a set of industry-standard, formalized UAV cinematography CMTs, extracted from a recently proposed UAV shot type taxonomy [35] [32] [24] [25] [26]. The derived controllers, serving as detailed examples of the proposed method, are extensively evaluated in simulation. ...
... To showcase the proposed method, we concretely implemented it for a set of industry-standard, formalized UAV cinematography CMTs, extracted from a recently proposed UAV shot type taxonomy [35] [32] [24] [25] [26]. Figure 1 graphically summarizes the supported CMTs, while brief textual descriptions can be found in Table 3. ...
Article
Full-text available
One of the most important aesthetic concepts in autonomous Unmanned Aerial Vehicle (UAV) cinematography is the UAV/Camera Motion Type (CMT), describing the desired UAV trajectory relative to a (still or moving) physical target/subject being filmed. Usually, for the drone to autonomously execute such a CMT and capture the desired shot in footage, the 3D states (positions/poses within the world) of both the UAV/camera and the target are required as input. However, the target's 3D state is not typically known in non-staged settings. This paper proposes a novel framework for reformulating each desired CMT as a set of requirements that interrelate 2D visual information, UAV trajectory and camera orientation. Then, a set of CMT-specific vision-driven Proportional-Integral-Derivative (PID) UAV controllers can be implemented, by exploiting the above requirements to form suitable error signals. Such signals drive continuous adjustments to instant UAV motion parameters, separately at each captured video frame/time instance. The only inputs required for computing each error value are the current 2D pixel coordinates of the target's on-frame bounding box, detectable by an independent, off-the-shelf, real-time, deep neural 2D object detector/tracker vision subsystem. Importantly , neither UAV nor target 3D states are required ever to be known or estimated, while no depth maps, target 3D models or camera intrinsic parameters are necessary. The method was implemented and successfully evaluated in a robotics simulator, by properly reformulating a set of standard, formalized UAV CMTs.
... Unmanned Aerial Vehicles (UAV) have found widespread applications, especially in applications that need interaction with a remote operator [1] (for example, in applications for farm inspection [2]- [4], photovoltaic plants inspection [5], film industry, an inspection of hazardous areas, and topography) [6]- [8]. The need for efficient and secure task execution has driven the development of advanced solutions in outdoor robots, which are often deployed in applications where the environment is dangerous or not easily accessible by humans. ...
Conference Paper
Full-text available
This study presents the development of virtual environments as control centers for remote teleoperation tasks of unmanned aerial vehicles (UAV s), Initially, our focus lies on reconstructing outdoor environments using a ZED mini stereo camera and reconstructing them through point cloud techniques. The virtual environment is hosted in UNITY, a widely recognized platform for designing virtual reality (VR) video games. Within this environment, a digital twin UAV is embedded, tasked with replicating the real positions and orientations of the vehicle. To achieve this, we propose using PID and PD control for the positions and rotations of the virtual vehicle, allowing it to follow the desired position, in this case, the real position of the vehicle. A series of experiments were conducted in outdoor and indoor environments to validate the functionality of this approach.
... In recent years, with the increasing prevalence of autonomous robots, UAVs in the aviation sector have garnered a great deal of attention. UAVs have demonstrated immense applicability in various domains, such as cinematography, visual inspection, communications, and networking [1][2][3]. However, their size, motors, and other components limit their effective payload capacity. ...
Article
Full-text available
In our study, we explore the task of performing docking maneuvers between two unmanned aerial vehicles (UAVs) using a combination of offline and online reinforcement learning (RL) methods. This task requires a UAV to accomplish external docking while maintaining stable flight control, representing two distinct types of objectives at the task execution level. Direct online RL training could lead to catastrophic forgetting, resulting in training failure. To overcome these challenges, we design a rule-based expert controller and accumulate an extensive dataset. Based on this, we concurrently design a series of rewards and train a guiding policy through offline RL. Then, we conduct comparative verification on different RL methods, ultimately selecting online RL to fine-tune the model trained offline. This strategy effectively combines the efficiency of offline RL with the exploratory capabilities of online RL. Our approach improves the success rate of the UAV’s aerial docking task, increasing it from 40% under the expert policy to 95%.
... According to recent UAV shot type taxonomy proposals [3], [4], [5], [6], [7], [8], UAV shots consist in valid combinations This work has received funding from the European Union's Horizon 2020 programme under grant agreement No 951911 (AI4Media). ...
Conference Paper
Full-text available
Unmanned Aerial Vehicles (UAVs, or drones) have revolutionized modern media production. Being rapidly deployable "flying cameras", they can easily capture aesthetically pleasing aerial footage of static or moving filming targets/subjects. Current approaches rely either on manual UAV/gimbal control by human experts, or on a combination of complex computer vision algorithms and hardware configurations for automating the flight+filming process. This paper explores an efficient Deep Reinforcement Learning (DRL) alternative, which implicitly merges the target detection and path planning steps into a single algorithm. To achieve this, a baseline DRL approach is augmented with a novel policy distillation component, which transfers knowledge from a suitable, semi-expert Model Predictive Control (MPC) controller into the DRL agent. Thus, the latter is able to autonomously execute a specific UAV cinematography task with purely visual input. Unlike the MPC controller, the proposed DRL agent does not need to know the 3D world position of the filming target during inference. Experiments conducted in a photorealistic simulator showcase superior performance and training speed compared to the baseline agent, while surpassing the MPC controller in terms of visual occlusion avoidance.
... In the early stages of research, simple camera behaviors such as pan and tilt are pre-defined based on the script notation and controlled by speed in analytical ways to ensure that the focused character remains within the frame [Hayashi et al. 2014;Kim et al. 1998;Subramonyam et al. 2018]. To enhance the viewer experience, real-world director rules that control shot types and angles [Karakostas et al. 2020] are introduced as constraints in optimization methods for selecting satisfactory camera placements [Louarn et al. 2018;Yu et al. 2022a] or generating camera rails [Galvane 2015]. Furthermore, additional encoded rules for camera intrinsics such as focal length and focal distance [Bonatti et al. 2020;Pueyo et al. 2022] are utilized to optimize better shooting performance. ...
Preprint
Immersion plays a vital role when designing cinematic creations, yet the difficulty in immersive shooting prevents designers to create satisfactory outputs. In this work, we analyze the specific components that contribute to cinematographic immersion considering spatial, emotional, and aesthetic level, while these components are then combined into a high-level evaluation mechanism. Guided by such a immersion mechanism, we propose a GAN-based camera control system that is able to generate actor-driven camera movements in the 3D virtual environment to obtain immersive film sequences. The proposed encoder-decoder architecture in the generation flow transfers character motion into camera trajectory conditioned on an emotion factor. This ensures spatial and emotional immersion by performing actor-camera synchronization physically and psychologically. The emotional immersion is further strengthened by incorporating regularization that controls camera shakiness for expressing different mental statuses. To achieve aesthetic immersion, we make effort to improve aesthetic frame compositions by modifying the synthesized camera trajectory. Based on a self-supervised adjustor, the adjusted camera placements can project the character to the appropriate on-frame locations following aesthetic rules. The experimental results indicate that our proposed camera control system can efficiently offer immersive cinematic videos, both quantitatively and qualitatively, based on a fine-grained immersive shooting. Live examples are shown in the supplementary video.
... During the last decade, UAVs have been used in many diferent applications, including aerial photography [6,7], agriculture [8,9], control [10,11], communications [12,13], crop spraying and monitoring [14], emergency response [15,16], logistics [17,18], search and rescue [19], and product delivery [20][21][22]. Further felds of applications can be found in [23][24][25][26][27][28][29]. ...
Article
Full-text available
The multiperiodic crowd tracking (MPCT) problem is an extension of the periodic crowd tracking (PCT) problem, recently addressed in the literature and solved using an iterative solver called PCTs solver. For a given crowded event, the MPCT consists of follow-up crowds, using unmanned aerial vehicles (UAVs) during different periods in a life-cycle of an open crowded area (OCA). Our main motivation is to remedy an important limitation of the PCTs solver called “PCTs solver myopia” which is, in certain cases, unable to manage the fleet of UAVs to cover all the periods of a given OCA life-cycle during a crowded event. The behavior of crowds can be predicted using machine learning techniques. Based on this assumption, we proposed a new mixed integer linear programming (MILP) model, called MILP-MPCT, to solve the MPCT. The MILP-MPCT was designed using linear programming technique to build two objective functions that minimize the total time and energy consumed by UAVs under a set of constraints related to the MPCT problem. In order to validate the MILP-MPCT, we simulated it using IBM-ILOG-CPLEX optimization framework. Thanks to the “clairvoyance” of the proposed MILP-MPCT model, experimental investigations show that the MILP-MPCT model provides strategic moves of UAVs between charging stations (CSs) and crowds to provide better solutions than those reported in the literature.
Chapter
In this chapter, we introduced an unmanned aircraft system (UAS)/unmanned ariel vehicle (UAV) that serves various purposes such as monitoring and regulating urban development, transportation, monitoring air quality, infrastructure inspection, emergency services, and communication establishment. These UAVs have proven to be valuable assets in the context of smart cities, offering a range of potential applications supported by key technologies. One notable application is the delivery of goods, which not only reduces traffic congestion but also minimizes emissions compared to traditional delivery vehicles. In addition, it provides real-time data to service provider and governing authorities of smart cities for better decision-making and resource allocation. This contributes to create more sustainable, efficient, and equitable smart cities. The main purpose is to introduce UAVs in smart cities for several potential applications along with the key technologies supporting. Throughout this chapter, we discuss how UAVs address challenges and build a smart city sustainable. Finally, we evaluate UAV prospects and look forward to its future research scopes.
Chapter
Full-text available
Equipping Unmanned Aerial Vehicles (UAVs/drones) with professional cameras has rapidly transformed the media production landscape in recent years. However, their creative potential in aerial cinematography applications can only be fully exploited by enhancing their cognitive autonomy and deploying them in a collaborative, multi-drone fleet setting. Thus, networking, security and data streaming issues arise naturally. In this Chapter, we assume a stand-alone UAV fleet coordinated in real-time by a central on-ground compute station for live outdoor event media coverage, with high-definition, low-latency video streaming from many moving sources. This is the most general and technically difficult filming scenario: on top of security concerns, fluctuations in wireless signal power inevitably make stable wireless communications a real challenge with current technology. Motivated by these difficulties, we designed and evaluated a novel multiple-UAV platform for live outdoor media production, featuring a communications architecture able to handle and overcome the relevant communication issues. Both 4G/LTE and WiFi are utilized to make this infrastructure easy to deploy, secure and robust, as indicated by the included empirical evaluation. Notably, this is an innovative, prototype platform: the first one specifically designed for handling difficult professional filming scenarios with multiple autonomous UAVs.
Article
Full-text available
The emerging field of autonomous UAV cinematography is examined through a tutorial for non-experts, which also presents the required underlying technologies and connections with different UAV application domains. Current industry practices are formalized by presenting a UAV shot-type taxonomy composed of framing shot types, single-UAV camera motion types, and multiple-UAV camera motion types. Visually pleasing combinations of framing shot types and camera motion types are identified, while the presented camera motion types are modeled geometrically and graded into distinct energy consumption classes and required technology complexity levels for autonomous capture. Two specific strategies are prescribed, namely focal length compensation and multidrone compensation, for partially overcoming a number of issues arising in UAV live outdoor event coverage, deemed as the most complex UAV cinematography scenario. Finally, the shot types compatible with each compensation strategy are explicitly identified. Overall, this tutorial both familiarizes readers coming from different backgrounds with the topic in a structured manner and lays necessary groundwork for future advancements.
Conference Paper
Full-text available
The use of camera-equipped Unmanned Aerial Vehicles (UAVs, or "drones") for a wide range of aerial video capturing applications, including media production, surveillance, search and rescue operations, etc., has exploded in recent years. Technological progress has led to commercially available UAVs with a degree of cognitive autonomy and perceptual capabilities, such as automated, on-line detection and tracking of target objects upon the captured footage. However, the limited computational hardware, the possibly high camera-to-target distance and the fact that both the UAV/camera and the target(s) are moving, makes it challenging to achieve both high accuracy and stable real-time performance. In this paper, the current state-of-the-art on real-time object detection/tracking is overviewed. Additionally , a relevant, modular implementation suitable for on-drone execution (running on top of the popular Robot Operating System) is presented and empirically evaluated on a number of relevant datasets. The results indicate that a sophisticated, neural network-based detection and tracking system can be deployed at real-time even on embedded devices.
Book
Full-text available
There has been an increasing interest in multi-disciplinary research on multisensor attitude estimation technology driven by its versatility and diverse areas of application, such as sensor networks, robotics, navigation, video, biomedicine, etc. Attitude estimation consists of the determination of rigid bodies’ orientation in 3D space. This research area is a multilevel, multifaceted process handling the automatic association, correlation, estimation, and combination of data and information from several sources. Data fusion for attitude estimation is motivated by several issues and problems, such as data imperfection, data multi-modality, data dimensionality, processing framework, etc. While many of these problems have been identified and heavily investigated, no single data fusion algorithm is capable of addressing all the aforementioned challenges. The variety of methods in the literature focus on a subset of these issues to solve, which would be determined based on the application in hand. Historically, the problem of attitude estimation has been introduced by Grace Wahba in 1965 within the estimate of satellite attitude and aerospace applications. This book intends to provide the reader with both a generic and comprehensive view of contemporary data fusion methodologies for attitude estimation, as well as the most recent researches and novel advances on multisensor attitude estimation task. It explores the design of algorithms and architectures, benefits, and challenging aspects, as well as a broad array of disciplines, including: navigation, robotics, biomedicine, motion analysis, etc. A number of issues that make data fusion for attitude estimation a challenging task, and which will be discussed through the different chapters of the book, are related to: 1) The nature of sensors and information sources (accelerometer, gyroscope, magnetometer, GPS, inclinometer, etc.); 2) The computational ability at the sensors; 3) The theoretical developments and convergence proofs; 4) The system architecture, computational resources, fusion level.
Article
Full-text available
Camera-equipped unmanned aerial vehicles (UAVs), or "drones," are a recent addition to standard audiovisual shooting technologies. As drone cinematography is expected to further revolutionize media production, this paper presents an overview of the state-of-the-art in this area, along with a brief review of current commercial UAV technologies and legal restrictions on their deployment. A novel taxonomy of UAV cinematography visual building blocks, in the context of filming outdoor events where targets (e.g., athletes) must be actively followed, is additionally proposed. Such a taxonomy is necessary for progress in intelligent/autonomous UAV shooting, which has the potential of addressing current technology challenges. Subsequently, the concepts and advantages inherent in multiple-UAV cinematography are introduced. The core of multiple-UAV cinematography consists in identifying different combinations of multiple single-UAV camera motion types, assembled in meaningful sequences. Finally, based on the defined UAV/camera motion types, tools for managing a partially autonomous, multiple-UAV fleet from the director's point of view are presented. Although the overall focus is on cinematic coverage of sports events, the majority of our contributions also apply in different scenarios, such as movies/TV production, newsgathering, or advertising.
Article
Full-text available
Recent mass commercialization of affordable Unmanned Aerial Vehicles (UAVs, or "drones") has significantly altered the media production landscape, allowing easy acquisition of impressive aerial footage. Relevant applications include production of movies, television shows or commercials, as well as filming outdoor events or news stories for TV. Increased drone autonomy in the near future is expected to reduce shooting costs and shift focus to the creative process, rather than the minutiae of UAV operation. This short overview introduces and surveys the emerging field of autonomous UAV filming, attempting to familiarize the reader with the area and, concurrently, highlight the inherent signal processing aspects and challenges .
Article
Full-text available
We consider the single-road-constrained estimation problem for positioning a target that moves on a single, deterministic and exactly known trajectory. Based on the geometry of the trajectory curve, we cast the constrained estimation problem as an unconstrained problem with reduced state dimension. Two approaches are devised based on a Markov transition model for unscented Kalman filtering and a continuous function of time for (weighted) least square fitting, respectively. A popular simulation model has been used for demonstrating the performance of the proposed approaches in comparison to existing approaches.
Poster
Full-text available
Autonomous UAV cinematography is an active research field with exciting potential for the media industry. It bears the promise of greatly facilitating UAV shooting for various applications, while significantly reducing the costs compared to manual shooting. However, the general problem has not been clearly defined and the challenges arising from current legislation and technology restrictions have not been fully charted. A complete overview of issues related to autonomous UAV cinematography is needed, pertaining to the current situation in the field, so as to guide immediate-future research. The purpose of this paper is to lay exactly this groundwork, with the expectation of providing a global perspective to multiple domain-specific research communities. The outlined issues are partitioned into challenges deriving from ethical/legal/safety considerations and from operational/production requirements. A brief survey of current technological solutions, including their limitations, is also provided for each issue.
Conference Paper
Full-text available
Autonomous UAV cinematography is an active research field with exciting potential for the media industry. It bears the promise of greatly facilitating UAV shooting for various applications , while significantly reducing the costs compared to manual shooting. However, the general problem has not been clearly defined and the challenges arising from current legislation and technology restrictions have not been fully charted. A complete overview of issues related to autonomous UAV cinematography is needed, pertaining to the current situation in the field, so as to guide immediate-future research. The purpose of this paper is to lay exactly this groundwork, with the expectation of providing a global perspective to multiple domain-specific research communities. The outlined issues are partitioned into challenges deriving from ethical/legal/safety considerations and from oper-ational/production requirements. A brief survey of current technological solutions, including their limitations, is also provided for each issue.
Article
In transportation networks, the majority of moving vehicles are route-based or trajectory-scheduled. Taking advantage of such predictive information generally produces more accurate dynamic models and better surveillance performance. This paper is concerned with the route-based dynamic modeling along with the route-aided tracking. First, the evolution of the positions across the route is formulated as a stationary Markov process from the characteristics of the route-based dynamics, which follows that the second- and third-order models of the straight-line route-based motions are constructed. This novel modeling strategy is in reverse to the conventional ones starting from the acceleration and its resultant dynamic models are easy to implement due to the linearity with respect to the system states. Second, an optimal initialization technique for route-aided tracking is proposed by utilizing the stationary process information sufficiently. Furthermore, an extension to the circular route-based dynamic modeling and a combinational modeling structure are also presented. Finally, in the context of aerial surveillance, numerical simulations are provided to show the effectiveness of the proposed dynamic modeling and to verify the theoretical results given in the paper.