ArticlePDF Available

3D RECONSTRUCTION TECHNIQUE FROM 2D SEQUENTIAL HUMAN BODY IMAGES IN SPORTS: A REVIEW

Authors:

Abstract and Figures

The process of 3D Reconstruction is a fundamental problem in Computer Vision. However, recent researches have been successfully addressed by motion capture systems with body-worn markers and multiple cameras. To recover 3D reconstruction from a fully-body human pose by a single camera remains a challenging problem. For instance, noisy background, variation in human appearance, and self-occlusion were among these challenges. This thesis investigated methods of 3D Reconstruction from monocular image sequences in vigorous activities such as sports. Six current methods were selected based on their focus on recovery fully automated system for estimating 3D human pose for 2D joint location. These researches have been developed as an algorithm that can solve the ill-posed problem. The evaluation of the methods was divided into two sections. First, each process's theoretical and comparative study was disclosed to identify the technique used, the problems that inquired, and the results achieved in their approach. After that, the advantages and disadvantages of each method were listed. Also, several factors, such as accuracy, self-occlusion, and so on, have been compared amongst these methods. In the second stage, based on the advantages found in the first stage of evaluation, three methods were chosen to be evaluated using a specific data set. Initially, the codes of the three methods on the PennAction dataset (tennis) were run, and the performance of the methods in 3D Reconstruction is showed. Then, the methods were tested on a varied activities sequence from the CMU motion capture database. This study's novel is the evaluation of current methods based on the accuracy of their performance on the specific dataset of a tennis player. We also proposed a technique that combines each technique's particular advantages to create a more efficient method for 3D Reconstruction of 2D sequential images in outdoor activities.
Content may be subject to copyright.
ISSN: 04532198
Volume 62, Issue 09, October, 2020
4973
3D RECONSTRUCTION TECHNIQUE FROM 2D
SEQUENTIAL HUMAN BODY IMAGES IN
SPORTS: A REVIEW
Azrulhizam Shapii1, Sanaz Pichak2, Zainal Rasyid Mahayuddin3
Center for Artificial Intelligence Technology, Universiti Kebangsaan Malaysia1,2,3
ABSTRACT The process of 3D Reconstruction is a fundamental problem in Computer Vision. However,
recent researches have been successfully addressed by motion capture systems with body-worn markers and
multiple cameras. To recover 3D reconstruction from a fully-body human pose by a single camera remains a
challenging problem. For instance, noisy background, variation in human appearance, and self-occlusion were
among these challenges. This thesis investigated methods of 3D Reconstruction from monocular image
sequences in vigorous activities such as sports. Six current methods were selected based on their focus on
recovery fully automated system for estimating 3D human pose for 2D joint location. These researches have
been developed as an algorithm that can solve the ill-posed problem. The evaluation of the methods was
divided into two sections. First, each process's theoretical and comparative study was disclosed to identify the
technique used, the problems that inquired, and the results achieved in their approach. After that, the
advantages and disadvantages of each method were listed. Also, several factors, such as accuracy, self-
occlusion, and so on, have been compared amongst these methods. In the second stage, based on the
advantages found in the first stage of evaluation, three methods were chosen to be evaluated using a specific
data set. Initially, the codes of the three methods on the PennAction dataset (tennis) were run, and the
performance of the methods in 3D Reconstruction is showed. Then, the methods were tested on a varied
activities sequence from the CMU motion capture database. This study's novel is the evaluation of current
methods based on the accuracy of their performance on the specific dataset of a tennis player. We also
proposed a technique that combines each technique's particular advantages to create a more efficient method
for 3D Reconstruction of 2D sequential images in outdoor activities.
KEYWORDS: 3D Reconstruction, Sports, Human pose, Images sequence.
1. INTRODUCTION
Multimedia equipment can capture video or multi-photographs in real-time in the course of a sports activity
that can be replayed to an athlete player after the game to identify and rectify faults in technique. However,
although this technique is flexible, the images shown provide only a single perspective (single camera view),
which reduces considerably the ability to conduct an in-depth analysis [4]. Multiple cameras can be used to
address this issue for simultaneous capture of the player's performance, but this will incur high costs and be
complicated. It will also require post-processing and thus limit the time for motion capture. On the other
hand, multiple challenges must be simplified in 3D Reconstruction of the human body area from sequential
images. In this article, some considerations are taken into account the different methods analyzed to determine
the most suitable approach to be applied to tennis. First, the "realistic human body" has been targeted due to
the complexity in modeling based on variations in individual body shape and different clothes. Second, the
accurate recognition of self-occlusion where some limbs block other body parts in the images and obstruct the
vision when the stationary camera is studied. Third, finding proper image descriptors can be more helpful in
resolving many pose ambiguities and usually require trial and evaluation procedures to determine the most
competitive representations. Finally, special attention was given to the inclusion of real-world conditions such
A. Shapii, S. Pichak and Z. R. Mahayuddin, 2020 Technology Reports of Kansai University
4974
as cluttered background, uncontrolled scenes, noisy data, and moving person's speed in a sequential frame
[12]. Therefore, this research identified the best method that improves the ill-posed problem and how to handle
outdoor conditions in the tennis court environment. This paper has three objectives, which are as follows;
first, we evaluate different methods for 3D Reconstruction of the human body from a sequence of monocular
images to determine which one performs efficiently under occlusions, noise on the real-world data. Second,
we compare the developed and implemented 3D reconstruction methods to identify the advantages and
drawbacks of each. Third, we propose a new technique combining the benefits of different methods studied
that is more accurate in a particular application (tennis sport) with a fixed and ordinary camera.
2. Literature Review
Considerable research has addressed the challenge of human motion capture from imagery such as [7], [11],
[14] allow reconstructing 3D human motion using feature tracks in monocular image sequences and
combining random camera motion depending on prior trained base poses. Also, they focus on any movement;
periodic and non-periodic. The review of the methods was conducted following the method proposed by [11],
[3], [1], [15], [14], [6]. These six of the most recent 3D reconstruction algorithms were selected for the analysis
based on their research's performance and result. The theoretical approach of all methods is discussed, and the
detailed performance of the mathematical model was identified. [11] offered a model that was not activity-
dependent to retrieve the 3D configuration of a human figure from 2D locations of anatomical points in a
single image, leveraging a large motion capture corpus as a substitute for visual memory. [3] developed three
principled approaches to enhance particle filtering by integrating bottom-up information either as proposal
density for obtaining more diverse particles or as complementary cues to improve likelihood computation
during the correction step. He also demonstrated that a feedback mechanism from top-down modeling could
further adapt and enhance the bottom-up predictors to enhance tracking performance. [1] modeled how joint-
limits differ with a pose for getting good poses. They collected a motion capture dataset that explored a
multiplicity of human poses and developed a pose-dependent model of joint limits that forms their prior. [15]
proposed the integration of a sparsity-driven 3D geometric prior and temporal smoothness when the image
locations of the human joints are provided and when they are unknown, and this was extended by
programming the image locations of the joints as latent variables by considering several ambiguities in 2D
joint areas. The approach suggested by [14] aims to address the issue of predicting non-rigid human 3D shape
and motion from image sequences captured by non-calibrated cameras. They factorized 2D observations in
camera parameters, base poses, and mixing coefficients, in the same way as other state-of-the-art solutions.
Compared with existing methods, the novelty of this method is that it can handle arbitrary camera motion
without the need to use predefined skeleton or anthropometric constraints. In contrast, other plans require
good camera motion during the sequence to obtain a proper 3D reconstruction.
[6] proposed in their method the goal to make the 3D motion reconstruction more accurate. So more built-in
knowledge was added, such as height-map, which was introduced into the algorithmic scheme of
reconstructing the 3D pose/motion in a single-view calibrated camera. Finally, our approach was a
comparative study of 3D reconstruction methods of the human body from a tennis player's 2D image sequence.
We focused on evaluating different methods that studied sports poses by analyzing several factors such as
accuracy of human pose estimation, self-occlusion, and noisy background that are still not fully resolved. We
run the code of their algorithm of these methods in MATLAB on the Penn Action dataset to get a 3D
reconstruction result. After collecting all the results and comparing them together, we proposed a new
technique that combines three methods Xiawoei method, the Wandt method, and the Du method. The novel
approach proposed improving 2D joint location and occlusion to recreate 2D images into 3D images with
realistic results, minimum requirements, and significant results.
ISSN: 04532198
Volume 62, Issue 09, October, 2020
4975
3. Methodology
The methodology used in this research consisted of four phases described in Figure 1. Phases one consisted
of the analysis of multiple methods recently published for 3D Reconstruction to identify six methods that
showed to be the most relevant for this research. Step two consisted of comparing the experimental result of
each technique presented by the authors based on several factors such as projection, camera, realistic
Reconstruction, self-occlusion, accuracy, noisy background, and process speed to shorten the list three
highlighted methods. In phase three, the evaluation of these three selected methods was studied using specific
sequential images of the tennis player data set, and the results were compared. Finally, step 4 consisted of the
proposal of a new, improved method for 3D Reconstruction from 2D sequential images that combine the
robustness of each technique evaluated.
Figure 1. Methodology Use for Comparative Study of 3D Reconstruction
Figure 2 Displayed the pathway selected to evaluate the performance of the three chosen methods. The first
step for the analysis was conducted by analyzing the mathematics described for each method. Following this,
the code was digitized using MATLAB. Each method's performance was assessed using the specific dataset
proposed by each author to verify that the codes are working without error. However, when the code was not
provided, additional work was required, and the mathematical analysis of the code was used to program the
method as described by the author. Specific factors for these methods were evaluated on our particular data
set to assess their performance and compare their accuracy in 3D Reconstruction [8]. The output of running
codes was compared using a tennis player's dataset (Penn Action). Finally, these methods were evaluated on
the CMU dataset to understand their performance in 3D reconstruction error, accuracy percentage, and to
compare the results.
A. Shapii, S. Pichak and Z. R. Mahayuddin, 2020 Technology Reports of Kansai University
4976
Figure 2. Process Used for The Evaluation of Three Selected Methods
The final stage of this research consisted of the compilation of the advantage found in the evaluated methods.
Specific benefits were integrated into the core method (i.e., The method that shows the best performance on
the proposed data set) to overcome disadvantages found and improve 3D Reconstruction efficacy. The final
method proposed includes the highlights and provided a novel approach for 3D Reconstruction from 2D
sequential tennis sports images.
4. Results
The review of each selected method is presented in Table 1. The advantages and disadvantages of each one
of their techniques are described.
Table 1. Summary of Advantages and Disadvantages of Each Algorithm.
Authors
Specification of
application
Pros
Cons
Illustrations
Ramakri
shna
method
1- Human pose
recovery based on
sparse
representation in an
over-complete
dictionary
2- enforces a
mandatory
criterion on the sum
of squared limb
lengths
3-enforces the
quantum of eight
selected limbs for
constancy.
1-solves
anthropometric
regularity
2-robustness to missing
data
3-Joint Sensitivity to
noise
4- describes an
expansive range of
actions by a statistical
model of human pose
5- solves the pose and
camera by reducing the
image reprojection
error.
1-Limb proportions
are different
between various
individuals.
2- Does not support
occlusion handling.
3-Cannot recover
the correct pose
with sound
perspective effects
when the mean pose
is not a reasonable
initialization
ISSN: 04532198
Volume 62, Issue 09, October, 2020
4977
4- utilizes a model
with 23 landmarks
of human anatomy.
5- MP algorithm to
estimate the sparse
demonstration of
3D pose and the
relative camera
from only 2D image
6- low RMS
reconstruction error
7- provides accurate
results using a single
image and no
requirement for
annotation to resolve
ambiguities
8-accuracy in
the camera estimation
9- applied to frames of
monocular video
streams
10-able to recover the
pose from non-standard
viewpoints
11- good generalization
to an extensive range of
poses
and viewpoints
Atul
method
1-fully automated
3D human pose and
shape analysis of the
human targets in
videos, recognizing
their activities and
characterizing their
behavior
2- combines Top-
down and Bottom-
up methods
3-uses advanced
Particle filtering
(PF) algorithms.
4- uses the
framework of the
non-parametric
density propagation
system
based on particle
filtering
1- high efficacy in
substantial ambiguities
2- overcomes
limitations of particle
filtering by improving
the proposal
density modeling and
likelihood computation
function
3-improves tracking
4-solves non-rigid
deformable surface
reconstruction
5- articulates body pose
recovery in static
images
1-self and Partial
occlusion in unseen
scenarios
2-Optimization
problems
3- uses fixed bone
lengths priors
1-a physically-
motivated prior
allows
anthropometrically
valid poses and
restricts on invalid
poses
2- last is combined
1-good generalization
while avoiding
invalid 3D poses
2-pose
parameterization is
accurate and
straightforward
3-improves pose
1-depth ambiguities
at several joints
2- incorrect
estimation of the
camera matrix
3-algorithm is
sensitive against
Gaussian noise
A. Shapii, S. Pichak and Z. R. Mahayuddin, 2020 Technology Reports of Kansai University
4978
Akhter
method
with a selected
sparse
representation of
poses from an over-
complete dictionary
3- formulates prior
over two endpoints
of each 3D bone
location.
4- uses Orthogonal
Matching Pursuit
(OMP)
estimation by a
grouping of body parts
(extended-torso)
4-defines a kinematic
skeleton tree structure
to apply joint-angle
limits
5-avoids non-rigid self-
calibration by selecting
linear coefficients from
a cosine function
4-needs multi-
camera setups
Xiaowei
method
1- combines a
sparsity-driven 3D
geometric
prior and
a 3D temporal
smoothness prior
2- uses a deep
convolutional
neural network
(CNN) architecture
to detect body parts
3- an Expectation-
Maximization (EM)
framework to
retrieve a sparse
model of 3D human
pose sequence
4- casting the 2D
joint locations as
latent variables
1- highly effective
against detector error,
occlusion, and
ambiguity
2- no requirement for
synchronized 2D-3D
data
3- handles the 2D
estimation uncertainty
in a statistical
framework
4-good accuracy in-the-
wild videos
5- improves the
initialization results
6- improves 2D joint
localization
7- using a single camera
1- cannot handle
multiple subjects
2- assumes
manually labeled
2D joint locations
Wandt
method
1-A periodic model
to mix coefficients
for periodic and
quasi-periodic
motions
2- a regularization
term based on
temporal bone
length constancy
prior for non-
periodic motion
3-based on a-priorly
trained base poses
4-model 3D pose as
a linear combination
of base poses
1-estimates non-rigid
human body pose
captured by an
uncalibrated camera
2-solves an unstable 3D
motion reconstruction
3-accurate algorithm
for estimating periodic
motion
4- handles arbitrary
camera motion
5-the stability of the
method
6-handles noise and
occlusions in real-
world data
7- does not use a
predefined
skeleton or
anthropometric
1-restrictive
assumptions on the
3D configurations
possible
2- ambiguous
camera placement
and 3D shape
deformation
ISSN: 04532198
Volume 62, Issue 09, October, 2020
4979
constraints
Du
method
1- RGB-based 2D
joint detection
algorithm
2- a dual-stream
Deep Convolution
Network (ConvNet)
to detect 2D
landmarks of human
joints
3-algorithm utilizes
the height-map as a
type of built-in prior
knowledge to
capture 3D
articulated skeleton
a motion under a
single-view
calibrated camera
1- improves skeleton-
based human 3D pose
estimation from the
inaccurate
localization of 2D
joints by temporal
constraints on the
camera
2-lower
reconstruction error
1-discriminative parts
are missed in pose
estimation
2-fails in complex
human motion
After identifying each algorithm's weaknesses and strengths, the experimental results of different parameters
are presented in Table 2. These results are based on the publisher's assessment of their selective database. For
example, [11], [15], focused their analysis in a noisy background. On the other hand, [3], [1], [6] focused their
results on self-occlusion. Moreover, some of the described methods present the evaluation on a single image,
whereas others evaluate sequential photos. Three methods were found suitable for the proposed application
of this study, as described below. Among these methods, two methods (i.e., Xiaowei and Bastian) showed
successful outcomes in a noisy background, self-occlusion, and realistic Reconstruction, making them ideal
for further evaluation of 3D Reconstruction. Furthermore, from the other methods compared, the Yu du
method was selected due to its outstanding results, the parameter required for the analysis of the dataset chose.
No significant advantages were found in the other methods analyzed and were discarded as they presented a
lack of noise background reduction, realistic Reconstruction, and both. Finally, these three selected methods
showed to be the most suitable methods to be analyzed using their result in a database of a tennis player.
Table 2. Comparison of Methods for 3D Reconstruction.
Parameters
Reference
Ramakrishna
Method
Atul
Method
Akhter
Method
Xiaowei
Method
Wandt
Method
Du
Method
Algorithm
Projected
Matching
Pursuit (MP)
Advanced
particle
filtering(PF)
Orthogonal
Matching
Pursuit
(OMP)
Expectation-
Maximization
Periodic -
non-
Periodic
Height-map
Projection
Single image
Sequence
images
Single
image
Sequence
images
Sequence
images
Sequence
images
Camera
arbitrary
Fixed
arbitrary
Arbitrary
fixed
fixed
Code of
algorithm
restrict
Open-source
restrict
Open-source
restrict
Open-source
Noisy
focus
No focus
No focus
focus
focus
No focus
A. Shapii, S. Pichak and Z. R. Mahayuddin, 2020 Technology Reports of Kansai University
4980
background
Self-occlusion
No focus
Focus
No focus
focus
focus
focus
Realistic
reconstruction
(Average 3D
error/ cm)
Not given
Not given
121.56
113.01
187.1
118.69
Average 2D
joints
localization
accuracy (pixel)
94.5
27.65
Not given
10.85
8.43
95.8
5. Evaluation of Methods On Pennaction
This section demonstrates the application of the selected approaches for pose estimation with in-the-wild
images sequence. Results are presented utilizing action from the PennAction dataset. The "tennis forehand"
was chosen for evaluation due to it is not a simple pose. It also has some challenges, such as the large pose
variability, self-occlusion, and image blur because of fast motion. We selected six frames (2,8,14,20,25,30)
from 31 images sequence of the dataset that we were able to evaluate the main factors. Tables 3 to table 8
illustrated the 3D results of each method on frames.
Table 3. 3D Results of Frame #2
ISSN: 04532198
Volume 62, Issue 09, October, 2020
4981
Analysis of frame #2 is shown that the Wandt method didn't have the correct result in the left arm (red color)
because of the racquet as a noisy environment.
Table 4. 3D Results of Frame #8
Input
Frame #8
Xiaowei
Wandt
Du
Analysis of frame #8 is shown that the Du method didn't have accurate results in the right leg (violet color)
because of the right leg occluded by the left leg that it causes the problem of self-occlusion.
Table 5 3D Results of Frame #14
Input
Frame #14
A. Shapii, S. Pichak and Z. R. Mahayuddin, 2020 Technology Reports of Kansai University
4982
Xiaowei
Wandt
Du
Analysis of frame #14 is shown that both methods Wandt and Du didn't have good results in both arms because
of arms of the player are behind the racquet. It causes occlusion.
Table 6 3D Results of Frame #20
ISSN: 04532198
Volume 62, Issue 09, October, 2020
4983
Analysis of frame #20 is shown that the Xiaowei method had the best result in this specific angle with less
missing data.
Table 7. 3D Results of Frame #25
A. Shapii, S. Pichak and Z. R. Mahayuddin, 2020 Technology Reports of Kansai University
4984
Analysis of frame #25 is shown that the Wandt method had a poor result in arms and shoulder (yellow color)
but had a better impact on the parts of legs compare with the Du method.
Table 8. 3D Results of Frame #30
Analysis of frame #30 is shown that the Wandt method is sensitive to the angle of the image and couldn't
reconstruct with accuracy.
Table 9 shows a summary of the results on the tennis player dataset. The conclusion of these results is shown
that the method proposed by [15], [6] had several similarities in the 3D result. But Xiaowei algorithm is more
robust to noise and can handle occlusions and reconstruct the occluded body parts correctly. Although,
Bastian's method revealed better performance to reconstruct in the part of the legs.
Table 9. A comparison of PCP scores on PennAction dataset
Methods
Upper Arms
Lower Arms
Upper legs
Lower legs
Average
Xiaowei
0.93
0.71
0.96
0.84
0.86
Du
0.92
0.68
0.94
0.82
0.84
Wandt
0.90
0.61
0.98
0.85
0.83
ISSN: 04532198
Volume 62, Issue 09, October, 2020
4985
5.1 Running Time
Table 10 showed the processing time of the three methods evaluated. The method proposed by Xiaowei is the
fastest method to process 3D reconstruction according to the specifications mentioned in the computing
section. Algorithms usually converge in 20 iterations with average CPU time below 150s for a sequence of
31 frames.
Table 10. The Processing Time of Each Algorithm
processing time
(seconds) per frame
Xiaowei method
3
Wandt method
6
Du method
5
5.2 Evaluation of Methods On CMU
These methods were evaluated by testing them on a sequence of varied activities from the CMU motion
capture database. Care was taken to make sure that the motion capture frames were not those utilized in the
shape bases' training. It could be seen that the reconstruction results of the jumping sequences were inferior
in comparison with the other sequences. This was because the difference between jumping motions of various
individuals was much larger than between running movements. As such, a new, untrained jumping motion
was insufficiently explained by the base poses, whereas each new running pattern was the same as those in
the training data. Second, the evaluation of 3D motion recovery was carried out with the ground-truth 2D joint
locations. The 3D reconstruction errors in millimeters are reported in Table 11. The standard evaluation per
common error (mm) in 3D was computed between the reconstructed pose and the ground truth in the camera
frame and their root locations.
Table 11. 3D reconstruction error in mm
Method
run
Jump
(mm)
(mm)
Wandt
28.05
31.13
Du
58.3
64.4
Xiaowei
20.99
47.57
This table is shown 3D reconstructions of Xiaowei are highly realistic, which was demonstrated by the 3D
error.
6. Discussion
The result from the evaluation conducted indicated that the method proposed by [15] is highly recommended
for 3D reconstruction of tennis player images. In this method, 2D joint heat maps capturing positional
uncertainty are generated with a deep, fully CNN. These heat maps are combined with a sparse model of the
3D human pose within an Expectation-Maximization framework realized the 3D parameter estimation over
the entire sequence. However, this method provided a solution for most of the challenges in 3D reconstruction,
such as large pose variability, self-occlusion, and image blur caused by fast motion. But, it needs manually
labeled for a 2D joint location that reduces the percentage of accuracy. To improve this issue, we proposed a
3D human pose estimation framework presented by Wandt et al. (2016). It consists of a synthesis between
discriminative image-based and 3D reconstruction. It treated 2D joint locations as latent variables whose
uncertainty distributions are given by a deep, fully convolutional neural network. The new 3D poses are
modeled by sparse representation. The 3D parameter estimates are realized via an Expectation-Maximization
A. Shapii, S. Pichak and Z. R. Mahayuddin, 2020 Technology Reports of Kansai University
4986
algorithm, where it is shown that the 2D joint location uncertainties can be conveniently marginalized out
during inference. Further, to improve the robustness of the method against occlusion and reconstruction
ambiguity, 3D temporal smoothness prior is imposed on the 3D pose and viewpoint parameters, which [6]
considered. Therefore, the usage of the method proposed by [15] as core-base and the integration of the
advantages described for the process by [14], [6] might provide an effective method for 3D reconstruction of
images sequence on a specific dataset. The novel method proposed might improve 2D joint location and
occlusion that can recreate 2D images into 3D images with realistic results, minimum requirements, and
significant results.
7. Conclusion
This paper is a review study of 3D reconstruction methods of the human body from a 2D image sequence of
tennis players. Among all the sports, we chose tennis sport because this exercise developed at very high speed
and required technical skills development. Also, this sport presents a challenge for the 3D reconstruction due
to factors such as self-occlusion and occlusion that occurs during the development of the game. Many
technologies tried to help raise the level of athletic techniques and reduce arbitration errors and physical
damage. However, they faced multiple problems such as high cost, time-consuming and heavy equipment
[10]. We believe that the simulation of tennis players' movement taken by the arbitrary camera through a 3D
reconstruction of sequential images reduces might be economically viable and simplify the time when
compared with traditional technologies. Moreover, this method can help the players and coaches to improve
their skills significantly. On the other hand, the increasing demand for 3D reconstruction, especially for the
human body, can provide multiple additional applications such as movies, gaming, and medical purpose. The
achievement of this research can help other industries, as well. Especially, generating 3D poses from a
sequence of images is much cheaper than marker-based technologies. Modeling the 3D human body from
image sequences is a challenging problem and has been a research topic for many years [2]. Significant
theoretical and algorithmic results were achieved to extract even complicated poses of the human body form.
Research in the area of the human pose has been approached from many different issues in an attempt to
implement a robust, accurate, and automatic fully-body system. In this paper, we focused on evaluating
different methods that studied sports poses by analyzing several factors that are still not fully resolved in this
area. For instance, realistic scenes background clutter, human appearance variation, and self-occlusion are
challenges required in an in-depth investigation. We also identified the most suitable method, which improves
the ill-posed problem and can handle outdoor conditions to be implemented in the tennis court environment
with a high-speed process [13]. To reach this goal, there is a two-step evaluation. First of all, we have chosen
six current methods based on their focus on several features such as image sequence, camera, sport poses in
the real world, and so on. These methods have improved several challenges in old methods and some
recommendations for future work. Advantages and disadvantages of methods of [11], [3], [1], [15], [14], [6]
were discussed and compared theoretically. Some factors, such as accuracy of human pose estimation, self-
occlusion, and noisy background, were analyzed in their experimental results.
In the next step of evaluation, three top methods are selected for further and more in-depth analysis. We run
the code of their algorithm in MATLAB on the Penn Action dataset to get a 3D reconstruction result. The
codes of [15], [6] obtained from the Internet. We have implemented the code of [14] method by ourselves. To
get the final and definitive results, we also tested these methods on the database CMU MoCap. After that, it
was decided that among them, the methods proposed by Xiaowei might be the most suitable method to be
implemented for the 3D reconstruction applied to tennis. This method proved to be faster than the other
method evaluated and produced outstanding results inaccuracy. Subsequently, the method proposed by the
Wandt showed to provide better accuracy when dealing with self-occlusions. Finally, the method proposed
by Du showed the lowest accuracy and poor performance when occlusion was involved. We eventually
ISSN: 04532198
Volume 62, Issue 09, October, 2020
4987
proposed a new technique, combining some approaches of the three methods Xiaowei method, the Wandt
method, and the Du method. It presented a 3D human pose estimation framework from a monocular image
consisting of a novel synthesis between a deep learning-based 2D part regressor, a sparsity-driven 3D
reconstruction approach of the Wandt, and a 3D temporal smoothness prior in the Du method. This joint
consideration combines the discriminative power of state-of-the-art 2D part detectors, the expressiveness of
3D pose models, and regularization by aggregating information over time. So, it can go directly from 2D
appearance to 3D geometry. The proposed method can improve 2D joint locations for tennis players in outdoor
conditions from sequence images taken by an arbitrary camera.
8. Acknowledgment
This research was funded by the University Grants PP-FTSM-2020.
9. References
[1] Akhter, I. and Black, M. J., 2015. Pose-Conditioned Joint Angle Limits for 3D Human Pose
Reconstruction. IEEE Conference On Computer Vision and Pattern Recognition (CVPR), pp. 1446-1455
[2] Ashraf, Y. A., Venkat, I. and Belaton, B. 2014. Reconstruction of 3d Faces by Shape Estimation and
Texture Interpolation. Asia-Pacific Journal of Information Technology and Multimedia, 3(1): pp. 15 21.
[3] Atul, K. 2014. Coupling Top-down and Bottom-up Methods for 3D Human Pose and Shape
Estimation from Monocular Image Sequences. Pattern Recognition :1410-0117.
[4] Thompson, J. J., Jain, A., LeCun, Y., & Bregler, C. 2014. Joint training of a convolutional network
and a graphical model for human pose estimation. In Advances in neural information processing systems:
pp.1799-1807. arXiv:1406.2984.
[5] Rasheed, Nada & Nordin, Md Jan. (2018). Classification and Reconstruction Algorithms for the
Archaeological Fragments. Journal of King Saud University - Computer and Information Sciences.
10.1016/j.jksuci.2018.09.019.
[6] Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z., Kankanhalli, M. & Geng, W. 2016. Marker-
less 3D human motion capture with monocular image sequence and height-maps. In European Conference on
Computer Vision. pp. 20-36.
[7] Gotardo, P.F. & Martinez, A.M. 2011. Non-rigid structure from motion with complementary rank-3
spaces. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on: pp. 3065-3072.
[8] Li, F., Li, K. & Lv, J. 2013. Research on biomechanics technology based on the tennis sports. In: Du
W. (eds) Informatics and Management Science III. Lecture Notes in Electrical Engineering, vol 206. Springer,
London. pp. 415-420.
[9] Mousavi Kahaki, Seyed Mostafa & Nordin, Md Jan & Ashtari, Amir & Zahra, Sophia. (2016).
Invariant Feature Matching for Image Registration Application Based on New Dissimilarity of Spatial
Features. PloS one. 11. e0149710. 10.1371/journal.pone.0149710.
[10] Norshaliza, K. 2016. Active Contour Model Using Fractional Sinc Wave Function for Medical Image
Segmentation. Asia-Pacific Journal 0f Information Technology and Multimedia, 5(2): 47 61.
A. Shapii, S. Pichak and Z. R. Mahayuddin, 2020 Technology Reports of Kansai University
4988
[11] Ramakrishna, V., Kanade, T. & Sheikh, Y. 2012. Reconstructing 3d human pose from 2d image
landmarks. In European Conference on Computer Vision. pp. 573-586.
[12] Saima, A. L., Rosziati, I., Nik Shahidah, A. M. T., Norhalina, S. and Suhaila, S. 2018. Thresholding
and Quantization Algorithms for Image Compression Techniques: A Review. Asia-Pacific Journal of
Information Technology and Multimedia, 7(1): 83 89.
[13] Shingade, A. & Ghotkar, A. 2014. Animation of 3D human model using markerless motion capture
applied to sports. International Journal of Computer Graphics & Animation (IJCGA) 4(1): 27-39.
[14] Wandt, B., Ackermann, H. & Rosenhahn, B. 2016. 3D Reconstruction of Human Motion from
Monocular Image Sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8): pp.
1505-1516.
[15] Xiaowei, Z., X., Zhu, M., Leonardos, S., Derpanis, K.G. and Daniilidis, K. 2016. Sparseness meets
deepness: 3D human pose estimation from monocular video. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. pp. 4966-4975.
[16] Azrulhizam Shapii, Sanaz Pichak, Mohd Baharuddin & Iida Hiroyuki. (2020). Comparative Study of
3D Reconstruction Methods from 2D Sequential Images in Sports. Asia-Pacific Journal of Information
Technology and Multimedia. 09. 40-57. 10.17576/apjitm-2020-0901-04.
This work is licensed under a Creative Commons Attribution Non-Commercial 4.0
International License.
... Several researchers performed 3D object detection only from a single image [52][53][54]. Monocular image-based 3D reconstruction is proposed by Shapii et al., 2020, where multiple images are used for generating a 3D view of a human activity pose [55]. This is the cheapest method for 3D object detection. ...
Article
Full-text available
Two-dimensional object detection techniques can detect multiscale objects in images. However, they lack depth information. Three-dimensional object detection provides the location of the object in the image along with depth information. To provide depth information, 3D object detection involves the application of depth-perceiving sensors such as LiDAR, stereo cameras, RGB-D, RADAR, etc. The existing review articles on 3D object detection techniques are found to be focusing on either a singular modality (e.g., only LiDAR point cloud-based) or a singular application field (e.g., autonomous vehicle navigation). However, to the best of our knowledge, there is no review paper that discusses the applicability of 3D object detection techniques in other fields such as agriculture, robot vision or human activity detection. This study analyzes both singular and multimodal techniques of 3D object detection techniques applied in different fields. A critical analysis comprising strengths and weaknesses of the 3D object detection techniques is presented. The aim of this study is to facilitate future researchers and practitioners to provide a holistic view of 3D object detection techniques. The critical analysis of the singular and multimodal techniques is expected to help the practitioners find the appropriate techniques based on their requirement.
... The three-dimensional reconstruction technology based on the images is a technology for restoring 2D images to 3D models [28]. SFM is one of the 3D reconstruction methods, of which the principle is to apply a matching algorithm to an acquired sequence of multi-view images in order to obtain the correspondence of the same pixel points of the image and to use the matching constraint relationship in combination with the triangulation principle to obtain the 3D coordinates of the spatial points and then reconstruct a 3D model of the object [29]. ...
Article
Full-text available
3D reconstruction is the transformation of real objects into mathematical models. By using 3D models, we can observe the shape and measure the parameters, and help us to analyze the properties of objects. For the problems of incompleteness and inefficiency in the reconstruction of object 3D point clouds, a fast and automated system for panoramic 3D point cloud reconstruction of objects was proposed. First, we designed an automatic platform, which could acquire RGB image sequences of objects in two directions. Then we adopted the Structure From Motion (SFM) algorithm to generate point clouds. For the problem of different scales of point clouds, we obtained the scaling by calculating the length ratio of the axes of the oriented bounding box, and scaled the point clouds to a uniform scale. In addition, markers were placed around the object and used to acquire the rotation matrix of the object point cloud in two directions. Finally, we verified the point cloud models of different objects generated by the system, and found that the relative error didn’t exceed 6.67%. According to the results, the system proposed could reconstruct the panoramic 3D point cloud of the object better and provide a reference for related research.
Article
Full-text available
Three-dimensional human pose estimation has made significant advancements through the integration of deep learning techniques. This survey provides a comprehensive review of recent 3D human pose estimation methods, with a focus on monocular images, videos, and multi-view cameras. Our approach stands out through a systematic literature review methodology, ensuring an up-to-date and meticulous overview. Unlike many existing surveys that categorize approaches based on learning paradigms, our survey offers a fresh perspective, delving deeper into the subject. For image-based approaches, we not only follow existing categorizations but also introduce and compare significant 2D models. Additionally, we provide a comparative analysis of these methods, enhancing the understanding of image-based pose estimation techniques. In the realm of video-based approaches, we categorize them based on the types of models used to capture inter-frame information. Furthermore, in the context of multi-person pose estimation, our survey uniquely differentiates between approaches focusing on relative poses and those addressing absolute poses. Our survey aims to serve as a pivotal resource for researchers, highlighting state-of-the-art deep learning strategies and identifying promising directions for future exploration in 3D human pose estimation.
Article
Full-text available
With increasing demand on digital images, there is a need to compress the image to entertain the limited bandwidth and storage capacity. Recently, there is a growing interest among researchers focusing on compression of various types of images and data. Amongst various compression algorithms, transform-based compression is one of the promising algorithms. Despite the technological advances in transmission and storage, the demands placed on the bandwidth of communication and storage capacities by far outstrips its availability. This paper presents a review of image compression principle, compression techniques and various thresholding algorithms (pre-processing algorithms) and quantization algorithm (post-processing algorithms). This paper intends to give an overview to the relevant parties to choose the suitable image compression algorithms to suit with the need.
Article
Full-text available
Intensity inhomogeneity occurs when pixels in medical images overlap due to anomalies in medical imaging devices. These anomalies lead to difficult medical image segmentation. This study proposes a new active contour model (ACM) with fractional sinc function to inexpensively segment medical images with intensity inhomogeneity. The method integrates a nonlinear fractional sinc function in its curve evolution and edge enhancement. The fractional sinc function contributes in giving a rapid contour movement where it improves the curve's bending capability. Furthermore, the fractional sinc function enables the contour evolution to move toward the object based on the preserved edges. This study uses the proposed method to segment medical images with intensity inhomogeneity using five various image modalities. With improved speed, the proposed method more accurately segments medical images compared with other baseline methods.
Article
Full-text available
Until recently Intelligence, Surveillance, and Reconnaissance (ISR) focused on acquiring behavioral information of the targets and their activities. Continuous evolution of intelligence being gathered of the human centric activities has put increased focus on the humans, especially inferring their innate characteristics - size, shapes and physiology. These bio-signatures extracted from the surveillance sensors can be used to deduce age, ethnicity, gender and actions, and further characterize human actions in unseen scenarios. However, recovery of pose and shape of humans in such monocular videos is inherently an ill-posed problem, marked by frequent depth and view based ambiguities due to self-occlusion, foreshortening and misalignment. The likelihood function often yields a highly multimodal posterior that is difficult to propagate even using the most advanced particle filtering(PF) algorithms. Motivated by the recent success of the discriminative approaches to efficiently predict 3D poses directly from the 2D images, we present several principled approaches to integrate predictive cues using learned regression models to sustain multimodality of the posterior during tracking. Additionally, these learned priors can be actively adapted to the test data using a likelihood based feedback mechanism. Estimated 3D poses are then used to fit 3D human shape model to each frame independently for inferring anthropometric bio-signatures. The proposed system is fully automated, robust to noisy test data and has ability to swiftly recover from tracking failures even after confronting with significant errors. We evaluate the system on a large number of monocular human motion sequences.
Article
Full-text available
Markerless motion capture is an active research in 3D virtualization. In proposed work we presented a system for markerless motion capture for 3D human character animation, paper presents a survey on motion and skeleton tracking techniques which are developed or are under development. The paper proposed a method to transform the motion of a performer to a 3D human character (model), the 3D human character performs similar movements as that of a performer in real time. In the proposed work, human model data will be captured by Kinect camera, processed data will be applied on 3D human model for animation. 3D human model is created using open source software (MakeHuman). Anticipated dataset for sport activity is considered as input which can be applied to any HCI application.
Article
Full-text available
This paper aims to address the ill-posed problem of reconstructing 3D faces from single 2D face images. An extended Tikhonov regularization method is connected with the standard 3D morphable model in order to reconstruct the 3D face shapes from a small set of 2D facial points. Further, by interpolating the input 2D texture with the model texture and warping the interpolated texture to the reconstructed face shapes, 3D face reconstruction is achieved. For the texture warping, the 2D face deformation has been learned from the model texture using a set of facial landmarks. Our experimental results justify the robustness of the proposed approach with respect to the reconstruction of realistic 3D face shapes.
Conference Paper
Full-text available
Non-rigid structure from motion (NR-SFM) is a difficult, underconstrained problem in computer vision. This paper proposes a new algorithm that revises the standard matrix factorization approach in NR-SFM. We consider two alternative representations for the linear space spanned by a small number K of 3D basis shapes. As compared to the standard approach using general rank-3K matrix factors, we show that improved results are obtained by explicitly modeling K complementary spaces of rank-3. Our new method is positively compared to the state-of-the-art in NR-SFM, providing improved results on high-frequency deformations of both articulated and simpler deformable shapes. We also present an approach for NR-SFM with occlusion.
Conference Paper
The recovery of 3D human pose with monocular camera is an inherently ill-posed problem due to the large number of possible projections from the same 2D image to 3D space. Aimed at improving the accuracy of 3D motion reconstruction, we introduce the additional built-in knowledge, namely height-map, into the algorithmic scheme of reconstructing the 3D pose/motion under a single-view calibrated camera. Our novel proposed framework consists of two major contributions. Firstly, the RGB image and its calculated height-map are combined to detect the landmarks of 2D joints with a dual-stream deep convolution network. Secondly, we formulate a new objective function to estimate 3D motion from the detected 2D joints in the monocular image sequence, which reinforces the temporal coherence constraints on both the camera and 3D poses. Experiments with HumanEva, Human3.6M, and MCAD dataset validate that our method outperforms the state-of-the-art algorithms on both 2D joints localization and 3D motion recovery. Moreover, the evaluation results on HumanEva indicates that the performance of our proposed single-view approach is comparable to that of the multi-view deep learning counterpart.
Article
This article tackles the problem of estimating non-rigid human 3D shape and motion from image sequences taken by uncalibrated cameras. Similar to other state-of-the-art solutions we factorize 2D observations in camera parameters, base poses and mixing coefficients. Existing methods require sufficient camera motion during the sequence to achieve a correct 3D reconstruction. To obtain convincing 3D reconstructions from arbitrary camera motion, our method is based on a-priorly trained base poses. We show that strong periodic assumptions on the coefficients can be used to define an efficient and accurate algorithm for estimating periodic motion such as walking patterns. For the extension to non-periodic motion we propose a novel regularization term based on temporal bone length constancy. In contrast to other works, the proposed method does not use a predefined skeleton or anthropometric constraints and can handle arbitrary camera motion. We achieve convincing 3D reconstructions, even under the influence of noise and occlusions. Multiple experiments based on a 3D error metric demonstrate the stability of the proposed method. Compared to other state-of-the-art methods our algorithm shows a significant improvement.
Conference Paper
Reconstructing an arbitrary configuration of 3D points from their projection in an image is an ill-posed problem. When the points hold semantic meaning, such as anatomical landmarks on a body, human observers can often infer a plausible 3D configuration, drawing on extensive visual memory. We present an activity-independent method to recover the 3D configuration of a human figure from 2D locations of anatomical landmarks in a single image, leveraging a large motion capture corpus as a proxy for visual memory. Our method solves for anthropometrically regular body pose and explicitly estimates the camera via a matching pursuit algorithm operating on the image projections. Anthropometric regularity (i.e., that limbs obey known proportions) is a highly informative prior, but directly applying such constraints is intractable. Instead, we enforce a necessary condition on the sum of squared limb-lengths that can be solved for in closed form to discourage implausible configurations in 3D. We evaluate performance on a wide variety of human poses captured from different viewpoints and show generalization to novel 3D configurations and robustness to missing data.