Conference PaperPDF Available

Research of a Neuron Model with Signal Accumulation for Motion Detection

Authors:

Abstract and Figures

The paper presents a new model of the MT-neuron (Middle temporal area neuron), which allows detecting movement and determining its direction and speed, without using recurrent connection. The model is based on the accumulation of the signal and is organized using a space-time vector that sets the weight coefficients. Despite the combinatorial redundancy, it is assumed that the model is more resistant to glare in comparison with the optical flow.
Content may be subject to copyright.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Research of a Neuron Model with Signal
Accumulation for Motion Detection
Alexander Kugaevskikh
Department of information technologies
Novosibirsk State University
Novosibirsk, Russia
https://orcid.org/0000-0002-6676-0518
Abstract The paper presents a new model of the MT-
neuron (Middle temporal area neuron), which allows detecting
movement and determining its direction and speed, without
using recurrent connection. The model is based on the
accumulation of the signal and is organized using a space-time
vector that sets the weight coefficients. Despite the
combinatorial redundancy, it is assumed that the model is more
resistant to glare in comparison with the optical flow.
Keywords neural network, motion detection, MT neuron,
bio-inspired model
I. INTRODUCTION
In the visual cortex, motion analysis begins in the primary
visual cortex. Although its primary function is to highlight
edges, complex cells respond to movement in a particular
direction within their receptive field. More in-depth analysis
of movement is performed in areas V3 and V5 (MT) of the
brain. Eventually, a general map of the movements within the
visual field is formed.
In computer vision, the problem of motion analysis is most
often solved by applying the optical flow equation. When
training neural networks to detect motion, we can also talk
about using the optical flow equation, or rather the underlying
mechanism for finding the direction of change in the
brightness of pixels. The approach based on the use of
reccurent links, such as GRU [1], LSTM [2], STCNN [3],
ResNet [4], STAL [5], is considered generally accepted for the
motion detection. The problem with neural networks is that if
the neural network is not trained to detect movement in a
particular direction, it will not detect it. This is especially true
for video analytics systems, if we tilt the camera by 45 degrees
and do not retrain the neural network, then the movement will
not be detected. We offer an alternative approach, using a bio-
inspired model of the neuron for motion detection. In bio-
inspired architectures, the most common model is the Higer
model [6,7], which is analyzed in detail in [8]. The Higer
model is based on the use of Gabor energy, which combines
the real and imaginary parts of the Gabor filter that are
collapsed with the image. The use of the imaginary part is
reasonable in signal processing, since it allows you to
compensate for redundancy in low frequencies, but it makes
no sense in image processing when convolving with pixel
brightness.
II. EDGE DETECTION
In our proposed model, we not only get rid of the Gabor
energy, but also construct the MT neuron in a different way to
better match the optical flow equation. The motion analysis
begins by highlighting the edges. For this purpose, we
constructed a two-layer neural network [9] based on the use of
the Gabor filter and the hyperbolic tangent as functions for
generating receptive fields of neurons.
The input of the edge selection neural network is fed the
L* channel (pixel luminance) of the image in the CIE L*a*b*
colour space [10]. Given the specifics of this colour space, we
no longer need to use the imaginary part of the Gabor filter.
For the implementation of brightness segmentation, the
use of a two-layer neural network is proposed. On the first
layer, lines of a certain orientation are highlighted. The second
layer is responsible for selecting combinations of lines,
including corners. The difference from the trained
convolutional layers, in particular the first layer of simple cells
in the neocognitron network, is the use of already pre-
configured receptive fields of neurons of the first layer, which
increases the predictability and interpretability of the results
of such a neural network.
Each layer contains 3 types of neurons that differ in the
configuration of receptive fields. In this case, the links
between the layers are organized in a special way. Each
neuron of the second layer is connected to only two neurons
of the first layer. Thus, the neurons of the second layer allow
you to select lines and angles (in the case of the Gabor filter)
and quadrilaterals (in the case of the hyperbolic tangent).
 

 
where
 , responsible for symmetry of a kernel
of the filter, was entered as an alternative to quadrature pair of
filters. It can take values of
or
to detect antisymmetric
components and , 0 or for symmetric components, is
the filter scale, is the degree of the filter ellipticity (defines
the elongation of the filter kernel along the ordinate axis). The
parameter is introduced solely for the purpose of simplifying
selection of the optimal kernel without different values of
and , and can be expressed through these parameters  
.
The first two types of neurons respond to lines of preferred
orientation. Their receptive fields are formed using the Gabor
filter. Neurons of the following type are required to determine
the zones of brightness difference, and a smooth function must
also be used to form the receptive field, to take into account
such moments as blurring in fog conditions and finding
shadows. For this reason, the Haar wavelet cannot be applied,
but the receptive field can be configured using the hyperbolic
tangent.
This paper was financially supported by the Russian Foundation for
Basic Research (Grant No. 19-57-45006).
The neurons of the first layer (named ) of the edge
selection neural network use a linear activation function

  
where neuron’s type, convolution
coordinates, - the pixel brightness matrix of the input image.
The neurons of the second layer use the sigmoid activation
function and function on the principle of "winner-takes-all"
(WTA)
 
  
III. MT NEURON MODEL
Motion detection sets the spatio-temporal organization of
the motion detection neural network. Movement, in this case,
is the sequential activation of several edge selection neurons
located in the same direction in a certain neighborhood over
time, i.e. with a change of frame. Thus, the MT neuron can
give the direction of movement α and its speed v. The MT
neuron, like the previous neurons, is created for each type p.
The connections of the MT neuron with the UC2 neurons of the
corresponding type determine its receptive field.
To determine linear motion, the receptive field of the MT
neuron (UMT{l}) includes a sequence of UC2 neurons in the α
direction. To determine the rotation, the receptive field of the
corresponding MT neuron (UMT{r}) is accompanied by
connections with neurons located in the same center of the
receptive field, but having different orientations θ. The
rotation detection neuron is created twice for different
directions of rotation. In the future, this will help to apply
inhibitory connections to level out parasitic activation.

 
  


  
The weights of MT-neurons are set using the product of
two Gaussians, the first is responsible for the spatial
characteristic, the second sets the attenuation coefficient of the
link weight over time.

 

 

 

 
A uniform filling of such a neuron, i.e. a stationary dark
area in the entire size of the receptive field, will not give the
required activation, fig.1. In this case, the attenuation
coefficient obeys a certain law of change t: at the beginning of
the movement in the receptive field   , when the
neuron  is activated, the vector t will have
values . At the end of the receptive field, the
vector t will have values . In this regard, the
value of the attenuation coefficient will change.
Fig.1. Scheme of the MT neuron work
IV. EXPERIMENS
For an experimental test, we will run the movements of
different angles to check the activation of UC2 neurons, in
the direction of 45 degrees. Fig. 2-4 show the frame-by-
frame activation of MT neurons. Ideally, there should be a
thin line, but due to the low resolution, there is a false
activation within 10-20 degrees (  ).
Fig.2. First frame (start motion, vertically - the angle of movement,
horizontally-the response of MT-neurons)

 in

 in

 in
Fig.3. Second frame (vertically - the angle of movement, horizontally-the
response of MT-neurons)
Fig. 4 shows that the maximum activation is achieved on
the second frame, then there is a fade, which confirms our
assumption. Attenuation is necessary so that the MT neuron
does not fire on stationary objects.
Fig.4. Third frame (end motion, vertically - the angle of movement,
horizontally-the response of MT-neurons)
In general, the longer the movement lasts, the more
accurately its direction is determined.
If we increase the size of the receptive field of the space-
time vector from 3 to 7, the accuracy of determining the
direction of movement increases, fig.6-7.
Fig.5. Fourth frame (vertically - the angle of movement, horizontally-the
response of MT-neurons)
Fig.6. Seventh frame (end motion, vertically - the angle of movement,
horizontally-the response of MT-neurons)
Another important aspect is the response to changes in the
speed of movement within the receptive field. The increase in
speed is expressed in an increase in the activation step of the
neurons of the previous layer in the spatial relation while
maintaining constancy in the temporal relation. So, at normal
speed, the UMT value is 4.3469, when activating UC2 through
one neuron, the activation of UMT drops sharply to 0.8448,
when activating UC2 through two neurons, the activation of
UMT is 0.6883.
V. CONCLUSION
The presented model of the MT neuron does not react to
a stationary object, since the uniform filling of the receptive
field of such a neuron gives the output value as close to zero
as possible. The movement is encoded by its direction and
speed. Neurons with different receptive fields are responsible
for the direction, and the speed can be determined based on
the output value of such a neuron.
The best accuracy in determining the direction of
movement can be obtained with the size of the space-time
vector is (7*7,7).
The experiments have shown that the proposed model of
the MT neuron responds to movement in the expected way.
REFERENCES
[1] Y. Cai, J. Liu, Y. Guo, S. Hu, and S. Lang, “Video anomaly detection
with multi-scale feature and temporal information fusion,”
Neurocomputing, vol. 423, pp. 264273, Jan. 2021, doi:
10.1016/j.neucom.2020.10.044
[2] R. Szeto, X. Sun, K. Lu, and J. J. Corso, “A Temporally-Aware
Interpolation Network for Video Frame Inpainting,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 42, no. 5, pp. 10531068, May 2020,
doi: 10.1109/TPAMI.2019.2951667
[3] C. Jing, P. Wei, H. Sun, and N. Zheng, “Spatiotemporal neural
networks for action recognition based on joint loss,” Neural Comput &
Applic, vol. 32, no. 9, pp. 42934302, May 2020, doi: 10.1007/s00521-
019-04615-w
[4] R. Xu, X. Li, B. Zhou, and C. C. Loy, “Deep Flow-Guided Video
Inpainting,” in 2019 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), Long Beach, CA, USA, Jun. 2019, pp.
37183727. doi: 10.1109/CVPR.2019.00384
[5] G. Chen, J. Lu, M. Yang, and J. Zhou, “Spatial-Temporal Attention-
Aware Learning for Video-Based Person Re-Identification,” IEEE
Trans. on Image Process., vol. 28, no. 9, pp. 41924205, Sep. 2019,
doi: 10.1109/TIP.2019.2908062
[6] D. J. Heeger, “Model for the extraction of image flow,” J Opt Soc Am
A, vol. 4, no. 8, pp. 14551471, Aug. 1987, doi:
10.1364/josaa.4.001455.
[7] E. P. Simoncelli and D. J. Heeger, “A model of neuronal responses in
visual area MT,” Vision Research, vol. 38, no. 5, pp. 743–761, Mar.
1998, doi: 10.1016/S0042-6989(97)00183-1.
[8] M. Chessa, S. P. Sabatini, and F. Solari, “A systematic analysis of a
V1–MT neural model for motion estimation,” Neurocomputing, vol.
173, pp. 18111823, Jan. 2016, doi: 10.1016/j.neucom.2015.08.091.
[9] A. V. Kugaevskikh and A. A. Sogreshilin, “Analyzing the Efficiency
of Segment Boundary Detection Using Neural Networks,”
Optoelectron.Instrument.Proc., vol. 55, no. 4, Art. no. 4, Jul. 2019, doi:
10.3103/S8756699019040137.
[10] ISO 11664-4:2008. Colorimetry. P. 4: CTE 1976 L*a*b* colourspace.
20081101.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Action recognition is a challenging and important problem in a myriad of significant fields, such as intelligent robots and video surveillance. In recent years, deep learning and neural network techniques have been widely applied to action recognition and attained remarkable results. However, it is still a difficult task to recognize actions in complicated scenes, such as various illumination conditions, similar motions, and background noise. In this paper, we present a spatiotemporal neural network model with a joint loss to recognize human actions from videos. This spatiotemporal neural network is comprised of two key connected substructures. The first one is a two-stream-based network extracting optical flow and appearance features from each frame of videos, which characterizes the human actions of videos in spatial dimension. The second substructure is a group of Long Short-Term Memory structures following the spatial network, which describes the temporal and transition information in videos. This research effort presents a joint loss function for training the spatiotemporal neural network model. By introducing the loss function, the action recognition performance is improved. The proposed method was tested with video samples from two challenging datasets. The experiments demonstrate that our approach outperforms the baseline comparison methods.
Article
Full-text available
—This paper describes the architecture of a neural network for edge detection. Different filters for first-layer neurons are compared. Neural network learning based on a cosine measure algorithm shows much worse results than an error backpropagation algorithm. Optimal parameters for the first-layer neuron operation are given. The proposed architecture fulfills the stated tasks on edge selection.
Article
Full-text available
A neural feed-forward model composed of two layers that mimic the V1–MT primary motion pathway, derived from previous works by Heeger and Simoncelli, is proposed and analyzed. Essential aspects of the model are highlighted and comparatively analyzed to point out how realistic neural responses can be efficiently and effectively used for optic flow estimation if properly combined at a population level. First, different profiles of the spatio-temporal V1 receptive fields are compared, both in terms of their properties in the frequency domain, and in terms of their responses to random dots and plaid stimuli. Then, a pooling stage at the MT level, which combines the afferent V1 responses, is modeled to obtain a population of pattern cells that encodes the local velocities of the visual stimuli. Finally, a decoding stage allows us to combine MT activities in order to compute optic flow. A systematic validation of the model is performed by computing the optic flow on synthetic and standard benchmark sequences with ground truth flow available. The average angular errors and the end-point errors on the resulting estimates allow us to quantitatively compare the different spatio-temporal profiles and the choices of the model׳s parameters, and to assess the validity and effectiveness of the approach in realistic situations.
Article
Video anomaly detection is a challenging task because of the uncertainty of abnormal events. The current method based on predictive frames has obtained better detection results compared with the previous reconstruction or hand-crafted methods. In current prediction methods, the characteristics considered previously are only of a single scale, and the time constraint information is not fully used. In our work, we proposed a new framework structure to achieve better abnormality detection rate. To address the objects of different scales in each video frame, we considered extracting the characteristics of different receptive fields to encode more spatial information. At the same time, we added temporal constraints to the network instead of using time-consuming optical flow information, and we completed the memory of temporal features through a ConvGRU module. Furthermore, while distinguishing abnormal events, we also considered temporal information and spatial information so that our framework could fully combine spatio-temporal information to correctly distinguish abnormal events from normal events. We obtained excellent results on three datasets, thus demonstrating the effectiveness of our method.
Article
In this work, we explore video frame inpainting, a task that lies at the intersection of general video inpainting, frame interpolation, and video prediction. Although our problem can be addressed by applying methods from other video interpolation or extrapolation tasks, doing so fails to leverage the additional context information that our problem provides. To this end, we devise a method specifically designed for video frame inpainting that is composed of two modules: a bidirectional video prediction module and a temporally-aware frame interpolation module. The prediction module makes two intermediate predictions of the missing frames, each conditioned on the preceding and following frames respectively, using a shared convolutional LSTM-based encoder-decoder. The interpolation module blends the intermediate predictions by using time information and hidden activations from the video prediction module to resolve disagreements between the predictions. Our experiments demonstrate that our approach produces smoother and more accurate results than state-of-the-art methods for general video inpainting, frame interpolation, and video prediction.
Article
In this paper, we present a spatial-temporal attention-aware learning (STAL) method for video-based person re-identification. Most existing person re-identification methods aggregate image features identically to represent persons, which are extracted from the same receptive field across video frames. However, the image quality may be varying for different spatial regions and changing over time, which shall contribute to person representation and matching adaptively. Our STAL method aims to attend to the salient parts of persons in videos jointly in both spatial and temporal domains. To achieve this, we slice the video into multiple spatial-temporal units which preserve the body structure of a person and develop a joint spatial-temporal attention model to learn the quality scores of these units. We evaluate the proposed method on three challenging datasets including iLIDS-VID, PRID-2011 and the large-scale MARS dataset, and consistently improve the rank-1 accuracy by a large margin of 5.7%, 0.9%, and 6.6% respectively, in comparison with the state-of-the-art methods.
Article
We propose the first deep learning solution to video frame inpainting, a challenging instance of the general video inpainting problem with applications in video editing, manipulation, and forensics. Our task is less ambiguous than frame interpolation and video prediction because we have access to both the temporal context and a partial glimpse of the future, allowing us to better evaluate the quality of a model's predictions objectively. We devise a pipeline composed of two modules: a bidirectional video prediction module, and a temporally-aware frame interpolation module. The prediction module makes two intermediate predictions of the missing frames, one conditioned on the preceding frames and the other conditioned on the following frames, using a shared convolutional LSTM-based encoder-decoder. The interpolation module blends the intermediate predictions to form the final result. Specifically, it utilizes time information and hidden activations from the video prediction module to resolve disagreements between the predictions. Our experiments demonstrate that our approach produces more accurate and qualitatively satisfying results than a state-of-the-art video prediction method and many strong frame inpainting baselines.
Article
A model is presented, consonant with current views regarding the neurophysiology and psychophysics of motion perception, that combines the outputs of a set of spatiotemporal motion-energy filters to extract optical flow. The output velocity is encoded as the peak in a distribution of velocity-tuned units that behave much like cells of the middle temporal area of the primate brain. The model appears to deal with the aperture problem as well as the human visual system since it extracts the correct velocity for patterns that have large differences in contrast at different spatial orientations, and it simulates psychophysical data on the coherence of sine-grating plaid patterns.