Conference PaperPDF Available

Research of a Neuron Model with Signal Accumulation for Motion Detection

September 2021

September 2021

DOI:10.1109/CNN53494.2021.9580340

Conference: 2021 Third International Conference Neurotechnologies and Neurointerfaces (CNN)

Authors:

Alexander Kugaevskikh

ITMO University

The paper presents a new model of the MT-neuron (Middle temporal area neuron), which allows detecting movement and determining its direction and speed, without using recurrent connection. The model is based on the accumulation of the signal and is organized using a space-time vector that sets the weight coefficients. Despite the combinatorial redundancy, it is assumed that the model is more resistant to glare in comparison with the optical flow.

Scheme of the MT neuron work

…

First frame (start motion, vertically -the angle of movement, horizontally-the response of MT-neurons)

…

Figures - uploaded by Alexander Kugaevskikh

Content may be subject to copyright.

Content uploaded by Alexander Kugaevskikh

Content may be subject to copyright.

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Research of a Neuron Model with Signal

Accumulation for Motion Detection

Alexander Kugaevskikh

Department of information technologies

Novosibirsk State University

Novosibirsk, Russia

https://orcid.org/0000-0002-6676-0518

Abstract— The paper presents a new model of the MT-

neuron (Middle temporal area neuron), which allows detecting

movement and determining its direction and speed, without

using recurrent connection. The model is based on the

accumulation of the signal and is organized using a space-time

vector that sets the weight coefficients. Despite the

combinatorial redundancy, it is assumed that the model is more

resistant to glare in comparison with the optical flow.

Keywords— neural network, motion detection, MT neuron,

bio-inspired model

I. INTRODUCTION

In the visual cortex, motion analysis begins in the primary

visual cortex. Although its primary function is to highlight

edges, complex cells respond to movement in a particular

direction within their receptive field. More in-depth analysis

of movement is performed in areas V3 and V5 (MT) of the

brain. Eventually, a general map of the movements within the

visual field is formed.

In computer vision, the problem of motion analysis is most

often solved by applying the optical flow equation. When

training neural networks to detect motion, we can also talk

about using the optical flow equation, or rather the underlying

mechanism for finding the direction of change in the

brightness of pixels. The approach based on the use of

reccurent links, such as GRU [1], LSTM [2], STCNN [3],

ResNet [4], STAL [5], is considered generally accepted for the

motion detection. The problem with neural networks is that if

the neural network is not trained to detect movement in a

particular direction, it will not detect it. This is especially true

for video analytics systems, if we tilt the camera by 45 degrees

and do not retrain the neural network, then the movement will

not be detected. We offer an alternative approach, using a bio-

inspired model of the neuron for motion detection. In bio-

inspired architectures, the most common model is the Higer

model [6,7], which is analyzed in detail in [8]. The Higer

model is based on the use of Gabor energy, which combines

the real and imaginary parts of the Gabor filter that are

collapsed with the image. The use of the imaginary part is

reasonable in signal processing, since it allows you to

compensate for redundancy in low frequencies, but it makes

no sense in image processing when convolving with pixel

brightness.

II. EDGE DETECTION

In our proposed model, we not only get rid of the Gabor

energy, but also construct the MT neuron in a different way to

better match the optical flow equation. The motion analysis

begins by highlighting the edges. For this purpose, we

constructed a two-layer neural network [9] based on the use of

the Gabor filter and the hyperbolic tangent as functions for

generating receptive fields of neurons.

The input of the edge selection neural network is fed the

L* channel (pixel luminance) of the image in the CIE L*a*b*

colour space [10]. Given the specifics of this colour space, we

no longer need to use the imaginary part of the Gabor filter.

For the implementation of brightness segmentation, the

use of a two-layer neural network is proposed. On the first

layer, lines of a certain orientation are highlighted. The second

layer is responsible for selecting combinations of lines,

including corners. The difference from the trained

convolutional layers, in particular the first layer of simple cells

in the neocognitron network, is the use of already pre-

configured receptive fields of neurons of the first layer, which

increases the predictability and interpretability of the results

of such a neural network.

Each layer contains 3 types of neurons that differ in the

configuration of receptive fields. In this case, the links

between the layers are organized in a special way. Each

neuron of the second layer is connected to only two neurons

of the first layer. Thus, the neurons of the second layer allow

you to select lines and angles (in the case of the Gabor filter)

and quadrilaterals (in the case of the hyperbolic tangent).

 



 

where 

 ,  responsible for symmetry of a kernel

of the filter, was entered as an alternative to quadrature pair of

filters. It can take values of 

 or 

 to detect antisymmetric

components and , 0 or  for symmetric components,  is

the filter scale,  is the degree of the filter ellipticity (defines

the elongation of the filter kernel along the ordinate axis). The

parameter is introduced solely for the purpose of simplifying

selection of the optimal kernel without different values of 

and , and can be expressed through these parameters  



.

The first two types of neurons respond to lines of preferred

orientation. Their receptive fields are formed using the Gabor

filter. Neurons of the following type are required to determine

the zones of brightness difference, and a smooth function must

also be used to form the receptive field, to take into account

such moments as blurring in fog conditions and finding

shadows. For this reason, the Haar wavelet cannot be applied,

but the receptive field can be configured using the hyperbolic

tangent.

This paper was financially supported by the Russian Foundation for

Basic Research (Grant No. 19-57-45006).

The neurons of the first layer (named ) of the edge

selection neural network use a linear activation function

 

  

where  – neuron’s type,   –convolution

coordinates,  - the pixel brightness matrix of the input image.

The neurons of the second layer use the sigmoid activation

function and function on the principle of "winner-takes-all"

(WTA)

 



  

III. MT NEURON MODEL

Motion detection sets the spatio-temporal organization of

the motion detection neural network. Movement, in this case,

is the sequential activation of several edge selection neurons

located in the same direction in a certain neighborhood over

time, i.e. with a change of frame. Thus, the MT neuron can

give the direction of movement α and its speed v. The MT

neuron, like the previous neurons, is created for each type p.

The connections of the MT neuron with the UC2 neurons of the

corresponding type determine its receptive field.

To determine linear motion, the receptive field of the MT

neuron (UMT{l}) includes a sequence of UC2 neurons in the α

direction. To determine the rotation, the receptive field of the

corresponding MT neuron (UMT{r}) is accompanied by

connections with neurons located in the same center of the

receptive field, but having different orientations θ. The

rotation detection neuron is created twice for different

directions of rotation. In the future, this will help to apply

inhibitory connections to level out parasitic activation.



 

  





  

The weights of MT-neurons are set using the product of

two Gaussians, the first is responsible for the spatial

characteristic, the second sets the attenuation coefficient of the

link weight over time.



  



 

 

 

   

 

 

A uniform filling of such a neuron, i.e. a stationary dark

area in the entire size of the receptive field, will not give the

required activation, fig.1. In this case, the attenuation

coefficient obeys a certain law of change t: at the beginning of

the movement in the receptive field   , when the

neuron  is activated, the vector t will have

values   . At the end of the receptive field, the

vector t will have values   . In this regard, the

value of the attenuation coefficient will change.

Fig.1. Scheme of the MT neuron work

IV. EXPERIMENS

For an experimental test, we will run the movements of

different angles to check the activation of UC2 neurons, in

the direction of 45 degrees. Fig. 2-4 show the frame-by-

frame activation of MT neurons. Ideally, there should be a

thin line, but due to the low resolution, there is a false

activation within 10-20 degrees (  ).

Fig.2. First frame (start motion, vertically - the angle of movement,

horizontally-the response of MT-neurons)













 in 















 in 















 in 





Fig.3. Second frame (vertically - the angle of movement, horizontally-the

response of MT-neurons)

Fig. 4 shows that the maximum activation is achieved on

the second frame, then there is a fade, which confirms our

assumption. Attenuation is necessary so that the MT neuron

does not fire on stationary objects.

Fig.4. Third frame (end motion, vertically - the angle of movement,

horizontally-the response of MT-neurons)

In general, the longer the movement lasts, the more

accurately its direction is determined.

If we increase the size of the receptive field of the space-

time vector from 3 to 7, the accuracy of determining the

direction of movement increases, fig.6-7.

Fig.5. Fourth frame (vertically - the angle of movement, horizontally-the

response of MT-neurons)

Fig.6. Seventh frame (end motion, vertically - the angle of movement,

horizontally-the response of MT-neurons)

Another important aspect is the response to changes in the

speed of movement within the receptive field. The increase in

speed is expressed in an increase in the activation step of the

neurons of the previous layer in the spatial relation while

maintaining constancy in the temporal relation. So, at normal

speed, the UMT value is 4.3469, when activating UC2 through

one neuron, the activation of UMT drops sharply to 0.8448,

when activating UC2 through two neurons, the activation of

UMT is 0.6883.

V. CONCLUSION

The presented model of the MT neuron does not react to

a stationary object, since the uniform filling of the receptive

field of such a neuron gives the output value as close to zero

as possible. The movement is encoded by its direction and

speed. Neurons with different receptive fields are responsible

for the direction, and the speed can be determined based on

the output value of such a neuron.

The best accuracy in determining the direction of

movement can be obtained with the size of the space-time

vector is (7*7,7).

The experiments have shown that the proposed model of

the MT neuron responds to movement in the expected way.

REFERENCES

[1] Y. Cai, J. Liu, Y. Guo, S. Hu, and S. Lang, “Video anomaly detection

with multi-scale feature and temporal information fusion,”

Neurocomputing, vol. 423, pp. 264–273, Jan. 2021, doi:

10.1016/j.neucom.2020.10.044

[2] R. Szeto, X. Sun, K. Lu, and J. J. Corso, “A Temporally-Aware

Interpolation Network for Video Frame Inpainting,” IEEE Trans.

Pattern Anal. Mach. Intell., vol. 42, no. 5, pp. 1053–1068, May 2020,

doi: 10.1109/TPAMI.2019.2951667

[3] C. Jing, P. Wei, H. Sun, and N. Zheng, “Spatiotemporal neural

networks for action recognition based on joint loss,” Neural Comput &

Applic, vol. 32, no. 9, pp. 4293–4302, May 2020, doi: 10.1007/s00521-

019-04615-w

[4] R. Xu, X. Li, B. Zhou, and C. C. Loy, “Deep Flow-Guided Video

Inpainting,” in 2019 IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR), Long Beach, CA, USA, Jun. 2019, pp.

3718–3727. doi: 10.1109/CVPR.2019.00384

[5] G. Chen, J. Lu, M. Yang, and J. Zhou, “Spatial-Temporal Attention-

Aware Learning for Video-Based Person Re-Identification,” IEEE

Trans. on Image Process., vol. 28, no. 9, pp. 4192–4205, Sep. 2019,

doi: 10.1109/TIP.2019.2908062

[6] D. J. Heeger, “Model for the extraction of image flow,” J Opt Soc Am

A, vol. 4, no. 8, pp. 1455–1471, Aug. 1987, doi:

10.1364/josaa.4.001455.

[7] E. P. Simoncelli and D. J. Heeger, “A model of neuronal responses in

visual area MT,” Vision Research, vol. 38, no. 5, pp. 743–761, Mar.

1998, doi: 10.1016/S0042-6989(97)00183-1.

[8] M. Chessa, S. P. Sabatini, and F. Solari, “A systematic analysis of a

V1–MT neural model for motion estimation,” Neurocomputing, vol.

173, pp. 1811–1823, Jan. 2016, doi: 10.1016/j.neucom.2015.08.091.

[9] A. V. Kugaevskikh and A. A. Sogreshilin, “Analyzing the Efficiency

of Segment Boundary Detection Using Neural Networks,”

Optoelectron.Instrument.Proc., vol. 55, no. 4, Art. no. 4, Jul. 2019, doi:

10.3103/S8756699019040137.

[10] ISO 11664-4:2008. Colorimetry. P. 4: CTE 1976 L*a*b* colourspace.

2008–11–01.

ResearchGate has not been able to resolve any citations for this publication.

Spatiotemporal neural networks for action recognition based on joint loss

Article

Full-text available

May 2020
NEURAL COMPUT APPL

Action recognition is a challenging and important problem in a myriad of significant fields, such as intelligent robots and video surveillance. In recent years, deep learning and neural network techniques have been widely applied to action recognition and attained remarkable results. However, it is still a difficult task to recognize actions in complicated scenes, such as various illumination conditions, similar motions, and background noise. In this paper, we present a spatiotemporal neural network model with a joint loss to recognize human actions from videos. This spatiotemporal neural network is comprised of two key connected substructures. The first one is a two-stream-based network extracting optical flow and appearance features from each frame of videos, which characterizes the human actions of videos in spatial dimension. The second substructure is a group of Long Short-Term Memory structures following the spatial network, which describes the temporal and transition information in videos. This research effort presents a joint loss function for training the spatiotemporal neural network model. By introducing the loss function, the action recognition performance is improved. The proposed method was tested with video samples from two challenging datasets. The experiments demonstrate that our approach outperforms the baseline comparison methods.

Analyzing the Efficiency of Segment Boundary Detection Using Neural Networks

Article

Full-text available

Jul 2019

—This paper describes the architecture of a neural network for edge detection. Different filters for first-layer neurons are compared. Neural network learning based on a cosine measure algorithm shows much worse results than an error backpropagation algorithm. Optimal parameters for the first-layer neuron operation are given. The proposed architecture fulfills the stated tasks on edge selection.

A systematic analysis of a V1-MT neural model for motion estimation

Article

Full-text available

Sep 2015
NEUROCOMPUTING

A neural feed-forward model composed of two layers that mimic the V1–MT primary motion pathway, derived from previous works by Heeger and Simoncelli, is proposed and analyzed. Essential aspects of the model are highlighted and comparatively analyzed to point out how realistic neural responses can be efficiently and effectively used for optic flow estimation if properly combined at a population level. First, different profiles of the spatio-temporal V1 receptive fields are compared, both in terms of their properties in the frequency domain, and in terms of their responses to random dots and plaid stimuli. Then, a pooling stage at the MT level, which combines the afferent V1 responses, is modeled to obtain a population of pattern cells that encodes the local velocities of the visual stimuli. Finally, a decoding stage allows us to combine MT activities in order to compute optic flow. A systematic validation of the model is performed by computing the optic flow on synthetic and standard benchmark sequences with ground truth flow available. The average angular errors and the end-point errors on the resulting estimates allow us to quantitatively compare the different spatio-temporal profiles and the choices of the model׳s parameters, and to assess the validity and effectiveness of the approach in realistic situations.

Video anomaly detection with multi-scale feature and temporal information fusion

Article

Jan 2021
NEUROCOMPUTING

Video anomaly detection is a challenging task because of the uncertainty of abnormal events. The current method based on predictive frames has obtained better detection results compared with the previous reconstruction or hand-crafted methods. In current prediction methods, the characteristics considered previously are only of a single scale, and the time constraint information is not fully used. In our work, we proposed a new framework structure to achieve better abnormality detection rate. To address the objects of different scales in each video frame, we considered extracting the characteristics of different receptive fields to encode more spatial information. At the same time, we added temporal constraints to the network instead of using time-consuming optical flow information, and we completed the memory of temporal features through a ConvGRU module. Furthermore, while distinguishing abnormal events, we also considered temporal information and spatial information so that our framework could fully combine spatio-temporal information to correctly distinguish abnormal events from normal events. We obtained excellent results on three datasets, thus demonstrating the effectiveness of our method.

Deep Flow-Guided Video Inpainting

Conference Paper

Jun 2019

A Temporally-Aware Interpolation Network for Video Frame Inpainting

Article

Nov 2019

In this work, we explore video frame inpainting, a task that lies at the intersection of general video inpainting, frame interpolation, and video prediction. Although our problem can be addressed by applying methods from other video interpolation or extrapolation tasks, doing so fails to leverage the additional context information that our problem provides. To this end, we devise a method specifically designed for video frame inpainting that is composed of two modules: a bidirectional video prediction module and a temporally-aware frame interpolation module. The prediction module makes two intermediate predictions of the missing frames, each conditioned on the preceding and following frames respectively, using a shared convolutional LSTM-based encoder-decoder. The interpolation module blends the intermediate predictions by using time information and hidden activations from the video prediction module to resolve disagreements between the predictions. Our experiments demonstrate that our approach produces smoother and more accurate results than state-of-the-art methods for general video inpainting, frame interpolation, and video prediction.

Spatial-Temporal Attention-Aware Learning for Video-Based Person Re-Identification

Article

Mar 2019

In this paper, we present a spatial-temporal attention-aware learning (STAL) method for video-based person re-identification. Most existing person re-identification methods aggregate image features identically to represent persons, which are extracted from the same receptive field across video frames. However, the image quality may be varying for different spatial regions and changing over time, which shall contribute to person representation and matching adaptively. Our STAL method aims to attend to the salient parts of persons in videos jointly in both spatial and temporal domains. To achieve this, we slice the video into multiple spatial-temporal units which preserve the body structure of a person and develop a joint spatial-temporal attention model to learn the quality scores of these units. We evaluate the proposed method on three challenging datasets including iLIDS-VID, PRID-2011 and the large-scale MARS dataset, and consistently improve the rank-1 accuracy by a large margin of 5.7%, 0.9%, and 6.6% respectively, in comparison with the state-of-the-art methods.

A Temporally-Aware Interpolation Network for Video Frame Inpainting

Article

Mar 2018

We propose the first deep learning solution to video frame inpainting, a challenging instance of the general video inpainting problem with applications in video editing, manipulation, and forensics. Our task is less ambiguous than frame interpolation and video prediction because we have access to both the temporal context and a partial glimpse of the future, allowing us to better evaluate the quality of a model's predictions objectively. We devise a pipeline composed of two modules: a bidirectional video prediction module, and a temporally-aware frame interpolation module. The prediction module makes two intermediate predictions of the missing frames, one conditioned on the preceding frames and the other conditioned on the following frames, using a shared convolutional LSTM-based encoder-decoder. The interpolation module blends the intermediate predictions to form the final result. Specifically, it utilizes time information and hidden activations from the video prediction module to resolve disagreements between the predictions. Our experiments demonstrate that our approach produces more accurate and qualitatively satisfying results than a state-of-the-art video prediction method and many strong frame inpainting baselines.

A Model of Neuronal Responses in Visual Area MT of the Macaque

Article

Jan 1998

Model of extraction of image flow

Article

Sep 1987

David J. Heeger

A model is presented, consonant with current views regarding the neurophysiology and psychophysics of motion perception, that combines the outputs of a set of spatiotemporal motion-energy filters to extract optical flow. The output velocity is encoded as the peak in a distribution of velocity-tuned units that behave much like cells of the middle temporal area of the primate brain. The model appears to deal with the aperture problem as well as the human visual system since it extracts the correct velocity for patterns that have large differences in contrast at different spatial orientations, and it simulates psychophysical data on the coherence of sine-grating plaid patterns.

Research of a Neuron Model with Signal Accumulation for Motion Detection

Abstract and Figures

Recommended publications

Bio-inspired Neuron Model for Motion Detection on Base Signal Accumulation

Recognition of deaf gestures based on a bio-inspired neural network

New End-Stopped Complex Cell Model Applicable in Convolutional Neural Networks

Bio-Inspired Models of Convolution Neurons in the Problem of Illusory Contour Recognition