ArticlePDF Available

Multi-view domain-adaptive representation learning for EEG-based emotion recognition

November 2023
Information Fusion 104(1–2):102156

November 2023
104(1–2):102156

DOI:10.1016/j.inffus.2023.102156

Authors:

Chao Li

Tianjin Normal University

Ziping Zhao

Tianjin Normal University

Show all 5 authorsHide

Content uploaded by Chao Li

Content may be subject to copyright.

Graphical Abstract

Multi-view Domain-adaptive Representation Learning for EEG-based Emotion Recognition

Chao Li, Ning Bian, Ziping Zhao, Haishuai Wang, Bj¨

orn W. Schuller

Feature Extractor

Dilated Causal

Convolutional Neural

Network

Extract the most

informative spatio-

temporal features

Compress the dimension

of frequency band

Capture the causal

relationships among

features

Label Classifier

neutral

positive

negative

Domain Discriminator

GRL

Input EEG Signals

Multi-view

Cross Attention Remove Band

...

source

target

...

😀

😐

🙁

Feature-

level

fusion

Highlights

Multi-view Domain-adaptive Representation Learning for EEG-based Emotion Recognition

Chao Li, Ning Bian, Ziping Zhao, Haishuai Wang, Bj¨

orn W. Schuller

•

The CADD-DCCNN can extract the most important features from EEG signals and minimize individual

diﬀerences between subjects.

•Multi-view learning is applied to explore the complementarity between diﬀerent channels.

•

A cross-attention mechanism, combined with multi-view learning, can extract discriminative spatio-temporal

features.

•

A dilated causal convolutional neural network is utilized to capture the causal interactions among multi-channel

EEG signals.

•

A domain discriminator enhances model generalization by constraining similar feature distributions between

diﬀerent domains.

Multi-view Domain-adaptive Representation Learning for EEG-based Emotion

Recognition

Chao Lia, Ning Biana, Ziping Zhaoa,∗, Haishuai Wangb, Bj¨

orn W. Schullerc

aCollege of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China

bCollege of Computer Science, Zhejiang University, Hangzhou, China

cUniversity of Augsburg, Germany and GLAM, Imperial College London, UK

Abstract

Current research suggests that there exist certain limitations in EEG emotion recognition, including redundant and

meaningless time-frames and channels, as well as inter- and intra-individual diﬀerences in EEG signals from diﬀerent

subjects. To deal with these limitations, a Cross-attention-based Dilated Causal Convolutional Neural Network with

Domain Discriminator (CADD-DCCNN) for multi-view EEG-based emotion recognition is proposed to minimize

individual diﬀerences and automatically learn more discriminative emotion-related features. First, diﬀerential entropy

(DE) features are obtained from the raw EEG signals using short-time Fourier transform (STFT). Second, each channel

of the DE features is regarded as a view, and the attention mechanisms are utilized at diﬀerent views to aggregate the

discriminative aﬀective information at the level of the time-frame of EEG. Then, a dilated causal convolutional neural

network is employed to distill nonlinear relationships among diﬀerent time frames. Next, a feature-level fusion is used

to fuse features from multiple channels, aiming to explore the potential complementary information among diﬀerent

views and enhance the representational ability of the feature. Finally, to minimize individual diﬀerences, a domain

discriminator is employed to generate domain-invariant features, which projects data from both the diﬀerent domains

into the same data representation space. We evaluated our proposed method on two public datasets, SEED and DEAP.

The experimental results illustrate that our CADD-DCCNN method outperforms the SOTA methods.

Keywords: Aﬀective computing, cross-attention, domain adaptation, EEG, emotion recognition, multi-view learning

1. Introduction

Emotion is a state that combines a human’s feelings, thoughts, and behaviors, and which inﬂuences their ra-

tional decision-making, perception and cognition; accordingly, emotion has a signiﬁcant impact on interpersonal

communication[

]. Therefore, emotions contribute greatly in a wide range of studies. Emotion recognition has also been

widely put into practice in many ﬁelds like depression diagnosis [

] and human–computer interaction [

]. Typically,

multiple modalities are used for emotion recognition, including facial expressions [

], body gestures [

], voice [

electrocardiography (ECG) [

], electroencephalography (EEG) [

], and electromyography (EMG) [

]. A great number

of studies [

] have found that multi-view EEG signals, which have a strong relationship with emotion, are more

diﬃcult to disguise compared to other modalities. Therefore, they can be utilized as an eﬀective approach for detecting

emotions [12, 13].

∗Corresponding author at: College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China.

Email addresses: superlee@tjnu.edu.cn (Chao Li), bianning0622@163.com (Ning Bian), ztianjin@126.com (Ziping Zhao),

haishuai.wang@zju.edu.cn (Haishuai Wang), schuller@tum.de (Bj¨

orn W. Schuller)

Preprint submitted to Elsevier October 31, 2023

In EEG-based emotion recognition, signiﬁcant progress has been made by numerous researchers in recent years.

However, there are still two problems in urgent need of a solution. First, multi-view EEG signals can vary signiﬁcantly

among individuals due to diﬀerences in brain structure and patterns of brain activity across diﬀerent subjects. Training

a common model that can adopt across diﬀerent datasets or subjects is a challenging task. Therefore, the primary

problem in this context is that of how to deal with individual diﬀerences between diﬀerent subjects or even diﬀerent

sessions of the same subject. Second, due to the varying contributions of the diﬀerent time frames and channels of

multi-view EEG signals to emotion recognition, it is vital to develop better ways of identifying and utilizing these

signals in emotion recognition classiﬁers. Therefore, it will be necessary to obtain more distinguishing spatio-temporal

features related to emotions to boost the eﬀectiveness of emotion recognition.

To address the previously mentioned concerns in multi-view EEG-based emotion recognition, we propose a Cross-

attention-based Dilated Causal Convolutional Neural Network with Domain Discriminator for multi-view EEG-based

emotion recognition called CADD-DCCNN. We use multiple source domains that correspond to one target domain

for domain adaptation to learn the commonalities between diﬀerent domains. Moreover, a multi-view cross-attention

mechanism is applied to learn the connection between EEG signals and emotional stimuli from multiple channel views,

thereby enhancing the eﬃciency and stability of emotion recognition. The CADD-DCCNN method we propose takes a

multi-view EEG signal as input, which is represented by a sequence of data from diﬀerent electrodes, and produces an

emotion label corresponding to this input. We focus on resolving several key issues in emotion recognition by adopting

the following two strategies: (1) we aim to explore the complementary information between multiple channels through

multi-view learning and combine attention mechanisms to identify the most informative spatio-temporal features for

emotion recognition by selecting the most relevant channels and time frames; (2) we use domain adaptation techniques

to eliminate individual diﬀerences among diﬀerent subjects and build domain-invariant features.

Channel and time frame selection. Collected multi-view EEG signals have varying relevance to emotion, and

diﬀerent time frames may also activate emotions to varying degrees. In this paper, our purpose is to identify the most

informative features for emotion recognition by assigning diﬀerent weights to diﬀerent channels and time frames.

We use neuro-physiological research and a data-driven approach to explore subtle relations both within and between

channels and time frames. We treat each channel as a view and analyze EEG signals from multiple views. To

achieve this goal, a multi-view cross-attention mechanism is used in our proposed CADD-DCCNN. Firstly, attention

mechanisms are applied within each view to explore the activation levels of diﬀerent time frames. Then, through

multi-view learning, we explore the importance of diﬀerent views to search for the optimal channels and time frames

relevant to emotions. We employ multi-view learning to explore complementarity and correlation between multi-view

EEG signals, which helps enhance the performance and generalizability of emotion recognition.

Domain-invariant features. Because of the shift in data distribution among diﬀerent individuals, many previous

studies in multi-view EEG-based emotion recognition build the model based on an individual’s brain responses.

Although user-dependent models are popular, some recent studies [

–

] suggest building specially designed user-

independent models. Accordingly, to deal with the problem of distribution shift in data, we integrate a domain

discriminator to limit the distributions of the features obtained from the training (source) and testing (target) data to be

similar.

In summary, our proposed CADD-DCCNN method utilizes multi-view cross-attention mechanism to extract

emotion-related features that are distinctive by capturing the nonlinear connections between multiple channels. Further-

more, it utilizes a global domain discriminator to ensure a domain-invariant data representation. Our CADD-DCCNN

method is evaluated on two widely used EEG emotional datasets (SEED [

] and DEAP [

]) to demonstrate its

eﬀectiveness. Additionally, ablation studies are executed to exhibit the eﬀectiveness of our multi-view learning module,

cross-attention mechanism, the dilated causal convolutional neural network, and domain discriminator module. In

particular, the primary contributions of this paper are:

We employ a multi-view cross-attention mechanism, where each channel of the DE features is considered as a

view. Attention mechanisms are then applied to each view to extract discriminative information for each emotion.

We explore the importance of diﬀerent time frames on each channel and uncover the complementarity between

diﬀerent views to enhance the performance of our method. Our experimental results indicated the eﬀectiveness

of multi-view cross-attention mechanism in selecting emotion-related channels and time frames.

2) A dilated causal convolutional neural network is utilized to capture the causal interactions among the features.

A domain discriminator is integrated to generate a common feature space that constrains similar feature dis-

tributions between diﬀerent (source and target) domains; this allows us to not only reduce discrepancy in data

distribution in cross-subject scenarios but also improve model generalization in cross-session experiments.

We propose a framework for multi-view EEG-based emotion recognition, CADD-DCCNN, which addresses the

challenges of channel and time frame selection and building domain-invariant features. We assess our proposed

approach using two publicly available datasets. Our approach obtains the SOTA performance on both datasets.

Speciﬁcally, on SEED, it carries out an average accuracy of 92.44%. On DEAP, it achieves mean accuracies of

69.45% and 70.50% for valence and arousal, respectively.

The remainder of this paper is organized as follows. Section 2 describes the related works. The proposed CADD-

DCCNN-based emotion recognition method is presented in Section 3. Experiments conducted on the two emotion

datasets and the analysis of experimental results are illustrated in Section 4. Finally, in Section 5, we conclude this

paper.

2. Related Work

2.1. EEG Features for Emotion Recognition

Traditional handcrafted features and machine learning classiﬁers are commonly utilized in emotion detection.

However, traditional machine learning approaches are signiﬁcantly constrained by feature design and selection, which

often demand a substantial amount of prior knowledge. To overcome these limitations, deep learning technology was

developed to learn data representations [

]. Inspired by the success of deep learning in speech recognition and

natural language processing [

–

], some scholars have utilized deep learning to extract features from EEG signals in

emotion recognition.

Time-domain features, such as amplitude, variance, and mean, are usually implemented to extract the time-domain

statistics of EEG signals, which are the most intuitive and accessible features in EEG signal analysis. Frequency-domain

features show how the EEG waveform changes with frequency. By using an algorithm, the time domain signals are

transformed into frequency domain signals, which is the main idea behind frequency-domain features; the results reﬂect

the way the features of the signal change with frequency, allowing the distribution of the various rhythms in EEGs to be

more intuitively observed. In order to extract frequency-domain features, EEG signals are usually decomposed into

several frequency bands (

band (1–3 Hz),

band (4–7 Hz),

band (8–13H z),

band (14–30 Hz),

band (31–50 Hz))

[

]. Then methods such as diﬀerential entropy (DE) [

], wavelet transform (WT) [

], and power spectral

density (PSD) are utilized to extract frequency-domain features from every frequency band. Time-frequency features

consider the above two features of the signal, describing the ways in which it changes with both time and frequency.

Methods such as STFT [27], and Hilbert–Huang transform (HHT), among others, are commonly employed to extract

time-frequency features.

2.2. Domain Adaptation

Creating subject-speciﬁc models for every subject is a feasible but impractical solution for addressing individual

diﬀerences, as it would require a signiﬁcant amount of eﬀort to collect a labeled dataset for each subject. Another way

to tackle this issue is to employ domain adaptation (DA) methods, which intend to minimize the distribution diﬀerences

between diﬀerent domains, thereby aiding the learning of transferable features for emotion recognition.

Wang et al. [

] categorized deep domain adaptation into discrepancy-based, adversarial-based, and reconstruction-

based methods. Adversarial-based methods aim to minimize the distance between diﬀerent domains by training a

domain discriminator to classify both domain types. Bao et al. [

] developed a two-level domain adaptation neural

network (TDANN) that reduces the distribution discrepancy of deep features between diﬀerent domains through

employment of maximum mean discrepancy (MMD) and domain adversarial neural network (DANN). Chen et al. [

]

proposed a multi-source EEG-based emotion recognition network (MEERNet), which adopts multiple source domains

to adapt to an individual target domain separately for domain adaptation, in order to extract domain-invariant and

domain-speciﬁc features. Liu et al. [31] proposed an extended domain adaptation method based on subject clustering

(DASC), which mitigates the eﬀects of “negative transfer” by incorporating subject clustering. However, although these

methods can eﬀectively reduce individual diﬀerences and improve generalization performance, the feature extraction

methods they choose may ignore important information in EEG signals. TDANN selected a deep CNN method,

which incorporates multiple convolutional layers and two max pooling layers, but which may lead to a lot of valuable

information being lost and the relationships between the whole and the parts being ignored. MEERNet and DASC

chose to use a multi-layer perceptron (MLP) method, which might cause the omission of the spatial information of

EEG signals. In summary, when extracting features, these methods may neglect the relationships within and between

channels as well as the temporal dimension. This also raises a second problem that needs to be solved, namely that of

how to identify EEG samples that contain a higher amount of emotional information.

2.3. Multi-view Learning for EEG Emotion Recognition

Data is often portrayed using various perspectives, incorporating multiple modalities or features[

]. In the ﬁeld

of medical research, Alzheimer’s disease (AD) encompasses diﬀerent types of data modalities, including Magnetic

Resonance Imaging (MRI) and Positron Emission Tomography (PET), along with diverse features [

]. Multimedia

data, including text, video, and audio, is commonly sourced from various origins or described by multiple features

[

], such as temporal and frequency features[

]. Empirical evidence from various studies [

] consistently

indicates that multi-view learning combines diﬀerent views to explore the complementarity between them, thereby

improving model accuracy. Multi-view learning has gained extensive popularity and adoption across diverse tasks,

owing to its notable eﬀectiveness [39, 40].

Moreover, several recent works in the ﬁeld of multi-view learning have successfully incorporated deep learning

techniques [

–

]. But few researchers have applied multi-view learning to emotion recognition in EEG signals.

EasyDA [

] utilizes an approximate empirical kernel map generated from samples in the source and target domains

to map each view into a domain generalization feature space, followed by a parameterless weighted combination for

multi-view. However, this method overlooks the interrelationships between multiple channels in EEG signals and

the degree of association between diﬀerent channels and diﬀerent emotions. Since diﬀerent channels play diﬀerent

roles under diﬀerent emotions, we consider each channel (i.e., electrode) as a view and explore the complementarity

between multiple channels. We conduct emotion recognition research on EEG signals using multi-view learning based

on multiple channels.

2.4. Attention Mechanism

Some researchers have utilized attention mechanisms to extract emotion-related spatio-temporal features in EEG-

based emotion recognition. Jia et al. [

] utilized an attention mechanism to transform channels into a 2D map when

calculating weights, and used pooling to calculate an attention matrix. Li et al. [

] proposed the transferable attention

neural network (TANN), which incorporates a global attention layer to merge the features from the entire brain regions

and emphasize signiﬁcant regions for emotion classiﬁcation. However, these methods tend to neglect the interactions

between EEG signal channels and temporal relevance. To address this limitation, the Transformer model, based on

self-attention mechanism, assigns diﬀerent weights to diﬀerent time frames in the temporal dimension. This allows to

fully consider the temporal dynamics in emotions. Consequently, many researchers have integrated the Transformer

model into EEG-based emotion recognition tasks. Wang et al. [

] utilized weight-shared transformer encoders to

adaptively capture the importance of diﬀerent time frames within each channel. They also combined a hierarchical

spatial encoder to capture the correlations between channels. Si et al. [

] proposed a hierarchical hybrid model called

MACTN, which extracts local emotional features, global emotional features, and emotion-relevant channels through

CNN, Transformer, and channel attention, respectively. Transformer models mainly focus on global temporal feature

extraction, often overlooking local temporal information. On the other hand, CNN and Transformer have diﬀerent

abilities to extract information at various scales. Therefore, some scholars have combined CNN and Transformer to

extract both local and global temporal information. However, it is worth noting that convolution and pooling operations

can disrupt temporal and channel correlations.

To address this issue, we note that Hao et al. [

] opted to employ an attention mechanism, which is popularly

employed in computer vision, natural language processing, and classifying multivariate time series (MTS) tasks due

to its ability to locate key regions in images, key parts in sentences, and key variables in MTS. We hypothesize that

this model also has the ability to locate global key information. We consider each channel as a view, and each view

contains multiple time frames. Based on this, we ﬁrst utilize an attention mechanism in the temporal dimension to

locate key time frames. Then, through multi-view learning, we extract key information and complementary information

between multiple views, forming a cross-attention mechanism based on multi-view learning. This allows the model to

automatically identify key emotional-related information in multi-view EEG signals.

3. Proposed methodology

3.1. System Overview

In our model, as shown in Fig.1, features (Fig.1(a)) containing temporal and spatial information are generated from

the raw signals for each subject. These features are then input into a domain adversarial neural network in which the

deep representation of the features is extracted by means of a multi-view cross-attention mechanism (Fig.1(b)) and a

dilated causal convolutional neural network (Fig.1(d)). Finally, the fused deep representation, which is fused through a

max-pooling layer, is fed to two parts, including a domain discriminator (Fig.1(f)) and a label classiﬁer (Fig.1(e)). The

domain discriminator is utilized to determine the domain from which the input originates (training data or testing data),

(b) Multi-view Cross Attention

Source data

Target data

(a) Input EEG signals (d) Dilated Casual Convolution

Conv(1x8x72)

Conv(1x5x48)

Conv(1x3x24)

Fully Connected

Pool

........ Classify

Emotions

(e) Label Classifier

Fully Connected

........

Classify

Domains

(f) Domain Discriminator

GRL

Feature Extractor

Fusion

Pool

Fusion

Figure 1: The network structure of the CADD-DCCNN framework for EEG emotion recognition.

in order to narrow down the distinguish shift. The label classiﬁer assigns the deep representation to a class label within

a classiﬁcation space.

3.2. Input EEG Signal Representation

Since DE features have exhibited remarkable performance in multi-view EEG-based emotion recognition [

], we

implement the proposed CADD-DCCNN method with DE features obtained from multi-view EEG signals as input.

Using (1), we formulate DE:

f(S)=−Z+∞

−∞

√2πσ2e

(s−µ)2

2σ2log 1

√2πσ2e

(s−µ)2

2σ2ds

2log2πeσ2

(1)

where

is a slice of the EEG signal that obeys the Gaussian distribution

(

µ, σ2

). Particularly, in the datasets referred

to in Section 4.1, every subject has multiple captured trials, and every trial consisting of a series of multi-view EEG

signals of speciﬁc duration recorded while experiencing a speciﬁc emotion. For each trial, DE features are extracted

from ﬁve distinct frequency bands of each channel using STFT with a non-overlapping Hanning window of 1-second.

The size of DE features for 1-second windows is (62, 5) and (32, 5) for the SEED and DEAP datasets, respectively. We

concatenate the DEs of all windows into a feature vector that represents one trial.

The proposed CADD-DCCNN method uses an input matrix denoted as tm, which corresponds to the m-th subject.

Speciﬁcally,

tm,1,tm,2,...,tm,n

]

⊤∈Rn×ds

, where

represents the size of a DE feature vector, and the vector

tm,i

with dsdimensions, represents a feature associated with the i-th trial of the m-th subject.

3.3. Attention-based Feature Extractor

The feature extractor consists of a multi-view cross-attention mechanism (MvCA), a band removal mechanism, and

a dilated causal convolutional neural network (DCCNN).

3.3.1. Multi-view Cross-Attention Mechanism (MvCA)

The input of the feature extractor is a multivariate (time frame and channel) EEG signal sequence. Since human

emotions are progressive and diverse, and the activated degrees of diﬀerent emotions also diﬀer across brain regions.

S∈RLxB

K(S)∈RLxB

Q(S)T∈RBxL

V(S)∈RLxB

1x1Conv

H∈RLxL

α∈RLxL

Softmax

OTFA∈RLxB

Figure 2: Time-frame Attention mechanism for calculating the attention of one channel.

Therefore, we consider each channel as a view. Additionally, since diﬀerent emotions exhibit varying activation patterns

across diﬀerent time frames, we employee an attention mechanism within each view to calculate attention weights on

the time dimension. Finally, we employ a similar attention mechanism across multiple views to dynamically learn the

weights for each view, and then combines them with the original signal to build a new vector representation for the

EEG trial using multi-view learning.

The MvCA module in our model comprises two modules: multi-view time-frame attention (MvTFA) and multi-view

attention (MvA). In this paper, we consider a channel as a view. MvTFA extracts the long- and short-term dependencies

of past values in each view of signals. MvA evaluates connections between multiple views. The MvCA module initially

applies the MvTFA module. For one view, we employ

1,s2

1,...,sL

)

. . .

(

C,s2

C,...,sL

) to embody the original

data feature sequence, where

represents the count of time frames and

denotes the quantity of views. The speciﬁc

computation process is demonstrated in Fig.2.

As the ﬁrst stage in the MvCA module,

is converted into three distinct feature domains (

, and

) using

(2)

Q(S)=S·WQ,K(S)=S·WK,V(S)=S·WV(2)

where

WQ,WK

, and

are weight matrices with dimension

B×B

. The outputs

(

), and

(

) share the

dimension

L×B

and represent the query space, key space, and value space, respectively. The MvCA seizes the

connections between a potential query and the key–value pairs in the data. The conversion of

into each feature space

can be accomplished by utilizing a 1 ×1 convolutional operation on S.

During the second stage, the time-frame attention for a single view, indicated as

in Fig.2, is computed using the

features in the query and key spaces by two equals depicted in (3) and (4).

H=Q(S)·K(S)⊤(3)

αq,k=ex p(Hq,k)/

j=1

ex p(Hq,k)(1 ≤k≤q≤L).(4)

Here,

Hq,k

is a hidden state within the

matrix (where

H∈RL×L

) that records the attention of features from

previous time step

to current time step

. To ensure that

Hq,k

is only applicable when

k≤q

is updated to

H′

through setting

Hm,n

to zero when

m<n

. That is to say, the upper right corner of the

H′

matrix is assigned zero,

resulting a lower triangular matrix. Hence,

H′

can directly seize the attention of both long- and short-term past values.

Finally, (4) is used to normalize the attention, which produces the attention matrix α.

During the third action,

is employed on the features in

(

) to compute the attention

OMvT F A

using

(5)

. Next,

OMvT F A is again merged with Sto obtain the hidden states Yusing (6).

OMvT F A =α·V(S) (5)

Y=OMvT F A +S.(6)

To our knowledge, to compute attention, current attention mechanisms either disregard the temporal order of

features [

], or only consider features that already indicate the temporal order. The MvTFA mechanism in our

proposed model diﬀers from these existing attention mechanisms because it can capture the time dependency of values

directly.

Apart from developing the MvTFA module to learn the long- and short-term dependencies of features in time

sequence from each view, we also construct the MvA module to assess the relationships between views. The input of

the MvA is a feature sequence at time step

from

views, denoted as

1,yt

2,...,yt

). Using a similar process to

that used to compute

in the MvTFA module, we can compute a normalized view attention. We then apply the view

attention to

and denote the output as

OMvA

. Finally,

OMvA

is merged with the

once more to obtain the hidden states

Zusing (7).

Z=OMvA +Y.(7)

The ultimate features in

, which combine both the

OMvT F A

and

OMvA

, are referred to as multi-view cross-attention

features.

3.3.2. Band Removal

The second part is a band removal mechanism. We perform a 1

1 convolution on the features from MvCA to

compress the dimension of frequency band direction and reduce the amount of subsequent computation.

3.3.3. Dilated Causal Convolutional Neural Network (DCCNN)

Input

Hidden

(k=3, d=1)

Hidden

(k=3, d=2)

(k=3, d=4)

Output

Figure 3: A dilated causal convolution with dilation factors d=1,2,4 and ﬁlter size k=3.

The third part of the feature extractor is a DCCNN. The DCCNN is formed by the combination of dilated convolution

and causal convolution. Causal convolution is a convolutional model used for solving time series problems, where each

node only considers preceding nodes, ensuring that information cannot ﬂow into the future and capturing the causal

relationships between time frames. Dilated convolution, on the other hand, expands the receptive ﬁeld, allowing the

current node to look ‘very far’ into the past. When the dilation factor is 1, dilated convolution is equivalent to a standard

Convolutional Neural Network (CNN). Fig.3 shows an illustration of a DCCNN. The kernal size

is 3 and the dilation

factors dof three dilated layers are 1, 2, and 4, respectively. A 1D dilated convolution is calculated as shown in (8).

p(s)=

k−1

l=0

f(l)h(s−d·l).(8)

Here,

(

∗

) represents the input,

(

∗

) represents the output of the dilated convolution,

(

) represents the ﬁlter of length

k, and s−d·ldenotes the historical direction of feature s.

In simple terms, the dilated convolution, without increasing the number of parameters, incorporates local information

of diﬀerent sizes by setting diﬀerent dilation factors at diﬀerent layers. Therefore, by using DCCNN, we can further

extract the temporal features and causal relationships of EEG signals while reducing the number of parameters for

subsequent calculations, thereby improving computational speed.

3.4. Domain Discriminator (DD)

Motivated by the generative adversarial network (GAN), we develop an adversarial training process for the feature

extractor and the DD. Throughout the training process, the DD seeks to determine whether the features belong to the

source or target domain. Meanwhile, the feature extractor is trained to transform the inputs from diﬀerent domains into

a shared latent space. By reducing the classiﬁcation capability of the DD, the feature extractor is trained to produce

features that are domain-independent. By adopting this approach, our proposed method can reduce the problem of

feature distribution shift.

Particularly, we ﬁrst use average pooling to fuse the deep representation from multi-view features and transform the

input

(

) to a vector

. Then, we introduce a gradient reversal layer (GRL), which can maximize DD loss, prior to

applying ReLU activation to

which is calculated by

(9)

. The GRL is not eﬀective during forward propagation, but

changes the direction of gradient transfer during backpropagation, so that the updating direction is reversed.

m=ReLU(Wr·dm+br) (9)

m=so f tmax(Ws·dr

m+bs).(10)

Finally, we derive the probability of the input originating from the source or target domain by transforming

into a 2D space and implementing a softmax function in

(10)

, where

, and

are weight matrices and bias

vectors that can be learned during training.

3.5. Label Classiﬁer

The label classiﬁer is connected to the emotion recognition task and is trained to learn the distribution of emotions

to output emotion labels from deep representations of EEG signals. We ﬁrst use average pooling to fuse the multi-view

features to learn more dircrimivative representations. Then, to decode these deep representations, we use several

fully connected (FC) layers to construct a classiﬁer. The label classiﬁer predicts emotions by mapping the deep

representations from the common space to the emotion space. Speciﬁcally, the classiﬁer comprises three FC layers and

a softmax function, which transforms the network predictions into emotions using (11).

m,f=so f tmax(Ws·dm+bs).(11)

Here, Wsand bsare the learnable weight matrix and bias vector.

4. Experiments and Results

4.1. Datasets

4.1.1. SEED

The SEED dataset is a public aﬀective EEG dataset for emotion recognition. It contains EEG data acquired from 15

subjects, recorded via 62 EEG electrodes while they watched 15 ﬁlm videos, each lasting about four minutes. The

videos elicited three types of emotions (positive, neutral, and negative). Each subject participated in the experiments

three times on diﬀerent days, watching the same 15 movie videos within each experiment; these settings allow for

subject-dependent (SD) and subject-independent (SI) experiments to validate the robustness and transferability of

emotion recognition models. In SEED, a bandpass ﬁlter ranging from 1.0 to 75.0 Hz in frequency was employed.

Then the DE features were extracted. Since the 15 trials undertaken by each subject were of diﬀerent durations, we

performed a zero-ﬁlling operation: speciﬁcally, we selected the longest duration of the trial as the ﬁnal duration, then

performed zero-ﬁlling operations after other trials to unify the trial length.

4.1.2. DEAP

The DEAP dataset is also a public aﬀective EEG dataset for emotion recognition, containing the collected data

of 32 subjects watching 40 one-minute music videos. Participants rate their levels from 1 to 9 after watching each

video on four dimensions: Arousal, Valence, Liking, and Dominance. In this paper, we removed the eight peripheral

channels and used only EEG signals for emotion recognition. We split the valence and the arousal dimension into

high/low, separately, resulting in two binary classiﬁcation task. In DEAP, a bandpass ﬁlter ranging from 4.0 to 45.0 Hz

in frequency was used, as reported in [

]. Initially, we disassemble the EEG signals into the same ﬁve frequency

bands. Next, DE features are extracted from each channel for every frequency band.

4.2. Experimental Settings

The feature extractor of our model includes a cross-attention, a multi-view fusion module, and a DCCNN. The

DCCNN includes three convolutional layers with kernels of (1, 8), (1, 5), and (1, 3), respectively. The dilation factors

for the second and third layers are (1, 8) and (1, 5). As DCCNN only works in the time dimension, the convolutional

kernel and dilated factor are set to 1 in the channel dimension. The output tensor of the three convolutional layers has

72, 48, and 24 channels, respectively. The label classiﬁer consists of three fully connected layers, with outputs of size

1024, 150, and 3 (for the SEED dataset) or 2 (for the DEAP dataset). During the training process, we utilized a batch

size of 5 for 200 epochs and employed the Adam optimizer with a learning rate of 1e-4.

4.3. Compared Methods

Within this part, we exhibit experimental results on two commonly adopted EEG datasets (as outlined above, SEED

and DEAP) and compare our proposed CADD-DCCNN with several other methods, which are listed below:

•Two traditional shallow machine learning techniques: SVM [52] and kNN [53];

•Three deep neural network models: DGCNN [50], SparseD [54], and MATCN [48];

•

Seven cutting-edge domain adaptation models for emotion recognition in EEG: ATDD-LSTM [

], MEERNet

[

], MS-MDA [

], AD-TCNs [

], HVF

N-DBR [

], MMDA-VAE [

], MSDA-SFE [

], TMLP+SRDANN

[60], and TSFIN [61].

While SVM and kNN can only process EEG signal channels one at a time, other deep networks are capable of

handling multi-channel EEG signals. These methods are all characteristic approaches in prior research studies on

emotion recognition from EEG signals. To ensure a persuasive comparison with the proposed approach, we directly

quoted the outcomes from the relevant literature. Additionally, we have conducted ablation studies to investigate the

roles of each part in CADD-DCCNN and their inﬂuence on the whole method.

4.4. Experiment on Two Publicly Available Datasets

4.4.1. Experiment on SEED Dataset

SD Experiment. Compared to the DEAP dataset, each subject in SEED participated in three trials at diﬀerent

time, resulting in data consisting of three sessions. We utilized this characteristic to design a SD cross-session trial

that forecasts the emotions from the same subject at diﬀerent time. The results illustrate the robustness of diﬀerent

models over time, making this experiment setting more useful for practical applications. However, few research works

have conducted cross-session experiments until now. In the cross-session scenario, we used leave-one-session-out

cross-validation. In particular, two out of three sessions of each subject were utilized as training data, whereas the

unused session was employed to test. The classiﬁcation accuracy for each subject was computed as the mean accuracy

across three trials. Finally, the mean classiﬁcation accuracy (ACC) and standard deviation (STD) across 15 subjects

were calculated as the ﬁnal result of cross-session experiment.

SI Experiment. We used leave-one-subject-out cross-validation (LOSO-CV), where 14 subjects were utilized as

training data and the unused subject was employed to test. We computed the ACC and STD across 15 subjects as the

ﬁnal results for cross-subject experiment.

4.4.2. Experiment on DEAP Dataset

To perform a binary classiﬁcation task, we categorized the emotions in the DEAP dataset as high/low arousal and

valence, using the same partition scheme and threshold as described in [18].

SD Experiment. We applied leave-one-clip-out cross-validation, where 39 out of 40 trials of one subject were

utilized as training data and the unused trial was employed to test. An ACC and STD were computed as the ﬁnal result

of SD experiment over 40 trials for each of the 32 subjects. In the DEAP dataset, each trial of each subject contains

only one data point; however, in this experiment, using one trial for testing would result in an insuﬃcient amount of

data. Therefore, we applied sliding windows with a size of 9 seconds and no overlap to split the data into multiple

segments, which divided the data into 7 data points for one trial. Thus, the training data was made up of 273 (39*7)

data points, and the testing data contained 7 (1*7) data points for one subject.

SI Experiment. Similar to our approach on SEED, we applied LOSO-CV for the current experiment. This means

that 31 out of 32 subjects were utilized as training data, and the unused subject was employed to test. We compute an

ACC and STD across 32 subjects as the ﬁnal result of SI experiment.

4.4.3. Results and Analysis

The results of our experiments are summarized in Tables 1 to 4. Bold indicates the highest average accuracy. Based

on the experimental results, we made three key observations:

Table 1: Mean accuracies (%) and STD for SD achieved by diﬀerent methods on SEED dataset.

Model ACC/STD

kNN [53] 72.00/12.60

DGCNN [50] 73.06/10.36

SVM [52] 81.19/14.79

MEERNet[30] 86.20/05.80

CADD-DCCNN 87.41/01.80

Table 2: Mean accuracies (%) and STD for SI achieved by diﬀerent methods on SEED dataset.

Model ACC/STD

SVM [52] 56.73/16.29

kNN [53] 73.93/09.95

DGCNN [50] 79.95/09.02

TMLP+SRDANN [60] 81.04/06.28

MMDA-VAE [58] 85.07/11.81

MEERNet [30] 87.10/02.00

HVF2N-DBR [57] 89.33/10.13

MS-MDA [55] 89.63/06.79

MSDA-SFE [59] 91.65/02.91

CADD-DCCNN 92.44/06.16

Table 3: Mean accuracies (%) and STD for SD achieved by diﬀerent methods on DEAP dataset.

Model ACC/STD

Valence Arousal

kNN [53] 50.11/21.92 57.76/23.14

SVM [52] 53.76/19.56 55.67/20.91

DGCNN [50] 86.06/02.61 85.61/02.44

ATDD-LSTM [51] 90.91/12.95 90.87/11.32

SparseD [54] 95.72/09.52 91.75/05.23

CADD-DCCNN 90.97/13.96 92.42/12.72

(1)

Our proposed CADD-DCCNN method outperformed all comparable methods on both the SEED and DEAP

datasets. Speciﬁcally, compared to non-domain adaptation methods such as SVM, kNN and DGCNN, the

average precision improvement of CADD-DCCNN was approximately 20.97%, 16.96% and 13.42% on the

SEED dataset. On the DEAP dataset, the average precision increased by around 25.52%, 27.97%, 8.54%, 2.03%,

and 3.35% for valence and 27.30%, 25.39%, 8.11%, 2.89%, and 2.7% for arousal when compared to SVM, kNN,

DGCNN, SparseD, and MATCN, respectively. These results validate the eﬀectiveness of our learned transferable

data representation for EEG-based emotion recognition and demonstrate the practicality of the domain adaptation

method in cross-subject EEG-based emotion recognition.

(2)

Our proposed CADD-DCCNN method outperformed existing domain adaptation methods. Compared to models

trained using domain adaptation learning strategies, such as ATDD-LSTM, MEERNet, MS-MDA, AD-TCNs,

HVF

N-DBR, MMDA-VAE, MSDA-SFE, TMLP+SRDANN, and TSFIN, the CADD-DCCNN achieved an

average precision improvement of 5.14%, along with improvements of 4.00% for valence and 3.98% for arousal,

Table 4: Mean accuracies (%) and STD for SI achieved by diﬀerent methods on DEAP dataset.

Model ACC/STD

Valence Arousal

kNN [53] 54.38/09.27 54.38/11.93

SVM [52] 55.62/09.44 52.66/14.47

TMLP+SRDANN [60] 57.70/07.23 61.88/05.55

DGCNN [50] 59.29/06.83 61.10/12.28

SparseD [54] 60.65/06.24 65.39/09.41

AD-TCNs [56] 64.33/07.06 63.25/04.62

MATCN [48] 66.10/06.10 67.80/08.10

TSFIN [61] 67.03/- 68.13/-

HVF2N-DBR [57] 68.91/- 69.22/-

MSDA-SFE [59] 69.26/- 70.10/-

CADD-DCCNN 69.45/05.60 70.50/09.39

in SI scenarios on the SEED and DEAP datasets, respectively. These results indicate that our proposed multi-

view cross-attention mechanism can eﬀectively explore the complementary and consistent information among

channels. Furthermore, the use of attention mechanisms within each view enables eﬀective learning of more

discriminative temporal information.

(3)

On the DEAP dataset, we observed that the model performance was much lower in SI scenarios than in SD

scenarios, which demonstrates that individual diﬀerences have a negative impact on multi-view EEG-based

emotion recognition. In cross-session scenarios on the SEED dataset, we found that diﬀerences between diﬀerent

testing times for the same subject are even harder to eliminate than individual diﬀerences between diﬀerent

subjects, resulting in lower precision in SD scenarios than in SI scenarios. However, despite these challenges,

our method still outperformed the compared methods. Therefore, overall, our proposed CADD-DCCNN method

has an advantage in minimizing both intra-individual and inter-individual diﬀerences.

Table 5: Mean accuracies (%) and STD for SI achieved in an ablation study on SEED dataset.

Model ACC/STD

DANN 88.15/07.99

DCCNN-DANN 89.48/07.34

TFA-DANN 88.44/06.89

MvCA-DANN 90.37/07.11

CADD-DCCNN 92.44/06.16

4.5. Ablation Study and Visualization

4.5.1. Ablation Study

We conducted an ablation study on the SEED and DEAP datasets to evaluate the eﬃciency of each part in our

CADD-DCCNN approach. CADD-DCCNN was constructed based on DANN, with three supplementary modules:

a time-frame attention, a multi-view cross-attention mechanism and a DCCNN. In particular, we compared our full

CADD-DCCNN method with four alternate methods:

Table 6: Mean accuracies (%) and STD for SI achieved in an ablation study on DEAP dataset.

Model ACC/STD

Valence Arousal

DANN 66.95/07.48 68.67/10.36

DCCNN-DANN 67.19/07.29 69.14/09.79

TFA-DANN 67.50/05.86 68.79/10.20

MvCA-DANN 68.91/06.38 69.45/09.43

CADD-DCCNN 69.45/05.60 70.50/09.39

•DANN: only DANN model without two supplementary modules;

•DCCNN-DANN: the DANN model with only the DCCNN;

•TFA-DANN: the DANN model with only the time-frame attention mechanism;

•MvCA-DANN: the DANN model with only the multi-view cross-attention mechanism.

The outcomes of our ablation study are presented in Tables 5 and 6. Our model achieves, by far, the best recognition

performance. Furthermore, DANN’s recognition accuracy is 2.87% lower than CADD-DCCNN’s, showing the

eﬀectiveness of DCCNN and multi-view cross-attention mechanism. The accuracy of DCCNN-DANN is 0.68%

higher than DANN, demonstrating the positive role of DCCNN in extracting contextual relationships in multi-channel

EEG signals during the improvement of this algorithm. The accuracy of TFA-DANN is 0.32% higher than DANN,

validating that the same channel exhibits diﬀerent emotional responses across diﬀerent time frames. By employing

an attention mechanism to assign diﬀerent weights to diﬀerent time frames, more inﬂuential emotion-related features

can be extracted. The accuracy of MvCA-DANN is 1.86% higher than TFA-DANN, indicating that the information

between multiple channels is complementary. Learning the complementary information among multiple channels can

eﬀectively improve the performance of the model, further validating the eﬀectiveness of multi-view learning. Drawing

on these ﬁndings, we believe that the proposed CADD-DCCNN method is beneﬁcial for improving EEG-based emotion

recognition performance.

4.5.2. Impact of Kernel size in DCCNN

Fig.4 shows the relationship between diﬀerent convolutional kernels and the performance of CADD-DCCNN.

DCCNN increases the receptive ﬁeld by introducing dilations in the convolutional kernel. Therefore, using diﬀerent

kernel sizes can capture features at diﬀerent scales. Larger kernels can capture global contextual information, while

smaller kernels can extract local detailed information. As shown in the ﬁgure, using the same kernel size in all three

layers cannot simultaneously capture global and local information. By using diﬀerent kernel sizes and combining

features at diﬀerent levels, the model can better capture multi-scale sequential patterns, enhancing its modeling

capability. Additionally, placing larger kernels at earlier layers allows the model to capture global contextual information

ﬁrst when processing input sequences, helping to better understand the overall sequence structure. Furthermore,

gradually decreasing kernel sizes can reduce the total number of model parameters. Therefore, we chose the combination

of c5.

4.5.3. Validation of Model Components

To further comprehend and demonstrate the impacts of multi-view learning, multi-view cross-attention (MvCA),

DCCNN, and the domain discriminator in EEG-based emotion recognition, we utilized two visualization techniques

c1 c2 c3 c4 c5

kernel_size

acc

accuracy

Figure 4: Line chart of average accuracy of CADD-DCCNN under diﬀerent combinations of kernel size in DCCNN. c1 represents the kernel sizes

in the temporal dimension as (3, 3, 3). c2 represents the kernel sizes in the temporal dimension as (5, 5, 5). c3 represents the kernel sizes in the

temporal dimension as (8, 8, 8). c4 represents the kernel sizes in the temporal dimension as (3, 5, 8). c5 represents the kernel sizes in the temporal

dimension as (8, 5, 3). Among them, c5 is the kernel sizes used in our model.

and a set of comparative experiments. Speciﬁcally, we employed EEG topographic maps to visualize the spatial

distribution of electrical activity in diﬀerent brain regions under various conditions, which enabled us to observe how

diﬀerent patterns of brain activity were associated with diﬀerent emotions in the presence of multi-view cross-attention.

We conducted comparative experiments by combining DCCNN and standard CNN with the baseline model DANN to

validate the eﬀectiveness of DCCNN. Additionally, we used 2-dimensional t-SNE plots to analyze the shift in feature

distribution between training and testing data, as well as to evaluate the performance of the domain discriminator.

As shown in Fig.5, to verify the usability of multi-view learning, we display the activation degrees of all channels

at the same moment under diﬀerent emotions. In negative and neutral emotions, the activation levels are relatively

higher in the blue regions, while in positive emotions, the activation levels are relatively lower. In negative emotions,

the activation levels are relatively lower in the green regions, while in positive and neutral emotions, the opposite

is observed. In neutral emotions, the activation levels are relatively higher in the purple regions, while in positive

and negative emotions, the opposite is observed. This indicates that the activation degrees of diﬀerent channels

vary signiﬁcantly under diﬀerent emotions. Therefore, we can consider each channel as a view and explore the

complementarity between multiple views to learn emotion-related features more comprehensively, further enhancing

the model’s generalization ability.

(i)δband (ii)θband (iii)αband (iv)βband (v)γband

0.0

0.2

0.4

0.6

0.8

1.0

□

□□

□

□ □ □ □ □

(a) negative emotion

0.0

0.2

0.4

0.6

0.8

1.0

□□□□□□

□

(i)δband (ii)θband (iii)αband (iv)βband (v)γband

(b) neutral emotion

(i)δband (ii)θband (iii)αband (iv)βband (v)γband

0.0

0.2

0.4

0.6

0.8

1.0

□

□□

□

□□

□

□□

□

□□

□

Figure 5: Brain topography maps generated from raw EEG signals under diﬀerent emotional states.

To analyze the eﬀectiveness of MvCA, we plotted heatmaps using the multi-view time-frame attention (MvTFA)

and EEG topographic maps after applying MvCA. We selected key channels based on the EEG topographic map

and conducted experiments after removing those key channels. First, we plotted heatmaps of the same experiment

for diﬀerent subjects and diﬀerent trials for the same subject after applying MvTFA, as shown in Fig.6. (a) and (b)

represent the same trial for diﬀerent subjects, while (b) and (c) represent diﬀerent trials for the same subject. The higher

the activation level, the lighter the color. From the ﬁgure, it can be observed that during the time intervals of 30s to 45s,

81s to 96s, and 189s to 207s, (a) and (b) exhibit relatively higher activation levels, while (c) shows relatively lower

activation levels. However, during the time interval of 0s to 21s, the activation levels are reversed, indicating that the

activation levels in the temporal dimension are similar for the same trial, while they diﬀer for diﬀerent trials. Therefore,

we can extract key time frames using MvTFA.

102

105

108

111

114

117

120

123

126

129

132

135

138

141

144

147

150

153

156

159

162

165

168

171

174

177

180

183

186

189

192

195

198

201

204

207

210

213

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

03691215182124273033363942454851545760

102

105

108

111

114

117

120

123

126

129

132

135

138

141

144

147

150

153

156

159

162

165

168

171

174

177

180

183

186

189

192

195

198

201

204

207

210

213

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

03691215182124273033363942454851545760

102

105

108

111

114

117

120

123

126

129

132

135

138

141

144

147

150

153

156

159

162

165

168

171

174

177

180

183

186

189

192

195

198

201

204

207

210

213

216

219

222

225

228

231

234

237

240

243

246

249

252

255

258

261

264

03691215182124273033363942454851545760

0.056

0.058

0.060

0.062

0.064

0.066

0.058

0.060

0.062

0.064

0.066

0.058

0.060

0.062

0.064

0.066

(a) 2-5

(b) 4-5

Figure 6: (a) The 5th trial from the 1st session of the 2nd subject, (b) The 5th trial from the 1st session of the 4th subject, (c) The 2nd trial from the

2nd session of the 4th subject.

Next, we plotted topographic maps of EEG signals after applying MvCA (as shown in Fig.7). Fig.7 depicts a time

frame of EEG data from the 7th subject in the SEED dataset. (a), (b), and (c), respectively, show the activation of

brain regions in the ﬁve frequency bands under negative, neutral, and positive emotions. Darker colors indicate higher

activation levels and greater attention weights. From Fig.7, we can observe that during negative emotion, the pre-frontal

and frontal lobes exhibit high activation levels, and the right brain region on the

and

bands also shows high activity.

During neutral emotion, the pre-frontal cortex is more active. During positive emotion, the left brain region exhibits

high activity. These ﬁndings are consistent with those of neuroscience studies [62, 63].

0.0

0.2

0.4

0.6

0.8

1.0

(i)δ band (ii)θ band (iii)α band (iv)β band (v)γ band

(a) negative emotion

0.0

0.2

0.4

0.6

0.8

1.0

(i)δ band (ii)θ band (iii)α band (iv)β band (v)γ band

(b) neutral emotion

0.0

0.2

0.4

0.6

0.8

1.0

(i)δ band (ii)θ band (iii)α band (iv)β band (v)γ band

Figure 7: EEG topographic maps under various emotions on SEED.

Based on Fig.7, we selected the top 5 channels with higher activation levels, namely FP2, F7, F5, T7, and P7. We

conducted experiments after removing these 5 channels, and the number of channels and results are shown in Table 7.

After removing the key channels, the accuracy decreased by 13.92%, demonstrating that MvCA can indeed extract key

channels, thus validating the eﬀectiveness of MvCA.

Table 7: Channels of EEG signals, mean accuracies (%), and STD for SI achieved on SEED dataset.

Parameters All-Channels No-Key-Channels

channels 62 57

ACC/STD 92.44/06.16 78.52/09.67

We conducted a series of comparative experiments to assess the eﬀectiveness of DCCNN. In our study, we integrated

DCCNN and standard CNN into the baseline DANN model for EEG-based emotion recognition. The parameters

and experimental results are shown in Table 8. The accuracy of CNN-DANN is 88.30%, which is 0.15% higher than

the DANN model but 1.18% lower than DCCNN-DANN. These ﬁndings suggest that the CNN model is proﬁcient

at extracting emotion-related features. However, when employing multi-layer convolution operations in CNN, the

feature map size tends to decrease and the receptive ﬁeld size also decreases. In contrast, DCCNN utilizes dilated

convolution, which not only preserves the feature map size but also expands the receptive ﬁeld through dilation factors.

Consequently, DCCNN eﬀectively captures temporal features at various scales, enabling the extraction of both local

and global temporal information.

Table 8: Parameters of DCCNN-DANN and CNN-DANN, mean accuracies (%), and STD for SI achieved on SEED dataset.

Parameters DCCNN-DANN CNN-DANN

convolution layers 3 3

kernel size (1, 8), (1, 5), (1, 3) (1, 8), (1, 5), (1, 3)

dilation factors (1, 8), (1, 5) none

output tensors 72, 48, 24 72, 48, 24

ACC/STD 89.48/07.34 88.30/07.00

Finally, to show the eﬃciency of the domain discriminator, we display the training and testing feature distributions

in the same 2D space before and after passing through the domain discriminator, respectively (see Fig.8). Each colored

shape stands for a subject, i. e., a domain. Fig.8 displays the feature distribution of each subject in SEED and DEAP

before and after passing through the domain discriminator. As depicted in (a) and (c), the allocation of EEG data among

various subjects (represented by distinct colors) is similar, with the majority of trials clustering together and only a

small number of outliers occurring in certain subjects. However, after passing through the domain discriminator, the

feature distribution of each subject becomes more uniform. Therefore, the inclusion of the domain discriminator into

our model allows us to ensure data representation invariance while minimizing signiﬁcant feature distribution shifts

between diﬀerent subjects.

Fig.9 presents the outcomes of our experiments on SEED. Speciﬁcally, the ﬁgure includes a confusion matrix

(Fig.9(a)) displaying percentages with row normalization, where the horizontal axis represents the true label and the

vertical axis stands for the predicted label. The element (

i,j

) represents the percentage of samples in class

that

were classiﬁed as class

, with the matrix block on the diagonal line indicating the probability of correct prediction.

Additionally, the ﬁgure includes a feature map (Fig.9(b)) in which red points represent true labels for negative emotion,

green points represent true labels for neutral emotion, and blue points correspond to true labels for positive emotion.

From the results presented in Fig.9, it is clear that our model achieves high average accuracies for recognition of all

emotions in general. However, the recognition accuracy for negative emotions is relatively poor when compared to that

for neutral and positive emotions. In the experiments, negative emotions were more prone to be misclassiﬁed as positive

emotions. One possible reason is that the EEG signals activated during negative and positive emotions bear similarities.

Subjects may exhibit strong responses to both types of emotions. Additionally, the dataset may have a relatively smaller

number of samples for negative emotions, leading to misclassiﬁcation of negative emotions as positive emotions.

5. Discussion

In this paper, we proposed the CADD-DCCNN method for emotion recognition from EEG signals, achieving SOTA

outcomes on the popular SEED and DEAP datasets. The success of our method can be largely attributed to its utilization

of multi-view learning combined with attention mechanisms to eﬀectively select emotion-related channels and time

frames. Additionally, the model leverages the ability of dilated causal convolutional neural networks to extract temporal

40 20 0 20 40

source

target

(a) SEED-before-DD

10 0 10 20

source

target

(b) SEED-after-DD

30 20 10 0 10 20 30 40

source

target

20 10 0 10 20

source

target

(d) DEAP-after-DD

Figure 8: T-SNE-based visualization of feature embedding on every subject of SEED and DEAP. (a) displays the feature space on the SEED of using

CADD-DCCNN before the domain discriminator (DD). (b) displays the feature space on the SEED of using CADD-DCCNN after the domain

discriminator.

negative

neutral

positive

Predicted label

negative

neutral

positive

True label

88.00% 5.33% 6.67%

2.67% 92.89% 4.44%

3.11% 4.89% 92.00%

confusion_matrix_svc

0.2

0.4

0.6

0.8

(a) The confusion matrix

20 10 0 10 20

negative

neutral

positive

(b) The feature maps

Figure 9: The confusion matrix and feature maps of the SI EEG emotion recognition results using our CADD-DCCNN on the SEED dataset.

information from multi-view features. Moreover, feature-level fusion is performed on the features from multiple views,

enabling the exploration of complementary information between diﬀerent views. Furthermore, a domain discriminator

is incorporated to ensure uniform feature distribution coverage and invariant data representation, thus minimizing the

problem of data distribution shift in cross-subject scenarios and improving model generalization. Finally, we performed

an ablation study to evaluate the individual contributions of each part, and the experimental outcomes conﬁrmed their

validity. Future eﬀorts will need to verify the eﬀectiveness in real-world ‘in-the-wild’ settings.

CRediT authorship contribution statement

Chao Li: Conceptualization, Data Curation, Writing – review & editing, Supervision, Funding acquisition. Ning

Bian: Methodology, Writing – original draft, Software, Visualization, Funding acquisition. Ziping Zhao: Writing –

review & editing, Supervision, Validation, Funding acquisition. Haishuai Wang: Validation, Supervision, Funding

acquisition. Bj¨

orn W. Schuller: Writing – Supervision, review & editing, Revised paper, Funding acquisition.

Declaration of competing interest

The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have

appeared to inﬂuence the work reported in this paper.

Acknowledgements

This work was substantially supported by the National Natural Science Foundation of China (Grant Nos: 62071330,

61702370, 61902282), the National Science Fund for Distinguished Young Scholars (Grant No: 61425017), the

Key Program of the National Natural Science Foundation of China (Grant No: 61831022), the Key Program of the

Natural Science Foundation of Tianjin (Grant No: 18JCZDJC36300), the Technology Plan of Tianjin (Grant No:

18ZXRHSY00100), and the Tianjin Postgraduate Scientiﬁc Research Innovation Project (Grant No: 2022SKYZ267).

References

[1]

A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, A. Hussain, Multimodal sentiment analysis: A systematic review of history, datasets, multimodal

fusion methods, applications, challenges and future directions, Information Fusion (2022).

[2]

U. R. Acharya, V. K. Sudarshan, H. Adeli, J. Santhosh, J. E. Koh, S. D. Puthankatti, A. Adeli, A novel depression diagnosis index using

nonlinear features in eeg signals, European neurology 74 (1-2) (2015) 79–83.

[3]

N. Kumar, J. Kumar, Measurement of cognitive load in hci systems using eeg power spectrum: an experimental study, Procedia Computer

Science 84 (2016) 70–78.

[4]

G. Recio, A. Schacht, W. Sommer, Recognizing dynamic facial expressions of emotion: Speciﬁcity and intensity eﬀects in event-related brain

potentials, Biological psychology 96 (2014) 111–125.

[5]

H. Gunes, M. Piccardi, Bi-modal emotion recognition from expressive face and body gestures, Journal of Network and Computer Applications

30 (4) (2007) 1334–1345.

[6]

Z. Zhao, Q. Li, Z. Zhang, N. Cummins, H. Wang, J. Tao, B. W. Schuller, Combining a parallel 2d cnn with a self-attention dilated residual

network for ctc-based discrete speech emotion recognition, Neural Networks 141 (2021) 52–60.

[7]

F. Agraﬁoti, D. Hatzinakos, A. K. Anderson, Ecg pattern analysis for emotion detection, IEEE Transactions on aﬀective computing 3 (1) (2011)

102–115.

[8]

C. Li, Z. Bao, L. Li, Z. Zhao, Exploring temporal representations by leveraging attention-based bidirectional lstm-rnns for multi-modal emotion

recognition, Information Processing & Management 57 (3) (2020) 102185.

[9]

B. Cheng, G. Liu, Emotion recognition from surface emg signal using wavelet transform and neural network, in: 2008 2nd International

Conference on Bioinformatics and Biomedical Engineering, IEEE, 2008, pp. 1363–1366.

[10]

M. Hamada, B. Zaidan, A. Zaidan, A systematic review for human eeg brain signals based emotion classiﬁcation, feature extraction, brain

condition, group comparison, Journal of medical systems 42 (2018) 1–25.

[11]

M. Li, H. Xu, X. Liu, S. Lu, Emotion recognition from multichannel eeg signals using k-nearest neighbor classiﬁcation, Technology and health

care 26 (S1) (2018) 509–519.

[12]

S. Liu, Y. Zhao, Y. An, J. Zhao, S.-H. Wang, J. Yan, Glfanet: A global to local feature aggregation network for eeg emotion recognition,

Biomed. Signal Process. Control. 85 (2023) 104799.

[13]

S. Liu, Z. Wang, Y. An, J. Zhao, Y. Zhao, Y. dong Zhang, Eeg emotion recognition based on the attention mechanism and pre-trained

convolution capsule network, Knowl. Based Syst. 265 (2023) 110372.

[14]

S. Tripathi, S. Acharya, R. D. Sharma, S. Mittal, S. Bhattacharya, Using deep and convolutional neural networks for accurate emotion

classiﬁcation on deap dataset., in: Twenty-ninth IAAI conference, 2017.

[15]

Y. Li, W. Zheng, Z. Cui, T. Zhang, Y. Zong, A novel neural network model based on cerebral hemispheric asymmetry for eeg emotion

recognition., in: IJCAI, 2018, pp. 1561–1567.

[16]

P. Pandey, K. Seeja, Subject-independent emotion detection from eeg signals using deep neural network, in: International Conference on

Innovative Computing and Communications, Springer, 2019, pp. 41–46.

[17]

W.-L. Zheng, B.-L. Lu, Investigating critical frequency bands and channels for eeg-based emotion recognition with deep neural networks,

IEEE Transactions on autonomous mental development 7 (3) (2015) 162–175.

[18]

S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, I. Patras, Deap: A database for emotion analysis;

using physiological signals, IEEE transactions on aﬀective computing 3 (1) (2011) 18–31.

[19] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2015) 436–444.

[20] K. Ezzameli, H. Mahersia, Emotion recognition from unimodal to multimodal analysis: A review, Information Fusion (2023) 101847.

[21] M. M. Lopez, J. Kalita, Deep learning applied to nlp, arXiv preprint arXiv:1703.03091 (2017).

[22]

S. Dara, P. Tumma, Feature extraction by using deep learning: A survey, in: 2018 Second international conference on electronics, communication

and aerospace technology (ICECA), IEEE, 2018, pp. 1795–1801.

[23]

K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, T. Ogata, Audio-visual speech recognition using deep learning, Applied intelligence 42

(2015) 722–737.

[24]

D. Nie, X.-W. Wang, L.-C. Shi, B.-L. Lu, Eeg-based emotion recognition during watching movies, in: 2011 5th International IEEE/EMBS

Conference on Neural Engineering, IEEE, 2011, pp. 667–670.

[25]

L.-C. Shi, Y.-Y. Jiao, B.-L. Lu, Diﬀerential entropy feature for eeg-based vigilance estimation, in: 2013 35th Annual International Conference

of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE, 2013, pp. 6627–6630.

[26]

S. M. Alarcao, M. J. Fonseca, Emotions recognition using eeg signals: A survey, IEEE Transactions on Aﬀective Computing 10 (3) (2017)

374–393.

[27]

Y.-J. Liu, M. Yu, G. Zhao, J. Song, Y. Ge, Y. Shi, Real-time movie-induced discrete emotion recognition from eeg signals, IEEE Transactions

on Aﬀective Computing 9 (4) (2017) 550–562.

[28] M. Wang, W. Deng, Deep visual domain adaptation: A survey, Neurocomputing 312 (2018) 135–153.

[29] G. Bao, N. Zhuang, L. Tong, B. Yan, J. Shu, L. Wang, Y. Zeng, Z. Shen, Two-level domain adaptation neural network for eeg-based emotion

recognition, Frontiers in Human Neuroscience 14 (2021) 605246.

[30]

H. Chen, Z. Li, M. Jin, J. Li, Meernet: multi-source eeg-based emotion recognition network for generalization across subjects and sessions, in:

2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), IEEE, 2021, pp. 6094–6097.

[31]

J. Liu, X. Shen, S. Song, D. Zhang, Domain adaptation for cross-subject emotion recognition by subject clustering, in: 2021 10th International

IEEE/EMBS Conference on Neural Engineering (NER), IEEE, 2021, pp. 904–908.

[32]

C. Zhang, Y. Cui, Z. Han, J. T. Zhou, H. Fu, Q. Hu, Deep partial multi-view learning, IEEE transactions on pattern analysis and machine

intelligence 44 (5) (2020) 2402–2415.

[33]

Y. Liu, L. Fan, C. Zhang, T. Zhou, Z. Xiao, L. Geng, D. Shen, Incomplete multi-modal representation learning for alzheimer’s disease diagnosis,

Medical Image Analysis 69 (2021) 101953.

[34]

C. Zhang, H. Fu, J. Wang, W. Li, X. Cao, Q. Hu, Tensorized multi-view subspace representation learning, International Journal of Computer

Vision 128 (8-9) (2020) 2344–2361.

[35]

Z. Li, C. Tang, X. Liu, X. Zheng, W. Zhang, E. Zhu, Consensus graph learning for multi-view clustering, IEEE Transactions on Multimedia 24

(2021) 2461–2472.

[36]

C. Tang, X. Liu, X. Zhu, E. Zhu, Z. Luo, L. Wang, W. Gao, Cgd: Multi-view clustering via cross-view graph diﬀusion, in: Proceedings of the

AAAI conference on artiﬁcial intelligence, Vol. 34, 2020, pp. 5924–5931.

[37]

Z. Tao, H. Liu, S. Li, Z. Ding, Y. Fu, Marginalized multiview ensemble clustering, IEEE transactions on neural networks and learning systems

31 (2) (2019) 600–611.

[38]

C. Tang, X. Zheng, W. Zhang, X. Liu, X. Zhu, E. Zhu, Unsupervised feature selection via multiple graph fusion and feature weight learning,

Science China Information Sciences 66 (5) (2023) 1–17.

[39]

D. Kiela, S. Bhooshan, H. Firooz, E. Perez, D. Testuggine, Supervised multimodal bitransformers for classifying images and text, arXiv

preprint arXiv:1909.02950 (2019).

[40]

W. Wang, D. Tran, M. Feiszli, What makes training multi-modal classiﬁcation networks hard?, in: Proceedings of the IEEE/CVF conference

on computer vision and pattern recognition, 2020, pp. 12695–12705.

[41]

Y. Gan, R. Han, L. Yin, W. Feng, S. Wang, Self-supervised multi-view multi-human association and tracking, in: Proceedings of the 29th ACM

International Conference on Multimedia, 2021, pp. 282–290.

[42]

J. Wang, Y. Zheng, J. Song, S. Hou, Cross-view representation learning for multi-view logo classiﬁcation with information bottleneck, in:

Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4680–4688.

[43]

W. Liu, X. Yue, Y. Chen, T. Denoeux, Trusted multi-view deep learning with opinion aggregation, in: Proceedings of the AAAI Conference on

Artiﬁcial Intelligence, Vol. 36, 2022, pp. 7585–7593.

[44]

C. Chen, C.-M. Vong, S. Wang, H. Wang, M. Pang, Easy domain adaptation for cross-subject multi-view emotion recognition, Knowledge-Based

Systems 239 (2022) 107982.

[45]

Z. Jia, Y. Lin, X. Cai, H. Chen, H. Gou, J. Wang, Sst-emotionnet: Spatial-spectral-temporal based attention 3d dense network for eeg emotion

recognition, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2909–2917.

[46]

Y. Li, B. Fu, F. Li, G. Shi, W. Zheng, A novel transferability attention neural network model for eeg emotion recognition, Neurocomputing 447

(2021) 92–101.

[47]

Z. Wang, Y. Wang, C. Hu, Z. Yin, Y. Song, Temporal-spatial representation learning transformer for eeg-based emotion recognition, arXiv

preprint arXiv:2211.08880 (2022).

[48]

X. Si, D. Huang, Y. Sun, D. Ming, Temporal aware mixed attention-based convolution and transformer network (mactn) for eeg emotion

recognition, arXiv preprint arXiv:2305.18234 (2023).

[49]

Y. Hao, H. Cao, A new attention mechanism to classify multivariate time series, in: Proceedings of the Twenty-Ninth International Joint

Conference on Artiﬁcial Intelligence, 2020.

[50]

T. Song, W. Zheng, P. Song, Z. Cui, Eeg emotion recognition using dynamical graph convolutional neural networks, IEEE Transactions on

Aﬀective Computing 11 (3) (2018) 532–541.

[51]

X. Du, C. Ma, G. Zhang, J. Li, Y.-K. Lai, G. Zhao, X. Deng, Y.-J. Liu, H. Wang, An eﬃcient lstm network for emotion recognition from

multichannel eeg signals, IEEE Transactions on Aﬀective Computing (2020).

[52] J. A. Suykens, J. Vandewalle, Least squares support vector machine classiﬁers, Neural processing letters 9 (1999) 293–300.

[53] T. Cover, P. Hart, Nearest neighbor pattern classiﬁcation, IEEE transactions on information theory 13 (1) (1967) 21–27.

[54]

G. Zhang, M. Yu, Y.-J. Liu, G. Zhao, D. Zhang, W. Zheng, Sparsedgcnn: Recognizing emotion from multichannel eeg signals, IEEE

Transactions on Aﬀective Computing (2021).

[55]

H. Chen, M. Jin, Z. Li, C. Fan, J. Li, H. He, Ms-mda: multisource marginal distribution adaptation for cross-subject and cross-session eeg

emotion recognition, Frontiers in Neuroscience 15 (2021) 778488.

[56]

Z. He, Y. Zhong, J. Pan, An adversarial discriminative temporal convolutional network for eeg-based cross-domain emotion recognition,

Computers in biology and medicine 141 (2022) 105048.

[57]

W. Guo, G. Xu, Y. Wang, Horizontal and vertical features fusion network based on diﬀerent brain regions for emotion recognition, Knowledge-

Based Systems 247 (2022) 108819.

[58]

Y. Wang, S. Qiu, D. Li, C. Du, B.-L. Lu, H. He, Multi-modal domain adaptation variational autoencoder for eeg-based emotion recognition,

IEEE/CAA Journal of Automatica Sinica 9 (9) (2022) 1612–1626.

[59]

W. Guo, G. Xu, Y. Wang, Multi-source domain adaptation with spatio-temporal feature extractor for eeg emotion recognition, Biomedical

Signal Processing and Control 84 (2023) 104998.

[60]

W. Li, B. Hou, X. Li, Z. Qiu, B. Peng, Y. Tian, Tmlp+srdann: A domain adaptation method for eeg-based emotion recognition, Measurement

207 (2023) 112379.

[61]

Z. Wang, Y. Wang, J. Zhang, Y. Tang, Z. Pan, A lightweight domain adversarial neural network based on knowledge distillation for eeg-based

cross-subject emotion recognition, arXiv preprint arXiv:2305.07446 (2023).

[62]

A. Etkin, T. Egner, R. Kalisch, Emotional processing in anterior cingulate and medial prefrontal cortex, Trends in cognitive sciences 15 (2)

(2011) 85–93.

[63]

T. Canli, J. E. Desmond, Z. Zhao, G. Glover, J. D. Gabrieli, Hemispheric asymmetry for emotional stimuli detected with fmri, Neuroreport

9 (14) (1998) 3233–3239.

Attention-based 3D convolutional recurrent neural network model for multimodal emotion recognition

Article

Full-text available

Jan 2024

Introduction Multimodal emotion recognition has become a hot topic in human-computer interaction and intelligent healthcare fields. However, combining information from different human different modalities for emotion computation is still challenging. Methods In this paper, we propose a three-dimensional convolutional recurrent neural network model (referred to as 3FACRNN network) based on multimodal fusion and attention mechanism. The 3FACRNN network model consists of a visual network and an EEG network. The visual network is composed of a cascaded convolutional neural network–time convolutional network (CNN-TCN). In the EEG network, the 3D feature building module was added to integrate band information, spatial information and temporal information of the EEG signal, and the band attention and self-attention modules were added to the convolutional recurrent neural network (CRNN). The former explores the effect of different frequency bands on network recognition performance, while the latter is to obtain the intrinsic similarity of different EEG samples. Results To investigate the effect of different frequency bands on the experiment, we obtained the average attention mask for all subjects in different frequency bands. The distribution of the attention masks across the different frequency bands suggests that signals more relevant to human emotions may be active in the high frequency bands γ (31–50 Hz). Finally, we try to use the multi-task loss function Lc to force the approximation of the intermediate feature vectors of the visual and EEG modalities, with the aim of using the knowledge of the visual modalities to improve the performance of the EEG network model. The mean recognition accuracy and standard deviation of the proposed method on the two multimodal sentiment datasets DEAP and MAHNOB-HCI (arousal, valence) were 96.75 ± 1.75, 96.86 ± 1.33; 97.55 ± 1.51, 98.37 ± 1.07, better than those of the state-of-the-art multimodal recognition approaches. Discussion The experimental results show that starting from the multimodal information, the facial video frames and electroencephalogram (EEG) signals of the subjects are used as inputs to the emotion recognition network, which can enhance the stability of the emotion network and improve the recognition accuracy of the emotion network. In addition, in future work, we will try to utilize sparse matrix methods and deep convolutional networks to improve the performance of multimodal emotion networks.

GLFANet: A global to local feature aggregation network for EEG emotion recognition

Article

Aug 2023
BIOMED SIGNAL PROCES

Multi-source domain adaptation with spatio-temporal feature extractor for EEG emotion recognition

Article

Jul 2023
BIOMED SIGNAL PROCES

Emotion recognition from unimodal to multimodal analysis: A review

Article

May 2023
INFORM FUSION

Unsupervised feature selection via multiple graph fusion and feature weight learning

Article

Apr 2023

Unsupervised feature selection attempts to select a small number of discriminative features from original high-dimensional data and preserve the intrinsic data structure without using data labels. As an unsupervised learning task, most previous methods often use a coefficient matrix for feature reconstruction or feature projection, and a certain similarity graph is widely utilized to regularize the intrinsic structure preservation of original data in a new feature space. However, a similarity graph with poor quality could inevitably affect the final results. In addition, designing a rational and effective feature reconstruction/projection model is not easy. In this paper, we introduce a novel and effective unsupervised feature selection method via multiple graph fusion and feature weight learning (MGF2WL) to address these issues. Instead of learning the feature coefficient matrix, we directly learn the weights of different feature dimensions by introducing a feature weight matrix, and the weighted features are projected into the label space. Aiming to exploit sufficient relation of data samples, we develop a graph fusion term to fuse multiple predefined similarity graphs for learning a unified similarity graph, which is then deployed to regularize the local data structure of original data in a projected label space. Finally, we design a block coordinate descent algorithm with a convergence guarantee to solve the resulting optimization problem. Extensive experiments with sufficient analyses on various datasets are conducted to validate the efficacy of our proposed MGF2WL.

EEG emotion recognition based on the attention mechanism and pre-trained convolution capsule network

Article

Feb 2023
KNOWL-BASED SYST

TMLP+SRDANN: A domain adaptation method for EEG-based emotion recognition

Article

Dec 2022

Multimodal Sentiment Analysis: A Systematic review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges and Future Directions

Article

Sep 2022
INFORM FUSION

In the field of artificial intelligence (AI) and natural language processing (NLP), sentiment analysis (SA) is gaining traction and becoming a buzzword. There is a growing demand for automizing the process of analysing the user's sentiments towards any product or services due to numerous SA applications. As more and more opinions are shared in the form of videos rather than text only, SA using multiple modalities known as Multimodal Sentiment Analysis (MSA) is become very much important research area. MSA uses recent machine learning innovations for its progress. All the latest advancements in machine learning and deep learning are utilised at each different stage in MSA like multimodal features extraction, multimodal fusion, sentiment polarity detection with minimized error rate and improved speed. This survey paper focuses mainly on the primary taxonomy and newly released Multimodal Fusion architectures, and it divides numerous recent developments in MSA architectures into ten categories. Early fusion, late fusion, hybrid fusion, model-level fusion, tensor fusion, hierarchical fusion, bi-modal fusion, attention-based fusion, quantum-based fusion and word-level fusion are the ten categories. This manuscript mainly contributes in a comparison of several architectural evolutions in MSA fusion, strengths, and limitations. It also discusses research gaps, applications in various sectors, and future scope.

Using Deep and Convolutional Neural Networks for Accurate Emotion Classification on DEAP Data

Article

Feb 2017

Emotion recognition is an important field of research in Brain Computer Interactions. As technology and the understanding of emotions are advancing, there are growing opportunities for automatic emotion recognition systems. Neural networks are a family of statistical learning models inspired by biological neural networks and are used to estimate functions that can depend on a large number of inputs that are generally unknown. In this paper we seek to use this effectiveness of Neural Networks to classify user emotions using EEG signals from the DEAP (Koelstra et al (2012)) dataset which represents the benchmark for Emotion classification research. We explore 2 different Neural Models, a simple Deep Neural Network and a Convolutional Neural Network for classification. Our model provides the state-of-the-art classification accuracy, obtaining 4.51 and 4.96 percentage point improvements over (Rozgic et al (2013)) classification of Valence and Arousal into 2 classes (High and Low) and 13.39 and 6.58 percentage point improvements over (Chung and Yoon(2012)) classification of Valence and Arousal into 3 classes (High, Normal and Low). Moreover our research is a testament that Neural Networks could be robust classifiers for brain signals, even outperforming traditional learning techniques.

Trusted Multi-View Deep Learning with Opinion Aggregation

Article

Jun 2022

Multi-view deep learning is performed based on the deep fusion of data from multiple sources, i.e. data with multiple views. However, due to the property differences and inconsistency of data sources, the deep learning results based on the fusion of multi-view data may be uncertain and unreliable. It is required to reduce the uncertainty in data fusion and implement the trusted multi-view deep learning. Aiming at the problem, we revisit the multi-view learning from the perspective of opinion aggregation and thereby devise a trusted multi-view deep learning method. Within this method, we adopt evidence theory to formulate the uncertainty of opinions as learning results from different data sources and measure the uncertainty of opinion aggregation as multi-view learning results through evidence accumulation. We prove that accumulating the evidences from multiple data views will decrease the uncertainty in multi-view deep learning and facilitate to achieve the trusted learning results. Experiments on various kinds of multi-view datasets verify the reliability and robustness of the proposed multi-view deep learning method.

Horizontal and vertical features fusion network based on different brain regions for emotion recognition

Article

Apr 2022
KNOWL-BASED SYST

Deep learning technology has been universally adopted in emotion recognition, which becomes a promising method that has recently achieved good recognition performance. However, the existing methods cannot reflect the influence of electroencephalography (EEG) signals from different brain regions on emotion recognition. Therefore, in the article, we raise a horizontal and vertical features fusion network based on different brain regions (HVF2N-DBR) for emotion recognition. The HVF2N-DBR method not only considers the influence of EEG activity in the various brain regions on affective recognition but also obtains the temporal-spatial characteristics of EEG signals. Specifically, EEG signals firstly are grouped according to the various brain lobes (frontal, parietal, temporal, and occipital lobes). Then, a horizontal-vertical features fusion module (HVF2M) is designed to learn multiple direction features of different brain regions. Afterwards, finer features of different brain regions are learned separately through hybrid dilation convolutions, then they are concatenated to capture richer and more discriminative emotional information. The designed architecture can achieve higher recognition performance. Finally, on three databases, i.e., SEED, SEED-IV, and DEAP, extensive experiments are implemented to estimate the capability of the designed HVF2N-DBR network. Results validate that the recognition performance of proposed model is superior to many existing models. Meanwhile, the ablation experiments also reveal that EEG signals related to emotion recognition mainly situate the frontal and parietal lobes of the cerebral cortex.

Multi-view domain-adaptive representation learning for EEG-based emotion recognition

Recommended publications

Learning Common Rationale to Improve Self-Supervised Representation for Fine-Grained Visual Recognit...

Metric Learning in the Dissimilarity Space to Improve Low-Resolution Face Recognition

A dynamic constraint representation approach based on cross-domain dictionary learning for expressio...

Improving generation performance of speech emotion recognition by denoising autoencoders