ArticlePDF AvailableLiterature Review

Deep Learning

May 2015
Nature 521(7553):436-44

May 2015
521(7553):436-44

DOI:10.1038/nature14539

Source
PubMed

Authors:

Y. Bengio

Université de Montréal

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

Multilayer neural networks and backpropagation. a, A multi-layer neural network (shown by the connected dots) can distort the input space to make the classes of data (examples of which are on the red and blue lines) linearly separable. Note how a regular grid (shown on the left) in input space is also transformed (shown in the middle panel) by hidden units. This is an illustrative example with only two input units, two hidden units and one output unit, but the networks used for object recognition or natural language processing contain tens or hundreds of thousands of units. Reproduced with permission from C. Olah (http://colah.github.io/). b, The chain rule of derivatives tells us how two small effects (that of a small change of x on y, and that of y on z) are composed. A small change Δx in x gets transformed first into a small change Δy in y by getting multiplied by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change Δy creates a change Δz in z. Substituting one equation into the other gives the chain rule of derivatives — how Δx gets turned into Δz through multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x, y and z are vectors (and the derivatives are Jacobian matrices). c, The equations used for computing the forward pass in a neural net with two hidden layers and one output layer, each constituting a module through which one can backpropagate gradients. At each layer, we first compute the total input z to each unit, which is a weighted sum of the outputs of the units in the layer below. Then a non-linear function f(.) is applied to z to get the output of the unit. For simplicity, we have omitted bias terms. The non-linear functions used in neural networks include the rectified linear unit (ReLU) f(z) = max(0,z), commonly used in recent years, as well as the more conventional sigmoids, such as the hyberbolic tangent, f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and logistic function logistic, f(z) = 1/(1 + exp(−z)). d, The equations used for computing the backward pass. At each hidden layer we compute the error derivative with respect to the output of each unit, which is a weighted sum of the error derivatives with respect to the total inputs to the units in the layer above. We then convert the error derivative with respect to the output into the error derivative with respect to the input by multiplying it by the gradient of f(z). At the output layer, the error derivative with respect to the output of a unit is computed by differentiating the cost function. This gives yl − tl if the cost function for unit l is 0.5(yl − tl)2, where tl is the target value. Once the ∂E/∂zk is known, the error-derivative for the weight wjk on the connection from unit j in the layer below is just yj ∂E/∂zk.

…

Inside a convolutional network. The outputs (not the filters) of each layer (horizontally) of a typical convolutional network architecture applied to the image of a Samoyed dog (bottom left; and RGB (red, green, blue) inputs, bottom right). Each rectangular image is a feature map corresponding to the output for one of the learned features, detected at each of the image positions. Information flows bottom up, with lower-level features acting as oriented edge detectors, and a score is computed for each image class in output. ReLU, rectified linear unit.

…

From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to 'translate' high-level representations of images into captions (top). Reproduced with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better 'translation' of images into captions.

…

Figures - uploaded by Y. Bengio

Content may be subject to copyright.

Content uploaded by Y. Bengio

Content may be subject to copyright.

1Facebook AI Research, 770 Broadway, New York, New York 10003 USA. 2New York University, 715 Broadway, New York, New York 10003, USA. 3Department of Computer Science and Operations

Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128 Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. 4Google, 1600 Amphitheatre Parkway, Mountain View, California

94043, USA. 5Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada.

achine-learning technology powers many aspects of modern

society: from web searches to content filtering on social net-

works to recommendations on e-commerce websites, and

it is increasingly present in consumer products such as cameras and

smartphones. Machine-learning systems are used to identify objects

in images, transcribe speech into text, match news items, posts or

products with users’ interests, and select relevant results of search.

Increasingly, these applications make use of a class of techniques called

deep learning.

Conventional machine-learning techniques were limited in their

ability to process natural data in their raw form. For decades, con-

structing a pattern-recognition or machine-learning system required

careful engineering and considerable domain expertise to design a fea-

ture extractor that transformed the raw data (such as the pixel values

of an image) into a suitable internal representation or feature vector

from which the learning subsystem, often a classifier, could detect or

classify patterns in the input.

Representation learning is a set of methods that allows a machine to

be fed with raw data and to automatically discover the representations

needed for detection or classification. Deep-learning methods are

representation-learning methods with multiple levels of representa-

tion, obtained by composing simple but non-linear modules that each

transform the representation at one level (starting with the raw input)

into a representation at a higher, slightly more abstract level. With the

composition of enough such transformations, very complex functions

can be learned. For classification tasks, higher layers of representation

amplify aspects of the input that are important for discrimination and

suppress irrelevant variations. An image, for example, comes in the

form of an array of pixel values, and the learned features in the first

layer of representation typically represent the presence or absence of

edges at particular orientations and locations in the image. The second

layer typically detects motifs by spotting particular arrangements of

edges, regardless of small variations in the edge positions. The third

layer may assemble motifs into larger combinations that correspond

to parts of familiar objects, and subsequent layers would detect objects

as combinations of these parts. The key aspect of deep learning is that

these layers of features are not designed by human engineers: they

are learned from data using a general-purpose learning procedure.

Deep learning is making major advances in solving problems that

have resisted the best attempts of the artificial intelligence commu-

nity for many years. It has turned out to be very good at discovering

intricate structures in high-dimensional data and is therefore applica-

ble to many domains of science, business and government. In addition

to beating records in image recognition

1–4

and speech recognition

5–7

, it

has beaten other machine-learning techniques at predicting the activ-

ity of potential drug molecules

, analysing particle accelerator data

9,10

reconstructing brain circuits11, and predicting the effects of mutations

in non-coding DNA on gene expression and disease12,13. Perhaps more

surprisingly, deep learning has produced extremely promising results

for various tasks in natural language understanding14, particularly

topic classification, sentiment analysis, question answering

and lan-

guage translation16,17.

We think that deep learning will have many more successes in the

near future because it requires very little engineering by hand, so it

can easily take advantage of increases in the amount of available com-

putation and data. New learning algorithms and architectures that are

currently being developed for deep neural networks will only acceler-

ate this progress.

Supervised learning

The most common form of machine learning, deep or not, is super-

vised learning. Imagine that we want to build a system that can classify

images as containing, say, a house, a car, a person or a pet. We first

collect a large data set of images of houses, cars, people and pets, each

labelled with its category. During training, the machine is shown an

image and produces an output in the form of a vector of scores, one

for each category. We want the desired category to have the highest

score of all categories, but this is unlikely to happen before training.

We compute an objective function that measures the error (or dis-

tance) between the output scores and the desired pattern of scores. The

machine then modifies its internal adjustable parameters to reduce

this error. These adjustable parameters, often called weights, are real

numbers that can be seen as ‘knobs’ that define the input–output func-

tion of the machine. In a typical deep-learning system, there may be

hundreds of millions of these adjustable weights, and hundreds of

millions of labelled examples with which to train the machine.

To properly adjust the weight vector, the learning algorithm com-

putes a gradient vector that, for each weight, indicates by what amount

the error would increase or decrease if the weight were increased by a

tiny amount. The weight vector is then adjusted in the opposite direc-

tion to the gradient vector.

The objective function, averaged over all the training examples, can

Deep learning allows computational models that are composed of multiple processing layers to learn representations of

data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-

ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep

learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine

should change its internal parameters that are used to compute the representation in each layer from the representation in

the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and

audio, whereas recurrent nets have shone light on sequential data such as text and speech.

Deep learning

Yann LeCun1,2, Yoshua Bengio3 & Geoffrey Hinton4,5

436 | NATURE | VOL 521 | 28 MAY 2015

REVIEW doi:10.1038/nature14539

be seen as a kind of hilly landscape in the high-dimensional space of

weight values. The negative gradient vector indicates the direction

of steepest descent in this landscape, taking it closer to a minimum,

where the output error is low on average.

In practice, most practitioners use a procedure called stochastic

gradient descent (SGD). This consists of showing the input vector

for a few examples, computing the outputs and the errors, computing

the average gradient for those examples, and adjusting the weights

accordingly. The process is repeated for many small sets of examples

from the training set until the average of the objective function stops

decreasing. It is called stochastic because each small set of examples

gives a noisy estimate of the average gradient over all examples. This

simple procedure usually finds a good set of weights surprisingly

quickly when compared with far more elaborate optimization tech-

niques

. After training, the performance of the system is measured

on a different set of examples called a test set. This serves to test the

generalization ability of the machine — its ability to produce sensible

answers on new inputs that it has never seen during training.

Many of the current practical applications of machine learning use

linear classifiers on top of hand-engineered features. A two-class linear

classifier computes a weighted sum of the feature vector components.

If the weighted sum is above a threshold, the input is classified as

belonging to a particular category.

Since the 1960s we have known that linear classifiers can only carve

their input space into very simple regions, namely half-spaces sepa

rated by a hyperplane

. But problems such as image and speech recog-

nition require the input–output function to be insensitive to irrelevant

variations of the input, such as variations in position, orientation or

illumination of an object, or variations in the pitch or accent of speech,

while being very sensitive to particular minute variations (for example,

the difference between a white wolf and a breed of wolf-like white

dog called a Samoyed). At the pixel level, images of two Samoyeds in

different poses and in different environments may be very different

from each other, whereas two images of a Samoyed and a wolf in the

same position and on similar backgrounds may be very similar to each

other. A linear classifier, or any other ‘shallow’ classifier operating on

Figure 1 | Multilayer neural networks and backpropagation. a, A multi-

layer neural network (shown by the connected dots) can distort the input

space to make the classes of data (examples of which are on the red and

blue lines) linearly separable. Note how a regular grid (shown on the left)

in input space is also transformed (shown in the middle panel) by hidden

units. This is an illustrative example with only two input units, two hidden

units and one output unit, but the networks used for object recognition

or natural language processing contain tens or hundreds of thousands of

units. Reproduced with permission from C. Olah (http://colah.github.io/).

b, The chain rule of derivatives tells us how two small effects (that of a small

change of x on y, and that of y on z) are composed. A small change Δx in

x gets transformed first into a small change Δy in y by getting multiplied

by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change

Δy creates a change Δz in z. Substituting one equation into the other

gives the chain rule of derivatives — how Δx gets turned into Δz through

multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x,

y and z are vectors (and the derivatives are Jacobian matrices). c, The

equations used for computing the forward pass in a neural net with two

hidden layers and one output layer, each constituting a module through

which one can backpropagate gradients. At each layer, we first compute

the total input z to each unit, which is a weighted sum of the outputs of

the units in the layer below. Then a non-linear function f(.) is applied to

z to get the output of the unit. For simplicity, we have omitted bias terms.

The non-linear functions used in neural networks include the rectified

linear unit (ReLU) f(z) = max(0,z), commonly used in recent years, as

well as the more conventional sigmoids, such as the hyberbolic tangent,

f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and logistic function logistic,

f(z) = 1/(1 + exp(−z)). d, The equations used for computing the backward

pass. At each hidden layer we compute the error derivative with respect to

the output of each unit, which is a weighted sum of the error derivatives

with respect to the total inputs to the units in the layer above. We then

convert the error derivative with respect to the output into the error

derivative with respect to the input by multiplying it by the gradient of f(z).

At the output layer, the error derivative with respect to the output of a unit

is computed by differentiating the cost function. This gives yl − tl if the cost

function for unit l is 0.5(yl − tl)2, where tl is the target value. Once the ∂E/∂zk

is known, the error-derivative for the weight wjk on the connection from

unit j in the layer below is just yj ∂E/∂zk.

Input

(2)

Output

(1 sigmoid)

Hidden

(2 sigmoid)



Δ Δ



Compare outputs with correct

answer to get error derivatives

=f(z

)

yj=f(zj)

zj=wij xi

yk=f(zk)

zk=wjk yj

Output units

Input units

Hidden units H2

Hidden units H1



out



Input

28 MAY 2015 | VOL 521 | NATURE | 437

REVIEW INSIGHT

raw pixels could not possibly distinguish the latter two, while putting

the former two in the same category. This is why shallow classifiers

require a good feature extractor that solves the selectivity–invariance

dilemma — one that produces representations that are selective to

the aspects of the image that are important for discrimination, but

that are invariant to irrelevant aspects such as the pose of the animal.

To make classifiers more powerful, one can use generic non-linear

features, as with kernel methods

, but generic features such as those

arising with the Gaussian kernel do not allow the learner to general-

ize well far from the training examples

. The conventional option is

to hand design good feature extractors, which requires a consider-

able amount of engineering skill and domain expertise. But this can

all be avoided if good features can be learned automatically using a

general-purpose learning procedure. This is the key advantage of

deep learning.

A deep-learning architecture is a multilayer stack of simple mod-

ules, all (or most) of which are subject to learning, and many of which

compute non-linear input–output mappings. Each module in the

stack transforms its input to increase both the selectivity and the

invariance of the representation. With multiple non-linear layers, say

a depth of 5 to 20, a system can implement extremely intricate func-

tions of its inputs that are simultaneously sensitive to minute details

— distinguishing Samoyeds from white wolves — and insensitive to

large irrelevant variations such as the background, pose, lighting and

surrounding objects.

Backpropagation to train multilayer architectures

From the earliest days of pattern recognition22,23, the aim of research-

ers has been to replace hand-engineered features with trainable

multilayer networks, but despite its simplicity, the solution was not

widely understood until the mid 1980s. As it turns out, multilayer

architectures can be trained by simple stochastic gradient descent.

As long as the modules are relatively smooth functions of their inputs

and of their internal weights, one can compute gradients using the

backpropagation procedure. The idea that this could be done, and

that it worked, was discovered independently by several different

groups during the 1970s and 1980s24–27.

The backpropagation procedure to compute the gradient of an

objective function with respect to the weights of a multilayer stack

of modules is nothing more than a practical application of the chain

rule for derivatives. The key insight is that the derivative (or gradi-

ent) of the objective with respect to the input of a module can be

computed by working backwards from the gradient with respect to

the output of that module (or the input of the subsequent module)

(Fig.1). The backpropagation equation can be applied repeatedly to

propagate gradients through all modules, starting from the output

at the top (where the network produces its prediction) all the way to

the bottom (where the external input is fed). Once these gradients

have been computed, it is straightforward to compute the gradients

with respect to the weights of each module.

Many applications of deep learning use feedforward neural net-

work architectures (Fig. 1), which learn to map a fixed-size input

(for example, an image) to a fixed-size output (for example, a prob-

ability for each of several categories). To go from one layer to the

next, a set of units compute a weighted sum of their inputs from the

previous layer and pass the result through a non-linear function. At

present, the most popular non-linear function is the rectified linear

unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0).

In past decades, neural nets used smoother non-linearities, such as

tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster

in networks with many layers, allowing training of a deep supervised

network without unsupervised pre-training28. Units that are not in

the input or output layer are conventionally called hidden units. The

hidden layers can be seen as distorting the input in a non-linear way

so that categories become linearly separable by the last layer (Fig.1).

In the late 1990s, neural nets and backpropagation were largely

forsaken by the machine-learning community and ignored by the

computer-vision and speech-recognition communities. It was widely

thought that learning useful, multistage, feature extractors with lit-

tle prior knowledge was infeasible. In particular, it was commonly

thought that simple gradient descent would get trapped in poor local

minima — weight configurations for which no small change would

reduce the average error.

In practice, poor local minima are rarely a problem with large net-

works. Regardless of the initial conditions, the system nearly always

reaches solutions of very similar quality. Recent theoretical and

empirical results strongly suggest that local minima are not a serious

issue in general. Instead, the landscape is packed with a combinato-

rially large number of saddle points where the gradient is zero, and

the surface curves up in most dimensions and curves down in the

Figure 2 | Inside a convolutional network. The outputs (not the filters)

of each layer (horizontally) of a typical convolutional network architecture

applied to the image of a Samoyed dog (bottom left; and RGB (red, green,

blue) inputs, bottom right). Each rectangular image is a feature map

corresponding to the output for one of the learned features, detected at each

of the image positions. Information flows bottom up, with lower-level features

acting as oriented edge detectors, and a score is computed for each image class

in output. ReLU, rectified linear unit.

Red Green Blue

Samoyed (16); Papillon (5.7); Pomeranian (2.7); Arctic fox (1.0); Eskimo dog (0.6); white wolf (0.4); Siberian husky (0.4)

Convolutions and ReLU

Max pooling

Convolutions and ReLU

438 | NATURE | VOL 521 | 28 MAY 2015

REVIEW

INSIGHT

remainder

29,30

. The analysis seems to show that saddle points with

only a few downward curving directions are present in very large

numbers, but almost all of them have very similar values of the objec-

tive function. Hence, it does not much matter which of these saddle

points the algorithm gets stuck at.

Interest in deep feedforward networks was revived around 2006

(refs31–34) by a group of researchers brought together by the Cana-

dian Institute for Advanced Research (CIFAR). The researchers intro-

duced unsupervised learning procedures that could create layers of

feature detectors without requiring labelled data. The objective in

learning each layer of feature detectors was to be able to reconstruct

or model the activities of feature detectors (or raw inputs) in the layer

below. By ‘pre-training’ several layers of progressively more complex

feature detectors using this reconstruction objective, the weights of a

deep network could be initialized to sensible values. A final layer of

output units could then be added to the top of the network and the

whole deep system could be fine-tuned using standard backpropaga-

tion

33–35

. This worked remarkably well for recognizing handwritten

digits or for detecting pedestrians, especially when the amount of

labelled data was very limited36.

The first major application of this pre-training approach was in

speech recognition, and it was made possible by the advent of fast

graphics processing units (GPUs) that were convenient to program

and allowed researchers to train networks 10 or 20 times faster. In

2009, the approach was used to map short temporal windows of coef-

ficients extracted from a sound wave to a set of probabilities for the

various fragments of speech that might be represented by the frame

in the centre of the window. It achieved record-breaking results on a

standard speech recognition benchmark that used a small vocabu-

lary

and was quickly developed to give record-breaking results on

a large vocabular y task39. By 2012, versions of the deep net from 2009

were being developed by many of the major speech groups6 and were

already being deployed in Android phones. For smaller data sets,

unsupervised pre-training helps to prevent overfitting

, leading to

significantly better generalization when the number of labelled exam-

ples is small, or in a transfer setting where we have lots of examples

for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep

learning had been rehabilitated, it turned out that the pre-training

stage was only needed for small data sets.

There was, however, one particular type of deep, feedforward net-

work that was much easier to train and generalized much better than

networks with full connectivity between adjacent layers. This was

the convolutional neural network (ConvNet)41,42. It achieved many

practical successes during the period when neural networks were out

of favour and it has recently been widely adopted by the computer-

vision community.

Convolutional neural networks

ConvNets are designed to process data that come in the form of

multiple arrays, for example a colour image composed of three 2D

arrays containing pixel intensities in the three colour channels. Many

data modalities are in the form of multiple arrays: 1D for signals and

sequences, including language; 2D for images or audio spectrograms;

and 3D for video or volumetric images. There are four key ideas

behind ConvNets that take advantage of the properties of natural

signals: local connections, shared weights, pooling and the use of

many layers.

The architecture of a typical ConvNet (Fig. 2) is structured as a

series of stages. The first few stages are composed of two types of

layers: convolutional layers and pooling layers. Units in a convolu-

tional layer are organized in feature maps, within which each unit

is connected to local patches in the feature maps of the previous

layer through a set of weights called a filter bank. The result of this

local weighted sum is then passed through a non-linearity such as a

ReLU. All units in a feature map share the same filter bank. Differ-

ent feature maps in a layer use different filter banks. The reason for

this architecture is twofold. First, in array data such as images, local

groups of values are often highly correlated, forming distinctive local

motifs that are easily detected. Second, the local statistics of images

and other signals are invariant to location. In other words, if a motif

can appear in one part of the image, it could appear anywhere, hence

the idea of units at different locations sharing the same weights and

detecting the same pattern in different parts of the array. Mathemati-

cally, the filtering operation performed by a feature map is a discrete

convolution, hence the name.

Although the role of the convolutional layer is to detect local con-

junctions of features from the previous layer, the role of the pooling

layer is to merge semantically similar features into one. Because the

relative positions of the features forming a motif can vary somewhat,

reliably detecting the motif can be done by coarse-graining the posi-

tion of each feature. A typical pooling unit computes the maximum

of a local patch of units in one feature map (or in a few feature maps).

Neighbouring pooling units take input from patches that are shifted

by more than one row or column, thereby reducing the dimension of

the representation and creating an invariance to small shifts and dis-

tortions. Two or three stages of convolution, non-linearity and pool-

ing are stacked, followed by more convolutional and fully-connected

layers. Backpropagating gradients through a ConvNet is as simple as

through a regular deep network, allowing all the weights in all the

filter banks to be trained.

Deep neural networks exploit the property that many natural sig-

nals are compositional hierarchies, in which higher-level features

are obtained by composing lower-level ones. In images, local combi-

nations of edges form motifs, motifs assemble into parts, and parts

form objects. Similar hierarchies exist in speech and text from sounds

to phones, phonemes, syllables, words and sentences. The pooling

allows representations to vary very little when elements in the previ-

ous layer vary in position and appearance.

The convolutional and pooling layers in ConvNets are directly

inspired by the classic notions of simple cells and complex cells in

visual neuroscience

, and the overall architecture is reminiscent of

the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral path-

way44. When ConvNet models and monkeys are shown the same pic-

ture, the activations of high-level units in the ConvNet explains half

of the variance of random sets of 160neurons in the monkey’s infer-

otemporal cortex

. ConvNets have their roots in the neocognitron

the architecture of which was somewhat similar, but did not have an

end-to-end supervised-learning algorithm such as backpropagation.

A primitive 1D ConvNet called a time-delay neural net was used for

the recognition of phonemes and simple words47,48.

There have been numerous applications of convolutional net-

works going back to the early 1990s, starting with time-delay neu-

ral networks for speech recognition

and document reading

. The

document reading system used a ConvNet trained jointly with a

probabilistic model that implemented language constraints. By the

late 1990s this system was reading over 10% of all the cheques in the

United States. A number of ConvNet-based optical character recog-

nition and handwriting recognition systems were later deployed by

Microsoft

. ConvNets were also experimented with in the early 1990s

for object detection in natural images, including faces and hands

50,51

and for face recognition52.

Image understanding with deep convolutional networks

Since the early 2000s, ConvNets have been applied with great success to

the detection, segmentation and recognition of objects and regions in

images. These were all tasks in which labelled data was relatively abun-

dant, such as traffic sign recognition

, the segmentation of biological

images54 particularly for connectomics55, and the detection of faces,

text, pedestrians and human bodies in natural images

36,50,51,56–58

. A major

recent practical success of ConvNets is face recognition59.

Importantly, images can be labelled at the pixel level, which will have

applications in technology, including autonomous mobile robots and

28 MAY 2015 | VOL 521 | NATURE | 439

REVIEW INSIGHT

self-driving cars

60,61

. Companies such as Mobileye and NVIDIA are

using such ConvNet-based methods in their upcoming vision sys-

tems for cars. Other applications gaining importance involve natural

language understanding14 and speech recognition7.

Despite these successes, ConvNets were largely forsaken by the

mainstream computer-vision and machine-learning communities

until the ImageNet competition in 2012. When deep convolutional

networks were applied to a data set of about a million images from

the web that contained 1,000 different classes, they achieved spec-

tacular results, almost halving the error rates of the best compet-

ing approaches

. This success came from the efficient use of GPUs,

ReLUs, a new regularization technique called dropout

, and tech-

niques to generate more training examples by deforming the existing

ones. This success has brought about a revolution in computer vision;

ConvNets are now the dominant approach for almost all recognition

and detection tasks

4,58,59,63–65

and approach human performance on

some tasks. A recent stunning demonstration combines ConvNets

and recurrent net modules for the generation of image captions

(Fig.3).

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun-

dreds of millions of weights, and billions of connections between

units. Whereas training such large networks could have taken weeks

only two years ago, progress in hardware, software and algorithm

parallelization have reduced training times to a few hours.

The performance of ConvNet-based vision systems has caused

most major technology companies, including Google, Facebook,

Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly

growing number of start-ups to initiate research and development

projects and to deploy ConvNet-based image understanding products

and services.

ConvNets are easily amenable to efficient hardware implemen-

tations in chips or field-programmable gate arrays66,67. A number

of companies such as NVIDIA, Mobileye, Intel, Qualcomm and

Samsung are developing ConvNet chips to enable real-time vision

applications in smartphones, cameras, robots and self-driving cars.

Distributed representations and language processing

Deep-learning theory shows that deep nets have two different expo-

nential advantages over classic learning algorithms that do not use

distributed representations

. Both of these advantages arise from the

power of composition and depend on the underlying data-generating

distribution having an appropriate componential structure

. First,

learning distributed representations enable generalization to new

combinations of the values of learned features beyond those seen

during training (for example, 2n combinations are possible with n

binary features)

68,69

. Second, composing layers of representation in

a deep net brings the potential for another exponential advantage

(exponential in the depth).

The hidden layers of a multilayer neural network learn to repre-

sent the network’s inputs in a way that makes it easy to predict the

target outputs. This is nicely demonstrated by training a multilayer

neural network to predict the next word in a sequence from a local

Figure 3 | From image to text. C aptions generated by a recurrent neural

network (RNN) taking, as extra input, the representation extracted by a deep

convolution neural network (CNN) from a test image, with the RNN trained to

‘translate’ high-level representations of images into captions (top). Reproduced

with permission from ref. 102. When the RNN is given the ability to focus its

attention on a different location in the input image (middle and bottom; the

lighter patches were given more attention) as it generates each word (bold), we

found86 that it exploits this to achieve better ‘translation’ of images into captions.

Vision

Deep CNN

Language

Generating RNN

A group of people

shopping at an outdoor

market.

There are many

vegetables at the

fruit stand.

A woman is throwing a frisbee in a park.

A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A girae standing in a forest with

trees in the background.

A dog is standing on a hardwood oor. A stop sign is on a road with a

mountain in the background

440 | NATURE | VOL 521 | 28 MAY 2015

REVIEW

INSIGHT

context of earlier words

. Each word in the context is presented to

the network as a one-of-N vector, that is, one component has a value

of 1 and the rest are0. In the first layer, each word creates a different

pattern of activations, or word vectors (Fig.4). In a language model,

the other layers of the network learn to convert the input word vec-

tors into an output word vector for the predicted next word, which

can be used to predict the probability for any word in the vocabulary

to appear as the next word. The network learns word vectors that

contain many active components each of which can be interpreted

as a separate feature of the word, as was first demonstrated

in the

context of learning distributed representations for symbols. These

semantic features were not explicitly present in the input. They were

discovered by the learning procedure as a good way of factorizing

the structured relationships between the input and output symbols

into multiple ‘micro-rules’. Learning word vectors turned out to also

work very well when the word sequences come from a large corpus

of real text and the individual micro-rules are unreliable71. When

trained to predict the next word in a news story, for example, the

learned word vectors for Tuesday and Wednesday are very similar, as

are the word vectors for Sweden and Norway. Such representations

are called distributed representations because their elements (the

features) are not mutually exclusive and their many configurations

correspond to the variations seen in the observed data. These word

vectors are composed of learned features that were not determined

ahead of time by experts, but automatically discovered by the neural

network. Vector representations of words learned from text are now

very widely used in natural language applications14,17,72–76.

The issue of representation lies at the heart of the debate between

the logic-inspired and the neural-network-inspired paradigms for

cognition. In the logic-inspired paradigm, an instance of a symbol is

something for which the only property is that it is either identical or

non-identical to other symbol instances. It has no internal structure

that is relevant to its use; and to reason with symbols, they must be

bound to the variables in judiciously chosen rules of inference. By

contrast, neural networks just use big activity vectors, big weight

matrices and scalar non-linearities to perform the type of fast ‘intui-

tive’ inference that underpins effortless commonsense reasoning.

Before the introduction of neural language models71, the standard

approach to statistical modelling of language did not exploit distrib-

uted representations: it was based on counting frequencies of occur-

rences of short symbol sequences of length up to N (called N-grams).

The number of possible N-grams is on the order of VN, where V is

the vocabulary size, so taking into account a context of more than a

handful of words would require very large training corpora. N-grams

treat each word as an atomic unit, so they cannot generalize across

semantically related sequences of words, whereas neural language

models can because they associate each word with a vector of real

valued features, and semantically related words end up close to each

other in that vector space (Fig.4).

Recurrent neural networks

When backpropagation was first introduced, its most exciting use was

for training recurrent neural networks (RNNs). For tasks that involve

sequential inputs, such as speech and language, it is often better to

use RNNs (Fig. 5). RNNs process an input sequence one element at a

time, maintaining in their hidden units a ‘state vector’ that implicitly

contains information about the history of all the past elements of

the sequence. When we consider the outputs of the hidden units at

different discrete time steps as if they were the outputs of different

neurons in a deep multilayer network (Fig.5, right), it becomes clear

how we can apply backpropagation to train RNNs.

RNNs are very powerful dynamic systems, but training them has

proved to be problematic because the backpropagated gradients

either grow or shrink at each time step, so over many time steps they

typically explode or vanish77,78.

Thanks to advances in their architecture

79,80

and ways of training

them

81,82

, RNNs have been found to be very good at predicting the

next character in the text

or the next word in a sequence

, but they

can also be used for more complex tasks. For example, after reading

an English sentence one word at a time, an English ‘encoder’ network

can be trained so that the final state vector of its hidden units is a good

representation of the thought expressed by the sentence. This thought

vector can then be used as the initial hidden state of (or as extra input

to) a jointly trained French ‘decoder’ network, which outputs a prob-

ability distribution for the first word of the French translation. If a

particular first word is chosen from this distribution and provided

as input to the decoder network it will then output a probability dis-

tribution for the second word of the translation and so on until a

full stop is chosen17,72,76. Overall, this process generates sequences of

French words according to a probability distribution that depends on

the English sentence. This rather naive way of performing machine

translation has quickly become competitive with the state-of-the-art,

and this raises serious doubts about whether understanding a sen-

tence requires anything like the internal symbolic expressions that are

manipulated by using inference rules. It is more compatible with the

view that everyday reasoning involves many simultaneous analogies

Figure 4 | Visualizing the learned word vectors. On the left is an illustration

of word representations learned for modelling language, non-linearly projected

to 2D for visualization using the t-SNE algorithm103. On the right is a 2D

representation of phrases learned by an English-to-French encoder–decoder

recurrent neural network75. One can observe that semantically similar words

or sequences of words are mapped to nearby representations. The distributed

representations of words are obtained by using backpropagation to jointly learn

a representation for each word and a function that predicts a target quantity

such as the next word in a sequence (for language modelling) or a whole

sequence of translated words (for machine translation)18,75.

−37 −36 −35 −34 −33 −32 −31 −30 −29

10.5

11.5

12.5

13.5

community

organizations institutions

society industry

company

organization

school

companies

Community

oce

Agency

communities

Association

body

schools

agencies

−5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2

−4.2

−4

−3.8

−3.6

−3.4

−3.2

−3

−2.8

−2.6

−2.4

−2.2

over the past few months

that a few days

In the last few days

the past few days

In a few months

in the coming months

a few months ago

" the two groups

of the two groups

over the last few months

dispute between the two

the last two decades

the next six months

two months before being

for nearly two months

over the last two decades

within a few months

28 MAY 2015 | VOL 521 | NATURE | 441

REVIEW INSIGHT

that each contribute plausibility to a conclusion84,85.

Instead of translating the meaning of a French sentence into an

English sentence, one can learn to ‘translate’ the meaning of an image

into an English sentence (Fig. 3). The encoder here is a deep Con-

vNet that converts the pixels into an activity vector in its last hidden

layer. The decoder is an RNN similar to the ones used for machine

translation and neural language modelling. There has been a surge of

interest in such systems recently (see examples mentioned in ref. 86).

RNNs, once unfolded in time (Fig. 5), can be seen as very deep

feedforward networks in which all the layers share the same weights.

Although their main purpose is to learn long-term dependencies,

theoretical and empirical evidence shows that it is difficult to learn

to store information for very long78.

To correct for that, one idea is to augment the network with an

explicit memory. The first proposal of this kind is the long short-term

memory (LSTM) networks that use special hidden units, the natural

behaviour of which is to remember inputs for a long time

. A special

unit called the memory cell acts like an accumulator or a gated leaky

neuron: it has a connection to itself at the next time step that has a

weight of one, so it copies its own real-valued state and accumulates

the external signal, but this self-connection is multiplicatively gated

by another unit that learns to decide when to clear the content of the

memory.

LSTM networks have subsequently proved to be more effective

than conventional RNNs, especially when they have several layers for

each time step

, enabling an entire speech recognition system that

goes all the way from acoustics to the sequence of characters in the

transcription. LSTM networks or related forms of gated units are also

currently used for the encoder and decoder networks that perform

so well at machine translation17,72,76.

Over the past year, several authors have made different proposals to

augment RNNs with a memory module. Proposals include the Neural

Turing Machine in which the network is augmented by a ‘tape-like’

memory that the RNN can choose to read from or write to88, and

memory networks, in which a regular network is augmented by a

kind of associative memory

. Memory networks have yielded excel-

lent performance on standard question-answering benchmarks. The

memory is used to remember the story about which the network is

later asked to answer questions.

Beyond simple memorization, neural Turing machines and mem-

ory networks are being used for tasks that would normally require

reasoning and symbol manipulation. Neural Turing machines can

be taught ‘algorithms’. Among other things, they can learn to output

a sorted list of symbols when their input consists of an unsorted

sequence in which each symbol is accompanied by a real value that

indicates its priority in the list

. Memory networks can be trained

to keep track of the state of the world in a setting similar to a text

adventure game and after reading a story, they can answer questions

that require complex inference

. In one test example, the network is

shown a 15-sentence version of the The Lord of the Rings and correctly

answers questions such as “where is Frodo now?”89.

The future of deep learning

Unsupervised learning

91–98

had a catalytic effect in reviving interest in

deep learning, but has since been overshadowed by the successes of

purely supervised learning. Although we have not focused on it in this

Review, we expect unsupervised learning to become far more important

in the longer term. Human and animal learning is largely unsupervised:

we discover the structure of the world by observing it, not by being told

the name of every object.

Human vision is an active process that sequentially samples the optic

array in an intelligent, task-speciﬁc way using a small, high-resolution

fovea with a large, low-resolution surround. We expect much of the

future progress in vision to come from systems that are trained end-to-

end and combine ConvNets with RNNs that use reinforcement learning

to decide where to look. Systems combining deep learning and rein-

forcement learning are in their infancy, but they already outperform

passive vision systems

at classification tasks and produce impressive

results in learning to play many different video games100.

Natural language understanding is another area in which deep learn-

ing is poised to make a large impact over the next few years. We expect

systems that use RNNs to understand sentences or whole documents

will become much better when they learn strategies for selectively

attending to one part at a time76,86.

Ultimately, major progress in artificial intelligence will come about

through systems that combine representation learning with complex

reasoning. Although deep learning and simple reasoning have been

used for speech and handwriting recognition for a long time, new

paradigms are needed to replace rule-based manipulation of symbolic

expressions by operations on large vectors101. ■

Received 25 February; accepted 1 May 2015.

1. Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep

convolutional neural networks. In Proc. Advances in Neural Information

Processing Systems 25 1090–1098 (2012).

This report was a breakthrough that used convolutional nets to almost halve

the error rate for object recognition, and precipitated the rapid adoption of

deep learning by the computer vision community.

2. Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Learning hierarchical features for

scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1915–1929 (2013).

3. Tompson, J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a convolutional

network and a graphical model for human pose estimation. In Proc. Advances in

Neural Information Processing Systems 27 1799–1807 (2014).

4. Szegedy, C. et al. Going deeper with convolutions. Preprint at http://arxiv.org/

abs/1409.4842 (2014).

5. Mikolov, T., Deoras, A., Povey, D., Burget, L. & Cernocky, J. Strategies for training

large scale neural network language models. In Proc. Automatic Speech

Recognition and Understanding 196–201 (2011).

6. Hinton, G. et al. Deep neural networks for acoustic modeling in speech

recognition. IEEE Signal Processing Magazine 29, 82–97 (2012).

This joint paper from the major speech recognition laboratories, summarizing

the breakthrough achieved with deep learning on the task of phonetic

classification for automatic speech recognition, was the first major industrial

application of deep learning.

7. Sainath, T., Mohamed, A.-R., Kingsbury, B. & Ramabhadran, B. Deep

convolutional neural networks for LVCSR. In Proc. Acoustics, Speech and Signal

Processing 8614–8618 (2013).

8. Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a

method for quantitative structure-activity relationships. J. Chem. Inf. Model. 55,

263–274 (2015).

9. Ciodaro, T., Deva, D., de Seixas, J. & Damazio, D. Online particle detection with

neural networks based on topological calorimetry information. J. Phys. Conf.

Series 368, 012030 (2012).

10. Kaggle. Higgs boson machine learning challenge. Kaggle https://www.kaggle.

com/c/higgs-boson (2014).

11. Helmstaedter, M. et al. Connectomic reconstruction of the inner plexiform layer

in the mouse retina. Nature 500, 168–174 (2013).

t−1

t+1

Unfold

VWW

WWW

V V V

UU U U

st−1

ot−1 ot

stst+1

ot+1

Figure 5 | A recurrent neural network and the unfolding in time of the

computation involved in its forward computation. The artificial neurons

(for example, hidden units grouped under node s with values st at time t) get

inputs from other neurons at previous time steps (this is represented with the

black square, representing a delay of one time step, on the left). In this way, a

recurrent neural network can map an input sequence with elements xt into an

output sequence with elements ot, with each ot depending on all the previous

xtʹ (for tʹ ≤ t). The same parameters (matrices U,V,W ) are used at each time

step. Many other architectures are possible, including a variant in which the

network can generate a sequence of outputs (for example, words), each of

which is used as inputs for the next time step. The backpropagation algorithm

(Fig. 1) can be directly applied to the computational graph of the unfolded

network on the right, to compute the derivative of a total error (for example,

the log-probability of generating the right sequence of outputs) with respect to

all the states st and all the parameters.

442 | NATURE | VOL 521 | 28 MAY 2015

REVIEW

INSIGHT

12. Leung, M. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Deep learning of the tissue-

regulated splicing code. Bioinformatics 30, i121–i129 (2014).

13. Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic

determinants of disease. Science 347, 6218 (2015).

14. Collobert, R., et al. Natural language processing (almost) from scratch. J. Mach.

Learn. Res. 12, 2493–2537 (2011).

15. Bordes, A., Chopra, S. & Weston, J. Question answering with subgraph

embeddings. In Proc. Empirical Methods in Natural Language Processing http://

arxiv.org/abs/1406.3676v3 (2014).

16. Jean, S., Cho, K., Memisevic, R. & Bengio, Y. On using very large target

vocabulary for neural machine translation. In Proc. ACL-IJCNLP http://arxiv.org/

abs/1412.2007 (2015).

17. Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural

networks. In Proc. Advances in Neural Information Processing Systems 27

3104–3112 (2014).

This paper showed state-of-the-art machine translation results with the

architecture introduced in ref. 72, with a recurrent network trained to read a

sentence in one language, produce a semantic representation of its meaning,

and generate a translation in another language.

18. Bottou, L. & Bousquet, O. The tradeoffs of large scale learning. In Proc. Advances

in Neural Information Processing Systems 20 161–168 (2007).

19. Duda, R. O. & Hart, P. E. Pattern Classiﬁcation and Scene Analysis (Wiley, 1973).

20. Schölkopf, B. & Smola, A. Learning with Kernels (MIT Press, 2002).

21. Bengio, Y., Delalleau, O. & Le Roux, N. The curse of highly variable functions

for local kernel machines. In Proc. Advances in Neural Information Processing

Systems 18 107–114 (2005).

22. Selfridge, O. G. Pandemonium: a paradigm for learning in mechanisation of

thought processes. In Proc. Symposium on Mechanisation of Thought Processes

513–526 (1958).

23. Rosenblatt, F. The Perceptron — A Perceiving and Recognizing Automaton. Tech.

Rep. 85-460-1 (Cornell Aeronautical Laboratory, 1957).

24. Werbos, P. Beyond Regression: New Tools for Prediction and Analysis in the

Behavioral Sciences. PhD thesis, Harvard Univ. (1974).

25. Parker, D. B. Learning Logic Report TR–47 (MIT Press, 1985).

26. LeCun, Y. Une procédure d’apprentissage pour Réseau à seuil assymétrique

in Cognitiva 85: a la Frontière de l’Intelligence Artiﬁcielle, des Sciences de la

Connaissance et des Neurosciences [in French] 599–604 (1985).

27. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by

back-propagating errors. Nature 323, 533–536 (1986).

28. Glorot, X., Bordes, A. & Bengio. Y. Deep sparse rectiﬁer neural networks. In Proc.

14th International Conference on Artificial Intelligence and Statistics 315–323

(2011).

This paper showed that supervised training of very deep neural networks is

much faster if the hidden layers are composed of ReLU.

29. Dauphin, Y. et al. Identifying and attacking the saddle point problem in high-

dimensional non-convex optimization. In Proc. Advances in Neural Information

Processing Systems 27 2933–2941 (2014).

30. Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B. & LeCun, Y. The loss

surface of multilayer networks. In Proc. Conference on AI and Statistics http://

arxiv.org/abs/1412.0233 (2014).

31. Hinton, G. E. What kind of graphical model is the brain? In Proc. 19th

International Joint Conference on Artificial intelligence 1765–1775 (2005).

32. Hinton, G. E., Osindero, S. & Teh, Y.-W. A fast learning algorithm for deep belief

nets. Neural Comp. 18, 1527–1554 (2006).

This paper introduced a novel and effective way of training very deep neural

networks by pre-training one hidden layer at a time using the unsupervised

learning procedure for restricted Boltzmann machines.

33. Bengio, Y., Lamblin, P., Popovici, D. & Larochelle, H. Greedy layer-wise training

of deep networks. In Proc. Advances in Neural Information Processing Systems 19

153–160 (2006).

This report demonstrated that the unsupervised pre-training method

introduced in ref. 32 significantly improves performance on test data and

generalizes the method to other unsupervised representation-learning

techniques, such as auto-encoders.

34. Ranzato, M., Poultney, C., Chopra, S. & LeCun, Y. Efﬁcient learning of sparse

representations with an energy-based model. In Proc. Advances in Neural

Information Processing Systems 19 1137–1144 (2006).

35. Hinton, G. E. & Salakhutdinov, R. Reducing the dimensionality of data with

neural networks. Science 313, 504–507 (2006).

36. Sermanet, P., Kavukcuoglu, K., Chintala, S. & LeCun, Y. Pedestrian detection with

unsupervised multi-stage feature learning. In Proc. International Conference

on Computer Vision and Pattern Recognition http://arxiv.org/abs/1212.0142

(2013).

37. Raina, R., Madhavan, A. & Ng, A. Y. Large-scale deep unsupervised learning

using graphics processors. In Proc. 26th Annual International Conference on

Machine Learning 873–880 (2009).

38. Mohamed, A.-R., Dahl, G. E. & Hinton, G. Acoustic modeling using deep belief

networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012).

39. Dahl, G. E., Yu, D., Deng, L. & Acero, A. Context-dependent pre-trained deep

neural networks for large vocabulary speech recognition. IEEE Trans. Audio

Speech Lang. Process. 20, 33–42 (2012).

40. Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new

perspectives. IEEE Trans. Pattern Anal. Machine Intell. 35, 1798–1828 (2013).

41. LeCun, Y. et al. Handwritten digit recognition with a back-propagation network.

In Proc. Advances in Neural Information Processing Systems 396–404 (1990).

This is the first paper on convolutional networks trained by backpropagation

for the task of classifying low-resolution images of handwritten digits.

42. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to

document recognition. Proc. IEEE 86, 2278–2324 (1998).

This overview paper on the principles of end-to-end training of modular

systems such as deep neural networks using gradient-based optimization

showed how neural networks (and in particular convolutional nets) can be

combined with search or inference mechanisms to model complex outputs

that are interdependent, such as sequences of characters associated with the

content of a document.

43. Hubel, D. H. & Wiesel, T. N. Receptive ﬁelds, binocular interaction, and functional

architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962).

44. Felleman, D. J. & Essen, D. C. V. Distributed hierarchical processing in the

primate cerebral cortex. Cereb. Cortex 1, 1–47 (1991).

45. Cadieu, C. F. et al. Deep neural networks rival the representation of primate

it cortex for core visual object recognition. PLoS Comp. Biol. 10, e1003963

(2014).

46. Fukushima, K. & Miyake, S. Neocognitron: a new algorithm for pattern

recognition tolerant of deformations and shifts in position. Pattern Recognition

15, 455–469 (1982).

47. Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K. & Lang, K. Phoneme

recognition using time-delay neural networks. IEEE Trans. Acoustics Speech

Signal Process. 37, 328–339 (1989).

48. Bottou, L., Fogelman-Soulié, F., Blanchet, P. & Lienard, J. Experiments with time

delay networks and dynamic time warping for speaker independent isolated

digit recognition. In Proc. EuroSpeech 89 537–540 (1989).

49. Simard, D., Steinkraus, P. Y. & Platt, J. C. Best practices for convolutional neural

networks. In Proc. Document Analysis and Recognition 958–963 (2003).

50. Vaillant, R., Monrocq, C. & LeCun, Y. Original approach for the localisation of

objects in images. In Proc. Vision, Image, and Signal Processing 141, 245–250

(1994).

51. Nowlan, S. & Platt, J. in Neural Information Processing Systems 901–908 (1995).

52. Lawrence, S., Giles, C. L., Tsoi, A. C. & Back, A. D. Face recognition: a

convolutional neural-network approach. IEEE Trans. Neural Networks 8, 98–113

(1997).

53. Ciresan, D., Meier, U. Masci, J. & Schmidhuber, J. Multi-column deep neural

network for trafﬁc sign classiﬁcation. Neural Networks 32, 333–338 (2012).

54. Ning, F. et al. Toward automatic phenotyping of developing embryos from

videos. IEEE Trans. Image Process. 14, 1360–1371 (2005).

55. Turaga, S. C. et al. Convolutional networks can learn to generate afﬁnity graphs

for image segmentation. Neural Comput. 22, 511–538 (2010).

56. Garcia, C. & Delakis, M. Convolutional face ﬁnder: a neural architecture for

fast and robust face detection. IEEE Trans. Pattern Anal. Machine Intell. 26,

1408–1423 (2004).

57. Osadchy, M., LeCun, Y. & Miller, M. Synergistic face detection and pose

estimation with energy-based models. J. Mach. Learn. Res. 8, 1197–1215

(2007).

58. Tompson, J., Goroshin, R. R., Jain, A., LeCun, Y. Y. & Bregler, C. C. Efﬁcient object

localization using convolutional networks. In Proc. Conference on Computer

Vision and Pattern Recognition http://arxiv.org/abs/1411.4280 (2014).

59. Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. Deepface: closing the gap to

human-level performance in face veriﬁcation. In Proc. Conference on Computer

Vision and Pattern Recognition 1701–1708 (2014).

60. Hadsell, R. et al. Learning long-range vision for autonomous off-road driving.

J. Field Robot. 26, 120–144 (2009).

61. Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Scene parsing with multiscale

feature learning, purity trees, and optimal covers. In Proc. International

Conference on Machine Learning http://arxiv.org/abs/1202.2160 (2012).

62. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R.

Dropout: a simple way to prevent neural networks from overﬁtting. J. Machine

Learning Res. 15, 1929–1958 (2014).

63. Sermanet, P. et al. Overfeat: integrated recognition, localization and detection

using convolutional networks. In Proc. International Conference on Learning

Representations http://arxiv.org/abs/1312.6229 (2014).

64. Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for

accurate object detection and semantic segmentation. In Proc. Conference on

Computer Vision and Pattern Recognition 580–587 (2014).

65. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale

image recognition. In Proc. International Conference on Learning Representations

http://arxiv.org/abs/1409.1556 (2014).

66. Boser, B., Sackinger, E., Bromley, J., LeCun, Y. & Jackel, L. An analog neural

network processor with programmable topology. J. Solid State Circuits 26,

2017–2025 (1991).

67. Farabet, C. et al. Large-scale FPGA-based convolutional networks. In Scaling

up Machine Learning: Parallel and Distributed Approaches (eds Bekkerman, R.,

Bilenko, M. & Langford, J.) 399–419 (Cambridge Univ. Press, 2011).

68. Bengio, Y. Learning Deep Architectures for AI (Now, 2009).

69. Montufar, G. & Morton, J. When does a mixture of products contain a product of

mixtures? J. Discrete Math. 29, 321–347 (2014).

70. Montufar, G. F., Pascanu, R., Cho, K. & Bengio, Y. On the number of linear regions

of deep neural networks. In Proc. Advances in Neural Information Processing

Systems 27 2924–2932 (2014).

71. Bengio, Y., Ducharme, R. & Vincent, P. A neural probabilistic language model. In

Proc. Advances in Neural Information Processing Systems 13 932–938 (2001).

This paper introduced neural language models, which learn to convert a word

symbol into a word vector or word embedding composed of learned semantic

features in order to predict the next word in a sequence.

72. Cho, K. et al. Learning phrase representations using RNN encoder-decoder

28 MAY 2015 | VOL 521 | NATURE | 443

REVIEW INSIGHT

for statistical machine translation. In Proc. Conference on Empirical Methods in

Natural Language Processing 1724–1734 (2014).

73. Schwenk, H. Continuous space language models. Computer Speech Lang. 21,

492–518 (2007).

74. Socher, R., Lin, C. C-Y., Manning, C. & Ng, A. Y. Parsing natural scenes and

natural language with recursive neural networks. In Proc. International

Conference on Machine Learning 129–136 (2011).

75. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed

representations of words and phrases and their compositionality. In Proc.

Advances in Neural Information Processing Systems 26 3111–3119 (2013).

76. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly

learning to align and translate. In Proc. International Conference on Learning

Representations http://arxiv.org/abs/1409.0473 (2015).

77. Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen [in

German] Diploma thesis, T.U. Münich (1991).

78. Bengio, Y., Simard, P. & Frasconi, P. Learning long-term dependencies with

gradient descent is difﬁcult. IEEE Trans. Neural Networks 5, 157–166 (1994).

79. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9,

1735–1780 (1997).

This paper introduced LSTM recurrent networks, which have become a crucial

ingredient in recent advances with recurrent networks because they are good

at learning long-range dependencies.

80. ElHihi, S. & Bengio, Y. Hierarchical recurrent neural networks for long-term

dependencies. In Proc. Advances in Neural Information Processing Systems 8

http://papers.nips.cc/paper/1102-hierarchical-recurrent-neural-networks-for-

long-term-dependencies (1995).

81. Sutskever, I. Training Recurrent Neural Networks. PhD thesis, Univ. Toronto

(2012).

82. Pascanu, R., Mikolov, T. & Bengio, Y. On the difﬁculty of training recurrent neural

networks. In Proc. 30th International Conference on Machine Learning 1310–

1318 (2013).

83. Sutskever, I., Martens, J. & Hinton, G. E. Generating text with recurrent neural

networks. In Proc. 28th International Conference on Machine Learning 1017–

1024 (2011).

84. Lakoff, G. & Johnson, M. Metaphors We Live By (Univ. Chicago Press, 2008).

85. Rogers, T. T. & McClelland, J. L. Semantic Cognition: A Parallel Distributed

Processing Approach (MIT Press, 2004).

86. Xu, K. et al. Show, attend and tell: Neural image caption generation with visual

attention. In Proc. International Conference on Learning Representations http://

arxiv.org/abs/1502.03044 (2015).

87. Graves, A., Mohamed, A.-R. & Hinton, G. Speech recognition with deep recurrent

neural networks. In Proc. International Conference on Acoustics, Speech and

Signal Processing 6645–6649 (2013).

88. Graves, A., Wayne, G. & Danihelka, I. Neural Turing machines. http://arxiv.org/

abs/1410.5401 (2014).

89. Weston, J. Chopra, S. & Bordes, A. Memory networks. http://arxiv.org/

abs/1410.3916 (2014).

90. Weston, J., Bordes, A., Chopra, S. & Mikolov, T. Towards AI-complete question

answering: a set of prerequisite toy tasks. http://arxiv.org/abs/1502.05698

(2015).

91. Hinton, G. E., Dayan, P., Frey, B. J. & Neal, R. M. The wake-sleep algorithm for

unsupervised neural networks. Science 268, 1558–1161 (1995).

92. Salakhutdinov, R. & Hinton, G. Deep Boltzmann machines. In Proc. International

Conference on Artificial Intelligence and Statistics 448–455 (2009).

93. Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. Extracting and composing

robust features with denoising autoencoders. In Proc. 25th International

Conference on Machine Learning 1096–1103 (2008).

94. Kavukcuoglu, K. et al. Learning convolutional feature hierarchies for visual

recognition. In Proc. Advances in Neural Information Processing Systems 23

1090–1098 (2010).

95. Gregor, K. & LeCun, Y. Learning fast approximations of sparse coding. In Proc.

International Conference on Machine Learning 399–406 (2010).

96. Ranzato, M., Mnih, V., Susskind, J. M. & Hinton, G. E. Modeling natural images

using gated MRFs. IEEE Trans. Pattern Anal. Machine Intell. 35, 2206–2222

(2013).

97. Bengio, Y., Thibodeau-Laufer, E., Alain, G. & Yosinski, J. Deep generative

stochastic networks trainable by backprop. In Proc. 31st International

Conference on Machine Learning 226–234 (2014).

98. Kingma, D., Rezende, D., Mohamed, S. & Welling, M. Semi-supervised learning

with deep generative models. In Proc. Advances in Neural Information Processing

Systems 27 3581–3589 (2014).

99. Ba, J., Mnih, V. & Kavukcuoglu, K. Multiple object recognition with visual

attention. In Proc. International Conference on Learning Representations http://

arxiv.org/abs/1412.7755 (2014).

100. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature

518, 529–533 (2015).

101. Bottou, L. From machine learning to machine reasoning. Mach. Learn. 94,

133–149 (2014).

102. Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: a neural image

caption generator. In Proc. International Conference on Machine Learning http://

arxiv.org/abs/1502.03044 (2014).

103. van der Maaten, L. & Hinton, G. E. Visualizing data using t-SNE. J. Mach. Learn.

Research 9, 2579–2605 (2008).

Acknowledgements The authors would like to thank the Natural Sciences and

Engineering Research Council of Canada, the Canadian Institute For Advanced

Research (CIFAR), the National Science Foundation and Office of Naval Research

for support. Y.L. and Y.B. are CIFAR fellows.

Author Information Reprints and permissions information is available at

www.nature.com/reprints. The authors declare no competing financial

interests. Readers are welcome to comment on the online version of this

paper at go.nature.com/7cjbaa. Correspondence should be addressed to Y.L.

(yann@cs.nyu.edu).

444 | NATURE | VOL 521 | 28 MAY 2015

REVIEW

INSIGHT

Commodification of Compute

Preprint

Full-text available

Jun 2024

The rapid advancements in artificial intelligence, big data analytics, and cloud computing have precipitated an unprecedented demand for computational resources. However, the current landscape of computational resource allocation is characterized by significant inefficiencies, including underutilization and price volatility. This paper addresses these challenges by introducing a novel global platform for the commodification of compute hours, termed the Global Compute Exchange (GCX) (Patent Pending). The GCX leverages blockchain technology and smart contracts to create a secure, transparent, and efficient marketplace for buying and selling computational power. The GCX is built in a layered fashion, comprising Market, App, Clearing, Risk Management, Exchange (Offchain), and Blockchain (Onchain) layers, each ensuring a robust and efficient operation. This platform aims to revolutionize the computational resource market by fostering a decentralized, efficient, and transparent ecosystem that ensures equitable access to computing power, stimulates innovation, and supports diverse user needs on a global scale. By transforming compute hours into a tradable commodity, the GCX seeks to optimize resource utilization, stabilize pricing, and democratize access to computational resources. This paper explores the technological infrastructure, market potential, and societal impact of the GCX, positioning it as a pioneering solution poised to drive the next wave of innovation in commodities and compute.

THE HIDDEN LAYERS OF WAV2VEC 2.0 ARE INVALUABLE FOR AUDIO DEEPFAKE DETECTION, PLEASE DO NOT LIMIT THEM *

Preprint

Jun 2024

Tao Hu

Generative AI technologies, including text-to-speech (TTS) and voice conversion (VC), frequently become indistinguishable from genuine samples, posing challenges for individuals in discerning between real and synthetic content. This indistinguishability undermines trust in media, and the arbitrary cloning of personal voice signals presents significant challenges to privacy and security. In the field of deepfake audio detection, the majority of models achieving higher detection accuracy currently employ self-supervised pre-trained models. However, with the ongoing development of deepfake audio generation algorithms, maintaining high discrimination accuracy against new algorithms grows more challenging. To enhance the sensitivity of deepfake audio features, we propose a deepfake audio detection model that incorporates an SLS (Sensitive Layer Selection) module. Specifically, utilizing the pre-trained XLS-R enables our model to extract diverse audio features from its various layers, each providing distinct discriminative information. Utilizing the SLS classifier, our model captures sensitive contextual information across different layer levels of audio features, effectively employing this information for fake audio detection. Experimental results show that our method achieves state-of-the-art (SOTA) performance on both the ASVspoof 2021 DF and In-the-Wild datasets, with a specific Equal Error Rate (EER) of 1.92% on the ASVspoof 2021 DF dataset and 7.46% on the In-the-Wild dataset.Code will be publicly released in the near future.

ML-Powered FPGA-based Real-Time Quantum State Discrimination Enabling Mid-circuit Measurements

Preprint

Jun 2024

Similar to reading the transistor state in classical computers, identifying the quantum bit (qubit) state is a fundamental operation to translate quantum information. However, identifying quantum state has been the slowest and most error-prone operation on superconducting quantum processors. Most existing state discrimination algorithms have only been implemented and optimized "after the fact" - using offline data transferred from control circuits to host computers. Real-time state discrimination is not possible because a superconducting quantum state only survives for a few hundred us, which is much shorter than the communication delay between the readout circuit and the host computer (i.e., tens of ms). Mid-circuit measurement (MCM), where measurements are conducted on qubits at intermediate stages within a quantum circuit rather than solely at the end, represents an advanced technique for qubit reuse. For MCM necessitating single-shot readout, it is imperative to employ an in-situ technique for state discrimination with low latency and high accuracy. This paper introduces QubiCML, a field-programmable gate array (FPGA) based system for real-time state discrimination enabling MCM - the ability to measure the state at the control circuit before/without transferring data to a host computer. A multi-layer neural network has been designed and deployed on an FPGA to ensure accurate in-situ state discrimination. For the first time, ML-powered quantum state discrimination has been implemented on a radio frequency system-on-chip FPGA platform. The deployed lightweight network on the FPGA only takes 54 ns to complete each inference. We evaluated QubiCML's performance on superconducting quantum processors and obtained an average accuracy of 98.5% with only 500 ns readout. QubiCML has the potential to be the standard real-time state discrimination method for the quantum community.

Evaluating AI Group Fairness: a Fuzzy Logic Perspective

Preprint

Full-text available

Jun 2024

Artificial intelligence systems often address fairness concerns by evaluating and mitigating measures of group discrimination, for example that indicate biases against certain genders or races. However, what constitutes group fairness depends on who is asked and the social context, whereas definitions are often relaxed to accept small deviations from the statistical constraints they set out to impose. Here we decouple definitions of group fairness both from the context and from relaxation-related uncertainty by expressing them in the axiomatic system of Basic fuzzy Logic (BL) with loosely understood predicates, like encountering group members. We then evaluate the definitions in subclasses of BL, such as Product or Lukasiewicz logics. Evaluation produces continuous instead of binary truth values by choosing the logic subclass and truth values for predicates that reflect uncertain context-specific beliefs, such as stakeholder opinions gathered through questionnaires. Internally, it follows logic-specific rules to compute the truth values of definitions. We show that commonly held propositions standardize the resulting mathematical formulas and we transcribe logic and truth value choices to layperson terms, so that anyone can answer them. We also use our framework to study several literature definitions of algorithmic fairness, for which we rationalize previous expedient practices that are non-probabilistic and show how to re-interpret their formulas and parameters in new contexts.

Retrieving genuine nonlinear Raman responses in ultrafast spectroscopy via deep learning

Article

Full-text available

Jun 2024

Noise manifests ubiquitously in nonlinear spectroscopy, where multiple sources contribute to experimental signals generating interrelated unwanted components, from random point-wise fluctuations to structured baseline signals. Mitigating strategies are usually heuristic, depending on subjective biases such as the setting of parameters in data analysis algorithms and the removal order of the unwanted components. We propose a data-driven frequency-domain denoiser based on a convolutional neural network to extract authentic vibrational features from a nonlinear background in noisy spectroscopic raw data. The different spectral scales in the problem are treated in parallel by means of filters with multiple kernel sizes, which allow the receptive field of the network to adapt to the informative features in the spectra. We test our approach by retrieving asymmetric peaks in stimulated Raman spectroscopy, an ideal test-bed due to its intrinsic complex spectral features combined with a strong background signal. By using a theoretical perturbative toolbox, we efficiently train the network with simulated datasets resembling the statistical properties and lineshapes of the experimental spectra. The developed algorithm is successfully applied to experimental data to obtain noise- and background-free stimulated Raman spectra of organic molecules and prototypical heme proteins.

A Comprehensive Review and Evaluation of Deep Learning Methods in Social Sciences

Article

Full-text available

Jun 2024

Peter Sasvari

Artificial intelligence (AI) is widely used in social sciences and continues to evolve. Deep learning (DL) has emerged as a powerful AI tool transforming social sciences with valuable insights across many areas. Employing DL for modelling social sciences’ big data has led to significant discoveries and transformations. This study aims to systematically review and evaluate DL methods in social sciences. Following PRISMA guideline, this study identifies fundamntal DL methods applied to social science applications. We evaluated DL models using reported metrics and calculated a normalized reliability score for uniform assessment. Employing relief feature selection, we identified influential parameters affecting DL techniques’ reliability. Findings suggest evaluation criteria significantly impact DL model effectiveness, while database and application type influence moderately. Identified limitations include inadequate reporting of evaluation criteria and model structure details hindering comprehensive assessment and informed policy development. In conclusion, this review underscores DL methods’ transformative role in social sciences, emphasising the importance of explainability and responsibility.

COMO ALGORITMOS DE INTELIGÊNCIA ARTIFICIAL GENERATIVA PODEM AJUDAR DESIGNERS DE INTERAÇÃO A PROJETAR INTERFACES INCLUSIVAS DURANTE A FASE DE IDEAÇÃO

Thesis

Full-text available

Jun 2024

Gabriel Mariquito

This study aimed to investigate the state of the art of generative artificial intelligence algorithms and how they can improve the work of interface designers during the ideation stage of inclusive interfaces. This study is justified because, when talking about inclusion in design, it comes up against neglecting the needs of users from marginalized communities, a pattern that is fed back through the insistence on designing for dominant social groups that end up reproducing traditional dominant structures through exclusionary products. As a social consequence, they reinforce stigmas and social barriers, which end up pushing these communities further to the margins. Therefore, the study is an exploratory, qualitative research. In conclusion, as AIs have moved beyond the scope of research and solidified themselves as widely adopted commercial tools, the result has been a lack of up-to-date and consistent references, making it challenging for research to navigate sources of information outside of this context. Thus, it becomes clear that GANs have significant limitations when it comes to generating inclusive interfaces.

Leveraging Intent Detection and Generative AI for Enhanced Customer Support

Article

Full-text available

Jun 2024

Vamsi Katragadda

Customer support plays a pivotal role in shaping customer satisfaction and fostering loyalty within any business. This paper delves into how the integration of intent detection and generative AI (GenAI) can transform customer support systems. At the core of this transformation is the ability to understand user intent, which is essential for directing customers effectively through the support funnel to the appropriate services. By employing sophisticated natural language processing (NLP) techniques, training LLM to perform RAG and machine learning models, businesses can precisely discern customer intents. This capability allows for the delivery of tailored, immediate responses. The paper further explores the methodologies employed, the advantages gained, and the challenges faced in the adoption of these advanced technologies in customer support systems.

Application of Distributed Acoustic Sensing in Geophysics Exploration: Comparative Review of Single-Mode and Multi-Mode Fiber Optic Cables

Article

Full-text available

Jun 2024

The advent of fiber optic technology in geophysics exploration has grown in its use in the exploration, production, and monitoring of subsurface environments, revolutionizing the way data are gathered and interpreted critically to speed up decision-making and reduce expense and time. Distributed Acoustic Sensing (DAS) has been increasingly utilized to build relationships in complex geophysics environments by utilizing continuous measurement along fiber optic cables with high spatial resolution and a frequency response of up to 10 KHz. DAS, as fiber optic technology examining backscattered light from a laser emitted inside the fiber and measuring strain changes, enables the performance of subsurface imaging in terms of real-time monitoring for Vertical Seismic Profiling (VSP), reservoir monitoring, and microseismic event detection. This review examines the most widely used fiber optic cables employed for DAS acquisition, namely Single-Mode Fiber (SMF) and Multi-Mode Fiber (MMF), with the different deployments and scopes of data used in geophysics exploration. Over the years, SMF has emerged as a preferred type of fiber optic cable utilized for DAS acquisition and, in most applications examined in this review, outperformed MMF. On the other side, MMF has proven to be preferable when used to measure distributed temperature. Finally, the fiber optic cable deployment technique and acquisition parameters constitute a pivotal preliminary step in DAS data preprocessing, offering a pathway to improve imaging resolution based on DAS measurement as a future scope of work.

Artificial Intelligence for Otosclerosis Detection: A Pilot Study

Article

Jun 2024

The gold standard for otosclerosis diagnosis, aside from surgery, is high-resolution temporal bone computed tomography (TBCT), but it can be compromised by the small size of the lesions. Many artificial intelligence (AI) algorithms exist, but they are not yet used in daily practice for otosclerosis diagnosis. The aim was to evaluate the diagnostic performance of AI in the detection of otosclerosis. This case–control study included patients with otosclerosis surgically confirmed (2010–2020) and control patients who underwent TBCT and for whom radiological data were available. The AI algorithm interpreted the TBCT to assign a positive or negative diagnosis of otosclerosis. A double-blind reading was then performed by two trained radiologists, and the diagnostic performances were compared according to the best combination of sensitivity and specificity (Youden index). A total of 274 TBCT were included (174 TBCT cases and 100 TBCT controls). For the AI algorithm, the best combination of sensitivity and specificity was 79% and 98%, with an ideal diagnostic probability value estimated by the Youden index at 59%. For radiological analysis, sensitivity was 84% and specificity 98%. The diagnostic performance of the AI algorithm was comparable to that of a trained radiologist, although the sensitivity at the estimated ideal threshold was lower.

Show and Tell: A Neural Image Caption Generator

Article

Full-text available

Nov 2014

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU score improvements on Flickr30k, from 55 to 66, and on SBU, from 19 to 27.

Efficient Learning of Sparse Representations with an Energy-Based Model

Chapter

Sep 2007

Papers from the 2006 flagship meeting on neural computation, with contributions from physicists, neuroscientists, mathematicians, statisticians, and computer scientists. The annual Neural Information Processing Systems (NIPS) conference is the flagship meeting on neural computation and machine learning. It draws a diverse group of attendees—physicists, neuroscientists, mathematicians, statisticians, and computer scientists—interested in theoretical and applied aspects of modeling, simulating, and building neural-like or intelligent systems. The presentations are interdisciplinary, with contributions in algorithms, learning theory, cognitive science, neuroscience, brain imaging, vision, speech and signal processing, reinforcement learning, and applications. Only twenty-five percent of the papers submitted are accepted for presentation at NIPS, so the quality is exceptionally high. This volume contains the papers presented at the December 2006 meeting, held in Vancouver. Bradford Books imprint

Learning Deep Architectures for AI

Book

Jan 2009

Yoshua Bengio

Can machine learning deliver AI? Theoretical results, inspiration from the brain and cognition, as well as machine learning experiments suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one would need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers, graphical models with many levels of latent variables, or in complicated propositional formulae re-using many sub-formulae. Each level of the architecture represents features at a different level of abstraction, defined as a composition of lower-level features. Searching the parameter space of deep architectures is a difficult task, but new algorithms have been discovered and a new sub-area has emerged in the machine learning community since 2006, following these discoveries. Learning algorithms such as those for Deep Belief Networks and other related unsupervised learning algorithms have recently been proposed to train deep architectures, yielding exciting results and beating the state-of-the-art in certain areas. Learning Deep Architectures for AI discusses the motivations for and principles of learning algorithms for deep architectures. By analyzing and comparing recent results with different learning algorithms for deep architectures, explanations for their success are proposed and discussed, highlighting challenges and suggesting avenues for future explorations in this area.

Semantic Cognition: A Parallel Distributed Processing Approach

Book

Jun 2004

This groundbreaking monograph offers a mechanistic theory of the representation and use of semantic knowledge, integrating the strengths and overcoming many of the weaknesses of hierarchical, categorization-based approaches, similarity-based approaches, and the approach often called "theory theory." Building on earlier models by Geoffrey Hinton in the 1980s and David Rumelhart in the early 1990s, the authors propose that performance in semantic tasks arises through the propagation of graded signals in a system of interconnected processing units. The representations used in performing these tasks are patterns of activation across units, governed by weighted connections among them. Semantic knowledge is acquired through the gradual adjustment of the strengths of these connections in the course of day-to-day experience. The authors show how a simple computational model proposed by Rumelhart exhibits a progressive differentiation of conceptual knowledge, paralleling aspects of cognitive development seen in the work of Frank Keil and Jean Mandler. The authors extend the model to address aspects of conceptual knowledge acquisition in infancy, disintegration of conceptual knowledge in dementia, "basic-level" effects and their interaction with expertise, and many findings introduced to support the idea that semantic cognition is guided by naive, domain-specific theories. Bradford Books imprint

Sequence to Sequence Learning with Neural Networks

Conference Paper

Sep 2014

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Rich feature hierarchies for accurate object detection and semantic segmentation

Conference Paper

Nov 2014

Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.

Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation

Conference Paper

Jun 2014

This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.

Speech Recognition With Deep Recurrent Neural Networks

Conference Paper

Jan 2013

Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates $backslash$emphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

Semi-Supervised Learning with Deep Generative Models

Conference Paper

Jun 2014

The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unlabelled ones. Generative approaches have thus far been either inflexible, inefficient or non-scalable. We show that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.

Deep Learning

Abstract and Figures

Recommended publications

Deep Networks and Deep Learning Algorithms

The Understanding of Deep Learning: A Comprehensive Review

A Survey on Deep Learning Concepts and Techniques

Artificial Intelligence AI Assisted Thermography to Detect Corrosion Under Insulation CUI

Deep Learning and Current Trends in Machine Learning