ArticlePDF Available

The City Brain: Practice of Large-Scale Artificial Intelligence in the Real World

June 2019
1(1)

DOI:10.1049/iet-smc.2019.0034

License
CC BY-NC 3.0

Authors:

A city is an aggregate of a huge amount of heterogeneous data. However, extracting meaningful values from that data remains a challenge. City Brain is an end‐to‐end system whose goal is to glean irreplaceable values from big city data, specifically from videos, with the assistance of rapidly evolving artificial intelligence technologies and fast‐growing computing capacity. From cognition to optimisation, to decision‐making, from search to prediction and ultimately, to intervention, City Brain improves the way to manage the city, as well as the way to live in it. In this study, the authors introduce current practices of the City Brain platform in a few cities in China, including what they can do to achieve the goal and make it a reality. Then they focus on the system overview and key technical details of each component of the City Brain system, from cognition to intervention. Lastly, they present a few deployment cases of City Brain in various cities in China.

100 feet view of the City Brain

…

Architecture of the large‐scale visual computing platform

…

Quantisation function for a neural network

…

Relaxation process of a quantisation function during training, which goes from a straight line to steps as the temperature T increases (a) No quantisation, (b) T = 1, (c) T = 11, (d) T = 121, (e) Complete quantisation

…

+10

Architecture of the cognition system in City Brain

…

Figures - available from: IET Smart Cities

This content is subject to copyright. Terms and conditions apply.

Access to this full-text is provided by Wiley.

Learn more

Content available from IET Smart Cities

This content is subject to copyright. Terms and conditions apply.

IET Smart Cities

Review Article

City brain: practice of large-scale artificial

intelligence in the real world

eISSN 2631-7680

Received on 10th May 2019

Accepted on 20th May 2019

doi: 10.1049/iet-smc.2019.0034

www.ietdl.org

Jianfeng Zhang1, Xian-Sheng Hua1 , Jianqiang Huang1, Xu Shen1, Jingyuan Chen1, Qin Zhou1, Zhihang

Fu1, Yiru Zhao1,2

1DAMO Academy, Alibaba Group, 969 West Wenyi Road, Hangzhou, Zhejiang Province, People's Republic of China

2Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, People's Republic of China

E-mail: xiansheng.hxs@alibaba-inc.com

Abstract: A city is an aggregate of a huge amount of heterogeneous data. However, extracting meaningful values from that

data remains a challenge. City Brain is an end-to-end system whose goal is to glean irreplaceable values from big city data,

specifically from videos, with the assistance of rapidly evolving artificial intelligence technologies and fast-growing computing

capacity. From cognition to optimisation, to decision-making, from search to prediction and ultimately, to intervention, City Brain

improves the way to manage the city, as well as the way to live in it. In this study, the authors introduce current practices of the

City Brain platform in a few cities in China, including what they can do to achieve the goal and make it a reality. Then they focus

on the system overview and key technical details of each component of the City Brain system, from cognition to intervention.

Lastly, they present a few deployment cases of City Brain in various cities in China.

1Introduction

1.1 About City Brain

As early as 2016, Smart City was presented as a national strategy

in China: We should profoundly understand the role of the Internet

in nation management and society governance, taking the

implementation of e-government and building new smart cities as

the key points. We will build a nationally integrated big data center

by data integration and promote technology convergence, business

integration, and data convergence to achieve collaborative

management and services across geographies, systems,

departments, and services. Today, the first batch of ‘Digital Twin

Cities’ using artificial intelligence (AI) technologies have realised

the Internet mode of data sharing, data co-creation, and data

automatic control with the help of Alibaba City Brain.

The City Brain is the ‘commanding heights’ of technologies in

Alibaba Group. Based on the elastic calculation and large-scale

data processing platform of Alibaba Cloud, integrated with the top

capabilities of interdisciplinary fields such as machine vision,

large-scale topological network computing, and traffic flow

analysis, the City Brain is capable of massive multi-source data

collection, real-time processing, and intelligent computing. There

are three metrics for a real ‘City Brain’: (1) it can deal with ultra-

large-scale and multi-source data that humans cannot understand in

real time (global cognition); (2) It can understand the complex

hidden rules that humans have not discovered (machine learning);

(3) It can formulate a global optimal strategy that surpasses local

suboptimal decision made by human (global coordination).

The City Brain has become a powerful assistant for city

managers in cognising, transforming, and operating cities. It

transcends human capabilities with four kinds of ‘super powers’:

(1) machine vision cognitive capability to enhance perception of

urban data; (2) the full-scale data platform construction capacity to

enhance the ‘data density’ and ‘particle management’ level; (3)

real-time computing capability under large-scale dynamic topology

networks; (4) the City Brain open platform capability to empower

the digital city industry.

The City Brain is deployed according to five major application

scenarios: urban traffic checkup, urban police monitoring, urban

traffic micro-control, urban special vehicles, and urban strategic

planning. (1) Urban traffic checkup can completely quantify the

urban ‘vital signs’ via the fusion and integration of full-scale, full-

network, and cross-domain data, avoiding one-sided solutions for

urban problems due to the single source data; (2) By taking

advantage of machine learning and computer vision, automatic

police monitoring can liberate police officers from laborious

legwork, and let the data to run errands, instead of police officers;

(3) Urban traffic micro control-and-feedback loop. It opens the

feedback control system between ‘brains’, ‘eyes’, and ‘hands and

feet’. Based on multi-source data, the global intelligent algorithm

provides a fine-grained control of city-scale traffic signals to

improve mobility in the city; (4) Route optimisation for emergency

vehicles. City Brain identifies the quickest route for emergency

vehicles to arrive at the scene within the shortest time frame; (5)

Urban layout planning and verification, which analyses the effect

of a proposed urban construction blueprint on the cloud with the

simulation data model.

1.2 History

In April 2016, the concept of ‘city brain’ was formally proposed.

City Brain is a new infrastructure built on massive data, which

utilises AI to solve urban governance and development issues that

cannot be solved by the human brain. It is a program that offers a

comprehensive suite of acquisition, integration, and analysis of big

and heterogeneous data generated by a diversity of sources in

urban spaces through video and image recognition, data mining

and machine learning technology. With this, city council and urban

planners will be able to make better decisions for the community.

In November 2017, Alibaba Cloud ET City Brain was selected

as one of the first four AI innovation platforms by the Ministry of

Science and Technology, which became a major contribution of

Chinese technology to the world's urban area.

On January 29, 2018, the Malaysia Digital Economy Corp

(MDEC) and the Dewan Bandaraya Kuala Lumpur (DBKL) jointly

announced the introduction of Alibaba Cloud ET City Brain. The

AI will be fully applied to Malaysia's traffic management, urban

planning, and environmental protection. It is the first time that the

City Brain went out to serve worldwide customers.

It has been three years from the birth of City Brain to the

present. The City Brain has been launched in Hangzhou, Shanghai,

Chongqing, Suzhou, Haikou, Beijing, Chengdu, Quzhou, Jiaxing,

Kuala Lumpur, Macao, and many other cities.

IET Smart Cities

This is an open access article published by the IET under the Creative Commons Attribution -NonCommercial License

(http://creativecommons.org/licenses/by-nc/3.0/)

2Overview of the City Brain

In this project, the challenges we are facing are all about three

keywords: cost, value, and difference. Whether the cost for such a

big computation, storage, and network intensive task is

manageable, whether the technology is ready to get the values from

those data, and whether the values are sufficiently significant.

What has been challenged even more is that where are the

differences compared with ‘video surveillance’ and ‘edge

computing’.

These questions can be well answered by taking a closer look at

the City Brain (Fig. 1). First, we have a bunch of data from the city,

including the video data. The first step is to acquire the data and

understand the data. We call this step ‘Cognition’, which includes

recognising what is on the road and what is happening on the road,

such as the cars, the people, the cyclist, the traffic status, the

accidents etc [1].

Then, the second step ‘Decision and Optimisation’, we make

decisions or optimise the ways we run the city based on the

cognitive results, e.g. automatic accident alerting [2], traffic light

optimisation. Thereafter, in the ‘Search and Mining’ step, we put

everything the cameras have seen into a database and build an

index, thus we can apply search on this data. For example, we find

a suspicious car or discover patterns in the data, such as finding the

root cause of traffic congestion somewhere in the city [3].

Next, based on current and historical data, we can predict what

is going to happen next, either in a short period of time, such as the

traffic congestion possibility after 20 min for an intersection, or

next day's accident possibility of a road section, given the weather

condition and event information of the city.

Last, based on predicted results, resources can be pre-allocated

to respond to those situations more effectively. For example, if we

know the possibility of accident will increase three times given the

bad weather tomorrow as well as a few events that will gather a

large number of people, we can adjust the traffic lights and send

traffic advice to prevent those bad things from happening. We call

this ‘Prediction’ and ‘Intervention’.

In the remaining part of this paper, we will present more details

about the aforementioned parts, as well as the specifically designed

large-scale visual computing platform.

3Large-scale visual computing platform

3.1 System overview

With the rapid development of urbanisation, large amount of video

data is generated every day in a city. These videos play critical

roles in city management, public safety, traffic control, and

environment protection etc. However, video data is unstructured.

How to effectively store, analyse, and further take advantage of

these videos has been a worldwide problem.

In order to address the above problem, our team builds the

large-scale visual computing platform to meet the requirement for

real-time, comprehensive, large-scale smart video analysis, which

makes joint perception, prediction, alarm and prevention in smart

city management possible for the first time.

The overall architecture of the platform is illustrated in Fig. 2,

which composes of three core systems, namely ‘the Access and

Transmitting system’, ‘the Computing system’, and ‘the Searching

system’. The access and transmitting stage perform data accessing,

data pre-processing, data resource scheduling, data transmitting,

and video streaming.

Based on the stream-processing framework (Flink [4]), the

computing system has the following key capabilities: batch

computing, stream computing, model parallelisation, model

scheduling, graphical calculation, and atlas calculation. These key

techniques are able to support the top-level applications such as

online/offline video analysis, trajectory tracking, feature

quantitation etc.

The searching system consists of the large-scale search engine,

online feature extraction service, and search strategy engine. The

search engine performs real-time index compression. Online

feature extraction is responsible for extracting features of the city

objects from video frames. The search strategy engine links the

former two modules and provides an image search service to target

customers.

The visual computing platform can be deployed on the cloud. It

could be shared and reused through the cloud resource pool, fully

exploiting the efficiency of multi-core and ensuring elastic

expansion. Besides, by means of the peak staggered multiplexing,

the platform achieves flexible and efficient resource utilisation.

The distributed deployment of cloud host could provide

intelligent analysis capability on demand, thus improving the

efficiency of intelligent analysis. With the large-scale visual

computing platform, we provide the capabilities of AI, large-scale

data processing and cloud computing to the upper-level application

layer, allowing customers to focus on business innovation.

3.2 Key technical details

3.2.1 Distributed heterogeneous scheduling engine: The

large-scale video computing resource scheduling system manages

the cloud video computing resources and dynamically adjusts the

resource allocation to best utilise the computing ability [4, 5]. Its

core functions include single-node heterogeneous computing

scheduling, distributed heterogeneous computing resource

scheduling, and distributed task dynamic allocation.

Single-node heterogeneous computing scheduling: this part

evaluates the model's requirements for computing resources, and

allocates appropriate heterogeneous computing resources (central

processing unit, graphics processing unit etc.) and model operating

parameters to the model on a single node according to the actual

configurations of the machines. In this way, we can improve the

resource utilisation rate as well as the number of video streams that

can be processed on a single node.

Distributed Heterogeneous Computing Resource Scheduling:

this part analyses and evaluates the computing resources for all

tasks running on the streaming computing platform and allocates

Fig. 1 100 feet view of the City Brain

Fig. 2 Architecture of the large-scale visual computing platform

2IET Smart Cities

This is an open access article published by the IET under the Creative Commons Attribution -NonCommercial License

(http://creativecommons.org/licenses/by-nc/3.0/)

different tasks to various computing nodes according to the

composition of heterogeneous computing resources and the

resource requirements of different tasks. By using the cloud

resource pool to share and reuse, the multi-core efficiency can be

fully utilised to ensure flexible expansion, thus improving the

utilisation of heterogeneous resources of the entire cluster and

finally reducing the energy consumption of the entire cluster.

Dynamic allocation of distributed tasks: Due to changes in time

periods and scenarios, resource requirements for different tasks

may change dramatically across time and space. Distributed task

dynamic allocation performs real-time statistical analysis on the

running status of tasks and effectively redistributes these tasks.

Flexible and efficient resource utilisation can be achieved through

peak staggered multiplexing.

3.2.2 Graph computation: Traditional video analysis systems

mainly focus on recognising certain objects within frames, which is

far from enough for scene perception. To fully understand the

scene, we need not only to recognise each separate object, but also

analyse the relationships among these objects. Towards the end, the

scene graph is designed to model the relationships within objects.

Thereafter, the graph can be indexed and retrieved to support upper

layer applications such as searching and prediction based on the

scene graph. To achieve this goal, our large-scale visual computing

platform is designed to support the functions of graph indexing [6]

and graph searching, which will be detailed in the following

sections.

Graph Indexing and Graph query: graph indexing is a very

important pre-processing step in graph query. Indexing guarantees

the uniqueness of each row of data in the database table. Besides, it

can greatly speed up the retrieval of data, which is the main reason

for creating an index. However, creating and maintaining an index

takes extra time and physical space, which increases the

maintenance cost of the data.

To enable query based on graph data, the large-scale visual

computing platform adopts the state-of-the-art index and search

techniques. By taking the relationships among the graph nodes into

consideration, we can make globally optimised predictions and

interventions on the real-time city events.

3.2.3 Model quantisation and acceleration: To efficiently

execute deep models on the proposed large-scale visual computing

platform, we introduce network quantitation techniques to reduce

the computation load [7].

Our work is devoted to quantising full-precision networks into

low-bit networks. Existing methods formulate the low-bit

quantisation of networks as an approximation or optimisation

problem. Approximation-based methods confront the gradient

mismatch problem, while optimisation-based methods are only

suitable for quantising weights and can introduce high

computational cost during the training stage. In our large-scale

visual computing platform, we provide a simple and uniform way

for weights and activations quantisation by formulating it as a

differentiable non-linear function. As shown in Fig. 3, the

quantisation function is formed as a linear combination of several

Sigmoid functions with learnable biases and scales. In this way, the

proposed quantisation function can be learned in a lossless and

end-to-end manner and works for any weights and activations in

neural networks, thereby avoiding the gradient mismatch problem.

It can further be trained via continuous relaxation of the steepness

of the Sigmoid functions (shown in Fig. 4).

4Cognition

4.1 System overview

City management involves a lot of data resources. Video data, with

its intuitive, mass, and real-time characteristics, is an important

part of the data resources of the city. The traditional way of city

patrolling mainly relies on laborious manual monitoring. In

contrast, through the processing and analysis of massive video, the

cognition system can not only obtain the running status of the

urban public area in real time, but also detect abnormal events in

specific areas in time. According to the architecture shown in

Fig. 5, the system consists of three main stages: visual data access

stage, multimedia processing stage, and visual algorithm

application stage.

In the visual data access stage, video resources from different

manufacturers are accessed through standard video protocols. The

system has the ability to access large-scale video data based on the

cloud platform, which meets the demand of comprehensive city

cognition. The accessed data includes online video streams, offline

video files, and static images, which will further be preprocessed

and transcoded at the multimedia processing stage.

In the multimedia processing stage, visual data is transmitted to

the system through the local area network of a city. The large-scale

video and images are decoded, transcoded, or preprocessed in this

stage. Furthermore, this stage also collects parameters of video

sources including camera position and alarm configurations to

comprehensively manage multimedia information.

In the visual algorithm application stage, the all-time all-area

cognition system integrates fundamental tasks such as image

recognition, object detection, object tracking, scene recognition,

and anomaly detection. These tasks are formed into independent

modules to support top-level algorithm applications. Specifically,

traffic accident perception integrates image recognition, object

detection, and object tracking tasks. The road congestion

perception involves object detection and object tracking tasks. The

sudden violence event perception is based on scene recognition and

anomaly detection tasks. And the object detection and anomaly

detection tasks are utilised to achieve the alarming of persons and

vehicles in restricted area. Based on the aforementioned rich top-

level visual algorithm applications, the system is further applied to

a variety of public scenes in the city, such as transportation,

subway, campus, and community.

Fig. 3 Quantisation function for a neural network

Fig. 4 Relaxation process of a quantisation function during training,

which goes from a straight line to steps as the temperature T increases

(a) No quantisation, (b) T = 1, (c) T = 11, (d) T = 121, (e) Complete quantisation

Fig. 5 Architecture of the cognition system in City Brain

IET Smart Cities

This is an open access article published by the IET under the Creative Commons Attribution -NonCommercial License

(http://creativecommons.org/licenses/by-nc/3.0/)

4.2 Key technical details

The all-time all-area city cognition system pursues a precise

understanding of a variety of scenarios. It recognises what is on the

road and what is happening on the road before making decisions or

alarms. In this section, we will introduce our object detection and

anomaly detection methods deployed in this system.

4.2.1 Object detection and tracking: Object detection is one of

the core tasks in cognition problems. In the cognition system,

detecting objects on the road, such as vehicles and pedestrians, is

the primary step for perception applications. Therefore, the high

accuracy of the detection algorithm is a prerequisite for subsequent

applications. We have devoted great efforts in object detection

research.

For vehicle detection, we proposed a scheme, which is

illustrated in Fig. 6, based on multi-task deep convolutional neural

networks (CNN), region-of-interest (RoI) voting, and multi-level

localisation, denoted by RV-CNN [1]. In the design of CNN

architecture, we enriched the supervised information with

subcategory, region overlap, bounding-box regression, and

category of each training RoI as a multi-task learning framework.

This design allows the CNN model to share visual knowledge

among different vehicle attributes simultaneously, and thus,

detection robustness can be effectively improved. We introduced

the subcategory classification task to enforce the CNN model to

learn a good representation for vehicles under different occlusions,

truncations, and viewpoints. In addition, we utilised the CNN

model to predict the offset direction of each RoI boundary toward

the corresponding ground truth. Then, each RoI could vote those

suitable adjacent bounding boxes, which are consistent with this

additional information. For clarity, suppose a predicted box has

coordinates b= {x1,y1,x2,y2} and score s. And we denote its

neighbouring RoIs by B, the number of RoIs in B by N and the ith

RoI with assigned score si and predicted directions Dl

i,Dt

i,Dr

i,Dd

i by

bi= {x1

i,y1

i,x2

i,y2

i}. Then we formulate the voting scheme as

s′ = s+λ∑

b∈ {l,t,r,d}∑

i= 1

Rb(b,bi)

(1)

in which

Rl(b,bi) =

siif x1<x1

iand Dl

i= go to left,

−siif x1<x1

iand Dl

i= go to right,

−siif x1>x1

iand Dl

i= go to left,

siif x1>x1

iand Dl

i= go to right .

(2)

Other Rb(b,bi) functions follow the same rule as Rl(b,bj). After

the scores of all predicted boxes are computed again. The voting

results are combined with the score of each RoI itself to find a

more accurate location from a large number of candidates.

For pedestrian detection, we introduced a previewer block [8]

which previews the objectness probability for the potential

regression region of each prior box, using the stronger features

with larger receptive fields and more contextual information for

better predictions. The proposed previewer blocks preselect regions

with high confidences containing objects by involving enough

contextual information. The detector then classifies and relocates

the prior boxes in these regions. In addition, we introduced a new

metric intersection of ground-truth (IoG) ratio to formulate the

containment relations between the previewer region and ground-

truth bounding boxes.

IoGi,j

l= max

n= 1, 2, …, N

area(P(i,j)

l∩ GTn)

area(GTn)

statusi,j

1, IoGi,j

l= 1 and IoGi,j

η< 1, η= 1, …, l− 1

0, otherwise

−1, IoGi,j

l< 0.8

(3)

where N is the number of ground-truth objects. An object is

completely contained by the previewer region when IoG = 1.0, and

we assign a positive label to this region. A previewer region will

get a negative label if IoG < 0.8. Furthermore, the label of a larger

region which contains an object is set to be ignored (neither

positive nor negative during training) when that object is already

contained in smaller previewer region. With the previewer blocks,

plenty of small-scale false positives were eliminated during the

inference process and we've got an effective performance on

pedestrian detection.

Besides, we use the renowned kernelised correlation filters [9]

for multiple objects tracking based on object detection results.

Object tracking effectively maps the corresponding detected

objects between different frames. Combined with object detection,

object tracking module first illustrates the trajectories of vehicles

and pedestrians over a period of time, and then identifies target

behaviours.

4.2.2 Event detection: Anomalous events detection in real-world

video scenes is a challenging problem due to the complexity of

‘anomaly’ as well as the cluttered backgrounds, objects and

motions in the scenes. Most existing methods use hand-crafted

features in local spatial regions to identify anomalies. We proposed

a Spatio-Temporal AutoEncoder (ST AutoEncoder or STAE) [2],

which utilises deep neural networks to learn video representation

automatically and extracts features from both spatial and temporal

dimensions by performing three-dimensional (3D) convolutions.

Fig. 7 shows the details of the framework: an encoder followed by

two branches of decoder for reconstructing past frames and

predicting future frames, respectively.

In addition to the reconstruction loss used in existing typical

autoencoders, we introduced a weight-decreasing prediction loss

for generating future frames, which enhances the motion feature

learning in videos. Specifically, the reconstruction branch and the

prediction branch share the same hidden feature layer but perform

different tasks: reconstructing the past sequence and predicting the

future sequence, respectively. The prediction task guides the model

to capture the trajectory of moving objects and enforce the encoder

to better extract the temporal features. The prediction loss is

formulated by:

Fig. 6 Illustration of RV-CNN multi-task framework. RoI pooling layer is proposed to extract features for each RoI. Then the pooled features are used for

category classification, bounding box regression, overlap prediction, and subcategory classification

4IET Smart Cities

This is an open access article published by the IET under the Creative Commons Attribution -NonCommercial License

(http://creativecommons.org/licenses/by-nc/3.0/)

Lpred =1

N∑

i= 1

T2∑

t= 1

(T−t) ∥ Xi+T

t−fpred(Xi)t∥2

(4)

where Xi is the input hyper-cuboid, fpred(Xi) is the output of the

prediction branch, Xi+T is the ground truth of the future T frames

and the superscript t in Xt is the tth frame of the video clip X. The

tth frame has a weight of T−t, which decreases as t increases.

With the anomaly detection framework, the all-time all-area

patrolling and alerting system can detect abnormal events in a

variety of scenarios in real time, and then notify the city manager

in the form of alarms. Real-time alarm anomaly events through

video surveillance can help the government officials to quickly

detect and even prevent abnormal emergencies, ensuring the public

safety and operation efficiency of a city.

5Decision and optimisation

5.1 System overview

Based on the acquisition, integration, and analysis of big and

heterogeneous data generated by a diversity of sources in urban

spaces, the City Brain can optimise the flow of vehicles and traffic

signals, and upgrade the city governance and decision-making on

traffic command and road construction. The whole decision and

optimisation system are depicted in Fig. 8, which consists of three

main stages: the data perceptron stage, the data fusion stage, the

decision and optimisation stage.

In the data perceptron stage, data from various sources in urban

spaces and departments are collected and analysed. First is the

video data, including general video streams and bayonet camera

streams. Traffic accidents (collision, jam etc.) and traffic

parameters (road traffic flow, traffic light status, traffic volume and

speed in particular lanes etc.) are generated from these video

streams. For map data, high-definition map with road network

topology, origin-destination data, floating car data, and reported

incidents from the public are collected. For structured traffic data,

SCATS data, induction coil data, and bayonet car-passing data are

collected. Meteorological data mainly contains the weather and

temperature data. Road administration data consists of information

about road infrastructure, road marking, and road construction

status.

In the data fusion stage, the first layer contains multi-modality

data fusion module and data quality management module. For

multi-modality data fusion, AI is adopted to merge all structured

summaries of data from the perceptron stage into a single-center

data platform. Besides, the data quality management module filters

out invalid data, reduces replicated data, and completes missing

data based on the synthesis of information from different sources.

The second layer is about unifying traffic evaluations, traffic

parameters, and traffic representations. Unified traffic evaluations

consist of flow speed, delay, line length etc. Unified traffic

parameters include lane parameters, intersection parameters, road

parameters, and area parameters. Unified traffic representations are

map representations, video representations, and structured traffic

representations.

In the decision and optimisation stage, based on the unified

summaries of structured traffic data, intelligence algorithms are

adopted for traffic signal optimisation, traffic organisation

optimisation, traffic guidance, traffic command and dispatch. For

traffic signal optimisation, traffic light timing schedule is

dynamically adjusted to improve mobility of an intersection, road

or area. For traffic organisation optimisation, the system tries to

optimise the spatial distribution and function configuration of the

city road network. For traffic guidance, the quickest outgo routes

are planned for the public in order to avoid traffic incidents or

traffic jams. Specifically, when faced with emergencies, by

integrating and analysing real-time data, the system can optimise

urban traffic flow such as by identifying the quickest route for

emergency vehicles to arrive at the scene within the shortest time

frame. For traffic command and dispatch, the system automatically

performs traffic accidents reporting, monitoring, and disposition.

More importantly, all the traffic patrolmen are dynamically

dispatched for each accident, which improves the efficiency of

traffic management.

Based on the aforementioned descriptions, we can see that this

system can be applied to many scenarios for city management,

such as city traffic monitoring, traffic flow guidance, city road

construction planning etc.

5.2 Key technical details

5.2.1 Real-time road traffic prediction with spatial–temporal

correlations: The spatiotemporal relationship is an essential aspect

of road traffic prediction. The fundamental observation is that the

traffic condition at a link is affected by the immediate past traffic

conditions of some number of its neighbouring links. A time lag

function defines how traffic flows are related in the temporal

dimension. In parallel, the spatial structure defines which

neighbouring links have an effect on the traffic characteristics of

other links, as a function of road type, speed, etc.

We have a new method which provides a complete description

of the most important spatiotemporal interactions in a road network

while maintaining the estimatability of the model [10]. It improves

upon existing methods proposed in the area and provides high

accuracy on both urban and expressway roads. We adopt a

multivariate spatial–temporal autoregressive (MSTAR) model to

account for transient behaviour on the traffic network. The standard

Vector-ARMA(p,q), or VARMA(p,q), model is

I−∑

d= 1

ΦdBdXt=I+∑

d= 1

ΘdBdat.

(5)

Fig. 7 Architecture of the network. An encoder followed by two branches of decoder for reconstructing past frames and predicting future frames, respectively

Fig. 8 Architecture of the decision and optimisation system in the City

Brain

IET Smart Cities

This is an open access article published by the IET under the Creative Commons Attribution -NonCommercial License

(http://creativecommons.org/licenses/by-nc/3.0/)

This transient model accounts for both spatial and temporal

interactions but does not respond to needs for parsimony in the

model definition. To respond to that requirement, we make use of

decomposition of time into intervals, or templates, r= 1, …, R, that

permit combining time periods into like sets.

Furthermore, we make use of the data history to induce not only

a set of mean values for the speed and volume but in parallel a set

of spatial matrices. In other words, each reference period,

i= 1, …, I, has associated with it a spatial correlation matrix which

corresponds best, on average, to the relevant neighbouring links

during the period.

The resulting parsimonious transient model is thus defined as

∑

l= 1

∑

i= 1

ΦlirSriXt−l,r=at+∑

j= 1

∑

i= 1

ΘjirSriat−j,r,

(6)

The proposed traffic prediction algorithm is implemented and

tested against the actual traffic volume/speed over a medium size

road network on real-time basis. The road network consists of 502

links (149 category A, 246 category B, 29 category C, 38 category

D, 22 category E, and 18 slip-road). The forecast up to one hour

ahead is issued every 5 min using the most recent actual traffic

data.

5.2.2 Vehicular traffic prediction with link interactions and

multiple data sources: In order to estimate a vehicle arrival time,

we invent a system which receives information representing prior

travel times of vehicles between pre-determined vehicle stops

along a vehicle route [11, 12].

The system comprises a memory device and a processor being

connected to the memory device. The system receives information

representing prior travel times of vehicles between vehicle stops

along a vehicle route. The system receives real-time data

representing a current journey. The current journey refers to a

movement of a vehicle currently traveling along the route. The

system calculates a regular trend representing the current journey

based on the received prior travel times information and the

received real-time data. The system computes a deviation from the

regular trend in the current journey. The system determines a future

traffic status in subsequent vehicle stops in the current journey. The

system estimates, for the vehicle, each arrival time of each

subsequent vehicle stop based on the calculated regular trend, the

computed deviation, and the determined future traffic status.

5.2.3 Providing navigational guidance using the states of

traffic signals: We invent a method and apparatus by which

vehicular traffic prediction can be calculated both accurately and

faster than using conventional methods and can be used in the

presence of missing real-time data [13]. The missing data is

estimated using a calibration model comprising of historical data

that can be periodically updated, from select links constituting a

relationship vector.

The missing data can be estimated off-line whereafter it can be

used to predict traffic for at least a part of the network, the traffic

prediction being calculated by using a deviation from historical

traffic on the network. The invention further discloses a method for

in-vehicle navigation; and a method for traffic prediction for a

single lane.

First, as shown in step 101 (Fig. 9, one must perform a division

of time and space into, preferably, relatively homogeneous subsets.

An example of dividing time into relatively homogeneous intervals

is to consider each day of the week and each hour of the 24-hour

day separately. As regards to spatial decomposition, the network in

the exemplary embodiment is also divided into links included in

the network. In step 102 a relationship vector for every network

link to be predicted is defined. The relationship vector for each link

contains the other links of the network whose traffic has an impact

on that link. Once these steps are performed, the next step 103 of

the method exemplarily described herein is to compute off-line

average-case estimates of the traffic for each link and for each time

period.

This method provides an exemplary technique for determining

the traffic state characteristics (e.g. speed, density, flow, etc.) that

best characterise the progression of that state into the future.

6Search and mining

6.1 System overview

In ‘Search and Mining’ system, we aim to put everything the

cameras have seen into a database, thus we can apply search on

these indexed data. Towards this end, we propose a progressive

video search engine to localise objects, such as missing people and

hit-and-run vehicles, among the tremendous volume of videos

quickly through progressive human–machine interactions. The

architecture of the progressive video search engine is shown in

Fig. 10. The system consists of three major stages, including

stream accessing stage, visual structuring stage, and large-scale

visual search stage. Many related technologies are used in this

progressive video search engine, among which are video content

structuring, target re-identification (ReID), indexing, and searching

strategies.

In the stream accessing stage, the platform accesses to the

sensor data of the city, including various cameras, MAC signals,

GPS signals, Internet data, etc. Specifically, the visual data from

different manufacturers is accessed through standard video

protocols. Based on the cloud platform, unified resource schedule,

comprehensive analysis as well as reliable storage can be easily

realised. The obtained data is then fed into the visual structuring

stage to transform into unified standard structured data.

In the visual structuring stage, we use deep learning algorithms

to analyse the information of pedestrians, non-vehicles, and

vehicles based on real-time video content captured from cameras

deployed in the city. Specifically, object detection, scene

recognition, and attribute recognition algorithms are employed to

extract the perceived objects (i.e. pedestrians, non-vehicles,

vehicles, and events) and the corresponding attribute features. For

example, we consider gender, age, and clothing style for

pedestrians and color, type, and moving direction for vehicles. The

generated unified standard structured data is used to finally support

various applications of the ecosystem through the search engine.

In the visual search stage, we build a database to visually index

the whole city and a large-scale search engine for city object

retrieval. Generally, there are two phases here. In the first phase,

the representative features from the pixels are effectively extracted

and stored in the database. In the second phase, the queries, i.e.

high-dimensional features calculated from a query image, are fed

into the database. The accuracy and recall of the search process is

Fig. 9 Flowchart of an exemplary prediction algorithm

Fig. 10 Architecture of the Search and Mining System in City Brain

6IET Smart Cities

This is an open access article published by the IET under the Creative Commons Attribution -NonCommercial License

(http://creativecommons.org/licenses/by-nc/3.0/)

guaranteed with the help of effective indexes combined with high-

dimensional global and local features. It is worth noting that

challenges may arise in real-world scenarios. For instance,

performance loss would certainly appear due to data expansion in

both volume and dimension. In order to tackle such challenges,

different indexing structures, including M-tree, R-tree, k-d tree etc.

should be implemented on top of the database. Furthermore, the

proposed search engine performs search with great efficiency,

where a single query among hundreds of billions of images can be

executed within one or several hundred milliseconds.

Based on the introduced architecture, the progressive video

search engine is widely applied in various scenarios of the city,

such as security, transportation, environmental protection, and

community service.

6.2 Key technical details

Person ReID is at the core of progressive video search engine.

Given a query person, the task aims at matching the same person

from multiple non-overlapping cameras. Compared with other

image search tasks, person ReID is still very challenging due to the

following reasons: (1) dramatic background variations caused by

different images from different cameras, (2) significant variations

in visual appearance caused by changes in human pose across time

and space, and (3) clutter or occlusions. In this section, we will

introduce our efforts in image-based person ReID, video-based

person ReID, and large-scale similarity search.

6.2.1 Image-based person ReID: We first propose a novel deep

Siamese architecture [3] based on CNN and multi-level similarity

perception. According to the distinct characteristics of diverse

feature maps, we effectively apply different similarity constraints

to both low-level and high-level feature maps, during the training

stage. Fig. 11 shows the overall architecture of the proposed

network at the training stage. Our network can efficiently learn

discriminative feature representations at different levels, which

significantly improves the ReID performance. Besides, the

proposed framework has two additional benefits. First,

classification constraints can be easily incorporated into the

framework, forming a unified multi-task network with similarity

constraints. For concrete demonstration, we separately optimise

similarity constraints on low-level feature map (e.g. Pool1 layer)

and high-level feature map (e.g. FC7 layer). In the meanwhile,

softmax loss is also utilised to optimise classification constraints.

Second, as similarity comparable information has been encoded in

the network's learning parameters via back-propagation, pairwise

input is not necessary at test time. That means we can extract

features of each gallery image in an off-line manner and combine

with the indexing techniques to further improve the retrieval

efficiency, which is essential for large-scale real-world

applications. Experimental results on two large data sets CUHK03

[14] and Market-1501 [15] demonstrate that our method

outperforms the current state-of-the-art approaches by large

margins, and we also achieve competitive performance on the

small-size data set CUHK01 [16].

Since the human body consists of well-defined parts, i.e. head,

torso and legs, a better approach to solve the various appearances

caused by pose changes and local differences are part-based

models. To merge the global and local features, we propose a set of

local operations as a generic family of building blocks for

synthesising local and global information in any CNNs layer,

termed Local CNN [17]. This building block can be inserted into

any convolutional modules with only a small amount of prior

knowledge about the approximate locations of local parts. As a

complementary of the global path, our local path consists of four

components: localisation module, sampling module, feature

extraction module, and fusion module. The localisation module is

designed to locate the positions of head, torso, and legs. The

sampling module is formulated as an explicit 2D form of attention,

yielding local patches of smoothly varying locations and scales.

The feature extraction module consists of several convolution,

ReLU, and batch normalisation layers as in general convolution

blocks. The current form of the feature extraction module is

restricted to one convolutional layer with filter size 3 × 3. The

fusion module is formed as a concatenation layer of global and

local outputs followed by a 1 × 1 convolutional layer. In practice,

any building block of existing backbone CNNs can be viewed as

the global path and the proposed local path can easily be inserted

into these blocks without any change in the training scheme.

Furthermore, the architecture of each component in the local

operations is quite flexible for different configurations. This model

outperforms state-of-the-art attention-based and part-based

methods on three large-scale benchmarks, including Market-1501,

CUHK03, and DukeMTMC-ReID [18].

6.2.2 Video-based person ReID: Video-based person ReID

plays an important role in video analysis, expanding image-based

methods by learning features of multiple frames. We propose an

attribute-driven method [19] for feature disentangling and frame

re-weighting. The features of single frames are disentangled into

groups of sub-features, each corresponds to specific semantic

attributes. The sub-features are re-weighted by the confidence of

attribute recognition and then aggregated at the temporal

dimension as the final representation. By means of this strategy, the

most informative regions of each frame are enhanced and

contribute to a more discriminative sequence representation. An

example of our proposed method is shown in Fig. 12. The feature

of one frame is disentangled into several sub-features

corresponding to specific semantic attribute groups. In the

displayed image sequences, frame-1 captured clear frontal face so

it has a higher weight in Head group. While the bag is invisible in

frame-1, the weights of Bag groups are mainly concentrated on

frame-2 and frame-3. Frame-2 also has the highest weight in Shoes

group. The weights of frame-T are relatively low because of the

poor detection bounding box and clutter background. The re-

weighted sub-features are aggregated at the temporal dimension

and then concatenated as the representation of the input sequence.

We refine the temporal weights to the sub-feature level for

handling various poses, occlusions, and detection localisations

within the sequence.

Extensive ablation studies verify the effectiveness of feature

disentangling as well as temporal re-weighting. The experimental

results on the iLIDS-VID [20], PRID-2011 [21], and MARS [22]

data sets demonstrate that our proposed method outperforms

existing state-of-the-art approaches.

6.2.3 Large-scale similarity search: Visual structuring stage

helps to obtain feature representations (i.e. high-dimensional

features) for a large number of pedestrians, non-vehicles, and

vehicles in the whole city. Then we need to construct a large-scale

retrieval system for efficient similarity search and clustering of

dense vectors. To tackle the challenge of ultra-efficient high-

dimensional similarity search, we propose a high queries-per-

second (QPS) vector search engine, namely CrazySearch.

CrazySearch operates in fast register memory and is flexible

enough to be fusible with other kernels. Similar with Faiss (https://

github.com/facebookresearch/faiss/wiki), we apply coarse

quantisation based on product quantisation (PQ), that enables a

nearest neighbour implementation that is 8 × faster than prior state-

of-the-art methods. Our implementation enables the k-NN search

Fig. 11 Illustration of multi-task framework during training. For concrete

demonstration, we separately optimise similarity constraints on low-level

Pool1 layer and high-level FC7 layer

IET Smart Cities

This is an open access article published by the IET under the Creative Commons Attribution -NonCommercial License

(http://creativecommons.org/licenses/by-nc/3.0/)

among billions of images with approximately tens of thousands of

QPS. Specifically, a single query delay is ∼10 milliseconds.

Moreover, we adopt an elastic mechanism for expansion, which

can flexibly expand the distributed systems cluster to handle the

massive volume of data. We apply CrazySearch in the progressive

video search scenarios of city brain. The key techniques used in

CrazySearch is coarse quantiser.

An exhaustive comparison of the query vector with all vectors

is impractical for very large data sets. The coarse quantiser [23] is

designed for non-exhaustive search. It retrieves a candidate set

first, then searches within the candidate set for nearest neighbours

based on PQ [23]. We introduce a modified inverted file structure

[24] to rapidly access the most relevant vectors. A coarse quantiser

is used to implement this inverted file structure, where vectors

corresponding to a cluster (index) are stored in an associated list.

The vectors in the list are represented by short vectors generated by

the product quantiser, which encodes the residual vector with

respect to the cluster center. This approach significantly accelerates

the search at the cost of a few additional bits/bytes per descriptor.

Furthermore, it slightly improves the search accuracy, as encoding

the residual is more precise than encoding the vector itself.

7Prediction and intervention

7.1 System overview

Based on the cognition of the city data mentioned above, further

prediction and intervention are important in many smart city

application scenarios. Different from the previous system, we

project multi-modal data into 3D models for global and

comprehensive prediction and intervention. The system

architecture is shown in Fig. 13, which is mainly divided into data

access stage, data processing stage, algorithm stage, and

application stage.

The data access stage consists of two parts: static offline data

and dynamic real-time data. Offline data is mainly used to

reconstruct city scenes, such as a square, a building and its

surroundings. Offline data mainly includes aerial pictures taken by

unmanned aerial vehicles and Internet photos, as well as design

drawing data of buildings such as Computer-Aided Design (CAD)

and Building Information Modelling (BIM). Real-time data mainly

includes surveillance videos and extensive IoT sensor data.

The data processing stage is designed to process and analyse the

aforementioned data. Three-dimensional models of city scenes can

be obtained from image data and design drawings based on 3D

reconstruction and scene modelling techniques. Utilising the

computer vision technology mentioned above, intelligent analysis

such as detection, tracking, crowd counting, and anomaly detection

on objects are performed in surveillance videos. For different

application scenarios, IoT sensor devices complement the

perception with other information besides visual information, such

as temperature, humidity, smoke, and so on.

In order to realise the global perception of a city, the perceived

operating status of the city from the 2D videos are mapped to the

3D scene in real time through the coordinate mapping algorithm.

Thereafter, crowd counting and forecasting are allowed for specific

3D spaces. Moreover, road planning can be adjusted based on the

directional analysis of traffic flow and crowd flow. In addition,

emergency plans are obtained in advance by simulation in the

constructed virtual scenes. These algorithms can provide service in

various application scenarios such as public security, fire

protection, subway, and campus.

7.2 Key technical details

The most important problems needed to be issued in this system is

reconstructing digital city scenes. As aforementioned, city scenes

can be modelled by images or CAD/BIM data, and the former will

be introduced in the following section. For the algorithm stage, we

will also present a graph-based method to predict traffic and

pedestrian flow.

7.2.1 Digital city modelling: Image-based 3D reconstruction is a

widely studied problem [25], and the main procedure is shown in

Fig. 14. Given a set of images taken around the target scene, the

first step is matching features for each image pair. There have been

various algorithms to detect and describe image local key-points,

which is divided into two categories: hand-crafted methods [26,

27] and neural network methods [28, 29]. After filtering out the

error matches by Ransac [30], we can extract the points tracks in

the scene. Each track is a set of feature points from different views

corresponding to the same physical point. The next step is to figure

out the 3D position of each track together with the intrinsic/

extrinsic parameters of each view. The optimisation is performed

iteratively and the most classical algorithm is bundle adjustment

[31], which is extended in the following years [32–34]. Given the

sparse point cloud, Multi-View Stereo [35] are utilised to

reconstruct a depth map for each view and generate a dense point

cloud. Finally, the whole model is produced by mesh

reconstruction and texturing.

Although image-based 3D reconstruction has been successfully

applied for modelling various objects, there still remains some

problems in the large-scale city scenes. The first problem is the

Fig. 12 Illustration of attribute-driven method. The feature of one frame is

disentangled into several sub-features corresponding to a specific semantic

group

Fig. 13 Architecture of Prediction and Intervention system in the City

Brain

Fig. 14 Main procedure of digital city modelling, including four steps:

image matching, point track extracting, point cloud optimisation, and mesh

reconstruction

8IET Smart Cities

This is an open access article published by the IET under the Creative Commons Attribution -NonCommercial License

(http://creativecommons.org/licenses/by-nc/3.0/)

data scale. Thousands of photos are taken into calculation for

reconstructing a campus-sized place. The feature matching and

parameter optimising are extremely time-consuming in such data

scale, which can be addressed by the feature indexing and

calculation acceleration technology we introduced before. Another

problem is that the moving city objects (people, vehicles)

appearing in videos need to be projected into the 3D model for

global and comprehensive prediction and intervention. A direct

approach is involving the surveillance images into reconstruction

to obtain the intrinsic/extrinsic parameters of each camera. In order

to deal with the cluttered video frames, background extraction is

performed first. The domain gap between video frames and photos

should also be taken into consideration when selecting the feature

descriptors.

7.2.2 Flow prediction: Accurate prediction for crowd and traffic

flow is the basis of intervention. For example, traffic prediction is

important for the adjustment of the traffic light. However, accurate

traffic forecast is a challenging problem due to the large-scale

problem size, as well as the complex and dynamic nature of

spatiotemporal dependency of traffic flow.

Most existing graph-based CNNs attempt to capture the static

relations while largely neglecting the dynamics underlying

sequential data. We proposed a dynamic spatiotemporal graph-

based CNNs (called DST-GCNN) [36] by learning expressive

features to represent spatiotemporal structures and predict future

traffic from the historical traffic flow. In particular, DST-GCNN is

a two-stream network. In the flow prediction stream, we present a

novel graph-based spatiotemporal convolutional (STC) layer to

extract features from a graph representation of traffic flow. Then

several such layers are stacked together to predict future traffic

over time. Meanwhile, the proximity relations between nodes in the

graph are often time variant as the traffic condition changes over

time. To capture the graph dynamics, we use the graph prediction

stream to predict the dynamic graph structures, and the predicted

structures are fed into the flow prediction stream.

The overview of the proposed framework is shown in Fig. 15.

The network consists of two streams, the first stream predicts the

dynamic traffic conditions which are encoded in an affinity matrix.

The second stream, equipped with the predicted traffic conditions

and the proposed STC layers, first predicts future flow from t+ 1

to t+TF− 1, then predicts the target future flow at t+TF.

Predicting the dynamic graph enables DST-GCNN to adapt to

the fast-varying traffic condition. In the future, we plan to apply the

proposed framework to other traffic prediction tasks like pedestrian

crowd prediction.

8Practices of the city brain

City Brain offers a comprehensive suite of acquisition, integration,

and analysis of big and heterogeneous data generated by a diversity

of sources in urban spaces. The power and functionality of the City

Brain enable urban planners and city officials to upgrade their city

governance and decision-making to turn the city into an intelligent

one. A few current deployment cases of City Brain are listed as

follows:

Xiong'an District: On 8, November 2017, Alibaba signed a

strategic cooperation agreement with Xiong'an New District to plan

and design the future city through the City Brain.

Chongqing: Alibaba creates an Intelligent Chongqing based on the

City Brain, driving smart cities, smart manufacturing, and smart

services.

Macao: Since 2017, the City Brain has improved the livelihood and

visitor experience of Macao through smart services.

Guangzhou: The real-time scheduling of City Brain enabled

Baiyun Airport to increase the dispatch usage rate of the parking

space by 73%.

Malaysia: The City Brain will be applied to Malaysia's

transportation management, urban planning, environmental

protection etc. and in the first phase, it had been used to alleviate

congestion in Kuala Lumpur.

Shanghai: The City Brain is widely applied for protecting public

safety and providing community service. By optimising traffic light

timing strategy, the average travel time dropped by 8% and the

roadway congestion index dropped by 15%.

Hangzhou: By building city traffic index and optimising the traffic

light timing strategy, the ambulance response time dropped by 50%

and the average travel time dropped by 15.3%. Moreover, the

accuracy of traffic incident real-time detection reaches 95%. The

formalisation of co-operation between Alibaba Group and Sports

Bureau of Zhejiang Province provides an opportunity to build the

intelligent engine for Hangzhou 2022 Asian Games.

Suzhou: Dynamic adjustment of bus departure time increased the

number of people taking buses by 17%.

Quzhou: With progressive video search, we located 50% more

people than before. We are able to locate people with only one

photo, even a photo of a person's back.

Wuzhen: The City Brain comprehensively escorts the fourth World

Internet Conference.

9Conclusion

In summary, we introduced the City Brain project, which aims at

extracting meaningful and irreplaceable values from an aggregate

of a huge amount of heterogeneous data, with a focus on city-scale

AI technologies and applications. Current new technologies

empower AI and enable us to create city brain. As a platform, the

proposed city brain can incubate, hasten, and solidify many more

AI technologies and applications in future. From cognition to

optimisation, to decision-making, from search to prediction and

ultimately, to intervention, City Brain improves the way we

manage the city, as well as the way we live in it.

10Acknowledgments

Jianfeng Zhang and Xian-Sheng Hua have contributed equally.

11References

[1] Chu, W., Liu, Y., Shen, C., et al.: ‘Multi-task vehicle detection with region-of-

interest voting’, IEEE Trans. Image Process., 2018, 27, (1), pp. 432–441

[2] Zhao, Y., Deng, B., Shen, C., et al.: ‘Spatio-temporal autoencoder for video

anomaly detection’. Proc. 25th ACM Int. Conf. on Multimedia, 2017, pp.

1933–1941

[3] Shen, C., Jin, Z., Zhao, Y., et al.: ‘Deep Siamese network with multi-level

similarity perception for person re-identification’. Proc. 25th ACM Int. Conf.

on Multimedia, 2017, pp. 1942–1950

[4] http://https://github.com/apache/flink/tree/blink

[5] Eswari, R., Nickolas, S.: ‘Effective task scheduling for heterogeneous

distributed systems using firefly algorithm’, Int. J. Comput. Sci. Eng., 2015,

11, (2), pp. 132–142

[6] Yan, X., Yu, P.S., Han, J.: ‘Graph indexing: a frequent structure-based

approach’. Proc. 2004 ACM SIGMOD Int. Conf. on Management of Data,

2004, pp. 335–346

[7] Yang, J., Shen, X., Xing, J., et al.: ‘Quantization networks’. Conference

Computer Vision and Pattern Recognition, 2019

[8] Fu, Z., Jin, Z., Qi, G.-J., et al.: ‘Previewer for multiscale object detector ’.

Proc. 26th ACM Int. Conf. on Multimedia, 2018, pp. 265–273

[9] Henriques, J.F., Caseiro, R., Martins, P., et al.: ‘High-speed tracking with

kernelized correlation filters’, IEEE Trans. Pattern Anal. Mach. Intell., 2015,

37, (3), pp. 583–596

[10] Min, W., Wynter, L.: ‘Real-time road traffic prediction with spatio-temporal

correlations’, 2011, pp. 606–616

Fig. 15 Framework of the proposed DST-GCNN, which contains two

stream. The first stream predicts the dynamic traffic conditions and the

second predicts the future flow

IET Smart Cities

This is an open access article published by the IET under the Creative Commons Attribution -NonCommercial License

(http://creativecommons.org/licenses/by-nc/3.0/)

[11] Wanli, M., Wynter, L.: ‘Vehicle arrival prediction using multiple data sources

including passenger bus arrival prediction’. US Patent 9,177,473, 2017

[12] Wynter, L., Min, W., Morris, B.G.: ‘Method and structure for vehicular traffic

prediction with link interactions and missing real-time data’. US Patent

8,755,991, 2014

[13] Wanli, M., Wynter, L.: ‘Method and apparatus for providing navigational

guidance using the states of traffic signal’. US Patent 9,599,488, 2017

[14] Li, W., Zhao, R., Xiao, T., et al.: ‘Deepreid: deep filter pairing neural network

for person re-identification’. 2014 IEEE Conf. on Computer Vision and

Pattern Recognition, 2014, pp. 152–159

[15] Zheng, L., Shen, L., Tian, L., et al.: ‘Scalable person re-identification: a

benchmark’. IEEE Int. Conf. on Computer Vision (ICCV), 2015, pp. 1116–

1124

[16] Li, W., Zhao, R., Wang, X.: ‘Human reidentification with transferred metric

learning’. Asian Conf. on Computer Vision (ACCV), 2012, pp. 31–44

[17] Yang, J., Shen, X., Tian, X., et al.: ‘Local convolutional neural networks for

person re-identification’. Proc. 26th ACM Int. Conf. on Multimedia, 2018, pp.

1074–1082

[18] Zheng, Z., Zheng, L., Yang, Y.: ‘Unlabeled samples generated by GAN

improve the person re-identification baseline in vitro’. IEEE Int. Conf. on

Computer Vision, 2017, pp. 3774–3782

[19] Zhao, Y., Shen, X., Jin, Z., et al.: ‘Attribute-driven feature disentangling and

temporal aggregation for video person re-identification’. Conf. Computer

Vision and Pattern Recognition, 2019

[20] Wang, T., Gong, S., Zhu, X., et al.: ‘Person re-identification by video

ranking’. 13th European Conf. on Computer Vision, ECCV, 2014, pp. 688–

703

[21] Hirzer, M., Beleznai, C., Roth, P.M., et al.: ‘Person re-identification by

descriptive and discriminative classification’.17th Scandinavian Conf. On

Image Analysis, SCIA, 2011, pp. 91–102

[22] Zheng, L., Bie, Z., Sun, Y., et al.: ‘MARS: a video benchmark for large-scale

person re-identification’. 14th European Conf. On Computer Vision, ECCV,

2016, pp. 868–884

[23] Jégou, H., Douze, M., Schmid, C.: ‘Product quantization for nearest neighbor

search’, IEEE Trans. Pattern Anal. Mach. Intell., 2011, 33, (1), pp. 117–128

[24] Sivic, J., Zisserman, A.: ‘Video Google: a text retrieval approach to object

matching in videos’. Int. Conf. on Computer Vision (ICCV 2003), 2003, pp.

1470–1477

[25] Agarwal, S., Snavely, N., Simon, I., et al.: ‘Building Rome in a day’. 2009

IEEE 12th Int. Conf. on Computer Vision, 2009, pp. 72–79

[26] Bay, H., Tuytelaars, T., Gool, L.V.: ‘Surf: speeded up robust features’.

European Conf. on Computer Vision, 2006, pp. 404–417

[27] Lowe, D.G.: ‘Distinctive image features from scale-invariant keypoints’, Int.

J. Comput. Vis., 2004, 60, (2), pp. 91–110

[28] DeTone, D., Malisiewicz, T., Rabinovich, A.: ‘Superpoint: self-supervised

interest point detection and description’. Proc. IEEE Conf. on Computer

Vision and Pattern Recognition Workshops, 2018, pp. 224–236

[29] Zhao, Y., Li, Y., Shao, Z., et al.: ‘LSOD: local sparse orthogonal descriptor

for image matching’. Proc. 24th ACM Int. Conf. on Multimedia, 2016, pp.

232–236

[30] Fischler, M.A., Bolles, R.C.: ‘Random sample consensus: a paradigm for

model fitting with applications to image analysis and automated cartography’,

Commun. ACM, 1981, 24, (6), pp. 381–395

[31] Triggs, B., McLauchlan, P.F., Hartley, R.I., et al.: ‘Bundle adjustment – a

modern synthesis’. Int. Workshop on Vision Algorithms, 1999, pp. 298–372

[32] Agarwal, S., Snavely, N., Seitz, S.M., et al.: ‘Bundle adjustment in the large’.

European Conf. on Computer Vision, 2010, pp. 29–42

[33] Sibley, D., Mei, C., Reid, I.D., et al.: ‘Adaptive relative bundle adjustment’.

Robotics: Science and Systems, 2009, vol. 32, p. 33

[34] Wu, C., Agarwal, S., Curless, B., et al.: ‘Multicore bundle adjustment’. IEEE

Computer Society Conf. on Computer Vision and Pattern Recognition

(CVPR), 2011, pp. 3057–3064

[35] Goesele, M., Snavely, N., Curless, B., et al.: ‘Multi-view stereo for

community photo collections’. 2007 IEEE 11th Int. Conf. on Computer

Vision, 2007, pp. 1–8

[36] Wang, M., Lai, B., Jin, Z., et al.: ‘Dynamic spatiotemporal graph-based CNNs

for traffic prediction’, arXiv preprint:1812.02019, 2018

10 IET Smart Cities

This is an open access article published by the IET under the Creative Commons Attribution -NonCommercial License

(http://creativecommons.org/licenses/by-nc/3.0/)

Available via license: CC BY-NC 3.0

Content may be subject to copyright.

Content uploaded by Zhihang Fu

Content may be subject to copyright.

The Emergence of Artificial Intelligence in Anticipatory Urban Governance: Multi-Scalar Evidence of China’s Transition to City Brains

Article

Full-text available

Jan 2024

While smart city initiatives have characterized global urbanization from the 1990s to the 2020s, nowadays a novel artificial intelligence (AI) enabled approach to urban governance is rapidly emerging, thereby shaping the governance and planning of present and future cities. This urban phenomenon can be understood theoretically through the notion of anticipatory governance, and empirically through so-called City Brain systems. This is particularly evident in China where a wide range of urban AI solutions are being experimented at different scales which this paper seeks to illustrate. First, by building a database of AI-urban policy texts associated with Chinese cities, we capture and discuss the national network of discourses surrounding urban AI. Second, we draw on empirical research conducted in Beijing to examine an existing city brain project and explain its impact on urban governance. Our study reveals the multi-scalar policy landscape of urban AI transitions in China and sheds light on the extent to which emerging AI technologies such as city brains can proactively address urban problems, thus developing an understanding of anticipatory governance in the age of urban AI. We conclude the paper by reflecting on the complex corporate-state relations embedded in the co-production of city brains, their diffusion and impact beyond China.

Ventajas y desventajas del uso de la Inteligencia Artificial en el ciclo de las políticas públicas: análisis de casos internacionales

Article

Full-text available

Nov 2023

Eugenio Arguelles

En la última década se observa un creciente uso de la Inteligencia Artificial (IA) en la administración pública; sin embargo, el estudio científico del tema es relativamente incipiente pues las afirmaciones sobre sus ventajas y desventajas se basan en suposiciones y predicciones de los investigadores y carecen de suficiente evidencia empírica. El objetivo de este trabajo es analizar las ventajas y desventajas del uso de la IA en el ciclo de las políticas públicas para contribuir a la solución de este vacío. Para ello se realiza un estudio comparativo de ocho casos a nivel internacional. El análisis muestra que la principal ventaja del uso de la IA es que permite procesar y analizar gran cantidad y diversidad de información de forma inmediata para automatizar diversos procesos dentro del ciclo de las políticas públicas; sin embargo, existen desventajas como la exclusión, sesgos en las estimaciones, falta de privacidad y poca transparencia.

Cost-effective mitigation of urban congestion with adaptive traffic signal control

Preprint

Full-text available

Jul 2023

Urban congestion is a widespread issue with detrimental effects on urban efficiency, energy consumption, and pollution levels. Traditional approaches to mitigating congestion, such as increasing transport infrastructures or reducing travel demands, can be costly or inequitable for residents. Adaptive traffic signal control is a less understood method that may potentially improve intersection efficiency and reduce congestion without changing travel demands or transport infrastructures. By analyzing the top 100 congested cities in China, here we show that adaptive traffic signal control reduces trip time by 11% and 8% during peak and off-peak hours, respectively, compared to pretimed traffic signals. This reduction in congestion also results in a decrease in fuel consumption and CO2 emissions of 12 million tonnes (Mt) and 40 Mt, respectively. Although implementing adaptive traffic signal control requires an annual cost of US$1.5 billion, the resulting societal benefits, mainly fuel savings and CO2 reduction, amount to US$40.4 billion per year for the 100 congested cities in China. To encourage the adoption of this method, a central system is necessary to facilitate traffic data sharing across sectors, which will require policy and technological innovation.

Introducing AI into Urban Studies

Chapter

Full-text available

Oct 2023

Innovation in artificial intelligence (AI) is transforming cities in unprecedented ways. In this chapter, we unpack the connections between AI and the urban by introducing the concept of urban AI and reflecting on its most prominent incarnations: autonomous vehicles, urban robots, city brains and urban software agents. We then illustrate how the emergence of urban AI is producing a new urbanism that we term AI urbanism. AI urbanism originates from smart urbanism but also departs from it along three main axes, namely function, presence and agency. We discuss the similarities and differences underpinning AI and smart urbanism, highlight the problematic implications of human–machine interactions in the making and governance of cities and, finally, call on urbanists and urban stakeholders to scrutinize the critical intersections between urban development and the development of artificial intelligences.

Why does urban Artificial Intelligence (AI) matter for urban studies? Developing research directions in urban AI research

Article

Full-text available

Mar 2024

New digital technologies and systems are being extensively applied in urban contexts. These technologies and systems include algorithms, robotics, drones, Autonomous Vehicles (AVs) and autonomous systems that can collectively be labelled as Artificial Intelligence (AI). Critical debates have recognized that these various forms of AI do not merely layer onto existing urban infrastructures, forms of management and practices of everyday life. Instead, they have social and material power: they perform work, anticipate and assess risks and opportunities, are aberrant or glitchy, cause accidents, and make new demands on humans as well as the design of cities. And yet, urban scholars have only recently started to engage with research on urban AI and to begin articulating research directions for urban development beyond the current focus on smart cities. To enhance this engagement, this intervention explores three sets of questions: what is distinctive about this novel way of thinking about and doing cities; what are the emerging mutual interdependencies and interrelations between AI and their urban contexts; and what are the consequent challenges and opportunities for urban governance. In closing, we outline research directions shaped around new research questions raised by the emergence of urban AI.

Analysis of Application Scenarios of Cloud Computing and Internet of Things Technology in Smart Cities

Article

Full-text available

Dec 2023

Weijie Cai

Based on information technology, Internet of Things technology, big data technology, and cloud computing technology, smart city achieves the integration of urban information, thus developing an all-round perception of the city. Moreover, according to the development status of the city, it develops dynamic and refined management, which is of great help to improve the convenience of life of urban residents. This paper analyzes the key technology of smart city construction. It takes intelligent lighting as a case study to analyze cloud computing and Internet of Things technology application scenarios in smart cities.

Conclusions: The Present of Urban AI and the Future of Cities

Chapter

Full-text available

Oct 2023

The era of urban artificial intelligences has begun. Already, It is already difficult to imagine urban futures without artificial intelligence (AI) are difficult to imagine., and it is difficult to imagine an urban future in which artificial intelligence (AI) will not be present. In this final chapter, we draw on the volume’s empirical findings to explore the repercussions of urban AI and give evidence ofsuggest how the emergence of AI in cities is reshaping urban society, urban infrastructure, urban governance, urban planning and urban sustainability. Subsequently, we demonstrate how the city is influencing the evolution of AI, by molding its physical manifestations in actually existing city spaces and determining its very intelligence. The second half of the chapter is dedicated to unpacking the similarities that exist between this collection’s case studies of AI urbanism and well-known practices of smart urbanism. Here we highlight connections with past and present smart-city initiatives, as well as points of departure indicating that suggest the formation of a novel AI urbanism. We conclude the volume by discussing the implications of that the the emergence of urban AI has for urban theory and the future of cities.

Influencing factors and action paths for public crisis governance performance improvement in digital twin cities

Article

Full-text available

Jun 2024
LIBR HI TECH

Purpose-Through the study, we identified four effective paths to improve governance performance and also found the key direction for future research on digital twin urban implementation of public crisis governance, i.e. how to find a balance between the cost and the effectiveness of governance. Design/methodology/approach-A total of 22 urban public emergencies were selected based on key influencing factors, and four action paths to improve the performance of public crisis governance in digital twin cities were obtained using a fuzzy set qualitative comparative analysis model. Findings-This paper identified digital twin technologies in urban public crisis governance, analyzed the key factors of public crisis governance in the digital twin city and proposed a path of action to improve the performance of public crisis governance in digital twin cities. Originality/value-This study focuses on the influencing factors of public crisis governance in digital twin cities and the action paths to promote improved governance performance.

Artificial Intelligence of Things for Synergizing Smarter Eco-City Brain, Metabolism, and Platform: Pioneering Data-Driven Environmental Governance

Article

Full-text available

May 2024

Spatial Data Intelligence and City Metaverse: a Review

Article

Dec 2023

Previewer for Multi-Scale Object Detector

Conference Paper

Full-text available

Oct 2018

Most multi-scale detectors face a challenge of small-size false positives due to the inadequacy of low-level features, which have small receptive field sizes and weak semantic capabilities. This paper demonstrates independent predictions from different feature layers on the same region is beneficial for reducing false positives. We propose a novel light-weight previewer block, which previews the objectness probability for the potential regression region of each prior box, using the stronger features with larger receptive fields and more contextual information for better predictions. This previewer block is generic and can be easily implemented in multi-scale detectors, such as SSD, RFBNet and MS-CNN. Extensive experiments are conducted on PASCAL VOC and KITTI pedestrian benchmark to show the superiority of the proposed method.

Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in Vitro

Conference Paper

Full-text available

Oct 2017

The main contribution of this paper is a simple semisupervised pipeline that only uses the original training set without collecting extra data. It is challenging in 1) how to obtain more training data only from the training set and 2) how to use the newly generated data. In this work, the generative adversarial network (GAN) is used to generate unlabeled samples. We propose the label smoothing regularization for outliers (LSRO). This method assigns a uniform label distribution to the unlabeled images, which regularizes the supervised model and improves the baseline. We verify the proposed method on a practical problem: person re-identification (re-ID). This task aims to retrieve a query person from other cameras. We adopt the deep convolutional generative adversarial network (DCGAN) for sample generation, and a baseline convolutional neural network (CNN) for representation learning. Experiments show that adding the GAN-generated data effectively improves the discriminative ability of learned CNN embeddings. On three large-scale datasets, Market1501, CUHK03 and DukeMTMC-reID, we obtain +4.37%, +1.6% and +2.46% improvement in rank-1 precision over the baseline CNN, respectively. We additionally apply the proposed method to fine-grained bird recognition and achieve a +0.6% improvement over a strong baseline. The code is available at https://github.com/layumi/ Person-reID_GAN.

Deep Siamese Network with Multi-level Similarity Perception for Person Re-identification

Conference Paper

Full-text available

Oct 2017

Person re-identification (re-ID), which aims at spotting a person of interest across multiple camera views, has gained more and more attention in computer vision community. In this paper, we propose a novel deep Siamese architecture based on convolutional neural network (CNN) and multi-level similarity perception. According to the distinct characteristics of diverse feature maps, we effectively apply different similarity constraints to both low-level and high-level feature maps, during training stage. Therefore, our network can efficiently learn discriminative feature representations at different levels, which significantly improves the re-ID performance. Besides, our framework has two additional benefits. Firstly, classification constraints can be easily incorporated into the framework, forming a unified multi-task network with similarity constraints. Secondly, as similarity comparable information has been encoded in the network's learning parameters via back-propagation, pairwise input is not necessary at test time. That means we can extract features of each gallery image and build index in an off-line manner, which is essential for large-scale real-world applications. Experimental results on multiple challenging benchmarks demonstrate that our method achieves splendid performance compared with the current state-of-the-art approaches.

Attribute-Driven Feature Disentangling and Temporal Aggregation for Video Person Re-Identification

Conference Paper

Jun 2019

Quantization Networks

Conference Paper

Jun 2019

SuperPoint: Self-Supervised Interest Point Detection and Description

Conference Paper

Jun 2018

Local Convolutional Neural Networks for Person Re-Identification

Conference Paper

Oct 2018

Recent works have shown that person re-identification can be substantially improved by introducing attention mechanisms, which allow learning both global and local representations. However, all these works learn global and local features in separate branches. As a consequence, the interaction/boosting of global and local information are not allowed, except in the final feature embedding layer. In this paper, we propose local operations as a generic family of building blocks for synthesizing global and local information in any layer. This building block can be inserted into any convolutional networks with only a small amount of prior knowledge about the approximate locations of local parts. For the task of person re-identification, even with only one local block inserted, our local convolutional neural networks (Local CNN) can outperform state-of-the-art methods consistently on three large-scale benchmarks, including Market-1501, CUHK03, and DukeMTMC-ReID.

Spatio-Temporal AutoEncoder for Video Anomaly Detection

Conference Paper

Oct 2017

Anomalous events detection in real-world video scenes is a challenging problem due to the complexity of "anomaly" as well as the cluttered backgrounds, objects and motions in the scenes. Most existing methods use hand-crafted features in local spatial regions to identify anomalies. In this paper, we propose a novel model called Spatio-Temporal AutoEncoder (ST AutoEncoder or STAE), which utilizes deep neural networks to learn video representation automatically and extracts features from both spatial and temporal dimensions by performing 3-dimensional convolutions. In addition to the reconstruction loss used in existing typical autoencoders, we introduce a weight-decreasing prediction loss for generating future frames, which enhances the motion feature learning in videos. Since most anomaly detection datasets are restricted to appearance anomalies or unnatural motion anomalies, we collected a new challenging dataset comprising a set of real-world traffic surveillance videos. Several experiments are performed on both the public benchmarks and our traffic dataset, which show that our proposed method remarkably outperforms the state-of-the-art approaches.

Multi-Task Vehicle Detection With Region-of-Interest Voting

Article

Oct 2017

Vehicle detection is a challenging problem in autonomous driving systems, due to its large structural and appearance variations. In this paper, we propose a novel vehicle detection scheme based on multi-task deep convolutional neural networks (CNN) and region-of-interest (RoI) voting. In the design of CNN architecture, we enrich the supervised information with subcategory, region overlap, bounding-box regression and category of each training RoI as a multi-task learning framework. This design allows the CNN model to share visual knowledge among different vehicle attributes simultaneously, thus detection robustness can be effectively improved. In addition, most existing methods consider each RoI independently, ignoring the clues from its neighboring RoIs. In our approach, we utilize the CNN model to predict the offset direction of each RoI boundary towards the corresponding ground truth. Then each RoI can vote those suitable adjacent bounding boxes which are consistent with this additional information. The voting results are combined with the score of each RoI itself to find a more accurate location from a large number of candidates. Experimental results on the real-world computer vision benchmarks KITTI and the PASCAL2007 vehicle dataset show that our approach achieves superior performance in vehicle detection compared with other existing published works.

Distinctive Image Features from Scale-Invariant Keypoints

Article

Nov 2004

David G. Lowe

This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

The City Brain: Practice of Large-Scale Artificial Intelligence in the Real World

Abstract and Figures

Recommended publications

The City Brain: Towards Real-Time Search for the Real-World

Performance Optimization for Federated Person Re-identification via Benchmark Analysis

Meta-Knowledge Enhanced Data Augmentation for Federated Person Re-Identification

Artificial Semantic Memory with Autonomous Learning Applied to Social Robots