Conference PaperPDF Available

Machine Learning-based Wait-time Prediction for Autonomous Mobility-on-Demand Systems

April 2019

April 2019

DOI:10.1109/SoutheastCon42311.2019.9020461

Conference: IEEE SoutheastCon 2019
At: Huntsville, AL

Authors:

Trevor Hillsgrove

Florida Polytechnic University

Robert Steele

Quantic School of Business & Technology

The development of more sophisticated autonomous course-determination mechanisms for Autonomous Mobility-on-Demand systems is an active area of research and development. In the case of traditional ridesharing systems, there are various factors to be considered such as efficient use of vehicular assets, minimizing passenger wait times and selecting course of travel. In this paper we consider the use of machine learning to aid the selection of a destination that is predicted to result in a lower wait time until the next rideshare ride request will occur in the vicinity of a trip's destination. We draw upon a real-world ridesharing dataset to build and evaluate predictive machine learning models to provide an exploratory analysis of the utility of this approach to destination selection and demonstrate promising performance.

Content uploaded by Robert Steele

Content may be subject to copyright.

Machine Learning-based Wait-time Prediction for

Autonomous Mobility-on-Demand Systems

Trevor Hillsgrove

Florida Polytechnic University

4700 Research Way

Lakeland, FL, USA

thillsgrove@floridapoly.edu

Robert Steele

Florida Polytechnic University

4700 Research Way

Lakeland, FL, USA

rsteele@floridapoly.edu

Abstract— The development of more sophisticated autonomous

course-determination mechanisms for Autonomous Mobility-on-

Demand systems is an active area of research and development. In

the case of traditional ridesharing systems, there are various

factors to be considered such as efficient use of vehicular assets,

minimizing passenger wait times and selecting course of travel. In

this paper we consider the use of machine learning to aid the

selection of a destination that is predicted to result in a lower wait

time until the next rideshare ride request will occur in the vicinity

of a trip’s destination. We draw upon a real-world ridesharing

dataset to build and evaluate predictive machine learning models

to provide an exploratory analysis of the utility of this approach to

destination selection and demonstrate promising performance.

Keywords— data mining, mobility, mobility-on-demand,

autonomous vehicles

I. INTRODUCTION

In conventional vehicular systems, selection of the

destination or course of travel is the result of a complex

assessment by a human operator. Typically, this is based largely

on the current travel needs of a vehicle’s operator or users. In

the case of ridesharing systems, the selection of destination, or

course, or other ‘mission planning’, can be based upon the

system-wide needs of a population of users and the efficient

utilization goals of the rideshare service.

In this paper we consider the particular case of ridesharing

in an Autonomous Mobility-on-Demand (AMoD) system

[1][2]. Ridesharing to-date has typically involved vehicles

owned and operated by members of the public, but in the case

of AMoD systems it is anticipated a fleet of autonomous

vehicles can be utilized. An autonomous vehicle would travel

to a specific location requested by a client/passenger, and then

transport the passenger to a destination location indicated by the

passenger, potentially sharing the vehicle with other passengers

also using the service, along the way from source to destination.

In the case of AMoD, the approach to mission planning can be

sophisticated and make use of algorithmic optimization and

predictive approaches to improve local and system-wide

performance. AMoD systems are complex systems in which the

mission choices can be based upon invariant data, system-wide

parameters, real-time data and historical data.

There are a range of existing modeling and control

approaches [3], but none of these completely addresses the

complexity of such systems. In particular in this paper, we

consider the specific problem of being able to predict via

machine learning techniques, for a specific rideshare trip at the

time the trip starts and ends, how long it will be until there will

be another rideshare request in the vicinity of the endpoint of

the rideshare trip. In this case, vicinity is defined as having the

same ZIP code.

The information to know this deterministically is not within

the AMoD system, as passenger requests for a vehicle are not

known to the system until requested. For that reason we draw

upon a machine learning-based predictive approach. We

develop and evaluate this approach based upon a real-world

rideshare dataset generated from the usage activity of Ride

Austin [4] a non-profit ridesharing service located in Austin,

Texas.

MoD systems tend to lead to a build-up of vehicles

concentrated in certain areas due to the greater popularity of

some destinations, leading to the inefficiency of additional trips

needed to rebalance the vehicle locations [3]. One of the

benefits of being able to predict the wait time of autonomous

vehicles for a given location and point-in-time, is that it can aid

in the autonomous decision as to if, when and to where

autonomous vehicles should relocate to after completing a trip.

Specifically, it assists in two problems:

1) assisting a vehicle to decide at the point of pick-up of

a passenger, how long they may need to wait after

dropping the passenger off. This can be an input

variable to individual vehicle or system-wide pick-up

decision making algorithms

2) at the time that a passenger has been dropped off, this

provides a data input to decide whether a vehicle

should move to ‘rebalance’ the locations of the fleet

vehicles or wait for a next pickup in the current

vicinity

In this exploratory study of this machine learning-based

approach we demonstrate that this approach can generate good

predictive performance, and this mechanism for intelligent

decision making in relation to the vehicular resources has not

previously been addressed in the literature. The exploratory

study suggests that this approach can offer benefits in creating

AMoD systems able to incorporate predictive models into local

and globalized decision making. Such an addition should be

differentiated from course decision-making algorithms that are

hand-modeled based upon static data or real-time data inputs.

The remainder of the paper is structured as follows. Section

II reviews the most recent relevant literature, Section III

describes the methodology used in evaluating this approach to

wait-time prediction. Section IV provides the results of the

evaluation. Section V discusses the technique and its

implications and applicability. Finally, this is followed by the

conclusion of the paper.

II. LITERATURE REVIEW

Our review of the literature indicates that there has yet to be

an approach based upon machine learning techniques to predict

time until next ride request in ridesharing or AMoD systems.

Many of the existing approaches to modeling such systems

are based upon queue-based system modeling, optimization and

also suffer the limitation of providing guidance based upon

static system parameters such as vehicle numbers.

Examples of existing works include predictive positioning

for Mobility on Demand (MoD) systems where a limited

number of autonomous vehicles are available to provide rides

to customers, so the objective as to improve customer quality of

service (QoS) in terms of minimizing customer wait times [5].

In this 2017 study based upon the MIT on-campus MoD system,

the authors identify the use of machine learning as a future

direction to improve customer QoS but do not utilize it within

that study.

Iglesias et al. [6] draw upon 2018 work at Stanford

University to formulate a Model Predictive Control (MPC)

approach to controlling autonomous vehicles within an MoD

system. This MPC approach predicts short-term future

customer demand utilizing an optimization algorithm rather

than a machine learning approach and does not do this in a

vicinity-specific way.

The proposed SAMoD system, Nov. 2018 [7] utilizes a

reinforcement learning approach to addressing shared

autonomous mobility-on-demand. In particular this work

considers a decentralized approach to determining vehicle

relocation in the case of a ridesharing scenario.

Another approach to vehicle balancing published in 2017

[8] utilizes dynamic region partitions and when evaluated on a

taxi ride dataset, demonstrates the average total idle driving

time is reduced by 30%.

Pendleton et al. (2017) [9], provide a survey paper

originating from the Singapore-MIT Alliance for Research and

Technology, in relation to perception, planning, control and

coordination of autonomous vehicles. In relation to planning,

the overarching area of this current work, the authors of this

2017 paper identify three sub-categories of planning, namely:

mission planning, behavioral planning and motion planning. In

particular, this current work tackles one problem in relation to

mission planning; mission planning deals with such high-level

tasks as pickup/ dropoff decision making. The Pendleton et al.

survey paper does not identify prior works applying machine

learning to mission planning.

The challenge of machine learning-based wait time

prediction for a given vicinity, the topic of this current paper, is

a valuable problem and yet to be addressed elsewhere in the

literature.

III. METHODOLOGY

In this exploratory study we are interested to evaluate

whether machine learning can be used to effectively predict

time until next ride in a given location and at a specific point in

time. While various ‘rules of thumb’ used by human drivers

may provide insights and judgements into this for given

circumstances, for a whole city or urban area, it would be

beneficial to be able to predict this systematically,

automatically and at any point in time for an AMoD service,

using real-time predictive model inference upon models, likely

trained off-line, to assist local vehicle or AMoD system-wide

decision making.

AMoD data log records are still in their early stages and in

some cases proprietary, with the widespread deployment of

AMoD systems yet to occur, and in particular large-scale

openly available AMoD datasets are currently not available. For

the purposes of this approach we require a large dataset that

represents real-world usage in a large urban environment.

Hence for this exploratory study, we utilize a historical dataset

of a real-world, ridesharing service that contains attributes that

would be minimally available to any future AMoD system as a

mechanism to evaluate the base effectiveness of this general

approach upon a specific urban area-rideshare service

combination to consider its potential broader applicability for

AMoD systems. We chose a dataset from the ridesharing

service RideAustin [3] as it had a richer attribute set than such

data as we examined made available by Uber or other current

commercial ridesharing services.

A. Data source

We draw upon a dataset made public by the non-profit

Austin-based ridesharing service called RideAustin [3]. The

dataset initially consists of 911,057 ride records from the

RideAustin service from June 2016 to February 2017. The

dataset initially contains 28 data attributes for each rideshare

trip record, which include such information as the time the trip

was completed (day/hour/min/second), distance traveled, end

location latitude, end location longitude, when the trip started

(day/hour/min/second), the rating of the driver for the trip, the

rating of the rider for the trip, the trip start ZIP code, the trip end

ZIP code, the category of car requested, some billing attributes

such as free credit used, the surge factor, the start location

latitude, the start location longitude, various car-related

attributes such as color, make, model, year, and various

attributes related to the weather on that day such as amount of

precipitation, maximum temperature, minimum temperature,

average wind, wind gust speed, and whether there was fog,

heavy fog or thunder.

In its initial form, the dataset describes the rideshare trips

offered by this service during that period combined with a set

of attributes providing external contextual information, the

pertinent weather information in this case.

B. Data Preparation

From the initial dataset, a number of data cleaning and data

preparation activities were then carried out. First this involved

the removal of a number of duplicate ride records in the dataset.

Such removal was of records readily identifiable as duplicates

as they had identical start and end times to the second and all

other attributes also the same.

Additionally, as monthly ride frequency significantly rose

during the months of initial start-up of the service in mid-2016,

a subset representing a ‘steady-state’ month was chosen at

which point the ride volumes were high, and that month chosen

was January 2017.

As a prerequisite step towards model training and evaluation

a complex data engineering task was carried out, that involved

the creation of a new derived target attribute for the machine

learning process, that is:

● Wait time until next pickup in the vicinity (Wi): for

each row/ride Ri, that has a trip end ZIP code of Zi and

completes at time Ti, a new target attribute is created

for ‘wait time’, Wi, that is equal to the amount of time

after Ti before the next earliest ride departed from that

same destination zip code of the previous completed

trip, Zi

With this data engineering step, it facilitates the building of

predictive models for wait time prediction. Subsequently a

random subset of 50,000 rows of this newly engineered dataset

corresponding to a random sample of the January 2017 rides

was created. This size subset was chosen to decrease model

training and evaluation times.

A number of attributes were then removed as not applicable

to predicting wait time: these included car make, car model, car

year, car color, charity_id, free_credit_used, driver rating,

rating and rider rating. Additionally the median wait time in the

dataset was now calculated. Additionally, after wait times had

already been calculated to the second, time values such as time

of completion of ride and time of start of ride were converted

into a single hour value (in 24 hour time) to effect a binning of

ride records around hours of the day.

Finally two different versions of this dataset were created,

with all of the same input attributes but different versions of the

target attribute:

● Classification dataset (CD): the target attribute is a

binary nominal value, set to 1 for rows with a wait time

greater than or equal to the median wait time, and set

to 0 for rows with a wait time less than the median wait

time. This is a dataset of 50,000 rows.

● Regression dataset (RD): the target attribute is a

numerical value equal to the wait time in seconds. This

is a dataset of 10,000 rows, further chosen as a random

subset of the 50,000 rows to accommodate the greater

computational load and time needed for regression

model training

In these datasets there are 20 attributes, including the target

attribute as shown in Table I.

C. Data Exploration

An initial exploration of the 50,000 row dataset provides the

following simple descriptive statistics. The mean distance

travelled was 7,999 meters or approximately 8 kilometers, and

the mean wait time was 209 seconds, or approximately three

and a half minutes.

D. Model Development

An open source machine learning toolkit [10] was used to

develop numerous classifiers for the CD dataset and numerous

regressors for the RD dataset. These were created using 10-fold

cross validation on the 50,000 row CD dataset and similarly

using 10-fold cross validation upon a random subset of 10,000

rows from the RD dataset.

Numerous classifier base models of such types as Bayesian,

instance-based learners (such as k-Nearest Neighbors and

KStar), decision trees (such as C4.5) and rule-based (such as

Decision Table) were initially trialed for the CD dataset and

then based upon the best performing of these base models,

various ensemble model variants of the high performing base

models were also developed, based upon such meta approaches

as bagging, boosting and stacking [11]. The Bayesian models in

general performed significantly better in terms of predictive

performance.

For boosting we utilized the Ada Boost (adaptive boosting)

meta-algorithm [12]. While this is often utilized with a decision

tree base model, however the boosted decision tree models that

we trialled in this case performed significantly worse than the

Bayesian models.

For the RD dataset, various base regression models were

trialed including Support Vector Machine (SVM) regression,

instance-based approaches such a K-Nearest Neighbors and

KStar, linear regression, decision tables and decision trees such

as C4.5 and REPTree. Again for the best performing of these

base models, additional ensemble model variants were also

trialed. In terms of regression models, a wider range of base

models demonstrated higher performance levels.

TABLE I. MODEL ATTRIBUTES

ATTRIBUTE NAME

DESCRIPTION

completed_on

Hour of day completed, in 24

hour time

distance_travelled

Distance in meters

end_location_lat

Latitude of end of trip

end_location_lon

Longitude of end of trip

started_on

Hour of day trip started, in 24

hour time

start_zip_code

ZIP code where trip started

end_zip_code

ZIP code where trip ended

requested_car_category

Category of car requested,

REGULAR, SUV, LUX,

PREMIUM

surge_factor

15 distinct surge factor values

start_location_lat

Latitude of where trip started

start_location_lon

Longitude of where the trip

started

PRCP

Precipitation in millimeters

TMAX

Maximum temperature of the

day

TMIN

Minimum temperature of the

day

AWND

Average wind speed

Gustspeed2

Speed of wind gusts

Fog

Fog level described by 9

distinct values

HeavyFog

Binary indicating heavy fog or

not

Thunder

Binary indicating presence of

thunder during day or not

WAIT

For CD: binary, indicating

either above or below median

wait time

For RD: exact wait time in

seconds

E. Evaluation

A number of different performance evaluation metrics were

captured for each type of model trained. For each classifier the

Area Under the Receiver Operator Characteristic curve (AUC),

the True Positive (TP) rate (weighted between the two classes)

and the Area Under the Precision-Recall Curve (AUPRC) were

noted.

For the regression models Correlation Coefficient and Mean

Absolute Error (MAE) were captured.

IV. RESULTS

The performance of the best performing classifiers

developed and evaluated with 10-fold cross validation, using

the 50,000 row CD dataset, are summarized in Table II.

TABLE II. CLASSIFIER PERFORMANCE - 10-FOLD CROSS VALIDATION ON

50,000 ROW CD DATASET

MODEL

AUC

WEIGHTED

AUPRC

BAYES NETWORK

0.821

0.744

0.812

BOOSTED BAYES NETWORK (ADA

BOOST)

0.825

0.752

0.810

BAGGED BAYES NETWORK

0.821

0.744

0.812

NAIVE BAYES

0.804

0.726

0.785

BOOSTED NAIVE BAYES (ADA

BOOST)

0.824

0.754

0.809

BAGGED NAIVE BAYES

0.804

0.726

0.786

STACKING - META: NAIVE BAYES,

STACKED MODELS: NAIVE BAYES,

BAYES NET, ONER

0.827

0.750

0.815

Fig. 1 and Fig 2. provide the AUC curves (in terms of the

low delay class and high delay class respectively) for the best

performing classifier, the Stacking model listed in Table II. The

OneR model referred to as one of the base models used by the

Stacking model is a simple rule-based model that uses a rule

based upon just one input attribute that best predicts the target.

FIG. 1: ROC THRESHOLD CURVE FOR STACKED MODEL - META: NAIVE

BAYES, STACKED: NAIVE BAYES, BAYES NET, ONER - LOW DELAY TIME

FIG. 2: ROC THRESHOLD CURVE FOR STACKED MODEL - META: NAIVE

BAYES, STACKED: NAIVE BAYES, BAYES NET, ONER - HIGH DELAY TIME

The best performing regression models developed and

evaluated using 10-fold cross-validation, on the 10,000 row RD

dataset are shown in Table III.

The MAE value is given in seconds.

V. DISCUSSION

The best performing classification models achieve an AUC

of over 0.82, with the stacking approach (see Table II)

achieving the highest AUC of 0.827. This would be considered

a good level of discriminative performance for a model, with

above 0.8 considered ‘good’ and above 0.9 considered

‘excellent’ predictive performance [13].

TABLE III. REGRESSION MODEL PERFORMANCE - 10-FOLD CROSS

VALIDATION ON 10,000 ROW RD DATASET

MODEL

CORRELATION

COEFFICIENT

MEAN ABSOLUTE

ERROR (SECS)

SVM REGRESSION

0.355

145.8866

LINEAR REGRESSION

0.3282

160.683

K-NEAREST NEIGHBORS

0.2872

195.1336

REPTREE

0.5546

154.7369

DECISION TABLE

0.5464

160.5489

REPTREE - BAGGING

0.5747

148.8249

ADDITIVE REGRESSION -

DECISIONTABLE

0.5871

156.4668

MULTISCHEME: ADDITIVE

REGRESSION -

DECISIONTABLE + REPTREE

- BAGGING

0.5782

148.1987

MULTISCHEME:

DECISIONTABLE + REPTREE

0.5412

155.9432

STACKING - META:

DECISIONTABLE, STACKED:

REPTREE, DECISIONTABLE

0.5591

157.2248

RANDOMSUBSPACE -

REPTREE

0.5836

158.1587

RANDOMSUBSPACE -

DECISIONTABLE

0.5695

166.9425

This indicates that the machine learning technique, even for

a basic set of available attributes that would be available in any

anticipated AMoD system is able to predict with high accuracy

at the coarse level of high vs low wait time until next ride

request for that vicinity (ZIP code), either at the point of

preceding passenger pickup, or at the time of completing the

passenger drop off. It could be anticipated that emerging AMoD

systems will additionally capture far more per-ride/per-request

data attributes that have the potential to improve upon

predictive model performance.

The best performing regression model in terms of

correlation coefficient is Decision Table-based Additive

Regression, achieving a correlation coefficient of 0.5871. Such

a result might be considered on the lower boundary of a ‘strong

correlation’ [14]. It achieves an MAE of 156.5 seconds. Other

similarly performing regression models include Random

Subspace - REPTree and Bagged REPTree with correlation

coefficients of 0.5836 and 0.5747 respectively. With MAEs in

the range of 150 seconds (approximately two and a half

minutes), what may be a relatively small amount of time

compared with that required typically to relocate an

autonomous vehicle to another ZIP code, this suggests the time-

to-wait prediction may have value in the decision as to whether

to relocate or not, or to where. It should be noted that the

regression model with the lowest MAE was SVM regression

achieving an MAE of 145.89 seconds.

Here we have concerned ourselves with evaluating the

predictive performance of exploratory or demonstrator models,

but various techniques can also draw upon these predictive

models so as to integrate the uncertain knowledge provided by

these or combine the uncertain predictions of multiple models

[15][16].

The results suggest that models, even based upon a limited

number of per-trip data attributes, can provide good predictive

performance.

A. Generalizability of Models

Such trained models for MoD or AMoD systems are

inherently city/region or geography specific, and also service

specific. That is, for a given AMoD system, the predictive

model would be trained from recent service usage data for each

particular city/region. That is, the model is specific to a given

city’s demand patterns. Such models would typically be trained

off-line on historical data, and then be used in real-time

inference to assist pickup and mission decision-making. It is

possible to update the training of the models fairly continuously

as new AMoD service data becomes available for that area.

Low-latency, fog computing-based architectures may provide

the distributed architecture for real-time inference

computation[17]. There is the potential in future work to extend

the training datasets with additional contextual attributes such

as fine-grained neighborhood-specific population numbers or

in-area workforce or pedestrian numbers or the potential for

ridesharing services to draw in desired attributes through such

customizable data collection approaches as crowdsensing [18].

The models are AMoD service specific in that they are

dependent upon the customer demand patterns of the service

users. Such demand patterns may have similarities or

differences between services, depending on whether the

customer bases differ in their location distribution and typical

usage/travel patterns.

A possible immediate future step is to consider such models

based on class of vehicle service-specific datasets. That is, the

demand patterns for different service levels will be distinct.

VI. CONCLUSION

In this paper we have described an exploratory study

demonstrating and evaluating the efficacy of machine learning-

based predictive models in predicting wait time until the next

ride request in a given vicinity for Autonomous Mobility-on-

Demand systems. Given the lack of available large-scale

historical AMoD trip datasets, we have demonstrated and

evaluated this approach on a real-world, ridesharing service

dataset, provided from such a service in Austin, TX. The

predictive performance demonstrated is good both in terms of

the classification and regression models developed, suggesting

the value and promise of extending such an approach to future,

more attribute-rich AMoD ride datasets.

REFERENCES

[1] Spieser, K., Treleaven, K., Zhang, R., Frazzoli, E., Morton, D., & Pavone,

M. “Toward a systematic approach to the design and evaluation of

automated mobility-on-demand systems: A case study in Singapore”. In

Road vehicle automation, 2014 (pp. 229-245). Springer, Cham.

[2] Pavone, M. “Autonomous mobility-on-demand systems for future urban

mobility”. In Autonomes Fahren, 2015. (pp. 399-416). Springer Vieweg,

Berlin, Heidelberg.

[3] Zhang, R., Spieser, K., Frazzoli, E., & Pavone, M. “Models, algorithms,

and evaluation for autonomous mobility-on-demand systems”. In

American Control Conference (ACC), July, 2015 (pp. 2573-2587). IEEE.

[4] Ride Austin. Data File and Dictionary. Available from:

https://data.world/ride-austin/ride-austin-june-6-april-13. Accessed Jan 2,

2019.

[5] J. Miller & J.P. How. ”Predictive positioning and quality of service

ridesharing for campus mobility on demand systems”. In 2017 IEEE

International Conference on Robotics and Automation (ICRA) (pp. 1402-

1408), 2017, IEEE.

[6] Iglesias, R., Rossi, F., Wang, K., Hallac, D., Leskovec, J., & Pavone, M.

“Data-driven model predictive control of autonomous mobility-on-

demand systems”. In 2018 IEEE International Conference on Robotics

and Automation (ICRA) (pp. 1-7). May, 2018, IEEE.

[7] Guériau, M., & Dusparic, I. “SAMoD: Shared Autonomous Mobility-on-

Demand using Decentralized Reinforcement Learning”. In 2018 21st

International Conference on Intelligent Transportation Systems (ITSC)

(pp. 1558-1563), Nov. 2018, IEEE.

[8] Miao, F., Han, S., Hendawi, A. M., Khalefa, M. E., Stankovic, J. A., &

Pappas, G. J. “Data-driven distributionally robust vehicle balancing using

dynamic region partitions”. In Proceedings of the 8th International

Conference on Cyber-Physical Systems, April, 2017 (pp. 261-271), ACM.

[9] Pendleton, S. D., Andersen, H., Du, X., Shen, X., Meghjani, M., Eng, Y.

H., ... & Ang, M. H. “Perception, planning, control, and coordination for

autonomous vehicles”. Machines, 5(1), 6, 2017.

[10] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, & I.H.

Witten, “The WEKA data mining software: an update”. ACM SIGKDD

explorations newsletter, 11(1), 10-18, 2009.

[11] Džeroski, S., & Ženko, B. “Is combining classifiers with stacking better

than selecting the best one?”. Machine learning, 54(3), 255-273, 2004.

[12] Collins M, Schapire RE, “Singer Y. Logistic regression, AdaBoost and

Bregman distances”. Machine Learning. 2002 Jul 1;48(1-3):253-85.

[13] A. Hanle and B.J. McNeil. “The meaning and use of the area under a

receiver operating characteristic (ROC) curve”. Radiology, 143(1), 29-36,

1982.

[14] BMJ, Online: [https://www.bmj.com/about-bmj/resources-

readers/publications/statistics-square-one/11-correlation-and-regression]

[15] Schmidt, S., Steele, R., Dillon, T. S., & Chang, E. “Fuzzy trust evaluation

and credibility development in multi-agent systems”. Applied Soft

Computing, 7(2), 492-505, 2007.

[16] Zhang, R., Rossi, F., & Pavone, M. “Model predictive control of

autonomous mobility-on-demand systems”. In 2016 IEEE International

Conference on Robotics and Automation (ICRA), May, 2016, (pp. 1382-

1389). IEEE.

[17] Jaimes, L. G., Chakeri, A., & Steele, R. “Localized cooperation for

crowdsensing in a fog computing-enabled internet-of-things”. Journal of

Ambient Intelligence and Humanized Computing, 1-13, 2019. doi:

10.1007/s12652-018-0818-z

[18] Steele, R., & Jaimes, L. G. “Crowdsensing sub-populations in a region”.

Journal of Ambient Intelligence and Humanized Computing, 1-10, 2019.

doi: 10.1007%2Fs12652-018-0799-y

The Effect of Training Set Timeframe on Future Performance of Machine Learning-based Malware Detection Models

Conference Paper

Full-text available

Dec 2020

The occurrence of previously unseen malicious code or malware is an implicit and ongoing issue for all software-based systems. It has been recognized that machine learning, applied to features statically extracted from binary executable files, offers a number of promising benefits, such as its ability to detect malware that has not been previously encountered. Nevertheless it is understood that these models will not continue to perform equally well over time as new and potentially less recognizable malwares occur. In this study, we have applied a range of machine learning models to the features extracted from a large collection of software executables in Portable Executable format ordered by the date the binary was first encountered, consisting of both malware and benign examples, whilst considering different training set configurations and timeframes. We analyze and quantify the relative performance deterioration of these machine learning models on future test sets of these features, and discuss some insights into the characteristics and rate of machine learning-based malware detection performance deterioration and training set selection.

Data-Driven Framework for Understanding & Modeling Ride-sourcing Transportation Systems

Thesis

Full-text available

Jun 2022

Bishoy Kelleny

Ride-sourcing transportation services offered by transportation network companies (TNCs) like Uber and Lyft are disrupting the transportation landscape. The growing demand on these services, along with their potential short and long-term impacts on the environment, society, and infrastructure emphasize the need to further understand the ride-sourcing system. There were no sufficient data to fully understand the system and integrate it within regional multimodal transportation frameworks. This can be attributed to commercial and competition reasons, given the technology-enabled and innovative nature of the system. Recently, in 2019, the City of Chicago the released an extensive and complete ride-sourcing trip-level data for all trips made within the city since November 1, 2018. The data comprises the trip ends (pick-up and drop-off locations), trip timestamps, trip length and duration, fare including tipping amounts, and whether the trip was authorized to be shared (pooled) with another passenger or not. Therefore, the main goal of this dissertation is to develop a comprehensive data-driven framework to understand and model the system using this data from Chicago, in a reproducible and transferable fashion. Using data fusion approach, sociodemographic, economic, parking supply, transit availability and accessibility, built environment and crime data are collected from open sources to develop this framework. The framework is predicated on three pillars of analytics: (1) explorative and descriptive analytics, (2) diagnostic analytics, and (3) predictive analytics. The dissertation research framework also provides a guide on the key spatial and behavioral explanatory variables shaping the utility of the mode, driving the demand, and governing the interdependencies between the demand’s willingness to share and surge price. Thus, the key findings can be readily challenged, verified, and utilized in different geographies. In the explorative and descriptive analytics, the ride-sourcing system’s spatial and temporal dimensions of the system are analyzed to achieve two objectives: (1) explore, reveal, and assess the significance of spatial effects, i.e., spatial dependence and heterogeneity, in the system behavior, and (2) develop a behavioral market segmentation and trend mining of the willingness to share. This is linked to the diagnostic analytics layer, as the revealed spatial effects motivates the adoption of spatial econometric models to analytically identify the ride-sourcing system determinants. Multiple linear regression (MLR) is used as a benchmark model against spatial error model (SEM), spatially lagged X (SLX) model, and geographically weighted regression (GWR) model. Two innovative modeling constructs are introduced deal with the ride-sourcing system’s spatial effects and multicollinearity: (1) Calibrated Spatially Lagged X Ridge Model (CSLXR) and Calibrated Geographically Weighted Ridge Regression (CGWRR) in the diagnostic analytics layer. The identified determinants in the diagnostic analytics layer are then fed into the predictive analytics one to develop an interpretable machine learning (ML) modeling framework. The system’s annual average weekday origin-destination (AAWD OD) flow is modeled using the following state-of-the-art ML models: (1) Multilayer Perceptron (MLP) Regression, (2) Support Vector Machines Regression (SVR), and (3) Tree-based ensemble learning methods, i.e., Random Forest Regression (RFR) and Extreme Gradient Boosting (XGBoost). The innovative modeling construct of CGWRR developed in the diagnostic analytics is then validated in a predictive context and is found to outperform the state-of-the-art ML models in terms of testing score of 0.914, in comparison to 0.906 for XGBoost, 0.84 for RFR, 0.89 for SVR, and 0.86 for MLP. The CGWRR exhibits outperformance as well in terms of the root mean squared error (RMSE) and mean average error (MAE). The findings of this dissertation partially bridge the gap between the practice and the research on ride-sourcing transportation systems understanding and integration. The empirical findings made in the descriptive and explorative analytics can be further utilized by regional agencies to fill practice and policymaking gaps on regulating ride-sourcing services using corridor or cordon toll, optimally allocating standing areas to minimize deadheading, especially during off-peak periods, and promoting the ride-share willingness in disadvantage communities. The CGWRR provides a reliable modeling and simulation tool to researchers and practitioners to integrate the ride-sourcing system in multimodal transportation modeling frameworks, simulation testbed for testing long-range impacts of policies on ride-sourcing, like improved transit supply, congestions pricing, or increased parking rates, and to plan ahead for similar futuristic transportation modes, like the shared autonomous vehicles.

SAMoD: Shared Autonomous Mobility-on-Demand using Decentralized Reinforcement Learning

Conference Paper

Full-text available

Nov 2018

Localized cooperation for crowdsensing in a fog computing-enabled internet-of-things

Article

Full-text available

May 2018

In this article, we describe and evaluate a crowdsensing approach that entails local cooperation between crowdsensing participants in smart environments, utilizing an underlying fog computing-enabled Internet of Things. A fog computing-based Internet-of-Things architecture involves a layer of computing nodes residing closer to the sensing devices, with this layer of fog nodes lying in between mobile and sensing devices at the network edge and the cloud. This motivates us to propose a model for crowdsensing in smart environments that involves both competition and cooperation between nearby crowdsensing participants at the edge network. Comprehensive simulations are presented to evaluate the performance of the proposed approach. The work shows desirable characteristics in terms of number of active participants, number of samples collected within a given budget and coverage, resulting from localized cooperation by crowdsensing participants at the edge layer that can support various smart environment applications.

Crowdsensing sub-populations in a region

Article

Full-text available

Apr 2019

Crowdsensing refers to an approach for collecting of data from a large number of smart devices and sensors carried by many individuals and has been employed for numerous applications, which include pollution monitoring, traffic monitoring and noise sensing. It is an important mechanism for building applications in the smart environments enabled by the internet-of-things. However, often a given problem may dictate that samples are drawn from a defined sub-population of participants, for example based on characteristics of the participant such as location, demographics or other profile attribute, rather than from any possible member of the whole population. In this article we introduce an approach for crowdsensing with a consideration for how to sample from specific sub-populations in a region, delineated in a dimension-based way analogous to the multi-dimensional data model used in data warehousing. Simulation and performance results are provided demonstrating the approach’s ability to maintain active participants, provide coverage of the region of interest, and to be able to scalably sample the variable of interest in relation to the sub-population. This is the first work to our knowledge to address and propose an approach to the specific problem of crowdsourcing from specific attribute-defined sub-populations.

Perception, Planning, Control, and Coordination for Autonomous Vehicles

Article

Full-text available

Feb 2017

Autonomous vehicles are expected to play a key role in the future of urban transportation systems, as they offer potential for additional safety, increased productivity, greater accessibility, better road efﬁciency, and positive impact on the environment. Research in autonomous systems has seen dramatic advances in recent years, due to the increases in available computing power and reduced cost in sensing and computing technologies, resulting in maturing technological readiness level of fully autonomous vehicles. The objective of this paper is to provide a general overview of the recent developments in the realm of autonomous vehicle software systems. Fundamental components of autonomous vehicle software are reviewed, and recent developments in each area are discussed.

Autonomous Mobility-on-Demand Systems for Future Urban Mobility

Chapter

Full-text available

Jan 2015

Marco Pavone

This chapter discusses the operational and economic aspects of autonomous mobility-ondemand (AMoD) systems, a transformative and rapidly developing mode of transportation wherein robotic, self-driving vehicles transport passengers in a given environment. Specifically, AMoD systems are addressed along three dimensions: (1) modeling, that is analytical models capturing salient dynamic and stochastic features of customer demand, (2) control, that is coordination algorithms for the vehicles aimed at throughput maximization, and (3) economic, that is fleet sizing and financial analyses for case studies of New York City and Singapore. Collectively, the models and methods presented in this chapter enable a rigorous assessment of the value of AMoD systems.

Data-Driven Model Predictive Control of Autonomous Mobility-on-Demand Systems

Conference Paper

May 2018

Predictive positioning and quality of service ridesharing for campus mobility on demand systems

Conference Paper

May 2017

Data-Driven Distributionally Robust Vehicle Balancing Using Dynamic Region Partitions *

Conference Paper

May 2017

With the transformation to smarter cities and the development of technologies, a large amount of data is collected from sensors in real-time. This paradigm provides opportunities for improving transportation systems' performance by allocating vehicles towards mobility predicted demand proactively. However, how to deal with uncertainties in demand probability distribution for improving the average system performance is still a challenging and unsolved task. Considering this problem, in this work, we develop a data-driven distributionally robust vehicle balancing method to minimize the worst-case expected cost. We design an efficient algorithm for constructing uncertainty sets of random demand probability distributions , and leverage a quad-tree dynamic region partition method for better capturing the dynamic spatial-temporal properties of the uncertain demand. We then prove equivalent computationally tractable form for numerically solving the distributionally robust problem. We evaluate the performance of the data-driven vehicle balancing framework based on four years of taxi trip data for New York City. We show that the average total idle driving distance is reduced by 30% with the distributionally robust vehicle balancing method using quad-tree dynamic region partition method, compared with vehicle balancing solutions based on static region partitions without considering demand uncertainties. This is about 60 million miles or 8 million dollars cost reduction annually in NYC. CCS CONCEPTS • Mathematics of computing →Stochastic control and optimization ; Probabilistic algorithms; • Networks →Network algorithms; • Computer systems organization →Embedded and cyber-physical systems; *

Model predictive control of autonomous mobility-on-demand systems

Conference Paper

May 2016

Models, algorithms, and evaluation for autonomous mobility-on-demand systems

Article

Jul 2015

This tutorial paper examines the operational and economic aspects of autonomous mobility-on-demand (AMoD) systems, a rapidly emerging mode of personal transportation wherein robotic, self-driving vehicles transport customers in a given environment. We address AMoD systems along three dimensions: (1) modeling - analytical models capable of capturing the salient dynamic and stochastic features of customer demand, (2) control - coordination algorithms for the vehicles aimed at stability and subsequently throughput maximization, and (3) economic - fleet sizing and financial analyses for case studies of New York City and Singapore. Collectively, the models and algorithms presented in this paper enable a rigorous assessment of the value of AMoD systems. In particular, the case study of New York City shows that the current taxi demand in Manhattan can be met with about 8,000 robotic vehicles (roughly 70% of the size of the current taxi fleet), while the case study of Singapore suggests that an AMoD system can meet the personal mobility need of the entire population of Singapore with a number of robotic vehicles that is less than 40% of the current number of passenger vehicles. Directions for future research on AMoD systems are presented and discussed.

Machine Learning-based Wait-time Prediction for Autonomous Mobility-on-Demand Systems

Abstract

Recommended publications

Analysis of the relationship between internet usage and allocation of time for personal travel and o...

Born global and well educated: start-up survival through fuzzy set analysis

The Application of Machine Learning to Problems in Graph Drawing A Literature Review

Low Assumptions, High Dimensions