Conference PaperPDF Available

Machine Learning-based Wait-time Prediction for Autonomous Mobility-on-Demand Systems

Authors:

Abstract

The development of more sophisticated autonomous course-determination mechanisms for Autonomous Mobility-on-Demand systems is an active area of research and development. In the case of traditional ridesharing systems, there are various factors to be considered such as efficient use of vehicular assets, minimizing passenger wait times and selecting course of travel. In this paper we consider the use of machine learning to aid the selection of a destination that is predicted to result in a lower wait time until the next rideshare ride request will occur in the vicinity of a trip's destination. We draw upon a real-world ridesharing dataset to build and evaluate predictive machine learning models to provide an exploratory analysis of the utility of this approach to destination selection and demonstrate promising performance.
Machine Learning-based Wait-time Prediction for
Autonomous Mobility-on-Demand Systems
Trevor Hillsgrove
Florida Polytechnic University
4700 Research Way
Lakeland, FL, USA
thillsgrove@floridapoly.edu
Robert Steele
Florida Polytechnic University
4700 Research Way
Lakeland, FL, USA
rsteele@floridapoly.edu
Abstract The development of more sophisticated autonomous
course-determination mechanisms for Autonomous Mobility-on-
Demand systems is an active area of research and development. In
the case of traditional ridesharing systems, there are various
factors to be considered such as efficient use of vehicular assets,
minimizing passenger wait times and selecting course of travel. In
this paper we consider the use of machine learning to aid the
selection of a destination that is predicted to result in a lower wait
time until the next rideshare ride request will occur in the vicinity
of a trip’s destination. We draw upon a real-world ridesharing
dataset to build and evaluate predictive machine learning models
to provide an exploratory analysis of the utility of this approach to
destination selection and demonstrate promising performance.
Keywords data mining, mobility, mobility-on-demand,
autonomous vehicles
I. INTRODUCTION
In conventional vehicular systems, selection of the
destination or course of travel is the result of a complex
assessment by a human operator. Typically, this is based largely
on the current travel needs of a vehicle’s operator or users. In
the case of ridesharing systems, the selection of destination, or
course, or other ‘mission planning’, can be based upon the
system-wide needs of a population of users and the efficient
utilization goals of the rideshare service.
In this paper we consider the particular case of ridesharing
in an Autonomous Mobility-on-Demand (AMoD) system
[1][2]. Ridesharing to-date has typically involved vehicles
owned and operated by members of the public, but in the case
of AMoD systems it is anticipated a fleet of autonomous
vehicles can be utilized. An autonomous vehicle would travel
to a specific location requested by a client/passenger, and then
transport the passenger to a destination location indicated by the
passenger, potentially sharing the vehicle with other passengers
also using the service, along the way from source to destination.
In the case of AMoD, the approach to mission planning can be
sophisticated and make use of algorithmic optimization and
predictive approaches to improve local and system-wide
performance. AMoD systems are complex systems in which the
mission choices can be based upon invariant data, system-wide
parameters, real-time data and historical data.
There are a range of existing modeling and control
approaches [3], but none of these completely addresses the
complexity of such systems. In particular in this paper, we
consider the specific problem of being able to predict via
machine learning techniques, for a specific rideshare trip at the
time the trip starts and ends, how long it will be until there will
be another rideshare request in the vicinity of the endpoint of
the rideshare trip. In this case, vicinity is defined as having the
same ZIP code.
The information to know this deterministically is not within
the AMoD system, as passenger requests for a vehicle are not
known to the system until requested. For that reason we draw
upon a machine learning-based predictive approach. We
develop and evaluate this approach based upon a real-world
rideshare dataset generated from the usage activity of Ride
Austin [4] a non-profit ridesharing service located in Austin,
Texas.
MoD systems tend to lead to a build-up of vehicles
concentrated in certain areas due to the greater popularity of
some destinations, leading to the inefficiency of additional trips
needed to rebalance the vehicle locations [3]. One of the
benefits of being able to predict the wait time of autonomous
vehicles for a given location and point-in-time, is that it can aid
in the autonomous decision as to if, when and to where
autonomous vehicles should relocate to after completing a trip.
Specifically, it assists in two problems:
1) assisting a vehicle to decide at the point of pick-up of
a passenger, how long they may need to wait after
dropping the passenger off. This can be an input
variable to individual vehicle or system-wide pick-up
decision making algorithms
2) at the time that a passenger has been dropped off, this
provides a data input to decide whether a vehicle
should move to ‘rebalance’ the locations of the fleet
vehicles or wait for a next pickup in the current
vicinity
In this exploratory study of this machine learning-based
approach we demonstrate that this approach can generate good
predictive performance, and this mechanism for intelligent
decision making in relation to the vehicular resources has not
previously been addressed in the literature. The exploratory
study suggests that this approach can offer benefits in creating
AMoD systems able to incorporate predictive models into local
and globalized decision making. Such an addition should be
differentiated from course decision-making algorithms that are
hand-modeled based upon static data or real-time data inputs.
The remainder of the paper is structured as follows. Section
II reviews the most recent relevant literature, Section III
describes the methodology used in evaluating this approach to
wait-time prediction. Section IV provides the results of the
evaluation. Section V discusses the technique and its
implications and applicability. Finally, this is followed by the
conclusion of the paper.
II. LITERATURE REVIEW
Our review of the literature indicates that there has yet to be
an approach based upon machine learning techniques to predict
time until next ride request in ridesharing or AMoD systems.
Many of the existing approaches to modeling such systems
are based upon queue-based system modeling, optimization and
also suffer the limitation of providing guidance based upon
static system parameters such as vehicle numbers.
Examples of existing works include predictive positioning
for Mobility on Demand (MoD) systems where a limited
number of autonomous vehicles are available to provide rides
to customers, so the objective as to improve customer quality of
service (QoS) in terms of minimizing customer wait times [5].
In this 2017 study based upon the MIT on-campus MoD system,
the authors identify the use of machine learning as a future
direction to improve customer QoS but do not utilize it within
that study.
Iglesias et al. [6] draw upon 2018 work at Stanford
University to formulate a Model Predictive Control (MPC)
approach to controlling autonomous vehicles within an MoD
system. This MPC approach predicts short-term future
customer demand utilizing an optimization algorithm rather
than a machine learning approach and does not do this in a
vicinity-specific way.
The proposed SAMoD system, Nov. 2018 [7] utilizes a
reinforcement learning approach to addressing shared
autonomous mobility-on-demand. In particular this work
considers a decentralized approach to determining vehicle
relocation in the case of a ridesharing scenario.
Another approach to vehicle balancing published in 2017
[8] utilizes dynamic region partitions and when evaluated on a
taxi ride dataset, demonstrates the average total idle driving
time is reduced by 30%.
Pendleton et al. (2017) [9], provide a survey paper
originating from the Singapore-MIT Alliance for Research and
Technology, in relation to perception, planning, control and
coordination of autonomous vehicles. In relation to planning,
the overarching area of this current work, the authors of this
2017 paper identify three sub-categories of planning, namely:
mission planning, behavioral planning and motion planning. In
particular, this current work tackles one problem in relation to
mission planning; mission planning deals with such high-level
tasks as pickup/ dropoff decision making. The Pendleton et al.
survey paper does not identify prior works applying machine
learning to mission planning.
The challenge of machine learning-based wait time
prediction for a given vicinity, the topic of this current paper, is
a valuable problem and yet to be addressed elsewhere in the
literature.
III. METHODOLOGY
In this exploratory study we are interested to evaluate
whether machine learning can be used to effectively predict
time until next ride in a given location and at a specific point in
time. While various ‘rules of thumb’ used by human drivers
may provide insights and judgements into this for given
circumstances, for a whole city or urban area, it would be
beneficial to be able to predict this systematically,
automatically and at any point in time for an AMoD service,
using real-time predictive model inference upon models, likely
trained off-line, to assist local vehicle or AMoD system-wide
decision making.
AMoD data log records are still in their early stages and in
some cases proprietary, with the widespread deployment of
AMoD systems yet to occur, and in particular large-scale
openly available AMoD datasets are currently not available. For
the purposes of this approach we require a large dataset that
represents real-world usage in a large urban environment.
Hence for this exploratory study, we utilize a historical dataset
of a real-world, ridesharing service that contains attributes that
would be minimally available to any future AMoD system as a
mechanism to evaluate the base effectiveness of this general
approach upon a specific urban area-rideshare service
combination to consider its potential broader applicability for
AMoD systems. We chose a dataset from the ridesharing
service RideAustin [3] as it had a richer attribute set than such
data as we examined made available by Uber or other current
commercial ridesharing services.
A. Data source
We draw upon a dataset made public by the non-profit
Austin-based ridesharing service called RideAustin [3]. The
dataset initially consists of 911,057 ride records from the
RideAustin service from June 2016 to February 2017. The
dataset initially contains 28 data attributes for each rideshare
trip record, which include such information as the time the trip
was completed (day/hour/min/second), distance traveled, end
location latitude, end location longitude, when the trip started
(day/hour/min/second), the rating of the driver for the trip, the
rating of the rider for the trip, the trip start ZIP code, the trip end
ZIP code, the category of car requested, some billing attributes
such as free credit used, the surge factor, the start location
latitude, the start location longitude, various car-related
attributes such as color, make, model, year, and various
attributes related to the weather on that day such as amount of
precipitation, maximum temperature, minimum temperature,
average wind, wind gust speed, and whether there was fog,
heavy fog or thunder.
In its initial form, the dataset describes the rideshare trips
offered by this service during that period combined with a set
of attributes providing external contextual information, the
pertinent weather information in this case.
B. Data Preparation
From the initial dataset, a number of data cleaning and data
preparation activities were then carried out. First this involved
the removal of a number of duplicate ride records in the dataset.
Such removal was of records readily identifiable as duplicates
as they had identical start and end times to the second and all
other attributes also the same.
Additionally, as monthly ride frequency significantly rose
during the months of initial start-up of the service in mid-2016,
a subset representing a ‘steady-state’ month was chosen at
which point the ride volumes were high, and that month chosen
was January 2017.
As a prerequisite step towards model training and evaluation
a complex data engineering task was carried out, that involved
the creation of a new derived target attribute for the machine
learning process, that is:
Wait time until next pickup in the vicinity (Wi): for
each row/ride Ri, that has a trip end ZIP code of Zi and
completes at time Ti, a new target attribute is created
for ‘wait time’, Wi, that is equal to the amount of time
after Ti before the next earliest ride departed from that
same destination zip code of the previous completed
trip, Zi
With this data engineering step, it facilitates the building of
predictive models for wait time prediction. Subsequently a
random subset of 50,000 rows of this newly engineered dataset
corresponding to a random sample of the January 2017 rides
was created. This size subset was chosen to decrease model
training and evaluation times.
A number of attributes were then removed as not applicable
to predicting wait time: these included car make, car model, car
year, car color, charity_id, free_credit_used, driver rating,
rating and rider rating. Additionally the median wait time in the
dataset was now calculated. Additionally, after wait times had
already been calculated to the second, time values such as time
of completion of ride and time of start of ride were converted
into a single hour value (in 24 hour time) to effect a binning of
ride records around hours of the day.
Finally two different versions of this dataset were created,
with all of the same input attributes but different versions of the
target attribute:
Classification dataset (CD): the target attribute is a
binary nominal value, set to 1 for rows with a wait time
greater than or equal to the median wait time, and set
to 0 for rows with a wait time less than the median wait
time. This is a dataset of 50,000 rows.
Regression dataset (RD): the target attribute is a
numerical value equal to the wait time in seconds. This
is a dataset of 10,000 rows, further chosen as a random
subset of the 50,000 rows to accommodate the greater
computational load and time needed for regression
model training
In these datasets there are 20 attributes, including the target
attribute as shown in Table I.
C. Data Exploration
An initial exploration of the 50,000 row dataset provides the
following simple descriptive statistics. The mean distance
travelled was 7,999 meters or approximately 8 kilometers, and
the mean wait time was 209 seconds, or approximately three
and a half minutes.
D. Model Development
An open source machine learning toolkit [10] was used to
develop numerous classifiers for the CD dataset and numerous
regressors for the RD dataset. These were created using 10-fold
cross validation on the 50,000 row CD dataset and similarly
using 10-fold cross validation upon a random subset of 10,000
rows from the RD dataset.
Numerous classifier base models of such types as Bayesian,
instance-based learners (such as k-Nearest Neighbors and
KStar), decision trees (such as C4.5) and rule-based (such as
Decision Table) were initially trialed for the CD dataset and
then based upon the best performing of these base models,
various ensemble model variants of the high performing base
models were also developed, based upon such meta approaches
as bagging, boosting and stacking [11]. The Bayesian models in
general performed significantly better in terms of predictive
performance.
For boosting we utilized the Ada Boost (adaptive boosting)
meta-algorithm [12]. While this is often utilized with a decision
tree base model, however the boosted decision tree models that
we trialled in this case performed significantly worse than the
Bayesian models.
For the RD dataset, various base regression models were
trialed including Support Vector Machine (SVM) regression,
instance-based approaches such a K-Nearest Neighbors and
KStar, linear regression, decision tables and decision trees such
as C4.5 and REPTree. Again for the best performing of these
base models, additional ensemble model variants were also
trialed. In terms of regression models, a wider range of base
models demonstrated higher performance levels.
TABLE I. MODEL ATTRIBUTES
ATTRIBUTE NAME
DESCRIPTION
completed_on
Hour of day completed, in 24
hour time
distance_travelled
Distance in meters
end_location_lat
Latitude of end of trip
end_location_lon
Longitude of end of trip
started_on
Hour of day trip started, in 24
hour time
start_zip_code
ZIP code where trip started
end_zip_code
ZIP code where trip ended
requested_car_category
Category of car requested,
REGULAR, SUV, LUX,
PREMIUM
surge_factor
15 distinct surge factor values
start_location_lat
Latitude of where trip started
start_location_lon
Longitude of where the trip
started
PRCP
Precipitation in millimeters
TMAX
Maximum temperature of the
day
TMIN
Minimum temperature of the
day
AWND
Average wind speed
Gustspeed2
Speed of wind gusts
Fog
Fog level described by 9
distinct values
HeavyFog
Binary indicating heavy fog or
not
Thunder
Binary indicating presence of
thunder during day or not
WAIT
For CD: binary, indicating
either above or below median
wait time
For RD: exact wait time in
seconds
E. Evaluation
A number of different performance evaluation metrics were
captured for each type of model trained. For each classifier the
Area Under the Receiver Operator Characteristic curve (AUC),
the True Positive (TP) rate (weighted between the two classes)
and the Area Under the Precision-Recall Curve (AUPRC) were
noted.
For the regression models Correlation Coefficient and Mean
Absolute Error (MAE) were captured.
IV. RESULTS
The performance of the best performing classifiers
developed and evaluated with 10-fold cross validation, using
the 50,000 row CD dataset, are summarized in Table II.
TABLE II. CLASSIFIER PERFORMANCE - 10-FOLD CROSS VALIDATION ON
50,000 ROW CD DATASET
MODEL
AUC
TP
WEIGHTED
AUPRC
BAYES NETWORK
0.821
0.744
0.812
BOOSTED BAYES NETWORK (ADA
BOOST)
0.825
0.752
0.810
BAGGED BAYES NETWORK
0.821
0.744
0.812
NAIVE BAYES
0.804
0.726
0.785
BOOSTED NAIVE BAYES (ADA
BOOST)
0.824
0.754
0.809
BAGGED NAIVE BAYES
0.804
0.726
0.786
STACKING - META: NAIVE BAYES,
STACKED MODELS: NAIVE BAYES,
BAYES NET, ONER
0.827
0.750
0.815
Fig. 1 and Fig 2. provide the AUC curves (in terms of the
low delay class and high delay class respectively) for the best
performing classifier, the Stacking model listed in Table II. The
OneR model referred to as one of the base models used by the
Stacking model is a simple rule-based model that uses a rule
based upon just one input attribute that best predicts the target.
FIG. 1: ROC THRESHOLD CURVE FOR STACKED MODEL - META: NAIVE
BAYES, STACKED: NAIVE BAYES, BAYES NET, ONER - LOW DELAY TIME
FIG. 2: ROC THRESHOLD CURVE FOR STACKED MODEL - META: NAIVE
BAYES, STACKED: NAIVE BAYES, BAYES NET, ONER - HIGH DELAY TIME
The best performing regression models developed and
evaluated using 10-fold cross-validation, on the 10,000 row RD
dataset are shown in Table III.
The MAE value is given in seconds.
V. DISCUSSION
The best performing classification models achieve an AUC
of over 0.82, with the stacking approach (see Table II)
achieving the highest AUC of 0.827. This would be considered
a good level of discriminative performance for a model, with
above 0.8 considered ‘good’ and above 0.9 considered
‘excellent’ predictive performance [13].
TABLE III. REGRESSION MODEL PERFORMANCE - 10-FOLD CROSS
VALIDATION ON 10,000 ROW RD DATASET
CORRELATION
COEFFICIENT
MEAN ABSOLUTE
ERROR (SECS)
0.355
145.8866
0.3282
160.683
0.2872
195.1336
0.5546
154.7369
0.5464
160.5489
0.5747
148.8249
0.5871
156.4668
0.5782
148.1987
0.5412
155.9432
0.5591
157.2248
0.5836
158.1587
0.5695
166.9425
This indicates that the machine learning technique, even for
a basic set of available attributes that would be available in any
anticipated AMoD system is able to predict with high accuracy
at the coarse level of high vs low wait time until next ride
request for that vicinity (ZIP code), either at the point of
preceding passenger pickup, or at the time of completing the
passenger drop off. It could be anticipated that emerging AMoD
systems will additionally capture far more per-ride/per-request
data attributes that have the potential to improve upon
predictive model performance.
The best performing regression model in terms of
correlation coefficient is Decision Table-based Additive
Regression, achieving a correlation coefficient of 0.5871. Such
a result might be considered on the lower boundary of a ‘strong
correlation’ [14]. It achieves an MAE of 156.5 seconds. Other
similarly performing regression models include Random
Subspace - REPTree and Bagged REPTree with correlation
coefficients of 0.5836 and 0.5747 respectively. With MAEs in
the range of 150 seconds (approximately two and a half
minutes), what may be a relatively small amount of time
compared with that required typically to relocate an
autonomous vehicle to another ZIP code, this suggests the time-
to-wait prediction may have value in the decision as to whether
to relocate or not, or to where. It should be noted that the
regression model with the lowest MAE was SVM regression
achieving an MAE of 145.89 seconds.
Here we have concerned ourselves with evaluating the
predictive performance of exploratory or demonstrator models,
but various techniques can also draw upon these predictive
models so as to integrate the uncertain knowledge provided by
these or combine the uncertain predictions of multiple models
[15][16].
The results suggest that models, even based upon a limited
number of per-trip data attributes, can provide good predictive
performance.
A. Generalizability of Models
Such trained models for MoD or AMoD systems are
inherently city/region or geography specific, and also service
specific. That is, for a given AMoD system, the predictive
model would be trained from recent service usage data for each
particular city/region. That is, the model is specific to a given
city’s demand patterns. Such models would typically be trained
off-line on historical data, and then be used in real-time
inference to assist pickup and mission decision-making. It is
possible to update the training of the models fairly continuously
as new AMoD service data becomes available for that area.
Low-latency, fog computing-based architectures may provide
the distributed architecture for real-time inference
computation[17]. There is the potential in future work to extend
the training datasets with additional contextual attributes such
as fine-grained neighborhood-specific population numbers or
in-area workforce or pedestrian numbers or the potential for
ridesharing services to draw in desired attributes through such
customizable data collection approaches as crowdsensing [18].
The models are AMoD service specific in that they are
dependent upon the customer demand patterns of the service
users. Such demand patterns may have similarities or
differences between services, depending on whether the
customer bases differ in their location distribution and typical
usage/travel patterns.
A possible immediate future step is to consider such models
based on class of vehicle service-specific datasets. That is, the
demand patterns for different service levels will be distinct.
VI. CONCLUSION
In this paper we have described an exploratory study
demonstrating and evaluating the efficacy of machine learning-
based predictive models in predicting wait time until the next
ride request in a given vicinity for Autonomous Mobility-on-
Demand systems. Given the lack of available large-scale
historical AMoD trip datasets, we have demonstrated and
evaluated this approach on a real-world, ridesharing service
dataset, provided from such a service in Austin, TX. The
predictive performance demonstrated is good both in terms of
the classification and regression models developed, suggesting
the value and promise of extending such an approach to future,
more attribute-rich AMoD ride datasets.
REFERENCES
[1] Spieser, K., Treleaven, K., Zhang, R., Frazzoli, E., Morton, D., & Pavone,
M. “Toward a systematic approach to the design and evaluation of
automated mobility-on-demand systems: A case study in Singapore”. In
Road vehicle automation, 2014 (pp. 229-245). Springer, Cham.
[2] Pavone, M. “Autonomous mobility-on-demand systems for future urban
mobility”. In Autonomes Fahren, 2015. (pp. 399-416). Springer Vieweg,
Berlin, Heidelberg.
[3] Zhang, R., Spieser, K., Frazzoli, E., & Pavone, M. “Models, algorithms,
and evaluation for autonomous mobility-on-demand systems”. In
American Control Conference (ACC), July, 2015 (pp. 2573-2587). IEEE.
[4] Ride Austin. Data File and Dictionary. Available from:
https://data.world/ride-austin/ride-austin-june-6-april-13. Accessed Jan 2,
2019.
[5] J. Miller & J.P. How. ”Predictive positioning and quality of service
ridesharing for campus mobility on demand systems”. In 2017 IEEE
International Conference on Robotics and Automation (ICRA) (pp. 1402-
1408), 2017, IEEE.
[6] Iglesias, R., Rossi, F., Wang, K., Hallac, D., Leskovec, J., & Pavone, M.
“Data-driven model predictive control of autonomous mobility-on-
demand systems”. In 2018 IEEE International Conference on Robotics
and Automation (ICRA) (pp. 1-7). May, 2018, IEEE.
[7] Guériau, M., & Dusparic, I. “SAMoD: Shared Autonomous Mobility-on-
Demand using Decentralized Reinforcement Learning”. In 2018 21st
International Conference on Intelligent Transportation Systems (ITSC)
(pp. 1558-1563), Nov. 2018, IEEE.
[8] Miao, F., Han, S., Hendawi, A. M., Khalefa, M. E., Stankovic, J. A., &
Pappas, G. J. “Data-driven distributionally robust vehicle balancing using
dynamic region partitions”. In Proceedings of the 8th International
Conference on Cyber-Physical Systems, April, 2017 (pp. 261-271), ACM.
[9] Pendleton, S. D., Andersen, H., Du, X., Shen, X., Meghjani, M., Eng, Y.
H., ... & Ang, M. H. “Perception, planning, control, and coordination for
autonomous vehicles”. Machines, 5(1), 6, 2017.
[10] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, & I.H.
Witten, “The WEKA data mining software: an update”. ACM SIGKDD
explorations newsletter, 11(1), 10-18, 2009.
[11] Džeroski, S., & Ženko, B. “Is combining classifiers with stacking better
than selecting the best one?”. Machine learning, 54(3), 255-273, 2004.
[12] Collins M, Schapire RE, “Singer Y. Logistic regression, AdaBoost and
Bregman distances”. Machine Learning. 2002 Jul 1;48(1-3):253-85.
[13] A. Hanle and B.J. McNeil. “The meaning and use of the area under a
receiver operating characteristic (ROC) curve”. Radiology, 143(1), 29-36,
1982.
[14] BMJ, Online: [https://www.bmj.com/about-bmj/resources-
readers/publications/statistics-square-one/11-correlation-and-regression]
[15] Schmidt, S., Steele, R., Dillon, T. S., & Chang, E. “Fuzzy trust evaluation
and credibility development in multi-agent systems”. Applied Soft
Computing, 7(2), 492-505, 2007.
[16] Zhang, R., Rossi, F., & Pavone, M. “Model predictive control of
autonomous mobility-on-demand systems”. In 2016 IEEE International
Conference on Robotics and Automation (ICRA), May, 2016, (pp. 1382-
1389). IEEE.
[17] Jaimes, L. G., Chakeri, A., & Steele, R. “Localized cooperation for
crowdsensing in a fog computing-enabled internet-of-things”. Journal of
Ambient Intelligence and Humanized Computing, 1-13, 2019. doi:
10.1007/s12652-018-0818-z
[18] Steele, R., & Jaimes, L. G. “Crowdsensing sub-populations in a region”.
Journal of Ambient Intelligence and Humanized Computing, 1-10, 2019.
doi: 10.1007%2Fs12652-018-0799-y
... instances 15,001-20,000, 20,001-25,000 ….. 45,001-50,000) the model has been trained on each of the preceding histories (the prior 10,000, the prior 15,000 and all preceding instances) in each case using 10-fold cross validation. Note that for the first test period (15,(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20),000), the 3 group (of 5000) history will necessarily be the same as the all preceding instances/groups history. ...
... It is anticipated, but not yet comprehensively demonstrated or even addressed in this current paper, that the performance deterioration characteristics of ML models trained on software artifacts such as malare/goodware datasets will be more generalizable than models developed for application domains that can be nation or state dependent [12] or even city specific [16]. Observations include that in terms of accuracy of the model considered, while the results are mixed and somewhat variable, accuracy is sometimes lower for a shorter history and sometimes higher as per Figure 5. ...
Conference Paper
Full-text available
The occurrence of previously unseen malicious code or malware is an implicit and ongoing issue for all software-based systems. It has been recognized that machine learning, applied to features statically extracted from binary executable files, offers a number of promising benefits, such as its ability to detect malware that has not been previously encountered. Nevertheless it is understood that these models will not continue to perform equally well over time as new and potentially less recognizable malwares occur. In this study, we have applied a range of machine learning models to the features extracted from a large collection of software executables in Portable Executable format ordered by the date the binary was first encountered, consisting of both malware and benign examples, whilst considering different training set configurations and timeframes. We analyze and quantify the relative performance deterioration of these machine learning models on future test sets of these features, and discuss some insights into the characteristics and rate of machine learning-based malware detection performance deterioration and training set selection.
Thesis
Full-text available
Ride-sourcing transportation services offered by transportation network companies (TNCs) like Uber and Lyft are disrupting the transportation landscape. The growing demand on these services, along with their potential short and long-term impacts on the environment, society, and infrastructure emphasize the need to further understand the ride-sourcing system. There were no sufficient data to fully understand the system and integrate it within regional multimodal transportation frameworks. This can be attributed to commercial and competition reasons, given the technology-enabled and innovative nature of the system. Recently, in 2019, the City of Chicago the released an extensive and complete ride-sourcing trip-level data for all trips made within the city since November 1, 2018. The data comprises the trip ends (pick-up and drop-off locations), trip timestamps, trip length and duration, fare including tipping amounts, and whether the trip was authorized to be shared (pooled) with another passenger or not. Therefore, the main goal of this dissertation is to develop a comprehensive data-driven framework to understand and model the system using this data from Chicago, in a reproducible and transferable fashion. Using data fusion approach, sociodemographic, economic, parking supply, transit availability and accessibility, built environment and crime data are collected from open sources to develop this framework. The framework is predicated on three pillars of analytics: (1) explorative and descriptive analytics, (2) diagnostic analytics, and (3) predictive analytics. The dissertation research framework also provides a guide on the key spatial and behavioral explanatory variables shaping the utility of the mode, driving the demand, and governing the interdependencies between the demand’s willingness to share and surge price. Thus, the key findings can be readily challenged, verified, and utilized in different geographies. In the explorative and descriptive analytics, the ride-sourcing system’s spatial and temporal dimensions of the system are analyzed to achieve two objectives: (1) explore, reveal, and assess the significance of spatial effects, i.e., spatial dependence and heterogeneity, in the system behavior, and (2) develop a behavioral market segmentation and trend mining of the willingness to share. This is linked to the diagnostic analytics layer, as the revealed spatial effects motivates the adoption of spatial econometric models to analytically identify the ride-sourcing system determinants. Multiple linear regression (MLR) is used as a benchmark model against spatial error model (SEM), spatially lagged X (SLX) model, and geographically weighted regression (GWR) model. Two innovative modeling constructs are introduced deal with the ride-sourcing system’s spatial effects and multicollinearity: (1) Calibrated Spatially Lagged X Ridge Model (CSLXR) and Calibrated Geographically Weighted Ridge Regression (CGWRR) in the diagnostic analytics layer. The identified determinants in the diagnostic analytics layer are then fed into the predictive analytics one to develop an interpretable machine learning (ML) modeling framework. The system’s annual average weekday origin-destination (AAWD OD) flow is modeled using the following state-of-the-art ML models: (1) Multilayer Perceptron (MLP) Regression, (2) Support Vector Machines Regression (SVR), and (3) Tree-based ensemble learning methods, i.e., Random Forest Regression (RFR) and Extreme Gradient Boosting (XGBoost). The innovative modeling construct of CGWRR developed in the diagnostic analytics is then validated in a predictive context and is found to outperform the state-of-the-art ML models in terms of testing score of 0.914, in comparison to 0.906 for XGBoost, 0.84 for RFR, 0.89 for SVR, and 0.86 for MLP. The CGWRR exhibits outperformance as well in terms of the root mean squared error (RMSE) and mean average error (MAE). The findings of this dissertation partially bridge the gap between the practice and the research on ride-sourcing transportation systems understanding and integration. The empirical findings made in the descriptive and explorative analytics can be further utilized by regional agencies to fill practice and policymaking gaps on regulating ride-sourcing services using corridor or cordon toll, optimally allocating standing areas to minimize deadheading, especially during off-peak periods, and promoting the ride-share willingness in disadvantage communities. The CGWRR provides a reliable modeling and simulation tool to researchers and practitioners to integrate the ride-sourcing system in multimodal transportation modeling frameworks, simulation testbed for testing long-range impacts of policies on ride-sourcing, like improved transit supply, congestions pricing, or increased parking rates, and to plan ahead for similar futuristic transportation modes, like the shared autonomous vehicles.
Article
Full-text available
In this article, we describe and evaluate a crowdsensing approach that entails local cooperation between crowdsensing participants in smart environments, utilizing an underlying fog computing-enabled Internet of Things. A fog computing-based Internet-of-Things architecture involves a layer of computing nodes residing closer to the sensing devices, with this layer of fog nodes lying in between mobile and sensing devices at the network edge and the cloud. This motivates us to propose a model for crowdsensing in smart environments that involves both competition and cooperation between nearby crowdsensing participants at the edge network. Comprehensive simulations are presented to evaluate the performance of the proposed approach. The work shows desirable characteristics in terms of number of active participants, number of samples collected within a given budget and coverage, resulting from localized cooperation by crowdsensing participants at the edge layer that can support various smart environment applications.
Article
Full-text available
Crowdsensing refers to an approach for collecting of data from a large number of smart devices and sensors carried by many individuals and has been employed for numerous applications, which include pollution monitoring, traffic monitoring and noise sensing. It is an important mechanism for building applications in the smart environments enabled by the internet-of-things. However, often a given problem may dictate that samples are drawn from a defined sub-population of participants, for example based on characteristics of the participant such as location, demographics or other profile attribute, rather than from any possible member of the whole population. In this article we introduce an approach for crowdsensing with a consideration for how to sample from specific sub-populations in a region, delineated in a dimension-based way analogous to the multi-dimensional data model used in data warehousing. Simulation and performance results are provided demonstrating the approach’s ability to maintain active participants, provide coverage of the region of interest, and to be able to scalably sample the variable of interest in relation to the sub-population. This is the first work to our knowledge to address and propose an approach to the specific problem of crowdsourcing from specific attribute-defined sub-populations.
Article
Full-text available
Autonomous vehicles are expected to play a key role in the future of urban transportation systems, as they offer potential for additional safety, increased productivity, greater accessibility, better road efficiency, and positive impact on the environment. Research in autonomous systems has seen dramatic advances in recent years, due to the increases in available computing power and reduced cost in sensing and computing technologies, resulting in maturing technological readiness level of fully autonomous vehicles. The objective of this paper is to provide a general overview of the recent developments in the realm of autonomous vehicle software systems. Fundamental components of autonomous vehicle software are reviewed, and recent developments in each area are discussed.
Chapter
Full-text available
This chapter discusses the operational and economic aspects of autonomous mobility-ondemand (AMoD) systems, a transformative and rapidly developing mode of transportation wherein robotic, self-driving vehicles transport passengers in a given environment. Specifically, AMoD systems are addressed along three dimensions: (1) modeling, that is analytical models capturing salient dynamic and stochastic features of customer demand, (2) control, that is coordination algorithms for the vehicles aimed at throughput maximization, and (3) economic, that is fleet sizing and financial analyses for case studies of New York City and Singapore. Collectively, the models and methods presented in this chapter enable a rigorous assessment of the value of AMoD systems.
Conference Paper
With the transformation to smarter cities and the development of technologies, a large amount of data is collected from sensors in real-time. This paradigm provides opportunities for improving transportation systems' performance by allocating vehicles towards mobility predicted demand proactively. However, how to deal with uncertainties in demand probability distribution for improving the average system performance is still a challenging and unsolved task. Considering this problem, in this work, we develop a data-driven distributionally robust vehicle balancing method to minimize the worst-case expected cost. We design an efficient algorithm for constructing uncertainty sets of random demand probability distributions , and leverage a quad-tree dynamic region partition method for better capturing the dynamic spatial-temporal properties of the uncertain demand. We then prove equivalent computationally tractable form for numerically solving the distributionally robust problem. We evaluate the performance of the data-driven vehicle balancing framework based on four years of taxi trip data for New York City. We show that the average total idle driving distance is reduced by 30% with the distributionally robust vehicle balancing method using quad-tree dynamic region partition method, compared with vehicle balancing solutions based on static region partitions without considering demand uncertainties. This is about 60 million miles or 8 million dollars cost reduction annually in NYC. CCS CONCEPTS • Mathematics of computing →Stochastic control and optimization ; Probabilistic algorithms; • Networks →Network algorithms; • Computer systems organization →Embedded and cyber-physical systems; *
Article
This tutorial paper examines the operational and economic aspects of autonomous mobility-on-demand (AMoD) systems, a rapidly emerging mode of personal transportation wherein robotic, self-driving vehicles transport customers in a given environment. We address AMoD systems along three dimensions: (1) modeling - analytical models capable of capturing the salient dynamic and stochastic features of customer demand, (2) control - coordination algorithms for the vehicles aimed at stability and subsequently throughput maximization, and (3) economic - fleet sizing and financial analyses for case studies of New York City and Singapore. Collectively, the models and algorithms presented in this paper enable a rigorous assessment of the value of AMoD systems. In particular, the case study of New York City shows that the current taxi demand in Manhattan can be met with about 8,000 robotic vehicles (roughly 70% of the size of the current taxi fleet), while the case study of Singapore suggests that an AMoD system can meet the personal mobility need of the entire population of Singapore with a number of robotic vehicles that is less than 40% of the current number of passenger vehicles. Directions for future research on AMoD systems are presented and discussed.