ArticlePDF Available

Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package)

May 2018
Neurocomputing 307(7)

May 2018
307(7)

DOI:10.1016/j.neucom.2018.03.067

License
CC BY 4.0

Authors:

Maximilian Christ

SMS group GmbH

Nils Braun

Karlsruhe Institute of Technology

Time series feature engineering is a time-consuming process because scientists and engineers have to consider the multifarious algorithms of signal processing and time series analysis for identifying and extracting meaningful features from time series. The Python package tsfresh (Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests) accelerates this process by combining 63 time series characterization methods, which by default compute a total of 794 time series features, with feature selection on basis automatically configured hypothesis tests. By identifying statistically significant time series characteristics in an early stage of the data science process, tsfresh closes feedback loops with domain experts and fosters the development of domain specific features early on. The package implements standard APIs of time series and machine learning libraries (e.g. pandas and scikit-learn) and is designed for both exploratory analyses as well as straightforward integration into operational data science applications.

…

Figures - uploaded by Maximilian Christ

Content may be subject to copyright.

Content uploaded by Maximilian Christ

Content may be subject to copyright.

A preview of the PDF is not available

Feature Engineering for a MIL-STD-1553B LSTM Autoencoder Anomaly Detector

Article

Full-text available

Jun 2024

The MIL-STD-1553B data bus protocol is used in both civilian and military aircraft to enable communications between subsystems. These interconnected subsystems are responsible for core services such as communications, flow of instrument data and aircraft control. With aircraft modernization, threat vectors are introduced through increased inter-connectivity internal and external to the aircraft. The resulting potential for exploitation introduces a requirement for an intrusion detection capability in order to maintain the integrity, availability and reliability of data transmitted using the MIL-STD-1553B protocol, safety of the aircraft and overall, to achieve mission assurance. Research in recent years has investigated signature, statistical and machine learning based solutions to detect attacks on MIL-STD-1553B buses. Of the different techniques, those based on machine learning have shown extremely good results. The aim of this research is to improve the performance of an existing Long Short-Term Memory Auto-Encoder by refining the feature engineering phase of its pipeline. The improvement in the detector’s overall effectiveness was accomplished through feature engineering focused on feature generation and selection. Five different attack datasets were used as the starting point, consisting of four different denial of service attacks and one data integrity attack. From initial feature extraction of 155 features, two feature generation techniques were employed to create over 38,000 features as a starting point. Using five different MIL-STD-1553B datasets and three feature selection techniques, fifteen different Long Short-Term Memory Auto-Encoder models were created, trained and evaluated using common performance metrics and compared to those of the original anomaly detector. This research demonstrated marked performance improvement achieved by the feature engineering refinements made in comparison to those of the original model. Equally important, this research also showed a significant reduction in the number of features required to achieve this performance gain. In the context of miliary air operations, the ability to improve detection capabilities with less data is important to the technical solutions that contribute to the achievement of cyber mission assurance.

aeon: a Python toolkit for learning from time series

Preprint

Full-text available

Jun 2024

aeon is a unified Python 3 library for all machine learning tasks involving time series. The package contains modules for time series forecasting, classification, extrinsic regression and clustering, as well as a variety of utilities, transformations and distance measures designed for time series data. aeon also has a number of experimental modules for tasks such as anomaly detection, similarity search and segmentation. aeon follows the scikit-learn API as much as possible to help new users and enable easy integration of aeon estimators with useful tools such as model selection and pipelines. It provides a broad library of time series algorithms, including efficient implementations of the very latest advances in research. Using a system of optional dependencies, aeon integrates a wide variety of packages into a single interface while keeping the core framework with minimal dependencies. The package is distributed under the 3-Clause BSD license and is available at https://github.com/ aeon-toolkit/aeon. This version was submitted to the JMLR journal on 02 Nov 2023 for v0.5.0 of aeon. At the time of this preprint aeon has released v0.9.0, and has had substantial changes.

Self-Supervised Spatial-Temporal Normality Learning for Time Series Anomaly Detection

Preprint

Jun 2024

Time Series Anomaly Detection (TSAD) finds widespread applications across various domains such as financial markets, industrial production, and healthcare. Its primary objective is to learn the normal patterns of time series data, thereby identifying deviations in test samples. Most existing TSAD methods focus on modeling data from the temporal dimension, while ignoring the semantic information in the spatial dimension. To address this issue, we introduce a novel approach, called Spatial-Temporal Normality learning (STEN). STEN is composed of a sequence Order prediction-based Temporal Normality learning (OTN) module that captures the temporal correlations within sequences, and a Distance prediction-based Spatial Normality learning (DSN) module that learns the relative spatial relations between sequences in a feature space. By synthesizing these two modules, STEN learns expressive spatial-temporal representations for the normal patterns hidden in the time series data. Extensive experiments on five popular TSAD benchmarks show that STEN substantially outperforms state-of-the-art competing methods. Our code is available at https://github.com/mala-lab/STEN.

Early Classification of Time Series: Taxonomy and Benchmark

Preprint

Full-text available

Jun 2024

In many situations, the measurements of a studied phenomenon are provided sequentially, and the prediction of its class needs to be made as early as possible so as not to incur too high a time penalty, but not too early and risk paying the cost of misclassification. This problem has been particularly studied in the case of time series, and is known as Early Classification of Time Series (ECTS). Although it has been the subject of a growing body of literature, there is still a lack of a systematic, shared evaluation protocol to compare the relative merits of the various existing methods. This document begins by situating these methods within a principle-based taxonomy. It defines dimensions for organizing their evaluation, and then reports the results of a very extensive set of experiments along these dimensions involving nine state-of-the art ECTS algorithms. In addition, these and other experiments can be carried out using an open-source library in which most of the existing ECTS algorithms have been implemented (see \url{https://github.com/ML-EDM/ml_edm}).

A Hybrid Intelligent System for Protection of Transmission Lines Connected to PV Farms based on Linear Trends

Preprint

Full-text available

Jun 2024

Conventional relays face challenges for transmission lines connected to inverter-based resources (IBRs). In this article, a single-ended intelligent protection of the transmission line in the zone between the grid and the PV farm is suggested. The method employs a fuzzy logic and random forest (RF)-based hybrid system to detect faults based on combined linear trend attributes of the 3-phase currents. The fault location is determined and the faulty phase is detected. RF feature selection is used to obtain the optimal linear trend feature. The performance of the methodology is examined for abnormal events such as faults, capacitor and load-switching operations simulated in PSCAD/EMTDC on IEEE 9-bus system obtained by varying various fault and switching parameters. Additionally, when validating the suggested strategy, consideration is given to the effects of conditions such as the presence of double circuit lines, PV capacity, sampling rate, data window length, noise, high impedance faults, CT saturation, compensation devices, evolving and cross-country faults, and far-end and near-end faults. The findings indicate that the suggested strategy can be used to deal with a variety of system configurations and situations while still safeguarding such complex power transmission networks.

Bending classification from interference signals of a fiber optic sensor using shallow learning and convolutional neural networks

Article

Jul 2024
PATTERN RECOGN LETT

SeamSleeve: Robust Arm Movement Sensing through Powered Stitching

Conference Paper

Jul 2024

Deep Learning Prediction of Rainfall-driven Debris Flows Considering the Similar Critical Thresholds within Comparable Background Conditions

Article

Jun 2024
ENVIRON MODELL SOFTW

Machine learning has been widely applied to predict the spatial or temporal likelihood of debris flows by leveraging its powerful capability to fit nonlinear features and uncover underlying patterns or rules in the complex formation mechanisms of debris flows. However, traditional approaches, including some current machine learning-based prediction models, still have limitations when used for debris flow prediction. These include the lack of a specific network structure or model to consider the updating of debris flow critical conditions in relation to geographical background conditions, limiting the universality of prediction models when transferring them to different places. In this study, this article proposes a deep learning network designed to predict the spatiotemporal probability of rainfall-induced debris flows, incorporating the Similarity Mechanism of Debris Flow Critical Conditions (SM-DFCC). The model comprehensively integrates the mining of rainfall-triggering features and couples them with geographical background features to fit the nonlinear relationship with debris flow formation. The model underwent training using data on various historical debris flows triggered by different storms across Liangshan Prefecture from 2020 to 2022. The results indicated that: (i) the method is effective in predicting the spatiotemporal likelihood of debris flows under catchment units, with accuracy scores (ACC) ranging from 0.724 to 0.835; (ii) after optimization using the AVOA algorithm, the predictive performance of the model significantly improved, with an increase of 27.24% in ACC scores for SVC and 8.81% for XGBoost; and (iii) factor importance analysis revealed that rainfall triggering factors have higher cumulative contribution rates when distinguishing between the occurrence and non-occurrence of debris flows. In addition, taking a rainfall storm on 06, September 2020 as a case, this research quantitatively revealed the pattern of debris flow formation, where high-frequency disaster areas exhibit lower rainfall thresholds of debris flows, represented by absolute energy (AE). Despite these findings, the accuracy and reliability of rainfall data still remain the most challenging obstacle in basin/regional-scale debris flow prediction when applying this method. The integration of multiple sources of rainfall data, including station data, satellite rainfall, radar rainfall, etc., is necessary to accurately quantify the impact of rainfall on debris flow formation when applying this method to debris flow monitoring and early warning tasks. Overall, this method shows great potential in providing a scientific reference for the construction of debris flow monitoring and early warning systems in the future.

Leveraging Social Media to Predict COVID-19-Induced Disruptions to Mental Well-Being Among University Students: Modeling Study

Article

Jun 2024

Background Large-scale crisis events such as COVID-19 often have secondary impacts on individuals’ mental well-being. University students are particularly vulnerable to such impacts. Traditional survey-based methods to identify those in need of support do not scale over large populations and they do not provide timely insights. We pursue an alternative approach through social media data and machine learning. Our models aim to complement surveys and provide early, precise, and objective predictions of students disrupted by COVID-19. Objective This study aims to demonstrate the feasibility of language on private social media as an indicator of crisis-induced disruption to mental well-being. Methods We modeled 4124 Facebook posts provided by 43 undergraduate students, spanning over 2 years. We extracted temporal trends in the psycholinguistic attributes of their posts and comments. These trends were used as features to predict how COVID-19 disrupted their mental well-being. Results The social media–enabled model had an F1-score of 0.79, which was a 39% improvement over a model trained on the self-reported mental state of the participant. The features we used showed promise in predicting other mental states such as anxiety, depression, social, isolation, and suicidal behavior (F1-scores varied between 0.85 and 0.93). We also found that selecting the windows of time 7 months after the COVID-19–induced lockdown presented better results, therefore, paving the way for data minimization. Conclusions We predicted COVID-19–induced disruptions to mental well-being by developing a machine learning model that leveraged language on private social media. The language in these posts described psycholinguistic trends in students’ online behavior. These longitudinal trends helped predict mental well-being disruption better than models trained on correlated mental health questionnaires. Our work inspires further research into the potential applications of early, precise, and automatic warnings for individuals concerned about their mental health in times of crisis.

Enteromorpha Prolifera Monitoring System Based on Air-Ground-Sea Multi-Source Information Fusion

Conference Paper

Jun 2024

hctsa : A Computational Framework for Automated Time-Series Phenotyping Using Massive Feature Extraction

Article

Full-text available

Nov 2017

Phenotype measurements frequently take the form of time series, but we currently lack a systematic method for relating these complex data streams to scientifically meaningful outcomes, such as relating the movement dynamics of organisms to their genotype or measurements of brain dynamics of a patient to their disease diagnosis. Previous work addressed this problem by comparing implementations of thousands of diverse scientific time-series analysis methods in an approach termed highly comparative time-series analysis. Here, we introduce hctsa, a software tool for applying this methodological approach to data. hctsa includes an architecture for computing over 7,700 time-series features and a suite of analysis and visualization algorithms to automatically select useful and interpretable time-series features for a given application. Using exemplar applications to high-throughput phenotyping experiments, we show how hctsa allows researchers to leverage decades of time-series research to quantify and understand informative structure in time-series data.

Feature-based time-series analysis

Article

Full-text available

Sep 2017

Ben Fulcher

This work presents an introduction to feature-based time-series analysis. The time series as a data type is first described, along with an overview of the interdisciplinary time-series analysis literature. I then summarize the range of feature-based representations for time series that have been developed to aid interpretable insights into time-series structure. Particular emphasis is given to emerging research that facilitates wide comparison of feature-based representations that allow us to understand the properties of a time-series dataset that make it suited to a particular feature-based representation or analysis algorithm. The future of time-series analysis is likely to embrace approaches that exploit machine learning methods to partially automate human learning to aid understanding of the complex dynamical patterns in the time series we measure from the world.

Distributed and parallel time series feature extraction for industrial big data applications

Article

Full-text available

Oct 2016

The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as predictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously. Here, we are proposing an efficient, scalable feature extraction algorithm, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features. The proposed algorithm combines established feature extraction methods with a feature importance filter. It has a low computational complexity, allows to start on a problem with only limited domain knowledge available, can be trivially parallelized, is highly scalable and based on well studied non-parametric hypothesis tests. We benchmark our proposed algorithm on all binary classification problems of the UCR time series classification archive as well as time series from a production line optimization project and simulated stochastic processes with underlying qualitative change of dynamics.

Time Series Analysis in Industrial Applications

Poster

Full-text available

Mar 2016

This poster illustrates challenges that occur during the analysis of time series in the BMBF funded research Project iPRODICT and similar Industrial Applications.

Data Structures for Statistical Computing in Python

Conference Paper

Jan 2010

Wes McKinney

Dissipative Solitons in Reaction-Diffusion Systems

Book

Jan 2008

Dask: Parallel Computation with Blocked algorithms and Task Scheduling

Conference Paper

Jan 2015

Matthew Rocklin

An introduction to predictive maintenance

Book

Jan 2002

Keith R Mobley

Integrating Predictive Analytics into Complex Event Processing by Using Conditional Density Estimations

Conference Paper

Sep 2016

The monitoring of real-time objects such as steel billets during their casting process creates myriads of events. Complex Event Processing (CEP) is the technology to analyze resulting event streams as fast as possible. But classic CEP is not able to consider events that did not happen yet. It is not clear how to transform CEP from a technology, which reacts on past events, to one, which anticipates near future events. Conditional density estimation allows to combine both estimation and expected uncertainty about the next occurrence of a given event in one mathematical object. Moreover, it allows to calculate the probability of event patterns, which are the basis for CEP. Hence, we are introducing the concept of Conditional Event Occurrence Density Estimation (CEODE) to CEP. We present a reference architecture for combining CEP engines with predictive analytics using CEODEs. On basis of concrete guidelines for transforming classical event processing rules to proactive ones, we are demonstrating how CEP evolves from being reactive to becoming both predictive and prescriptive.

Design Principles for Industrie 4.0 Scenarios

Conference Paper

Jan 2016

Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package)

Abstract and Figures

Recommended publications

A Window-Based Feature Extraction Method in Document Copy Detection

Feature Extraction Mechanisms in Bioinformatics data

Hybrid clusteing algorithm and Neural Network classifier for satellite image classification

Using Genetic Algorithm and ELM Neural Networks for Feature Extraction and Classification of Type 2-...