ArticlePDF Available

Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package)

Authors:

Abstract and Figures

Time series feature engineering is a time-consuming process because scientists and engineers have to consider the multifarious algorithms of signal processing and time series analysis for identifying and extracting meaningful features from time series. The Python package tsfresh (Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests) accelerates this process by combining 63 time series characterization methods, which by default compute a total of 794 time series features, with feature selection on basis automatically configured hypothesis tests. By identifying statistically significant time series characteristics in an early stage of the data science process, tsfresh closes feedback loops with domain experts and fosters the development of domain specific features early on. The package implements standard APIs of time series and machine learning libraries (e.g. pandas and scikit-learn) and is designed for both exploratory analyses as well as straightforward integration into operational data science applications.
Content may be subject to copyright.
A preview of the PDF is not available
... The second method utilizes the python tool tsfresh. tsfresh is a python package that generates time-series features using 78 different feature calculation modules (Christ et al, 2018). tsfresh does this by running the different calculation modules with the data from the features in the dataset, and then creates additional features based on the results from the calculation modules. ...
... used of the processing module netAI(Zander and Williams, 2011) to create network specific flow features.Stan et al (2020) also implemented the use of a feature generation module called Time Series FeatuRe Extraction on basis of Scalable Hypothesis (tsfresh) byChrist et al. (2018) that allowed the calculation of features from time-series datasets. Self-Organizing Maps (SOM) were used by da Silva Rodrigues et al (2021) to generate new features with noted improvements in F1-Scores depending on the classification model used.Brownlee (2020) highlighted other feature generation methods such as polynomial transformation. ...
Article
Full-text available
The MIL-STD-1553B data bus protocol is used in both civilian and military aircraft to enable communications between subsystems. These interconnected subsystems are responsible for core services such as communications, flow of instrument data and aircraft control. With aircraft modernization, threat vectors are introduced through increased inter-connectivity internal and external to the aircraft. The resulting potential for exploitation introduces a requirement for an intrusion detection capability in order to maintain the integrity, availability and reliability of data transmitted using the MIL-STD-1553B protocol, safety of the aircraft and overall, to achieve mission assurance. Research in recent years has investigated signature, statistical and machine learning based solutions to detect attacks on MIL-STD-1553B buses. Of the different techniques, those based on machine learning have shown extremely good results. The aim of this research is to improve the performance of an existing Long Short-Term Memory Auto-Encoder by refining the feature engineering phase of its pipeline. The improvement in the detector’s overall effectiveness was accomplished through feature engineering focused on feature generation and selection. Five different attack datasets were used as the starting point, consisting of four different denial of service attacks and one data integrity attack. From initial feature extraction of 155 features, two feature generation techniques were employed to create over 38,000 features as a starting point. Using five different MIL-STD-1553B datasets and three feature selection techniques, fifteen different Long Short-Term Memory Auto-Encoder models were created, trained and evaluated using common performance metrics and compared to those of the original anomaly detector. This research demonstrated marked performance improvement achieved by the feature engineering refinements made in comparison to those of the original model. Equally important, this research also showed a significant reduction in the number of features required to achieve this performance gain. In the context of miliary air operations, the ability to improve detection capabilities with less data is important to the technical solutions that contribute to the achievement of cyber mission assurance.
... As well as the core dependencies, aeon also includes a range of optional/soft dependencies to packages such as statsmodels (Seabold and Perktold, 2010), tensorflow (Abadi et al., 2015), and tsfresh (Christ et al., 2018). These are commonly used to create wrappers for algorithms present in these packages or used as a framework for estimators such as deep learners. ...
... They can be series-to-series transformations which both take and output a time series, such as the Fourier transform or channel selection for multivariate series (Dhariyal et al., 2023). Alternatively, transformers can be seriesto-features which take a series input but output a feature vector such as basic summary statistics or TSFresh (Christ et al., 2018). ...
Preprint
Full-text available
aeon is a unified Python 3 library for all machine learning tasks involving time series. The package contains modules for time series forecasting, classification, extrinsic regression and clustering, as well as a variety of utilities, transformations and distance measures designed for time series data. aeon also has a number of experimental modules for tasks such as anomaly detection, similarity search and segmentation. aeon follows the scikit-learn API as much as possible to help new users and enable easy integration of aeon estimators with useful tools such as model selection and pipelines. It provides a broad library of time series algorithms, including efficient implementations of the very latest advances in research. Using a system of optional dependencies, aeon integrates a wide variety of packages into a single interface while keeping the core framework with minimal dependencies. The package is distributed under the 3-Clause BSD license and is available at https://github.com/ aeon-toolkit/aeon. This version was submitted to the JMLR journal on 02 Nov 2023 for v0.5.0 of aeon. At the time of this preprint aeon has released v0.9.0, and has had substantial changes.
... With the emergence of machine learning technology, techniques including classification [19], clustering [15], ensemble learning [32],and time series forecasting [14] are applied to time series anomaly detection. Besides, Tsfresh has inspired the window-to-feature approach, enhancing the efficiency of feature extraction in time series analysis [8]. ROCKET's focus on sub-sequence patterns through random convolutional kernels has inspired advancements in capturing local temporal patterns [11]. ...
Preprint
Time Series Anomaly Detection (TSAD) finds widespread applications across various domains such as financial markets, industrial production, and healthcare. Its primary objective is to learn the normal patterns of time series data, thereby identifying deviations in test samples. Most existing TSAD methods focus on modeling data from the temporal dimension, while ignoring the semantic information in the spatial dimension. To address this issue, we introduce a novel approach, called Spatial-Temporal Normality learning (STEN). STEN is composed of a sequence Order prediction-based Temporal Normality learning (OTN) module that captures the temporal correlations within sequences, and a Distance prediction-based Spatial Normality learning (DSN) module that learns the relative spatial relations between sequences in a feature space. By synthesizing these two modules, STEN learns expressive spatial-temporal representations for the normal patterns hidden in the time series data. Extensive experiments on five popular TSAD benchmarks show that STEN substantially outperforms state-of-the-art competing methods. Our code is available at https://github.com/mala-lab/STEN.
... However, the choice of the base classifiers could induce a bias favoring or hampering some methods. In order to clarify this, we have repeated the experiments replacing MiniROCKET with two base classifiers: WEASEL 2.0 (Schäfer & Leser, 2023), and the XGBoost classifier using features produced by tsfresh (Christ et al., 2018). Both of these classifiers have already been tested within the ECTS literature in (Schäfer & Leser, 2020;Lv et al., 2019) and (Achenchabe et al., 2021a) respectively. ...
Preprint
Full-text available
In many situations, the measurements of a studied phenomenon are provided sequentially, and the prediction of its class needs to be made as early as possible so as not to incur too high a time penalty, but not too early and risk paying the cost of misclassification. This problem has been particularly studied in the case of time series, and is known as Early Classification of Time Series (ECTS). Although it has been the subject of a growing body of literature, there is still a lack of a systematic, shared evaluation protocol to compare the relative merits of the various existing methods. This document begins by situating these methods within a principle-based taxonomy. It defines dimensions for organizing their evaluation, and then reports the results of a very extensive set of experiments along these dimensions involving nine state-of-the art ECTS algorithms. In addition, these and other experiments can be carried out using an open-source library in which most of the existing ECTS algorithms have been implemented (see \url{https://github.com/ML-EDM/ml_edm}).
... It represents a straight line pattern when plotted on a graph, where the data points follow a linear relationship as time progresses. For the time series values, it computes the linear least-squares regression, and "p-value", "correlation coefficient", "intercept", "slope", and "standard error" are obtained [32]. In this scenario, the three time series correspond to the 3-phase currents monitored by CT P V . ...
Preprint
Full-text available
Conventional relays face challenges for transmission lines connected to inverter-based resources (IBRs). In this article, a single-ended intelligent protection of the transmission line in the zone between the grid and the PV farm is suggested. The method employs a fuzzy logic and random forest (RF)-based hybrid system to detect faults based on combined linear trend attributes of the 3-phase currents. The fault location is determined and the faulty phase is detected. RF feature selection is used to obtain the optimal linear trend feature. The performance of the methodology is examined for abnormal events such as faults, capacitor and load-switching operations simulated in PSCAD/EMTDC on IEEE 9-bus system obtained by varying various fault and switching parameters. Additionally, when validating the suggested strategy, consideration is given to the effects of conditions such as the presence of double circuit lines, PV capacity, sampling rate, data window length, noise, high impedance faults, CT saturation, compensation devices, evolving and cross-country faults, and far-end and near-end faults. The findings indicate that the suggested strategy can be used to deal with a variety of system configurations and situations while still safeguarding such complex power transmission networks.
Article
Machine learning has been widely applied to predict the spatial or temporal likelihood of debris flows by leveraging its powerful capability to fit nonlinear features and uncover underlying patterns or rules in the complex formation mechanisms of debris flows. However, traditional approaches, including some current machine learning-based prediction models, still have limitations when used for debris flow prediction. These include the lack of a specific network structure or model to consider the updating of debris flow critical conditions in relation to geographical background conditions, limiting the universality of prediction models when transferring them to different places. In this study, this article proposes a deep learning network designed to predict the spatiotemporal probability of rainfall-induced debris flows, incorporating the Similarity Mechanism of Debris Flow Critical Conditions (SM-DFCC). The model comprehensively integrates the mining of rainfall-triggering features and couples them with geographical background features to fit the nonlinear relationship with debris flow formation. The model underwent training using data on various historical debris flows triggered by different storms across Liangshan Prefecture from 2020 to 2022. The results indicated that: (i) the method is effective in predicting the spatiotemporal likelihood of debris flows under catchment units, with accuracy scores (ACC) ranging from 0.724 to 0.835; (ii) after optimization using the AVOA algorithm, the predictive performance of the model significantly improved, with an increase of 27.24% in ACC scores for SVC and 8.81% for XGBoost; and (iii) factor importance analysis revealed that rainfall triggering factors have higher cumulative contribution rates when distinguishing between the occurrence and non-occurrence of debris flows. In addition, taking a rainfall storm on 06, September 2020 as a case, this research quantitatively revealed the pattern of debris flow formation, where high-frequency disaster areas exhibit lower rainfall thresholds of debris flows, represented by absolute energy (AE). Despite these findings, the accuracy and reliability of rainfall data still remain the most challenging obstacle in basin/regional-scale debris flow prediction when applying this method. The integration of multiple sources of rainfall data, including station data, satellite rainfall, radar rainfall, etc., is necessary to accurately quantify the impact of rainfall on debris flow formation when applying this method to debris flow monitoring and early warning tasks. Overall, this method shows great potential in providing a scientific reference for the construction of debris flow monitoring and early warning systems in the future.
Article
Background Large-scale crisis events such as COVID-19 often have secondary impacts on individuals’ mental well-being. University students are particularly vulnerable to such impacts. Traditional survey-based methods to identify those in need of support do not scale over large populations and they do not provide timely insights. We pursue an alternative approach through social media data and machine learning. Our models aim to complement surveys and provide early, precise, and objective predictions of students disrupted by COVID-19. Objective This study aims to demonstrate the feasibility of language on private social media as an indicator of crisis-induced disruption to mental well-being. Methods We modeled 4124 Facebook posts provided by 43 undergraduate students, spanning over 2 years. We extracted temporal trends in the psycholinguistic attributes of their posts and comments. These trends were used as features to predict how COVID-19 disrupted their mental well-being. Results The social media–enabled model had an F1-score of 0.79, which was a 39% improvement over a model trained on the self-reported mental state of the participant. The features we used showed promise in predicting other mental states such as anxiety, depression, social, isolation, and suicidal behavior (F1-scores varied between 0.85 and 0.93). We also found that selecting the windows of time 7 months after the COVID-19–induced lockdown presented better results, therefore, paving the way for data minimization. Conclusions We predicted COVID-19–induced disruptions to mental well-being by developing a machine learning model that leveraged language on private social media. The language in these posts described psycholinguistic trends in students’ online behavior. These longitudinal trends helped predict mental well-being disruption better than models trained on correlated mental health questionnaires. Our work inspires further research into the potential applications of early, precise, and automatic warnings for individuals concerned about their mental health in times of crisis.
Article
Full-text available
Phenotype measurements frequently take the form of time series, but we currently lack a systematic method for relating these complex data streams to scientifically meaningful outcomes, such as relating the movement dynamics of organisms to their genotype or measurements of brain dynamics of a patient to their disease diagnosis. Previous work addressed this problem by comparing implementations of thousands of diverse scientific time-series analysis methods in an approach termed highly comparative time-series analysis. Here, we introduce hctsa, a software tool for applying this methodological approach to data. hctsa includes an architecture for computing over 7,700 time-series features and a suite of analysis and visualization algorithms to automatically select useful and interpretable time-series features for a given application. Using exemplar applications to high-throughput phenotyping experiments, we show how hctsa allows researchers to leverage decades of time-series research to quantify and understand informative structure in time-series data.
Article
Full-text available
This work presents an introduction to feature-based time-series analysis. The time series as a data type is first described, along with an overview of the interdisciplinary time-series analysis literature. I then summarize the range of feature-based representations for time series that have been developed to aid interpretable insights into time-series structure. Particular emphasis is given to emerging research that facilitates wide comparison of feature-based representations that allow us to understand the properties of a time-series dataset that make it suited to a particular feature-based representation or analysis algorithm. The future of time-series analysis is likely to embrace approaches that exploit machine learning methods to partially automate human learning to aid understanding of the complex dynamical patterns in the time series we measure from the world.
Article
Full-text available
The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as predictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously. Here, we are proposing an efficient, scalable feature extraction algorithm, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features. The proposed algorithm combines established feature extraction methods with a feature importance filter. It has a low computational complexity, allows to start on a problem with only limited domain knowledge available, can be trivially parallelized, is highly scalable and based on well studied non-parametric hypothesis tests. We benchmark our proposed algorithm on all binary classification problems of the UCR time series classification archive as well as time series from a production line optimization project and simulated stochastic processes with underlying qualitative change of dynamics.
Poster
Full-text available
This poster illustrates challenges that occur during the analysis of time series in the BMBF funded research Project iPRODICT and similar Industrial Applications.
Conference Paper
The monitoring of real-time objects such as steel billets during their casting process creates myriads of events. Complex Event Processing (CEP) is the technology to analyze resulting event streams as fast as possible. But classic CEP is not able to consider events that did not happen yet. It is not clear how to transform CEP from a technology, which reacts on past events, to one, which anticipates near future events. Conditional density estimation allows to combine both estimation and expected uncertainty about the next occurrence of a given event in one mathematical object. Moreover, it allows to calculate the probability of event patterns, which are the basis for CEP. Hence, we are introducing the concept of Conditional Event Occurrence Density Estimation (CEODE) to CEP. We present a reference architecture for combining CEP engines with predictive analytics using CEODEs. On basis of concrete guidelines for transforming classical event processing rules to proactive ones, we are demonstrating how CEP evolves from being reactive to becoming both predictive and prescriptive.