Chapter

Functional Data Analysis

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

When either data or the models for them involve functions, and when only weak assumptions about these functions such as smoothness are permitted, familiar statistical methods must be modified and new approaches developed in order to take advantage of this smoothness. The first part of the article considers some general issues such as characteristics of functional data, uses of derivatives in functional modelling, estimation of phase variation by the alignment or registration of curve features, the nature of error, and so forth. The second section describes functional versions of traditional methods such principal components analysis and linear modelling, and also mentions purely functional approaches that involve working with and estimating differential equations in the functional data analysis process.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Second, the basic element of the analytical process performed in FDA is the whole function itself, and not the individual data elements of which it is composed. Finally, functional data usually appears associated to some sort of temporal variable and it is also assumed to have some regularity conditions [1]. ...
... In the last years, FDA has gained momentum evidenced by a rise in popularity in several applied research areas and the publication of multiple works including monographs [1] and review articles [2]. Now that this knowledge is available to the public and FDA's theoretical basis and applications are beginning to be established, researchers are starting to consider the use of FDA tools for extended setups such as its application in the field of medical imaging. ...
... In the context of biomedicine, there is great interest in medical imaging data such as the ones obtained from brain scanners: images of tumor tissues, among others [1]. Nevertheless, smoothing methods proposed in the scientific literature to date which are focused on imaging data (e.g., kernel smoothing, tensor product smoothing . . . ) suffer from a severe problem of leakage for high-complexity data structures (i.e., poor estimation in difficult regions as a result of inappropriate smoothing on boundary regions), showing difficulties which result in inappropriate smoothing. ...
Article
Full-text available
In the field of medical imaging, one of the most extended research setups consists of the comparison between two groups of images, a pathological set against a control set, in order to search for statistically significant differences in brain activity. Functional Data Analysis (FDA), a relatively new field of statistics dealing with data expressed in the form of functions, uses methodologies which can be easily extended to the study of imaging data. Examples of this have been proposed in previous publications where the authors settle the mathematical groundwork and properties of the proposed estimators. The methodology herein tested allows for the estimation of mean functions and simultaneous confidence corridors (SCC), also known as simultaneous confidence bands, for imaging data and for the difference between two groups of images. FDA applied to medical imaging presents at least two advantages compared to previous methodologies: it avoids loss of information in complex data structures and avoids the multiple comparison problem arising from traditional pixel-to-pixel comparisons. Nonetheless, computing times for this technique have only been explored in reduced and simulated setups. In the present article, we apply this procedure to a practical case with data extracted from open neuroimaging databases; then, we measure computing times for the construction of Delaunay triangulations and for the computation of mean function and SCC for one-group and two-group approaches. The results suggest that the previous researcher has been too conservative in parameter selection and that computing times for this methodology are reasonable, confirming that this method should be further studied and applied to the field of medical imaging.
... With the advance of modern technology, it becomes increasingly common that data are in the form of functions. Rapid developments of new statistical methods for this new type of data has created the field of functional data analysis (FDA) (Yao et al. 2005 b;Ramsay and Silverman 2006;Ferraty and Vieu 2006;Hsing and Eubank 2015;Kokoszka and Reimherr 2017;Lin et al. 2018;Lin and Yao 2019). Many semiparametric and nonparametric methods have been proposed for regression with functional response and/or covariates. ...
... Denote the functional observations by {(Y i (t), X i (t)), i = 1, …, n; t ∈ T}, where T is an arbitrary set. We want to investigate the relationship between X and Y. Ramsay and Silverman (2006) proposed the concurrent linear model (CLM), Y i (t) = α(t) + β(t)X i (t) + ϵ i (t), i = 1, ⋯, n, (1) where α(t) and β(t) are unknown functions to be estimated, ϵ i (t), independent of X i (t), are i.i.d. random errors. ...
... Model (1) assumes that the value of Y at t depends on X at the same point t only. Ramsay and Silverman (2006) also proposed the functional linear model (FLM), Y i (t) = α(t) + ∫ T β(s, t)X i (s)ds + ϵ i (t), i = 1, ⋯, n, (2) where Y(t) depends on the whole function of X(·), and α(t) and β(s,t) are unknown functions to be estimated. Both models (1) and (2) assume that Y is linearly dependent on X, which could be restrictive for some applications. ...
Article
Regression models with a functional response and functional covariate have received significant attention recently. While various nonparametric and semiparametric models have been developed, there is an urgent need for model selection and diagnostic methods. In this article, we develop a unified framework for estimation and model selection in nonparametric function-on-function regression. We propose a general nonparametric functional regression model with the model space constructed through smoothing spline analysis of variance (SS ANOVA). The proposed model reduces to some of the existing models when selected components in the SS ANOVA decomposition are eliminated. We propose new estimation procedures under either L1 or L2 penalty and show that the combination of the SS ANOVA decomposition and L1 penalty provides powerful tools for model selection and diagnostics. We establish consistency and convergence rates for estimates of the regression function and each component in its decomposition under both the L1 and L2 penalties. Simulation studies and real examples show that the proposed methods perform well. Technical details and additional simulation results are available in online supplementary materials.
... These two types of functional data often require different sets of techniques to model. The book [33] offers a comprehensive perspective of FDA methods for denselyobserved functional data, and the paper [26] provides a nice review for sparse functional data analysis. ...
... For dense functional data, where a large number of regularly-observed measurements for each subject are attainable, some well-developed methods for the two-sample mean function testing problem exist, including the pointwise t-test [33], the L 2 -norm-based test and the F-type test [41,9,42,43], the Hotelling's T 2 test with permutation [31], tests involving basis representations [1,16,15,17,21,36], and the pseudo-likelihood ratio test [37]. Extensions to multiple functional samples are discussed in [35,8,6,27]. ...
... For a related purpose, [4] developed simultaneous confidence bands for mean functions. The point-wise F-test and a simultaneous test based on the permutation distribution of the supremum of the point-wise F-test are proposed to test the parameter functions in function-on-scalar regressions [33,34]. When defining the scalar-valued predictor as a dummy variable that indicates the group label of functional response, the tests for the parameter function being 0 is equivalent to the two-sample mean function inference problem in the paper. ...
... FDA is a branch of statistic that provides tools for describing and modelling sets of functions (or curves) rather than vectors of discrete values (Ramsay, 2006). The guiding idea of this approach is to describe data as parameterized functions, and to use these parameters for clustering, comparing or interpolating functions. ...
... The guiding idea of this approach is to describe data as parameterized functions, and to use these parameters for clustering, comparing or interpolating functions. In particular, classical statistical tools can be adapted to functional data such as functional principal component analysis (fPCA) to summarize and characterize significant variation in finite dimension among a sample curves (Dabo-Niang and Ferraty, 2008;Ramsay, 2006). Functional analysis of variance (fANOVA) uses all the information of each mean functional curve to test the possible differences in the datasets, based on the shape and temporal (along the depth) variability of the curves (Cuevas et al., 2004;Ramsay, 2006). ...
... In particular, classical statistical tools can be adapted to functional data such as functional principal component analysis (fPCA) to summarize and characterize significant variation in finite dimension among a sample curves (Dabo-Niang and Ferraty, 2008;Ramsay, 2006). Functional analysis of variance (fANOVA) uses all the information of each mean functional curve to test the possible differences in the datasets, based on the shape and temporal (along the depth) variability of the curves (Cuevas et al., 2004;Ramsay, 2006). ...
Article
The dynamic of the thermohaline structure of the upper ocean, which depends on ocean-atmosphere interactions, drives most near surface oceanic processes, including the control of gases and heat fluxes, and nutrient availability in the photic layer. The thermohaline structure of the southwestern tropical Atlantic (SWTA), a key region for diagnosing variation of the Atlantic Meridional Overturning Circulation, has prime impact on global climate. Characterising the thermohaline structure is typically based on the application of classical statistical methods on vertical profiles. Such approach has important limitations since classical methods do not explicitly contemplate the vertical nature of the profiles. Functional Data Analysis (FDA) is a new alternative to solve such drawbacks. Here, we apply an FDA approach to characterise the 3D canonical thermohaline structure of the SWTA in austral spring and fall. Our results reveal a clear spatial pattern with the presence of three areas with significantly different thermohaline structure. Area 1, mostly located along the continental slope, reflects the western boundary current system, with low static stability and high frequency of occurrence of barrier layer (BL). Conversely, Area 2, located along the Fernando de Noronha chain, presents strong static stability with a well-marked thermocline. This area, under the influence of the eastern Atlantic, is characterised by a low BL frequency, which is seasonally modulated by the latitudinal oscillation of the Intertropical Convergence Zone, controlling the regime of precipitation. In turn, Area 3 behaves as a transition zone between A1 and A2 with the presence of the water core of maximum salinity in subsurface, and therefore presence of strong-moderate BL. Beyond this study, FDA approach emerges as a powerful way to describe, characterise, classify and compare ocean patterns and processes. It can be applied to in situ data but could also be used to deeply and comprehensively explore ocean model output.
... First, we describe the oceanscape, i.e., the vertical features of the environmental factors (currents, current shear, stratification, oxygen concentration and fluorescence) for each hydrodynamic system (WBCS and SECS) according to the season (spring and autumn). For that, we used a functional data analysis approach [44] to statistically test for seasonal differences in the vertical profiles of each environmental parameter in a given hydrodynamic system. This branch of statistics works on functions instead of discretized vectors to analyse the distribution and variability of data according to the physical dimension in which they are measured, the depth in our case. ...
... The higher the number of K-functions, the more complexity is preserved. Once this was done, we used the functional analysis of variance (fANOVA), which uses all the information of each mean functional curve to test for differences based on the shape and spatial (along with depth) variability of the curves [27, 44,47]. The significance testing value (p-value) considered in this study was 0.05. ...
Article
Full-text available
Ocean dynamics initiate the structure of nutrient income driving primary producers, and these, in turn, shape the distribution of subsequent trophic levels until the whole pelagic community reflects the physicochemical structure of the ocean. Despite the importance of bottom-up structuring in pelagic ecosystems, fine-scale studies of biophysical interactions along depth are scarce and challenging. To improve our understanding of such relationships , we analyzed the vertical structure of key oceanographic variables along with the distribution of acoustic biomass from multi-frequency acoustic data (38, 70, and 120 kHz) as a reference for pelagic fauna. In addition, we took advantage of species distribution databases collected at the same time to provide further interpretation. The study was performed in the Southwestern Tropical Atlantic of northeast Brazil in spring 2015 and autumn 2017, periods representative of canonical spring and autumn conditions in terms of thermohaline structure and current dynamics. We show that chlorophyll-a, oxygen, current, and stratification are important drivers for the distribution of sound scattering biota but that their relative importance depends on the area, the depth range, and the diel cycle. Prominent sound scattering layers (SSLs) in the epipelagic layer were associated with strong stratification and subsur-face chlorophyll-a maximum. In areas where chlorophyll-a maxima were deeper than the peak of stratifications, SSLs were more correlated with stratification than subsurface chlorophyll maxima. Dissolved oxygen seems to be a driver in locations where lower oxygen concentration occurs in the subsurface. Finally, our results suggest that organisms seem to avoid strong currents core. However, future works are needed to better understand the role of currents on the vertical distribution of organisms.
... Instead of basing the posterior probability on the parametric model, a semiparametric approach is applied. We therefore draw on Rossi, Wang, and Ramsay (2002), who used the basis function expansion approach (Ramsay & Silverman, 2005) on logit functions. This approach is semiparametric in so far as the IRF is still parametric, but the unknown functional form of the IRF is approximated by a finite number of basis functions. ...
... 2.1. Semiparametric IRF estimation using B-spline functions Using the basis function expansion approach (Ramsay & Silverman, 2005), the logit function can be described as ...
Article
Full-text available
When scaling data using item response theory, valid statements based on the measurement model are only permissible if the model fits the data. Most item fit statistics used to assess the fit between observed item responses and the item responses predicted by the measurement model show significant weaknesses, such as the dependence of fit statistics on sample size and number of items. In order to assess the size of misfit and to thus use the fit statistic as an effect size, dependencies on properties of the data set are undesirable. The present study describes a new approach and empirically tests it for consistency. We developed an estimator of the distance between the predicted item response functions (IRFs) and the true IRFs by semiparametric adaptation of IRFs. For the semiparametric adaptation, the approach of extended basis functions due to Ramsay and Silverman (2005) is used. The IRF is defined as the sum of a linear term and a more flexible term constructed via basis function expansions. The group lasso method is applied as a regularization of the flexible term, and determines whether all parameters of the basis functions are fixed at zero or freely estimated. Thus, the method serves as a selection criterion for items that should be adjusted semiparametrically. The distance between the predicted and semiparametrically adjusted IRF of misfitting items can then be determined by describing the fitting items by the parametric form of the IRF and the misfitting items by the semiparametric approach. In a simulation study, we demonstrated that the proposed method delivers satisfactory results in large samples (i.e., N ≥ 1,000).
... The basic idea of viewing transformations of densely observed asset price data as sequentially observed stochastic processes appears in studies such as Kokoszka andReimherr (2013) andConstantinou et al. (2018), among others. We refer the reader to Ramsay and Silverman (2006) and Bosq (2000) for a review of functional data analysis and linear functional time series. ...
... The functions φ j can be chosen in a number of ways, including using a deterministic basis system such as polynomials, b-splines, or the Fourier basis, as well as using a functional principal component basis; see e.g. Chapter 6 of Ramsay and Silverman (2006). Cerovecki et al. (2019) and Aue et al. (2017) suggest using the principal component basis determined by the squared processes X 2 i (t), which we also consider below. ...
Article
Full-text available
Functional data objects derived from high‐frequency financial data often exhibit volatility clustering. Versions of functional generalized autoregressive conditionally heteroscedastic (FGARCH) models have recently been proposed to describe such data, however so far basic diagnostic tests for these models are not available. We propose two portmanteau type tests to measure conditional heteroscedasticity in the squares of asset return curves. A complete asymptotic theory is provided for each test. We also show how such tests can be adapted and applied to model residuals to evaluate adequacy, and inform order selection, of FGARCH models. Simulation results show that both tests have good size and power to detect conditional heteroscedasticity and model mis‐specification in finite samples. In an application, the tests show that intra‐day asset return curves exhibit conditional heteroscedasticity. This conditional heteroscedasticity cannot be explained by the magnitude of inter‐daily returns alone, but it can be adequately modeled by an FGARCH(1,1) model. This article is protected by copyright. All rights reserved.
... There is rather extensive literature on the use of machine learning to analyse functional data, e.g., adapting Support Vector Machine models to functional data (Blanquero et al. 2019;Chaovalitwongse et al. 2008), using regression trees to detect critical intervals (Blanquero et al. 2023) or novel forms of intepretability when dealing with functional data (Carrizosa et al. 2022;Martín-Barragán et al. 2014). See also Aneiros et al. (2022), Ramsay (2006) for an overview of methods for functional data analysis. ...
Article
Full-text available
Counterfactual explanations have become a very popular interpretability tool to understand and explain how complex machine learning models make decisions for individual instances. Most of the research on counterfactual explainability focuses on tabular and image data and much less on models dealing with functional data. In this paper, a counterfactual analysis for functional data is addressed, in which the goal is to identify the samples of the dataset from which the counterfactual explanation is made of, as well as how they are combined so that the individual instance and its counterfactual are as close as possible. Our methodology can be used with different distance measures for multivariate functional data and is applicable to any score-based classifier. We illustrate our methodology using two different real-world datasets, one univariate and another multivariate.
... were smoothed. Missing data during winter is filled with 0. We apply a maximum moving window filter and functional data analysis (Ramsay, 2006) to remove the noise while preserving the maximum values. The time series is smoothed in a β-spline function. ...
Article
Full-text available
Dinophysis acuminata produces Diarrhetic Shellfish Toxins (DST) that contaminate natural and farmed shellfish, leading to public health risks and economically impacting mussel farms. For this reason, there is a high interest in understanding and predicting D. acuminata blooms. This study assesses the environmental conditions and develops a sub-seasonal (7 - 28 days) forecast model to predict D. acuminata cells abundance in the Lyngen fjord located in northern Norway. A Support Vector Machine (SVM) model is trained to predict future D. acuminata cells abundance by using the past cell concentration, sea surface temperature (SST), Photosynthetic Active Radiation (PAR), and wind speed. Cells concentration of Dinophysis spp. are measured in-situ from 2006 to 2019, and SST, PAR, and surface wind speed are obtained by satellite remote sensing. D. acuminata only explains 40% of DST variability from 2006 to 2011, but it changes to 65% after 2011 when D. acuta prevalence reduced. The D. acuminata blooms can reach concentration up to 3954 cells l-1 and are restricted to the summer during warmer waters, varying from 7.8 to 12.7 °C. The forecast model predicts with fair accuracy the seasonal development of the blooms and the blooms amplitude, showing a coefficient of determination varying from 0.46 to 0.55. SST has been found to be a useful predictor for the seasonal development of the blooms, while the past cells abundance is needed for updating the current status and adjusting the blooms timing and amplitude. The calibrated model should be tested operationally in the future to provide an early warning of D. acuminata blooms in the Lyngen fjord. The approach can be generalized to other regions by recalibrating the model with local observations of D. acuminata blooms and remote sensing data.
... For practical details on the implementation, the interested reader could refer to [14]. This decomposition can be applied to each recorded muscle to express the temporal activation profile into a set of weights. ...
Chapter
Full-text available
Monitoring workers’ status is crucial to prevent work-related musculoskeletal disorders and to enable a safe human-robot interaction. This is typically achieved relying on muscle activation recordings, commonly performed via wearable electromyographic EMG sensors. However, to properly acquire whole-body muscular status, a large number of sensors is needed. This represents a limitation for a real deployment of wearable acquisition systems, due to cost and wearability constraints. To overcome this problem, we propose a solution to provide a reliable muscles estimation from a limited number of EMG recordings. Our method exploits the covariation patterns between muscles activation to complement the recordings coming from a reduced set of optimally placed sensors, minimizing the estimation uncertainty. Using a dataset of EMG data recorded from 10 subjects, we demonstrate that it is possible to reconstruct the temporal evolution of 10 whole-body muscles with a maximum normalized estimation error of 13%, using only 7 EMG sensors.KeywordsErgonomicsHuman motion controlEMGOptimal sensing
... There is rather extensive literature on the use of machine learning to analyse functional data, e.g., adapting Support Vector Machine models to functional data [Blanquero et al., 2019, Chaovalitwongse et al., 2008, using regression trees to detect critical intervals [Blanquero et al., 2023] or novel forms of intepretability when dealing with functional data [Carrizosa et al., 2022, Martín-Barragán et al., 2014. See also Aneiros et al. [2022], Ramsay [2006] for an overview of methods for functional data analysis. ...
Preprint
Full-text available
Counterfactual explanations have become a very popular interpretability tool to understand and explain how complex machine learning models make decisions for individual instances. Most of the research on counterfactual explainability focuses on tabular and image data and much less on models dealing with functional data. In this paper, a counterfactual analysis for functional data is addressed, in which the goal is to identify the samples of the dataset from which the counterfactual explanation is made of, as well as how they are combined so that the individual instance and its counterfactual are as close as possible. Our methodology can be used with different distance measures for multivariate functional data and is applicable to any score-based classifier. We illustrate our methodology using two different real-world datasets, one univariate and another multivariate.
... Un hito importante durante ese periodo fue la metodología propuesta por Lee y Carter (1992) para modelar y extrapolar las tendencias a largo plazo en las tasas de mortalidad; desde entonces, se ha aplicado ampliamente y se han formulado varias extensiones y modificaciones (por ejemplo, Booth et al. [2002] y Renshaw y Haberman [2003]). El método de Hyndman y Ullah (2007) se enmarca en el paradigma de datos funcionales (Ramsay, 2006). Un dato funcional no es una única observación, sino un grupo de medidas a través de un continuo, tomadas conjuntamente y vistas como una sola entidad, curva o imagen. ...
Article
Full-text available
Las enfermedades crónicas no transmisibles (ECNT) son padecimientos intransferibles por contacto de persona a persona y se caracterizan por su evolución generalmente lenta. En Argentina, son la principal causa de muerte y discapacidad, solo dos grupos de estas (cardiovasculares y cáncer) son responsables de la mitad del total de las muertes y del 27 % de los años de vida potencialmente perdidos (AVPP). El objetivo general de este trabajo cuantitativo, transversal y descriptivo es describir y analizar el perfil según edad y sexo de las tasas de mortalidad por ECNT en Argentina, a partir del modelo para datos funcionales (MDF) de Hyndman y Ullah (2007). Dicho método permite, además, pronosticar el comportamiento de los índices al considerar los cambios relacionados con la edad y la tendencia observada a través del tiempo. La diferencia relativa en la mortalidad entre el inicio del periodo de estudio (1985 a 2014) y el pronóstico para el año 2025 indica que, de continuar el comportamiento imperante, se alcanzarían descensos de alrededor del 50 % para hombres de entre 30 y 50 años y del 20 % para mujeres de entre 20 y 35 años. Estos resultados apuntan, de un modo más general, a que las tasas de mortalidad de los grupos etarios menores de 70 años, cuyas defunciones se denominan prematuras, son claramente descendentes para ambos sexos, aunque el caso de los hombres destaca, pues, si bien presentan mayores tasas de mortalidad por ECNT, el descenso es más marcado.
... In the analysis of longitudinal data, besides the mean function, it might also be interesting to consider a nonparametric covariance function, which captures the patterns of subject-specific trajectories. A popular nonparametric modeling approach for estimating both mean and covariance functions of longitudinal data is via methods of functional data analysis (Ramsay & Silverman, 2006), in which one views the values of a longitudinal trajectory as observations from a random smooth function; see Guo (2002) and Yao et al. (2003Yao et al. ( , 2005. The advantage of functional data methods compared to traditional parametric longitudinal methods is that they allow for unspecified mean/covariance functions, which may better capture complex variation patterns of subject-specific trajectories. ...
Article
Full-text available
In functional data analysis for longitudinal data, the observation process is typically assumed to be non‐informative, which is often violated in real applications. Thus, methods that fail to account for the dependence between observation times and longitudinal outcomes may result in biased estimation. For longitudinal data with informative observation times, we find that under a general class of shared random effect models, a commonly used functional data method may lead to inconsistent model estimation while another functional data method results in consistent and even rate‐optimal estimation. Indeed, we show that the mean function can be estimated appropriately via penalized splines and that the covariance function can be estimated appropriately via penalized tensor‐product splines, both with specific choices of parameters. For the proposed method, theoretical results are provided, and simulation studies and a real data analysis are conducted to demonstrate its performance. This article is protected by copyright. All rights reserved
... Additionally, some classical measures are also used to describe sound patterns, such as Sound Pressure Level (SPL) [52], functions that describe signal variations, such as Roughness [48] and Rugosity [38], Root Mean Square (RMS), the mean of the Power Spectral Density (PSD) [67], Signal-to-noise ratio (SNR) [2], and Mel-frequency Cepstrum Coefficients (MFCC) [57]. ...
Article
Full-text available
In soundscape ecology analysis, the use of acoustic features is well established and offers important baselines to ecological analyses. However, in many cases, the problem is difficult due to high-class overlap in terms of time-frequency characteristics, as well as the presence of noise. Deep neural networks have become state-of-the-art for feature learning in many multi-class applications, but they often present issues such as over-fitting or achieve unbalanced performances for different classes, which can hamper the deployment of such models in realistic scenarios. In the context of counting the number of classes in observations, the quantification task is attracting attention and was shown to be effective in other applications. This paper investigates the use of quantification combined with classification loss in order to train a convolutional neural network to classify species of birds and anurans. Results indicate quantification has advantages over both acoustic features alone and the use of regular classification networks, in particular in terms of generalization and class recall making it a suitable choice for segregation tasks related to soundscape ecology. Moreover, we show that a more compact network can outperform a deeper one for fine-grained scenarios of birds and anurans species.
... However, summarising curves to a single value results in loss of information, loss of sensitivity and does not account for the quality of the fit of the parametric model 29 . A more powerful approach is to employ techniques from functional data analysis [30][31][32] and use the whole melting curve for statistics 29 . ...
Article
Full-text available
The thermal stability of proteins can be altered when they interact with small molecules, other biomolecules or are subject to post-translation modifications. Thus monitoring the thermal stability of proteins under various cellular perturbations can provide insights into protein function, as well as potentially determine drug targets and off-targets. Thermal proteome profiling is a highly multiplexed mass-spectrommetry method for monitoring the melting behaviour of thousands of proteins in a single experiment. In essence, thermal proteome profiling assumes that proteins denature upon heating and hence become insoluble. Thus, by tracking the relative solubility of proteins at sequentially increasing temperatures, one can report on the thermal stability of a protein. Standard thermodynamics predicts a sigmoidal relationship between temperature and relative solubility and this is the basis of current robust statistical procedures. However, current methods do not model deviations from this behaviour and they do not quantify uncertainty in the melting profiles. To overcome these challenges, we propose the application of Bayesian functional data analysis tools which allow complex temperature-solubility behaviours. Our methods have improved sensitivity over the state-of-the art, identify new drug-protein associations and have less restrictive assumptions than current approaches. Our methods allows for comprehensive analysis of proteins that deviate from the predicted sigmoid behaviour and we uncover potentially biphasic phenomena with a series of published datasets.
... n pk . For more details of the functional principal component analysis, please refer to Ramsay and Silverman (2006). 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 Note that in practice, only a few eigenvalues and eigenfunctions are capable of capturing the majority of variation in the data. ...
Article
Full-text available
Process monitoring using profile data remains an important and challenging problem in various manufacturing industries. Motivated by an application case of motherboard testing processes, we develop a novel modeling and monitoring framework for heterogeneous multivariate profiles. In this framework, a heterogeneous graphical model is constructed to depict the complicated heterogeneous relationship among profile channels. Then monitoring the heterogeneous relationship among profile channels can be reduced to monitoring the graphical networks. Besides, we investigate several theoretical results concerning the accuracy of the estimated graphical structure. Finally, we demonstrate the proposed method through extensive simulations and a real case study. Supplementary materials for this article are available online.
... These functions are latent and infinite-dimensional, which cannot be calculated analytically and they need to be approximated. The study of these functions falls under the name functional data analysis (FDA), which originally was introduced by Ramsay et al. [19,20]. It involves smoothing technique, data reduction, adjustment for clustering, functional linear modeling, and forecasting methods [21][22][23][24]. ...
Article
Full-text available
Raman spectral data are best described by mathematical functions; however, due to the spectroscopic measurement setup, only discrete points of these functions are measured. Therefore, we investigated the Raman spectral data for the first time in the functional framework. First, we approximated the Raman spectra by using B-spline basis functions. Afterwards, we applied the functional principal component analysis followed by the linear discriminant analysis (FPCA-LDA) and compared the results with those of the classical principal component analysis followed by the linear discriminant analysis (PCA-LDA). In this context, simulation and experimental Raman spectra were used. In the simulated Raman spectra, normal and abnormal spectra were used for a classification model, where the abnormal spectra were built by shifting one peak position. We showed that the mean sensitivities of the FPCA-LDA method were higher than the mean sensitivities of the PCA-LDA method, especially when the signal-to-noise ratio is low and the shift of the peak position is small. However, for a higher signal-to-noise ratio, both methods performed equally. Additionally, a slight improvement of the mean sensitivity could be shown if the FPCA-LDA method was applied to experimental Raman data.
... Therefore, these plots can only show values at those discrete time stamps and may not capture possible variables in between. Therefore, this paper uses a more principled statistical approach to comparing key features of clusters based on functional data analysis (FDA) (Ramsay, 2006). ...
Article
Smartphone-based activity-travel surveys have enabled the collection of continuous, multi-day data on individuals’ trips and activities with high spatial and temporal resolution. However, the multi-dimensional nature of these data makes it challenging to compare activity-travel patterns and identify clusters of individuals with similar patterns and use them to study behaviors and forecast travel demands. To address this challenge, we adopt a discrete, step-based view on time and transform the episodic-based diary into a sequence of states observed at a sample interval. The resulting sequences can visually characterize variations in activity-travel patterns across days of a week and among different individuals. Motivated by techniques in genomics and data science, we apply sequence alignment methods to measure the pairwise similarity between these activity-trip sequences. To address its practical implementation in transportation studies, we define and compare four weighting schemas: (1) the unit-cost distance, which assigns equal weights to all substitution operations between states; (2) the fixed-flexible weighted distance based on the time geography framework, where costs differ for substitutions involving fixed and flexible activities; (3) the trip-activity weighted distance considering travel as a derived demand, where costs differ for substitutions between trips and activities; and (4) transition-based weighted distance, where costs are based on the global or time-varying activity and trip transition rates estimated from the data. Then, we calculate the pairwise distances between individuals’ sequences and use them as inputs to a hierarchical clustering algorithm to group individuals with similar sequences. We visualize the state distributions of the identified clusters to infer and compare behavior patterns, and use functional data regression methods to estimate the time-dependent probabilities of engaging in various activities and trips. To demonstrate our methods, we analyze a smartphone-based survey dataset collected in the Minneapolis-St. Paul metropolitan area. We also conduct sensitivity analysis on the selection of cost metrics and sample intervals to ensure the robustness of our methods and discuss their implications in practice. By identifying population subgroups with distinct daily activity-travel patterns and explaining how these patterns vary over one day and depend on user profiles, our weighted sequence alignment approach provides an intuitive and flexible method for extracting and characterizing individuals’ activity-travel behaviors for use in transportation planning.
... For example, FDA uses smoothing methods to create continuous functions to represent data. 30 This means that the decision of time granularity, and time as a discrete measurement, is just a processing step before we represent the data as a function and with continuous time. This may lead us to question whether time granularity should even be considered. ...
Article
Full-text available
Introduction: The use of digital biomarker data in dementia research provides the opportunity for frequent cognitive and functional assessments that was not previously available using conventional approaches. Assessing high-frequency digital biomarker data can potentially increase the opportunities for early detection of cognitive and functional decline because of improved precision of person-specific trajectories. However, we often face a decision to condense time-stamped data into a coarser time granularity, defined as the frequency at which measurements are observed or summarized, for statistical analyses. It is important to find a balance between ease of analysis by condensing data and the integrity of the data, which is reflected in a chosen time granularity. Methods: In this paper, we discuss factors that need to be considered when faced with a time granularity decision. These factors include follow-up time, variables of interest, pattern detection, and signal-to-noise ratio. Results: We applied our procedure to real-world data which include longitudinal in-home monitored walking speed. The example shed lights on typical problems that data present and how we could use the above factors in exploratory analysis to choose an appropriate time granularity. Discussion: Further work is required to explore issues with missing data and computational efficiency.
... General clustering functional data methods allow to cluster functional data by modeling the curves within a common functional subspace, that is a difficult task because of the infinite dimensional space that data belong to. For different clustering functional based approaches proposed by the recent literature see, for instance, [3][4][5]37,38,40]. A model-based clustering approach is provided by [11]. ...
Article
Full-text available
Students' migration mobility is the new form of migration: students migrate to improve their skills and become more valued for the job market. The data regard the migration of Italian Bachelors who enrolled at Master Degree level, moving typically from poor to rich areas. This paper investigates the migration and other possible determinants on the Master Degree students' performance. The Clustering of Effects approach for Quantile Regression Coefficients Modelling has been used to cluster the effects of some variables on the students' performance for three Italian macro-areas. Results show evidence of similarity between Southern and Centre students, with respect to the Northern ones.
... Recent statistical research in semiparametric or nonparametric methods inspires us to understand the effect of environmental variation via nonparametric models that do not impose these assumptions. Specifically, we consider spline-based methods (see Wood (2000), Ramsay (2006), Ramsay, Hooker and Graves (2009) and Ruppert, Wand and Carroll (2003)) to predict nonlinear responses under environmental fluctuation. ...
... Between-group differences in network metrics for the mid-follicular and late-luteal menstrual phases were tested for each density threshold by nonparametric permutations using GAT. In addition, a summary measure for each network metric across the range of densities was calculated using functional data analysis (FDA) (Bassett et al., 2008;Ramsay, 2006) and between-group differences in this measure were tested using nonparametric permutations. Relationships between global measures and difficulties in emotion regulation were tested using Pearson's correlation. ...
Article
The female predominance in the prevalence of depression is partially accounted by reactivity to hormonal fluctuations. Premenstrual dysphoric disorder (PMDD) is a reproductive subtype of depression characterized by cyclic emotional and somatic symptoms that recur before menstruation. Despite the growing understanding that most psychiatric disorders arise from dysfunctions in distributed brain circuits, the brain's functional connectome and its network properties of segregation and integration were not investigated in PMDD. To this end, we examined the brain's functional network organization in PMDD using graph theoretical analysis. 24 drug naïve women with PMDD and 27 controls without premenstrual symptoms underwent 2 resting-state fMRI scans, during the mid-follicular and late-luteal menstrual cycle phases. Functional connectivity MRI, graph theory metrics and levels of sex hormones were computed during each menstrual phase. Altered network topology was found in PMDD across symptomatic and remitted stages in major graph metrics (characteristic path length, clustering coefficient, transitivity, local and global efficiency, centrality), indicating decreased functional network segregation and increased functional network integration. In addition, PMDD patients exhibited hypoconnectivity of the anterior temporal lobe and hyperconnectivity of the basal ganglia and thalamus, across menstrual phases. Furthermore, the relationship between difficulties in emotion regulation and PMDD was mediated by specific patterns of functional connectivity, including connections of the striatum, thalamus and prefrontal cortex. The shifts in the functional connectome and its topology in PMDD may suggest trait vulnerability markers of the disorder.
... Functional principal component analysis [15] assumes that the observations are functional in nature (time functions) and adapts the PCA concepts such as the rows of the data matrix become functions, a functional inner product is used instead of the inner product and an integral transform is the analog of the eigen-equation [16]. ...
Article
Full-text available
Principal Component Analysis (PCA) is a method based on statistics and linear algebra techniques, used in hyperspectral satellite imagery for data dimensionality reduction required in order to speed up and increase the performance of subsequent hyperspectral image processing algorithms. This paper introduces the PCA approximation method based on a geometric construction approach (gaPCA) method, an alternative algorithm for computing the principal components based on a geometrical constructed approximation of the standard PCA and presents its application to remote sensing hyperspectral images. gaPCA has the potential of yielding better land classification results by preserving a higher degree of information related to the smaller objects of the scene (or to the rare spectral objects) than the standard PCA, being focused not on maximizing the variance of the data, but the range. The paper validates gaPCA on four distinct datasets and performs comparative evaluations and metrics with the standard PCA method. A comparative land classification benchmark of gaPCA and the standard PCA using statistical-based tools is also described. The results show gaPCA is an effective dimensionality-reduction tool, with performance similar to, and in several cases, even higher than standard PCA on specific image classification tasks. gaPCA was shown to be more suitable for hyperspectral images with small structures or objects that need to be detected or where preponderantly spectral classes or spectrally similar classes are present.
Article
Data depth is an efficient tool for robustly summarizing the distribution of functional data and detecting potential magnitude and shape outliers. Commonly used functional data depth notions, such as the modified band depth and extremal depth, are estimated from pointwise depth for each observed functional observation. However, these techniques require calculating one single depth value for each functional observation, which may not be sufficient to characterize the distribution of the functional data and detect potential outliers. This article presents an innovative approach to make the best use of pointwise depth. We propose using the pointwise depth distribution for magnitude outlier visualization and the correlation between pairwise depth for shape outlier detection. Furthermore, a bootstrap‐based testing procedure has been introduced for the correlation to test whether there is any shape outlier. The proposed univariate methods are then extended to bivariate functional data. The performance of the proposed methods is examined and compared to conventional outlier detection techniques by intensive simulation studies. In addition, the developed methods are applied to simulated solar energy datasets from a photovoltaic system. Results revealed that the proposed method offers superior detection performance over conventional techniques. These findings will benefit engineers and practitioners in monitoring photovoltaic systems by detecting unnoticed anomalies and outliers.
Article
The Hilbert-Schmidt independence criterion (HSIC) is a dependence measure based on reproducing kernel Hilbert spaces. This measure can be used for the global sensitivity analysis of numerical simulators whose objective is to identify the most influential inputs on the output(s) of the code. For this purpose, HSIC-based sensitivity measures and independence tests can be used for the ranking and screening of inputs, respectively. In this framework, this work proposes several improvements in the use of HSIC to increase their application spectrum and make the associated independence tests more powerful. First, we introduce a new method to perform the tests in a non-asymptotic framework. This method is much less central-processing-time expensive than the one based on permutation, while remaining as efficient. Then, the use of HSIC-based independence tests is extended to the case of some space-filling designs, where the independent and identically distributed condition of the observations is lifted. For this, a new procedure based on conditional randomization test is used. In addition, we also propose a more powerful test that relies on a well-chosen parameterization of the HSIC statistics: the kernel bandwidth parameter is optimized instead of the standard choices. Numerical studies are performed to assess the efficiency of these procedures and compare it to existing tests in the literature. Finally, HSIC-based indices for functional outputs are defined: they rely on appropriate and relevant kernels for this type of data. Illustrations are provided on temporal outputs of an analytical function and a compartmental epidemiological model.
Preprint
Full-text available
This work proposes a functional data analysis (FDA) approach for morphometrics in classifying three shrew species (S. murinus, C. monticola and C. malayana) from Peninsular Malaysia. Functional data geometric morphometrics (FDGM) for 2D landmark data is introduced and its performance is compared with classical geometric morphometrics (GM). The FDGM approach converts 2D landmark data into continuous curves, which are then represented as linear combinations of basis functions. The landmark data was obtained from 90 crania of shrew specimens based on three craniodental views (dorsal, jaw, and lateral). Principal component analysis (PCA) and linear discriminant analysis (LDA) were applied to both GM and FDGM methods to classify the three shrew species. This study also compared four machine learning approaches (naïve Bayes, support vector machine, random forest, and generalised linear models) using predicted PC scores obtained from both methods (combination of all three craniodental views and individual views). The analyses favoured FDGM and the dorsal view was the best view for distinguishing the three species. Overall, the generalised linear models (GLM) was the most accurate (95.4% accuracy) among the four classification models.
Article
Full-text available
In this paper, we propose a novel approach to address the problem of functional outlier detection. Our method leverages a low-dimensional and stable representation of functions using Reproducing Kernel Hilbert Spaces (RKHS). We define a depth measure based on density kernels that satisfy desirable properties. We also address the challenges associated with estimating the density kernel depth. Throughout a Monte Carlo simulation we assess the performance of our functional depth measure in the outlier detection task under different scenarios. To illustrate the effectiveness of our method, we showcase the proposed method in action studying outliers in mortality rate curves.
Article
Objective: A clinician's operative skill-the ability to safely and effectively perform a procedure-directly impacts patient outcomes and well-being. Therefore, it is necessary to accurately assess skill progression during medical training as well as develop methods to most efficiently train healthcare professionals. Methods: In this study, we explore whether time-series needle angle data recorded during cannulation on a simulator can be analyzed using functional data analysis methods to (1) identify skilled versus unskilled performance and (2) relate angle profiles to degree of success of the procedure. Results: Our methods successfully differentiated between types of needle angle profiles. In addition, the identified profile types were associated with degrees of skilled and unskilled behavior of subjects. Furthermore, the types of variability in the dataset were analyzed, providing particular insight into the overall range of needle angles used as well as the rate of change of angle as cannulation progressed in time. Finally, cannulation angle profiles also demonstrated an observable correlation with degree of cannulation success, a metric that is closely related to clinical outcome. Conclusion: In summary, the methods presented here enable rich assessment of clinical skill since the functional (i.e., dynamic) nature of the data is duly considered.
Article
Current histocytometry methods enable single-cell quantification of biomolecules in tumor tissue sections by multiple detection technologies, including multiplex fluorescence-based immunohistochemistry or in situ hybridization. Quantitative pathology platforms can provide distributions of cellular signal intensity (CSI) levels of biomolecules across the entire cell populations of interest within the sampled tumor tissue. However, the heterogeneity of CSI levels is usually ignored, and the simple mean signal intensity (MSI) value is considered as a cancer biomarker. Here, we consider the entire distribution of CSI expression levels of a given biomolecule in the cancer cell population as a predictor of clinical outcome. The proposed Quantile Index (QI) biomarker is defined as the weighted average of CSI distribution quantiles in individual tumors. The weight for each quantile is determined by fitting a functional regression model for a clinical outcome. That is, the weights are optimized so that the resulting QI has the highest power to predict a relevant clinical outcome. The proposed QI biomarkers were derived for proteins expressed in cancer cells of malignant breast tumors and demonstrated improved prognostic value as compared to the standard MSI predictors. The R package Qindex implementing QI biomarkers has been developed. The proposed approach is not limited to immunohistochemistry data and can be based on any cell level expressions of proteins or nucleic acids.
Article
Compared with the conditional mean regression-based scalar-on-function regression model, the scalar-on-function quantile regression is robust to outliers in the response variable. However, it is susceptible to outliers in the functional predictor (called leverage points). This is because the influence function of the regression quantiles is bounded in the response variable but unbounded in the predictor space. The leverage points may alter the eigenstructure of the predictor matrix, leading to poor estimation and prediction results. This study proposes a robust procedure to estimate the model parameters in the scalar-on-function quantile regression method and produce reliable predictions in the presence of both outliers and leverage points. The proposed method is based on a functional partial quantile regression procedure. We propose a weighted partial quantile covariance to obtain functional partial quantile components of the scalar-on-function quantile regression model. After the decomposition, the model parameters are estimated via a weighted loss function, where the robustness is obtained by iteratively reweighting the partial quantile components. The estimation and prediction performance of the proposed method is evaluated by a series of Monte-Carlo experiments and an empirical data example. The results are compared favorably with several existing methods. The method is implemented in an R package robfpqr.
Article
This paper introduces the functional tensor singular value decomposition (FTSVD), a novel dimension reduction framework for tensors with one functional mode and several tabular modes. The problem is motivated by high-order longitudinal data analysis. Our model assumes the observed data to be a random realization of an approximate CP low-rank functional tensor measured on a discrete time grid. Incorporating tensor algebra and the theory of Reproducing Kernel Hilbert Space (RKHS), we propose a novel RKHS-based constrained power iteration with spectral initialization. Our method can successfully estimate both singular vectors and functions of the low-rank structure in the observed data. With mild assumptions, we establish the non-asymptotic contractive error bounds for the proposed algorithm. The superiority of the proposed framework is demonstrated via extensive experiments on both simulated and real data.
Article
We present an approach for characterizing complex temporal behavior in the sensor measurements of a system in order to support detection of anomalies in that system. We first characterize typical behavior by extending a hidden Markov model-based approach to time series alignment. We then use a trace of that learned behavior to develop a particle filter that enables efficient estimation of the filtering distribution on the state space. This produces filtered residuals that can then be used in an anomaly detection framework. Our motivating example is the daily behavior of a building’s heating, ventilation, and air conditioning (HVAC) system, using sensor measurements that arrive every minute and induce a state space with 15,120 states. We provide an end-to-end demonstration of our approach showing improved performance of anomaly detection after application of alignment and filtering compared to the unaligned data. The proposed model is implemented as a computationally efficient R package alignts (align time series) built with R and Fortran 95 with OpenMP support.
Article
Full-text available
Best estimate plus uncertainty for the safety assessment of nuclear power plant transient requires, among others, estimating the probability density function (PDF) of physical model parameters in thermal-hydraulic system codes. In that context, Bayesian calibration based on experimental data from separate-effect test facilities are increasingly popular to inform the PDF of a single thermal-hydraulic phenomenon. These calibrations are, however, time intensive, especially when considering multiple time-dependent outputs. Calibrating on many tests with different boundary conditions and potentially different phenomena to derive PDFs applicable to complex transients appears intractable, even using hierarchical modeling. In this paper, we start investigating this problem by considering a set of Flooding Experiments with Blocked Arrays reflood tests with different boundary conditions. We use TRACE v5.0p3 to model time- and space-dependent temperature profiles, pressure drops, and liquid carry-over. Global sensitivity analysis helps screen out noninfluential parameters and gain a detailed understanding of the modeled physics of reflood. The analysis shows that, for all tests, the outputs were sensitive to a similar set of influential model parameters. In turn, Bayesian calibration yields similar posterior PDFs for the influential parameters, and forward propagation of these posterior PDFs yields similar confidence intervals. As such, the information of the investigated tests can well be represented by a unique posterior PDF. Such simplifications, although not general, are welcome to help manage the intensive calibration effort necessary for dealing with complex thermal-hydraulic transients encountered in nuclear power plants.
Article
The scalar‐on‐function regression model has become a popular analysis tool to explore the relationship between a scalar response and multiple functional predictors. Most of the existing approaches to estimate this model are based on the least‐squares estimator, which can be seriously affected by outliers in empirical datasets. When outliers are present in the data, it is known that the least‐squares‐based estimates may not be reliable. This paper proposes a robust functional partial least squares method, allowing a robust estimate of the regression coefficients in a scalar‐on‐multiple‐function regression model. In our method, the functional partial least squares components are computed via the partial robust M‐regression. The predictive performance of the proposed method is evaluated using several Monte Carlo experiments and two chemometric datasets: glucose concentration spectrometric data and sugar process data. The results produced by the proposed method are compared favorably with some of the classical functional or multivariate partial least squares and functional principal component analysis methods.
Chapter
The accurate assessment of upper limb motion impairment induced by stroke—which represents one of the primary causes of disability world-wide—is the first step to successfully monitor and guide patients’ recovery. As of today, the majority of the procedures relies on clinical scales, which are mostly based on ordinal scaling, operator-dependent, and subject to floor and ceiling effects. In this work, we intend to overcome these limitations by proposing a novel approach to analytically evaluate the level of pathological movement coupling, based on the quantification of movement complexity. To this goal, we consider the variations of functional Principal Components applied to the reconstruction of joint angle trajectories of the upper limb during daily living task execution, and compared these variations between two conditions, i.e. the affected and non-affected arm. A Dissimilarity Index, which codifies the severity of the upper limb motor impairment with respect to the movement complexity of the non-affected arm, is then proposed. This methodology was validated as a proof of concept upon a set of four chronic stroke subjects with mild to moderate arm and hand impairments. As a first step, we evaluated whether the derived outcomes differentiate between the two conditions upon the whole data-set. Secondly, we exploited this concept to discern between different subjects and impairment levels. Results show that: (i) differences in terms of movement variability between the affected and non-affected upper limb are detectable and (ii) different impairment profiles can be characterized for single subjects using the proposed approach.
Chapter
The rich variety of human upper limb movements requires an extraordinary coordination of different joints according to specific spatio-temporal patterns. However, unveiling these motor schemes is a challenging task. Principal components have been often used for analogous purposes, but such an approach relies on hypothesis of temporal uncorrelation of upper limb poses in time.
Article
In this work, we predict the outcomes of high fidelity multivariate computer simulations from low fidelity counterparts using function-to-function regression. The high fidelity simulation takes place on a high definition mesh, while its low fidelity counterpart takes place on a coarsened and truncated mesh. We showcase our approach by applying it to a complex finite element simulation of an abdominal aortic aneurysm which provides the displacement field of a blood vessel under pressure. In order to link the two multidimensional outcomes we compress them and then fit a function-to-function regression model. The data are high dimensional but of low sample size, meaning that only a few simulations are available, while the output of both low and high fidelity simulations is in the order of several thousands. To match this specific condition our compression method assumes a Gaussian Markov random field that takes the finite element geometry into account and only needs little data. In order to solve the function-to-function regression model we construct an appropriate prior with a shrinkage parameter which follows naturally from a Bayesian view of the Karhunen–Loève decomposition. Our model enables real multivariate predictions on the complete grid instead of resorting to the outcome of specific points.
Article
Expectile regression is a useful alternative to conditional mean and quantile regression for characterizing a conditional response distribution, especially when the distribution is asymmetric or when its tails are of interest. In this article, we propose a class of scalar‐on‐function linear expectile regression models where the functional slope parameter is assumed to reside in a reproducing kernel Hilbert space (RKHS). Our approach addresses numerous drawbacks to existing estimators based on functional principal components analysis (FPCA), which make implicit assumptions about RKHS eigenstructure. We show that our proposed estimator can achieve an optimal rate of convergence by establishing asymptotic minimax lower and upper bounds on the prediction error. Under this framework, we propose a flexible implementation based on the alternating direction method of multipliers algorithm. Simulation studies and an analysis of real‐world neuroimaging data validate our methodology and theoretical findings and, furthermore, suggest its superiority over FPCA‐based approaches in numerous settings.
Article
Nonlinear differential equations (DEs) are used in a wide range of scientific problems to model complex dynamic systems. The differential equations often contain unknown parameters that are of scientific interest, which have to be estimated from noisy measurements of the dynamic system. Generally, there is no closed-form solution for nonlinear DEs, and the likelihood surface for the parameter of interest is multi-modal and very sensitive to different parameter values. We propose a Bayesian framework for nonlinear DE systems. A flexible nonparametric function is used to represent the dynamic process such that expensive numerical solvers can be avoided. A sequential Monte Carlo algorithm in the annealing framework is proposed to conduct Bayesian inference for parameters in DEs. In our numerical experiments, we use examples of ordinary differential equations and delay differential equations to demonstrate the effectiveness of the proposed algorithm. We developed an R package that is available at \url{https://github.com/shijiaw/smcDE}.
Chapter
Full-text available
Functional Data Analysis (FDA) is the field of statistics which deals with the analysis of data expressed in the form of functions, which is extensible to data in the form of images. In a recent publication, Wang et al. [1] settled the mathematical groundwork for the application of FDA to the estimation of mean function and simultaneous confidence corridors (SCC) for a group of images and for the difference between two groups of images. This approach presents at least two advantages compared to previous methodologies: it avoids loss of information in complex data structures and also avoids the multiple comparison problem which arises from pixel-to-pixel comparison techniques. However, the computational costs of applying these procedures are yet to be fully explored and could outweigh the benefits resulting from the use of an FDA approach. In the present study, we aim to apply these novel procedures to simulated data and measure computing times both for the estimation of mean function and SCC for a one-group approach, for the comparison between two groups of images, and for the construction of Delaunay triangulations necessary for the implementation of the methodology. We also provide the computational tools to ensure replicability of the results herein presented.
Article
Full-text available
In this article we propose an extension of singular spectrum analysis for interval‐valued time series. The proposed methods can be used to decompose and forecast the dynamics governing a set‐valued stochastic process. The resulting components on which the interval time series is decomposed can be understood as interval trendlines, cycles, or noise. Forecasting can be conducted through a linear recurrent method, and we devised generalizations of the decomposition method for the multivariate setting. The performance of the proposed methods is showcased in a simulation study. We apply the proposed methods so to track the dynamics governing the Argentina Stock Market (MERVAL) in real time, in a case study that covers the most recent period of turbulence that led to discussions of the government of Argentina with the International Monetary Fund.
Article
The water and sediment fluxes into the estuary from upstream highly influence the estuary ecosystem. The lack of method to evaluate the impact of daily water and sediment fluxes to ecosystem has limited the ecological management of estuary. Therefore, it is important to find a quantitative method for the response of estuarine ecosystem to daily water and sediment fluxes into the estuary. Based on long time sequenced data, we aim to determine the dynamic response of fish communities to the interannual and seasonal variations of water and sediment fluxes, providing a novel method for the regulation of water and sediment process in dammed rivers. We establish a functional regression model and improve Functional Linear Regression That is Interpretable (FLiRTI) method to quantify the dynamic response relationship between the annual catch per unit effort (CPUE) and the daily water and sediment fluxes into the Yellow River Estuary from 1980 to 2011. The results showed that the water and sediment fluxes into the estuary explained 56% of the variability in the CPUE. Low fluxes during the spawning period (April-May) and high fluxes during the dam discharge period (June-July) were not conducive to estuarine fisheries. By increasing freshwater inflow in spring to 3.1 or 4.3 billion m³, the cumulative effect on estuarine fish communities can be increased by 20% and 44%, respectively. We propose a functional regression method to quantify the response of estuarine fish communities to water and sediment fluxes into the estuary. We suggest that advancing the discharge time of dam to spring, a suitable spawning environment for fish can be provided, thus effectively increasing fish production in the estuary. This study overcomes the limitation of number of species as well as the disunity of hydrological and ecological data on time scale, providing new idea for estuarine ecology research.
Article
Modelling conditional distributions for functional data extends the concept of a mean response in functional regression settings, where vector predictors are paired with functional responses. This extension is challenging because of the non‐existence of well‐defined densities, cumulative distributions or quantile functions in the Hilbert space where the response functions are located. To address this challenge, we simplify the problem by assuming that the response functions are Gaussian processes, which means that the conditional distribution of the responses is determined by conditional mean and conditional covariance. We demonstrate that these quantities can be obtained by applying global and local Fr\'echet regression, where the local version is more flexible and applicable when the covariate dimension is low and covariates are continuous, while the global version is not subject to these restrictions but is based on the assumption of a more restrictive regression relation. Convergence rates for the proposed estimates are obtained under the framework of M‐estimation. The corresponding estimation of conditional distributions is illustrated with simulations and an application to bike sharing data. We also show that our methods are applicable to the challenging problem to study functional fragments. Such data are observed in accelerated longitudinal studies. and correspond to functional data observed over short domain segments. We demonstrate the utility of conditional distributions in this context by using the time (age) at which a subject enters the domain of a fragment in addition to other covariates as predictor and the function observed over the domain of the fragment as response.
Article
Functional data analysis plays an increasingly important role in medical research because patients are followed over time. Thus, the measurements of a particular biomarker for each patient are often registered as curves. Hence, it is of interest to estimate the mean function under certain conditions as an average of the observed functional data over a given period. However, this is often difficult as this type of follow‐up studies are confronted with the challenge of some individuals dropping‐out before study completion. Therefore, for these individuals, only a partial functional observation is available. In this study, we propose an estimator for the functional mean when the functions may be censored from the right, and thus, only partly observed. Unlike sparse functional data, the censored curves are observed until some (random) time and this censoring time may depend on the trajectory of the functional observations. Our approach is model‐free and fully nonparametric, although the proposed methods can also be incorporated into regression models. The use of the functional structure of the data distinguishes our approach from the longitudinal data approaches. In addition, in this study, we propose a bootstrap‐based confidence band for the mean function, examine the estimation of the covariance function, and apply our new approach to functional principal component analysis. Employing an extensive simulation study, we demonstrate that our method outperforms the only two existing approaches. Furthermore, we apply our new estimator to a real data example on lung growth, measured by changes in pulmonary function for girls in the United States.
Article
Time series data have grown at an explosive rate in numerous domains and have stimulated a surge of time series modeling research. A comprehensive comparison of different time series models, for a considered data analytics task, provides useful guidance on model selection for data analytics practitioners. Data scarcity is a universal issue that occurs in a vast range of data analytics problems, due to the high costs associated with collecting, generating and labeling data as well as some data quality issues such as missing data. In this paper, we focus on the temporal classification/regression problem that attempts to build a mathematical mapping from multivariate time series inputs to a discrete class label or a real-valued response variable. For this specific problem, we identify two types of scarce data: scarce data with small samples and scarce data with sparsely and irregularly observed time series covariates. Observing that all existing works are incapable of utilizing the sparse time series inputs for proper modeling building, we propose a model called sparse functional multilayer perceptron (SFMLP) for handling the sparsity in the time series covariates. The effectiveness of the proposed SFMLP under each of the two types of data scarcity, in comparison with the conventional deep sequential learning models (e.g., Recurrent Neural Network, and Long Short-Term Memory), is investigated through mathematical arguments and numerical experiments ¹.
Article
This paper proposes a sequential design for maximizing a stochastic computer simulator output, y(x), over an unknown optimization domain. The training data used to estimate the optimization domain are a set of (historical) inputs, often from a physical system modeled by the simulator. Two methods are provided for estimating the simulator input domain. An extension of the well-known efficient global optimization algorithm is presented to maximize y(x). The domain estimation/maximization procedure is applied to two readily understood analytic examples. It is also used to solve a problem in nuclear safety by maximizing the k-effective “criticality coefficient” of spent fuel rods, considered as one-dimensional heterogeneous fissile media. One of the two domain estimation methods relies on expertise-type constraints. We show that these constraints, initially chosen to address the spent fuel rod example, are robust in that they also lead to good results in the second analytic optimization example. Of course, in other applications, it could be necessary to design alternative constraints that are more suitable for these applications.
Article
Full-text available
Athletic groin pain (AGP) is a chronic, painful condition which is prevalent in players of field sports that require rapid changes of direction. Following successful rehabilitation, systematic changes have been observed in the kinetics and kinematics of pre-planned change of direction manoeuvres, providing insight into potential foci for rehabilitation monitoring and for the assessment of interventions. However, changing direction in field sports is often reactive rather than pre-planned, and it is not known whether such post-rehabilitation changes are seen in reactive manoeuvres. We analysed the stance phase kinetics and kinematics of a 90° reactive cutting manoeuvre in 35 AGP patients before and after a successful exercise intervention programme. Following the intervention, transverse plane rotation of the pelvis towards the intended direction of travel increased, and the body centre of mass was positioned more anteriorly relative to the centre of pressure. Ankle dorsiflexion also increased, and participants demonstrated greater ankle plantar flexor internal moment and power during the second half of stance. These findings provide insight into mechanical variables of potential importance in AGP, as identified during a manoeuvre based on a common sporting task.
Chapter
Full-text available
The sensory-motor architecture of human upper limb and hand is characterized by a complex inter-relation of multiple elements, such as ligaments, muscles, and joints. Nonetheless, humans are able to generate coordinated and meaningful motor actions to interact—and eventually explore—the external environment. Such a complexity reduction is usually studied within the framework of synergistic control, whose focus has been mostly limited on human grasping and manipulation. Little attention has been devoted to the spatio-temporal characterization of human upper limb kinematic strategies and how the purposeful exploitation of the environmental constraints shapes human execution of manipulative actions. In this chapter, we report results on the evidence of a synergistic control of human upper limb and during manipulation with the environment. We propose functional analysis to characterize main spatio-temporal coordinated patterns of arm joints. Furthermore, we study how the environment influences human grasping synergies. The effect of cutaneous impairment is also evaluated. Applications to the design and control of robotic and assistive devices are finally discussed.
Article
Full-text available
A new class of longitudinal data has emerged with the use of technological devices for scientific data collection. This class of data is called intensive longitudinal data (ILD). This book features applied statistical modelling strategies developed by leading statisticians and methodologists working in conjunction with behavioural scientists.
Article
Full-text available
The methods of functional data analysis are used to estimate item response functions (IRFs) nonparametrically. The EM algorithm is used to maximize the penalized marginal likelihood of the data. The penalty controls the smoothness of the estimated IRFs, and is chosen so that, as the penalty is increased, the estimates converge to shapes closely represented by the three-parameter logistic family. The one-dimensional latent trait model is recast as a problem of estimating a space curve or manifold, and, expressed in this way, the model no longer involves any latent constructs, and is invariant with respect to choice of latent variable. Some results from differential geometry are used to develop a data-anchored measure of ability and a new technique for assessing item discriminability. Functional data-analytic techniques are used to explore the functional variation in the estimated IRFs. Applications involving simulated and actual data are included.
Book
Full-text available
Semiparametric regression is concerned with the flexible incorporation of non-linear functional relationships in regression analyses. Any application area that benefits from regression analysis can also benefit from semiparametric regression. Assuming only a basic familiarity with ordinary parametric regression, this user-friendly book explains the techniques and benefits of semiparametric regression in a concise and modular fashion. The authors make liberal use of graphics and examples plus case studies taken from environmental, financial, and other applications. They include practical advice on implementation and pointers to relevant software. The 2003 book is suitable as a textbook for students with little background in regression as well as a reference book for statistically oriented scientists such as biostatisticians, econometricians, quantitative social scientists, epidemiologists, with a good working knowledge of regression and the desire to begin using more flexible semiparametric models. Even experts on semiparametric regression should find something new here.
Article
Full-text available
The authors examined how timing accuracy in tapping sequences is influenced by sequential effects of preceding finger movements and biomechanical interdependencies among fingers. Skilled pianists tapped sequences at 3 rates; in each sequence, a finger whose motion was more or less independent of other fingers' motion was preceded by a finger to which it was more or less coupled. Less independent fingers and those preceded by a more coupled finger showed large timing errors and change in motion because of the preceding finger's motion. Motion change correlated with shorter intertap intervals and increased with rate. Thus, timing of sequence elements is not independent of the motion trajectories that individuals use to produce them. Neither motion nor its relation to timing is invariant across rates.
Article
Full-text available
Data often arrives as curves --- functions sampled at regular times or frequencies. Functional principal components (Ramsay and Silverman, 1997) can be used to describe the modes of variation of these functions. In many situations we do not get complete measurements of the individual curves. For example, growth curves are sampled functions, consisting of measurements such as bone density at different ages in a child's development. These measurements are often taken at an irregular and sparse set of time points, which can differ widely across individuals. We develop principal component models for representing the modes of variation of these curves. These models are estimated in a reduced-rank mixed-effects framework. 1 The problem In this paper we present a technique for estimating principal component curves for data such as those illustrated in Figure 1. The data consists of partial growth curves for 48 females measured at various ages. Even though there are only partial curves for ea...
Article
Full-text available
this paper. This method attempts to estimate the principal component curve directly rather than estimating an entire covariance matrix and computing the first eigenvector. This involves estimating 4 Time 0.0 0.2 0.4 0.6 0.8 1.0
Chapter
So far, we have considered Green’s functions only for ordinary differential equations, i.e., in one-dimensional problems. However, Green’s functions are useful for multidimensional spaces and general curvilinear coordinates as well as Cartesian coordinates. Thus an inverse operator can be defined for a partial differential operator as well as an ordinary differential operator. The Green’s function will again satisfy the given differential equation with a (now multidimensional) δ function as the forcing function. Let’s consider what is to be meant by a multidimensional δ function.
Article
Introduction.- Life Course Data in Criminology.- The Nondurable Goods Index.- Bone Shapes from a Paleopathology Study.- Modeling Reaction Time Distributions.- Zooming in on Human Growth.- Time Warping Handwriting and Weather Records.- How do Bone Shapes Indicate Arthritis?- Functional Models for Test Items.- Predicting Lip Acceleration from Electromyography.- Variable Seasonal Trend in the Goods Index.- The Dynamics of Handwriting Printed Characters.- A Differential Equation for Juggling.
Article
Functional data analysis techniques are used to analyze a sample of handwriting in Chinese. The goals are (a) to identify a differential equation that satisfactorily models the data's dynamics, and (b) to use the model to classify handwriting samples taken from differential individuals. After preliminary smoothing and registration steps, a second-order linear differential equation, for which the forcing function is small, is found to provide a good reconstruction of the original script records. The equation is also able to capture a substantial amount of the variation in the scripts across replication. The cross-validated classification process is 100% effective for the samples analyzed.
Article
The authors develop a functional linear model in which the values at time t of a sample of curves yi (t) are explained in a feed-forward sense by the values of covariate curves xi(s) observed at times s ±.t. They give special attention to the case s ± [t — δ, t], where the lag parameter δ is estimated from the data. They use the finite element method to estimate the bivariate parameter regression function β(s, t), which is defined on the triangular domain s ± t. They apply their model to the problem of predicting the acceleration of the lower lip during speech on the basis of electromyographical recordings from a muscle depressing the lip. They also provide simulation results to guide the calibration of the fitting process.
Article
We present a technique for extending generalized linear models to the situation where some of the predictor variables are observations from a curve or function. The technique is particularly useful when only fragments of each curve have been observed. We demonstrate, on both simulated and real data sets, how this approach can be used to perform linear, logistic and censored regression with functional predictors. In addition, we show how functional principal components can be used to gain insight into the relationship between the response and functional predictors. Finally, we extend the methodology to apply generalized linear models and principal components to standard missing data problems.
Article
This article describes flexible statistical methods that may be used to identify and characterize nonlinear regression effects. These methods are called "generalized additive models". For example, a commonly used statistical model in medical research is the logistic regression model for binary data. Here we relate the mean of the binary response ¯ = P (y = 1) to the predictors via a linear regression model and the logit link function: log
Chapter
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Chapter
Most statistical analyses involve one or more observations taken on each of a number of individuals in a sample, with the aim of making inferences about the general population from which the sample is drawn. In an increasing number of fields, these observations are curves or images. Curves and images are examples of functions, since an observed intensity is available at each point on a line segment, a portion of a plane, or a volume. For this reason, we call observed curves and images ‘functional data,’ and statistical methods for analyzing such data are described by the term ‘functional data analysis.’ It is the smoothness of the processes generating functional data that differentiates this type of data from more classical multivariate observations. This smoothness means that we can work with the information in the derivatives of functions or images. This article includes several illustrative examples.
Article
The option characteristic curve, the relation between ability and probability of choosing a particular option for a test item, can be estimated by nonparametric smoothing techniques. What is smoothed is the relation between some function of estimated examinee ability rankings and the binary variable indicating whether or not the option was chosen. This paper explores the use of kernel smoothing, which is particularly well suited to this application. Examples show that, with some help from the fast Fourier transform, estimates can be computed about 500 times as rapidly as when using commonly used parametric approaches such as maximum marginal likelihood estimation using the three-parameter logistic distribution. Simulations suggest that there is no loss of efficiency even when the population curves are three-parameter logistic. The approach lends itself to several interesting extensions.
Article
Consideration is given to determination of parameters of a functional relation between two variables by the means of factor analysis techniques. If the function can be separated into a sum of products of functions of the individual parameters and corresponding functions of the independent variable, particular values of the functions of the parameters and of the functions of the independent variables might be found by factor analysis. Otherwise approximate solutions may be determined. These solutions may represent important results from experimental investigations.
Article
We propose a new method for estimating parameters in models that are defined by a system of non-linear differential equations. Such equations represent changes in system outputs by linking the behaviour of derivatives of a process to the behaviour of the process itself. Current methods for estimating parameters in differential equations from noisy data are computationally intensive and often poorly suited to the realization of statistical objectives such as inference and interval estimation. The paper describes a new method that uses noisy measurements on a subset of variables to estimate the parameters defining a system of non-linear differential equations. The approach is based on a modification of data smoothing methods along with a generalization of profiled estimation. We derive estimates and confidence intervals, and show that these have low bias and good coverage properties respectively for data that are simulated from models in chemical engineering and neurobiology. The performance of the method is demonstrated by using real world data from chemistry and from the progress of the autoimmune disease lupus. Copyright 2007 Royal Statistical Society.
Article
We develop a flexible model-based procedure for clustering functional data. The technique can be applied to all types of curve data but is particularly useful when individuals are observed at a sparse set of time points. In addition to producing final cluster assignments, the procedure generates predictions and confidence intervals for missing portions of curves. Our approach also provides many useful tools for evaluating the resulting models. Clustering can be assessed visually via low-dimensional representations of the curves, and the regions of greatest separation between clusters can be determined using a discriminant function. Finally, we extend the model to handle multiple functional and finite-dimensional covariates and show how it can be applied to standard finite-dimensional clustering problems involving missing data.
Article
The maximum of a Gaussian random field was used by Worsley et al. (1992) to test for activation at an unknown point in positron emission tomography images of blood flow in the human brain. The Euler characteristic of excursion sets was used as an estimator of the number of regions of activation. The expected Euler characteristic of excursion sets of stationary Gaussian random fields has been derived by Adler and Hasofer (1976) and Adler (1981). In this paper we extend the results of Adler (1981) to chi2, F and t fields. The theory is applied to some three-dimensional images of cerebral blood flow from a study on pain perception.
Article
We introduce a technique for extending the classical method of Linear Discriminant Analysis to data sets where the predictor variables are curves or functions. This procedure, which we call functional linear discriminant analysis (FLDA), is particularly useful when only fragments of the curves are observed. FLDA possesses all of the usual LDA tools . In particular it can be used to produce classifications on new (test) curves, give an estimate of the discriminant function between classes, and provide a low (one or two) dimensional pictorial representation of a set of curves. We also extend this procedure to provide generalizations of quadratic and regularized discriminant analysis. Some key words: Functional discriminant analysis; Classification; Low dimensional graphical representation. 1 Introduction Linear discriminant analysis (LDA) is a popular procedure which dates back as far as Fisher (1936). Let X be a q dimensional vector representing an observation from one of sever...
Differential Equations, Third Edition
  • P Blanchard
  • R L Daveney
  • G R Hall
Blanchard, P., Daveney R. L. and Hall, G. R. (2006) Differential Equations, Third Edition. Belmont, CA: Thompson Brooks/Cole.
Plague: A Story of Smallpox in Montreal
  • M Bliss
Bliss, M. (1991) Plague: A Story of Smallpox in Montreal. Toronto: Harper Collins Publishers.
Canadian Journal of Statistics
  • N. Heckman
  • J. O. Ramsay
Functional Data Analysis, Second Edition
  • J O Ramsay
  • B W Silverman
Ramsay, J. O. and Silverman, B. W. (2005) Functional Data Analysis, Second Edition. New York: Springer.