Sean Moran's research while affiliated with Jpmorgan Chase & Co. and other places

Publications (24)

Conference Paper
Full-text available
This paper introduces an unsupervised method to estimate the class separability of text datasets from a topological point of view. Using persistent homology, we demonstrate how tracking the evolution of embedding manifolds during training can inform about class sep-arability. More specifically, we show how this technique can be applied to detect wh...
Preprint
Full-text available
The recent rapid advancements in both sensing and machine learning technologies have given rise to the universal collection and utilization of people's biometrics, such as fingerprints, voices, retina/facial scans, or gait/motion/gestures data, enabling a wide range of applications including authentication, health monitoring, or much more sophistic...
Article
Full-text available
The recent rapid advancements in both sensing and machine learning technologies have given rise to the universal collection and utilization of people’s biometrics, such as fingerprints, voices, retina/facial scans, or gait/motion/gestures data, enabling a wide range of applications including authentication, health monitoring, or much more sophistic...
Preprint
Full-text available
Machine learning models trained on sensitive or private data can inadvertently memorize and leak that information. Machine unlearning seeks to retroactively remove such details from model weights to protect privacy. We contribute a lightweight unlearning algorithm that leverages the Fisher Information Matrix (FIM) for selective forgetting. Prior wo...
Chapter
Finding relevant and high-quality datasets to train machine learning models is a major bottleneck for practitioners. Furthermore, to address ambitious real-world use-cases there is usually the requirement that the data come labelled with high-quality annotations that can facilitate the training of a supervised model. Manually labelling data with hi...
Preprint
Full-text available
This paper proposes a method to estimate the class separability of an unlabeled text dataset by inspecting the topological characteristics of sentence-transformer embeddings of the text. Experiments conducted involve both binary and multi-class cases, with balanced and imbalanced scenarios. The results demonstrate a clear correlation and a better c...
Preprint
Full-text available
This paper investigates the effectiveness of large language models (LLMs) in email spam detection by comparing prominent models from three distinct families: BERT-like, Sentence Transformers, and Seq2Seq. Additionally, we examine well-established machine learning techniques for spam detection, such as Na\"ive Bayes and LightGBM, as baseline methods...
Preprint
Full-text available
Finding relevant and high-quality datasets to train machine learning models is a major bottleneck for practitioners. Furthermore, to address ambitious real-world use-cases there is usually the requirement that the data come labelled with high-quality annotations that can facilitate the training of a supervised model. Manually labelling data with hi...
Chapter
We present CV4Code1, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an exp...
Preprint
Full-text available
The concept of Green AI has been gaining attention within the deep learning community given the recent trend of ever larger and more complex neural network models. Some large models have billions of parameters causing the training time to take up to hundreds of GPU/TPU-days. The estimated energy consumption can be comparable to the annual total ene...
Preprint
Full-text available
When designing a new API for a large project, developers need to make smart design choices so that their code base can grow sustainably. To ensure that new API components are well designed, developers can learn from existing API components. However, the lack of standardized method for comparing API designs makes this learning process time-consuming...
Preprint
The use of packaged libraries can significantly shorten the software development cycle by improving the quality and readability of code. In this paper, we present a recommendation engine called Librarian for open source libraries. A candidate library package is recommended for a given context if: 1) it has been frequently used with the imported lib...
Chapter
The use of biometrics such as fingerprints, voices, and images are becoming increasingly more ubiquitous through people’s daily lives, in applications ranging from authentication, identification, to much more sophisticated analytics, thanks to the recent rapid advances in both the sensing hardware technologies and machine learning techniques. While...
Preprint
Machine learning on source code (MLOnCode) promises to transform how software is delivered. By mining the context and relationship between software artefacts, MLOnCode augments the software developers capabilities with code auto-generation, code recommendation, code auto-tagging and other data-driven enhancements. For many of these tasks a script l...
Preprint
We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an expl...
Preprint
Federated learning reduces the risk of information leakage, but remains vulnerable to attacks. We investigate how several neural network design decisions can defend against gradients inversion attacks. We show that overlapping gradients provides numerical resistance to gradient inversion on the highly vulnerable dense layer. Specifically, we propos...
Preprint
Chest Computational Tomography (CT) scans present low cost, speed and objectivity for COVID-19 diagnosis and deep learning methods have shown great promise in assisting the analysis and interpretation of these images. Most hospitals or countries can train their own models using in-house data, however empirical evidence shows that those models perfo...
Preprint
Machine learning on source code (MLOnCode) is a popular research field that has been driven by the availability of large-scale code repositories and the development of powerful probabilistic and deep learning models for mining source code. Code-to-code recommendation is a task in MLOnCode that aims to recommend relevant, diverse and concise code sn...

Citations

... We use BERTopic 37 to cluster topics from preprocessed content. Previous literature indicates that BERTopic consistently outperforms its competitors in various topic modeling tasks (Grootendorst 2022) and is frequently used in empirical software engineering research (Diamantopoulos et al. 2023;Gu et al. 2023;Tao et al. 2023). BERT embeddings are adept at capturing semantic nuances, often outperforming sparse, high-dimensional bagof-words, or n-gram models. ...
... A code-to-code recommendation is an approach where the recommendation tools understand the current context of developers' code and recommend the code accordingly. There have been multiple studies of code-to-code recommendation such as Aroma [1], Senatus [2], Strathcona [3], and Example Overflow [4] using techniques such as clustering and ranking of code snippets, code structure, Jaccard similarity and Minhas-LSH on code, and TF-IDF. Text-to-code recommendations have developers inquire about the details and behaviors of the code they require from the recommendation tool using natural language text. ...
... A first version of our work appeared in ESORICS 2022 [37]. The work here is considerably extended with enhanced algorithm designs as well as expanded experimental evaluation. ...
... (3) Most experiments do not consider the execution time problem. It is necessary to shorten the execution time with appropriate data preprocessing [192][193][194][195] strategies or GPU acceleration. (4) The experiments discussed in this paper use chest CT or chest X-ray images as the input datasets of CNN and have achieved good performance. ...
... The latter work has presented an extensive comparative analysis on this data wherein boosting algorithms have admitted promising results. A very recent work [22] that takes advantage of the dynamical changes of the Bitcoin transaction graph of Elliptic data has introduced graphlet spectral correlation analysis. This work has proposed a two-stage random forest classifier associated with the GCN model using different configurations on train/test set split. ...