Home
Princeton University
Department of Computer Science
Sanjeev Arora

Sanjeev Arora
Princeton University | PU · Department of Computer Science

About

162

Publications

20,785

Reads

16,676

Citations

Skills and Expertise

Theoretical Computer Science

Algorithm Analysis

Publications

Language Models as Science Tutors

Preprint

Full-text available

Feb 2024

NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TUTOREVAL and TUTORCHA...

Evaluating gradient inversion attacks and defenses

Chapter

Jan 2024

Figure 2: TINT simulates the forward pass of a linear layer as a H-head...

Figure 4: TINT simulates the backward pass of a linear layer as a...

Figure 5: TINT computes the parameter gradients for a linear layer as a...

Figure 9: TINT simulates the backward pass of the self-attention layer...

Few-shot (k = 32) results with different loss types, input formats, and...

Trainable Transformer in Transformer

Preprint

Full-text available

Jul 2023

Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In thi...

Experiments modifying the variance of MeZO using d as the gradient norm...

present the full results of OPT-30B and OPT-66B, with detailed MeZO...

Fine-Tuning Language Models with Just Forward Passes

Preprint

Full-text available

May 2023

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this wo...

Why (and When) does Local SGD Generalize Better than SGD?

Preprint

Mar 2023

Local SGD is a communication-efficient variant of SGD for large-scale training, where multiple GPUs perform SGD independently and average the model parameters periodically. It has been recently observed that Local SGD can not only achieve the design goal of reducing the communication overhead but also lead to higher test accuracy than the correspon...

Task-Specific Skill Localization in Fine-tuned Language Models

Preprint

Full-text available

Feb 2023

Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-shot settings. Thus fine-tuning allows the model to quickly pick up task-specific ``skills,'' but there has been limited study of where these newly-learnt skills reside inside the massive model. This paper introduces the term skill localization for this probl...

Do Transformers Parse while Predicting the Masked Word?

Conference Paper

Jan 2023

Figure 2: ImageNet image and masked versions for labels desk and...

Figure 4: Worst case completeness v.s. soundness plot on Imagenette for...

Figure 5: Images containing both elephant(s) and zebra(s), and the...

Figure 6: Our masks and masks by Phang et al. [31] for ImageNet images...

Figure 7: Masks of 10 Imagenette examples generated by different...

New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound

Preprint

Full-text available

Nov 2022

Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninforma...

A Kernel-Based View of Language Model Fine-Tuning

Preprint

Oct 2022

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural...

Understanding Influence Functions and Datamodels via Harmonic Analysis

Preprint

Full-text available

Oct 2022

Influence functions estimate effect of individual data points on predictions of the model on test data and were adapted to deep learning in Koh and Liang [2017]. They have been used for detecting data poisoning, detecting helpful and harmful examples, influence of groups of datapoints, etc. Recently, Ilyas et al. [2022] introduced a linear regressi...

Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent

Preprint

Jul 2022

As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a characterization of this phenomenon under a notion termed commuting parametriza...

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Preprint

Jun 2022

Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supportin...

Figure 11: SVAG experiments on GPT pretrained on WikiText-103 with...

Figure 12: VGG-16 trained on CIFAR-10 using RMSprop are close for...

Figure 20: ResNet-50 trained on ImageNet using Adam are close for...

Figure 21: ResNet-50 trained on ImageNet using Adam are close for...

Optimization hyperparameters of baseline RoBERTa model during pretraining.

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Preprint

Full-text available

May 2022

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there wer...

Figure 1: GD on √ L oscillates around the manifold Γ = {(x, y) | y = 0}...

Figure 2: We verify our theoretical claims during limiting flow phase...

Figure 3: Normalized GD and Riemannian flow have almost the same...

Figure 4: The trajectory of Normalized GD is very close to that of the...

Understanding Gradient Descent on Edge of Stability in Deep Learning

Preprint

Full-text available

May 2022

Deep learning experiments in Cohen et al. (2021) using deterministic Gradient Descent (GD) revealed an {\em Edge of Stability (EoS)} phase when learning rate (LR) and sharpness (\emph{i.e.}, the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterati...

Understanding Contrastive Learning Requires Incorporating Inductive Biases

Preprint

Feb 2022

Contrastive learning is a popular form of self-supervised learning that encourages augmentations (views) of the same input to have more similar representations compared to augmentations of different inputs. Recent attempts to theoretically explain the success of contrastive learning on downstream classification tasks prove guarantees depending on p...

Figure 2: Attack is weakened when private labels are not available. (a)...

Figure 6: Reconstruction results under different defenses for a more...

Figure 7: Reconstruction results under different defenses with batch...

Figure 8: Reconstruction results of MNIST digits under different...

Figure 9: Original and reconstructed batches of 16 images under MixUp...

Evaluating Gradient Inversion Attacks and Defenses in Federated Learning

Preprint

Full-text available

Nov 2021

Gradient inversion attack (or input recovery from gradient) is an emerging threat to the security and privacy preservation of Federated learning, whereby malicious eavesdroppers or participants in the protocol can recover (partially) the clients' private data. This paper evaluates existing attacks and defenses. We find that some attacks make strong...

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

Preprint

Oct 2021

The generalization mystery of overparametrized deep nets has motivated efforts to understand how gradient descent (GD) converges to low-loss solutions that generalize well. Real-life neural networks are initialized from small random values and trained with cross-entropy loss for classification (unlike the "lazy" or "NTK" regime of training where an...

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

Preprint

Full-text available

Oct 2021

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\eta$, SGD tracks Gradient Descent (GD) until it gets close to such...

Figure 7. Validation accuracy for PreResNet32 with BatchNorm (left),...

Figure 14. Further verification for our theory on predicting the...

On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)

Preprint

Full-text available

Feb 2021

It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs). But formal justification for this approximation (e.g., (Li et al., 2019a)) only ap...

Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?

Preprint

Oct 2020

Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of 'better inductive bias'. However, this has not been made mathematically rigorous, and the hurdle is that the fully connected net can always simulate the convolutional...

Performance on the GLUE tasks for both baseline (standard finetuning)...

Success rate of 50 independent gradients matching attacks. Baseline is...

TextHide: Tackling Data Privacy in Language Understanding Tasks

Preprint

Full-text available

Oct 2020

An unsolved challenge in distributed or federated learning is to effectively mitigate privacy risks without slowing down training or reducing accuracy. In this paper, we propose TextHide aiming at addressing this challenge for natural language understanding tasks. It requires all participants to add a simple encryption step to prevent an eavesdropp...

A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks

Preprint

Oct 2020

Autoregressive language models pretrained on large corpora have been successful at solving downstream tasks, even with zero-shot usage. However, there is little theoretical justification for their success. This paper considers the following questions: (1) Why should learning the distribution of natural language help with downstream classification t...

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Preprint

Oct 2020

Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e.g., use of exponentially increasing learning rates. The current paper highlights other ways in which behavior of normalized nets departs...

A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

Preprint

Feb 2020

One popular trend in meta-learning is to learn from many training tasks a common initialization for a gradient-based method that can be used to solve a new task with few samples. The theory of meta-learning is still in its early stages, with several recent learning-theoretic analyses of methods such as Reptile [Nichol et al., 2018] being for convex...

Provable Representation Learning for Imitation Learning via Bi-level Optimization

Preprint

Feb 2020

A common strategy in modern learning systems is to learn a representation that is useful for many tasks, a.k.a. representation learning. We study this strategy in the imitation learning setting for Markov decision processes (MDPs) where multiple experts' trajectories are available. We formulate representation learning as a bi-level optimization pro...

Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality

Preprint

Feb 2020

Adversarial training is a popular method to give neural nets robustness against adversarial perturbations. In practice adversarial training leads to low robust training loss. However, a rigorous explanation for why this happens under natural conditions is still missing. Recently a convergence theory for standard (non-adversarial) supervised trainin...

TextHide: Tackling Data Privacy in Language Understanding Tasks

Conference Paper

Jan 2020

Enhanced Convolutional Neural Tangent Kernels

Preprint

Nov 2019

Recent research shows that for training with $\ell_2$ loss, convolutional neural networks (CNNs) whose width (number of channels in convolutional layers) goes to infinity correspond to regression with respect to the CNN Gaussian Process kernel (CNN-GP) if only the last layer is trained, and correspond to regression with respect to the Convolutional...

An Exponential Learning Rate Schedule for Deep Learning

Preprint

Oct 2019

Intriguing empirical evidence exists that deep learning can work well with exoticschedules for varying the learning rate. This paper suggests that the phenomenonmay be due to Batch Normalization or BN(Ioffe & Szegedy, 2015), which is ubiquitous and provides benefits in optimization and generalization across all standardarchitectures. The following...

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Preprint

Oct 2019

Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its...

Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

Preprint

Jun 2019

Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima---at least those discovered by gradient-based optimization---turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical expla...

Implicit Regularization in Deep Matrix Factorization

Preprint

May 2019

Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing --- a model referr...

A Simple Saliency Method That Passes the Sanity Checks

Preprint

May 2019

There is great interest in *saliency methods* (also called *attribution methods*), which give "explanations" for a deep net's decision, by assigning a *score* to each feature/pixel in the input. Their design usually involves credit-assignment via the gradient of the output with respect to input. Recently Adebayo et al. [arXiv:1810.03292] questioned...

On Exact Computation with an Infinitely Wide Neural Net

Preprint

Apr 2019

How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its "width" --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoreti...

A Theoretical Analysis of Contrastive Unsupervised Representation Learning

Preprint

Feb 2019

Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically "similar" data points and "negative samples," the learner...

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Preprint

Jan 2019

Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed...

Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

Preprint

Dec 2018

Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability t...

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Preprint

Oct 2018

We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as $x\mapsto W_N \cdots W_1x$) by minimizing the $\ell_2$ loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and o...

A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors

Preprint

May 2018

Motivations like domain adaptation, transfer learning, and feature learning have fueled interest in inducing embeddings for rare or unseen words, n-grams, synsets, and other textual features. This paper introduces a la carte embedding, a simple and general alternative to the usual word2vec-based approaches for building such representations that is...

Learning topic models --- provably and efficiently

Article

Mar 2018

An Analysis of the t-SNE Algorithm for Data Visualization

Article

Mar 2018

A first line of attack in exploratory data analysis is data visualization, i.e., generating a 2-dimensional representation of data that makes clusters of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based...

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Article

Full-text available

Feb 2018

Conventional wisdom in deep learning states that increasing depth improves expressiveness but complicates optimization. This paper suggests that, sometimes, increasing depth can speed up optimization. The effect of depth on optimization is decoupled from expressiveness by focusing on settings where additional layers amount to overparameterization -...

Stronger generalization bounds for deep nets via a compression approach

Article

Feb 2018

Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that're orders of magnitude better i...

A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors

Conference Paper

Jan 2018

Theoretical limitations of Encoder-Decoder GAN architectures

Article

Nov 2017

Encoder-decoder GANs architectures (e.g., BiGAN and ALI) seek to add an inference mechanism to the GANs setup, consisting of a small encoder deep net that maps data-points to their succinct encodings. The intuition is that being forced to train an encoder alongside the usual generator forces the system to learn meaningful mappings from the code to...

Provable learning of noisy-OR networks

Conference Paper

Jun 2017

Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Learning the model ---that is, the mapping from hidden variables to visible ones and vice versa---is NP-hard even in very sim...

Provable benefits of representation learning

Article

Jun 2017

There is general consensus that learning representations is useful for a variety of reasons, e.g. efficient use of labeled data (semi-supervised learning), transfer learning and understanding hidden structure of data. Popular techniques for representation learning include clustering, manifold learning, kernel-learning, autoencoders, Boltzmann machi...

Figure 1: Summary of Experimental Setup: We learn a shared response for...

Figure 3: Visualization of Semantic Annotation Vector Weightings: We...

Figure 4: Visualizing Concatenation: We visualize what the single...

Figure 8: Varying Previous Timesteps: For the DMN-A region, choosing...

Mapping Between fMRI Responses to Movies and their Natural Language Annotations

Article

Full-text available

Jun 2017

Several research groups have shown how to map fMRI responses to the meanings of presented stimuli. This paper presents new methods for doing so when only a natural language annotation is available as the description of the stimulus. We study fMRI data gathered from subjects watching an episode of BBCs Sherlock (Chen et al., 2017), and learn bidirec...

Extending and Improving Wordnet via Unsupervised Word Embeddings

Article

Apr 2017

This work presents an unsupervised approach for improving WordNet that builds upon recent advances in document and sense representation via distributional semantics. We apply our methods to construct Wordnets in French and Russian, languages which both lack good manual constructions.1 These are evaluated on two new 600-word test sets for word-to-sy...

Generalization and Equilibrium in Generative Adversarial Nets (GANs)

Article

Mar 2017

This paper makes progress on several open theoretical issues related to Generative Adversarial Networks. A definition is provided for what it means for the training to generalize, and it is shown that generalization is not guaranteed for the popular distances between distributions such as Jensen-Shannon or Wasserstein. We introduce a new metric cal...

On the ability of neural nets to express distributions

Article

Feb 2017

Deep neural nets have caused a revolution in many classification tasks. A related ongoing revolution---also theoretically not understood---concerns their ability to serve as generative models for complicated types of data such as images and texts. These models are trained using ideas like variational autoencoders and Generative Adversarial Networks...

Automated WordNet Construction Using Word Embeddings

Conference Paper

Jan 2017

Provable learning of Noisy-or Networks

Article

Dec 2016

Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NP-hard even in very simple settings. In recent years, provably efficient a...

A Latent Variable Model Approach to PMI-based Word Embeddings

Article

Dec 2016

Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods. Many use nonlinear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods. This paper proposes a new generative model, a dynamic version of the log-linear topic model of Mnih and Hinton (2007). The...

Mapping Between Natural Movie fMRI Responses and Word-Sequence Representations

Article

Full-text available

Oct 2016

This work provides support for the notion that distributional methods of representing word meaning from computational linguistics are useful for capturing neural correlates of real life multi-sensory stimuli, where the stimuli ---in this case, a movie being watched by the human subjects--- have been given text annotations. We present an approach to...

Provable Algorithms for Inference in Topic Models

Article

May 2016

Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We l...

A Combinatorial, Primal-Dual Approach to Semidefinite Programs

Article

May 2016

Semidefinite programs (SDPs) have been used in many recent approximation algorithms. We develop a general primal-dual approach to solve SDPs using a generalization of the well-known multiplicative weights update rule to symmetric matrices. For a number of problems, such as SPARSEST CUT and BALANCED SEPARATOR in undirected and directed weighted grap...

Linear Algebraic Structure of Word Senses, with Applications to Polysemy

Article

Full-text available

Jan 2016

Word embeddings are ubiquitous in NLP and information retrieval, but it's unclear what they represent when the word is polysemous, i.e., has multiple senses. Here it is shown that multiple word senses reside in linear superposition within the word embedding and can be recovered by simple sparse coding. The success of the method ---which applies to...

Computing a Nonnegative Matrix Factorization---Provably

Article

Jan 2016

In the nonnegative matrix factorization (NMF) problem we are given an $n \times m$ nonnegative matrix $M$ and an integer $r > 0$. Our goal is to express $M$ as $A W$, where $A$ and $W$ are nonnegative matrices of size $n \times r$ and $r \times m$, respectively. In some applications, it makes sense to ask instead for the product $AW$ to approximate...

Why are deep nets reversible: A simple theory, with implications for training

Article

Full-text available

Nov 2015

Generative model approaches to deep learning are of interest in the quest for both better understanding as well as training methods requiring fewer labeled samples. Recent works use generative model approaches to produce the deep net's input given the value of a hidden layer several levels above. However, there is no accompanying "proof of correctn...

Simple, Efficient, and Neural Algorithms for Sparse Coding

Article

Mar 2015

Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating min...

Random Walks on Context Spaces: Towards an Explanation of the Mysteries of Semantic Word Embeddings

Article

Full-text available

Feb 2015

The papers of Mikolov et al. 2013 as well as subsequent works have led to dramatic progress in solving word analogy tasks using semantic word embeddings. This leverages linear structure that is often found in the word embeddings, which is surprising since the training method is usually nonlinear. There were attempts ---notably by Levy and Goldberg...

Provable Bounds for Learning Some Deep Representations

Conference Paper

Oct 2014

We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an $n$ node multilayer neural net that has degree at most $n^backslashgamma$ for some $backslashgamma textless1$ and each edge has a random edge weight in $[-1,1]$. Our algorithm learns $...

More Algorithms for Provable Dictionary Learning

Article

Jan 2014

In dictionary learning, also known as sparse coding, the algorithm is given samples of the form $y = Ax$ where $x\in \mathbb{R}^m$ is an unknown random sparse vector and $A$ is an unknown dictionary matrix in $\mathbb{R}^{n\times m}$ (usually $m > n$, which is the overcomplete case). The goal is to learn $A$ and $x$. This problem has been studied i...

Provable Bounds for Learning Some Deep Representations

Article

Oct 2013

We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an $n$ node multilayer neural net that has degree at most $n^{\gamma}$ for some $\gamma <1$ and each edge has a random edge weight in $[-1,1]$. Our algorithm learns {\em almost all} netwo...

New Algorithms for Learning Incoherent and Overcomplete Dictionaries

Article

Aug 2013

A matrix $A \in \R^{n \times m}$ is said to be $\mu$-incoherent if each pair of columns has inner product at most $\mu / \sqrt{n}$. Starting with the pioneering work of Donoho and Huo such matrices (often called {\em dictionaries}) have played a central role in signal processing, statistics and machine learning. They allow {\em sparse recovery}: th...

Towards a Better Approximation for Sparsest Cut?

Article

Apr 2013

We give a new $(1+\epsilon)$-approximation for sparsest cut problem on graphs where small sets expand significantly more than the sparsest cut (sets of size $n/r$ expand by a factor $\sqrt{\log n\log r}$ bigger, for some small $r$; this condition holds for many natural graph families). We give two different algorithms. One involves Guruswami-Sinop...

A Practical Algorithm for Topic Modeling with Provable Guarantees

Article

Dec 2012

Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model inference have been based on a maximum likelihood objective. Efficient algorithms exist that approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced...

Testing Permanent Oracles -- Revisited

Article

Full-text available

Jul 2012

Suppose we are given an oracle that claims to approximate the permanent for most matrices X, where X is chosen from the Gaussian ensemble (the matrix entries are i.i.d. univariate complex Gaussians). Can we test that the oracle satisfies this claim? This paper gives a polynomial-time algorithm for the task. The oracle-testing problem is of interest...

Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders

Article

Jun 2012

We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form $y = Ax + \eta$ where $A$ is an unknown $n \times n$ matrix and $x$ is a random variable whose components are independent and have a fourth moment strictly less than that of a standar...

Learning topic models - Going beyond SVD

Article

Apr 2012

Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby docu...

The Multiplicative Weights Update Method: a Meta Algorithm and Applications

Article

Full-text available

Jan 2012

Algorithms in varied fields use the idea of maintaining a distribution over a certain set and use the multiplicative update rule to iteratively change these weights. Their analyses are usually very similar and rely on an exponential potential function. In this survey we present a simple meta-algorithm that unifies many of these disparate algorithm...

Message-Passing Algorithms and Improved LP Decoding

Article

Jan 2012

Linear programming (LP) decoding for low-density parity-check codes (and related domains such as compressed sensing) has received increased attention over recent years because of its practical performancecoming close to that of iterative decoding algorithmsand its amenability to finite-blocklength analysis. Several works starting with the work of F...

Finding Overlapping Communities in Social Networks: Toward a Rigorous Approach

Article

Dec 2011

A community in a social network is usually understood to be a group of nodes more densely connected with each other than with the rest of the network. This is an important concept in most domains where networks arise: social, technological, biological, etc. For many years algorithms for finding communities implicitly assumed communities are nonover...

Computing a Nonnegative Matrix Factorization -- Provably

Article

Full-text available

Nov 2011

The Nonnegative Matrix Factorization (NMF) problem has a rich history spanning quantum mechanics, probability theory, data analysis, polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormously popular in machine learning, where the factorization is computed using a variety of local...

New Tools for Graph Coloring

Chapter

Aug 2011

How to color 3 colorable graphs with few colors is a problem of longstanding interest. The best polynomial-time algorithm uses n 0.2072 colors. There are no indications that coloring using say O(logn) colors is hard. It has been suggested that SDP hierarchies could be used to design algorithms that use n ε colors for arbitrarily small ε > 0. We e...

New Algorithms for Learning in Presence of Errors

Conference Paper

Jul 2011

We give new algorithms for a variety of randomly-generated instances of computational problems using a linearization technique that reduces to solving a system of linear equations. These algorithms are derived in the context of learning with structured noise, a notion introduced in this paper. This notion is best illustrated with the learning pari...

Computational Complexity and Information Asymmetry in Financial Products

Article

May 2011

This paper introduces notions from computational complexity into the study of financial derivatives. Tradi- tional economics argues that derivatives, like CDOs and CDSs, ameliorate the negative costs imposed due to asymmetric information between buyers and sellers. This is because securitization via these derivatives allows the informed party to fi...

A Reformulation of the Arora-Rao-Vazirani Structure Theorem

Article

Feb 2011

In a well-known paper[ARV], Arora, Rao and Vazirani obtained an O(sqrt(log n)) approximation to the Balanced Separator problem and Uniform Sparsest Cut. At the heart of their result is a geometric statement about sets of points that satisfy triangle inequalities, which also underlies subsequent work on approximation algorithms and geometric embeddi...

Subexponential Algorithms for Unique Games and Related Problems

Conference Paper

Oct 2010

Subexponential time approximation algorithms are presented for the Unique Games and Small-Set Expansion problems. Specifically, for some absolute constant c, the following two algorithms are presented. (1) An exp(kn&epsi;)-time algorithm that, given as input a k-alphabet unique game on n variables that has an assignment satisfying 1-&epsi;c fractio...

Semidefinite Programming and Approximation Algorithms: A Survey

Conference Paper

Jun 2010

Sanjeev Arora

Computing approximate solutions for NP-hard problems is an important research endeavor. Since the work of Goemans-Williamson in 1993, semidefinite programming (a form of convex programming in which the variables are vector inner products) has been used to design the current best approximation algorithms for problems such as MAX-CUT, MAX-3SAT, SPARS...

Improved Algorithms for Unique Games via Divide and Conquer.

Article

Full-text available

Jan 2010

Learning Parities with Structured Noise.

Article

Jan 2010

O(sqrt(log(n)) Approximation to SPARSEST CUT in Õ(n

Article

Jan 2010

Computational Complexity and Information Asymmetry in Financial Products (Extended Abstract).

Conference Paper

Jan 2010

O(logn) approximation to sparsest cut in O ˜(n 2 ) time

Article

Jan 2010

This paper shows how to compute O(logn)-approximations to the sparsest cut and balanced separator problems in O ˜(n 2 ) time, thus improving upon the recent algorithm of S. Arora, S. Rao and U. Vazirani [Proceedings of the 36th annual ACM symposium on theory of computing (STOC 2004), 222–231 (2004; Zbl 1192.68467)]. Their algorithm uses semidefinit...

Towards a Study of Low-Complexity Graphs

Conference Paper

Jul 2009

We propose the study of graphs that are defined by low-complexity distributed and deterministic agents. We suggest that this viewpoint may help introduce the element of individual choice in models of large scale social networks. This viewpoint may also provide interesting new classes of graphs for which to design algorithms. We focus largely on th...

Message-Passing Algorithms and Improved LP Decoding

Conference Paper

Full-text available

May 2009

Linear programming decoding for low-density parity check codes (and related domains such as compressed sensing) has received increased attention over recent years because of its practical performance |coming close to that of iterative de- coding algorithms| and its amenability to nite-blocklength analysis. Several works starting with the work of Fe...

Computational Complexity: A Modern Approach

Book

Jan 2009

This beginning graduate textbook describes both recent achievements and classical results of computational complexity theory. Requiring essentially no background apart from mathematical maturity, the book can be used as a reference for self-study for anyone interested in complexity, including physicists, mathematicians, and other scientists, as wel...

Geometry, Flows, and Graph-Partitioning Algorithms

Article

Oct 2008

Graph partitioning is a computational problems to divide the vertices of a graph into two large pieces with minimum number of the edges. The application of partitioning can be used for computer vision, data analysis, image segmentation, and image analysis. The geometric approach of partitioning start with drawing the graph in a geometric space by k...

Unique games on expanding constraint graphs are easy: extended abstract

Conference Paper

May 2008

We present an efficient algorithm to find a good solution to the Unique Games problem when the constraint graph is an expander. We introduce a new analysis of the standard SDP in this case that involves correlations among distant vertices. It also leads to a parallel repetition theorem for unique games when the graph is an expander.

Unique Games on expanding constraint graphs are easy

Article

Jan 2008

Fréchet Embeddings of Negative Type Metrics

Article

Dec 2007

We show that every n-point metric of negative type (in particular, every n-point subset of L 1) admits a Fréchet embedding into Euclidean space with distortion $O(\sqrt{\log n}\cdot \log \log n)$ , a result which is tight up to the O(log log n) factor, even for Euclidean metrics. This strengthens our recent work on the Euclidean distortion of met...

A combinatorial, primal-dual approach to semidefinite programs

Conference Paper

Jun 2007

Semidefinite programs (SDP) have been used in many recent approximation algorithms. We develop a general primal-dual approach to solve SDPs using a generalization of the well-known multiplicative weights update rule to symmetric matrices. For a number of problems, such as Sparsest Cut and Balanced Separator in undirected and directed weighted graph...

A 2 + ɛ approximation algorithm for the k MST problem

Article

Full-text available

Jul 2006

For any ε>0 we give a (2+ε)-approximation algorithm for the problem of finding a minimum tree spanning any k vertices in a graph (k-MST), improving a 3-approximation algorithm by N. Garg [A 3-approximation for the minimum tree spanning k vertices. in: Proceedings of the 37 IEEE Symp. on Foundations of Computer Science (FOCS), 302–309 (1996)]. As in...

New approximation guarantee for chromatic number

Conference Paper

May 2006

We describe how to color every 3-colorable graph with O(n0.2111) colors, thus improving an algorithm of Blum and Karger from almost a decade ago. Our analysis uses new geometric ideas inspired by the recent work of Arora, Rao, and Vazirani on SPARSEST CUT, and these ideas show promise of leading to further improvements.

Approximation Algorithms for Geometric TSP

Chapter

May 2006

Sanjeev Arora

In the Euclidean traveling salesman problem, we are given n nodes in ℝ2 (more generally, in ℝd and desire the minimum cost salesman tour for these nodes, where the cost of the edge between nodes (x1,y1) and (x2,y2) is $ \sqrt {(x_1 - x_2 )^2 + (y_1 - y_2 )^2 } $ The decision version of the problem (“Does a tour of cost ≤ C exist?”) is NP-hard [65...

Local Versus Global Properties of Metric Spaces

Conference Paper

Full-text available

Jan 2006

Motivated by applications in combinatorial optimization, we initiate a study of the extent to which the global properties of a metric space (especially, embeddability in l(1) with low distortion) are determined by the properties of small subspaces. We note connections to similar issues studied already in Ramsey theory, complexity theory (especially...

Proving Integrality Gaps Without Knowing the Linear Program

Article

Full-text available

Jan 2006

Proving integrality gaps for linear relaxations of NP optimization problems is a difficult task and usually undertaken on a case-by-case basis. We initiate a more systematic approach. We prove an integrality gap of $2 -o(1)$ for three families of linear relaxations for VERTEX COVER, and our methods seem relevant to other problems as well.