About
67
Publications
18,740
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
957
Citations
Publications
Publications (67)
In these notes we provide additional reference material demonstrating the impact of our work during the 2013 Filipino General Elections. Our work, as described in the main paper, was used by the ABS-CBN News corporation, the largest media organization during the Philippines during the 2013 General elections. Using our system the media house able to...
Methods, computer program products and systems are provided for mining for sub-patterns within a text data set. The embodiments facilitate finding a set of N frequently occurring sub-patterns within the data set, extracting the N sub-patterns from the data set, and clustering the extracted sub-patterns into K groups, where each extracted sub-patter...
Described herein are methods, systems, apparatuses and products for efficient development of a rule-based system. An aspect provides a method including accessing data records; converting said data records to an intermediate form; utilizing intermediate forms to compute similarity scores for said data records; and selecting as an example to be provi...
Systems, methods, and computer products for optimally managing large rule sets are disclosed. Rule dependencies of rules within a set of rules may be determined as a function of rules execution frequency data generated from applying the rules over a data set. The rules within the set of rules may be clustered into rules clusters based on the determ...
A clustering-based approach to data standardization is provided. Certain embodiments take as input a plurality of addresses, identify one or more features of the addresses, cluster the addresses based on the one or more features, utilize the cluster(s) to provide a data-based context useful in identifying one or more synonyms for elements contained...
Described herein are methods, systems, apparatuses and products for efficient development of a rule-based system. An aspect provides a method including accessing data records; converting said data records to an intermediate form; utilizing intermediate forms to compute similarity scores for said data records; and selecting as an example to be provi...
A method of blocking column selection can include determining a first parameter for each column set of a plurality of column sets, wherein the first parameter indicates distribution of blocks in the column set, and determining a second parameter for each column set. The second parameter can indicate block size for the column set. For each column se...
Due to the advent of technology and internet over the past few years, significant number of customers have started shopping online and accessing their bank account through various channels like Netbanking, Mobile banking etc. In this paper, we describe Edge Analytics framework which deliver analytics as a service that can be hosted by a financial i...
The last few years has seen an exponential increase in the amount of social media data generated daily. Thus, researchers have started exploring the use of social media data in building recommendation systems, prediction models, improving disaster management, discovery trending topics etc. An interesting application of social media is for the predi...
Data quality is a perennial problem for many enterprise data assets. To improve data quality, businesses often employ rule based data standardization systems in which domain experts code rules for handling important and prevalent patterns. Finding these patterns is laborious and time consuming, particularly for noisy or highly specialized data sets...
In this paper we investigate the problem of processing multi-way spatial joins on map-reduce platform. We look at two common spatial predicates - overlap and range. We address these two classes of join queries, discuss the challenges and outline novel approaches for executing these queries on a map-reduce framework. We then discuss how we can proce...
Techniques for identifying one or more inconsistencies between an unstructured document and a back-end fact-base are provided. The techniques include automatically parsing a query document and comparing the document with a back-end fact-base comprising facts relevant to the document, identifying one or more inconsistencies between information menti...
To date, ILP models in drug design have largely focussed on models in first-order logic that relate two- or three-dimensional molecular structure of a potential drug (a ligand) to its activity (for example, inhibition of some protein). In modelling terms: (a) the models have largely been logic-based (although there have been some attempts at probab...
The popularity of social media as a medium for sharing information has made extracting information of interest a challenge. In this work we provide a system that can return posts published on social media covering various aspects of a concept being searched.We present a faceted model for navigating social media that provides a consistent, usable an...
This paper addresses the problem of discovering activities and their temporal significance in surveillance videos in an unsupervised manner. We propose a generative model that can jointly capture the activities and their behaviour over time. We use multinomial distribution over local motion features to model activities and a mixture distribution ov...
In this paper, we address the problem of discovering activities and their temporal significance in an area under surveillance. Discovering activities along with its expectation of occurrence at a particular time plays an important role in many surveillance applications. We propose an unsupervised model, called Time pLSA model, that extends the prob...
Enterprise datasets are often noisy. Several columns can have non-standard, erroneous or missing information. Poor quality data can lead to incorrect reporting and wrong conclusions being drawn. Data cleansing involves standardizing such data to improve its quality. Often data cleansing tasks involve writing rules manually. The step involves unders...
Record Linkage is an essential but expensive step in enterprise data management. In most deployments, blocking techniques are employed which can reduce the number of record pair comparisons and hence, the computational complexity of the task. Blocking algorithms require a careful selection of column(s) to be used for blocking. Selection of appropri...
The threats of the 21st century are too complex, difficult and time consuming to discern with traditional intelligence practices that shun advances in information technology and rely heavily on human experts. Good information is fundamental to understand and respond to 21st century national security threats. Without comprehensive information, decis...
Nearly two decades of research in the area of Inductive Logic Programming (ILP) have seen steady progress in clarifying its
theoretical foundations and regular demonstrations of its applicability to complex problems in very diverse domains. These
results are necessary, but not sufficient, for ILP to be adopted as a tool for data analysis in an era...
In recent years, social networking sites have not only enabled people to connect with each other using social links but have also allowed them to share, communicate and interact over diverse geographical regions. Social network provide a rich source of heterogeneous data which can be exploited to discover previously unknown relationships and intere...
Several citizen service databases such as, police, national citizen identity, passport and vehicle registration, store both biographical and biometric information containing huge number of records. Achieving scalability and high accuracy for a 1:N person identification task on these databases is a huge challenge. In this work, we propose to use com...
Surveillance videos of public places often consist of group activities composed from multiple co-occurring individual activities. However, latent topic models, such as Latent Dirichlet Allocation (LDA), which have been successfully used to discover individual activities, do not discover group activities. In this paper we propose a method to discove...
Motivated by the task of vernacular news analysis using known news topics from national news-papers, we study the task of topic analysis, where given source datasets with observed topics, data items from a target dataset need to be assigned either to observed source topics or to new ones. Using Hierarchical Dirichlet Processes for addressing this t...
Social graphs have received renewed interest as a research topic with the advent of social networking websites. These online networks provide a rich source of data to study user relationships and interaction patterns on a large scale. In this paper, we propose a generative Bayesian model for extracting latent communities from a social graph. We ass...
Businesses require the contact center agents to meet pre-specified customer satisfaction levels while keeping the cost of operations low or meeting sales targets, objectives that end up being complementary and difficult to achieve in real-time. In this paper, we describe a speech enabled real-time conversation management system that tracks customer...
With the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is differ-ent from the standard language, as people try to use var-ious kinds of short forms for words to save time and ef-fort. We call that noisy text. Part-Of-Speech (POS) tag-ging has...
Businesses are increasingly realizing the value of creating a {it single view} of its customers and partners by integrating information residing in 'siloed' datasets within and outside the enterprise. However, the task of {it augmenting} data available within the enterprise with data purchased from third-party providers or that residing in a public...
Data quality improvement is an important aspect of enterprise data management. Data characteristics can change with customers, with domain and geography making data quality improvement a challenging task. Data quality improvement is often an iterative process which mainly involves writing a set of data quality rules for standardization and eliminat...
Enterprises today accumulate huge quantities of data which is often noisy and unstructured in nature making data cleansing an important task. Data cleansing refers to standardizing data from different sources to a common format so that data can be better utilized. Most of the enterprise data cleansing models are rule based involving lot of manual e...
Address Cleansing is very challenging, particularly for geographies with variability in writing addresses. Supervised learners can be easily trained for different data sources. However, training requires labeling large corpora for each data source which is time consuming and labor intensive to create. We propose a method to automatically transfer s...
Data processing on the cloud is increasingly used for offering cost effective services. In this paper, we present a method for resource allocation for data processing services over the cloud taking into account not just the processing power and memory requirements, but the network speed, reliability and data throughput. We also present algorithms f...
Poor Data Quality is a serious problem affecting enterprises. Enterprise databases are large and manual data cleansing is not feasible. For such large databases it is logical to attempt to cleanse the data in an automated way. This has led to the development of commercial tools for automatic cleansing. However, offering data cleansing as a service...
We propose a novel method to automatically acquire a term-frequency-based taxonomy from a corpus using an unsupervised method. A term-frequency-based taxonomy is useful for application domains where the frequency with which terms occur on their own and in combination with other terms imposes a natural term hierarchy. We highlight an application for...
There is often a transient need within enterprises for data cleansing which can be satisfied by offering data cleansing as a transient service. Every time a data cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system tha...
Recent times have seen a tremendous growth in mobile based data services that allow peo- ple to use Short Message Service (SMS) to access these data services. In a multilin- gual society it is essential that data services that were developed for a specific language be made accessible through other local lan- guages also. In this paper, we present a...
In this paper we look at the problem of cleansing noisy text using a statistical machine translation model. Noisy text is produced in informal communications such as Short Message Service (SMS), Twitter and chat. A typical Statistical Machine Translation system is trained on parallel text comprising noisy and clean sentences. In this paper we propo...
Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these cl...
Topic models such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) have been successfully used to discover individual activities in a scene. However these methods do not discover group activities which are commonly observed in real life videos of public places. In this paper we address the problem of discoverin...
Protecting sensitive information while preserving the share ability and usability of data is becoming increasingly important in the outsourced business process industry. Particularly in the context of call-centers a lot of customer related sensitive information is stored in audio recordings. In this work, we address the problem of protecting sensit...
In this paper, we present a first of a kind system, called business intelligence from voice of customer (BIVoC), that can: 1) combine unstructured information and structured information in an information intensive enterprise and 2) derive richer business insights from the combined data. Unstructured information, in this paper, refers to voice of cu...
Often, in the real world noise is ubiquitous in text communications. Text produced by processing signals intended for human use are often noisy for automated computer processing. Automatic speech recognition, optical character recognition and machine translation all introduce processing noise. Also digital text produced in informal settings such as...
Short Messaging Service (SMS) is popu- larly used to provide information access to people on the move. This has resulted in the growth of SMS based Question An- swering (QA) services. However auto- matically handling SMS questions poses significant challenges due to the inherent noise in SMS questions. In this work we present an automatic FAQ-based...
In this paper, we address the problem of detecting sensitive events in speech signal such as exchange of credit card information. Although close in nature to the word spotting problem, variability in the linguistic content constituting an event and their composition makes event detection a harder task, especially in the context where it is applied...
Protecting sensitive information while preserving the share-ability and usability of data is becoming increasingly important. In call-centers a lot of customer related sensitive information is stored in audio recordings. In this work, we address the problem of protecting sensitive information in audio recordings and speech transcripts. We present a...
Utterance classification is an important task in spoken-dialog systems. The response of the system is dependent on category assigned to the speaker's utterance by the classifier. However, often the input speech is spontaneous and noisy which results in high word error rates. This results in unsatisfactory system performance. In this paper we descri...
Voice application development requires specialized speech related skills besides the general programming ability. Encapsulating
the speech specific behavior and complexities in prepackaged, configurable User Interface (UI) components will ease and expedite
the voice application development. These components can be used across applications and are c...
We describe a new algorithm for the Decoding problem in Statistical Machine Translation. Our algorithm is based on the Alternating Optimization framework and employs dy- namic programming. The time complexity of the algorithm is O m2 , where m is the length of the sentence to be trans- lated, which is the best among all known algorithms for the pro...
This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our...
The decoding problem in Statistical Ma-chine Translation (SMT) is a computation-ally hard combinatorial optimization prob-lem. In this paper, we propose a new al-gorithmic framework for solving the decod-ing problem and demonstrate its utility. In the new algorithmic framework, the decod-ing problem can be solved both exactly and approximately. The...
Recently statistical methods for natural language translation have become popular and found reasonable success. In this paper
we describe an English-Hindi statistical machine translation system. Our machine translation system is based on IBM Models
1, 2, and 3. We present experimental results on an English-Hindi parallel corpus consisting of 150,00...
This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our...
This paper describes a morphing based automated audio driven facial animation system. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English, is presented. New viseme and expression combinations are synthesized to be able to generate a...
Orthogonal information present in the video signal associated with the audio helps in improving the accuracy of a speech recognition system. Audio-visual speech recognition involves extraction of both the audio as well as visual features from the input signal. Extraction of visual parameters is done by the recognition of speech dependent features f...
Audio-driven facial animation is an interesting and evolving technique for human-computer interaction. Based on an incoming audio stream, a face image is animated with full lip synchronization. This requires a speech recognition system in the language in which audio is provided to get the time alignment for the phonetic sequence of the audio signal...
In this paper, we demonstrate a morphing based automated audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and expression. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face speaking differe...
Several promising image/video data compression techniques that explicitly exploit self-similarity in images and videos have been proposed in the recent past. While most of these fractal techniques are variants and/or enhancements of Jacquin's Iterated Function Systems (IFS), the Weighted Finite Automata (WFA) techniques (introduced by Culik and Kar...
We present our work on visual pruning in an audio-visual (AV)
speech recognition scenario. Visual speech information has been
successfully used in circumstances where audio-only recognition suffers
(e.g. noisy environments). Tracking and extraction of region-of-interest
(ROI) (e.g., speaker's mouth region) from video is an essential
component of su...
In this paper, we demonstrate a morphing based automated au- dio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and expression. An animation sequence using optical flo w between visemes is constructed, given an incoming audio stream and still pictures of a face speaking diff...
Audio-driven facial animation is an interesting and evolving
technique for human-computer interaction. Based on an incoming audio
stream, a face image is animated with full lip synchronization. This
requires a speech recognition system in the language in which audio is
provided to get the time alignment for the phonetic sequence of the
audio signal...
Orthogonal information present in the video signal associated with
the audio helps in improving the accuracy of a speech recognition
system. Audio-visual speech recognition involves extraction of both the
audio as well as visual features from the input signal. Extraction of
visual parameters is done by the recognition of speech dependent
features f...
Using visual information in speech recognition has been an area of interest because it can significantly improve the speech recognition efficiency in the conditions where audio only recognition suffers due to noisy environment. In this paper, we present a new approach to combine audio and video to improve the robustness of the speech recognition sy...
The emergence of very large machine-generated datasets raises a question of some importance for ILP, namely: can an ILP system construct models efficiently using datasets whose sizes are too large to fit in random access memory? In this paper, we examine the applicability to ILP of a pop-ular distributed computing approach that in principle, allows...
In this paper we address the problem of obtaining structured information about products in the form of attribute-value pairs by leveraging a combination of enter-prise internal product descriptions and ex-ternal data. Product descriptions are short text strings used internally within enter-prises to describe a product. These strings usually compris...