Home
Tanveer A. Faruquie

Tanveer A. Faruquie

About

Publications

18,740

Reads

957

Citations

Publications

Tracking political elections on social media : Applications and Experience (Supplementary Material)

Technical Report

Full-text available

Apr 2015

In these notes we provide additional reference material demonstrating the impact of our work during the 2013 Filipino General Elections. Our work, as described in the main paper, was used by the ABS-CBN News corporation, the largest media organization during the Philippines during the 2013 General elections. Using our system the media house able to...

Automatically mining patterns for rule based data standardization systems

Patent

Mar 2015

Methods, computer program products and systems are provided for mining for sub-patterns within a text data set. The embodiments facilitate finding a set of N frequently occurring sub-patterns within the data set, extracting the N sub-patterns from the data set, and clustering the extracted sub-patterns into K groups, where each extracted sub-patter...

Efficient development of a rule-based system using crowd-sourcing

Patent

Feb 2015

Described herein are methods, systems, apparatuses and products for efficient development of a rule-based system. An aspect provides a method including accessing data records; converting said data records to an intermediate form; utilizing intermediate forms to compute similarity scores for said data records; and selecting as an example to be provi...

Rule set management

Patent

Full-text available

Apr 2014

Systems, methods, and computer products for optimally managing large rule sets are disclosed. Rule dependencies of rules within a set of rules may be determined as a function of rules execution frequency data generated from applying the rules over a data set. The rules within the set of rules may be clustered into rules clusters based on the determ...

Systems and methods for discovering synonymous elements using context over multiple similar addresses

Patent

Full-text available

Mar 2014

A clustering-based approach to data standardization is provided. Certain embodiments take as input a plurality of addresses, identify one or more features of the addresses, cluster the addresses based on the one or more features, utilize the cluster(s) to provide a data-based context useful in identifying one or more synonyms for elements contained...

Systems and methods for efficient development of a rule-based system using crowd-sourcing

Patent

Jan 2014

Automatic selection of blocking column for de-duplication

Patent

Full-text available

Oct 2013

A method of blocking column selection can include determining a first parameter for each column set of a plurality of column sets, wherein the first parameter indicates distribution of blocks in the column set, and determining a second parameter for each column set. The second parameter can indicate block size for the column set. For each column se...

Edge analytics as service — A service oriented framework for real time and personalised recommendation analytics

Conference Paper

Full-text available

Jul 2013

Due to the advent of technology and internet over the past few years, significant number of customers have started shopping online and accessing their bank account through various channels like Netbanking, Mobile banking etc. In this paper, we describe Edge Analytics framework which deliver analytics as a service that can be hosted by a financial i...

Understanding election candidate approval ratings using social media data

Conference Paper

May 2013

The last few years has seen an exponential increase in the amount of social media data generated daily. Thus, researchers have started exploring the use of social media data in building recommendation systems, prediction models, improving disaster management, discovery trending topics etc. An interesting application of social media is for the predi...

Automating pattern discovery for rule based data standardization systems

Conference Paper

Full-text available

Apr 2013

Data quality is a perennial problem for many enterprise data assets. To improve data quality, businesses often employ rule based data standardization systems in which domain experts code rules for handling important and prevalent patterns. Finding these patterns is laborious and time consuming, particularly for noisy or highly specialized data sets...

Processing multi-way spatial joins on map-reduce

Conference Paper

Full-text available

Mar 2013

In this paper we investigate the problem of processing multi-way spatial joins on map-reduce platform. We look at two common spatial predicates - overlap and range. We address these two classes of join queries, discuss the challenges and outline novel approaches for executing these queries on a map-reduce framework. We then discuss how we can proce...

Detecting factual inconsistencies between a document and a fact-base

Patent

Feb 2013

Techniques for identifying one or more inconsistencies between an unstructured document and a back-end fact-base are provided. The techniques include automatically parsing a query document and comparing the document with a back-end fact-base comprising facts relevant to the document, identifying one or more inconsistencies between information menti...

Topic Models with Relational Features for Drug Design

Conference Paper

Jan 2013

To date, ILP models in drug design have largely focussed on models in first-order logic that relate two- or three-dimensional molecular structure of a potential drug (a ligand) to its activity (for example, inhibition of some protein). In modelling terms: (a) the models have largely been logic-based (although there have been some attempts at probab...

Faceted Browsing over Social Media

Conference Paper

Dec 2012

The popularity of social media as a medium for sharing information has made extracting information of interest a challenge. In this work we provide a system that can return posts published on social media covering various aspects of a concept being searched.We present a faceted model for navigating social media that provides a consistent, usable an...

Unsupervised Discovery of Activities and Their Temporal Behaviour

Conference Paper

Sep 2012

This paper addresses the problem of discovering activities and their temporal significance in surveillance videos in an unsupervised manner. We propose a generative model that can jointly capture the activities and their behaviour over time. We use multinomial distribution over local motion features to model activities and a mixture distribution ov...

Discovering Activities and Their Temporal Significance

Conference Paper

Sep 2012

In this paper, we address the problem of discovering activities and their temporal significance in an area under surveillance. Discovering activities along with its expectation of occurrence at a particular time plays an important role in many surveillance applications. We propose an unsupervised model, called Time pLSA model, that extends the prob...

Managing data quality by identifying the noisiest data samples

Conference Paper

Full-text available

Jul 2012

Enterprise datasets are often noisy. Several columns can have non-standard, erroneous or missing information. Poor quality data can lead to incorrect reporting and wrong conclusions being drawn. Data cleansing involves standardizing such data to improve its quality. Often data cleansing tasks involve writing rules manually. The step involves unders...

Automated selection of blocking columns for record linkage

Conference Paper

Full-text available

Jul 2012

Record Linkage is an essential but expensive step in enterprise data management. In most deployments, blocking techniques are employed which can reduce the number of record pair comparisons and hence, the computational complexity of the task. Blocking algorithms require a careful selection of column(s) to be used for blocking. Selection of appropri...

Data consolidation solution for internal security needs

Conference Paper

Full-text available

Jul 2012

The threats of the 21st century are too complex, difficult and time consuming to discern with traditional intelligence practices that shun advances in information technology and rely heavily on human experts. Good information is fundamental to understand and respond to 21st century national security threats. Without comprehensive information, decis...

Data and task parallelism in ILP using MapReduce

Article

May 2012

Nearly two decades of research in the area of Inductive Logic Programming (ILP) have seen steady progress in clarifying its theoretical foundations and regular demonstrations of its applicability to complex problems in very diverse domains. These results are necessary, but not sufficient, for ILP to be adopted as a tool for data analysis in an era...

Using content and interactions for discovering communities in social networks

Conference Paper

Full-text available

Apr 2012

In recent years, social networking sites have not only enabled people to connect with each other using social links but have also allowed them to share, communicate and interact over diverse geographical regions. Social network provide a rich source of heterogeneous data which can be exploited to discover previously unknown relationships and intere...

Fusing biographical and biometric classifiers for improved person identification

Conference Paper

Full-text available

Jan 2012

Several citizen service databases such as, police, national citizen identity, passport and vehicle registration, store both biographical and biometric information containing huge number of records. Achieving scalability and high accuracy for a 1:N person identification task on these databases is a huge challenge. In this work, we propose to use com...

Latent topic model-based group activity discovery

Article

Dec 2011

Surveillance videos of public places often consist of group activities composed from multiple co-occurring individual activities. However, latent topic models, such as Latent Dirichlet Allocation (LDA), which have been successfully used to discover individual activities, do not discover group activities. In this paper we propose a method to discove...

Learning Dirichlet Processes from Partially Observed Groups

Conference Paper

Full-text available

Dec 2011

Motivated by the task of vernacular news analysis using known news topics from national news-papers, we study the task of topic analysis, where given source datasets with observed topics, data items from a target dataset need to be assigned either to observed source topics or to new ones. Using Hierarchical Dirichlet Processes for addressing this t...

Probabilistic model for discovering topic based communities in social networks

Conference Paper

Full-text available

Oct 2011

Social graphs have received renewed interest as a research topic with the advent of social networking websites. These online networks provide a rich source of data to study user relationships and interaction patterns on a large scale. In this paper, we propose a generative Bayesian model for extracting latent communities from a social graph. We ass...

Discovering customer intent in real-time for streamlining service desk conversations

Conference Paper

Full-text available

Oct 2011

Businesses require the contact center agents to meet pre-specified customer satisfaction levels while keeping the cost of operations low or meeting sales targets, objectives that end up being complementary and difficult to achieve in real-time. In this paper, we describe a speech enabled real-time conversation management system that tracks customer...

Adapting a WSJ trained Part-of-Speech tagger to Noisy Text: Preliminary Results

Article

Full-text available

Sep 2011

With the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is differ-ent from the standard language, as people try to use var-ious kinds of short forms for words to save time and ef-fort. We call that noisy text. Part-Of-Speech (POS) tag-ging has...

Data Augmentation as a Service for Single View Creation

Conference Paper

Aug 2011

Businesses are increasingly realizing the value of creating a {it single view} of its customers and partners by integrating information residing in 'siloed' datasets within and outside the enterprise. However, the task of {it augmenting} data available within the enterprise with data purchased from third-party providers or that residing in a public...

Data Cleansing Techniques for Large Enterprise Datasets

Conference Paper

Full-text available

May 2011

Data quality improvement is an important aspect of enterprise data management. Data characteristics can change with customers, with domain and geography making data quality improvement a challenging task. Data quality improvement is often an iterative process which mainly involves writing a set of data quality rules for standardization and eliminat...

Optimal Training Data Selection for Rule-Based Data Cleansing Models

Article

Full-text available

Mar 2011

Enterprises today accumulate huge quantities of data which is often noisy and unstructured in nature making data cleansing an important task. Data cleansing refers to standardizing data from different sources to a common format so that data can be better utilized. Most of the enterprise data cleansing models are rule based involving lot of manual e...

Transfer of Supervision for Improved Address Standardization

Conference Paper

Full-text available

Aug 2010

Address Cleansing is very challenging, particularly for geographies with variability in writing addresses. Supervised learners can be easily trained for different data sources. However, training requires labeling large corpora for each data source which is time consuming and labor intensive to create. We propose a method to automatically transfer s...

Resource Allocation and SLA Determination for Large Data Processing Services over Cloud

Conference Paper

Full-text available

Jul 2010

Data processing on the cloud is increasingly used for offering cost effective services. In this paper, we present a method for resource allocation for data processing services over the cloud taking into account not just the processing power and memory requirements, but the network speed, reliability and data throughput. We also present algorithms f...

A Knowledge Acquisition Method for Improving Data Quality in Services Engagements

Conference Paper

Full-text available

Jul 2010

Poor Data Quality is a serious problem affecting enterprises. Enterprise databases are large and manual data cleansing is not feasible. For such large databases it is logical to attempt to cleanse the data in an automated way. This has led to the development of commercial tools for automatic cleansing. However, offering data cleansing as a service...

Automatically Generating Term Frequency Induced Taxonomies.

Conference Paper

Full-text available

Jan 2010

We propose a novel method to automatically acquire a term-frequency-based taxonomy from a corpus using an unsupervised method. A term-frequency-based taxonomy is useful for application domains where the frequency with which terms occur on their own and in combination with other terms imposes a natural term hierarchy. We highlight an application for...

Data Cleansing as a Transient Service

Conference Paper

Full-text available

Jan 2010

There is often a transient need within enterprises for data cleansing which can be satisfied by offering data cleansing as a transient service. Every time a data cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system tha...

Handling Noisy Queries in Cross Language FAQ Retrieval.

Conference Paper

Full-text available

Jan 2010

Recent times have seen a tremendous growth in mobile based data services that allow peo- ple to use Short Message Service (SMS) to access these data services. In a multilin- gual society it is essential that data services that were developed for a specific language be made accessible through other local lan- guages also. In this paper, we present a...

Unsupervised cleansing of noisy text

Conference Paper

Full-text available

Jan 2010

In this paper we look at the problem of cleansing noisy text using a statistical machine translation model. Noisy text is produced in informal communications such as Short Message Service (SMS), Twitter and chat. A typical Statistical Machine Translation system is trained on parallel text comprising noisy and clean sentences. In this paper we propo...

Estimating accuracy for text classification tasks on large unlabeled data

Conference Paper

Full-text available

Oct 2010

Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these cl...

Unsupervised discovery of activity correlations using latent topic models

Article

Dec 2010

Topic models such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) have been successfully used to discover individual activities in a scene. However these methods do not discover group activities which are commonly observed in real life videos of public places. In this paper we address the problem of discoverin...

Protecting Sensitive Customer Information in Call Center Recordings

Conference Paper

Full-text available

Oct 2009

Protecting sensitive information while preserving the share ability and usability of data is becoming increasingly important in the outsourced business process industry. Particularly in the context of call-centers a lot of customer related sensitive information is stored in audio recordings. In this work, we address the problem of protecting sensit...

Business Intelligence from Voice of Customer

Conference Paper

Full-text available

May 2009

In this paper, we present a first of a kind system, called business intelligence from voice of customer (BIVoC), that can: 1) combine unstructured information and structured information in an information intensive enterprise and 2) derive richer business insights from the combined data. Unstructured information, in this paper, refers to voice of cu...

A survey of types of text noise and techniques to handle noisy text

Conference Paper

Full-text available

Jul 2009

Often, in the real world noise is ubiquitous in text communications. Text produced by processing signals intended for human use are often noisy for automated computer processing. Automatic speech recognition, optical character recognition and machine translation all introduce processing noise. Also digital text produced in informal settings such as...

SMS based interface for FAQ retrieval

Conference Paper

Full-text available

Jan 2009

Short Messaging Service (SMS) is popu- larly used to provide information access to people on the move. This has resulted in the growth of SMS based Question An- swering (QA) services. However auto- matically handling SMS questions poses significant challenges due to the inherent noise in SMS questions. In this work we present an automatic FAQ-based...

Time based Activity Inference using Latent Dirichlet Allocation

Conference Paper

Jan 2009

HMM based event detection in audio conversation

Conference Paper

May 2008

In this paper, we address the problem of detecting sensitive events in speech signal such as exchange of credit card information. Although close in nature to the word spotting problem, variability in the linguistic content constituting an event and their composition makes event detection a harder task, especially in the context where it is applied...

Exploiting context to detect sensitive information in call center conversations

Conference Paper

Full-text available

Oct 2008

Protecting sensitive information while preserving the share-ability and usability of data is becoming increasingly important. In call-centers a lot of customer related sensitive information is stored in audio recordings. In this work, we address the problem of protecting sensitive information in audio recordings and speech transcripts. We present a...

Improving Automatic Call Classification using Machine Translation

Conference Paper

May 2007

Utterance classification is an important task in spoken-dialog systems. The response of the system is dependent on category assigned to the speaker's utterance by the classifier. However, often the input speech is spontaneous and noisy which results in high word error rates. This results in unsatisfactory system performance. In this paper we descri...

Reusable Dialog Component Framework for Rapid Voice Application Development

Conference Paper

May 2005

Voice application development requires specialized speech related skills besides the general programming ability. Encapsulating the speech specific behavior and complexities in prepackaged, configurable User Interface (UI) components will ease and expedite the voice application development. These components can be used across applications and are c...

A New Decoding Algorithm for Statistical Machine Translation: Design and Implementation.

Conference Paper

Jan 2005

We describe a new algorithm for the Decoding problem in Statistical Machine Translation. Our algorithm is based on the Alternating Optimization framework and employs dy- namic programming. The time complexity of the algorithm is O m2 , where m is the length of the sentence to be trans- lated, which is the best among all known algorithms for the pro...

An architecture for pluggable disambiguation mechanism for RDC based voice applications

Conference Paper

Sep 2005

Animating Expressive Faces Across Languages

Article

Full-text available

Jan 2005

This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our...

An algorithmic framework for the decoding problem in statistical machine translation

Article

Aug 2004

The decoding problem in Statistical Ma-chine Translation (SMT) is a computation-ally hard combinatorial optimization prob-lem. In this paper, we propose a new al-gorithmic framework for solving the decod-ing problem and demonstrate its utility. In the new algorithmic framework, the decod-ing problem can be solved both exactly and approximately. The...

An English-Hindi Statistical Machine Translation System

Conference Paper

Mar 2004

Recently statistical methods for natural language translation have become popular and found reasonable success. In this paper we describe an English-Hindi statistical machine translation system. Our machine translation system is based on IBM Models 1, 2, and 3. We present experimental results on an English-Hindi parallel corpus consisting of 150,00...

Animating expressive faces across languages.

Article

Full-text available

Jan 2004

Animating Expressive Faces To Speak In Indian

Article

Mar 2002

This paper describes a morphing based automated audio driven facial animation system. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English, is presented. New viseme and expression combinations are synthesized to be able to generate a...

Large Vocabulary Audio-Visual Speech Recognition Using Active Shape Models

Article

Full-text available

Jul 2001

Orthogonal information present in the video signal associated with the audio helps in improving the accuracy of a speech recognition system. Audio-visual speech recognition involves extraction of both the audio as well as visual features from the input signal. Extraction of visual parameters is done by the recognition of speech dependent features f...

Translingual Visual Speech Synthesis

Article

Jul 2001

Audio-driven facial animation is an interesting and evolving technique for human-computer interaction. Based on an incoming audio stream, a face image is animated with full lip synchronization. This requires a speech recognition system in the language in which audio is provided to get the time alignment for the phonetic sequence of the audio signal...

Audio Driven Facial Animation For Audio-Visual Reality

Article

Full-text available

Jul 2001

In this paper, we demonstrate a morphing based automated audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and expression. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face speaking differe...

An Efficient Decoding Algorithm for WFA

Article

Mar 2001

Several promising image/video data compression techniques that explicitly exploit self-similarity in images and videos have been proposed in the recent past. While most of these fractal techniques are variants and/or enhancements of Jacquin's Iterated Function Systems (IFS), the Weighted Finite Automata (WFA) techniques (introduced by Culik and Kar...

Robust detection of visual ROI for automatic speechreading

Conference Paper

Full-text available

Feb 2001

We present our work on visual pruning in an audio-visual (AV) speech recognition scenario. Visual speech information has been successfully used in circumstances where audio-only recognition suffers (e.g. noisy environments). Tracking and extraction of region-of-interest (ROI) (e.g., speaker's mouth region) from video is an essential component of su...

Audio Driven Facial Animation For Audio-Visual Reality.

Conference Paper

Full-text available

Jan 2001

In this paper, we demonstrate a morphing based automated au- dio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and expression. An animation sequence using optical flo w between visemes is constructed, given an incoming audio stream and still pictures of a face speaking diff...

Translingual visual speech synthesis

Conference Paper

Feb 2000

Large vocabulary audio-visual speech recognition using active shape models

Conference Paper

Full-text available

Feb 2000

Late Integration In Audio-Visual Continuous Speech Recognition

Article

Dec 1999

Using visual information in speech recognition has been an area of interest because it can significantly improve the speech recognition efficiency in the conditions where audio only recognition suffers due to noisy environment. In this paper, we present a new approach to combine audio and video to improve the robustness of the speech recognition sy...

Text from Pitman Shorthand Scripted

Article

Exact Data Parallel Computation for Very Large ILP Datasets

Article

The emergence of very large machine-generated datasets raises a question of some importance for ILP, namely: can an ILP system construct models efficiently using datasets whose sizes are too large to fit in random access memory? In this paper, we examine the applicability to ILP of a pop-ular distributed computing approach that in principle, allows...

Using Text Reviews for Product Entity Completion

Article

Full-text available

In this paper we address the problem of obtaining structured information about products in the form of attribute-value pairs by leveraging a combination of enter-prise internal product descriptions and ex-ternal data. Product descriptions are short text strings used internally within enter-prises to describe a product. These strings usually compris...