• Home
  • Tanveer A. Faruquie
Tanveer A. Faruquie

Tanveer A. Faruquie

About

67
Publications
18,740
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
957
Citations

Publications

Publications (67)
Technical Report
Full-text available
In these notes we provide additional reference material demonstrating the impact of our work during the 2013 Filipino General Elections. Our work, as described in the main paper, was used by the ABS-CBN News corporation, the largest media organization during the Philippines during the 2013 General elections. Using our system the media house able to...
Patent
Methods, computer program products and systems are provided for mining for sub-patterns within a text data set. The embodiments facilitate finding a set of N frequently occurring sub-patterns within the data set, extracting the N sub-patterns from the data set, and clustering the extracted sub-patterns into K groups, where each extracted sub-patter...
Patent
Described herein are methods, systems, apparatuses and products for efficient development of a rule-based system. An aspect provides a method including accessing data records; converting said data records to an intermediate form; utilizing intermediate forms to compute similarity scores for said data records; and selecting as an example to be provi...
Patent
Full-text available
Systems, methods, and computer products for optimally managing large rule sets are disclosed. Rule dependencies of rules within a set of rules may be determined as a function of rules execution frequency data generated from applying the rules over a data set. The rules within the set of rules may be clustered into rules clusters based on the determ...
Patent
Full-text available
A clustering-based approach to data standardization is provided. Certain embodiments take as input a plurality of addresses, identify one or more features of the addresses, cluster the addresses based on the one or more features, utilize the cluster(s) to provide a data-based context useful in identifying one or more synonyms for elements contained...
Patent
Described herein are methods, systems, apparatuses and products for efficient development of a rule-based system. An aspect provides a method including accessing data records; converting said data records to an intermediate form; utilizing intermediate forms to compute similarity scores for said data records; and selecting as an example to be provi...
Patent
Full-text available
A method of blocking column selection can include determining a first parameter for each column set of a plurality of column sets, wherein the first parameter indicates distribution of blocks in the column set, and determining a second parameter for each column set. The second parameter can indicate block size for the column set. For each column se...
Conference Paper
Full-text available
Due to the advent of technology and internet over the past few years, significant number of customers have started shopping online and accessing their bank account through various channels like Netbanking, Mobile banking etc. In this paper, we describe Edge Analytics framework which deliver analytics as a service that can be hosted by a financial i...
Conference Paper
The last few years has seen an exponential increase in the amount of social media data generated daily. Thus, researchers have started exploring the use of social media data in building recommendation systems, prediction models, improving disaster management, discovery trending topics etc. An interesting application of social media is for the predi...
Conference Paper
Full-text available
Data quality is a perennial problem for many enterprise data assets. To improve data quality, businesses often employ rule based data standardization systems in which domain experts code rules for handling important and prevalent patterns. Finding these patterns is laborious and time consuming, particularly for noisy or highly specialized data sets...
Conference Paper
Full-text available
In this paper we investigate the problem of processing multi-way spatial joins on map-reduce platform. We look at two common spatial predicates - overlap and range. We address these two classes of join queries, discuss the challenges and outline novel approaches for executing these queries on a map-reduce framework. We then discuss how we can proce...
Patent
Techniques for identifying one or more inconsistencies between an unstructured document and a back-end fact-base are provided. The techniques include automatically parsing a query document and comparing the document with a back-end fact-base comprising facts relevant to the document, identifying one or more inconsistencies between information menti...
Conference Paper
To date, ILP models in drug design have largely focussed on models in first-order logic that relate two- or three-dimensional molecular structure of a potential drug (a ligand) to its activity (for example, inhibition of some protein). In modelling terms: (a) the models have largely been logic-based (although there have been some attempts at probab...
Conference Paper
The popularity of social media as a medium for sharing information has made extracting information of interest a challenge. In this work we provide a system that can return posts published on social media covering various aspects of a concept being searched.We present a faceted model for navigating social media that provides a consistent, usable an...
Conference Paper
This paper addresses the problem of discovering activities and their temporal significance in surveillance videos in an unsupervised manner. We propose a generative model that can jointly capture the activities and their behaviour over time. We use multinomial distribution over local motion features to model activities and a mixture distribution ov...
Conference Paper
In this paper, we address the problem of discovering activities and their temporal significance in an area under surveillance. Discovering activities along with its expectation of occurrence at a particular time plays an important role in many surveillance applications. We propose an unsupervised model, called Time pLSA model, that extends the prob...
Conference Paper
Full-text available
Enterprise datasets are often noisy. Several columns can have non-standard, erroneous or missing information. Poor quality data can lead to incorrect reporting and wrong conclusions being drawn. Data cleansing involves standardizing such data to improve its quality. Often data cleansing tasks involve writing rules manually. The step involves unders...
Conference Paper
Full-text available
Record Linkage is an essential but expensive step in enterprise data management. In most deployments, blocking techniques are employed which can reduce the number of record pair comparisons and hence, the computational complexity of the task. Blocking algorithms require a careful selection of column(s) to be used for blocking. Selection of appropri...
Conference Paper
Full-text available
The threats of the 21st century are too complex, difficult and time consuming to discern with traditional intelligence practices that shun advances in information technology and rely heavily on human experts. Good information is fundamental to understand and respond to 21st century national security threats. Without comprehensive information, decis...
Article
Nearly two decades of research in the area of Inductive Logic Programming (ILP) have seen steady progress in clarifying its theoretical foundations and regular demonstrations of its applicability to complex problems in very diverse domains. These results are necessary, but not sufficient, for ILP to be adopted as a tool for data analysis in an era...
Conference Paper
Full-text available
In recent years, social networking sites have not only enabled people to connect with each other using social links but have also allowed them to share, communicate and interact over diverse geographical regions. Social network provide a rich source of heterogeneous data which can be exploited to discover previously unknown relationships and intere...
Conference Paper
Full-text available
Several citizen service databases such as, police, national citizen identity, passport and vehicle registration, store both biographical and biometric information containing huge number of records. Achieving scalability and high accuracy for a 1:N person identification task on these databases is a huge challenge. In this work, we propose to use com...
Article
Surveillance videos of public places often consist of group activities composed from multiple co-occurring individual activities. However, latent topic models, such as Latent Dirichlet Allocation (LDA), which have been successfully used to discover individual activities, do not discover group activities. In this paper we propose a method to discove...
Conference Paper
Full-text available
Motivated by the task of vernacular news analysis using known news topics from national news-papers, we study the task of topic analysis, where given source datasets with observed topics, data items from a target dataset need to be assigned either to observed source topics or to new ones. Using Hierarchical Dirichlet Processes for addressing this t...
Conference Paper
Full-text available
Social graphs have received renewed interest as a research topic with the advent of social networking websites. These online networks provide a rich source of data to study user relationships and interaction patterns on a large scale. In this paper, we propose a generative Bayesian model for extracting latent communities from a social graph. We ass...
Conference Paper
Full-text available
Businesses require the contact center agents to meet pre-specified customer satisfaction levels while keeping the cost of operations low or meeting sales targets, objectives that end up being complementary and difficult to achieve in real-time. In this paper, we describe a speech enabled real-time conversation management system that tracks customer...
Article
Full-text available
With the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is differ-ent from the standard language, as people try to use var-ious kinds of short forms for words to save time and ef-fort. We call that noisy text. Part-Of-Speech (POS) tag-ging has...
Conference Paper
Businesses are increasingly realizing the value of creating a {it single view} of its customers and partners by integrating information residing in 'siloed' datasets within and outside the enterprise. However, the task of {it augmenting} data available within the enterprise with data purchased from third-party providers or that residing in a public...
Conference Paper
Full-text available
Data quality improvement is an important aspect of enterprise data management. Data characteristics can change with customers, with domain and geography making data quality improvement a challenging task. Data quality improvement is often an iterative process which mainly involves writing a set of data quality rules for standardization and eliminat...
Article
Full-text available
Enterprises today accumulate huge quantities of data which is often noisy and unstructured in nature making data cleansing an important task. Data cleansing refers to standardizing data from different sources to a common format so that data can be better utilized. Most of the enterprise data cleansing models are rule based involving lot of manual e...
Conference Paper
Full-text available
Address Cleansing is very challenging, particularly for geographies with variability in writing addresses. Supervised learners can be easily trained for different data sources. However, training requires labeling large corpora for each data source which is time consuming and labor intensive to create. We propose a method to automatically transfer s...
Conference Paper
Full-text available
Data processing on the cloud is increasingly used for offering cost effective services. In this paper, we present a method for resource allocation for data processing services over the cloud taking into account not just the processing power and memory requirements, but the network speed, reliability and data throughput. We also present algorithms f...
Conference Paper
Full-text available
Poor Data Quality is a serious problem affecting enterprises. Enterprise databases are large and manual data cleansing is not feasible. For such large databases it is logical to attempt to cleanse the data in an automated way. This has led to the development of commercial tools for automatic cleansing. However, offering data cleansing as a service...
Conference Paper
Full-text available
We propose a novel method to automatically acquire a term-frequency-based taxonomy from a corpus using an unsupervised method. A term-frequency-based taxonomy is useful for application domains where the frequency with which terms occur on their own and in combination with other terms imposes a natural term hierarchy. We highlight an application for...
Conference Paper
Full-text available
There is often a transient need within enterprises for data cleansing which can be satisfied by offering data cleansing as a transient service. Every time a data cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system tha...
Conference Paper
Full-text available
Recent times have seen a tremendous growth in mobile based data services that allow peo- ple to use Short Message Service (SMS) to access these data services. In a multilin- gual society it is essential that data services that were developed for a specific language be made accessible through other local lan- guages also. In this paper, we present a...
Conference Paper
Full-text available
In this paper we look at the problem of cleansing noisy text using a statistical machine translation model. Noisy text is produced in informal communications such as Short Message Service (SMS), Twitter and chat. A typical Statistical Machine Translation system is trained on parallel text comprising noisy and clean sentences. In this paper we propo...
Conference Paper
Full-text available
Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these cl...
Article
Topic models such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) have been successfully used to discover individual activities in a scene. However these methods do not discover group activities which are commonly observed in real life videos of public places. In this paper we address the problem of discoverin...
Conference Paper
Full-text available
Protecting sensitive information while preserving the share ability and usability of data is becoming increasingly important in the outsourced business process industry. Particularly in the context of call-centers a lot of customer related sensitive information is stored in audio recordings. In this work, we address the problem of protecting sensit...
Conference Paper
Full-text available
In this paper, we present a first of a kind system, called business intelligence from voice of customer (BIVoC), that can: 1) combine unstructured information and structured information in an information intensive enterprise and 2) derive richer business insights from the combined data. Unstructured information, in this paper, refers to voice of cu...
Conference Paper
Full-text available
Often, in the real world noise is ubiquitous in text communications. Text produced by processing signals intended for human use are often noisy for automated computer processing. Automatic speech recognition, optical character recognition and machine translation all introduce processing noise. Also digital text produced in informal settings such as...
Conference Paper
Full-text available
Short Messaging Service (SMS) is popu- larly used to provide information access to people on the move. This has resulted in the growth of SMS based Question An- swering (QA) services. However auto- matically handling SMS questions poses significant challenges due to the inherent noise in SMS questions. In this work we present an automatic FAQ-based...
Conference Paper
In this paper, we address the problem of detecting sensitive events in speech signal such as exchange of credit card information. Although close in nature to the word spotting problem, variability in the linguistic content constituting an event and their composition makes event detection a harder task, especially in the context where it is applied...
Conference Paper
Full-text available
Protecting sensitive information while preserving the share-ability and usability of data is becoming increasingly important. In call-centers a lot of customer related sensitive information is stored in audio recordings. In this work, we address the problem of protecting sensitive information in audio recordings and speech transcripts. We present a...
Conference Paper
Utterance classification is an important task in spoken-dialog systems. The response of the system is dependent on category assigned to the speaker's utterance by the classifier. However, often the input speech is spontaneous and noisy which results in high word error rates. This results in unsatisfactory system performance. In this paper we descri...
Conference Paper
Voice application development requires specialized speech related skills besides the general programming ability. Encapsulating the speech specific behavior and complexities in prepackaged, configurable User Interface (UI) components will ease and expedite the voice application development. These components can be used across applications and are c...
Conference Paper
We describe a new algorithm for the Decoding problem in Statistical Machine Translation. Our algorithm is based on the Alternating Optimization framework and employs dy- namic programming. The time complexity of the algorithm is O m2 , where m is the length of the sentence to be trans- lated, which is the best among all known algorithms for the pro...
Article
Full-text available
This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our...
Article
The decoding problem in Statistical Ma-chine Translation (SMT) is a computation-ally hard combinatorial optimization prob-lem. In this paper, we propose a new al-gorithmic framework for solving the decod-ing problem and demonstrate its utility. In the new algorithmic framework, the decod-ing problem can be solved both exactly and approximately. The...
Conference Paper
Recently statistical methods for natural language translation have become popular and found reasonable success. In this paper we describe an English-Hindi statistical machine translation system. Our machine translation system is based on IBM Models 1, 2, and 3. We present experimental results on an English-Hindi parallel corpus consisting of 150,00...
Article
Full-text available
This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our...
Article
This paper describes a morphing based automated audio driven facial animation system. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English, is presented. New viseme and expression combinations are synthesized to be able to generate a...
Article
Full-text available
Orthogonal information present in the video signal associated with the audio helps in improving the accuracy of a speech recognition system. Audio-visual speech recognition involves extraction of both the audio as well as visual features from the input signal. Extraction of visual parameters is done by the recognition of speech dependent features f...
Article
Audio-driven facial animation is an interesting and evolving technique for human-computer interaction. Based on an incoming audio stream, a face image is animated with full lip synchronization. This requires a speech recognition system in the language in which audio is provided to get the time alignment for the phonetic sequence of the audio signal...
Article
Full-text available
In this paper, we demonstrate a morphing based automated audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and expression. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face speaking differe...
Article
Several promising image/video data compression techniques that explicitly exploit self-similarity in images and videos have been proposed in the recent past. While most of these fractal techniques are variants and/or enhancements of Jacquin's Iterated Function Systems (IFS), the Weighted Finite Automata (WFA) techniques (introduced by Culik and Kar...
Conference Paper
Full-text available
We present our work on visual pruning in an audio-visual (AV) speech recognition scenario. Visual speech information has been successfully used in circumstances where audio-only recognition suffers (e.g. noisy environments). Tracking and extraction of region-of-interest (ROI) (e.g., speaker's mouth region) from video is an essential component of su...
Conference Paper
Full-text available
In this paper, we demonstrate a morphing based automated au- dio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and expression. An animation sequence using optical flo w between visemes is constructed, given an incoming audio stream and still pictures of a face speaking diff...
Conference Paper
Audio-driven facial animation is an interesting and evolving technique for human-computer interaction. Based on an incoming audio stream, a face image is animated with full lip synchronization. This requires a speech recognition system in the language in which audio is provided to get the time alignment for the phonetic sequence of the audio signal...
Conference Paper
Full-text available
Orthogonal information present in the video signal associated with the audio helps in improving the accuracy of a speech recognition system. Audio-visual speech recognition involves extraction of both the audio as well as visual features from the input signal. Extraction of visual parameters is done by the recognition of speech dependent features f...
Article
Using visual information in speech recognition has been an area of interest because it can significantly improve the speech recognition efficiency in the conditions where audio only recognition suffers due to noisy environment. In this paper, we present a new approach to combine audio and video to improve the robustness of the speech recognition sy...
Article
The emergence of very large machine-generated datasets raises a question of some importance for ILP, namely: can an ILP system construct models efficiently using datasets whose sizes are too large to fit in random access memory? In this paper, we examine the applicability to ILP of a pop-ular distributed computing approach that in principle, allows...
Article
Full-text available
In this paper we address the problem of obtaining structured information about products in the form of attribute-value pairs by leveraging a combination of enter-prise internal product descriptions and ex-ternal data. Product descriptions are short text strings used internally within enter-prises to describe a product. These strings usually compris...

Network

Cited By