Table 2 - uploaded by Petr Sojka
Content may be subject to copyright.
Scalability test results (run on 448 GiB RAM, eight 8-core 64bit processors Intel Xeon TM X7560 2.26 GHz driven machine).

Scalability test results (run on 448 GiB RAM, eight 8-core 64bit processors Intel Xeon TM X7560 2.26 GHz driven machine).

Source publication
Article
Full-text available
We demonstrate searching of mathematical expressions in technical digital libraries on a MREC collection of 439,423 real scientific documents with more than 158 million mathematical formulae. Our solution-the WebMIaS system-allows the retrieval of mathematical expressions written in TEX or MathML. TEX queries are converted on-the-fly into tree repr...

Context in source publication

Context 1
... is shown in Table 2 on page 83, the performance of the system scales linearly. This gives feasible response times even for our billions of indexed subformulae. ...

Similar publications

Article
Full-text available
The modelling of the user profile and its integration into the search process is an effective way in personalized information search within a repository of educational digital resources. Therefore, it raises gradually the issue concerning the dynamic development of this profile so as the information requester sets up queries. In our approach presente...
Article
Full-text available
A Spatial Data Infrastructure (SDI) is a framework of geospatial data, metadata, users and tools intended to provide an efficient and flexible way to use spatial information. One of the key software components of an SDI is the catalogue service which is needed to discover, query, and manage the metadata. Catalogue services in an SDI are typically b...
Conference Paper
Full-text available
Search Engines play important roles in helping users to rapidly retrieve relevant information. The technology underlying Search Engines has been improved in the last years, both in terms of hardware capabilities and in terms of software. However, they are still affected by many issues due to the continuously growing amount of data and the various f...
Article
Full-text available
Rapid progress in proteomics and large-scale profiling of biological systems at the protein level necessitates the continued development of efficient computational tools for the analysis and interpretation of proteomics data. Here, we present the piNET server that facilitates integrated annotation, analysis and visualization of quantitative proteom...
Preprint
Full-text available
Software bug reports reported on bug-tracking systems often lack crucial information for the developers to promptly resolve them, costing companies billions of dollars. There has been significant research on effectively eliciting information from bug reporters in bug tracking systems using different templates that bug reporters need to use. However...

Citations

... An earlier effort to develop a test collection started with the Mathematical REtrieval Collection (MREC) [94], a set of 439,423 scientific documents that contained more than 158 million formulas. ...
Thesis
Large collections containing millions of math formulas are available online. Retrieving math expressions from these collections is challenging. Users can use formula, formula+text, or math questions to express their math information needs. The structural complexity of formulas requires specialized processing. Despite the existence of math search systems and online community question-answering websites for math, little is known about mathematical information needs. This research first explores the characteristics of math searches using a general search engine. The findings show how math searches are different from general searches. Then, test collections for math-aware search are introduced. The ARQMath test collections have two main tasks: 1) finding answers for math questions and 2) contextual formula search. In each test collection (ARQMath-1 to -3) the same collection is used, Math Stack Exchange posts from 2010 to 2018, introducing different topics for each task. Compared to the previous test collections, ARQMath has a much larger number of diverse topics, and improved evaluation protocol. Another key role of this research is to leverage text and math information for improved math information retrieval. Three formula search models that only use the formula, with no context are introduced. The first model is an n-gram embedding model using both symbol layout tree and operator tree representations. The second model uses tree-edit distance to re-rank the results from the first model. Finally, a learning-to-rank model that leverages full-tree, sub-tree, and vector similarity scores is introduced. To use context, Math Abstract Meaning Representation (MathAMR) is introduced, which generalizes AMR trees to include math formula operations and arguments. This MathAMR is then used for contextualized formula search using a fine-tuned Sentence-BERT model. The experiments show tree-edit distance ranking achieves the current state-of-the-art results on contextual formula search task, and the MathAMR model can be beneficial or re-ranking. This research also addresses the answer retrieval task, introducing a two-step retrieval model in which similar questions are first found and then answers previously given to those similar questions are ranked. The proposed model, fine-tunes two Sentence-BERT models, one for finding similar questions and another one for ranking the answers. For Sentence-BERT model, raw text as well as MathAMR are used.
... We use the MREC corpus 2 (Líška et al., 2011) as a source. The MREC corpus contains around 450k articles from ArxMLiV (Stamerjohanns et al., 2010), an on-going project aiming at converting the arXiv 3 repository from L A T E X to XML, a format more suited to machine processing. ...
Preprint
We introduce a novel task consisting in assigning a proof to a given mathematical statement. The task is designed to improve the processing of research-level mathematical texts. Applying Natural Language Processing (NLP) tools to research level mathematical articles is both challenging, since it is a highly specialized domain which mixes natural language and mathematical formulae. It is also an important requirement for developing tools for mathematical information retrieval and computer-assisted theorem proving. We release a dataset for the task, consisting of over 180k statement-proof pairs extracted from mathematical research articles. We carry out preliminary experiments to assess the difficulty of the task. We first experiment with two bag-of-words baselines. We show that considering the assignment problem globally and using weighted bipartite matching algorithms helps a lot in tackling the task. Finally, we introduce a self-attention-based model that can be trained either locally or globally and outperforms baselines by a wide margin.
... We performed a speed evaluation of MIaS on the MREC dataset of 439,423 documents [13] (see Table 1), a quality and speed evaluation on the NTCIR-10 Math [1,12] dataset of 100,000 documents, and a quality and speed evaluation on the NTCIR-11 Math-2 [2,16] (see Tables 2, and 3), and NTCIR-12 MathIR [22,15] dataset of 105,120 documents that were split into 8,301,578 paragraphs. Speed evaluation shows that the indexing time of our system is linear in the number of indexed documents and that the average query time is 469 ms. ...
Preprint
Full-text available
Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML contain mainly documents from STEM fields, where mathematical formulae are often more important than text for understanding. Conventional information retrieval (IR) systems are unable to represent formulae and they are therefore ill-suited for math information retrieval (MIR). To fill the gap, we have developed, and open-sourced the MIaS MIR system. MIaS is based on the full-text search engine Apache Lucene. On top of text retrieval, MIaS also incorporates a set of tools for preprocessing mathematical formulae. We describe the design of the system and present speed, and quality evaluation results. We show that MIaS is both efficient, and effective, as evidenced by our victory in the NTCIR-11 Math-2 task.
... MREC [8] is a dataset of scientific papers in arxiv.org translated to XML. ...
Article
MathML is a standard markup language for describing math expressions. MathML consists of two sets of elements: Presentation Markup and Content Markup. The former is widely used to display math expressions in Web pages, while the latter is more suited to the calculation of math expressions. In this letter, we focus on the former and consider classifying Presentation MathML expressions. Identifying the classes of given Presentation MathML expressions is helpful for several applications, e.g., Presentation to Content MathML conversion, text-to-speech, and so on. We propose a method for classifying Presentation MathML expressions by using multilayer perceptron. Experimental results show that our method classifies MathML expressions with high accuracy.
... We have constructed an annotated data set of sentences for building variable typing classifiers. The sentences in our corpus are sourced from the Mathematical REtrieval Corpus (MREC) (Líška et al., 2011), a subset of arXiv (over 439,000 papers) with all L A T E X formulae converted to MathML. The data set is split into a standard training/development/test machine learning partitioning scheme as outlined in Table 1. ...
... We used 70% of them for training data and used 30 % for test data. [2] is a dataset of scientific papers in arxiv.org translated to XML. ...
Conference Paper
MathML consists of two sets of elements: Presentation Markup and Content Markup. The former is more widely used to display math expressions in Web pages, while the latter is more suited to the calculation of math expressions. In this paper, we consider classifying math expressions in Presentation Markup. In general, a math expression in Presentation Markup cannot be uniquely converted into the corresponding expression in Content Markup. If the class of a given math expression can be identified automatically, such conversions can be done more appropriately. Moreover, identifying the class of a given math expression is useful for text-to-speech of math expression. In this paper, we propose a method for classifying math expressions in Presentation Markup by using a kind of deep learning; multilayer perceptron. Experimental results show that our method classifies math expressions with high accuracy.
... WebMIaS [24] allows the retrieval of mathematical expressions written in TEX or MathML converting TeX queries on-the-fly into tree representations of presentation MathML, which is used for indexing. The queries can be composed of plain text and mathematical formulae. ...
Article
The paper presents an overview of the current development of tools for search for mathematical formulae and their implementation in Digital Mat hematical Libraries and reference databases such as zbMATH, MathSciNet and EuDML for mathematical scholarly literature.
... The user must be given the possibility to quickly determine whether the matched document is interesting or not. Therefore the problem of how to present summaries of the selected documents in the result list is of fundamental importance [LSLM11,LSR14,MG08b,WG10,You05,You06,You07,You08]. Even highlighting correctly the bits of the summary that matches the query can make a significant difference in the user experience [LSLM11,LSR14,You05,You06]. ...
... Therefore the problem of how to present summaries of the selected documents in the result list is of fundamental importance [LSLM11,LSR14,MG08b,WG10,You05,You06,You07,You08]. Even highlighting correctly the bits of the summary that matches the query can make a significant difference in the user experience [LSLM11,LSR14,You05,You06]. The list of results must be the starting point for further investigations by the user. ...
... The Document Retrieval problem is formulated in a way that is agnostic of the encoding. However, the user is likely to enter formulae in the query using a presentation language (mostly L A T E X, even if MathML starts to be used [LSLM11,LSR14,MG08b]). Some authors have provided evidence that precision is improved when exploiting parallel markup, even when the content part is automatically generated from the presentation part [NKTA14]. ...
Conference Paper
Full-text available
We present a short survey of the literature on indexing and retrieval of mathematical knowledge, with pointers to 72 papers and tentative taxonomies of both retrieval problems and recurring techniques.
... As the underlying document collection, we have used the Mathematical Retrieval Corpus (MREC) 3 (Líška et al., 2011), which contains more than 439,000 mathematical publications, complete with mathematical formulae converted to machinereadable MathML. Similarly, we have made mathematical expressions in our topics accessible to MIR systems by converting all L A T E X embedded in MO questions into MathML using the LaTeXML tool-kit. ...
... For EuDML, we have added on-the-fly rendering of math, as autodetected in L A T E X and MathML formats. We have added facets for searching in different document fields [6]. Most importantly, we have had the privilege of mining EuDML search logs for user arXiv:1404.6476v1 ...
Article
Full-text available
We are designing and developing a web user interface for digital mathematics libraries called WebMIaS. It allows queries to be expressed by mathematicians through a faceted search interface. Users can combine standard textual autocompleted keywords with keywords in the form of mathematical formulae in LaTeX or MathML formats. Formulae are shown rendered by the web browser on-the-fly for users' feedback. We describe WebMIaS design principles and our experiences deploying in the European Digital Mathematics Library (EuDML). We further describe the issues addressed by formulae canonicalization and by extending the MIaS indexing engine with Content MathML support.