Matko Bošnjak's research while affiliated with Ruđer Bošković Institute and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (13)


Synthetic clickstreams for Videolectures.net recommender system challenge
  • Data
  • File available

April 2013

·

44 Reads

·

Matko Bosnjak

·

·

[...]

·

Download
Share

Figure 1: Experiment timeline and target analysis.(a) Timeline for the CAFA experiment. (b) Number of target sequences per organism. The graph shows the number of target sequences for each of the ontologies (Molecular Function and Biological Process) as well as the total number of targets, obtained as a union between sequences in the two ontologies. Of 866 proteins, 531 had Molecular Function annotations and 587 had Biological Process annotations. (c) Distribution of target sequences in each ontology according to the number of leaf terms available for each protein sequence. For example, in the Molecular Function category, 79% of proteins had one leaf term, 16% had two leaf terms, and so on. A term is considered a leaf term for a particular target if no other GO term associated with that sequence is its descendant.
Figure 2: Overall performance evaluation.(a,b) The maximum F-measure for the top-performing methods for Molecular Function ontology (a) and Biological Process ontology (b). All panels show the top ten participating methods in each category as well as the BLAST and Naive baseline methods. Note that 33 models outperformed BLAST in the Molecular Function category, whereas 26 models outperformed BLAST in the Biological Process category (cutoff scores below which methods were excluded from the panels were 0.468 and 0.300 for the Molecular Function and Biological Process categories, respectively). In the Molecular Function category, proteins with “protein binding” as their only leaf term were excluded from the analysis because the protein binding term was not considered informative (results that include those proteins are presented in Supplementary ). A perfect predictor would be characterized with Fmax = 1. Confidence intervals (95%) were determined using bootstrapping with n = 10,000 iterations on the set of target sequences. For cases in which a principal investigator participated in multiple teams, only the results of the best-scoring method are presented.
Figure 3: Domain analysis and performance evaluation for single-domain versus multidomain eukaryotic targets.(a) Distribution of target proteins with respect to the number of Pfam domains they contain. (b) Performance evaluation in the Molecular Function category. Each of the ten top-performing methods showed higher accuracy (higher Fmax) on single-domain proteins. Confidence intervals (95%) were determined using bootstrapping with n = 10,000 iterations on the set of target sequences.
Figure 4: Case study on the human PNPT1 gene.(a) Domain architecture of human PNPT1 gene according to the Pfam classification. For each domain, the numbers of different leaf terms (for the Molecular Function and Biological Process categories) associated with any protein in Swiss-Prot database containing this domain are shown. (b) Molecular Function terms (six of which are leaves) associated with the human PNPT1 gene in Swiss-Prot as of December 2011. Colored circles represent the predicted terms for three representative methods as well as two baseline methods. The prediction threshold for each method was selected to correspond to the point in the precision-recall space that provides the maximum F-measure. J (blue), Jones-UCL; O (magenta), Team Orengo; d (navy blue), dcGO; B (green), BLAST; N (brown), Naive. Dashed lines indicate the presence of other terms between the source and destination nodes.
A large-scale evaluation of computational protein function prediction

January 2013

·

863 Reads

·

709 Citations

Nature Methods

Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools. Supplementary information The online version of this article (doi:10.1038/nmeth.2340) contains supplementary material, which is available to authorized users.


Phyletic Profiling with Cliques of Orthologs Is Enhanced by Signatures of Paralogy Relationships

January 2013

·

223 Reads

·

40 Citations

PLOS Computational Biology

PLOS Computational Biology

New microbial genomes are sequenced at a high pace, allowing insight into the genetics of not only cultured microbes, but a wide range of metagenomic collections such as the human microbiome. To understand the deluge of genomic data we face, computational approaches for gene functional annotation are invaluable. We introduce a novel model for computational annotation that refines two established concepts: annotation based on homology and annotation based on phyletic profiling. The phyletic profiling-based model that includes both inferred orthologs and paralogs-homologs separated by a speciation and a duplication event, respectively-provides more annotations at the same average Precision than the model that includes only inferred orthologs. For experimental validation, we selected 38 poorly annotated Escherichia coli genes for which the model assigned one of three GO terms with high confidence: involvement in DNA repair, protein translation, or cell wall synthesis. Results of antibiotic stress survival assays on E. coli knockout mutants showed high agreement with our model's estimates of accuracy: out of 38 predictions obtained at the reported Precision of 60%, we confirmed 25 predictions, indicating that our confidence estimates can be used to make informed decisions on experimental validation. Our work will contribute to making experimental validation of computational predictions more approachable, both in cost and time. Our predictions for 998 prokaryotic genomes include ∼400000 specific annotations with the estimated Precision of 90%, ∼19000 of which are highly specific-e.g. "penicillin binding," "tRNA aminoacylation for protein translation," or "pathogenesis"-and are freely available at http://gorbi.irb.hr/.





Memory biased random walk approach to synthetic clickstream generation

January 2012

·

748 Reads

·

3 Citations

Personalized recommender systems rely on personal usage data of each user in the system. However, privacy policies protecting users' rights prevent this data of being publicly available to a wider researcher audience. In this work, we propose a memory biased random walk model (MBRW) based on real clickstream graphs, as a generator of synthetic clickstreams that conform to statistical properties of the real clickstream data, while, at the same time, adhering to the privacy protection policies. We show that synthetic clickstreams can be used to learn recommender system models which achieve high recommender performance on real data and at the same time assuring that strong de-minimization guarantees are provided.


REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms

July 2011

·

8,707 Reads

·

5,302 Citations

Outcomes of high-throughput biological experiments are typically interpreted by statistical testing for enriched gene functional categories defined by the Gene Ontology (GO). The resulting lists of GO terms may be large and highly redundant, and thus difficult to interpret. REVIGO is a Web server that summarizes long, unintelligible lists of GO terms by finding a representative subset of the terms using a simple clustering algorithm that relies on semantic similarity measures. Furthermore, REVIGO visualizes this non-redundant GO term set in multiple ways to assist in interpretation: multidimensional scaling and graph-based visualizations accurately render the subdivisions and the semantic relationships in the data, while treemaps and tag clouds are also offered as alternative views. REVIGO is freely available at http://revigo.irb.hr/.


ECML-PKDD 2011 Discovery Challenge overview

January 2011

·

28 Reads

·

4 Citations

This year's Discovery Challenge was dedicated to solving of the video lecture recommendation problems, based on the data collected at VideoLectures.Net site. Challenge had two tasks: task 1 in which new-user/new-item recommendation problem was simulated, and the task 2 which was a simulation of the clickstream-based recommendation. In this overview we present challenge datasets, tasks, evaluation measure and we analyze solutions and results.


RSCTC’2010 Discovery Challenge: Mining DNA Microarray Data for Medical Diagnosis and Treatment

June 2010

·

203 Reads

·

29 Citations

RSCTC’2010 Discovery Challenge was a special event of Rough Sets and Current Trends in Computing conference. The challenge was organized in the form of an interactive on-line competition, at TunedIT.org platform, in days between Dec 1, 2009 and Feb 28, 2010. The task was related to feature selection in analysis of DNA microarray data and classification of samples for the purpose of medical diagnosis or treatment. Prizes were awarded to the best solutions. This paper describes organization of the competition and the winning solutions.


Citations (8)


... Our novel features provide a much larger coverage than existing methods while maintaining a high accuracy. 2) Preliminary experiments on standard WEBSPAM-UK2007 [5], ClueWeb-2009 [6], and ECML-PKDD-2011 [7] benchmark datasets demonstrate the effectiveness of the novel features on learning the classifier for detecting web spam. The rest of the paper is formed as follows: We review the previous research work in Section 2. In section 3, we describe the proposed groups of novel web spam features. ...

Reference:

Novel Features for Web Spam Detection
ECML-PKDD 2011 Discovery Challenge overview
  • Citing Article
  • January 2011

... RapidMiner is often successfully used in the application of classification algorithms [7]. Furthermore, it provides a support for Meta learning for classification [8] and constructing of recommender system workflow templates [9]. In this paper, we focus on building recommender system for higher education students. ...

Constructing recommender systems workflow templates in RapidMiner
  • Citing Article

... In recent years there has been a significant expansion in the use of telemedicine to improve safety and efficacy of treatment as well as to improve patient education [10]. ...

HEARTFAID's eCRF: Lessons Learnt from Using a Two-Level Data Acquisition and Storage System for Knowledge Discovery Tasks within an Electronic Platform for Managing Heart Failure Patients

Bio-Algorithms and Med-Systems

... Accurately predicting protein function is a cornerstone in molecular biology, with extensive applications in drug design, drug discovery and disease modeling (Rezaei et al., 2020). However, the complexity and variability of proteins pose significant challenges for computational prediction models (Radivojac et al., 2013;Schauperl & Denny, 2022). The functionality of a protein is affected by its threedimensional structure, often dictating its interactions with other molecules (Ivanisenko et al., 2005). ...

A large-scale evaluation of computational protein function prediction

Nature Methods

... In the original study, orthologs for a protein of interest were identified as those matching the query sequence with a score above an alignment threshold relative to the size of the searched database [24]. Since then, profile elements have been identified using bit-score thresholds, protein domains, membership in Clusters of Orthologous Groups of proteins (COGS), and methods for distinguishing between orthologs and paralogs [25,37,38,39,34,40]. ...

Phyletic Profiling with Cliques of Orthologs Is Enhanced by Signatures of Paralogy Relationships
PLOS Computational Biology

PLOS Computational Biology

... Four datasets were used as a case study for the feature selection algorithms. Two sets are microarray datasets that are used for research on psoriasis [31,32,33,34] and cancer [35]. The two other sets are mass spectrometry datasets, used for research on cancer [36] and micro organisms [37]. ...

RSCTC’2010 Discovery Challenge: Mining DNA Microarray Data for Medical Diagnosis and Treatment

... Medical knowledge is a cognitive and technical component, i.e., it comprises the individual's perspectives, beliefs, talents, and expertise. Challenging aspects of medical plans and medical knowledge itself are (i) time, data gathering may last years, while the answer can require only a few seconds; (ii) space, because data may arrive from many different health care units, in distinct formats; and (iii) medicine's inherent complexity, the depth of knowledge that each medical specialty offers [Jovic et al., 2007c] [Gamberger et al., 2008]. ...

Attribute ranking for intelligent data analysis in medical applications
  • Citing Conference Paper
  • July 2008