ArticlePDF Available

INTELLIGENT ANALYTICAL SYSTEM AS A TOOL TO ENSURE THE REPRODUCIBILITY OF BIOMEDICAL CALCULATIONS ІНТЕЛЕКТУАЛЬНА // Artificial Intelligence. – 2020. No. 3. — pp.65-78.

January 2020

January 2020

Authors:

Vasyl Gorbachuk

National Academy of Sciences of Ukraine

Natalia Novoselova

National Academy of Sciences of Belarus

s. P. Osypenko

Vadim Skobtsov

United Institute of Informatics Problems

Show all 11 authorsHide

The experience of the use of applied containerized biomedical software tools in cloud environment is summarized. The reproducibility of scientific computing in relation with modern technologies of scientific calculations is discussed. The main approaches to biomedical data preprocessing and integration in the framework of the intelligent analytical system are described. At the conditions of pandemic, the success of health care system depends significantly on the regular implementation of effective research tools and population monitoring. The earlier the risks of disease can be identified, the more effective process of preventive measures or treatments can be. This publication is about the creation of a prototype for such a tool within the project «Development of methods, algorithms and intelligent analytical system for processing and analysis of heterogeneous clinical and biomedical data to improve the diagnosis of complex diseases» (M/99-2019, M/37-2020 with support of the Ministry of Education and Science of Ukraine), implementted by the V.M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, together with the United Institute of Informatics Problems, National Academy of Sciences of Belarus (F19UKRG-005 with support of the Belarussian Republican Foundation for Fundamental Research). The insurers, entering the market, can insure mostly low risks by facilitating more frequent changes of insurers by consumers (policyholders) and mixing the overall health insurance market. Socio-demographic variables can be risk adjusters. Since age and gender have a relatively small explanatory power, other socio-demographic variables were studied – marital status, retirement status, disability status, educational level, income level. Because insurers have an interest in beneficial diagnoses for their policyholders, they are also interested in the ability to interpret relevant information – upcoding: insurers can encourage their policyholders to consult with doctors more often to select as many diagnoses as possible. Many countries and health care systems use diagnostic information to determine the reimbursement to a service provider, revealing the necessary data. For processing and analysis of these data, software implementations of construction for classifiers, allocation of informative features, processing of heterogeneous medical and biological variables for carrying out scientific research in the field of clinical medicine are developed. The experience of the use of applied containerized biomedical software tools in cloud environment is summarized. The reproducibility of scientific computing in relation with modern technologies of scientific calculations is discussed. Particularly, attention is paid to containerization of biomedical applications (Docker, Singularity containerization technology), this permits to get reproducibility of the conditions in which the calculations took place (invariability of software including software and libraries), technologies of software pipelining of calculations, that allows to organize flow calculations, and technologies for parameterization of software environment, that allows to reproduce, if necessary, an identical computing environment. The main approaches to biomedical data preprocessing and integration in the framework of the intelligent analytical system are described. The experience of using the developed linear classifier, gained during its testing on artificial and real data, allows us to conclude about several advantages provided by the containerized form of the created application: it permits to provide access to real data located in cloud environment; it is possible to perform calculations to solve research problems on cloud resources both with the help of developed tools and with the help of cloud services; such a form of research organization makes numerical experiments reproducible, i.e. any other researcher can compare the results of their developments on specific data that have already been studied by others, in order to verify the conclusions and technical feasibility of new results; there exists a universal opportunity to use the developed tools on technical devices of various classes from a personal computer to powerful cluster.

Comparative density distribution for full data set (n=12000)

…

Comparative density distribution for imbalanced data set (n=3720)

…

Comparative density distribution for balanced data set (n=8280)

…

Figures - uploaded by Vadim Skobtsov

Content may be subject to copyright.

Content uploaded by Vadim Skobtsov

Content may be subject to copyright.

ISSN 1561-5359. Artificial Intelligence, 2020, № 3

UDC 519.8

T.O. Bardadym1, V.М. Gorbachuk1, N.А. Novoselova2, С.P. Osypenko1, V.Yu. Skobtsov2

1V.M.Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Ukraine

Academician Glushkov Ave., 40, Kyiv, 03187

2United Institute of Informatics Problems of the National Academy of Sciences of Belarus, Belarus

6, Surganava St., Minsk, 220012

INTELLIGENT ANALYTICAL SYSTEM AS A TOOL TO ENSURE

THE REPRODUCIBILITY OF BIOMEDICAL CALCULATIONS

Т.О. Бардадим1, В.М. Горбачук1, Н.А. Новоселова2, С.П. Осипенко1, В.Ю. Скобцов2

1Інститут кібернетики імені В.М. Глушкова НАН України, Україна

пр. Академіка Глушкова, 40, м. Київ, 03187

2Об’єднаний інститут проблем інформатики НАН Білорусі, Білорусь

вул. Сурганова, 6, м. Мінськ, 220012

ІНТЕЛЕКТУАЛЬНА АНАЛІТИЧНА СИСТЕМА ЯК ІНСТРУМЕНТ

ЗАБЕЗПЕЧЕННЯ ВІДТВОРЮВАНОСТІ

БІОМЕДИЧНИХ ОБЧИСЛЕНЬ

Abstract. The experience of the use of applied containerized biomedical software tools in cloud environment is

summarized. The reproducibility of scientific computing in relation with modern technologies of scientific calculations

is discussed. The main approaches to biomedical data preprocessing and integration in the framework of the intelligent

analytical system are described. At the conditions of pandemic, the success of health care system depends significantly

on the regular implementation of effective research tools and population monitoring. The earlier the risks of disease can

be identified, the more effective process of preventive measures or treatments can be. This publication is about the

creation of a prototype for such a tool within the project «Development of methods, algorithms and intelligent analy-

tical system for processing and analysis of heterogeneous clinical and biomedical data to improve the diagnosis of com-

plex diseases» (M/99-2019, M/37-2020 with support of the Ministry of Education and Science of Ukraine), implement-

ted by the V.M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, together with the United

Institute of Informatics Problems, National Academy of Sciences of Belarus (F19UKRG-005 with support of the Bela-

russian Republican Foundation for Fundamental Research). The insurers, entering the market, can insure mostly low

risks by facilitating more frequent changes of insurers by consumers (policyholders) and mixing the overall health insu-

rance market. Socio-demographic variables can be risk adjusters. Since age and gender have a relatively small explana-

tory power, other socio-demographic variables were studied – marital status, retirement status, disability status, educati-

onal level, income level. Because insurers have an interest in beneficial diagnoses for their policyholders, they are also

interested in the ability to interpret relevant information – upcoding: insurers can encourage their policyholders to con-

sult with doctors more often to select as many diagnoses as possible. Many countries and health care systems use diag-

nostic information to determine the reimbursement to a service provider, revealing the necessary data. For processing

and analysis of these data, software implementations of construction for classifiers, allocation of informative features,

processing of heterogeneous medical and biological variables for carrying out scientific research in the field of clinical

medicine are developed. The experience of the use of applied containerized biomedical software tools in cloud environ-

ment is summarized. The reproducibility of scientific computing in relation with modern technologies of scientific cal-

culations is discussed. Particularly, attention is paid to containerization of biomedical applications (Docker, Singularity

containerization technology), this permits to get reproducibility of the conditions in which the calculations took place

(invariability of software including software and libraries), technologies of software pipelining of calculations, that

allows to organize flow calculations, and technologies for parameterization of software environment, that allows to re-

produce, if necessary, an identical computing environment. The main approaches to biomedical data preprocessing and

integration in the framework of the intelligent analytical system are described. The experience of using the developed

linear classifier, gained during its testing on artificial and real data, allows us to conclude about several advantages pro-

vided by the containerized form of the created application: it permits to provide access to real data located in cloud en-

vironment; it is possible to perform calculations to solve research problems on cloud resources both with the help of

developed tools and with the help of cloud services; such a form of research organization makes numerical experiments

reproducible, i.e. any other researcher can compare the results of their developments on specific data that have already

been studied by others, in order to verify the conclusions and technical feasibility of new results; there exists a univer-

sal opportunity to use the developed tools on technical devices of various classes from a personal computer to powerful

cluster.

Keywords: classifier; cloud service; containerized application; gene expression data; isolated software

environment; reproducibility of calculations; biomarker.

ISSN 1561-5359. Штучний інтелект, 2020, № 3

66 © Т.О. Бардадим, В.М. Горбачук, Н.А. Новоселова, С.П. Осипенко, В.Ю. Скобцов

Анотація. Підсумовано досвід використання прикладних контейнеризованих біомедичних програмних

засобів у хмарному середовищі. Вказано шляхи забезпечення відтворюваності наукових обчислень при вико-

ристанні сучасних технологій наукових розрахунків. Описано основні підходи до попередньої обробки та ін-

теграції біомедичних даних у рамках інтелектуальної аналітичної системи. В умовах пандемії успіхи системи

охорони здоров’я суттєво залежать від регулярного впровадження ефективних засобів досліджень і моніторин-

гу стану населення. Чим раніше вдається виявити ризики появи захворювання, тим ефективніше може йти про-

цес профілактичних заходів або лікування. У даній публікації йдеться про створення прототипу такого засобу

в рамках проєкту «Розробка методів, алгоритмів і інтелектуальної аналітичної системи для обробки й аналізу

різнорідних клінічних та біомедичних даних з метою вдосконалення діагностики складних захворювань»

(М/99-2019, M/37-2020 за підтримки Міністерства освіти та науки України), що виконується Інститутом кібер-

нетики імені В.М.Глушкова НАН України спільно з Об’єднаним інститутом проблем інформатики НАН Біло-

русі (Ф19УКРГ-005 за підтримки Білоруського республіканського фонду фундаментальних досліджень). Стра-

ховики, що входять у ринок, можуть страхувати переважно низькі ризики, сприяючи частішим змінам страхо-

виків з боку страхувальників і змішуючи загальний ринок страхування. Коригувачами ризику можуть бути

соціально-демографічні змінні. Оскільки вік і стать мають відносно невелику пояснювальну спроможність, то

вивчалися інші соціально-демографічні змінні − сімейний статус, пенсійний статус, статус інвалідності, освіт-

ній рівень, рівень доходу. Оскільки страховики мають інтерес до вигідних діагнозів для своїх страхувальників,

то також мають інтерес до можливостей трактування відповідної інформації − перекодування інформації: стра-

ховики можуть заохочувати своїх страхувальників консультуватися з лікарями, щоб відбирати більше діагно-

зів. Багато країн і систем охорони здоровʼя використовують діагностичну інформацію для визначення відшко-

дування провайдеру відповідних послуг, відкриваючи необхідні для цього дані. Для обробки й аналізу цих да-

них розробляються програмні реалізації побудови класифікаторів, виділення інформативних ознак, опрацю-

вання різнорідних медико-біологічних змінних для проведення наукових досліджень у галузі клінічної меди-

цини. У статті підсумовано досвід використання прикладних контейнеризованих біомедичних програмних за-

собів у хмарному середовищі. Вказано шляхи забезпечення відтворюваності наукових обчислень при викорис-

танні сучасних технологій наукових розрахунків. Зокрема, увага привертається до контейнеризації біомедич-

них додатків (технології Docker, Singularity), за рахунок чого досягається відтворюваність середовища для ви-

конання обчислень (використання ідентичних програмних засобів та бібліотек), технології конвеєризації, що

допомагає організувати обчислення в потоковому режимі, та технології параметризації обчислювального сере-

довища, що дозволяє, за необхідності, створювати ідентичне обчислювальне середовище. Описано основні під-

ходи до попередньої обробки та інтеграції біомедичних даних в рамках інтелектуальної аналітичної системи.

Досвід використання розробленого лінійного класифікатора, набутий при його тестуванні на штучних та ре-

альних даних, дозволяє зробити висновок про декілька переваг, які надає контейнеризована форма створеного

додатку: вдається забезпечити доступ до реальних даних, розташованих у хмарних середовищах; забезпечу-

ється можливість виконання обчислень для розв’язування дослідницьких задач на хмарних ресурсах як за до-

помогою розроблених засобів, так і за допомогою хмарних сервісів; така форма організації досліджень робить

числові експерименти відтворюваними, тобто будь-який інший дослідник може порівняти результати роботи

своїх розробок на конкретних даних, які вже вивчали інші, з метою перевірити зроблені висновки та технічні

можливості нових розробок; з’являється універсальна можливість використовувати розроблені засоби на тех-

нічних пристроях різного класу від персонального комп’ютера до потужного кластера.

Ключові слова: класифікатор; хмарний сервіс; контейнеризований додаток; дані експресії генів;

ізольоване програмне середовище; відтворюваність обчислень; біомаркер.

Introduction

This publication summarizes the expe-

rience of the use of applied containerized soft-

ware tools in cloud environment, which the

authors gained during the project «Develop-

ment of methods, algorithms and intellectual

analytical system for processing and analysis

of heterogeneous clinical and biomedical data

in order to improve the diagnosis of complex

diseases», accomplished by the team from the

United Institute of Informatics Problems of the

NAS of Belarus and V.M.Glushkov Institute

of Cybernetics of the NAS of Ukraine. The

main approaches and program tools for the

development of intellectual analytical system

are described.

Problem formulation

At the conditions of pandemic, the suc-

cess of health care system depends signifi-

cantly on the regular implementation of effec-

tive research tools and population monitoring

[1−2]. Decisions on which the peopleʼs lives

depend are regularly made not only by indivi-

duals, but also by legislative and executive in-

stitutions of power that implement the func-

tion of state health care system. These deci-

sions take into account the possibility of pre-

serving and prolonging human life by means

of scarce resources (for example, financial,

human, temporal resources). Such decisions

are made by government institutions to imple-

ment the functions of defence and security,

ISSN 1561-5359. Artificial Intelligence, 2020, № 3

law and order, macroeconomic management,

protection of property rights.

Analysis of modern researches and

publications

The countries, having a national health

service or national health insurance, usually

allow government agencies to make decisions

on orders for new products (pharmaceuticals,

therapies and medical devices). As a rule, in-

novations, that promote therapeutic treatment

with a lower probability of early death within

a certain risk group, predominate [3]. Because

such innovations involve additional costs,

cost-cutting innovations are often neglected.

For instance, providing a multimillion-

dollar mobile coronary unit can help treat pa-

tients with heart attacks quickly, significantly

reducing the number of lethal cases on the

way to a hospital. The long-term drug therapy

for patients with hypertension who use anti-

hypertensive drugs can also prevent heart at-

tacks, significantly supporting the economy of

research and development (R&D) in pharma-

ceuticals. The installation of dialysis equip-

ment for patients with chronic renal failure

promotes R&D in manufacturing medical

equipment.

Goal of the research

The earlier the risks of disease can be

identified, the more effective process of pre-

ventive measures or treatments can be [4].

Life-saving costs are borne not only in the

field of health care: in the field of transport, in

locations with a higher number of road acci-

dents, there are issues of improving the quality

of roads (not only road surface), which must

be met by local communities and government

agencies; in the field of transport, there are

also issues of proper arrangement of roads

within residential areas in order to reduce

speed of vehicles and to conduct permanent

video surveillance. Of course, the practical

realization of responses to those issues in-

volves certain expenditures of the local or

state budget.

Main results

In the field of environmental protection,

there are questions about ensuring the levels of

security systems for such dangerous enter-

prises as a nuclear power plant or a chemical

plant; if the level of security system is insuffi-

cient, an accident can occur threatening the

lives of millions of people. One of the conse-

quences of the 1986 Chornobyl disaster was

an increase in cancer cases, especially in

Ukraine and Belarus. In thermal power plants

burning coal, there are questions about the

cost of filters that can contain sulfur dioxide

and other harmful emissions into the atmo-

sphere. Such emissions increase the incidence

of respiratory diseases among the people.

In all the above issues, government ins-

titutions cannot make rational decisions with-

out a comprehensive and accurate assessment

of future gains (and losses) caused by the im-

plementation of a particular project, as well as

without comparison of such gains with the

present value of cost flow associated with the

project. It is important for decision makers to

measure gains and costs in the same units.

Since project costs are usually measured in

monetary terms, it makes sense to measure all

gains in monetary terms as well. Therefore,

the prolongation of life or improvement of hu-

man health, caused by the implementation of

project should also be measured in monetary

units. Since it is difficult to assess the status of

health and life for a human being in monetary

units, economists have developed alternative

methods for assessing the state of health and

human life.

Different approaches to economic health

assessment compare the benefits of medical

intervention with the costs of this intervention.

Gains from intervention can be measured by

physical units on a one-dimensional scale, mo-

netary units, units of cardinal utility function

reflecting the multidimensional concept of

health in a scalar index.

Since the 1990-s, several states in the

world have taken steps to increase competition

for their health care insurers, hoping to im-

prove efficiency in their fields of health insu-

rance and health care. Then the generalized

equality of price and marginal cost will mean

that competing health insurers will charge a

high premium for high risks and at the same

time a low premium for low risks: high risks

ISSN 1561-5359. Штучний інтелект, 2020, № 3

68 © Т.О. Бардадим, В.М. Горбачук, Н.А. Новоселова, С.П. Осипенко, В.Ю. Скобцов

are characterized by a relatively high expected

cost of treatment due to the high probability of

disease. As the state wants all its citizens to be

provided with health insurance, there are

issues of risk selection in health insurance

markets.

One way to ensure an universal access

to health insurance is to provide targeted sub-

sidies to the poorer strata of population to

cover insurance premiums. In practice, govern-

ments regulate premiums, effectively elimina-

ting the dependence of premium charged by

an insurer on risk: in the United States, for

example, premium regulation applies so called

a community rating. In addition, the German

and Swiss regulators typically require insurers

to follow an open enrollment policy and ac-

cept all the applications. In the United States

Medicare gives its beneficiaries a choice bet-

ween the Medicare Plan itself and competing

health care plans, which receive a capitation

payment for every policyholder.

Therefore, in the countries mentioned,

there is a natural incentive to risk selection. If

each person pays the same insurance pre-

mium, the insurer will expect losses with high-

risk individuals (of high-risk type) and gains

with low-risk individuals (of low-risk type).

The economic viability and balance of any

health insurer presumes a sufficient number of

low-risk persons insured: insurers try to attract

as many such persons as possible. Therefore,

under the pressure of competition, all the insu-

rers will take part in the collection of cream on

market (cream-skimming), attracting favo-

rable risks and avoiding adverse risks.

Risk selection can take many forms. On

the one hand, health insurers can implement

direct risk selection by influencing who would

sign the insurance contract: for example, the

insurers may not pay their attention to the draft

contract from a high-risk person. Individuals

who are likely to need some medical care may

be asked to sign a contract that provides addi-

tional discount services or outright payments.

On the other hand, indirect risk selection is the

development of payment packages or contrac-

ting with service provides that involve low-

risk individuals but do not involve high-risk

persons. Direct risk selection concerns the

problem of individual access to a service, and

indirect one − the quality problem.

The both forms of risk selection will oc-

cur only when insurers or their consumers

possess information about individual health

care costs. Direct risk selection require insu-

rers to be able to observe the characteristics of

physical persons that correlate with their ex-

pected costs − gender, age, social behaviour,

and so on. For instance, if healthy people use

the Internet more often, the risk selection stra-

tegy is to market insurance contracts online:

this way people do not have to know their type

of risk. However, people need to know their

type of risk in indirect risk selection: for

example, people need to know the likelihood

that they will use certain services. Such perso-

nal data allow insurers to develop payment

packages and attract service providers with

different types of risk.

Direct and indirect risk selection can

take place simultaneously: measures that ex-

clude one selection should not affect another.

For instance, if the benefit package is strictly

regulated, preventing indirect risk selection,

insurers may remain interested in attracting fa-

vorable risks and thus turn to another risk se-

lection − direct risk selection. On the contrary,

if insurers do not have the ability to select

risks directly, they retain the incentive to deve-

lop a benefit package that attracts low risks

and avoids high risks. Indirect risk selection is

closely related to the phenomenon of unfavo-

rable (adverse) selection in insurance markets,

which happens when policyholders have more

information about their type of risk in compa-

rison with their insurers. This phenomenon

takes place regardless of the actions of state.

At the same time, indirect risk selection is an

implication of state regulations for premiums.

To avoid unwanted behavior by insurers

in selecting risks, certain measures can be

taken based on the assumption of compulsory

health insurance, forcing them to cover high

risks by means of low risks.

First, open enrollment guarantees that

some insurers will take some high risks. At the

same time, legislation, regulation and repor-

ISSN 1561-5359. Artificial Intelligence, 2020, № 3

ting may prevent obvious opportunities for di-

rect risk selection: for example, the law may

limit the insurerʼs financial and other benefits

from taking low risks.

Second, the measure against indirect

risk selection is the regulation of benefit pack-

age. On the one hand, lower bounds of bene-

fits can be envisaged, forcing insurers to offer

benefits that are important for high risks (say,

for the treatment of different types of dia-

betes). On the other hand, upper bounds of

payments may prevent insurers from including

low-risk services (say, fitness center services)

in their contracts. In addition, certain types of

payments, that are convenient for risk selec-

tion, can be regulated by separate provisions.

However, the payment package includes sup-

ply of services from specific partners provided

for in the contract (say, subcontractors), which

may be selected by the insurer in question.

Such selection is especially important in Ma-

naged Care: for example, by involving many

sports medicine professionals, the insurer can

count on the attention of healthy lifestyle ad-

vocates (low-risk consumers).

Third, the measure of creating incen-

tives via additional payments to insurers, who

take high risks, and imposing financial sanc-

tions to insurers, who skim creams (favorable

risks), is a risk adjustment scheme (RAS). The

payments mentioned depend on such characte-

ristics observed as age and gender. The mea-

sure of reimbursing the share of actual costs

for medical treatment is a cost reimbursement

scheme (CRS). The idea of CRS is to reduce

gains from risk selection by decreasing the im-

pact of costs on the profits of insurers. At the

same time, the CRS reduces incentives of in-

surers to control their costs.

The RAS and CRS can be substantiated

by modeling risk selection. First of all, due to

various reasons insurers may differ in their

terms of insurance for population, the RAS

and CRS can create a competitive system

where the favorable risk structure of an insurer

does not give her a starting advantage. Be-

sides, the health insurance market may be de-

stabilized as new insurers enter the market and

move from high to low risks. The RAS and

CRS can reduce differences of insurers in pre-

miums, thereby reducing incentives to the

movement (transition).

The insurers, entering the market, can

insure mostly low risks by facilitating more

frequent changes of insurers by consumers

(policyholders) and mixing the overall health

insurance market. Because insurers, that have

entered the market earlier, would appear at

high risks, they eventually have to increase

their premiums or file for bankruptcy. In such

circumstances, insurers will have no incentive

to invest in proving effective payments.

Indeed, there is evidence of higher low-

risk mobility in the German health insurance

market, based on a comparison of the health

care expenditure (HCE) of those who change

insurers and those who do not change their

insurer: depending on age categories, people,

who changed insurers, had on average

45−85 % less HCE than the HCE of those

who did not change insurers. Studies, based on

the German socioeconomic panel, have shown

that (adult) people, who remained loyal to

their insurer, had significantly worse health

status than people who changed insurers. In

the United States, there is a case of Harvard

University’s decision to increase employers’

contributions to insurance premiums if emplo-

yers did not choose the cheapest option

(Health Maintenance Organization (HMO)

plan). Types of risk began to be identified du-

ring the year: those who switched from the

most expensive insurance plans to HMOs had

a mean age of 46 years and were 9% higher in

HCE than the overall average HCE; those who

remained on expensive insurance plans had an

average age of 50 years and a 16% higher

HCE compared to the general average HCE.

The rapid loss of low risks by broad insurance

plans forced the experiment to stop.

Thus, the RAS and CRS can help ensure

a level playing field during the transition to a

competitive market and the stabilization of

health insurance market. In the absence of

schemes such as the RAS and CRS, the mar-

ket may lose the most efficient insurers. For

actuaries and other financial professionals, risk

ISSN 1561-5359. Штучний інтелект, 2020, № 3

70 © Т.О. Бардадим, В.М. Горбачук, Н.А. Новоселова, С.П. Осипенко, В.Ю. Скобцов

adjustment means the accrual of a premium or

per capita payment in proportion to the ex-

pected expenses of an individual or group. The

RAS is based upon risk adjusters – the ob-

served characteristics of individuals. The de-

velopment of RAS and the search for appro-

priate risk adjusters require empirical testing

of their ability to predict HCE.

Socio-demographic variables can be risk

adjusters. Since age and gender have a relati-

vely small explanatory power, other socio-de-

mographic variables were studied – marital

status, retirement status, disability status, edu-

cational level, income level. Data from the

German health insurance funds showed that

elderly pensioners with disabilities have signi-

ficantly higher HCE. In addition, higher HCEs

are revealed by single retirees and low-income

individuals.

HCE in previous periods is an obvious

indicator of morbidity: an increase in HCE

leads to an increase in HCE in the next period

by 20−30%. At the same time, the explanatory

capacity of HCE should be weighed against

the weakening of person’s incentives to reduce

her costs, because higher current HCE will to

some extent be compensated to the person la-

ter. It is through HCE that insurers try to iden-

tify favorable risks, and there may not be bet-

ter risk adjusters. Prescription medications in

previous periods have predicted the value of

HCE. The morbidity can be measured by ga-

thering available diagnostic information to

identify chronically ill patients and to classify

individuals according to their expected HCE.

This classification can be done by various me-

thods. The empirical studies show that diag-

nostic information gives an accurate predict-

tion of HCE values. In turn, the corresponding

gathering of information can be expensive.

Because insurers have an interest in beneficial

diagnoses for their policyholders, they are also

interested in the ability to interpret relevant in-

formation – upcoding: insurers can encourage

their policyholders to consult with doctors

more often to select as many diagnoses as

possible.

Many countries and health care systems

use diagnostic information to determine the re-

imbursement to a service provider, revealing

the necessary data. For processing and analy-

sis of these data, software implementations of

construction for classifiers, allocation of infor-

mative features, processing of heterogeneous

medical and biological variables for carrying

out scientific research in the field of clinical

medicine are developed.

One of the goals of research includes the

development of approaches and program tools

for the purpose of the reproducibility of nume-

rical experiments, which were conducted in

the framework of the joint project. The goal of

the project is to develop effective methods and

software for constructing classifiers, selection

of informative features, creation of a prototype

of an intelligent analytical system, which is a

software implementation of all stages of data

processing and analysis and is aimed at con-

ducting research in the field of clinical medi-

cine. This system will implement the functions

of integrating clinical and molecular patient

data, determining diagnostic biomarkers and

their combinations, building classifiers of

complex diseases (oncological diseases) based

on integrated data, identifying new disease

subtypes to improve treatment methods and

increase its efficiency. The second goal in-

cludes the development of the approaches to

Large amount of research activities de-

voted to the development of mathematical me-

thods of data handling, particularly classifica-

tion models, is due, on the one hand, to a wide

range of possible applications, and on the

other hand – the complexity of these prob-

lems, which requires the development and im-

provement of means to solve them (see, for

example, [5−9]). In addition to general requi-

rements for efficiency of the created software

there exists a need to pay attention to the con-

ditions of availability of large and heteroge-

neous data sets, requirements for the ability to

transfer programs from one hardware to

another, their performance in cloud

computing.

ISSN 1561-5359. Artificial Intelligence, 2020, № 3

Moreover, one of the most important re-

quirements is the reproducibility of research

numerical experiments. The principle of repro-

ducibility of research is one of the basic scien-

tific principles. However, a crisis called "re-

producibility crisis" has been realized in sci-

ence [10−11]. This crisis has affected almost

all branches of science, in particular, to a large

extent – biology and medicine. Much effort

has been made recently to overcome this cri-

sis, including the development of software and

software platforms to ensure the reproducibi-

lity of scientific computing. Computing in bio-

logy and medicine involves the use of high-

performance computing technologies (inclu-

ding clusters and grid technologies). However,

the introduction of modern technologies to en-

sure the reproducibility of calculations in this

area is quite slow [12, p. 731]. As a result, in

the field of cluster technologies, which do not

have the appropriate software installed, there

is a contradiction between modern require-

ments for the reproducibility of scientific cal-

culations and the ability to achieve it by old

means.

It so happened that the need to create a

containerized application was not a planned

stage of our study. This was primarily due to

the ways of accessing the real data on which

the software was tested. Only then did the

authors realize that they had gained other ad-

vantages, among which the most important is

the reproducibility of research numerical ex-

periments. It is the purpose of the publication

to share this experience.

The second purpose is to shortly de-

scribe our efforts taken towards the develop-

ment of specialized computer methods and

models in order to solve the vital tasks in the

field of biomedicine. Nowadays there exists

the enormous amount of biomedical and clini-

cal data collected in the public and private re-

positories. They can be freely accessed and

present the wide field for experiments with the

newly developed scientific approaches and

their comparison. The integration of heteroge-

neous information sources is one of the urgent

applied problems, which we have tried to

solve in our project. The hybrid classification

model presents the basis of the intelligent ana-

lytical system and aims to integrate several

sources of biomedical information in order to

improve the diagnostics and prognosis of

complex diseases.

Based on the approaches presented in

[13−14], optimization models and methods for

solving problems of constructing linear classi-

fiers have been developed. In particular, the

problem of constructing classifiers for linearly

indivisible sets was formulated as a problem

of minimizing the band of incorrect classifica-

tion of training sample points. This model be-

longs to the class of optimization problems of

non-convex programming and is multi-ext-

reme. Various formulations of this problem

are offered, approaches to construction of ap-

proximate decisions and calculation of estima-

tions of optimum values are considered. An

interesting geometric interpretation of the

problems of constructing linear classifiers can

be found in [15].

To solve these optimization problems,

methods of non-smooth optimization, namely

r-algorithms of N.Z. Shor [16−17] and exact

penalty functions [18–19] were used. When

creating appropriate software, modern libra-

ries of linear algebra, similar to [20−22]

should be used to speed up arithmetic opera-

tions. It is a combination of algorithms based

on non-smooth optimization methods and the

use of modern libraries of linear algebra was

implemented in the developed software mo-

dule NonSmoothSVC.

To test the abilities of the new classifier

NonSmoothSVC a comparison with existing

tools was made. The methods integrated into

the library scikit-learn [12; 23] were chosen,

namely Linear SVC, NuSVC, Ada Boost. The

two last methods are non-linear classifiers;

they were chosen to get additional information

concerning advantages of different methods

for different problems. First numerical experi-

ments were made on specially generated artifi-

cial data.

Computational experiments aimed to es-

tablish the speed and predictive properties of

new software compared to existing ones. Both

artificially created data and real medical data

were used in the calculations in the test prob-

lems. Training and control samples of ran-

ISSN 1561-5359. Штучний інтелект, 2020, № 3

72 © Т.О. Бардадим, В.М. Горбачук, Н.А. Новоселова, С.П. Осипенко, В.Ю. Скобцов

domly generated problems were formed as

identically distributed data points on a single

cube in the space of features

. Then, the

points of the first class shifted in the first coor-

dinate by the value δ, and the points of the se-

cond class shifted in the first coordinate by the

value (-1-δ). When δ>0, training and control

samples are linearly separable, and when δ<0,

they are linearly inseparable. Next, the rotation

(linear transformation) of space was per-

formed so that the separating hyperplane de-

pended on many coordinates of space. The

need to test new software on real data forced

us to locate the software module

NonSmoothSVC into a containerized applica-

tion (using Docker technology [24]) for use on

a personal computer, as well as on a cluster,

grid, and cloud environment.

Fig. 1. Comparative density distribution

for full data set (n=12000)

This permitted to get access to the real

data on Cancer Genomics Cloud [25], a speci-

alized cloud platform that provides free access

to genetic, medical databases, in particular –

The Cancer Genome Atlas (TCGA) [26], and

more than 450 public applications designed to

analyze data on this topic. It is possible to ex-

pand this list with the own applications, data

sets, research results (currently there are more

than one million on this service), to involve

other researchers in projects. Computational

experiments have demonstrated that on some

data sets the NonSmoothSVC has qualitative

advantages over other methods involved in the

comparison, but is inferior in speed. Parti-

cularly, on linearly separable samples the

NonSmoothSVC gained an advantage over the

LinearSVC in the number of cases with better

classification accuracy.

Fig. 2. Comparative density distribution

for imbalanced data set (n=3720)

On the unbalanced samples, the

NonSmoothSVC software slightly outperfor-

med the LinearSVC software in the number of

cases with better classification accuracy on

average, but demonstrated an advantage in

some parts of the classification accuracy scale

(Fig. 1–6).

Full description of numerical experi-

ments and the results of testing can be found

in the reports (in Ukrainian) at

http://moderninform.icybcluster.org.ua/ais/.

ISSN 1561-5359. Artificial Intelligence, 2020, № 3

Thanks to the containerized form, the

developed software can become publicly avai-

lable tools and applications of this and other

services in the problems of constructing opti-

mized linear classifiers using modern libraries

of linear algebra.

Fig. 3. Comparative density distribution

for large data set (n=7200)

In the presence of technical possibilities,

parallelization on microprocessor networks

looks promising.

This approach is especially recom-

mended in the case of large data samples, when

the dimension of the feature space is tens of

thousands. It is also necessary to take into

account the features of optimization problems

in specific cases. In particular, additional requi-

rements that may be formulated by specialists

may reduce the number of informative features.

Processing and study of biomedical data

have some peculiarities. This, in particular, the

existence of possible large errors that arise in

the processing of medical information and

huge number of features that need to be taken

into account, which increases the dimensiona-

lity of the corresponding optimization prob-

lems, the missed measurements, which requi-

res the use of specialized methods for their

processing and analysis.

In order to improve the diagnosis and

treatment of complex diseases, much attention

is paid to the comprehensive analysis of vari-

ous biomedical and clinical data to understand

the processes occurring in the body at the cel-

lular level and changes caused by the develop-

ment of the disease.

It is known, the cause of complex disea-

ses, along with external factors, is a combina-

tion of genetic failures, which does not allow

to fix only one genetic mutation as a biomar-

ker. The difficulty also lies in the fact that in-

dividual genetic factors can differ and indivi-

dual cases of the same disease (phenotype)

can be caused by different genetic changes. In

addition, in the case of the combined effect of

several mutations, the individual effect of each

of them can be rather insignificant and, there-

fore, difficult to be detected.

It is also necessary to take into account

the high heterogeneity of the complex disease,

i.e. heterogeneity of its observed manifesta-

tions (phenotypes).

Recently, the methods of systems bio-

logy have become widely used to study comp-

lex diseases, namely, knowledge about the

interactions between genes, their products and

small molecules that form a complex network

of interactions. This approach makes it pos-

sible to explain the appearance of similar phe-

notypes despite different genetic causes, na-

mely, their interconnection and influence (dys-

regulation) on the same component of the cel-

lular system. Thus, the use of interactome in

conjunction with other data from biogenetic

studies can contribute to understanding the

processes occurring at the molecular level in

complex diseases. The use of combinations of

heterogeneous data makes it possible to deter-

mine dysregulated cellular pathways, to reveal

the relationship between genotype and pheno-

type, and to explain the heterogeneity of a

complex disease.

ISSN 1561-5359. Штучний інтелект, 2020, № 3

74 © Т.О. Бардадим, В.М. Горбачук, Н.А. Новоселова, С.П. Осипенко, В.Ю. Скобцов

Fig. 4. Comparative density distribution

for small data set (n=2400)

Natural approaches here are: to increase

the efficiency of tools to solve such optimiza-

tion problems and the use of methods for se-

lection informative features. In the works

[27−30] attention is paid to the preliminary

preparation of available medical data in order

to select informative features.

In the course of the project, algorithms

for preprocessing and extracting biomarkers

from biomedical data were developed, inclu-

ding: an algorithm for ranking features by in-

formation content for classification [23]; an

algorithm for identifying combinations of bio-

markers, taking into account the correlation of

features and allowing to exclude their

influence.

Fig. 5. Comparative density distribution

for balanced data set (n=8280)

Fig. 6. Comparative density distribution

for imbalanced+small data set (n=720)

ISSN 1561-5359. Artificial Intelligence, 2020, № 3

Moreover, several approaches were ana-

lyzed for identifying a subset of informative

features, taking into account several data sour-

ces, namely, gene expression data and data on

functional and physical interactions of genes

and their products, presented in the form of

networks. Based on the analysis of existing

approaches, an algorithm for identifying a

subset of features has been developed, which

allows integrating interactomic and transcrip-

tomic data to determine functional subnets as-

sociated with the disease. Pre-processing of

biomedical data made it possible to reduce the

feature space and thereby increase the accu-

racy of classification models.

Detailed description of algorithms and

related information can be found in the report

at http://moderninform.icybcluster.org.ua/ais/

(in Russian).

In one of the numerical experiments the

real data contained information on the gene

expression of cancer patients (143 observa-

tions of 60,483 features) obtained from the

Cancer Genome Atlas (TCGA). From these

data by means of the simplified method of ran-

king of features proposed by Novoselova [28]

23 most informative features concerning the

forecast of a vital status of patients having

diagnosed glioblastoma were identified. This

approach substantially simplifies numerical

difficulties in following data processing.

Due to the fact that various sources of

biological information characterize various

changes occurring in the body at the cellular

level during the development of a complex

disease, it is assumed that their combination

will improve the accuracy of diagnosis of the

subtype of the disease, the reliability of the

disease prognosis and response to therapy

[31]. In addition, combining heterogeneous

data will allow one to discover the relation-

ships between various biomedical entities (ge-

nes, proteins, metabolites, etc.) directly related

to the development of the disease, compensate

for noise and errors in individual data sources

and thereby obtain more reliable results.

A common problem in solving this problem is

how to combine information from different

data sources. In our study, of interest are me-

thods for constructing classifiers based on va-

rious sources of multidimensional data, which,

as a rule, have a heterogeneous representation.

Consequently, the task is to unify this repre-

sentation, determine the base classifier, build

classification models on each data source, and

select ways to combine the predicted values,

obtained using the constructed models.

The core of the intelligent analytical

system being developed is a hybrid classifica-

tion model, which allows combining several

sources of biological information about pati-

ents in order to build a classification model

that allows diagnosing subtypes of complex

diseases characterized by genetic disorders.

The proposed hybrid model is a classification

ensemble with the following distinctive

features:

1. Uniform presentation of information from

various data sources by constructing a mat-

rix of object-object distances using various

kernel functions (density functions), inclu-

ding Gaussian, polynomial function, scalar

product of vectors, etc.

2. Implementation of the procedure for selec-

ting classification characteristics for each

individual data source.

3. Construction of a basic or individual classi-

fier of a hybrid model, which can be either

a single classifier or an ensemble of classi-

fiers built on a single data source.

4. Implementation of several ways of integra-

ting individual classifiers of the model.

5. Analysis of the information content of indi-

vidual classifiers using the assessment of

their weight coefficients.

The method for constructing a hybrid

model is based on a combination of the bag-

ging procedure and the aggregation of ranked

lists to build basic classifiers and a pruning

procedure to determine the final structure of

the model, which allows adaptively adjusting

the ensemble taking into account the type of

classified data.

The preliminary experiments on the

TCGA data [26] showed that the ensembles

built on heterogeneous data sources can suffi-

ISSN 1561-5359. Штучний інтелект, 2020, № 3

ciently increase the accuracy of classification

and prediction of subtypes of complex disea-

ses, since each of the data sources describes

the organism under study in different planes:

gene expression data, Ribonucleic acid (RNA)

sequencing, metabolic data, gene copy number

data, etc.

Ensuring the reproducibility of calcula-

tions is a prerequisite for the reproducibility of

scientific research as a whole. The conditions

for computational reproducibility are the avai-

lability of source data, the ability to reproduce

an identical computing environment (or an en-

vironment that does not lead to other calcula-

tion results), and the availability of the results

of computations. Biomedical calculations have

their own specific features that should be

taken into account when planning them. Let

we mention some of them.

Modern biomedical calculations, espe-

cially based on genome data, are very huge

and cumbersome. Usually "classic" biome-

dical applications (PAML, Muscle, MAFFT,

MrBayes, BLAST, etc.) and large libraries

with implementations of biomedical algo-

rithms written in different programming lan-

guages (C / C ++, Java, R, Go, Scala, Haskell,

Perl, Python, Ruby, Erlang, Julia, etc. [32]) are

quite often used simultaneously in one study.

Moreover, biomedical calculations often in-

volve methods of artificial intelligence – ma-

chine learning, pattern recognition, and corres-

ponding libraries (e.g., scikit-learn [6], [17]).

Such a variety of software requires careful

configuration of the computing environment

with control of the versions of libraries used

(here can be used as dozens and hundreds of

libraries).

Otherwise one can get a lack of reprodu-

cibility as a result of calculations. In terms of

using cluster technologies, creating such envi-

ronments (separate for each user) and maintai-

ning them in a conflict-free state is quite a bur-

densome task (unless you use special software

configuration tools, such as Conda, Bioconda,

or containerization of applications using, for

example, technology Singularity). Most of the

libraries and applications used in biomedical

computing do not provide efficient use of pa-

rallel multithreaded computing with multi-

core processors, and at the same time many of

them can be applied to an "embarrassingly pa-

rallel" model − a model in which individual

pieces of data are calculated in parallel by

identical instances of computational processes

without transferring messages between them

(for example, using Apache Hadoop

technology) [12].

Taking into account the peculiarities of

biomedical computing, reproducibility and

their horizontal scaling (the ability to increase

the number of identical computing units to

solve one problem) can be achieved through

the use of containerized applications, software

pipeline computing and parameterization of

software environment.

Technologies of containerization of soft-

ware applications. Due to the containerization

of biomedical applications (Docker, Singula-

rity containerization technology) the following

can be achieved: reproducibility of the con-

ditions in which the calculations took place

(invariability of software including software

and libraries), the possibility of horizontal sca-

ling provided the use of "stunning" model of

parallelism in cluster (Singularity) and cloud

(using Docker) calculations.

Technologies of software pipelining of

calculations. Software pipeline allows you to

organize flow calculations (calculations in

which the inputs and outputs of processes are

interconnected). Thanks to the use of tools for

automation of flow calculations (workflow en-

gine) such as CWL (Common Workflow Lan-

guage), GWL (Guix Workflow Language),

Snakemake, Nextflow, it is possible to present

a specific calculation in the form of a task

(text file, as usual, in YAML format or

JSON), the results of which can be reproduced

[7]. In addition, there are tools that allow you

to create / display such tasks in the form of a

graph of processes and data flows. An

example of such a tool is RABIX (Reprodu-

cible Analyzes for Bioinformatics) − a graphi-

cal editor for CWL. Some pipeline tools also

use containerization (for example, CWL) −

ISSN 1561-5359. Artificial Intelligence, 2020, № 3

such tasks can be performed both on a perso-

nal computer and in a cloud environment. An

important feature of streaming automation

tools is that the task description syntax allows

you to specify the scale of the calculations, in-

dicating the number of resources required.

Seven Bridges' product, Cancer Genomics

Cloud (CGC, see

http://www.cancergenomicscloud.org/), is an

example of a cloud software platform for per-

forming reproducible biomedical computa-

tions using containerization and pipelining. It

is the use of containerization in the creation of

an application for the construction of a linear

classifier at the V.M. Glushkov Institute of

Cybernetics of the National Academy of Sci-

ences of Ukraine made it possible to conduct

testing on real very voluminous medical data

located at the CGC.

Technologies for parameterization of

software environment. Parameterization of the

software environment allows you to repro-

duce, if necessary, an identical computing en-

vironment. GNU Guix, Conda, Bioconda are

examples of tools that allow you to create an

isolated software environment for individual

users in a cluster [12].

At present, there exists a range of tech-

nologies to ensure the reproducibility of scien-

tific calculations in cloud and cluster environ-

ments. This makes it possible to create biome-

dical applications adapted to these environ-

ments. In the result we get computational basis

that satisfies modern requirements for compu-

tational reproducibility.

The experience of using the developed

linear classifier, gained during its testing on

artificial and real data, allows us to conclude

about several advantages provided by the con-

tainerized form of the created application: it

permits to provide access to real data located

in cloud environment; it is possible to perform

calculations to solve research problems on

cloud resources both with the help of deve-

loped tools and with the help of cloud ser-

vices; such a form of research organization

makes numerical experiments reproducible,

i.e. any other researcher can compare the re-

sults of their developments on specific data

that have already been studied by others, in or-

der to verify the conclusions and technical fea-

sibility of new results; there exists an universal

opportunity to use the developed tools on

technical devices of various classes from a

personal computer to powerful cluster.

Conclusions

The next steps of the project include

development of the common software inter-

face of the experimental prototype of the intel-

ligent analytical system in order to integrate

the developed methods and software modules

of biomedical data preprocessing, data cluste-

ring and classification. It will allow perfor-

ming all the steps of data analysis from the

single framework and conducting research in

the field of biomedicine. The hybrid classifica-

tion model as a core of the intelligent system

will make it possible to integrate multidimen-

sional, heterogeneous biomedical data with the

aim to better understand the molecular courses

of disease origin and development, to improve

the identification of disease subtypes and dise-

ase prognosis. Much attention will be paid to

the experimentation with different computa-

tion approaches on real datasets taking into ac-

count the reproducibility of results.

References

1. Knopov P.S., Norkin V.I., Atoyev K.L.,

Gorbachuk V.M., Kyryliuk V.S., Bila H.D.,

Samosyonok O.S., Bogdanov O.V. (2020). Some

approaches to the use of stochastic models of

epidemiology to the COVID-19 problem. Kyiv:

V.M.Glushkov Institute of Cybernetics, Retrieved

from http://incyb.kiev.ua/archives/3988/dejaki-

pidhodi-vikoristannja-stohastichnih-modelej-

epidemiologii-do-problemi-covid-19/

(In Ukrainian).

2. Gorbachuk V., Gavrilenko S. (2020). Analysis for

dynamics of COVID-19 spreading in Ukraine and

neighboring countries on May 1–10, 2020. Global

and regional problems of informatization in society

and nature using 2020. Kyiv: National University

of Life and Environmental Sciences of Ukraine,

56–60. (In Ukrainian).

3. Gorbachuk V.M., Dunaievskyi M.S., Suleimanov

S.-B. (2020). Management and administration in

the field of health care services. Management and

administration in the field of services: selected

examples. T.Pokusa, T.Nestorenko (eds.) Opole:

Academy of Management and Administration,

268−279. (In Ukrainian).

ISSN 1561-5359. Штучний інтелект, 2020, № 3 
 
 
78                 © Т.О. Бардадим, В.М. Горбачук, Н.А. Новоселова, С.П. Осипенко, В.Ю. Скобцов    
4. Gorbachuk  V.M.,  Suleimanov  S.-B.,  Batih  L.O. 
(2020). Decision making criteria  in  the branch of 
health care. Measurement  and control in complex 
systems.  Vinnytsia:  VNTU,  149–151.                   
(In Ukrainian). 
5. Vorontsov K.V. Mathematical methods of learning 
by  precedents  (Machine  Learning  Theory)  (in 
Russian),  Retrieved  from: 
http://www.machinelearning.ru/wiki/images/6/6d/
Voron-ML-1.pdf 
6. Gupal  A.M.,  Sergienko  I.V.  Symmetry  in  DNA. 
Methods  for    Discrete  Sequences  Recognition. 
Kyiv. Naukova Dumka (in Russian). 
7. Baldi P., Hatfield W.G. (2011). DNA Microarrays 
and Gene Expression. From Experiments  to Data 
Analysis  and  Modeling.  Cambridge  University 
Press.  
8. Kuhn M.,  Johnson K.  (2013). Applied  predictive 
modeling. New York: Springer. 
9. Heath  L.S.,  Ramakrishnan  N.  (2010).  Problem 
solving  handbook  in  computational  biology  and 
bioinformatics. NY: Springer Science & Business 
Media. 
10. Ioannidis J. (2005). Why Most Published Research 
Findings Are False. PLoS Medicine, vol. 2, no. 8, 
p. 124. 
11. Baker  M.  (2016). Reproducibility  crisis?  Nature, 
vol. 26, no. 533, 353-66. 
12. Strozzi  F.  et  al. (2019). Scalable  workflows  and 
reproducible  data  analysis  for  genomics. 
Evolutionary Genomics, 2nd ed., New York, NY: 
Humana Press, 723-745. 
13. Zhuravlev  Y.,  Laptin  Y.,  Vinogradov  A., 
Zhurbenko  N., Lykhovyd  O.,  Berezovskyi  O. 
(2017).  Linear  classifiers  and  selection  of 
informative  features.  Pattern  Recogn.  and Image 
Anal., vol. 27, no. 3, 426-432. 
14. Zhuravlev  Y.,  Laptin  Y.,  Vinogradov  A. (2014). 
Comparison of Some Approaches to Classification 
Problems,  and Possibilities to  Construct Optimal 
Solutions  Efficiently.  Pattern  Recogn.  and  Image 
Anal., vol. 24, no. 2, 189-195. 
15. Zhurbenko  N.G.  (2020).  Linear  classifier  and 
projection on polytop. Cybern. Syst. Anal., vol. 56, 
no. 3, 1-8. 
16. Shor N.Z., Zhurbenko N.G. (1971). A minimization 
method  using  the  operation  of  extension  of  the 
space  in  the  direction  of  the  difference  of  two 
successive gradients. Cybernetics, vol. 7, 450-459. 
17. Shor  N.Z.  (1998). Nondifferentiable Optimization 
and Polynomial Problems. London: Kluwer Acad. 
Publ.  
18. Laptin Yu.P. (2016). Exact penalty functions and 
convex  extensions  of  functions  in  decomposition 
schemes  in  variables.  Cybernetics  and  Systems 
Analysis,  vol.  52, 85–95.   DOI:  10.1007/s10559-
016-9803-8. 
19. Laptin  Yu.P.,  Bardadym  T.A.  (2019).  Problems 
related  to  estimating  the  coefficients  of  exact 
penalty  functions.  Cybernetics  and  Systems 
Analysis,  vol.  55,  no.  3,  400-412.  DOI: 
10.1007/s10559-019-00147-2. 
20. Chang,  Chih-Chung;  Lin,  Chih-Jen  LIBSVM  – 
A Library for Support Vector Machines. Retrieved 
from https://www.csie.ntu.edu.tw/~cjlin/libsvm/. 
21. BLAS  (Basic  Linear  Algebra  Subprograms). 
Retrieved from http://www.netlib.org/blas/. 
22. LAPACK—Linear  Algebra  PACKage.  Retrieved 
from http://www.netlib.org/lapack/. 
23. Free  software  machine  learning  library  for  the 
Python  programming  language.  Retrieved  from 
https://scikit-learn.org/stable/index.html 
24. Tools  for  creation  of  isolated  Linux-containers. 
Retrieved from https://www.docker.com/ 
25. The  Cancer  Genomics  Cloud.  Retrieved  from 
http://www.cancergenomicscloud.org/  
26. The  Cancer  Genome  Atlas  (TCGA).  Retrieved 
from  https://www.cancer.gov/about-
nci/organization/ccg/research/structural-
genomics/tcga 
27. Novoselova  N.A.,  Tom  I.E.  (2018).  Integrated 
network approach  to  protein function  prediction. 
The  Scientific  Journal  of  Riga  Technical 
University.  Information  Technology  and 
Management  Science,  vol. 21,  98–103.  DOI: 
10.7250/itms-2018-0016 
28. Tom I.E.  (2016). Information technologies in the 
analysis of medical data. Science and innovations, 
no. 3, 28-31. 
29. Novoselova  N.A.,  Tom  I.E.  (2016).  Method  for 
constructing clusters in genetic data. Informatika, 
no.1(49), 64-74. 
30. Novoselova N.A., Tom I.E. (2013). Algorithm for 
ranking features for detecting biomarkers in gene 
expression  data. Artificial Intelligence, no.  3, 58-
68.  
31. Novoselova  N.A., Tom  I.E., Ablameyko  S.V. 
(2011).  Evolutionary  design  of  the  classifier 
ensemble. Artificial Intelligence, no. 3, 429-438. 
32. Bonnal  R.  et  al.  (2019).  Sharing  Programming 
Resources  Between  Bio*  Projects.  Evolutionary 
Genomics,  2nd  ed.,  New  York,  NY:  Humana 
Press,  747-766.  DOI:  10.1007/978-1-4939-9074-
0_25  
Received 10.06.2020 
Accepted 12.08.2020 
 

ResearchGate has not been able to resolve any citations for this publication.

Analysis for dynamics of COVID-19 spreading in Ukraine and neighboring countries on May 1–10, 2020

Conference Paper

Full-text available

May 2020

As of May 10, 2020, the number of people infected by COVID-19 in Dnipropetrovsk, Zakarpattya, Ivano-Frankivsk, Kyiv, Lviv, Odesa, Rivne, Ternopil, Chernivtsi regions, city of Kyiv exceeded that indicator in Georgia (country). Besides, that number in Chernivtsi region and city of Kyiv exceeded the indicators in Slovakia and Bulgaria. Taking into account the moderate, in comparison with other countries, number of testings on COVID-19 in Ukraine, the actual epidemic situation in Ukraine is worse than that in a range of neighboring countries despite of tougher declared restrictive measures and correspondingly higher social-economic losses. Therefore, the accurate data study for dynamics of COVID-19 spreading in Ukraine and neighboring countries is the topical issue as well as the analysis for factors of real condition of epidemic situation in Ukraine. The social distance, related with sociocultural traditions, social organization, implementation of state functions, is the important factor. The World Bank selects the five basic state functions: defense and security, law and order, macroeconomic management, protection of property rights, state system of health care. In the contemporary information era, successful implementation of those functions presumes an efficient application of modern information and communication technologies based on competitive domestic scientific or practical research and development. Thus, the current epidemic situation in Ukraine (which refers to some other infectious diseases) is determined by the general level of statehood, responsibility of state employees, and social consciousness. The conditions of institutions and workers in Ukraine's health care sector during the COVID-19 epidemic highlighted the complex of problems available, caused by the shortcomings in organization of applications of modern information and communication technologies.

Management and administration in the field of health care services

Chapter

Full-text available

Nov 2020

Based on the principle «Who canʼt measure cannot manage», the article introduces such key performance indicators in the field of health care services as average and incremental cost-effectiveness ratios, cost-utility ratios, cost-benefit ratios within cost-effectiveness analysis, cost-utility analysis, cost-benefit analysis, respectively. Both indicators and analyses are compared in the terms of practical applications. The indicators of quality-adjusted life years and net present value are discussed as well.

Sharing Programming Resources Between Bio* Projects

Chapter

Full-text available

Jul 2019

Open-source software encourages computer programmers to reuse software components written by others. In evolutionary bioinformatics, open-source software comes in a broad range of programming languages, including C/C++, Perl, Python, Ruby, Java, and R. To avoid writing the same functionality multiple times for different languages, it is possible to share components by bridging computer languages and Bio* projects, such as BioPerl, Biopython, BioRuby, BioJava, and R/Bioconductor.In this chapter, we compare the three principal approaches for sharing software between different programming languages: by remote procedure call (RPC), by sharing a local "call stack," and by calling program to programs. RPC provides a language-independent protocol over a network interface; examples are SOAP and Rserve. The local call stack provides a between-language mapping, not over the network interface but directly in computer memory; examples are R bindings, RPy, and languages sharing the Java virtual machine stack. This functionality provides strategies for sharing of software between Bio* projects, which can be exploited more often.Here, we present cross-language examples for sequence translation and measure throughput of the different options. We compare calling into R through native R, RSOAP, Rserve, and RPy interfaces, with the performance of native BioPerl, Biopython, BioJava, and BioRuby implementations and with call stack bindings to BioJava and the European Molecular Biology Open Software Suite (EMBOSS).In general, call stack approaches outperform native Bio* implementations, and these, in turn, outperform "RPC"-based approaches. To test and compare strategies, we provide a downloadable Docker container with all examples, tools, and libraries included.

Integrated Network Approach to Protein Function Prediction

Article

Full-text available

Dec 2018

One of the main problems in functional genomics is the prediction of the unknown gene/protein functions. With the rapid increase of high-throughput technologies, the vast amount of biological data describing different aspects of cellular functioning became available and made it possible to use them as the additional information sources for function prediction and to improve their accuracy.In our research, we have described an approach to protein function prediction on the basis of integration of several biological datasets. Initially, each dataset is presented in the form of a graph (or network), where the nodes represent genes or their products and the edges represent physical, functional or chemical relationships between nodes. The integration process makes it possible to estimate the network importance for the prediction of a particular function taking into account the imbalance between the functional annotations, notably the disproportion between positively and negatively annotated proteins. The protein function prediction consists in applying the label propagation algorithm to the integrated biological network in order to annotate the unknown proteins or determine the new function to already known proteins. The comparative analysis of the prediction efficiency with several integration schemes shows the positive effect in terms of several performance measures.

EVOLUTIONARY DESIGN OF THE CLASSIFIER ENSEMBLE

Article

Full-text available

Jan 2011

This paper 1 presents two novel approaches to evolutionary design of the classifier ensemble. The first one presents the task of one-objective optimization of feature set partitioning together with feature weighting for the construction of the individual classifiers. The second approach deals with multi-objective optimization of classifier ensemble design. The proposed approaches have been tested on two data sets from the machine learning repository and one real data set on transient ischemic attack. The experiments show the advantages of the feature weighting in terms of classification accuracy when dealing with multivariate data sets and the possibility in one run of multi-objective genetic algorithm to get the non-dominated ensembles of different sizes and thereby skip the tedious process of iterative search for the best ensemble of fixed size.

Linear Classifier and Projection Onto a Polytope *

Article

May 2020

N. G. Zhurbenko

An algorithm for constructing binary linear classifiers is considered. Objects of recognition are presented by points of an n-dimensional Euclidean space. The algorithm is based on solving the problem of projecting zero onto the convex hull of a finite number of points of Euclidean space.

Article

May 2019

New approaches to estimating the coefficients of exact penalty functions for constrained optimization problems are considered. The results of computational experiments using simplified procedures for estimating coefficients in solving some classes of problems are presented. Such approaches are most relevant when applying methods of decomposition by variables (generalized Benders decomposition methods). This allows to overcome the difficulties related to an implicit description of the feasible region of the master problem.

Linear classifiers and selection of informative features

Article

Jul 2017

In this work, to construct classifiers for two linearly inseparable sets, the problem of minimizing the margin of incorrect classification is formulated, approaches to achieving approximate solution, and calculation estimates of the optimal value for this problem, are considered. Results of computational experiments that compare proposed approaches with SVM are presented. The problem of identifying informative features for large-dimensional diagnostic applications is analyzed and algorithms for its solution are developed.

Why most published research findings are false

Article