ArticlePDF Available

Significant Role of Statistics in Computational Sciences

December 2015
International Journal of Computer Applications Technology and Research 4(12):952-955

December 2015
4(12):952-955

DOI:10.7753/IJCATR0412.1014

Authors:

Rakesh Kr Singh

Aryabhatta Knowledge University, Patna

Neeraj Tiwari

Kumaun University

Relation between statistics, computer science, statistical computing and computational statics.

…

Figures - uploaded by Neeraj Tiwari

Content may be subject to copyright.

Content uploaded by Neeraj Tiwari

Content may be subject to copyright.

International Journal of Computer Applications Technology and Research

Volume 4– Issue 12, 952 - 955, 2015, ISSN: 2319–8656

www.ijcat.com 952

Significant Role of Statistics in Computational Sciences

Rakesh Kumar Singh

Scientist-D

G.B. Pant Institute of

Himalayan Environment &

Development

Kosi-Katarmal, Almora

Uttarakhand, India

Neeraj Tiwari

Professor & Head

Department of Statistics

Kumaun University

SSJ Campus, Almora

Uttarakahnd, India

R.C. Prasad

Scientist-F

G.B. Pant Institute of

Himalayan Environment &

Development

Kosi-Katarmal, Almora

Uttarakhand, India

Abstract: This paper is focused on the issues related to optimizing statistical approaches in the emerging fields of Computer Science

and Information Technology. More emphasis has been given on the role of statistical techniques in modern data mining. Statistics is

the science of learning from data and of measuring, controlling, and communicating uncertainty. Statistical approaches can play a vital

role for providing significance contribution in the field of software engineering, neural network, data mining, bioinformatics and other

allied fields. Statistical techniques not only helps make scientific models but it quantifies the reliability, reproducibility and general

uncertainty associated with these models. In the current scenario, large amount of data is automatically recorded with computers and

managed with the data base management systems (DBMS) for storage and fast retrieval purpose. The practice of examining large pre-

existing databases in order to generate new information is known as data mining. Presently, data mining has attracted substantial

attention in the research and commercial arena which involves applications of a variety of statistical techniques. Twenty years ago

mostly data was collected manually and the data set was in simple form but in present time, there have been considerable changes in

the nature of data. Statistical techniques and computer applications can be utilized to obtain maximum information with the fewest

possible measurements to reduce the cost of data collection.

Keywords: Statistics, Data Mining, Software Engineering, DBMS, Neural Networks, etc

1. INTRODUCTION

Statistics is a scientific discipline having sophisticated

methods for statistical inference, prediction, quantification of

uncertainty and experimental design. From ancient to modern

times statistics has been fundamental to advances in computer

science. The statistics encompasses a wide range of research

areas. The future of the World Wide Web (www) will depend

on the development of many new statistical ideas and

algorithms. The most productive approach is involve with

statistics are: computational and mathematical. Modern

statistics encompasses the collection, presentation and

characterization of information to assist in both data analysis

and the decision-making process. Statistical advances made in

collaboration with other sciences can address various

challenges in the field of science and technology. Computer

science uses statistics in many ways to guarantee products

available on the market are accurate, reliable, and

helpful[1][2].

 Statistical Computing: The term “statistical

computing” to refer to the computational methods

that enable statistical methods. Statistical computing

includes numerical analysis, database methodology,

computer graphics, software engineering and the

computer-human interface[1].

 Computational Statistics: The term “computational

statistics” somewhat more broadly to include not

only the methods of statistical computing but also

modern statistical methods that are computationally

intensive. Thus, to some extent, “computational

statistics” refers to a large class of modern statistical

methods. Computational statistics is grounded in

mathematical statistics, statistical computing and

applied statistics. Computational statistics is related

to the advance of statistical theory and methods

through the use of computational methods.

Computation in statistics is based on algorithms

which originate in numerical mathematics or in

computer science. The group of algorithms highly

relevant for computational statistics from computer

science is machine learning, artificial intelligence

(AI), and knowledge discovery in data bases or data

mining. These developments have given rise to a

new research area on the borderline between

statistics and computer science[1].

 Computer Science vs. Statistics: Statistics and

Computer Science are both about data. Massive

amounts of data is present around today’s World.

Statistics lets us summarize and understand it with

the use of Computer Science. Statistics also lets data

do our work for us[2].

Fig.1. Relation between statistics, computer science,

statistical computing and computational statics.

2. STATISTICAL APPROACHES IN

COMPUTATIONAL SCIENCES

Statistics is essential to the field of computer science in

ensuring effectiveness, efficiency, reliability, and high-quality

Computer Science

Statistical

Computing

Computational

Statistics

International Journal of Computer Applications Technology and Research

Volume 4– Issue 12, 952 - 955, 2015, ISSN: 2319–8656

www.ijcat.com 953

products for the public. Statistical thinking not only helps

make scientific discoveries, but it quantifies the reliability,

reproducibility and general uncertainty associated with these

discoveries. The following terms are a brief listing of areas in

computer science that use statistics to varying degrees at

various times[6][7][8]:

 Data Mining: Data mining is the analysis of

information in a database, using tools that look for

trends or irregularities in large data sets. In other

words "finding useful information from the

available data sets using statistical techniques".

 Data Compression: Data compression is the coding

of data using compact formulas, called algorithms,

and utilities to save storage space or transmission

time.

 Speech Recognition: Speech recognition is the

identification of spoken words by a machine. The

spoken words are turned into a sequence of numbers

and matched against coded dictionaries.

 Vision and Image Analyses: Vision and image

analyses use statistics to solve contemporary and

practical problems in computer vision, image

processing, and artificial intelligence.

 Human/Computer Interaction: Human/Computer

interaction uses statistics to design, implement, and

evaluate new technologies that are useable, useful,

and appealing to a broad cross-section of people.

 Network/Traffic Modeling: Network/Traffic

modeling uses statistics to avoid network congestion

while fully exploiting the available bandwidth.

 Stochastic Optimization: Stochastic optimization

uses chance and probability models to develop the

most efficient code for finding the solution to a

problem.

 Stochastic Algorithms: Stochastic algorithms

follow a detailed sequence of actions to perform or

accomplish a task in the face of uncertainty.

 Artificial Intelligence: Artificial intelligence is

concerned with modelling aspects of human thought

on computers.

 Machine Learning: Machine learning is the ability

of a machine or system to improve its performance

based on previous results.

 Capacity Planning: Capacity planning determines

what equipment and software will be sufficient

while providing the most power for the least cost.

 Storage and Retrieval: Storage and retrieval

techniques rely on statistics to ensure computerized

data is kept and recovered efficiently and reliably.

 Quality Management: Quality management uses

statistics to analyze the condition of manufactured

parts (hardware, software, etc.) using tools and

sampling to ensure a minimum level of defects.

 Software Engineering: Software engineering is a

systematic approach to the analysis, design,

implementation, and maintenance of computer

programs.

 Performance Evaluation: Performance evaluation

is the process of examining a system or system

component to determine the extent to which

specified properties are present.

 Hardware Manufacturing: Hardware

manufacturing is the creation of the physical

material parts of a system, such as the monitor or

disk drive.

3. STATISTICS IN SOFTWARE

ENGINEERING

Software engineering aims to develop methodologies and

procedures to control the whole software development

process. Nowadays researchers attempt to bridge the islands

of knowledge and experience between statistics and software

engineering by enunciating a new interdisciplinary field:

statistical software engineering. Design of Experiments

(DOE) uses statistical techniques to test and construct models

of engineering components and systems. Quality control and

process control use statistics as a tool to manage conformance

to specifications of manufacturing processes and their

products. Time and methods engineering uses statistics to

study repetitive operations in manufacturing in order to set

standards and find optimum (in some sense) manufacturing

procedures. Reliability engineering uses statistics to measures

the ability of a system to perform for its intended function

(and time) and has tools for improving performance.

Probabilistic design uses statistics in the use of probability in

product and system design. Essential to statistical software

engineering, is the role of data: wherever data are used or can

be generated in the software life cycle, statistical methods can

be brought to bear for description, estimation, and prediction.

The department of software engineering and statistics trains

multiskilled engineers in the processing of information, both

in its statistical and computational forms, for use in various

business professions.

4. STATISTICS IN HARDWARE

MANUFACTURING

The hardware manufacturing companies are applying

statistical approaches to create a plan of action that will work

more efficiently for forecasting the future productivity of the

hardware enterprise[8]. Adopted statistical approaches for:

 Forecasting production, when there is a stable

demand and uncertain demand.

 Pinpoint when and which inputs of a specific model

will be the cause of uncertainty

 Calculate summary statistics in order to set sample

data.

 To make market analysis and process optimizations.

 Statistical tracking and predicting for quality

improvement

5. STATISTICS IN DATABASE

MANAGEMENT

Databases are packages designed to create, edit, manipulate

and analyze data. To be suitable for a database, the data must

consist of records which provide information on individual

cases, people, places, features, etc. Optimizer statistics are a

collection of data that describe more details about the

database and the objects in the database. The optimizer

statistics are stored in the data dictionary. They can be viewed

using data dictionary views. Because the objects in a database

can be constantly changing; statistics must be regularly

updated so that they accurately describe these database

objects. These statistics are used by the query optimizer to

choose the best execution plan for each SQL statement[5].

Optimizer statistics include the following:

 Table Statistics

 Number of rows

 Number of blocks

 Average row length

 Column Statistics

International Journal of Computer Applications Technology and Research

Volume 4– Issue 12, 952 - 955, 2015, ISSN: 2319–8656

www.ijcat.com 954

 Number of distinct values (NDV) in

column

 Number of nulls in column

 Data distribution (histogram)

 Index Statistics

 Number of leaf blocks

 Levels

 Clustering factor

 System Statistics

 I/O performance and utilization

 CPU performance and utilization

Statistical packages for databases are SAS, SPSS, R, etc. and

these are available over a wide range of operating systems.

Numerous other packages have been developed specifically

for the PC DOS environment. S is a commonly available

statistical package for UNIX

6. STATISTICS IN ARTIFICIAL

INTELLIGENCE

Artificial intelligence (AI) is the intelligence exhibited by

machines or software. Popular AI approaches include

statistical methods, computational intelligence, machine

learning and traditional symbolic AI. The goals of AI include

reasoning, knowledge, planning, learning, natural language

processing, perception and the ability to move and manipulate

objects. There are a large number of tools used in AI,

including versions of search and mathematical optimization,

logic, methods based on probability and economics, and many

others[4]. The simplest AI applications can be divided into

two types:

 Classifiers: Classifiers are functions that use pattern

matching to determine a closest match. A classifier

can be trained in various ways; there are many

statistical and machine learning approaches. The

most widely used classifiers is the neural network.

 Controllers: Controllers do however also classify

conditions before inferring actions, and therefore

classification forms a central part of many AI

systems.

Fig.2. Graphical approach of Artificial Intelligence.

7. STATISTICS IN NEURAL NETWORK

Neural network had been used to refer to a network of

biological neurons and artificial neural networks used to refer

to a network of artificial neurons or nodes. Biological neural

networks are made up of real biological neurons that are

connected or functionally related in the peripheral nervous

system or the central nervous system. Artificial neural

networks are made up of interconnecting artificial neurons

(programming constructs that mimic the properties of

biological neurons). Artificial neural networks may either be

used to gain an understanding of biological neural

networks or for solving artificial intelligence problems

without necessarily creating a model of a real biological

system. Because the inner product is a linear operator in the

input space, the Perception can only perfectly classify a set of

data for which different classes are linearly separable in the

input space, while it often fails completely for non-separable

data. While the development of the algorithm initially

generated some enthusiasm, partly because of its apparent

relation to biological mechanisms, the later discovery of this

inadequacy caused such models to be abandoned until the

introduction of non-linear models into the field[4].

8. STATISTICS IN BIOINFORMATICS

Bioinformatics is the application of "computational biology“

to the management and analysis of biological data. Concepts

from computer science, discrete mathematics and statics are

being used increasingly to study and describe biological

systems. Bioinformatics would not be possible without

advances in computer hardware and software: analysis

of algorithms, data structures and software engineering. To

elaborate algorithms on computers increased the awareness of

more recent statistical methods. Statistical analysis for

differently expressed genes are best carried out via hypothesis

test. More complex data may require analysis via ANOVA or

general linear models[8].

Fig.3. Taxonomy of Bioinformatics.

9. STATISTICS IN DATA MINING

Data Mining is a process of discovering previously unknown

and potentially useful hidden pattern in the data. Advances in

information technology have resulted in a much more data-

based society. Data touch almost every aspect of our lives like

commerce on the web, measuring our fitness and safety,

doctors treat our illnesses, economic decisions that affect

entire nations, etc. Alone, data are not useful for knowledge

Artificial Intelligence

Applications

Classifiers and

Controllers

Pattern

Matching

Statistical

Implications

Neural Network, Gaussian mixture

model, Naive Bayes classifier, etc.

Bioinformatics

Computational Statistics

Computational Biology

Statistics

Computer Science

Biology

International Journal of Computer Applications Technology and Research

Volume 4– Issue 12, 952 - 955, 2015, ISSN: 2319–8656

www.ijcat.com 955

discovery. Data mining are transitioning from data-poor to

data-rich by using the methods like data exploration,

statistical inference and understanding of variability and

uncertainty[5].

Statistical Elements Present in Data Mining

 Contrived serendipity, creating the conditions for

fortuitous discovery.

 Exploratory data analysis with large data sets, in

which the data are as far as possible allowed to

speak for themselves, independently of subject area

assumptions and of models which might explain

their pattern. There is a particular focus on the

search for unusual or interesting features.

 Specialised problems: fraud detection.

 The search for specific known patterns.

 Standard statistical analysis problems with large

data sets.

Data Mining from Statistical Perspective

 Data sets which are relatively large and

homogeneous might be reasonable to us

mainstream statistical techniques on the whole or a

very large subset of the data.

 All analyses done by mainstream statistics have

intended outcome like set of data to a small amount

of readily assimilated information.

 The outcome may include graphs, or summary

statistics, or equations that can be used for

prediction or a decision tree.

 Large volume of data without loss of information be

reduced to a much smaller summary form, this can

enormously aid the subsequent analysis task.

 It becomes much easier to make graphical and other

checks that give the analyst assurance that

predictive models or other analysis outcomes are

meaningful and valid

Statistics vs. Data Mining

Feature

Statistics

Data Mining

Type of

Problem

Well structured

Unstructured /

Semi-structured

Inference

Role

Explicit inference plays

great role in any

analysis

No explicit

inference

Objective of

the Analysis

and Data

Collection

First – objective

formulation, and then -

data collection

Data rarely

collected for

objective of the

analysis/modeling

Size of data

set

Data set is small and

hopefully homogeneous

Data set is large

and data set is

heterogeneous

Paradigm/A

pproach

Theory-based

(deductive)

Synergy of theory-

based and

heuristic-based

approaches

(inductive)

Type of

Analysis

Confirmative

Explorative

Number of

variables

Small

Large

Methods/Te

chniques

- Dependence Methods:

Discriminant analysis,

Logistic regression

- Interdependence

Methods: Correlation

analysis,

Correspondence

analysis, Cluster

analysis

- Predictive Data

Mining:

Classification,

Regression

- Discovery Data

Mining:

Association

Analysis, Sequence

Analysis,

Clustering

10. PROPERTIES OF STATISTICAL

PACKAGES

Statistical packages offer a range of types of statistical

analysis[3]. Statistical packages includes:

 Database functions, such as editing, printing reports.

 Capabilities for graphic output, particularly graphs

but many also produce maps.

 Common packages are SAS, SPSS, R, etc.

 Available over a wide range of operating systems.

 Some have been "ported" to (rewritten for) the IBM

PC.

 Numerous other packages have been developed

specifically for the PC DOS environment.

 S is a commonly available statistical package for

UNIX

11. CONCLUSION

In this paper, many areas of computer science have been

described in which statistics plays a very vital role for data

and information management. Statistical thinking fuels the

cross-fertilization of ideas between scientific fields

(biological, physical, and social sciences), industry, and

government. The statistical and algorithmic issues are both

important in the context of data mining. Statistics is an

essential and valuable component for any data mining

exercise. The future success of data mining will depend

critically on our ability to integrate techniques for modeling

and inference from statistics into the mainstream of data

mining practice.

12. REFERENCES

[1] Lauro, C. (1996). Computational Statistics or

Statistical Computing, is that the question?

Computational Statistics and Data Analysis, Vol.

23, pp.191–193.

[2] Billard, L. and Gentle, J.E. (1993). The middle

years of the Interface. Computing Science and

Statistics, Vol. 25, pp.19–26.

[3] Yates, F (1966). Computers: the second revolution

in statistics. Biometrics, Vol. 22.

[4] Cheng, B. and Titterington, D. M. (1994). Neural

networks: a review from a statistical perspective.

Statistical Science, Vol. 9, pp.2-54.

[5] Elder, J. F. and Pregibon, D. (1996). A statistical

perspective on knowledge discovery in databases.

Advances in Knowledge Discovery and Data

Mining, MIT Press, pp.83-115.

[6] Gentle, J.E. (2004). Courses in statistical computing

and computational statistics. The American

Statistician, Vol. 58, pp.2–5.

[7] Grier, D.A. (1991). Statistics and the introduction of

digital computers. Chance, Vol. 4(3), pp.30–36.

[8] Friedman, J. H. and Fisher, N. I. (1999). Bump

hunting in high-dimensional data. Statistics and

Computing, Vol. 9, pp.123-143.

Application of Mathematics in Computer Science

Article

Oct 2021

Dr. Dhiraj Yadav

No one escape the learning of mathematics in one way or other, ranging from our kitchen to our journey from earth to Moon or Mars. Mathematics persists everywhere around us. It can be perceived in our garden or park from symmetry of leaves, flowers, fruits etc. and by so many examples of Geometry and symmetry can be seen in nature. God used mathematics in creation of the universe in one form or the other. Likewise, Mathematics is the queen of all sciences. Scientists and researchers can not perfectly accomplish their work without including mathematics. Mathematics is the foundation of Computer Science. If one is eager to learn any arena of Computer Science, first he/she has to imbibe a love of Mathematics that will be supportive for progressive learning of the said subject. Mathematics is friendly for analytical skills needed in Computer Science. Concepts of binary number system, Boolean algebra, Calculus, Discrete mathematics, linear algebra, number theory, and graph theory are the most applicable to the subject of computer science with the emergence of new concepts like machine learning, artificial intelligence, virtual reality and augmented reality.

Bump hunting in high-dimensional data-Discussion

Article

Full-text available

Jan 1999

Many data analytic questions can be formulated as (noisy) optimization problems. They explicitly or implicitly involve finding simultaneous combinations of values for a set of ("input") variables that imply unusually large (or small) values of another designated ("output") variable. Specifically, one seeks a set of sub-regions of the input variable space within which the value of the output variable is considerably larger (or smaller) than its average value over the entire input domain. In addition it is usually desired that these regions be describable in an interpretable form involving simple statements ("rules") concerning the input values. This paper presents a procedure directed towards this goal based on the notion of "patient" rule induction. This patient strategy is contrasted with the greedy ones used by most rule induction methods, and semi-greedy ones used by some partitioning tree techniques such as CART. Applications involving scientific and commercial data bases are presented.

Courses in Statistical Computing and Computational Statistics

Article

Full-text available

Feb 2004

James Gentle

Statisticians spend a large portion of their working days using the computer. In addition to the standard things that almost everyone does, like E-mail and text processing, and the standard uses teachers make of computers in the classroom, statisticians' use of computers includes data analysis with prepackaged software, development of algorithms and software to implement new statistical methods, Monte Carlo simulation to study the performance of statistical procedures, and mathematical analysis using symbolic processing software. Although the use of the computer for things like E-mail and Web surfing or for classroom demonstrations of the central limit theorem are important, the following comments do not address those activities. I should also make a disclaimer about any specific software package I mention in the following. There are many good software packages in each of the relevant areas of application and of the general types of software; my mention of specific packages is not meant to imply that those packages are any better or any worse than other packages. (Also, some names are trademarked, and my use of the name without a mark designation does not imply an acceptable generic use of the name.)

Bump hunting in high-dimensional data

Article

Full-text available

Apr 1999

Many data analytic questions can be formulated as (noisy) optimization problems. They explicitly or implicitly involve finding simultaneous combinations of values for a set of (“input”) variables that imply unusually large (or small) values of another designated (“output”) variable. Specifically, one seeks a set of subregions of the input variable space within which the value of the output variable is considerably larger (or smaller) than its average value over the entire input domain. In addition it is usually desired that these regions be describable in an interpretable form involving simple statements (“rules”) concerning the input values. This paper presents a procedure directed towards this goal based on the notion of “patient” rule induction. This patient strategy is contrasted with the greedy ones used by most rule induction methods, and semi-greedy ones used by some partitioning tree techniques such as CART. Applications involving scientific and commercial data bases are presented.

[Neural Networks: A Review from Statistical Perspective]: Rejoinder

Article

Full-text available

Feb 1994
STAT SCI

This paper informs a statistical readership about Artificial Neural Networks (ANNs), points out some of the links with statistical methodology and encourages cross-disciplinary research in the directions most likely to bear fruit. The areas of statistical interest are briefly outlined, and a series of examples indicates the flavor of ANN models. We then treat various topics in more depth. In each case, we describe the neural network architectures and training rules and provide a statistical commentary. The topics treated in this way are perceptrons (from single-unit to multilayer versions), Hopfield-type recurrent networks (including probabilistic versions strongly related to statistical physics and Gibbs distributions) and associative memory networks trained by so-called unsupervised learning rules. Perceptrons are shown to have strong associations with discriminant analysis and regression, and unsupervized networks with cluster analysis. The paper concludes with some thoughts on the future of the interface between neural networks and statistics.

Statistics and the Introduction of Digital Computers

Article

Jun 1991

David Alan Grier

Neural networks: A review from a statistical perspective

Article

Jan 1994
STAT SCI

Computational statistics or statistical computing, is that the question?

Article

Jan 1996
COMPUT STAT DATA AN

N.C. Lauro

A Statistical Perspective on Knowledge Discovery in Data Bases

Article

Jan 1996

Computers, the Second Revolution in Statistics

Article

Jul 1966

F Yates

May I first express my appreciation of the honour our new President of the Royal Society, Professor Blackett, has done Sir Ronald Fisher by consenting to take the Chair tonight. I am particularly pleased that he is presiding at this meeting as not only was he an old friend of Sir Ronald, but it was also he and the then Secretary of the Agricultural Research Council, Sir William Slater, who were mainly responsible for enabling us to get started with a computer at Rothamsted. As this is the first Fisher Memorial Lecture it might be thought that it should deal with some facet of Fisher's work. In a sense it does indeed do so, for to Fisher computing was very much a part of statistics, and a necessary adjunct to the development of mathematical theory. Although Fisher never much concerned himself with electronic computers-I remember him referring to them once in the early days as 'meccano arithmetic'-he later, I think, began to recognise that, rightly used, they had possibilities. Certainly he always greatly valued good computing aids. Soon after he came to Rothamsted he persuaded the station to buy him a Millionaire, regarded at that time as a great extravagance for an agricultural research station. And an excellent machine it was. I still retain it for my own personal use. Our early acquisition of an electronic computer at Rothamsted-the first computer, I believe, solely devoted to statistics-stems directly from that early Millionaire. For Fisher set the tradition that to be a good theoretical statistician one must also compute, and must therefore have the best computing aids. It is a tradition we have firmly held to.

Computational statistics or statistical computing, is that the question?

Article

Feb 1996
COMPUT STAT DATA AN

Carlo Lauro

Significant Role of Statistics in Computational Sciences

Figures

Recommended publications

Statistical and Mathematical Sciences and their Applications

Interface between Statistics, Mathematics and Allied Sciences

A note on constructive procedure for unbiased controlled rounding

Statistical Physics-Book-Course Materials