Content uploaded by Howard Robert Iles
Author content
All content in this area was uploaded by Howard Robert Iles on Aug 13, 2014
Content may be subject to copyright.
Ethics of Data Mining: a New Zealand
Survey
Howard Robert Edward Iles
Eastern Institute of Technology
Napier, Hawke’s Bay
New Zealand
06 974 8000 ext 6017
riles@eit.ac.nz
Abstract
This research paper looks at the core issues of data mining and ethics. It covers both
the direct and indirect ethical issues of data mining. The paper also takes a quick
survey of New Zealand data mining companies and analysis of their websites to see
what sort of ethical issues are referred to. To put this indigenous survey into context,
a small survey of overseas data mining companies’ websites is also done. The results
are striking in the sense that none of the companies mention ethical issues in
relation to data mining. It is suggested that data mining companies should make
clear potential ethical issues at the outset. The future of data mining is then explored
relating to the explosive growth of data and areas of growth.
Keywords
Data Mining, Big Data, Ethical Issues, Data Mining New Zealand, BI, Business
Intelligence, Knowledge Database Discovery, KDD, Machine Learning
Table of Contents
Abstract ................................................................................................................ 1
Keywords .............................................................................................................. 1
Introduction ......................................................................................................... 3
What is Data Mining? ............................................................................................ 3
What is Ethics? ..................................................................................................... 4
Direct Ethical Issues of Data Mining....................................................................... 4
Loss of privacy ................................................................................................... 4
Mis-interpretation of mined information ........................................................... 5
Troubles with Anonymising Information ............................................................ 5
Indirect Issues of Data Mining ............................................................................... 5
Who Does the Information Belong to? ............................................................... 6
How Data is Collected? ...................................................................................... 6
Onus is on whom? ............................................................................................. 6
Survey Methods .................................................................................................... 7
New Zealand Data Mining Companies ................................................................... 7
Sample of Overseas Companies that Data Mine .................................................... 8
Analysis ................................................................................................................ 8
The Future of Data Mining .................................................................................... 9
Discussion ............................................................................................................ 10
Conclusion ........................................................................................................... 10
References ........................................................................................................... 11
Appendices .......................................................................................................... 13
Appendix One .................................................................................................. 13
Appendix Two .................................................................................................. 14
Appendix Three ................................................................................................ 15
Introduction
The structure of this paper is to look at the meanings of both data mining and of
ethics. To start with I will outline what data mining is and give a brief explanation of
ethics. I will then outline the core ethical issues with data mining – these being of
both an indirect and direct nature, and then survey the current New Zealand and
International data mining scene with reference to the main New Zealand indigenous
players in this field and analyse ethical views given by these companies on their
websites. Following this is a glimpse into the future of data mining and then a
concluding paragraph.
What is Data Mining?
Witten defines data mining as “the extraction of implicit, previously unknown, and
potentially useful information from [electronic] data.” (Witten 2011 P. xxiii). This
potential information is discovered from patterns in the data. (Witten, p.5) Data
Mining is the broad definition and “machine learning” is the technical side of data
mining. The term “big data” is often confused with data mining (and may now be
synonymous with) but strictly means very large data sets. Big data can be data mined
with some special preparation. For most of this essay I will tend to concentrate on
social and business use of data mining, but it does have important uses in the
sciences as well. Data mining goes beyond the standard Business Intelligence (BI) but
is likely if not already to be included in it. Another term that needs to be understood
is KDD – Knowledge Discovery in Databases – this term is often synonymous with
data mining. For the use of this essay I will use data mining as an umbrella term.
Tools such as WEKA (WEKA, 2013) are bringing data mining techniques to the
general IT user, Universities are offering a number of undergraduate and graduate
papers on data mining – even MOOC’s on data mining are being offered (WEKA
MOOC, 2013). However data mining is a complex business and should only be
attempted with the aid of well train practitioners with a high level of statistical
knowledge. (Seltzer, 2003). This complexity can be shown from WEKA which has a
huge range customisation for any data set (See Appendix One)
Large amounts of data are being created “From 2005 to 2020, the digital universe
will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, or 40 trillion
gigabytes (more than 5,200 gigabytes for every man, woman, and child in 2020).
From now until 2020, the digital universe will about double every two years.” (IDC,
2012 p.1). So data mining will only increase as data increases and data mining
becomes more main stream. Indeed the AIIM survey showed that a data mining
killer app is seen by business as the next big thing. (AIIM, 2012).
What is Ethics?
Ethics, also known as moral philosophy, is a branch of philosophy that involves
systematizing, defending and recommending concepts of right and wrong conduct.
The term comes from the Greek word ethos, which means "character". In
philosophy, ethics studies the moral behaviour in humans and how one should act.
Ethics as related to data mining covers many areas but I will concentrate on the
business use of mined information as it relates to privacy. I will mention the
relationship between privacy, ethics and law. Ethical issues are often however quite
personal with a large variation in any given population as to the nature and level of
any given ethical issue. Ethical issues will also vary between country and culture – so
there are no hard and fast rules when dealing with ethical issues. People will also
tend to view the legal and ethical lines in a blurred nature. Data in itself is ethically
neutral. (Davis, 2012)
Direct Ethical Issues of Data Mining
Loss of privacy
The loss of privacy is the greatest issue facing data mining. This can be from
disclosure either accidental or on purpose. It may stem from a belief that the data
has been made anonymous but this can prove hard to do and there are many
methods for re-identification – it is said that 50 percent of Americans can be
identified from just city, birth date, and gender. (Sweeny, 2000). Privacy can also be
lost when data is handed over to other parties to do the data mining. Data may also
be lost or hacked and while not directly a result of data mining, the data may only be
being saved because of its potential from data mining.
Mis-interpretation of mined information
Ethical issues can occur with misinterpretation of the results of data mining.
Examples of this can stem a belief that data mining results are correlations not of
causation, although causation is often the desired result. An example maybe, that as
ice cream sales increases so do drowning, but it is unlikely that there is causation
with drowning and ice cream sales. It is more likely a relationship related to better
weather (WEKA MOOC, 2013) Poor data mining techniques or the misuse of
statistical evaluations can lead to improper conclusions from data mining. (Seltzer,
2003) Data can be “over fitted” which will lead to unrealistic views of the data
mining analysis.
Troubles with Anonymising Information
The anonymising of information before data mining occurs or after data mining
occurs is an important way to ensure privacy of people’s information. However,
there are many cases of where information was believed to have been made
anonymous but ways of deducing individuals have been discovered. (Hussien, 2013)
Another problem that can occur is when it is believed to be anonymous but the data
then can be used to incriminate a certain group – this may occur from something as
trivial as leaving a post code in the data.
Indirect Issues of Data Mining
Indirect issues of data mining are where an individual of a group of people are
affected in some way by data mining but not directly due to a breach in privacy. An
example is when going for a mortgage, your application is based on criteria derived
from data mining – it’s not an individual thing.
Who Does the Information Belong to?
The question can be asked of who owns the data and for how long should it be kept
and for what purposes. There seems to be a kind of consensus that if the data is
about a person then that person is then viewed to own the information or at least
have a say in its use. To an extent this may be protected by certain laws such as the
Privacy Act 19931 in New Zealand. A grey area also develops when two data sets are
matched together an individual may feel that together they are more than the sum
of the original two databases in terms implicate release of information. Data mining
form joined data sets can become very powerful. (WEKA MOOC, 2013)
How Data is Collected?
Data can be collected and agreed as to its use by an individual – although many
individuals will be unaware of the potentials of data mining and the manipulation of
data. But most data is captured with little knowledge or respect for the individual
from sources such as street surveillance and what may seem innocuous events such
as the use of loyalty cards. The gathering of data from the internet can occur from
the legitimate giving of data but may also occur due to malware where no consent
was given and this collection is moving into the mobile arena. (Erturk, 2012) There is
also the problem that not all data mining may occur in New Zealand and the data
mining may be done by overseas companies which may not have the same view
toward ethics or any understanding of the New Zealand legal situation.
Onus is on whom?
A degree of onus should fall on the individual to understand how and why data is
being collected and what it is used for but also business should be relatively
transparent with using people’s data. Some degree of responsibility may also fall on
governments. (Van Wel, 2004) Organisations with vested interest in data mining for
its members may create guidelines such as the ADMA (Association for data-driven
Marketing and Advertising) did. (ADMA, 2013) It should be made clear to the
1 The Privacy Act 1993 can be found at the Legislation New Zealand website
http://www.legislation.govt.nz/act/public/1993/0028/latest/DLM296639.html
consumer that data is going to be shared through varying levels of a business. But it
seems that most of the onus of data mining ethics should fall on the owners of the
data and those doing the data mining. The data miners should be very sure of their
results before using them or making them public. The very method of data mining is
discriminatory because that is the basis of the use for data mining, for example who
should get a loan or what’s your chance of having an accident when applying for car
insurance.
Survey Methods
A survey was undertaken to sample indigenous data mining companies and also a
sample of data mining companies outside of New Zealand. I found the data mining
companies by doing Google2 searches, using the phrase “Data Mining Companies in
New Zealand” and “Data Mining Companies” this was carried out between
28/09/2013 and 05/10/2013. The analysis of these sites was carried out during the
same period. The actual pages of the websites were analysed for reference to the
term “ethics”, “ethical” and the following terms seemed to be used in an ethical
context, “privacy” or “legal”. The Data Mining tool WEKA has no mention of ethical
thought before use included in the software or the download area. (University of
Waikato, 2013).
New Zealand Data Mining Companies
Company Name
Area of Expertise
1
Datamine
Data mining, statistical analysis, forecasting, predictive
modelling, explanatory modelling, optimization, and big
data. http://www.datamine.com/
2
Pingar
Content Management, Data Mining and Metadata
http://www.pingar.com/
3
Harmonic Analytics
Data analysis, use of R statistical package, Data Mining
http://www.harmonic.co.nz
Company Name
Mention of Ethical Issues on their Website
1
Datamine
No
2
Pingar
No
3
Harmonic Analytics
No
2 Google.co.nz was used without being logged in to a Google account
Sample of Overseas Companies that Data Mine
Company Name
Area of Expertise
1
IBM
Data mining and lots more http://www.ibm.com
2
Teradata
Data Analysis, Business Intelligence and Data
Mining http://www.teradata.com
3
Data Mining International
Data Mining
http://www.datamininginternational.com/
Company Name
Mention of Ethical Issues on their Website
1
IBM
Yes
2
Teradata
Yes
3
Data Mining International
No
Analysis
The results of the admittedly small sample set showed that ethical issues with data
mining are not made obvious to companies that maybe want to use their data
mining services. It seemed from the results that the bigger the organisation offering
the data mining service the more likely ethical issues were explained. Also that
indigenous New Zealand data mining companies did not mention ethical issues at all
is both a bit curious and a bit worrying.
Figure 1: Bar graph to show difference between the mentioning of ethical issues on
their website.
0
1
2
3
4
5
Yes
No
Mentioning of Ethical issues on
Website
No. mentioning Ethical
issues on Website
Figure 2: Bar graph to show difference between the mentioning of ethical issues on
their website for New Zealand and Overseas.
The Future of Data Mining
Data mining will only increase as organisations realise the leverage and business
information that can be gleaned by data mining. Again discussion needs to occur
about data people’s privacy and other ethical issues about data mining. Applications
that make data mining easier to do should be viewed with suspicion due to the fact
that data mining requires a very sophisticated understanding of statistics, data
mining principles and ethical issues in dealing with data. No one statistical method
provides a one stop shop to use for data mining with the likely hood of a data mining
killer app an unlikely scenario. More and more data is being collected from sources.
Some areas are new such as data from smart city applications, the internet of things,
and new products such as Google’s Glass including other wearable computing), as
well as implementation of E- Government. These new sources of data will need to be
monitored and the onus to understand what and where data is being collected will
have to some extend fall on the individual. Data mining will also open up new areas
of understanding on human behaviour, this could be used to help society and
individuals but may also create greater restrictions and controls – again people will
have to be aware of how data is being used and interpreted.
0
1
2
3
4
New Zealand websites not
mentioning ethical issues
Overseas websites not
mentioning ethical issues
New Zealand versus Overseas
Companies Metioning Ethical issues
on Website
No. Metioning Ethical
issues on Website
Discussion
The ethical issues of data mining are important to discuss and understand from both
the business and or personal point of view. Information gathered could be around
for a long time so understanding of what is held is also important for the future as
well as its current use. Additional legislation to increase understanding of digital
privacy and related use of information for data mining may help. The introduction of
the Copyright Act 2011 in New Zealand helped clarify copyright issues (Erturk,
2013)(Hooper, 2010), so the same maybe true if additional legislation was added
around data mining. Future analysis may need to be done with a larger data set of
data mining websites and maybe a break down by country. There may also be an
area of future research regarding end users understanding of their data in the hands
of multi nationals. Some current research suggests that businesses should look at the
data they currently have and the BI methods they currently use before embarking on
data mining. (Fitzsimmons, 2013)
Conclusion
Data mining ethical issues do need to be raised and awareness increased to the
practitioners of data mining, owners of data and the data of people. More and more
data is likely to be collected; data mining techniques are likely to be become more
sophisticated and also to become available to more data store owners. People will
need to understand more of what data is being held and to what advantages or
disadvantages there is from data mining. People will need to get a clearer idea of
their current rights to privacy and what data mining means to this as well as what is
not covered. Companies may need to become more transparent as to how data is
being collected and used. Companies that infringe ethical rights or appear to do so
face litigation, bad publicity, and damage to their reputations. (Fule, 2004) As has
been shown in the website analysis, companies that are practioners of data mining
are not forth coming in pushing the ethical issues in data mining at the point of entry
of a business using their services. It would seem to be a good place to start to bring
the ethical issues of data mining to the fore front.
References
ADMA (2013). Best Practice Guideline: Big Data – 2013. Retrieved from
http://www.adma.com.au/assets/Uploads/Downloads/Big-Data-Best-Practice-
Guidelines.pdf
AIIM (2012). Big Data – extracting value from your digital landfills. Retrieved from
http://www.aiim.org/pdfdocuments/IW_Big-Data_2012.pdf
Erturk, E. (2012). A case study in open source software security and privacy: android
adware, Internet Security (WorldCIS), 2012 World Congress on, 189-191, 10-12
June 2012 Retrieved from http://ieeexplore.ieee.org
Erturk, E. (2013). The impact of intellectual property polices on ethical attitudes
toward internet piracy. Knowledge Management: An International Journal. 12(1)
101-109 Retrieved from http://ijmk.cgpublisher.com/product/pub.257/prod.12
Davis, K., Patterson, D. (2012). Ethics of Big Data [Safari Books version]. Retrieved
from Safari Books
Fitzsimmons, C. (2013). Big Data? Big Deal. BRW. October 10-16, 2013, 17-22.
Fule P., & Roddick, J.F., (2004). Detecting privacy and ethical sensitivity in data
mining results. In Proceedings of the 27th Australasian conference on Computer
science - Volume 26 (ACSC '04), Estivill-Castro (Ed.), Vol. 26. Australian Computer
Society, Inc., Darlinghurst, Australia, Australia, 159-166. Retrieved from
http://dl.acm.org/citation.cfm?id=979942
Hooper, T. & Evans, T. B. (2010). The Value Congruence of Social Networking Services
- a New Zealand Assessment of Ethical Information Handling. The Electronic
Journal of Information Systems Evaluation. 13(2), 121-132 Retrieved from
http://www.ejise.com
Hussien, A. A., Hamza, N., & Hefny, H. A. (2013). Attacks on anonymization-based
privacy-preserving: A survey for data mining and data publishing. Journal of
Information Security, 4(2), 101-112. Retrieved from
http://search.proquest.com/docview/1349963671?accountid=39646
IDC (2012). The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and
Biggest Growth in the Far East Retrieved from
http://www.emc.com/leadership/digital-universe/iview/executive-summary-a-
universe-of.htm
Seltzer, W. (2005). The Promise and Pitfalls of Data Mining: Ethical Issues. Retrieved
from http://www.amstat.org/committees/ethics/linksdir/Jsm2005Seltzer.pdf
Sweeney, L. (2000). Simple Demographics Often Identify People Uniquely. Carnegie
Mellon University, Data Privacy Working Paper 3. Pittsburgh. Retrieved from
http://dataprivacylab.org/projects/identifiability/paper1.pdf
University of Waikato (2013). WEKA: Waikato Environment for Knowledge Analysis
(3.6.10) [Data Mining Tool]. Retrieved from
http://www.cs.waikato.ac.nz/ml/weka/
Van Wel, L., Royakkers, L. (2004). Ethical issues in web data mining. Ethics and
Information Technology, 6(2), 129-140. Retrieved from
http://search.proquest.com/docview/222253386?accountid=39646
WEKA MOOC (2013). Online Data Mining Paper. Waikato University [Run from 9th
September 2013 to the 20th October 2013]. Retrieved from
https://weka.waikato.ac.nz/
Witten, I. H., Eibe F., & Holmes, G. (2011). Data Mining: Practical Machine Learning
Tools and Techniques, (3rd ed.).Burlington, MA: Morgan Kaufmann
Appendices
Appendix One
Weka Version 3.6.10 was used to get a better understanding of data mining
techniques. A MOOC from Waikato University (Data Mining with Weka) was also
done during the period that the essay was written. The MOOC lasted for 5 weeks
with 6 lessons per week and two assessments. (The Weka MOOC ran from the 9th
September 2013 to the 20th October 2013.)
Screen shot of WEKA in action
(Figure 1: Weka doing a NaiveBayes classify of a dataset)
Appendix Two
(Figure 2: Data Mining with Weka MOOC
Each class consisted of a video tutorial with an associated activity using the Weka
data mining tool – There was one whole lesson devoted to ethical issues.
Figure 3: Lesson and Activity Structure from the Waikato University Weka MOOC
Appendix Three
Figure 4: Completion Certificate
The whole Data Mining MOOC with Weka was fairly painless. Ian Witten was a very
good teacher and kept the course interesting. There was lots of help if needed with
forums, wikis etc.