Page 31 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 31
Unit 2: Data Mining Concept
notes
Task Data mining is a term used to describe the “process of discovering patterns and
trends in large data sets in order to find useful decision-making information.” Discuss.
2.5 how Data Mining Works?
Data mining is a component of a wider process called “knowledge discovery from database”.
It involves scientists and statisticians, as well as those working in other fields such as machine
learning, artificial intelligence, information retrieval and pattern recognition.
Before a data set can be mined, it first has to be “cleaned”. This cleaning process removes errors,
ensures consistency and takes missing values into account. Next, computer algorithms are used
to “mine” the clean data looking for unusual patterns. Finally, the patterns are interpreted to
produce new knowledge.
How data mining can assist bankers in enhancing their businesses is illustrated in this example.
Records include information such as age, sex, marital status, occupation, number of children, and
etc. of the bank’s customers over the years are used in the mining process. First, an algorithm
is used to identify characteristics that distinguish customers who took out a particular kind of
loan from those who did not. Eventually, it develops “rules” by which it can identify customers
who are likely to be good candidates for such a loan. These rules are then used to identify such
customers on the remainder of the database. Next, another algorithm is used to sort the database
into cluster or groups of people with many similar attributes, with the hope that these might
reveal interesting and unusual patterns. Finally, the patterns revealed by these clusters are then
interpreted by the data miners, in collaboration with bank personnel.
2.6 Data Mining – on What kind of Data?
Data mining should be applicable to any kind of data repository, as well as to transient data,
such as data streams. The data repository may include relational databases, data warehouses,
transactional databases, advanced database systems, flat files, data streams, and the Worldwide
Web. Advanced database systems include object-relational databases and specific application-
oriented databases, such as spatial databases, time-series databases, text databases, and
multimedia databases. The challenges and techniques of mining may differ for each of the
repository systems.
A brief introduction to each of the major data repository systems listed above.
2.6.1 flat files
Flat files are actually the most common data source for data mining algorithms, especially at the
research level. Flat files are simple data files in text or binary format with a structure known by
the data mining algorithm to be applied. The data in these files can be transactions, time-series
data, scientific measurements, etc.
2.6.2 relational Databases
A database system or a Database Management System (DBMS) consists of a collection of
interrelated data, known as a database, and a set of software programs to manage and access the
data. The software programs involve the following functions:
1. Mechanisms to create the definition of database structures:
2. Data storage
LoveLy professionaL university 25