Page 31 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 31

Unit 2: Data Mining Concept




                                                                                                notes


              Task    Data mining is a term used to describe the “process of discovering patterns and
             trends in large data sets in order to find useful decision-making information.” Discuss.

          2.5 how Data Mining Works?

          Data mining is a component of a wider process called “knowledge discovery from database”.
          It involves scientists and statisticians, as well as those working in other fields such as machine
          learning, artificial intelligence, information retrieval and pattern recognition.
          Before a data set can be mined, it first has to be “cleaned”. This cleaning process removes errors,
          ensures consistency and takes missing values into account. Next, computer algorithms are used
          to “mine” the clean data looking for unusual patterns. Finally, the patterns are interpreted to
          produce new knowledge.

          How data mining can assist bankers in enhancing their businesses is illustrated in this example.
          Records include information such as age, sex, marital status, occupation, number of children, and
          etc. of the bank’s customers over the years are used in the mining process. First, an algorithm
          is used to identify characteristics that distinguish customers who took out a particular kind of
          loan from those who did not. Eventually, it develops “rules” by which it can identify customers
          who are likely to be good candidates for such a loan. These rules are then used to identify such
          customers on the remainder of the database. Next, another algorithm is used to sort the database
          into cluster or groups of people with many similar attributes, with the hope that these might
          reveal interesting and unusual patterns. Finally, the patterns revealed by these clusters are then
          interpreted by the data miners, in collaboration with bank personnel.
          2.6 Data Mining – on What kind of Data?


          Data mining should be applicable to any kind of data repository, as well as to transient data,
          such as data streams. The data repository may include relational databases, data warehouses,
          transactional databases, advanced database systems, flat files, data streams, and the Worldwide
          Web. Advanced database systems include object-relational databases and specific application-
          oriented  databases,  such  as  spatial  databases,  time-series  databases,  text  databases,  and
          multimedia  databases.  The  challenges  and  techniques  of  mining  may  differ  for  each  of  the
          repository systems.
          A brief introduction to each of the major data repository systems listed above.

          2.6.1 flat files

          Flat files are actually the most common data source for data mining algorithms, especially at the
          research level. Flat files are simple data files in text or binary format with a structure known by
          the data mining algorithm to be applied. The data in these files can be transactions, time-series
          data, scientific measurements, etc.

          2.6.2 relational Databases

          A  database  system  or  a  Database  Management  System  (DBMS)  consists  of  a  collection  of
          interrelated data, known as a database, and a set of software programs to manage and access the
          data. The software programs involve the following functions:
          1.   Mechanisms to create the definition of database structures:

          2.   Data storage



                                           LoveLy professionaL university                                    25
   26   27   28   29   30   31   32   33   34   35   36