Page 140 - DCAP208_Management Support Systems

P. 140

Unit 9: Data Mining

memos in textual forms often exchanged by e-mail. These messages are regularly stored Notes
in digital form for future use and reference creating formidable digital libraries.

The World Wide Web repositories: Since the inception of the World Wide Web in 1993,
documents of all sorts of formats, content and description have been collected and inter-
connected with hyperlinks making it the largest repository of data ever built. Despite its
dynamic and unstructured nature, its heterogeneous characteristic, and its very often
redundancy and inconsistency, the World Wide Web is the most important data collection
regularly used for reference because of the broad variety of topics covered and the infinite
contributions of resources and publishers. Many believe that the World Wide Web will
become the compilation of human knowledge.

9.1.2 Data Mining and Knowledge Discovery

With the enormous amount of data stored in files, databases, and other repositories, it is
increasingly important, if not necessary, to develop powerful means for analysis and perhaps
interpretation of such data and for the extraction of interesting knowledge that could help in
decision-making.

Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the
nontrivial extraction of implicit, previously unknown and potentially useful information from
data in databases.

Did u know? While data mining and knowledge discovery in databases (or KDD) are
frequently treated as synonyms, data mining is actually part of the knowledge discovery
process.
The figure 9.1 shows data mining as a step in an iterative knowledge discovery process.
Figure 9.1: Data Mining

Source: http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/notes/Chapter1/
The Knowledge Discovery in Databases process comprises of a few steps leading from raw data
collections to some form of new knowledge. The iterative process consists of the following
steps:

Data cleaning: It is also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection.

LOVELY PROFESSIONAL UNIVERSITY 133

135 136 137 138 139 140 141 142 143 144 145