Page 142 - DCAP208_Management Support Systems
P. 142
Unit 9: Data Mining
Flat files: Flat files are actually the most common data source for data mining algorithms, Notes
especially at the research level. Flat files are simple data files in text or binary format with
a structure known by the data mining algorithm to be applied. The data in these files can
be transactions, time-series data, scientific measurements, etc.
Relational Databases: Briefly, a relational database consists of a set of tables containing
either values of entity attributes, or values of attributes from entity relationships. Tables
have columns and rows, where columns represent attributes and rows represent tuples.
A tuple in a relational table corresponds to either an object or a relationship between
objects and is identified by a set of attribute values representing a unique key.
Example: In Figure 9.2 we present some relations Customer, Items, and Borrow
representing business activity in a fictitious video store OurVideoStore. These relations are just
a subset of what could be a database for the video store and is given as an example.
Figure 9.2: Fragments of Some Relations from a Relational Database for OurVideoStore
Source: http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/notes/Chapter1/
The most commonly used query language for relational database is SQL, which allows retrieval
and manipulation of the data stored in the tables, as well as the calculation of aggregate functions
such as average, sum, min, max and count. For instance, an SQL query to select the videos
grouped by category would be:
SELECT count(*) FROM Items WHERE type=video GROUP BY category.
Data mining algorithms using relational databases can be more versatile than data mining
algorithms specifically written for flat files, since they can take advantage of the structure
inherent to relational databases. While data mining can benefit from SQL for data selection,
transformation and consolidation, it goes beyond what SQL could provide, such as predicting,
comparing, detecting deviations, etc.
Data Warehouses: A data warehouse as a storehouse, is a repository of data collected from
multiple data sources (often heterogeneous) and is intended to be used as a whole under
the same unified schema. A data warehouse gives the option to analyze data from different
sources under the same roof. Let us suppose that OurVideoStore becomes a franchise in
North America. Many video stores belonging to OurVideoStore company may have
different databases and different structures. If the executive of the company wants to
access the data from all stores for strategic decision-making, future direction, marketing,
LOVELY PROFESSIONAL UNIVERSITY 135