Page 218 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 218
Data Warehousing and Data Mining
notes The performance of these query processing techniques are compared using a disk I/O. Finally,
we propose a recommendation for Database Management Systems (DBMSs) to select the most
cost-effective query processing techniques based on the cost model
Before you can analyze and interpret customers buying behavior, we have to store all the historical
information of customers. This voluminous amount of data is most suitably stored in a data
warehouse. The current DBMSs mainly deal with daily operation of business such as Transaction
Processing System (TPS) involving reading and writing data. Data warehouse handles the large
read only information. The volume of data in data warehouse is increasing rapidly. If a query
was to be made on this large volume of data, the response time will, as expected, not be good.
The next question that comes to mind will be: How can we achieve a reasonable response time?
Response time relies heavily on query processing technique and experiment on new techniques.
A new reliable and faster query processing technique will greatly enhance the performance of
data warehousing processes..
11.4.1 relational Model
Suppose we are working with a relation R that has m number of tuples {t , t , … t }, and O
1
2
m
number of attributes {a , a , …, a }, the size S of O number of attributes {s , s , …, s } and instance
o
1
2
0
1
2
of tuples {i , i , i , …, i m o }. The term’s relation and table are used. Table 11.1 shows the
1 o
1 2
1 1
relational model R.
table 11.1: the relational Model r
a 1 a 2 . . . . a 4 . . . .
i 11 i 12 . . . . i 14 . . . .
: : : : :
: : : : :
i i : i :
m1 m2 m4
Assumptions
Let us assume that:
The access time per tuple with 1000 bytes is 0.1 sec.
There are 10,000 tuples in the R relation.
The disk I/O fetches 100 bytes in 0.01 sec.
In a data warehousing environment, the information sources are made up of combined data from
different sources. These data are fed into a central repository from the dispersed client locations
and form the R relation. In such a situation, usually the respective clients are only interested in
their own data sets. Suppose, the particular client wants to display a and a . Given this situation,
4
1
how can we achieve a reasonable response time?
11.4.2 cost Model
Here, we introduce a cost model to measure the performance of query processing techniques.
In general, there are three components to make up the response time of a query. Scanning the
data in the table level only will exclude the access time of the index level such as in the case of
the full scan technique. Scanning the data in the index level only will exclude the access time of
the table level such as in the case of the inverted partitioned indexes. Table 11.2 shows the cost
model for respective query processing techniques. To measure the response time of a query, we
need to submit what type of query processing techniques as a parameter such as Full Scan, Index
techniques or Inverted Partitioned Index. Thus, DBMSs will be able to pick up what components
required.
212 LoveLy professionaL university