Page 161 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 161
Unit 8: Data Warehouse Refreshment
In any case, the refreshment of a data warehouse is considered to be a difficult and critical notes
problem for three main reasons:
1. First the volume of data stored in a warehouse is usually large and is predicted to grow in
the near future. Recent inquiries show that 100 GB warehouses are becoming commonplace.
Also a study from META Group published in January 1996 reported that 52% of the
warehouses surveyed would be 20 GB to 1 TB or larger in 12-18 months. In particular the
level of detail required by the business leads to fundamentally new volumes of warehoused
data. Further the refreshment process must be propagated along the various levels of data
(ODS, CDW and data marts), which enlarges the volume of data must be refreshed.
2. Second the refreshment of warehouse requires the execution of transactional workloads of
varying complexity. In fact, the refreshment of warehouses yields different performance
challenges depending on its level in the architecture. The refreshment of an ODS involves
many transactions that need to access and update a few records. Thus, the performance
requirements for refreshment are those of general purpose record-level update processing.
The refreshment of a global data warehouse involves heavy load and access transactions.
Possibly large volumes of data are periodically loaded in the data warehouse, and once
loaded, these data are accessed either for informational processing or for refreshing
the local warehouses. Power for loading is now measured in GB per hour and several
companies are moving to parallel architectures when possible to increase their processing
power for loading and refreshment. The network interconnecting the data sources to the
warehouse can also be bottleneck during refreshment and calls for compression techniques
for data transmission. Finally, as a third reason the refreshment of local warehouses
involves transactions that access many data perform complex calculations to produce
highly summarized and aggregated data and update a few records in the local warehouses.
This is particularly true for the local data warehouses that usually contain the data cubes
manipulated by OLAP applications. Thus, a considerable processing time may be needed
to refresh the warehouses. This is a problem because there is always a limited time frame
during which the refreshment is expected to happen. Even if this time frame goes up to
several hours and does not occur at peak periods it may be challenging to guarantee that
the data warehouse will be refreshed within it.
3. Third, the refreshment of a warehouse may be run concurrently with the processing of
queries. This may happen because the time frame during which the data warehouse is not
queried is either too short or nonexistent.
Task The weather date is stored for different locations in a warehouse. The weather
data consists of ‘temperature,’ ‘pressure,’ humidity,’ and ‘wind velocity.’ The location is
defined in terms of ‘latitude,’ ‘longitude,’ altitude’ and ‘time.’ Assume that nation() is a
function that returns the name of the country for a given latitude and longitude. Propose a
warehousing model for this case.
8.2 incremental Data extraction
The way incremental data extraction can be implemented depends on the characteristics of the
data sources and also on the desired functionality of the data warehouse system.
Data sources are heterogeneous and can include conventional database systems and nontraditional
sources like flat files, XML and HTML documents, knowledge systems and legacy systems.
The mechanisms offered by each data source to help the detection of changes are also quite
heterogeneous.
LoveLy professionaL university 155