Page 161 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 161

Unit 8: Data Warehouse Refreshment




          In  any  case,  the  refreshment  of  a  data  warehouse  is  considered  to  be  a  difficult  and  critical   notes
          problem for three main reasons:
          1.   First the volume of data stored in a warehouse is usually large and is predicted to grow in
               the near future. Recent inquiries show that 100 GB warehouses are becoming commonplace.
               Also  a  study  from  META  Group  published  in  January  1996  reported  that  52%  of  the
               warehouses surveyed would be 20 GB to 1 TB or larger in 12-18 months. In particular the
               level of detail required by the business leads to fundamentally new volumes of warehoused
               data. Further the refreshment process must be propagated along the various levels of data
               (ODS, CDW and data marts), which enlarges the volume of data must be refreshed.
          2.   Second the refreshment of warehouse requires the execution of transactional workloads of
               varying complexity. In fact, the refreshment of warehouses yields different performance
               challenges depending on its level in the architecture. The refreshment of an ODS involves
               many transactions that need to access and update a few records. Thus, the performance
               requirements for refreshment are those of general purpose record-level update processing.
               The refreshment of a global data warehouse involves heavy load and access transactions.
               Possibly large volumes of data are periodically loaded in the data warehouse, and once
               loaded,  these  data  are  accessed  either  for  informational  processing  or  for  refreshing
               the local warehouses. Power for loading is now measured in GB per hour and several
               companies are moving to parallel architectures when possible to increase their processing
               power for loading and refreshment. The network interconnecting the data sources to the
               warehouse can also be bottleneck during refreshment and calls for compression techniques
               for  data  transmission.  Finally,  as  a  third  reason  the  refreshment  of  local  warehouses
               involves  transactions  that  access  many  data  perform  complex  calculations  to  produce
               highly summarized and aggregated data and update a few records in the local warehouses.
               This is particularly true for the local data warehouses that usually contain the data cubes
               manipulated by OLAP applications. Thus, a considerable processing time may be needed
               to refresh the warehouses. This is a problem because there is always a limited time frame
               during which the refreshment is expected to happen. Even if this time frame goes up to
               several hours and does not occur at peak periods it may be challenging to guarantee that
               the data warehouse will be refreshed within it.
          3.   Third, the refreshment of a warehouse may be run concurrently with the processing of
               queries. This may happen because the time frame during which the data warehouse is not
               queried is either too short or nonexistent.




              Task    The weather date is stored for different locations in a warehouse. The weather
             data consists of ‘temperature,’ ‘pressure,’ humidity,’ and ‘wind velocity.’ The location is
             defined in terms of ‘latitude,’ ‘longitude,’ altitude’ and ‘time.’ Assume that nation() is a
             function that returns the name of the country for a given latitude and longitude. Propose a
             warehousing model for this case.


          8.2 incremental Data extraction

          The way incremental data extraction can be implemented depends on the characteristics of the
          data sources and also on the desired functionality of the data warehouse system.
          Data sources are heterogeneous and can include conventional database systems and nontraditional
          sources  like  flat  files,  XML  and  HTML  documents,  knowledge  systems  and  legacy  systems.
          The mechanisms offered by each data source to help  the detection of changes  are  also quite
          heterogeneous.





                                           LoveLy professionaL university                                   155
   156   157   158   159   160   161   162   163   164   165   166