Page 95 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 95

Unit 5: Data Warehouse Research – Issues and Research




          introduction                                                                          notes

          An enterprise data warehouse often fetches records from several disparate systems and store
          them centrally in an enterprise-wide warehouse. But what is the guarantee that the quality of
          data will not degrade in the process of centralization?

          Many of the data warehouses are built on n-tier architecture with multiple data extraction and
          data insertion jobs between two consecutive tiers. As it happens, the nature of the data changes
          as it passes from one tier to the next tier. Data reconciliation is the method of reconciling or tie-up
          the data between any two consecutive tiers (layers).

          5.1 Data extraction

          Extraction  is  the  operation  of  extracting  data  from  a  source  system  for  further  use  in  a  data
          warehouse environment. This is the first step of the ETL process. After the extraction, this data
          can be transformed and loaded into the data warehouse.
          The source systems for a data warehouse are typically transaction processing applications.


                 Example: One of the source systems for a sales analysis data warehouse might be an
          order entry system that records all of the current order activities.
          Designing and creating the extraction process is often one of the most time-consuming tasks
          in the ETL process and, indeed, in the entire data warehousing process. The source systems
          might be very complex and poorly documented, and thus determining which data needs to be
          extracted can be difficult. The data has to be extracted normally not only once, but several times
          in a periodic manner to supply all changed data to the data warehouse and keep it up-to-date.
          Moreover, the source system typically cannot be modified, nor can its performance or availability
          be adjusted, to accommodate the needs of the data warehouse extraction process.

          These are important considerations for extraction and ETL in general. This unit, however, focuses
          on the technical considerations of having different kinds of sources and extraction methods. It
          assumes that the data warehouse team has already identified the data that will be extracted, and
          discusses common techniques used for extracting data from source databases.

          extraction Methods in Data Warehouses

          The extraction method you should choose is highly dependent on the source system and also from
          the business needs in the target data warehouse environment. Very often, there is no possibility
          to add additional logic to the source systems to enhance an incremental extraction of data due to
          the performance or the increased workload of these systems. Sometimes even the customer is not
          allowed to add anything to an out-of-the-box application system.
          The estimated amount of the data to be extracted and the stage in the ETL process (initial load or
          maintenance of data) may also impact the decision of how to extract, from a logical and a physical
          perspective. Basically, you have to decide how to extract data logically and physically.

          Logical Extraction Methods

          There are two types of logical extraction:
          1.   Full Extraction: The data is extracted completely from the source system. Because this
               extraction reflects all the data currently available on the source system, there’s no need to
               keep track of changes to the data source since the last successful extraction. The source data
               will be provided as-is and no additional logical information (for example, timestamps) is




                                           LoveLy professionaL university                                    89
   90   91   92   93   94   95   96   97   98   99   100