Page 95 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 95
Unit 5: Data Warehouse Research – Issues and Research
introduction notes
An enterprise data warehouse often fetches records from several disparate systems and store
them centrally in an enterprise-wide warehouse. But what is the guarantee that the quality of
data will not degrade in the process of centralization?
Many of the data warehouses are built on n-tier architecture with multiple data extraction and
data insertion jobs between two consecutive tiers. As it happens, the nature of the data changes
as it passes from one tier to the next tier. Data reconciliation is the method of reconciling or tie-up
the data between any two consecutive tiers (layers).
5.1 Data extraction
Extraction is the operation of extracting data from a source system for further use in a data
warehouse environment. This is the first step of the ETL process. After the extraction, this data
can be transformed and loaded into the data warehouse.
The source systems for a data warehouse are typically transaction processing applications.
Example: One of the source systems for a sales analysis data warehouse might be an
order entry system that records all of the current order activities.
Designing and creating the extraction process is often one of the most time-consuming tasks
in the ETL process and, indeed, in the entire data warehousing process. The source systems
might be very complex and poorly documented, and thus determining which data needs to be
extracted can be difficult. The data has to be extracted normally not only once, but several times
in a periodic manner to supply all changed data to the data warehouse and keep it up-to-date.
Moreover, the source system typically cannot be modified, nor can its performance or availability
be adjusted, to accommodate the needs of the data warehouse extraction process.
These are important considerations for extraction and ETL in general. This unit, however, focuses
on the technical considerations of having different kinds of sources and extraction methods. It
assumes that the data warehouse team has already identified the data that will be extracted, and
discusses common techniques used for extracting data from source databases.
extraction Methods in Data Warehouses
The extraction method you should choose is highly dependent on the source system and also from
the business needs in the target data warehouse environment. Very often, there is no possibility
to add additional logic to the source systems to enhance an incremental extraction of data due to
the performance or the increased workload of these systems. Sometimes even the customer is not
allowed to add anything to an out-of-the-box application system.
The estimated amount of the data to be extracted and the stage in the ETL process (initial load or
maintenance of data) may also impact the decision of how to extract, from a logical and a physical
perspective. Basically, you have to decide how to extract data logically and physically.
Logical Extraction Methods
There are two types of logical extraction:
1. Full Extraction: The data is extracted completely from the source system. Because this
extraction reflects all the data currently available on the source system, there’s no need to
keep track of changes to the data source since the last successful extraction. The source data
will be provided as-is and no additional logical information (for example, timestamps) is
LoveLy professionaL university 89