Page 181 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 181
Unit 9: Data Warehouse Refreshment – II
The difference between the refreshment process and the loading process is mainly in the notes
following. First, the refreshment process may have a complete asynchronism between its
different activities (preparation, integration, aggregation and customisation). Second, there may
be a high level parallelism within the preparation activity itself, each data source having its
own availability window and its own strategy of extraction. The synchronization is done by the
integration activity. Another difference lies in the source availability. While the loading phase
requires a long period of availability, the refreshment phase should not overload the operational
applications which use the data sources. Then, each source provides a specific access frequency
and a restricted availability duration. Finally, there are more constraints on response time for
the refreshment process than for the loading process. Indeed, with respect to the users, the data
warehouse does not exist before the initial loading, so the computation time is included within
the design project duration. After the initial loading, the data becomes visible and should satisfy
user requirements in terms of data availability, accessibility and freshness.
View Maintenance vs. Data Refreshment
The propagation of changes during the refreshment process is done through a set of independent
activities among which we find the maintenance of the views stored in the ODS and CDW levels.
The view maintenance phase consists in propagating a certain change raised in a given source
over a set of views stored at the ODS or CDW level. Such a phase is a classical materialized
view maintenance problem except that, in data warehouses, the changes to propagate into the
aggregated views are not exactly those occurred in the sources, but the result of pre-treatments
performed by other refreshment activities such as data cleaning and multi-source data
reconciliation.
The view maintenance problem has been intensively studied in the database research community.
Most of the references focus on the problems raised by the maintenance of a set of materialized
(also called concrete) views derived from a set of base relations when the current state of the base
relations is modified. The main results concern:
1. Self-maintainability: Results concerning the self-maintainability are generalized for a set
of views : a set of view V is self-maintainable with respect to the changes to the underlying
base relations if the changes may be propagated in every views in V without querying the
base relations (i.e. the information stored in the concrete views plus the instance of the
changes are sufficient to maintain the views).
2. Coherent and efficient update propagation: Various algorithms are provided to schedule
updates propagation through each individual view, taking care of interdependencies
between views, which may lead to possible inconsistencies. For this purpose, auxiliary views
are often introduced to facilitate update propagation and to enforce self-maintainability.
Results over the self-maintainability of a set of views are of a great interest in the data warehouse
context, and it is commonly admitted that the set of views stored in a data warehouse have
to be globally selfmaintainable. The rationale behind this recommendation is that the self-
maintainability is a strong requirement imposed by the operational sources in order to not
overload their regular activity.
Research on data warehouse refreshment has mainly focused on update propagation through
materialized views. Many papers have been published on this topic, but a very few is devoted to
the whole refreshment process as defined before. We consider view maintenance just as one step
of the complete refreshment process. Other steps concern data cleaning, data reconciliation, data
customisation, and if needed data archiving. In another hand, extraction and cleaning strategies
may vary from one source to another, as well as update propagation which may vary from
one user view to another, depending for example on the desired freshness for data. So the data
warehouse refreshment process cannot be limited to a view maintenance process.
LoveLy professionaL university 175