Page 184 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 184
Data Warehousing and Data Mining
notes the cleaning process is triggered immediately after the extraction process, it is actually executed
only if the extraction process has gathered some source changes. Consequently, we can consider
that the state of the input data store of each activity may be considered as a condition to effectively
execute this activity.
Within the workflow which represents the refreshment process, activities may be of different
origins and different semantics, the refreshment strategy is logically considered as independent
of what the activities actually do. However, at the operational level, some activities can be merged
(e.g., extraction and cleaning), and some others decomposed (e.g. integration). The flexibility
claimed for workflow systems should allow to dynamically tailor the refreshment activities and
the coordinating events.
There may be another way to represent the workflow and its triggering strategies. Indeed,
instead of considering external events such as temporal events or termination events of the
different activities, we can consider data changes as events. Hence, each input data store of the
refreshment workflow is considered as an event queue that triggers the corresponding activity.
However, to be able to represent different refreshment strategies, this approach needs a parametric
synchronization mechanism which allows to trigger the activities at the right moment. This can
be done by introducing composite events which combine, for example, data change events and
temporal events. Another alternative is to put locks on data stores and remove them after an
activity or a set of activities decide to commit. In the case of a long term synchronization policy,
as it may sometimes happen in some data warehouses, this latter approach is not sufficient.
The Workflow Agents
Two main agent types are involved in the refreshment workflow: human agents which define
requirements, constraints and strategies, and computer agents which process activities. Among
human agents we can distinguish users, the data warehouse administrator, source administrators.
Among computer agents, we can mention source management systems, database systems
used for the data warehouse and data marts, wrappers and mediators. For simplicity, agents
are not represented in the refreshment workflow which concentrates on the activities and their
coordination.
9.4.2 Defining Refreshment Scenarios
To illustrate different workflow scenarios, we consider the following example which concern
three national Telecom billing sources represented by three relations S1, S2, and S3. Each relation
has the same (simplified) schema: (#PC, date, duration, cost). An aggregated view V with schema
(avgduration, avg-cost, country) is defined in a data warehouse from these sources as the average
duration and cost of a phone call in each of the three country associated with the sources, during
the last 6 months. We assume that the construction of the view follows the steps as explained
before. During the preparation step, the data of the last six months contained in each source is
cleaned (e.g., all cost units are translated in Euros). Then, during the integration phase, a base
relation R with schema (date, duration, cost country) is constructed by unioning the data coming
from each source and generating an extra attribute (country). Finally, the view is computed using
aggregates (Figure 9.5).
178 LoveLy professionaL university