Page 184 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 184

Data Warehousing and Data Mining




                    notes          the cleaning process is triggered immediately after the extraction process, it is actually executed
                                   only if the extraction process has gathered some source changes. Consequently, we can consider
                                   that the state of the input data store of each activity may be considered as a condition to effectively
                                   execute this activity.
                                   Within the workflow which represents the refreshment process, activities may be of different
                                   origins and different semantics, the refreshment strategy is logically considered as independent
                                   of what the activities actually do. However, at the operational level, some activities can be merged
                                   (e.g., extraction and cleaning), and some others decomposed (e.g. integration). The flexibility
                                   claimed for workflow systems should allow to dynamically tailor the refreshment activities and
                                   the coordinating events.
                                   There  may  be  another  way  to  represent  the  workflow  and  its  triggering  strategies.  Indeed,
                                   instead  of  considering  external  events  such  as  temporal  events  or  termination  events  of  the
                                   different activities, we can consider data changes as events. Hence, each input data store of the
                                   refreshment workflow is considered as an event queue that triggers the corresponding activity.
                                   However, to be able to represent different refreshment strategies, this approach needs a parametric
                                   synchronization mechanism which allows to trigger the activities at the right moment. This can
                                   be done by introducing composite events which combine, for example, data change events and
                                   temporal events. Another alternative is to put locks on data stores and remove them after an
                                   activity or a set of activities decide to commit. In the case of a long term synchronization policy,
                                   as it may sometimes happen in some data warehouses, this latter approach is not sufficient.

                                   The Workflow Agents

                                   Two main agent types are involved in the refreshment workflow: human agents which define
                                   requirements, constraints and strategies, and computer agents which process activities. Among
                                   human agents we can distinguish users, the data warehouse administrator, source administrators.
                                   Among  computer  agents,  we  can  mention  source  management  systems,  database  systems
                                   used for the data warehouse and data marts, wrappers and mediators. For simplicity, agents
                                   are not represented in the refreshment workflow which concentrates on the activities and their
                                   coordination.

                                   9.4.2 Defining Refreshment Scenarios

                                   To illustrate different workflow scenarios, we consider the following example which concern
                                   three national Telecom billing sources represented by three relations S1, S2, and S3. Each relation
                                   has the same (simplified) schema: (#PC, date, duration, cost). An aggregated view V with schema
                                   (avgduration, avg-cost, country) is defined in a data warehouse from these sources as the average
                                   duration and cost of a phone call in each of the three country associated with the sources, during
                                   the last 6 months. We assume that the construction of the view follows the steps as explained
                                   before. During the preparation step, the data of the last six months contained in each source is
                                   cleaned (e.g., all cost units are translated in Euros). Then, during the integration phase, a base
                                   relation R with schema (date, duration, cost country) is constructed by unioning the data coming
                                   from each source and generating an extra attribute (country). Finally, the view is computed using
                                   aggregates (Figure 9.5).
















          178                              LoveLy professionaL university
   179   180   181   182   183   184   185   186   187   188   189