Page 102 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 102

Data Warehousing and Data Mining




                    notes          can avoid mistakes related to schema information. Based on this analysis, we can safely argue
                                   that different roles imply a different collection of quality aspects, which should be ideally treated
                                   in a consistent and meaningful way.
                                   From the previous it follows that, on one hand, the quality of data is of highly subjective nature
                                   and should ideally be treated differently for each user. At the same time, the quality goals of the
                                   involved stakeholders are highly diverse in nature. They can be neither assessed nor achieved
                                   directly but require complex measurement, prediction, and design techniques, often in the form
                                   of an interactive process. On the other hand, the reasons for data deficiencies, non-availability
                                   or reachability problems are definitely objective, and depend mostly on the information system
                                   definition and implementation. Furthermore, the prediction of data quality for each user must be
                                   based on objective quality factors that are computed and compared to users’ expectations. The
                                   question that arises, then, is how to organize the design, administration and evolution of the data
                                   warehouse in such a way that all the different, and sometimes opposing, quality requirements
                                   of the users can be simultaneously satisfied. As the number of users and the complexity of data
                                   warehouse systems do not permit to reach total quality for every user, another question is how
                                   to prioritize these requirements in order to satisfy them with respect to their importance. This
                                   problem is typically illustrated by the physical design of the data warehouse where the problem
                                   is to find a set of materialized views that optimize user requests response time and the global
                                   data warehouse maintenance cost at the same time.
                                   Before answering these questions, though, it should be useful to make a clear-cut definition of the
                                   major concepts in these data warehouse quality management problems. To give an idea of the
                                   complexity of the situation let us present a verbal description of the situation. The interpretability
                                   of the data and the processes of the data warehouse is heavily dependent on the design process (the
                                   level of the description of the data and the processes of the warehouse) and the expressive power
                                   of the models and the languages which are used. Both the data and the systems architecture (i.e.
                                   where each piece of information resides and what the architecture of the system is) are part of the
                                   interpretability dimension. The integration process is related to the interpretability dimension, by
                                   trying to produce minimal schemata. Furthermore, processes like query optimization (possibly
                                   using semantic information about the kind of the queried data -e.g. temporal, aggregate, etc.),
                                   and multidimensional aggregation (e.g. containment of views, which can guide the choice of
                                   the appropriate relations to answer a query) are dependent on the interpretability of the data
                                   and the processes of the warehouse. The accessibility dimension of quality is dependent on the
                                   kind of data sources and the design of the data and the processes of the warehouse. The kind of
                                   views stored in the warehouse, the update policy and the querying processes are all influencing
                                   the accessibility of the information. Query optimization is related to the accessibility dimension,
                                   since the sooner the queries are answered, the higher the transaction availability is. The extraction
                                   of data from the sources is also influencing (actually determining) the availability of the data
                                   warehouse. Consequently, one of the primary goals of the update propagation policy should
                                   be to achieve high availability of the data warehouse (and the sources). The update policies, the
                                   evolution of the warehouse (amount of purged information) and the kind of data sources are all
                                   influencing the timeliness and consequently the usefulness of data. Furthermore, the timeliness
                                   dimension influences the data warehouse design and the querying of the information stored in
                                   the warehouse (e.g., the query optimization could possibly take advantage of possible temporal
                                   relationships in the data warehouse). The believability of the data in the warehouse is obviously
                                   influenced from the believability of the data in the sources. Furthermore, the level of the desired
                                   believability influences the design of the views and processes of the warehouse. Consequently,
                                   the source integration should take into account the believability of the data, whereas the data
                                   warehouse design process should also take into account the believability of the processes. The
                                   validation of all the processes of the data warehouse is another issue, related with every task in
                                   the data warehouse environment and especially with the design process. Redundant information
                                   in  the  warehouse  can  be  used  from  the  aggregation,  customization  and  query  optimization
                                   processes in order to obtain information faster. Also, replication issues are related to these tasks.





          96                               LoveLy professionaL university
   97   98   99   100   101   102   103   104   105   106   107