Page 102 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 102
Data Warehousing and Data Mining
notes can avoid mistakes related to schema information. Based on this analysis, we can safely argue
that different roles imply a different collection of quality aspects, which should be ideally treated
in a consistent and meaningful way.
From the previous it follows that, on one hand, the quality of data is of highly subjective nature
and should ideally be treated differently for each user. At the same time, the quality goals of the
involved stakeholders are highly diverse in nature. They can be neither assessed nor achieved
directly but require complex measurement, prediction, and design techniques, often in the form
of an interactive process. On the other hand, the reasons for data deficiencies, non-availability
or reachability problems are definitely objective, and depend mostly on the information system
definition and implementation. Furthermore, the prediction of data quality for each user must be
based on objective quality factors that are computed and compared to users’ expectations. The
question that arises, then, is how to organize the design, administration and evolution of the data
warehouse in such a way that all the different, and sometimes opposing, quality requirements
of the users can be simultaneously satisfied. As the number of users and the complexity of data
warehouse systems do not permit to reach total quality for every user, another question is how
to prioritize these requirements in order to satisfy them with respect to their importance. This
problem is typically illustrated by the physical design of the data warehouse where the problem
is to find a set of materialized views that optimize user requests response time and the global
data warehouse maintenance cost at the same time.
Before answering these questions, though, it should be useful to make a clear-cut definition of the
major concepts in these data warehouse quality management problems. To give an idea of the
complexity of the situation let us present a verbal description of the situation. The interpretability
of the data and the processes of the data warehouse is heavily dependent on the design process (the
level of the description of the data and the processes of the warehouse) and the expressive power
of the models and the languages which are used. Both the data and the systems architecture (i.e.
where each piece of information resides and what the architecture of the system is) are part of the
interpretability dimension. The integration process is related to the interpretability dimension, by
trying to produce minimal schemata. Furthermore, processes like query optimization (possibly
using semantic information about the kind of the queried data -e.g. temporal, aggregate, etc.),
and multidimensional aggregation (e.g. containment of views, which can guide the choice of
the appropriate relations to answer a query) are dependent on the interpretability of the data
and the processes of the warehouse. The accessibility dimension of quality is dependent on the
kind of data sources and the design of the data and the processes of the warehouse. The kind of
views stored in the warehouse, the update policy and the querying processes are all influencing
the accessibility of the information. Query optimization is related to the accessibility dimension,
since the sooner the queries are answered, the higher the transaction availability is. The extraction
of data from the sources is also influencing (actually determining) the availability of the data
warehouse. Consequently, one of the primary goals of the update propagation policy should
be to achieve high availability of the data warehouse (and the sources). The update policies, the
evolution of the warehouse (amount of purged information) and the kind of data sources are all
influencing the timeliness and consequently the usefulness of data. Furthermore, the timeliness
dimension influences the data warehouse design and the querying of the information stored in
the warehouse (e.g., the query optimization could possibly take advantage of possible temporal
relationships in the data warehouse). The believability of the data in the warehouse is obviously
influenced from the believability of the data in the sources. Furthermore, the level of the desired
believability influences the design of the views and processes of the warehouse. Consequently,
the source integration should take into account the believability of the data, whereas the data
warehouse design process should also take into account the believability of the processes. The
validation of all the processes of the data warehouse is another issue, related with every task in
the data warehouse environment and especially with the design process. Redundant information
in the warehouse can be used from the aggregation, customization and query optimization
processes in order to obtain information faster. Also, replication issues are related to these tasks.
96 LoveLy professionaL university