Page 151 - DCAP208_Management Support Systems
P. 151
Management Support Systems
Notes most important phases in the knowledge discovery process. Data mining techniques should be
able to handle noise in data or incomplete information.
More than the size of data, the size of the search space is even more decisive for data mining
techniques. The size of the search space is often depending upon the number of dimensions in
the domain space. The search space usually grows exponentially when the number of dimensions
increases. This is known as the curse of dimensionality. This “curse” affects so badly the performance
of some data mining approaches that it is becoming one of the most urgent issues to solve.
Performance Issues
Many artificial intelligence and statistical methods exist for data analysis and interpretation.
However, these methods were often not designed for the very large data sets data mining is
dealing with today. Terabyte sizes are common. This raises the issues of scalability and efficiency
of the data mining methods when processing considerably large data. Algorithms with
exponential and even medium-order polynomial complexity cannot be of practical use for data
mining. Linear algorithms are usually the norm. In same theme, sampling can be used for
mining instead of the whole dataset. However, concerns such as completeness and choice of
samples may arise. Other topics in the issue of performance are incremental updating, and parallel
programming. There is no doubt that parallelism can help solve the size problem if the dataset
can be subdivided and the results can be merged later. Incremental updating is important for
merging results from parallel mining, or updating data mining results when new data becomes
available without having to re-analyze the complete dataset.
Data Source Issues
There are many issues related to the data sources, some are practical such as the diversity of data
types, while others are philosophical like the data glut problem. We certainly have an excess of
data since we already have more data than we can handle and we are still collecting data at an
even higher rate. If the spread of database management systems has helped increase the gathering
of information, the advent of data mining is certainly encouraging more data harvesting. The
current practice is to collect as much data as possible now and process it, or try to process it, later.
The concern is whether we are collecting the right data at the appropriate amount, whether we
know what we want to do with it, and whether we distinguish between what data is important
and what data is insignificant. Regarding the practical issues related to data sources, there is the
subject of heterogeneous databases and the focus on diverse complex data types. We are storing
different types of data in a variety of repositories. It is difficult to expect a data mining system to
effectively and efficiently achieve good mining results on all kinds of data and sources. Different
kinds of data and sources may require distinct algorithms and methodologies. Currently, there
is a focus on relational databases and data warehouses, but other approaches need to be pioneered
for other specific complex data types. A versatile data mining tool, for all sorts of data, may not
be realistic. Moreover, the proliferation of heterogeneous data sources, at structural and semantic
levels, poses important challenges not only to the database community but also to the data
mining community.
Task Make a report on various issues related to data mining.
Self Assessment
Fill in the blanks:
1. Data mining is also known as ........................ in Databases.
144 LOVELY PROFESSIONAL UNIVERSITY