Page 151 - DCAP208_Management Support Systems
P. 151

Management Support Systems




                    Notes          most important phases in the knowledge discovery process. Data mining techniques should be
                                   able to handle noise in data or incomplete information.
                                   More than the size of data, the size of the search space is even more decisive for data mining
                                   techniques. The size of the search space is often depending upon the number of dimensions in
                                   the domain space. The search space usually grows exponentially when the number of dimensions
                                   increases. This is known as the curse of dimensionality. This “curse” affects so badly the performance
                                   of some data mining approaches that it is becoming one of the most urgent issues to solve.

                                   Performance Issues

                                   Many artificial intelligence and statistical methods exist for data analysis and interpretation.
                                   However, these methods were often not designed for the very large data sets data mining is
                                   dealing with today. Terabyte sizes are common. This raises the issues of scalability and efficiency
                                   of the data mining methods when processing considerably large data. Algorithms with
                                   exponential and even medium-order polynomial complexity cannot be of practical use for data
                                   mining. Linear algorithms are usually the norm. In same theme, sampling can be used for
                                   mining instead of the whole dataset. However, concerns such as completeness and choice of
                                   samples may arise. Other topics in the issue of performance are incremental updating, and parallel
                                   programming. There is no doubt that parallelism can help solve the size problem if the dataset
                                   can be subdivided and the results can be merged later. Incremental updating is important for
                                   merging results from parallel mining, or updating data mining results when new data becomes
                                   available without having to re-analyze the complete dataset.

                                   Data Source Issues

                                   There are many issues related to the data sources, some are practical such as the diversity of data
                                   types, while others are philosophical like the data glut problem. We certainly have an excess of
                                   data since we already have more data than we can handle and we are still collecting data at an
                                   even higher rate. If the spread of database management systems has helped increase the gathering
                                   of information, the advent of data mining is certainly encouraging more data harvesting. The
                                   current practice is to collect as much data as possible now and process it, or try to process it, later.
                                   The concern is whether we are collecting the right data at the appropriate amount, whether we
                                   know what we want to do with it, and whether we distinguish between what data is important
                                   and what data is insignificant. Regarding the practical issues related to data sources, there is the
                                   subject of heterogeneous databases and the focus on diverse complex data types. We are storing
                                   different types of data in a variety of repositories. It is difficult to expect a data mining system to
                                   effectively and efficiently achieve good mining results on all kinds of data and sources. Different
                                   kinds of data and sources may require distinct algorithms and methodologies. Currently, there
                                   is a focus on relational databases and data warehouses, but other approaches need to be pioneered
                                   for other specific complex data types. A versatile data mining tool, for all sorts of data, may not
                                   be realistic. Moreover, the proliferation of heterogeneous data sources, at structural and semantic
                                   levels, poses important challenges not only to the database community but also to the data
                                   mining community.




                                      Task  Make a report on various issues related to data mining.

                                   Self Assessment

                                   Fill in the blanks:

                                   1.  Data mining is also known as ........................ in Databases.


          144                               LOVELY PROFESSIONAL UNIVERSITY
   146   147   148   149   150   151   152   153   154   155   156