Page 123 - DMGT308_CUSTOMER_RELATIONSHIP_MANAGEMENT
P. 123
Customer Relationship Management
Notes The guideline of striving for high intraclass similarity and low interclass similarity still applies.
In data mining, efforts have focused on finding methods for efficient and effective cluster analysis
in large databases. Active themes of research focus on the scalability of clustering methods, the
effectiveness of methods for clustering complex shapes and types of data, high-dimensional
clustering techniques, and methods for clustering mixed numerical and categorical data in large
databases.
Clustering is a challenging field of research where its potential applications pose their own
special requirements. The following are typical requirements of clustering in data mining.
1. Scalability: Many clustering algorithms work well in small data sets containing less than
200 data objects; however, a large database may contain millions of objects. Clustering on
a sample of a given large data set may lead to biased results. Highly scalable clustering
algorithms are needed.
2. Ability to deal with different types of attributes: Many algorithms are designed to cluster
interval-based (numerical) data. However, applications may require clustering other types
of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data
types.
3. Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters
based on Euclidean or Manhattan distance measures. Algorithms based on such distance
measures tend to find spherical clusters with similar size and density. However, a cluster
could be of any shape. It is important to develop algorithms which can detect clusters of
arbitrary shape.
4. Minimal requirements for domain knowledge to determine input parameters: Many
clustering algorithms require users to input certain parameters in cluster analysis (such as
the number of desired clusters). The clustering results are often quite sensitive to input
parameters. Parameters are often hard to determine, especially for data sets containing
high-dimensional objects. This not only burdens users, but also makes the quality of
clustering difficult to control.
5. Ability to deal with noisy data: Most real-world databases contain outliers or missing,
unknown, or erroneous data. Some clustering algorithms are sensitive to such data and
may lead to clusters of poor quality.
6. Insensitivity to the order of input records: Some clustering algorithms are sensitive to the
order of input data, e.g., the same set of data, when presented with different orderings to
such an algorithm, may generate dramatically different clusters. It is important to develop
algorithms which are insensitive to the order of input.
7. High dimensionality: A database or a data warehouse may contain several dimensions or
attributes. Many clustering algorithms are good at handling low-dimensional data,
involving only two to three dimensions. Human eyes are good at judging the quality of
clustering for up to three dimensions. It is challenging to cluster data objects in
high-dimensional space, especially considering that data in high-dimensional space can
be very sparse and highly skewed.
8. Constraint-based clustering: Real-world applications may need to perform clustering
under various kinds of constraints. Suppose that your job is to choose the locations for a
given number of new automatic cash stations (ATMs) in a city. To decide upon this, you
may cluster households while considering constraints such as the city’s rivers and highway
networks, and customer requirements per region. A challenging task is to find groups of
data with good clustering behaviour that satisfy specified constraints.
9. Interpretability and usability: Users expect clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied up with specific
118 LOVELY PROFESSIONAL UNIVERSITY