1. Introduction

هدف پروژه بدست آوردن ارجاع های یکسان به یک مقاله, کتاب یا ... است و حذف تکرار ها یا دسته بندی کردن ارجاع های یکسان به عنوان یک ارجاع.
اما به طور کلی مشخص کردن ارجاع های یکسان کاربرد های دیگری نظیر بهینه سازی موتورهای جستجوگر صفحات تحت وب را دارد که برای افزایش دقت و سرعت پاسخگویی روش های مختلفی مانند خوشه بندی متون به کار گرفته می شود.
[خوشه بندی] یا [Document Clustering] روشی برای دسته بندی متن ها با حجم داده ی وسیع می باشد و هدف پیدا کردن شباهت ها یا الگوهای رفتاری مشابه در یک داده از متن می باشد. در اینجا ارجاع های یکسان به گونه ای در یک خوشه قرار خواهند گرفت که در یک خوشه حداکثر شباهت بین ارجاع ها وجود داشته باشد در حالی که بین دو خوشه متفاوت, حداقل شباهت دیده شود.
خوشه بندی تا جایی ادامه پیدا خواهد کرد که تمامی ارجاع های یکسان هر کدام در یک خوشه و به عنوان یک ارجاع واحد مورد استفاده قرار گیرند.

2. Related Works

2.1. Document clustering

Document clustering (or Text clustering) is automatic document organization, topic extraction and fast information retrieval or filtering. It is closely related to data clustering.

A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as Northern Light and Vivisimo, consumer search engines such as PolyMeta and Helioid, or open source software such as Carrot2.

Examples:

Clustering divides the results of a search for "cell" into groups like "biology," "battery," and "prison."
FirstGov.gov, the official Web portal for the U.S. government, uses document clustering to automatically organize its search results into categories. For example, if a user submits “immigration”, next to their list of results they will see categories for “Immigration Reform”, “Citizenship and Immigration Services”, “Employment”, “Department of Homeland Security”, and more.
Probabilistic Latent Semantic Analysis (PLSA) can also be conducted to perform document clustering.

Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.

The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared offline applications.

In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the K-means algorithm and its variants. Usually, it is of greater efficiency, but less accurate than the hierarchical algorithm. These algorithms can further be classified as hard or soft clustering algorithms. Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters. Latent semantic indexing, a form of dimensionality reduction, is a soft clustering algorithm.

Other algorithms involve graph based clustering, ontology supported clustering and order sensitive clustering.

K-mean
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.

Example of implementing K-mean algorithm in python:

>>> import sys
>>> from math import log, sort
>>> from itertools import combinations
>>> def choose_cluster(node, cluster_lookup, edges):
>>>     new = cluster_lookup[node]
>>>     if node in edges:
>>>         seen, num_seen = {}, {}
>>>         for target, weight in edges.get(node, []):
>>>             seen[cluster_lookup[target]] = seen.get(
>>>                 cluster_lookup[target], 0.0) + weight
>>>         for k, v in seen.iteritems():
>>>             num_seen.setdefault(v, []).append(k)
>>>         new = num_seen[max(num_seen)][0]
>>>     return new
>>> 
>>> def majorclust(graph):
>>>     cluster_lookup = dict((node, i) for i, node in enumerate(graph.nodes))
>>>  ...
>>> def get_distance_graph(documents):
>>>     class Graph(object):
>>>        def __init__(self):
>>>             self.edges = {}
>>>  ...
>>> def get_documents():
>>>     texts = [...]
>>> def main(args):
>>>     documents = get_documents()
>>>     add_tfidf_to(documents)
>>>     dist_graph = get_distance_graph(documents)
>>> 
>>>     for cluster in majorclust(dist_graph):
>>>         print "========="
>>>         for doc_id in cluster:
>>>             print documents[doc_id]["text"]

Hierarchical_clustering_simple_diagram

2.2. Markov logic network

Briefly, it is a collection of formulas from first order logic, to each of which is assigned a real number, the weight. Taken as a Markov network, the vertices of the network graph are atomic formulas, and the edges are the logical connectives used to construct the formula. Each formula is considered to be a clique, and the Markov blanket is the set of formulas in which a given atom appears. A potential function is associated to each formula, and takes the value of one when the formula is true, and zero when it is false. The potential function is combined with the weight to form the Gibbs measure and partition function for the Markov network.

The above definition glosses over a subtle point: atomic formulas do not have a truth value unless they are grounded and given an interpretation; that is, until they are ground atoms with a Herbrand interpretation. Thus, a Markov logic network becomes a Markov network only with respect to a specific grounding and interpretation; the resulting Markov network is called the ground Markov network. The vertices of the graph of the ground Markov network are the ground atoms. The size of the resulting Markov network thus depends strongly (exponentially) on the number of constants in the domain of discourse.

3. Experiments

A good clustering of text requires effective feature selection and a proper choice of the algorithm for the task at hand. Among the different classes of algorithms, the distance-based methods are among the most popular in a wide variety of applications.
In recent years, the main trend in research in this area has been in the context of two kinds of text data:
Dynamic Applications: The large amounts of text data being created by dynamic applications such as social networks or online chat applications has created an immense need for streaming text clustering applications. Such streaming applications need to be applicable in the case of text which is not very clean, as is often the case for applications such as social networks.
Heterogeneous Applications: Text applications increasingly arise in heterogeneous applications in which the text is available in the context of links, and other heterogeneous multimedia data. For example, in social networks such as Flickr the clustering often needs to be applied in such scenario. Therefore, it is critical to ef- fectively adapt text-based algorithms to heterogeneous multimedia scenarios.
We note that the field of text clustering is too vast to cover comprehen- sively in a single chapter. Some methods such as committee-based clustering [1] cannot even be neatly incorporated into any class of methods, since they use a combination of the different clustering methods in order to create a final clustering result. The main purpose of this chapter is to provide a comprehensive overview of the main algorithms which are often used in the area, as a starting point for further study.

3.1. Distance-based Clustering Algorithms

  • Agglomerative and Hierarchical Clustering Algorithms

The general concept of agglomerative clustering is to successively merge documents into clusters based on their similarity with one an- other. Almost all the hierarchical clustering algorithms successively merge groups based on the best pairwise similarity between these groups of documents. The main differences between these classes of methods are in terms of how this pairwise similarity is computed between the different groups of documents. For example, the similarity between a pair of groups may be computed as the best-case similarity, average- case similarity, or worst-case similarity between documents which are drawn from these pairs of groups. Conceptually, the process of agglom- erating documents into successively higher levels of clusters creates a cluster hierarchy (or dendogram) for which the leaf nodes correspond to individual documents, and the internal nodes correspond to the merged groups of clusters. When two groups are merged, a new node is created in this tree corresponding to this larger merged group. The two children of this node correspond to the two groups of documents which have been merged to it.

  • Distance-based Partitioning Algorithms

Partitioning algorithms are widely used in the database literature in order to efficiently create clusters of objects. The two most widely used distance-based partitioning algorithms are as follows:

3.1.1. k-medoid clustering algorithms:

In k-medoid clustering algorithms, we use a set of points from the original data as the anchors (or medoids) around which the clusters are built. The key aim of the algorithm is to determine an optimal set of representative documents from the original corpus around which the clusters are built. Each document is assigned to its closest representative from
the collection. This creates a running set of clusters from the corpus which are successively improved by a randomized process
The algorithm works with an iterative approach in which the set of k representatives are successively improved with the use of ran- domized inter-changes. Specifically, we use the average similarity of each document in the corpus to its closest representative as the objective function which needs to be improved during this inter- change process. In each iteration, we replace a randomly picked representative in the current set of medoids with a randomly picked representative from the collection, if it improves the clustering ob- jective function. This approach is applied until convergence is achieved.
There are two main disadvantages of the use of k-medoids based clustering algorithms, one of which is specific to the case of text data. One general disadvantage of k-medoids clustering algorithms is that they require a large number of iterations in order to achieve convergence and are therefore quite slow. This is because each iter- ation requires the computation of an objective function whose time requirement is proportional to the size of the underlying corpus.
The second key disadvantage is that k-medoid algorithms do not work very well for sparse data such as text. This is because a large fraction of document pairs do not have many words in common, and the similarities between such document pairs are small (and noisy) values. Therefore, a single document medoid often does not contain all the concepts required in order to effectively build a cluster around it. This characteristic is specific to the case of the information retrieval domain, because of the sparse nature of the underlying text data.

3.1.2. k-means clustering algorithms:

he k-means clustering algo- rithm also uses a set of k representatives around which the clusters are built. However, these representatives are not necessarily ob- tained from the original data and are refined somewhat differently than a k-medoids approach. The simplest form of the k-means ap- proach is to start off with a set of k seeds from the original corpus, and assign documents to these seeds on the basis of closest sim- ilarity. In the next iteration, the centroid of the assigned points to each seed is used to replace the seed in the last iteration. In other words, the new seed is defined, so that it is a better central point for this cluster. This approach is continued until conver- gence. One of the advantages of the k-means method over the k-medoids method is that it requires an extremely small number of iterations in order to converge. Observations seem to suggest that for many large data sets, it is sufficient to use 5 or less iterations for an effective clustering. The main disadvantage of the k-means method is that it is still quite sensitive to the initial set of seeds picked during the clustering. Secondly, the centroid for a given cluster of documents may contain a large number of words. This will slow down the similarity calculations in the next iteration. A number of methods are used to reduce these effects, which will be discussed later on in this chapter.

3.2. Word and Phrase-based Clustering

Since text documents are drawn from an inherently high-dimensional domain, it can be useful to view the problem in a dual way, in which important clusters of words may be found and utilized for finding clus- ters of documents. In a corpus containing d terms and n documents, one may view a term-document matrix as an n × d matrix, in which the (i,j)th entry is the frequency of the jth term in the ith document. We note that this matrix is extremely sparse since a given document contains an extremely small fraction of the universe of words. We note that the problem of clustering rows in this matrix is that of clustering documents, whereas that of clustering columns in this matrix is that of clustering words. In reality, the two problems are closely related, as good clusters of words may be leveraged in order to find good clusters of documents and vice-versa. For example, the work in [2] determines frequent itemsets of words in the document collection, and uses them to determine compact clusters of documents. This is somewhat analogous to the use of clusters of words for determining clusters of documents. The most general technique for simultaneous word and document clus- tering is referred to as co-clustering . This approach simultaneous clusters the rows and columns of the term-document matrix, in order to create such clusters. This can also be considered to be equivalent to the problem of re-ordering the rows and columns of the term-document matrix so as to create dense rectangular blocks of non-zero entries in this matrix. In some cases, the ordering information among words may be used in order to determine good clusters. The work in [3] determines the frequent phrases in the collection and leverages them in order to determine document clusters.
It is important to understand that the problem of word clusters and document clusters are essentially dual problems which are closely related to one another. The former is related to dimensionality reduction, whereas the latter is related to traditional clustering. The boundary be- tween the two problems is quite fluid, because good word clusters provide hints for finding good document clusters and vice-versa. For example, a more general probabilistic framework which determines word clusters and document clusters simultaneously is referred to as topic modeling [4]. Topic modeling is a more general framework than either clustering or dimensionality reduction.

3.3. Semi-Supervised Clustering

In some applications, prior knowledge may be available about the kinds of clusters that are available in the underlying data. This prior knowledge may take on the form of labels attached with the documents, which indicate its underlying topic. For example, if we wish to use the broad distribution of topics in the Y ahoo! taxonomy in order to supervise the clustering process of a new web collection, one way to performing supervision would be add some labeled pages from the Y ahoo! taxonomy to the collection. Typically such pages would contain labels of the form @Science@Astronomy or @Arts@Painting, which indicate the subject area of the added pages. Such knowledge can be very useful in creating significantly more coherent clusters, especially when the total number of clusters is large. The process of using such labels to guide the clustering process is referred to as semi-supervised clustering. This form of learning is a bridge between the clustering and classification problem, because it uses the underlying class structure, but it is not completely tied down by the specific structure. As a result, such an approach finds applicability both to the clustering and classification scenarios.

4. References

  • [1] P. Pantel, D. Lin. Document Clustering with Committees, ACM SIGIR Conference, 2002.

  • [2] F. Beil, M. Ester, X. Xu. Frequent term-based text clustering, ACM KDD Conference, 2002.

  • [3 ] O. Zamir, O. Etzioni. Web Document Clustering: A Feasibility Demonstration, ACM SIGIR Conference, 1998.

  • [4] T. Hofmann. Probabilistic Latent Semantic Indexing. ACM SIGIR Conference, 1999.

  • [5] C. C. Aggarwal, Y. Zhao, P. S. Yu. On Text Clustering with Side Information, ICDE Conference, 2012.

  • [6] C. C. Aggarwal, P. S. Yu. On Effective Conceptual Indexing and Similarity Search in Text, ICDM Conference, 2001.

  • [7] C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, J.-S. Park. Fast Algorithms for Projected Clustering, ACM SIGMOD Conference, 1999.

  • [8] R. Agrawal, J. Gehrke, P. Raghavan. D. Gunopulos. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, ACM SIGMOD Conference, 1999.

  • [9] M. Franz, T. Ward, J. McCarley, W.-J. Zhu. Unsupervised and supervised clustering for topic tracking. ACM SIGIR Conference, 2001.

۵. پیوندهای مفید


    رد شده

    خوب است فقط ای کاش توضیحات به یک زبان بود (فارس یا انگلیسی)نه به دو زبان(فارسی و انگلیسی)...(copy and paste)

    رد شده

    بسم الله

    سلام و خسته نباشید
    توضیحات خوب و کامل بود .
    به نظر بنده بهتر است که برای ارزیابی نتایج یک معیار کمی داده شود . چون همیشه اعداد و ارقام کار را برای مقایسه دقیق راحت تر می کنند.
    نمره ای که به این پروژه می دهم 3.5 است.

    تایید شده

    مطالبی که در ابتدای بخش کارهای مرتبط گفته شده بیشتر به مقدمه مربوط می شود و بهتر است در مقدمه بیاید تا با خواندن آن موضوع پروژه واضح تر توضیح داده شود.
    مطالب گفته شده در بخش آزمایش ها، الگوریتم های مرتبط با موضوع است که باید در بخش کارهای مرتبط بیاید و در قسمت آزمایش ها باید نتایج کمی و کیفی بدست آمده از کدی که پیاده سازی شده ذکر شود تا در فاز بعدی این نتایج بهبود یابد اما در بخش آزمایش ها توضیحی در رابطه با نتایج کد داده نشده.
    کد پروژه تمیز و مرتب است اما بهتر بود توضیحاتی درباره ی آن در یک فایل جداگانه و یا در همین صفحه داده می شد.

    تایید شده

    حقیقتا با خوندن مقاله متوجه منظور از "ارجاع های یکسان نشدم " و به نظرم هدف پروژه و مشکل اصلی، به خوبی گفته نشده است و خواننده انگیزه مناسب برای ادامه مطالعه را ندارد و به بیان دیگر ، مسئله به خوبی بیان نشده است .
    کد ها کامنت گذاری ندارن و نمونه کد ها تنها برای بخش " کارهای مرتبط" قرار داده شده است و برای قسمت "آزمایشات" و کارهایی که در حقیقت شما انجام داده اید در متن، کدی قرار داده نشده است.
    برای گزارشات نتایج خاصی در نظر گرفته نشده است ! عددی ! جدولی ! نموداری و ... به نظرم صرفا متن برای القای مفاهیم کافی نیست. و تنها یک سری الگوریتم ها بیان شده است.
    ببخشید اگر تند و صریح نظراتم رو گفتم.
    با تشکر.

    تایید شده

    با سلام و خسته نباشید،
    شما در قسمت ازمایش بیشتر روش ها و الگوریتم ها را توضیح دادید من در قسمت ازمایش انتظار داشتم که نتیجه کار هایی که انجام داده بودید مشاهده کنم . ولی در کل مفاهیم پروژه را بخوبی توضیح داده اید.