در رده‌بندی متون هدف این است که سندهایی را که در اختیار داریم بتوانیم برچسب‌گذاری موضوعی کنیم. در واقع این موضوع صرفا یک مسئله با ناظر است، یعنی مجموعه‌ای از اسناد متنی که گروه‌بندی موضوعی شده‌اند به عنوان داده‌ی آموزشی در اختیار سامانه قرار می‌گیرد تا بتواند با یادگیری از این مجموعه، اسناد جدید ورودی را به یکی از این گروه‌های موضوعی ملحق نماید.

در این پژوهش روش‌های مختلف رده‌بندی اسناد متنی مورد بررسی قرار گرفته و برای زبان فارسی پیاده‌سازی می‌شوند.

Abstract

TODO

Introduction

Data is being generated with a rapid rate and with it the need for better web search, spam filtering, content recommendation and etc., which are in the scope of document classification. Document classification is the problem of assigning one or more pre-specified categories to documents.
A typical document classification process consists of training and building a classifier from a training set and then it's given documents to classify. Depending on the training set available, two learning approaches are available:

Supervised learning—This approach became popular and set aside Knowledge Extraction, and it is the way to solve document classification problems in many domains. In this approach a large labeled set of examples is required to build the classifier.

Many common classifiers are used such as Naive Bayes, Decision Trees, SVMs, KNNs, etc. Naive Bayes is shown to be a good classifier in term of accuracy and computational accuracy as well as a simple classifier.
Semi-supervised learning—A recent trend is to benefit from both labeled data and unlabeled data. This approach which is somewhere between supervised learning and unsupervised learning is exciting for both theoretical and practical points of view. Labeled data is difficult, expensive and time consuming to obtain, but unlabeled data is cheap and easy to obtain. In addition, results have shown that one could reach higher accuracy simply by adding unlabeled to their existing supervised classification system.

Some often-used methods in semi-supervised classification include: EM with generative mixture models, self-training, co-training, transductive support vector machines and graph-based methods.

Hence one of the major decisions when trying to solve document classification problems is choosing the right approach and method. For example bad matching of problem structure with model assumption could lead to lesser accuracy in semi-supervised learning.

Another problem is the ever increasing number of features and variables in these methods. This rises the issue of finding good features which could be more difficult than using those features to create a classification model. Using too many features can lead to overfitting, hinders the interpretability and is computationally expensive.

Related Works

Here we study a few methods and compare them.

Preprocessing

Before doing any real processing, a few preprocessing tasks should be done to increase effectiveness. Mostly these tasks are Information Retrieval techniques which are also used in document classification as they both process text documents. Stemming smooths the data set and tries to group similar words and words with the same root. For example imagine the effect of *voting* in the favor of political documents is decreased because in some *voting* is used more and in some *vote*. Although both have the same meaning they are considered different features. As we'll see in the next section (Dimensionality Reduction), not all terms are worth processing and only a fraction of the whole term set will be chosen to represent the document. A high percentage of the terms in a document are very frequent but useless terms such as pronouns referred to as stop words. Now if we choose highly frequent terms for document representation, stop words will occupy the features set resulting in a low accuracy.

Dimensionality Reduction

Normally in practical and sometimes in academic settings, the high number of terms in the documents causes problems, since some classification algorithms can't handle this. Also, this may cause the classifiers to tune to the training data resulting in a very good accuracy when reclassifying the data they have been trained on, and a much worse accuracy when classifying unseen data. It would take a long time and so on... . There are two approaches to deal with this issue, feature selection and feature extraction. We'll examine these approaches briefly.

Features Selection

Feature selection is the problem of selecting a relevant subset ($T ^{\prime}$) of terms ($T$) which yields the highest effectiveness and ignoring the rest. If done right, it will improve the predicting performance, improve the speed and reduce the cost of predictors and decrease the storage requirements. A few methods used to achieve this include: Ranking, filters, wrappers and embedded methods. Depending on the goal one could outperform the others, e.g. if the task is to predict as accurately as possible, an algorithm that has a safeguard against overfitting might be better than ranking. A major factor in feature selection is the aggressivity $\cfrac{\vert T \vert}{\vert T ^{\prime} \vert}$ of reduction which affects the effectiveness directly. A high aggressivity results in the removal of some useful terms, therefore it should be chosen with care. To give you a sense of how feature selection works, we'll look at filtering. Filtering works by keeping the terms that score high according to an importance measure. There are a variety of measures available, one of them is document frequency. Using document frequency as a measure, the feature selection algorithm chooses the terms that occur in the highest number of documents. Using this measure it is possible to reduce the dimensionality by a factor of 10 with no loss in effectiveness.

Features Extraction

Feature extraction generates a set of artificial terms to improve the effectiveness. The difference is that the mentioned term set is not a subset of all the terms, and the terms may not be available in the data at all, that's why sometimes this approach is called feature construction. There are a number of feature construction methods including clustering, basic linear transforms, latent semantic indexing and more sophisticated methods. Clustering method replaces a group of similar terms by a cluster centroid, which becomes a feature, mostly through K-means and hierarchical clustering algorithms.

Supervised Classifiers

Naive Bayes

As mentioned earlier Naive Bayes is a simple probabilistic classifier. The output P(y \vert x) for y being a category in the set of categories (Y) and x being a document in the set of documents (X), is the probability of x belonging to class y. Simply said, Naive Bayes classifier learns from the number of occurrences of words (features) independently, meaning that it doesn't include the effect of presence of absence of one word on the other, thus decreasing the computations. This assumption also has it's drawbacks.

y_{1}, ..., y_{\left\vert{Y}\right\vert} \in Y

x_{1}, ..., x_{\left\vert{X}\right\vert} \in X

We'd compute the category for an unseen document d with the word list W using the following:

w_{1}, ..., w_{l_{d}} \in W

y^{*}=argmax_{y_{j} \in Y} P(y_{j}) \prod_{i=1}^{l_{d}} P(w_{i} \vert y_{j})

Where P(y_{j}) is a priori probability of class y_{j} and P(w_{i} \vert y_{j}) is the conditional probability of w_{i} given class y_{j}. By applying the Laplace law of succession to estimate P(w_{i} \vert y_{j}), we'll get:

P(w_{i} \vert y_{j}) = \cfrac{n_{ij} + 1}{n_{j} + k{j}}

Where n_{j} is the total number of words in class y_{j}, n_{ij} is the number of occurrences of word w_{i} in class y_{j} and k_{j} is the vocabulary size of class c_{j}.

Decision Tree

A decision tree classifier constructs a tree which its nodes represent features. In a binary document representation depending on whether the feature exists or not, a branch exiting from the node is taken. Starting from the root and checking features repeatedly, we'll get to a leaf which represents the corresponding category of the document. In contrast to the naive bayes method mentioned above which is quantitative and hard to interpret for humans, decision tree outputs are symbolic and easily interpreted. Several methods for decision tree classifiers include ID3, C4.5 and C5.

An algorithm for creating a decision tree might be as follows. For category c_{i}, at each step a term t_{k} is chosen usually according to an entropy or information gave measure, then the term set is divided to groups by the selected term t_{k} and each group is placed in a seperate subtree, then the procedure is repeated until each leaf of the tree contains training documents assigned to the same category c_{i}, which is then chosen as the label for the leaf.

Decision trees are prone to overfitting, since some of the branches may be too specific to the training data. To prevent the classifier from building huge trees with specific trees, setting parameters like maximum depth or minimum observation in a leaf, and pruning techniques are used.

Semi-supervised Classifiers

Self-training

This method is mostly used for replacing a supervised classifier by a semi-supervised one and enjoying the benefits of semi-supervised algorithm with less effort. First a classifier trained with the small labeled data available and it is used to classify the unlabeled data. At each step, the most promising documents are added to the training set and the classifier is re-trained with the new training set. In another words, the classifier uses its own predictions to train itself. Note that a mistake prediction's effect is doubled if it is chosen to be added to the training set. Some algorithms detect this using some threshold and try to avoid it by unlearning the those documents.

Experiments

To experiment said methods, a Naive Bayes classifier with bag of words as document representation has been implemented. The dimensionality reduction approach is feature selection, and the selected measure is frequency of terms in the whole corpora. With a little tweaking, a ratio of 10% for \cfrac{\vert T ^{\prime} \vert}{\vert T \vert} showed to have the best accuracy in the old hamshari corpora. Training with 1800 documents and testing with the next 200 document we'll get an accuracy of 0.53.

Source code

References

F Sebastiani, "Machine learning in automated text categorization", ACM computing surveys (CSUR), 2002.
SL Ting, WH Ip and AHC Tsang, "Is Naïve Bayes a Good Classifier for Document Classification?", International Journal of Software Engineering and Its Applications Vol. 5, No. 3, July, 2011.
X Zhu, "Semi-supervised learning literature survey", University of Wisconsin-Madison, 2006.
J Turian, L Ratinov, Y Bengio, "Word representations: a simple and general method for semi-supervised learning", ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010.
I Guyon, A Elisseeff, "An introduction to variable and feature selection", The Journal of Machine Learning Research, Vol. 3, 2003.
Y. H. Li, A. K. Jain, "Classification of Text Documents", The Computer Journal, 1998.
R Kohavi, GH John, "Wrappers for feature subset selection", Vol. 97, Artificial Intelligence, 1997.
Berry, Michael W., ed. Survey of Text Mining I: Clustering, Classification, and Retrieval. Vol. 1. Springer, 2004.

Useful Links

رد شده

از نظر من بسیار عالی و خوب کارهای مرتبط انجام شده است.

تایید شده

پروژه بسیار خوبی است و به سختی می‌توان ایرادی در آن دید. امتیاز بنده به این پروژه به دلایلی است که در زیر مطرح کرده‌ام. چند نکته را به عنوان پیشنهاد مطرح می‌کنم:

توضیحات بسیار مختصر و مفید داده‌شده‌اند. به نظرم این یکی از نقاط بسیار مثبت، در مستند‌سازی شما است. در مورد روش‌های semi supervised هم توضیحات جالبی داده بودید.
بهتر است چند خط فارسی اولیه توضیحات را هم حدف کنید!!!(:دی)
از منابع بسیار خوب و جدیدی استفاده‌ شده است. بهتر است در متن‌تان، هرجا که جمله‌ای را از منبعی گفته‌اید به آن ارجاع دهید.
در پروژه از شما خواسته شده است که بر روی رده بندی اسناد فارسی کار کنید. بهتر است علاوه بر روش‌های آماری، از ویژگی‌های ساختاری زبان فارسی هم استفاده کنید.
old_hamshahri_reader را به hazm اضافه کنید.
آیا می‌شود برای رده‌بندی، روش‌های رده‌بندی و خوشه‌بندی را باهم ادغام کرد؟(!) یا این که مثلا داده‌هایی نظیر لغات یک دانشنامه‌های موضوعی رو به بازی اضافه کرد(؟)

علیرضا نوریان

خوب بود. ولی با توجه به ارجاع‌های خوبی که گذاشتید، می‌تونستید کارهای مرتبط بیشتری رو بیان کنید و یا همین روش‌ها رو با تفصیل بیشتر. البته این خیلی خوبه که پیاده‌سازی شما در این مرحله کامل هست.

یکی از شاخصه‌های متن علمی قرار دادن ارجاع پس از هر ادعاست و اصلا کارهای مرتبط بر ارجاع‌ها استوار است. تقریبا شما در هر پاراگراف کارهای مرتبط حداقل به یک ارجاع نیاز دارید. در واقع وقتی عنوان مقاله‌ای را در بخش ارجاع‌ها قرار می‌دهید، یعنی قبلا در متن جایی به آن اشاره کرده‌اید.

چیزی به عنوان رده‌بند نیمه‌نظارتی نداریم. و عنوان نیمه‌نظارتی به مجموعه‌ای از روشها اختصاص یافته که خودتون هم ازشون نام بردید. برای نمونه عنوان روشی که شما در این بخش توضیح دادید، Bootstrapping است.

برای ارزیابی لطفا Precision و Recall را هم اعلام کنید.

توضیحات داده شده کامل بود و در بخش کارهای مرتبط الگوریتم های ارائه شده به خوبی توضیح داده شده است. همچنین پیاده سازی اولیه و نتایج اولیه آزمایشات نیز ثبت شده که نشان دهنده پیشرفت پروژه می باشد.