در رده‌بندی متون هدف این است که سندهایی را که در اختیار داریم بتوانیم برچسب‌گذاری موضوعی کنیم. در واقع این موضوع صرفا یک مسئله با ناظر است، یعنی مجموعه‌ای از اسناد متنی که گروه‌بندی موضوعی شده‌اند به عنوان داده‌ی آموزشی در اختیار سامانه قرار می‌گیرد تا بتواند با یادگیری از این مجموعه، اسناد جدید ورودی را به یکی از این گروه‌های موضوعی ملحق نماید.

در این پژوهش روش‌های مختلف رده‌بندی اسناد متنی مورد بررسی قرار گرفته و برای زبان فارسی پیاده‌سازی می‌شوند.


Github

1. Abstract

TODO

2. Introduction

Data is being generated with a rapid rate and with it the need for better web search, spam filtering, content recommendation and etc., which are in the scope of document classification. Document classification is the problem of assigning one or more pre-specified categories to documents.
A typical document classification process consists of training and building a classifier from a training set and then it's given documents to classify. Depending on the training set available, two learning approaches are available:

  • Supervised learning—This approach became popular and set aside Knowledge Extraction, and it is the way to solve document classification problems in many domains. In this approach a large labeled set of examples is required to build the classifier.

    Many common classifiers are used such as Naive Bayes, Decision Trees, SVMs, KNNs, etc. Naive Bayes is shown to be a good classifier in term of accuracy and computational accuracy as well as a simple classifier.

  • Semi-supervised learning—A recent trend is to benefit from both labeled data and unlabeled data. This approach which is somewhere between supervised learning and unsupervised learning is exciting for both theoretical and practical points of view. Labeled data is difficult, expensive and time consuming to obtain, but unlabeled data is cheap and easy to obtain. In addition, results have shown that one could reach higher accuracy simply by adding unlabeled to their existing supervised classification system.

    Some often-used methods in semi-supervised classification include: EM with generative mixture models, self-training, co-training, transductive support vector machines and graph-based methods.

Hence one of the major decisions when trying to solve document classification problems is choosing the right approach and method. For example bad matching of problem structure with model assumption could lead to lesser accuracy in semi-supervised learning.

Another problem is the ever increasing number of features and variables in these methods. This rises the issue of finding good features which could be more difficult than using those features to create a classification model. Using too many features can lead to overfitting, hinders the interpretability and is computationally expensive.

3. Related Works

Here we study a few methods and compare them.

3.1. Preprocessing

Before doing any real processing, a few preprocessing tasks should be done to increase effectiveness. Mostly these tasks are Information Retrieval techniques which are also used in document classification as they both process text documents.

Stemming smooths the data set and tries to group similar words and words with the same root. For example imagine the effect of voting in the favor of political documents is decreased because in some voting is used more and in some vote. Although both have the same meaning they are considered different features.

As we'll see in the next section (Dimensionality Reduction), not all terms are worth processing and only a fraction of the whole term set will be chosen to represent the document. A high percentage of the terms in a document are very frequent but useless terms such as pronouns referred to as stop words. Now if we choose highly frequent terms for document representation, stop words will occupy the features set resulting in a low accuracy.

3.2. Dimensionality Reduction

Normally in practical and sometimes in academic settings, the high number of terms in the documents causes problems, since some classification algorithms can't handle this. Also, this may cause the classifiers to tune to the training data resulting in a very good accuracy when reclassifying the data they have been trained on, and a much worse accuracy when classifying unseen data. It would take a long time and so on... . There are two approaches to deal with this issue, feature selection and feature extraction. We'll examine these approaches briefly.

3.2.1. Features Selection

Feature selection is the problem of selecting a relevant subset (T ^{\prime}) of terms (T) which yields the highest effectiveness and ignoring the rest. If done right, it will improve the predicting performance, improve the speed and reduce the cost of predictors and decrease the storage requirements. A few methods used to achieve this include: Ranking, filters, wrappers and embedded methods. Depending on the goal one could outperform the others, e.g. if the task is to predict as accurately as possible, an algorithm that has a safeguard against overfitting might be better than ranking. A major factor in feature selection is the aggressivity \cfrac{\vert T \vert}{\vert T ^{\prime} \vert} of reduction which affects the effectiveness directly. A high aggressivity results in the removal of some useful terms, therefore it should be chosen with care.

To give you a sense of how feature selection works, we'll look at filtering. Filtering works by keeping the terms that score high according to an importance measure. There are a variety of measures available, one of them is document frequency. Using document frequency as a measure, the feature selection algorithm chooses the terms that occur in the highest number of documents. Using this measure it is possible to reduce the dimensionality by a factor of 10 with no loss in effectiveness.

3.2.2. Features Extraction

Feature extraction generates a set of artificial terms to improve the effectiveness. The difference is that the mentioned term set is not a subset of all the terms, and the terms may not be available in the data at all, that's why sometimes this approach is called feature construction. There are a number of feature construction methods including clustering, basic linear transforms, latent semantic indexing and more sophisticated methods. Clustering method replaces a group of similar terms by a cluster centroid, which becomes a feature, mostly through K-means and hierarchical clustering algorithms.

3.3. Supervised Classifiers

3.3.1. Naive Bayes

As mentioned earlier Naive Bayes is a simple probabilistic classifier. The output P(y \vert x) for y being a category in the set of categories (Y) and x being a document in the set of documents (X), is the probability of x belonging to class y. Simply said, Naive Bayes classifier learns from the number of occurrences of words (features) independently, meaning that it doesn't include the effect of presence of absence of one word on the other, thus decreasing the computations. This assumption also has it's drawbacks.

y_{1}, ..., y_{\left\vert{Y}\right\vert} \in Y
x_{1}, ..., x_{\left\vert{X}\right\vert} \in X

We'd compute the category for an unseen document d with the word list W using the following:

w_{1}, ..., w_{l_{d}} \in W
y^{*}=argmax_{y_{j} \in Y} P(y_{j}) \prod_{i=1}^{l_{d}} P(w_{i} \vert y_{j})

Where P(y_{j}) is a priori probability of class y_{j} and P(w_{i} \vert y_{j}) is the conditional probability of w_{i} given class y_{j}. By applying the Laplace law of succession to estimate P(w_{i} \vert y_{j}), we'll get:

P(w_{i} \vert y_{j}) = \cfrac{n_{ij} + 1}{n_{j} + k{j}}

Where n_{j} is the total number of words in class y_{j}, n_{ij} is the number of occurrences of word w_{i} in class y_{j} and k_{j} is the vocabulary size of class c_{j}.

3.3.2. Decision Tree

A decision tree classifier constructs a tree which its nodes represent features. In a binary document representation depending on whether the feature exists or not, a branch exiting from the node is taken. Starting from the root and checking features repeatedly, we'll get to a leaf which represents the corresponding category of the document. In contrast to the naive bayes method mentioned above which is quantitative and hard to interpret for humans, decision tree outputs are symbolic and easily interpreted. Several methods for decision tree classifiers include ID3, C4.5 and C5.

An algorithm for creating a decision tree might be as follows. For category c_{i}, at each step a term t_{k} is chosen usually according to an entropy or information gave measure, then the term set is divided to groups by the selected term t_{k} and each group is placed in a seperate subtree, then the procedure is repeated until each leaf of the tree contains training documents assigned to the same category c_{i}, which is then chosen as the label for the leaf.

Decision trees are prone to overfitting, since some of the branches may be too specific to the training data. To prevent the classifier from building huge trees with specific trees, setting parameters like maximum depth or minimum observation in a leaf, and pruning techniques are used.

3.4. Semi-supervised Classifiers

3.4.1. Self-training

This method is mostly used for replacing a supervised classifier by a semi-supervised one and enjoying the benefits of semi-supervised algorithm with less effort. First a classifier trained with the small labeled data available and it is used to classify the unlabeled data. At each step, the most promising documents are added to the training set and the classifier is re-trained with the new training set. In another words, the classifier uses its own predictions to train itself. Note that a mistake prediction's effect is doubled if it is chosen to be added to the training set. Some algorithms detect this using some threshold and try to avoid it by unlearning the those documents.

4. Experiments

To experiment said methods and compare them with each other, using different parameters and in different situations, we need an evaluation method. To achieve this goal, while classifying the test set, a confusion matrix is created. Confusion matrices help show how the classifier performed on predicting and separating the classes. It is a n-n matrix with n being the number of classes. Rows represent the class assigned by experts and columns represent the class predicted by the classifier. Each document classification increments m[i][j] where i is index of the label given by experts, and j is index of the label predicted.

<small style="text-align: center;">Table 1 - A confusion matrix in which rows are reference and cols are test.</small>

SiasiVrzshEqts
Siasi2415
Vrzsh2213
Eqts7217
AccuracyPrecisionRecall
Naive Bayes0.5640.5101830.308059
Decision Tree0.4460.3015440.244987
SVM(SVC)0.330.525180.121543
SVM(LinearSVC)0.5940.4178790.333864
تایید شده

سلام
در قسمتی از پروژه روش svm استفاده کردید ولی توضیحی درباره روش ذکر نکردید.
برای پروژه رده بندی متون روش های جدیدی وجود دارد ولی از روش های قدیمی استفاده کرده اید بهتر بود به پیاده سازی روش جدید تری میپرداختید
بهتر بود در متن ارجاعات مقاله ها را با شماره مشخص میکردید تا مشخص شود هر مقاله به کدام قسمت اشاره دارد
در قسمت ازمایش ها اگر با مثال توضیح میدادید روند کار بیشتر معلوم میشد
استفاده از requirements.pip و LICENSE ایده جالبی بود
خسته نباشید:)

رد شده

پروژه بسیار خوب و کامل انجام شده، اما بهتر بود فاز به فاز قسمت Abstract را کامل میکردید .
همچنین بهتر بود راهنمایی جهت اجرای کد مینوشتید.

تایید شده

نکات مثبت:
پروژه جالب و کاربردی به نظر میرسد.
مقدمه سازی و ایجاد انگیزه برای ادامه مطالعه توسط مخاطب . خوب نوشته شده است .
نوشتن مقاله زبان انگلیسی به طور کلی خوب است.
فرمولهای ریاضی و نحوه محاسبه پارامترها دقیق محاسبه و نشان داده شده است.
ادغام درس هوش مصنوعی و ذخیره و بازیابی اطلاعات بسیار جالب و کاربردی و خوب است.

نکات منفی:
الگوریتم ها به شکل متنی توضیح داده شده است. ای کاش به شکل مرحله ای . ( step 1 ، step 2 و ... ) بیان میشد و در بعضی جا ها از سودو کد استفاده میشد.
عدم وجود هرگونه عکس و نمودار کمی آزار دهنده است. ای کاش نتایج آزمایشات به کمک نودار هم نشان داده میشد. نمودار کمک شایانی به درک مفاهیم توسط خواننده دارد. همچنین وجود عکس هایی که سیستم خوشه بندی و یا الگوریتم ها را به شکل مفهمومی نشان میداد ، میتوانست مفید باشد.
در مقدمه گفته بودید ، رده بندی برای متون فارسی صورت خواهد گرفت و لی در نتایج آزمایشات، فکر می کنم از لغات فینگیلیش استفاده شده است.
با توجه به اینکه این مقاله قرار است متون فارسی را رده بندی کند . اگر از بان فارسی بیشتر استفاده می شد، به نظرم بهتر بود.

اجازه بدید امتیاز را 5 ستاره بدهم. با توجه به اینکه انتقاد ها و نکات منفی ، خیلی مشکل ساز نیستند و میتوان به سرعت آن ها را رفع کرد. و خطری به هسته پروژه وارد نکرده است. ( هرچند که بعضی از آن ها واقعا آزار دهنده هستند ... )