در رده‌بندی متون هدف این است که سندهایی را که در اختیار داریم بتوانیم برچسب‌گذاری موضوعی کنیم. در واقع این موضوع صرفا یک مسئله با ناظر است، یعنی مجموعه‌ای از اسناد متنی که گروه‌بندی موضوعی شده‌اند به عنوان داده‌ی آموزشی در اختیار سامانه قرار می‌گیرد تا بتواند با یادگیری از این مجموعه، اسناد جدید ورودی را به یکی از این گروه‌های موضوعی ملحق نماید.

در این پژوهش روش‌های مختلف رده‌بندی اسناد متنی مورد بررسی قرار گرفته و برای زبان فارسی پیاده‌سازی می‌شوند.


Github

Abstract


TODO

Introduction


Data is being generated with a rapid rate and with it the need for better web search, spam filtering, content recommendation and etc., which are in the scope of document classification. Document classification is the problem of assigning one or more pre-specified categories to documents.
A typical document classification process consists of training and building a classifier from a training set and then it's given documents to classify. Depending on the training set available, two learning approaches are available:

  • Supervised learning—This approach became popular and set aside Knowledge Extraction, and it is the way to solve document classification problems in many domains. In this approach a large labeled set of examples is required to build the classifier.

    Many common classifiers are used such as Naive Bayes, Decision Trees, SVMs, KNNs, etc. Naive Bayes is shown to be a good classifier in term of accuracy and computational accuracy as well as a simple classifier.

  • Semi-supervised learning—A recent trend is to benefit from both labeled data and unlabeled data. This approach which is somewhere between supervised learning and unsupervised learning is exciting for both theoretical and practical points of view. Labeled data is difficult, expensive and time consuming to obtain, but unlabeled data is cheap and easy to obtain. In addition, results have shown that one could reach higher accuracy simply by adding unlabeled to their existing supervised classification system.

    Some often-used methods in semi-supervised classification include: EM with generative mixture models, self-training, co-training, transductive support vector machines and graph-based methods.

Hence one of the major decisions when trying to solve document classification problems is choosing the right approach and method. For example bad matching of problem structure with model assumption could lead to lesser accuracy in semi-supervised learning.

Another problem is the ever increasing number of features and variables in these methods. This rises the issue of finding good features which could be more difficult than using those features to create a classification model. Using too many features can lead to overfitting, hinders the interpretability and is computationally expensive.

Related Works


Here we study a few methods and compare them. It should be noted that currently some preprccessing steps are omitted to keep the document concise.

Feature Selection


Feature selection is the problem of selecting a relevant subset of features and ignoring the rest. If done right, it will improve the predicting performance, improve the speed and reduce the cost of predictors and decrease the storage requirements. A few methods used to achieve this include: Ranking, filters, wrappers and embedded methods. Depending on the goal one could outperform the others, e.g. if the task is to predict as accurately as possible, an algorithm that has a safeguard against overfitting might be better than ranking.

Naive Bayes


As mentioned earlier Naive Bayes is a simple probabilistic classifier. The output P(y|x) for y being a category in the set of categories (Y) and x being a document in the set of documents (X), is the probability of x belonging to class y. Simply said, Naive Bayes classifier learns from the number of occurrences of words (features) independently, meaning that it doesn't include the effect of presence of absence of one word on the other, thus decreasing the computations. This assumption also has it's drawbacks.

y_{1}, ..., y_{\left\vert{Y}\right\vert} \in Y
x_{1}, ..., x_{\left\vert{X}\right\vert} \in X

We'd compute the category for an unseen document d with the word list W using the following:

w_{1}, ..., w_{l_{d}} \in W
y^{*}=argmax_{y_{j} \in Y} P(y_{j}) \prod_{i=1}^{l_{d}} P(w_{i} \vert y_{j})

Where P(yj) is a priori probability of class yj and P(wi | yj) is the conditional probability of wi given class yj. By applying the Laplace law of succession to estimate P(wi | yj), we'll get:

P(w_{i} \vert y_{j}) = \cfrac{n_{ij} + 1}{n_{j} + k{j}}

Where nj is the total number of words in class yj, nij is the number of occurrences of word wi in class yj and kj is the vocabulary size of class cj.

Experiments


PLACEHOLDER

References

  • SL Ting, WH Ip and AHC Tsang, "Is Naïve Bayes a Good Classifier for Document Classification?", International Journal of Software Engineering and Its Applications Vol. 5, No. 3, July, 2011.

  • X Zhu, "Semi-supervised learning literature survey", University of Wisconsin-Madison, 2006.

  • J Turian, L Ratinov, Y Bengio, "Word representations: a simple and general method for semi-supervised learning", ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010.

  • I Guyon, A Elisseeff, "An introduction to variable and feature selection", The Journal of Machine Learning Research, Vol. 3, 2003.

  • Y. H. Li, A. K. Jain, "Classification of Text Documents", The Computer Journal, 1998.

  • R Kohavi, GH John, "Wrappers for feature subset selection", Vol. 97, Artificial Intelligence, 1997.

  • Berry, Michael W., ed. Survey of Text Mining I: Clustering, Classification, and Retrieval. Vol. 1. Springer, 2004.


Useful Links

محسن ایمانی

کار شما برای فاز اول این پروژه بسیار خوب بود و از شما به خاطر زحمتی که کشیدید تشکر می‌کنم.

اما چند نکته به ذهنم می‌رسد که در فازهای بعدی به آن توجه کنید:

  • چند جای متن شما ایرادهای نگارشی و دستوری داشت، و در کل به نظر بنده دلیلی برای نوشتن متن پژوهش‌نامه به زبان انگلیسی وجود ندارد.

  • بهتر است وقتی در متن خود روشی را توضیح می‌دهید و یا تعریفی را ارائه می‌کنید، بلافاصله مرجعی را که از آن برداشته‌اید را هم ذکر کنید تا خواننده مثلا اگر خواست جزئیات بیشتری در مورد روش‌های نیمه‌نظارتی و یا مدل بیز ساده بداند بتواند به این مرجع، رجوع نماید.

  • مسئله رده‌بندی متون، مسئله‌ای است که روی آن کارهای بسیار زیادی انجام شده است و به همین خاطر از شما نیز انتظار می‌رود که وارد حوزه‌های جدیدتر، مثلا روش‌های چندکلاسه، و یا استفاده از ویژگی‌های مربوط به زیان فارسی و ... نیز بشوید.