News articles clustering

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

News articles clustering

Gorgi Kakasevski
Dear,
I found Carrot on web. I have this problem:

I crawl news for several news sites (text only), and I want to group (cluster) similar news. I have about 4000 articles for last 2 days which I want to group. With tf-idf for each article I extract 7 most important keywords. Now I need (aglomerative/hierarchical) clustering algorithm to cluster news articles. Can Carrot do this job? And can you give me advice.

Regards,

м-р Ѓорги Какашевски

Факултет за информациско-комуникациски технологии
ФОН Универзитет - Скопје, Р. Македонија
е-маил: [hidden email]


------------------------------------------------------------------------------
FREE DOWNLOAD - uberSVN with Social Coding for Subversion.
Subversion made easy with a complete admin console. Easy
to use, easy to manage, easy to install, easy to extend.
Get a Free download of the new open ALM Subversion platform now.
http://p.sf.net/sfu/wandisco-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: News articles clustering

Stanislaw Osinski
Administrator
Hi,

I crawl news for several news sites (text only), and I want to group (cluster) similar news. I have about 4000 articles for last 2 days which I want to group. With tf-idf for each article I extract 7 most important keywords. Now I need (aglomerative/hierarchical) clustering algorithm to cluster news articles. Can Carrot do this job? And can you give me advice.

Carrot2 won't be suitable in this case -- Carrot2 operates on the raw text of documents. Obviously, you can still try clustering the raw news with Carrot2, though Carrot2 will not be able to exploit the domain-specific properties of your documents (e.g. news release dates) -- clustering will be based only on the text of news.

For other clustering algorithms, you may want to take a look at the Apache Mahout project.

Cheers,

Staszek

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers