org.carrot2.examples.clustering.ClusteringDocumentList

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

org.carrot2.examples.clustering.ClusteringDocumentList

reinhard
hi developers,

may be you remember me after long time...
i'm trying carrot2 again, and have restarted with
org.carrot2.examples.clustering.ClusteringDocumentList

i feed it with search results retrieved from google.
one cluster is

Yoga-krems-wachau-j C3 BCrgen-ullrich (2 documents, score:
8.036041263619147)
  [23]
http://www.herold.at/gelbe-seiten/krems-an-der-donau/lDk8q/yoga-krems-wachau-j%C3%BCrgen-ullrich/
       yoga krems wachau - Jürgen Ullrich / 3500 Krems an der Donau ...
Zu yoga krems wachau - Jürgen Ullrich in 3500 Krems an der Donau liefert
HEROLD.at Gelbe Seiten Kontaktdaten wie Adresse und Telefonnummer sowie
den ...www.herold.at/.../krems.../yoga-krems-wachau-jürgen-ullrich/ -  
www.herold.at/.../krems.../yoga-krems-wachau-jürgen-ullrich/ -  

  [35]
http://tupalo.com/de/krems-an-der-donau/yoga-krems-wachau-j%C3%BCrgen-ullrich
       Yoga Krems Wachau Jürgen Ullrich - Yogastudio - Krems an der
Donau ... Yoga Krems Wachau Jürgen Ullrich - Krems an der Donau,
Österreich Erfahrungsberichte von Leuten die dort wohnen. Die besten
Restaurants, Cafes, Friseure und
...tupalo.com/de/krems.../yoga-krems-wachau-jürgen-ullrich -  
tupalo.com/de/krems.../yoga-krems-wachau-jürgen-ullrich -  

my query is "wachau krems" to google and the urls are url encoded from
the search result of google, not processed furthermore.
carrots2 expects the urls to be decoded like i do it in the snippet?

regards
reinhard


------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: org.carrot2.examples.clustering.ClusteringDocumentList

Stanislaw Osinski
Administrator
Hi Reinhard,

may be you remember me after long time...

Welcome back! While you were away, we've rewritten Lingo to make the code clearer, should you need to dive into it again :-)
 
i'm trying carrot2 again, and have restarted with
org.carrot2.examples.clustering.ClusteringDocumentList

i feed it with search results retrieved from google.
one cluster is

Yoga-krems-wachau-j C3 BCrgen-ullrich (2 documents, score:
8.036041263619147)
 [23]
http://www.herold.at/gelbe-seiten/krems-an-der-donau/lDk8q/yoga-krems-wachau-j%C3%BCrgen-ullrich/
      yoga krems wachau - Jürgen Ullrich / 3500 Krems an der Donau ...
Zu yoga krems wachau - Jürgen Ullrich in 3500 Krems an der Donau liefert
HEROLD.at Gelbe Seiten Kontaktdaten wie Adresse und Telefonnummer sowie
den ...www.herold.at/.../krems.../yoga-krems-wachau-jürgen-ullrich/ -
www.herold.at/.../krems.../yoga-krems-wachau-jürgen-ullrich/ -

 [35]
http://tupalo.com/de/krems-an-der-donau/yoga-krems-wachau-j%C3%BCrgen-ullrich
      Yoga Krems Wachau Jürgen Ullrich - Yogastudio - Krems an der
Donau ... Yoga Krems Wachau Jürgen Ullrich - Krems an der Donau,
Österreich Erfahrungsberichte von Leuten die dort wohnen. Die besten
Restaurants, Cafes, Friseure und
...tupalo.com/de/krems.../yoga-krems-wachau-jürgen-ullrich -
tupalo.com/de/krems.../yoga-krems-wachau-jürgen-ullrich -

my query is "wachau krems" to google and the urls are url encoded from
the search result of google, not processed furthermore.
carrots2 expects the urls to be decoded like i do it in the snippet?

Currently, Carrot2 text clustering algorithms (Lingo, STC) don't take the URL field into account when clustering, only the simple ByUrlClusteringAlgorithm does. Obviously, when the title or snippet contain URLs, they will be processed.

In theory, "full" URLs (starting with a protocol specification) appearing in titles and snippets should be ignored when clustering. In practice though, looking at the label you got, not all URLs seem to be recognized properly. I've filed a bug for it and will commit a temporary fix. Once the build passes (http://builds.carrot2.org/browse/C2HEAD-CORE), you should be able to grab the 3.3.0-dev binaries from: http://download.carrot2.org/head/.

Later today, I'll go through the RFC and fix the tokenizer properly.

Thanks for reporting this!

S.

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers