Benchmark dataset for STC algorithm

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Benchmark dataset for STC algorithm

Han4Me
Hi,

   I am searching for a benchmark dataset that I can use to evaluate the (Suffix Tree Clustering) STC algorithm.
   Does any body have any idea or know which ones I can use?

Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Benchmark dataset for STC algorithm

Stanislaw Osinski
Administrator

  I am searching for a benchmark dataset that I can use to evaluate the
(Suffix Tree Clustering) STC algorithm.
  Does any body have any idea or know which ones I can use?

Quality benchmarking of clustering algorithms is tricky, there are lots of papers on that topic alone. For quick benchmarks you can use the AMBIENT data set (http://credo.fub.it/ambient/), which is also implemented as a Carrot2 document source:

http://fisheye3.atlassian.com/browse/carrot2/trunk/core/carrot2-source-ambient

You can use the components from the carrot2-output-metrics package to calculate various quality metrics:

http://fisheye3.atlassian.com/browse/carrot2/trunk/core/carrot2-output-metrics

Finally, for an example benchmarking code, please see:

http://fisheye3.atlassian.com/browse/carrot2/trunk/applications/carrot2-examples/src/org/carrot2/examples/research/ClusteringQualityBenchmark.java?r=trunk

Thanks,

Staszek

------------------------------------------------------------------------------
Come build with us! The BlackBerry® Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9-12, 2009. Register now!
http://p.sf.net/sfu/devconf
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Benchmark dataset for STC algorithm

Han4Me
Thanks a lot for you fast reply.

Yes, actually I looked at the AMBIENT data set and the implementation in Carrot2, but I am searching for a standard benchmark data set, that is suitable for STC algorithm and to build a reliable results based on it.

I found also some of these data sets but which is not suitable for STC algo. nature.
Did you use or know some of them which I can use either for free/academic purpose or paid one?

Best Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Benchmark dataset for STC algorithm

Stanislaw Osinski
Administrator

Thanks a lot for you fast reply.

Yes, actually I looked at the AMBIENT data set and the implementation in
Carrot2, but I am searching for a standard benchmark data set, that is
suitable for STC algorithm and to build a reliable results based on it.

I'm not aware of any "standard" data set for testing search results clustering. AMBIENT is aiming to establish such a standard.

Cheers,

S.

------------------------------------------------------------------------------
Come build with us! The BlackBerry® Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9-12, 2009. Register now!
http://p.sf.net/sfu/devconf
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Benchmark dataset for STC algorithm

Han4Me
OK, thanks a lot Stanislaw for your help.