Improving performance of STC

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Improving performance of STC

willycws
Hi,

I currently using Carrot2 STC as part of my experience. Do you have the manual for improving performance of STC on certain parameters to set to improve the performance of STC?

Thanks,
Willy
 
CONFIDENTIALITY: This email is intended solely for the person(s) named. The contents may be confidential and/or privileged. If you are not the intended recipient, please delete it, notify us, and do not copy or use it, nor disclose its contents. Thank you.
 
Save the Earth: Print Only When you Need it.

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance of STC

Dawid Weiss-2
Can you specify what kind of data you want to process? How many
documents, what's their total size? STC won't scale to super-large
data sets (because it is implemented to run completely in main
memory). You can peek at Apache Mahout project, there are large scale
clustering algorithms implemented there.

Dawid

On Sun, Jun 20, 2010 at 5:18 AM, #CHUA WEE SIONG WILLY#
<[hidden email]> wrote:

> Hi,
>
> I currently using Carrot2 STC as part of my experience. Do you have the
> manual for improving performance of STC on certain parameters to set to
> improve the performance of STC?
>
> Thanks,
> Willy
>
> CONFIDENTIALITY: This email is intended solely for the person(s) named. The
> contents may be confidential and/or privileged. If you are not the intended
> recipient, please delete it, notify us, and do not copy or use it, nor
> disclose its contents. Thank you.
>
> Save the Earth: Print Only When you Need it.
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance of STC

willycws
Hi,

I'm processing keywords from news articles. About 500 documents which is the the maximum. I'm currently trying to use STC to merge similar cluster(s) together to give a better results. There are quite a few parameters to set in STC but sometimes do not result in any effect on the clusters. So probably may need your advice on essential STC parameters to set that will have a huge effect on the resultset.

Currently parameters which I have tried using are: all under STCClusteringParameters.class
maxBaseClusters
maxPhraseOverlap
minBaseClusterScore
mergeThreshold
mostGeneralPhraseCoverage

Thanks,
Willy

CONFIDENTIALITY: This email is intended solely for the person(s) named. The contents may be confidential and/or privileged. If you are not the intended recipient, please delete it, notify us, and do not copy or use it, nor disclose its contents. Thank you.

Save the Earth: Print Only When you Need it.

________________________________________
From: Dawid Weiss [[hidden email]]
Sent: Sunday, 20 June, 2010 4:42:04 PM
To: Carrot2-developers
Subject: Re: [C2-devel] Improving performance of STC

Can you specify what kind of data you want to process? How many
documents, what's their total size? STC won't scale to super-large
data sets (because it is implemented to run completely in main
memory). You can peek at Apache Mahout project, there are large scale
clustering algorithms implemented there.

Dawid

On Sun, Jun 20, 2010 at 5:18 AM, #CHUA WEE SIONG WILLY#
<[hidden email]> wrote:

> Hi,
>
> I currently using Carrot2 STC as part of my experience. Do you have the
> manual for improving performance of STC on certain parameters to set to
> improve the performance of STC?
>
> Thanks,
> Willy
>
> CONFIDENTIALITY: This email is intended solely for the person(s) named. The
> contents may be confidential and/or privileged. If you are not the intended
> recipient, please delete it, notify us, and do not copy or use it, nor
> disclose its contents. Thank you.
>
> Save the Earth: Print Only When you Need it.
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance of STC

Dawid Weiss-2
There should be absolutely no problems with memory in clustering 500
documents (if you do have problems, increase the default java heap
with -Xmx1024m, for example, but this shouldn't be necessary). As for
parameters -- you will find the description of these parameters (and
their meaning) in the research paper by Zamir and Etzioni:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.4719

There is no way to predict how each and every parameter will affect
your clusters, because this depends on many factors. Experiment (the
Workbench is a good tool for this) and find your best settings
manually.

Dawid


On Sun, Jun 20, 2010 at 11:36 AM, #CHUA WEE SIONG WILLY#
<[hidden email]> wrote:

> Hi,
>
> I'm processing keywords from news articles. About 500 documents which is the the maximum. I'm currently trying to use STC to merge similar cluster(s) together to give a better results. There are quite a few parameters to set in STC but sometimes do not result in any effect on the clusters. So probably may need your advice on essential STC parameters to set that will have a huge effect on the resultset.
>
> Currently parameters which I have tried using are: all under STCClusteringParameters.class
> maxBaseClusters
> maxPhraseOverlap
> minBaseClusterScore
> mergeThreshold
> mostGeneralPhraseCoverage
>
> Thanks,
> Willy
>
> CONFIDENTIALITY: This email is intended solely for the person(s) named. The contents may be confidential and/or privileged. If you are not the intended recipient, please delete it, notify us, and do not copy or use it, nor disclose its contents. Thank you.
>
> Save the Earth: Print Only When you Need it.
>
> ________________________________________
> From: Dawid Weiss [[hidden email]]
> Sent: Sunday, 20 June, 2010 4:42:04 PM
> To: Carrot2-developers
> Subject: Re: [C2-devel] Improving performance of STC
>
> Can you specify what kind of data you want to process? How many
> documents, what's their total size? STC won't scale to super-large
> data sets (because it is implemented to run completely in main
> memory). You can peek at Apache Mahout project, there are large scale
> clustering algorithms implemented there.
>
> Dawid
>
> On Sun, Jun 20, 2010 at 5:18 AM, #CHUA WEE SIONG WILLY#
> <[hidden email]> wrote:
>> Hi,
>>
>> I currently using Carrot2 STC as part of my experience. Do you have the
>> manual for improving performance of STC on certain parameters to set to
>> improve the performance of STC?
>>
>> Thanks,
>> Willy
>>
>> CONFIDENTIALITY: This email is intended solely for the person(s) named. The
>> contents may be confidential and/or privileged. If you are not the intended
>> recipient, please delete it, notify us, and do not copy or use it, nor
>> disclose its contents. Thank you.
>>
>> Save the Earth: Print Only When you Need it.
>> ------------------------------------------------------------------------------
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> _______________________________________________
>> Carrot2-developers mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>>
>>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance of STC

willycws
Thanks alot. Will check the research paper.

Thanks,
Willy

CONFIDENTIALITY: This email is intended solely for the person(s) named. The contents may be confidential and/or privileged. If you are not the intended recipient, please delete it, notify us, and do not copy or use it, nor disclose its contents. Thank you.

Save the Earth: Print Only When you Need it.

________________________________________
From: Dawid Weiss [[hidden email]]
Sent: Sunday, 20 June, 2010 8:06:37 PM
To: Carrot2-developers
Subject: Re: [C2-devel] Improving performance of STC

There should be absolutely no problems with memory in clustering 500
documents (if you do have problems, increase the default java heap
with -Xmx1024m, for example, but this shouldn't be necessary). As for
parameters -- you will find the description of these parameters (and
their meaning) in the research paper by Zamir and Etzioni:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.4719

There is no way to predict how each and every parameter will affect
your clusters, because this depends on many factors. Experiment (the
Workbench is a good tool for this) and find your best settings
manually.

Dawid


On Sun, Jun 20, 2010 at 11:36 AM, #CHUA WEE SIONG WILLY#
<[hidden email]> wrote:

> Hi,
>
> I'm processing keywords from news articles. About 500 documents which is the the maximum. I'm currently trying to use STC to merge similar cluster(s) together to give a better results. There are quite a few parameters to set in STC but sometimes do not result in any effect on the clusters. So probably may need your advice on essential STC parameters to set that will have a huge effect on the resultset.
>
> Currently parameters which I have tried using are: all under STCClusteringParameters.class
> maxBaseClusters
> maxPhraseOverlap
> minBaseClusterScore
> mergeThreshold
> mostGeneralPhraseCoverage
>
> Thanks,
> Willy
>
> CONFIDENTIALITY: This email is intended solely for the person(s) named. The contents may be confidential and/or privileged. If you are not the intended recipient, please delete it, notify us, and do not copy or use it, nor disclose its contents. Thank you.
>
> Save the Earth: Print Only When you Need it.
>
> ________________________________________
> From: Dawid Weiss [[hidden email]]
> Sent: Sunday, 20 June, 2010 4:42:04 PM
> To: Carrot2-developers
> Subject: Re: [C2-devel] Improving performance of STC
>
> Can you specify what kind of data you want to process? How many
> documents, what's their total size? STC won't scale to super-large
> data sets (because it is implemented to run completely in main
> memory). You can peek at Apache Mahout project, there are large scale
> clustering algorithms implemented there.
>
> Dawid
>
> On Sun, Jun 20, 2010 at 5:18 AM, #CHUA WEE SIONG WILLY#
> <[hidden email]> wrote:
>> Hi,
>>
>> I currently using Carrot2 STC as part of my experience. Do you have the
>> manual for improving performance of STC on certain parameters to set to
>> improve the performance of STC?
>>
>> Thanks,
>> Willy
>>
>> CONFIDENTIALITY: This email is intended solely for the person(s) named. The
>> contents may be confidential and/or privileged. If you are not the intended
>> recipient, please delete it, notify us, and do not copy or use it, nor
>> disclose its contents. Thank you.
>>
>> Save the Earth: Print Only When you Need it.
>> ------------------------------------------------------------------------------
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> _______________________________________________
>> Carrot2-developers mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>>
>>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers