Rule of thumb for setting Lingo desired base cluster size?

classic Classic list List threaded Threaded
4 messages Options
cmg
Reply | Threaded
Open this post in threaded view
|

Rule of thumb for setting Lingo desired base cluster size?

cmg
Hi -

I am experimenting to see if cluster labels might be able to serve as tags from a set of search results.  Is there a rule of thumb for setting the Lingo algorithm desired cluster count base?  Also, what is a cluster's score weight indicating?  Is a higher score indicating in some way how strongly a label might relate to the documents in a cluster?

Thanks very much.

-greg
Reply | Threaded
Open this post in threaded view
|

Fwd: Rule of thumb for setting Lingo desired base cluster size?

Stanislaw Osinski
Administrator
Hi Greg,

I am experimenting to see if cluster labels might be able to serve as tags
from a set of search results.  Is there a rule of thumb for setting the
Lingo algorithm desired cluster count base?  

Here's the bit of code that converts cluster count base to the actual number of clusters Lingo will attempt to create:


If you'd like to directly specify the number of clusters, just set the cluster count base to the inverse of this function.

Ultimately, the final number of clusters may be smaller than the result from the above method due to the removal and merging of the overlapping clusters.

 
Also, what is a cluster's score
weight indicating?  Is a higher score indicating in some way how strongly a
label might relate to the documents in a cluster?

Technically, the label score is the cosine similarity between the label text and the corresponding column in the dimensionality-reduced VSM matrix. This translates to how "certain" Lingo is that this label is a "strong" topic in the input data. Unfortunately, the score does not have a direct connection to the strength of the relationship between the label and the cluster's documents. The latter relationship is very simple -- the cluster contains those documents that contain all of the cluster label's words.

Stanislaw


------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
cmg
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Rule of thumb for setting Lingo desired base cluster size?

cmg

Hi Stanislaw -

Thank you very much, this is very helpful.  I do you have a follow up question that I hope you can help with:

Is there any relationship between the number of clusters and the specificity of the tags?  Intuitively I would think that more clusters would result in fewer documents per cluster and therefore more specific labels, but I'm not sure that is true unless the documents in the cluster use very similar terms.

Thanks again!
-greg

On Jul 10, 2014 6:16 AM, "Stanislaw Osinski [via Carrot2 Users and Developers Forum]" <[hidden email]> wrote:
Hi Greg,

I am experimenting to see if cluster labels might be able to serve as tags
from a set of search results.  Is there a rule of thumb for setting the
Lingo algorithm desired cluster count base?  

Here's the bit of code that converts cluster count base to the actual number of clusters Lingo will attempt to create:


If you'd like to directly specify the number of clusters, just set the cluster count base to the inverse of this function.

Ultimately, the final number of clusters may be smaller than the result from the above method due to the removal and merging of the overlapping clusters.

 
Also, what is a cluster's score
weight indicating?  Is a higher score indicating in some way how strongly a
label might relate to the documents in a cluster?

Technically, the label score is the cosine similarity between the label text and the corresponding column in the dimensionality-reduced VSM matrix. This translates to how "certain" Lingo is that this label is a "strong" topic in the input data. Unfortunately, the score does not have a direct connection to the strength of the relationship between the label and the cluster's documents. The latter relationship is very simple -- the cluster contains those documents that contain all of the cluster label's words.

Stanislaw


------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers



To unsubscribe from Rule of thumb for setting Lingo desired base cluster size?, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Fwd: Fwd: Rule of thumb for setting Lingo desired base cluster size?

Stanislaw Osinski
Administrator
Hi Greg,

Your intuition would be right for a "conventional" clustering algorithm, such as k-means. Lingo uses a completely different principle to ensure that cluster labels are readable. What the algorithm does is to first select a number of phrases that would form good cluster labels and only then assign document to them. Therefore, generating more clusters would not make them smaller, but would likely increase the overlap between clusters (such that one document would belong to two, three or more clusters). To get smaller / more specific clusters, we'd need to filter out short phrases and/or phrases with a lot of matching documents. Looking at the list of current Lingo options, I don't see settings for that, let me see over the next few days if I can easily add them.

Thanks,

Stanislaw

On Fri, Jul 11, 2014 at 2:13 AM, cmg <[hidden email]> wrote:

Hi Stanislaw -

Thank you very much, this is very helpful.  I do you have a follow up question that I hope you can help with:

Is there any relationship between the number of clusters and the specificity of the tags?  Intuitively I would think that more clusters would result in fewer documents per cluster and therefore more specific labels, but I'm not sure that is true unless the documents in the cluster use very similar terms.

Thanks again!
-greg

On Jul 10, 2014 6:16 AM, "Stanislaw Osinski [via Carrot2 Users and Developers Forum]" <[hidden email]> wrote:
Hi Greg,

I am experimenting to see if cluster labels might be able to serve as tags
from a set of search results.  Is there a rule of thumb for setting the
Lingo algorithm desired cluster count base?  

Here's the bit of code that converts cluster count base to the actual number of clusters Lingo will attempt to create:


If you'd like to directly specify the number of clusters, just set the cluster count base to the inverse of this function.

Ultimately, the final number of clusters may be smaller than the result from the above method due to the removal and merging of the overlapping clusters.

 
Also, what is a cluster's score
weight indicating?  Is a higher score indicating in some way how strongly a
label might relate to the documents in a cluster?

Technically, the label score is the cosine similarity between the label text and the corresponding column in the dimensionality-reduced VSM matrix. This translates to how "certain" Lingo is that this label is a "strong" topic in the input data. Unfortunately, the score does not have a direct connection to the strength of the relationship between the label and the cluster's documents. The latter relationship is very simple -- the cluster contains those documents that contain all of the cluster label's words.

Stanislaw


------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers



To unsubscribe from Rule of thumb for setting Lingo desired base cluster size?, click here.
NAML


View this message in context: Re: Fwd: Rule of thumb for setting Lingo desired base cluster size?
------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers




------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers