Snippet Assignment Threshold in Cluster Content Discovery

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Snippet Assignment Threshold in Cluster Content Discovery

seyfullahd
Hi,

I have been studying Carrot2 framework and Lingo Clustering Algorithm implementation (in Java) for a while. For the cluster content discovery phase (public void assign(PreprocessingContext context) : DocumentAssigner and void assignDocuments(LingoProcessingContext context) : ClusterBuilder methods) I can't understand the business flow completely yet, but I couldn't see a snippet assignment threshold. And wanted to ask, if the implementation of the cluster content discovery phase changed.  And also wondered if there is no snippet assignment threshold yet, also is measuring the similarity between term-cluster label matrix's column vectors and original term-document matrix's column vectors gone and a new algorithm is employed to assign documents to each clusters.

Thanks,

Seyfullah
Reply | Threaded
Open this post in threaded view
|

Re: Snippet Assignment Threshold in Cluster Content Discovery

Stanislaw Osinski-3
Hi Seyfullah,

I have been studying Carrot2 framework and Lingo Clustering Algorithm
implementation (in Java) for a while. For the cluster content discovery
phase (public void assign(PreprocessingContext context) : DocumentAssigner
and void assignDocuments(LingoProcessingContext context) : ClusterBuilder
methods) I can't understand the business flow completely yet, but I couldn't
see a snippet assignment threshold. And wanted to ask, if the implementation
of the cluster content discovery phase changed.

Yes, the implementation changed. Currently, a document is assigned to a cluster when it contains all the words of the cluster label (except stop words). While this is less flexible in terms of configuring the number of documents in clusters, it generally produces more precise assignments than the VSM approach.

 
 And also wondered if there
is no snippet assignment threshold yet, also is measuring the similarity
between term-cluster label matrix's column vectors and original
term-document matrix's column vectors gone and a new algorithm is employed
to assign documents to each clusters.

That is correct. Currently the similarity function is really binary: either the document contains all words of the label or not.

S.

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers