Order docs

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Order docs

cccefet
When I iterate over the documents that compound the Cluster what criteria is used to order them inside the cluster.
Reply | Threaded
Open this post in threaded view
|

Re: Order docs

Stanislaw Osinski
Administrator

When I iterate over the documents that compound the Cluster what criteria is
used to order them inside the cluster.

Currently, the documents within clusters preserve the original order in which the documents were fed on input. In other words, clustering algorithms do not provide the scores of document-cluster assignments. The reason for this is that Carrot2 uses very simple document-cluster assignment criteria: documents must contain all the words of the cluster's label (applies to STC, Lingo, does not apply to k-means clustering).

Cheers,

Staszek

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

[Solved]: Order docs

cccefet
Thanks for the answer. But i have other doubt: could i reduce the number of documents inside one cluster? When i clustering documents, got cluster with 600-6000 docs and i cluster with, for instance, 25 docs upper bound
Reply | Threaded
Open this post in threaded view
|

Re: [Solved]: Order docs

Stanislaw Osinski
Administrator

Thanks for the answer. But i have other doubt: could i reduce the number of
documents inside one cluster? When i clustering documents, got cluster with
600-6000 docs and i cluster with, for instance, 25 docs upper bound

There's no direct way to do this for the time being, I'm afraid. Two things you can try are:

1. Set http://download.carrot2.org/head/manual/#section.attribute.DocumentAssigner.exactPhraseAssignment to true -- this will make the clusters with multiword labels smaller (though the one-word labelled clusters will stay the same).

2. Increase http://download.carrot2.org/head/manual/#section.attribute.LingoClusteringAlgorithm.desiredClusterCountBase -- this will create more clusters, most of the new clusters should be smaller. Note that increasing this attribute will increase the processing time too.

Cheers,

Staszek

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers