Incremental clustering

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Incremental clustering

Bogdan94202
Hi Carrot2 team,

I would like to ask if and how it is possible to implement the following scenario using the Carrot2 framework:
1) cluster some initial set of documents (e.g. text files), store the list of clusters
2) add a new document to the initial set and then run the clustering over the newly added file

Would that be possible that I get a relevance ratio (distance) of the newly added file to all of the clusters?
Another question I have: is it possible to extract key words which are most relevant (key words) for a particular cluster? Is such data available and which API should I use to get to it?

Would there be an example which is already close to what I am trying to achieve?
I walked through some of the examples but I am not that acquainted with all the APIs.
 
Thanks in advance!

Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: Incremental clustering

Dawid Weiss-2
> 2) add a new document to the initial set and then run the clustering over
> the newly added file

Carrot2 does not have such a feature. This is called incremental clustering.

> Would that be possible that I get a relevance ratio (distance) of the newly
> added file to all of the clusters?

No.

> Another question I have: is it possible to extract key words which are most
> relevant (key words) for a particular cluster? Is such data available and
> which API should I use to get to it?

No. Cluster label contains the unique feature binding the cluster's
documents. This is specific to Carrot2 algorithms -- clusters will
usually be smaller, but their content is clearly in relationship with
the label.

D.

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Incremental clustering

Bogdan94202
Ok, how is then possible to write another algorithm and plug it in to the framework? e.g. I would like to leverage the workbench or the web server?
I actually want to have topic extraction algorithm which can give me more details about the extracted topic - the most relevant words for a topic and then for each document - i.e. I would like to know why the machine decided to classify a document under one or another topic/cluster.
Reply | Threaded
Open this post in threaded view
|

Re: Incremental clustering

Dawid Weiss-2
You are encouraged to write your own algorithms, of course. The
existing source code is the best manual -- start by looking at
existing algorithms and modifying them.

Dawid

On Sun, Dec 13, 2009 at 2:09 PM, Bogdan94202 <[hidden email]> wrote:

>
> Ok, how is then possible to write another algorithm and plug it in to the
> framework? e.g. I would like to leverage the workbench or the web server?
> I actually want to have topic extraction algorithm which can give me more
> details about the extracted topic - the most relevant words for a topic and
> then for each document - i.e. I would like to know why the machine decided
> to classify a document under one or another topic/cluster.
> --
> View this message in context: http://n2.nabble.com/Incremental-clustering-tp4158120p4159528.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Return on Information:
> Google Enterprise Search pays you back
> Get the facts.
> http://p.sf.net/sfu/google-dev2dev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Incremental clustering

Stanislaw Osinski
Administrator

You are encouraged to write your own algorithms, of course. The
existing source code is the best manual -- start by looking at
existing algorithms and modifying them.


------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev

_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Incremental clustering

Bogdan94202
>The easiest algorithm to start with is probably the
>ByFieldClusteringAlgorithm:

I didn't give up on the Lingo algorithm yet (I am not sure I explored/understood all its capabilities yet).
I am still having trouble with clustering on ppt file content though.
It seems that Lingo is clustering based on the "title" field when Solr document source used.
Is it always the case that the Lingo algorithm is operating on titles?
How can I tell the algorithm to use the content of the documents?
How can I do that for the Solr document source plug-in?

BR,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: Incremental clustering

Stanislaw Osinski
Administrator

How can I do that for the Solr document source plug-in?

Again, manual has the answer: http://wiki.apache.org/solr/ClusteringComponent (linked from: http://download.carrot2.org/head/manual/#section.advanced-topics.integration-with-solr)

In Solr's configuration you can set the Solr fields that will be passed to Carrot2 as the title, content and, optionally, url fiels.

Cheers,

Staszek

--
Stanislaw Osinski, [hidden email]
http://www.carrot-search.com

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev

_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Incremental clustering

Bogdan94202
Thanks Staszek!
This helped a lot.
I did some progress as a Carrot2 user and I have some other questions now but I will open a new thread.
Best regards,
Bogdan