Document content Clustering using Carrot2

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Document content Clustering using Carrot2

Gaurang Patel
All-

I was looking for the possibility of using Carrot2 algos(lingo,stc,etc) for clustering the contents of a Document. Aim here is to get the clustered and summarized contents from a html document.

The approach I am following here is to divide a document in a list of sub documents, and then pass this sub-document list to the SimpleController::process() API. I am trying to write an API say which will be similar to carrot2-source-microsoft/carrot2-source-google as per as far as i/p and o/p of the API is concerned.

Concerns:
1) Any ideas/feedback on the approach? Any similar kind of development done? I am waiting for any kind of help. 2) After they already have the search results in form of document list, do carrot2 clustering algorithms(please try to include details of all three algos) take the search query into consideration while actually making clusters?

Let me know of any ideas.! Or any work done before?

Thanks & Regards,
Gaurang
Reply | Threaded
Open this post in threaded view
|

Re: Document content Clustering using Carrot2

Gaurang Patel
One more concern:

Just to confirm that all three clustering algos use actual contents of the document to cluster them into proper cluster? And they do not simply use snippet of the document.

Regards,
Gaurang

Gaurang Patel wrote
All-

I was looking for the possibility of using Carrot2 algos(lingo,stc,etc) for clustering the contents of a Document. Aim here is to get the clustered and summarized contents from a html document.

The approach I am following here is to divide a document in a list of sub documents, and then pass this sub-document list to the SimpleController::process() API. I am trying to write an API say which will be similar to carrot2-source-microsoft/carrot2-source-google as per as far as i/p and o/p of the API is concerned.

Concerns:
1) Any ideas/feedback on the approach? Any similar kind of development done? I am waiting for any kind of help. 2) After they already have the search results in form of document list, do carrot2 clustering algorithms(please try to include details of all three algos) take the search query into consideration while actually making clusters?

Let me know of any ideas.! Or any work done before?

Thanks & Regards,
Gaurang
Reply | Threaded
Open this post in threaded view
|

Re: Document content Clustering using Carrot2

Dawid Weiss-2
In reply to this post by Gaurang Patel
> The approach I am following here is to divide a document in a list of sub
> documents, and then pass this sub-document list to the
> SimpleController::process() API. I am trying to write an API say which will
> be similar to carrot2-source-microsoft/carrot2-source-google as per as far
> as i/p and o/p of the API is concerned.

You can also use the simple input (provide paragraphs or slices of
your document as a list of Document instances). Look at the examples
-- there is one that shows how to do this. Then you won't need to
write your own inputs (which requires some skill).

> 1) Any ideas/feedback on the approach? Any similar kind of development done?

People have used Carrot2 for this, I don't know what the results were.
Our algorithms work best for search results because these are usually
already focused on a single topic and only its "context" needs to be
clustered. For totally unrelated paragraphs, the results may be weird.
Hard to say.

> 2) After they already have the search
> results in form of document list, do carrot2 clustering algorithms(please
> try to include details of all three algos) take the search query into
> consideration while actually making clusters?

Yes, terms from the  query are  penalized to avoid creation of trivial
clusters. STC and Lingo do this.

> Just to confirm that all three clustering algos use actual contents of the
> document to cluster them into proper cluster? And they do not simply use
> snippet of the document.

All algorithms use whatever is passed in the fields of the Document
object (by default these fields are title, url and summary). So for
search results only snippets and titles are actually parsed, not the
full content.

Dawid

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Document content Clustering using Carrot2

Gaurang Patel
Hey Dawis,

Thanks for quick reply.

I searched through the example code, but was not able to find the example which clusters paragraphs. Is it the ClusteringDataFromPubMed.java ?

Please confirm.


Thanks,
Gaurang


Dawid Weiss wrote
> The approach I am following here is to divide a document in a list of sub
> documents, and then pass this sub-document list to the
> SimpleController::process() API. I am trying to write an API say which will
> be similar to carrot2-source-microsoft/carrot2-source-google as per as far
> as i/p and o/p of the API is concerned.

You can also use the simple input (provide paragraphs or slices of
your document as a list of Document instances). Look at the examples
-- there is one that shows how to do this. Then you won't need to
write your own inputs (which requires some skill).

> 1) Any ideas/feedback on the approach? Any similar kind of development done?

People have used Carrot2 for this, I don't know what the results were.
Our algorithms work best for search results because these are usually
already focused on a single topic and only its "context" needs to be
clustered. For totally unrelated paragraphs, the results may be weird.
Hard to say.

> 2) After they already have the search
> results in form of document list, do carrot2 clustering algorithms(please
> try to include details of all three algos) take the search query into
> consideration while actually making clusters?

Yes, terms from the  query are  penalized to avoid creation of trivial
clusters. STC and Lingo do this.

> Just to confirm that all three clustering algos use actual contents of the
> document to cluster them into proper cluster? And they do not simply use
> snippet of the document.

All algorithms use whatever is passed in the fields of the Document
object (by default these fields are title, url and summary). So for
search results only snippets and titles are actually parsed, not the
full content.

Dawid

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Carrot2-developers mailing list
Carrot2-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Document content Clustering using Carrot2

Dawid Weiss-2
> I searched through the example code, but was not able to find the example
> which clusters paragraphs. Is it the ClusteringDataFromPubMed.java ?

There are no ready to use solutions. You need to split your documents
into paragraphs and then model your code after ClusteringDocumentList:

http://fisheye3.atlassian.com/browse/carrot2/trunk/applications/carrot2-examples/src/org/carrot2/examples/clustering/ClusteringDocumentList.java?r=3482

Dawid

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Document content Clustering using Carrot2

Gaurang Patel
In reply to this post by Dawid Weiss-2
hey,


> Just to confirm that all three clustering algos use actual contents of the
> document to cluster them into proper cluster? And they do not simply use
> snippet of the document.

All algorithms use whatever is passed in the fields of the Document
object (by default these fields are title, url and summary). So for
search results only snippets and titles are actually parsed, not the
full content.


Then what is the case with ByUrlClusteringAlgorithm? I observed that this algo fails even if documents URLs are not set and snippets, titles are set.

What does it consider and/or ignore? Please let me know

Regards,
Gaurang

Dawid Weiss wrote
> The approach I am following here is to divide a document in a list of sub
> documents, and then pass this sub-document list to the
> SimpleController::process() API. I am trying to write an API say which will
> be similar to carrot2-source-microsoft/carrot2-source-google as per as far
> as i/p and o/p of the API is concerned.

You can also use the simple input (provide paragraphs or slices of
your document as a list of Document instances). Look at the examples
-- there is one that shows how to do this. Then you won't need to
write your own inputs (which requires some skill).

> 1) Any ideas/feedback on the approach? Any similar kind of development done?

People have used Carrot2 for this, I don't know what the results were.
Our algorithms work best for search results because these are usually
already focused on a single topic and only its "context" needs to be
clustered. For totally unrelated paragraphs, the results may be weird.
Hard to say.

> 2) After they already have the search
> results in form of document list, do carrot2 clustering algorithms(please
> try to include details of all three algos) take the search query into
> consideration while actually making clusters?

Yes, terms from the  query are  penalized to avoid creation of trivial
clusters. STC and Lingo do this.

> Just to confirm that all three clustering algos use actual contents of the
> document to cluster them into proper cluster? And they do not simply use
> snippet of the document.

All algorithms use whatever is passed in the fields of the Document
object (by default these fields are title, url and summary). So for
search results only snippets and titles are actually parsed, not the
full content.

Dawid

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Carrot2-developers mailing list
Carrot2-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Document content Clustering using Carrot2

Dawid Weiss-2
> Then what is the case with ByUrlClusteringAlgorithm? I observed that this
> algo fails even if documents URLs are not set and snippets, titles are set.

This is not a clustering algorithm, but a partitioning algorithm (it
partitions input documents by URLs and their segments).

Dawid

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers