Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

Han4Me
This post has NOT been accepted by the mailing list yet.
Hi,

   I have a problem when I am using "Suffix Tree Clustering Algorithm" to cluster my corpus. That I can only use this clustering algorithm with a limited number of documents (almost 450 documents, each contains 10-20 lines).

   If I put more documents, I got an "out of memory" error:

    --- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space


   Is there anyway to enlarge the corpus with more documents?

   Thanks for your help.

Best Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

Stanislaw Osinski
Administrator
Hi,

To cluster larger numbers of documents, you need to increase the heap size for your JVM. This can be done by adding e.g. -Xmx256m to the JVM command line invocation.

If you're using Carrot2 Document Clustering Workbench, see the manual for instructions on increasing heap size:

http://download.carrot2.org/head/manual/#section.troubleshooting.workbench.heap-size

Finally, Carrot2 is suited to cluster small to medium document collections, you may not be able to cluster thousands of docs:

http://project.carrot2.org/faq.html#scalability

Cheers,

S.


Han4Me wrote
Hi,

   I have a problem when I am using "Suffix Tree Clustering Algorithm" to cluster my corpus. That I can only use this clustering algorithm with a limited number of documents (almost 450 documents, each contains 10-20 lines).

   If I put more documents, I got an "out of memory" error:

    --- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space


   Is there anyway to enlarge the corpus with more documents?

   Thanks for your help.

Best Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

Han4Me
Hi,

   Thanks a lot Dr. Stanislaw for you help.
   Actually I tried to increase the heap size of my JVM for eclipse before I post my issue 1st time, but same results I got. (sorry I didn't mention that before)

   Do you have any idea/way to help me to cluster huge number of documents using STC algorithm?
   --- lets say initially (two thousands documents) for now.
   
Regards,  


Stanislaw Osinski wrote
Hi,

To cluster larger numbers of documents, you need to increase the heap size for your JVM. This can be done by adding e.g. -Xmx256m to the JVM command line invocation.

If you're using Carrot2 Document Clustering Workbench, see the manual for instructions on increasing heap size:

http://download.carrot2.org/head/manual/#section.troubleshooting.workbench.heap-size

Finally, Carrot2 is suited to cluster small to medium document collections, you may not be able to cluster thousands of docs:

http://project.carrot2.org/faq.html#scalability

Cheers,

S.


Han4Me wrote
Hi,

   I have a problem when I am using "Suffix Tree Clustering Algorithm" to cluster my corpus. That I can only use this clustering algorithm with a limited number of documents (almost 450 documents, each contains 10-20 lines).

   If I put more documents, I got an "out of memory" error:

    --- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space


   Is there anyway to enlarge the corpus with more documents?

   Thanks for your help.

Best Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

Dawid Weiss-2

There should be no problems clustering two thousand documents. Can you provide
an XML in the Carrot2 format that contains those documents?

Dawid

Han4Me wrote:

> Hi,
>
>    Thanks a lot Dr. Stanislaw for you help.
>    Actually I tried to increase the heap size of my JVM for eclipse before I post my issue 1st time, but same results I got. (sorry I didn't mention that before)
>
>    Do you have any idea/way to help me to cluster huge number of documents using STC algorithm?
>    --- lets say initially (two thousands documents) for now.
>    
> Regards,  
>
>
>
> Hi,
>
> To cluster larger numbers of documents, you need to increase the heap size for your JVM. This can be done by adding e.g. -Xmx256m to the JVM command line invocation.
>
> If you're using Carrot2 Document Clustering Workbench, see the manual for instructions on increasing heap size:
>
> http://download.carrot2.org/head/manual/#section.troubleshooting.workbench.heap-size
>
> Finally, Carrot2 is suited to cluster small to medium document collections, you may not be able to cluster thousands of docs:
>
> http://project.carrot2.org/faq.html#scalability
>
> Cheers,
>
> S.
>
>
>
> Hi,
>
>    I have a problem when I am using "Suffix Tree Clustering Algorithm" to cluster my corpus. That I can only use this clustering algorithm with a limited number of documents (almost 450 documents, each contains 10-20 lines).
>
>    If I put more documents, I got an "out of memory" error:
>
>     --- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
>
>    Is there anyway to enlarge the corpus with more documents?
>
>    Thanks for your help.
>
> Best Regards,
>
>
>
>
>

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

Han4Me
Hello Dr. Dawid,

   Sorry, I didn't get what do you mean by "XML in Carrot2 format that contains documents".
   But I am using Reuters corpora "euters21578".

   note:
   I converted all news stories inside this corpus to separate ".txt" files, every news item in separate text file. then read these files and create document object and fill its data with file data.
Then make STC clustering.

Best Regards,


JIRA dawid.weiss@cs.put.poznan.pl wrote
There should be no problems clustering two thousand documents. Can you provide
an XML in the Carrot2 format that contains those documents?

Dawid

Han4Me wrote:
> Hi,
>
>    Thanks a lot Dr. Stanislaw for you help.
>    Actually I tried to increase the heap size of my JVM for eclipse before I post my issue 1st time, but same results I got. (sorry I didn't mention that before)
>
>    Do you have any idea/way to help me to cluster huge number of documents using STC algorithm?
>    --- lets say initially (two thousands documents) for now.
>    
> Regards,  
>
>
>
> Hi,
>
> To cluster larger numbers of documents, you need to increase the heap size for your JVM. This can be done by adding e.g. -Xmx256m to the JVM command line invocation.
>
> If you're using Carrot2 Document Clustering Workbench, see the manual for instructions on increasing heap size:
>
> http://download.carrot2.org/head/manual/#section.troubleshooting.workbench.heap-size
>
> Finally, Carrot2 is suited to cluster small to medium document collections, you may not be able to cluster thousands of docs:
>
> http://project.carrot2.org/faq.html#scalability
>
> Cheers,
>
> S.
>
>
>
> Hi,
>
>    I have a problem when I am using "Suffix Tree Clustering Algorithm" to cluster my corpus. That I can only use this clustering algorithm with a limited number of documents (almost 450 documents, each contains 10-20 lines).
>
>    If I put more documents, I got an "out of memory" error:
>
>     --- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
>
>    Is there anyway to enlarge the corpus with more documents?
>
>    Thanks for your help.
>
> Best Regards,
>
>
>
>
>

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Carrot2-developers mailing list
Carrot2-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

Han4Me
In reply to this post by Dawid Weiss-2
Hello Dr. Dawid,

   Sorry, I didn't get what do you mean by "XML in Carrot2 format that contains documents".
   But I am using Reuters corpora "euters21578".

   note:
   I converted all news stories inside this corpus to separate ".txt" files, every news item in separate text file. then read these files and create document object and fill its data with file data.
Then make STC clustering.

Best Regards,


JIRA dawid.weiss@cs.put.poznan.pl wrote
There should be no problems clustering two thousand documents. Can you provide
an XML in the Carrot2 format that contains those documents?

Dawid

Han4Me wrote:
> Hi,
>
>    Thanks a lot Dr. Stanislaw for you help.
>    Actually I tried to increase the heap size of my JVM for eclipse before I post my issue 1st time, but same results I got. (sorry I didn't mention that before)
>
>    Do you have any idea/way to help me to cluster huge number of documents using STC algorithm?
>    --- lets say initially (two thousands documents) for now.
>    
> Regards,  
>
>
>
> Hi,
>
> To cluster larger numbers of documents, you need to increase the heap size for your JVM. This can be done by adding e.g. -Xmx256m to the JVM command line invocation.
>
> If you're using Carrot2 Document Clustering Workbench, see the manual for instructions on increasing heap size:
>
> http://download.carrot2.org/head/manual/#section.troubleshooting.workbench.heap-size
>
> Finally, Carrot2 is suited to cluster small to medium document collections, you may not be able to cluster thousands of docs:
>
> http://project.carrot2.org/faq.html#scalability
>
> Cheers,
>
> S.
>
>
>
> Hi,
>
>    I have a problem when I am using "Suffix Tree Clustering Algorithm" to cluster my corpus. That I can only use this clustering algorithm with a limited number of documents (almost 450 documents, each contains 10-20 lines).
>
>    If I put more documents, I got an "out of memory" error:
>
>     --- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
>
>    Is there anyway to enlarge the corpus with more documents?
>
>    Thanks for your help.
>
> Best Regards,
>
>
>
>
>

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Carrot2-developers mailing list
Carrot2-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

Dawid Weiss-2

Hi, apologies for the delay.

>    Sorry, I didn't get what do you mean by "XML in Carrot2 format that contains documents".

There is a pseudo-standard XML schema that we use for feeding documents to, for
example, the DCS server. It goes something like this:

<searchresult>
   <query>data mining</query>
   <document id="0">
     <title>...</title>
     <snippet>...</snippet>
     <url>...</url>
   </document>
   <document id="1">
   ...

and so on.

>    But I am using Reuters corpora "euters21578".

Ok, I understand now. How much is Reuters after you converted it to TXT? I guess
it may be too big to fit in memory for in-memory clustering. It's not the number
of documents but their length that matters here. Try other software packages,
dedicated to full-text clustering, like Cluto.

Dawid

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

eishay
In reply to this post by Stanislaw Osinski
Hi,
I'm running a modified version of STC that creates ngrams out of the tokens.
It consumes most of the 2g heap I allocated to it with 8k news articles, trimming them at 1k of characters and using (4..100)grams. When using (3..4)grams it takes less space and you can use more characters of the articles.
In any way, it takes a long time to process and should be an offline batch.
So a large number of articles are doable - with some constraints.
Eishay
Reply | Threaded
Open this post in threaded view
|

Re: Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

eishay
I wish to correct my previous note. The heap size was actually 512m. When increasing it to STC could handle more then 12k articles trimmed at 1k chars with about 900m worth of heap).
As expected, it did took a much longer time to complete. If you run it on the workbench then expect the UI part to take long time when velocity does a template merge (processing-result.vm).
Reply | Threaded
Open this post in threaded view
|

Re: Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

Dawid Weiss-2
In reply to this post by eishay

> I'm running a modified version of STC that creates ngrams out of the tokens.
> It consumes most of the 2g heap I allocated to it with 8k news articles,
 > I wish to correct my previous note. The heap size was actually 512m. When
 > increasing it to STC could handle more then 12k articles trimmed at 1k chars
 > with about 900m worth of heap).

The current implementation of STC is fairly memory-intense (lots of objects are
being created to represent the suffix tree, although the algorithm itself is
quite fast). A way to really improve it would be to use a suffix array to
construct the base clusters. Something I am considering because one of my
students did a great job at writing efficient suffix array construction
algorithms. We shall see what comes out of it.

 > As expected, it did took a much longer time to complete. If you run it on the
 > workbench then expect the UI part to take long time when velocity does a
 > template merge (processing-result.vm).

Yep, this will be a pain in the Workbench, unfortunately. It really is tuned to
display search results rather than large document sets.

Dawid

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Help - Suffix Tree Clustering Algorithm "OutOfMemoryError"

Han4Me
In reply to this post by Dawid Weiss-2
Hi Dr. Dawid,

   Thank you very much for your help.

Best Regards,

JIRA dawid.weiss@cs.put.poznan.pl wrote
Hi, apologies for the delay.

>    Sorry, I didn't get what do you mean by "XML in Carrot2 format that contains documents".

There is a pseudo-standard XML schema that we use for feeding documents to, for
example, the DCS server. It goes something like this:

<searchresult>
   <query>data mining</query>
   <document id="0">
     <title>...</title>
     <snippet>...</snippet>
     <url>...</url>
   </document>
   <document id="1">
   ...

and so on.

>    But I am using Reuters corpora "euters21578".

Ok, I understand now. How much is Reuters after you converted it to TXT? I guess
it may be too big to fit in memory for in-memory clustering. It's not the number
of documents but their length that matters here. Try other software packages,
dedicated to full-text clustering, like Cluto.

Dawid

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
Carrot2-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/carrot2-developers