Improoving performance (stc)

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Improoving performance (stc)

eishay
Hi,

Please let me know if there is a better place for such discussion.
I noticed that we can make the loop that does the create links in base clusters graph at STCEngine.createMergedClusters run in threads and by that make it much faster for large amount of documents.
Using it on 8k documents it decreased the time for the same code section from 67971ms to 20465ms (more then three times faster).

Here is an example code, please let me know if there is a better way to share it:
    private void createLinksInBaseClustersGraph(final float MERGE_THRESHOLD)
    {
      //I use a four cores machine.
      ThreadPoolExecutor tpe = new ThreadPoolExecutor(4, 4, 1000, TimeUnit.SECONDS, new ArrayBlockingQueue<Runnable>(baseClusters.size()));
      List<FutureTask<Object>> list = new ArrayList<FutureTask<Object>>();
      for (int i = 1; i < baseClusters.size(); i++)
      {
        final int index = i;
        FutureTask task = new FutureTask(new Callable(){

          public Object call() throws Exception
          {
            BaseCluster a = (BaseCluster) baseClusters.elementAt(index);
            final long a_docCount = a.getNode().getSuffixedDocumentsCount();
 
            for (int j = 0; j < index; j++)
            {
                BaseCluster b = (BaseCluster) baseClusters.elementAt(j);
 
                final double a_and_b_docCount = a.getNode()
                    .getInternalDocumentsRepresentation().numberOfSetBitsAfterAnd(
                        b.getNode().getInternalDocumentsRepresentation());
 
                // BUG: This check should be bidirectional (see Zamir's paper).
                if (((a_and_b_docCount / b.getNode().getSuffixedDocumentsCount()) > MERGE_THRESHOLD)
                    && ((a_and_b_docCount / a_docCount) > MERGE_THRESHOLD))
                {
                    // add links to base cluster graph. This is actually redundant as
                    // we're adding two
                    // directed edges.
                    a.addLink(b);
                    b.addLink(a);
                }
            }
            return null;
          }});
        tpe.execute(task);
        list.add(task);
      }
      for(FutureTask task: list)
      {
        try
        {
          task.get();
        }
        catch (Exception e)
        {
          e.printStackTrace();
        }
      }
    }
Reply | Threaded
Open this post in threaded view
|

Re: Improoving performance (stc)

Dawid Weiss-2

Hi Eishay,

This patch looks sensible -- I guess it will speed up things on multi-core
processors, but still.

The way to go is probably to add a JIRA issue with the patch to existing code. I
would change the following things in your patch:

- the number of worker threads could be taken from Runtime.availableProcessors(),

- rethrow a runtime exception instead of fall through handler,

- you should shut down the thread pool executor once done with processing and
exceptions.

Dawid

eishay wrote:

> Hi,
>
> Please let me know if there is a better place for such discussion.
> I noticed that we can make the loop that does the create links in base clusters graph at STCEngine.createMergedClusters run in threads and by that make it much faster for large amount of documents.
> Using it on 8k documents it decreased the time for the same code section from 67971ms to 20465ms (more then three times faster).
>
> Here is an example code, please let me know if there is a better way to share it:
>     private void createLinksInBaseClustersGraph(final float MERGE_THRESHOLD)
>     {
>       //I use a four cores machine.
>       ThreadPoolExecutor tpe = new ThreadPoolExecutor(4, 4, 1000, TimeUnit.SECONDS, new ArrayBlockingQueue<Runnable>(baseClusters.size()));
>       List<FutureTask> list = new ArrayList<FutureTask>();
>       for (int i = 1; i < baseClusters.size(); i++)
>       {
>         final int index = i;
>         FutureTask task = new FutureTask(new Callable(){
>
>           public Object call() throws Exception
>           {
>             BaseCluster a = (BaseCluster) baseClusters.elementAt(index);
>             final long a_docCount = a.getNode().getSuffixedDocumentsCount();
>  
>             for (int j = 0; j < index; j++)
>             {
>                 BaseCluster b = (BaseCluster) baseClusters.elementAt(j);
>  
>                 final double a_and_b_docCount = a.getNode()
>                     .getInternalDocumentsRepresentation().numberOfSetBitsAfterAnd(
>                         b.getNode().getInternalDocumentsRepresentation());
>  
>                 // BUG: This check should be bidirectional (see Zamir's paper).
>                 if (((a_and_b_docCount / b.getNode().getSuffixedDocumentsCount()) > MERGE_THRESHOLD)
>                     && ((a_and_b_docCount / a_docCount) > MERGE_THRESHOLD))
>                 {
>                     // add links to base cluster graph. This is actually redundant as
>                     // we're adding two
>                     // directed edges.
>                     a.addLink(b);
>                     b.addLink(a);
>                 }
>             }
>             return null;
>           }});
>         tpe.execute(task);
>         list.add(task);
>       }
>       for(FutureTask task: list)
>       {
>         try
>         {
>           task.get();
>         }
>         catch (Exception e)
>         {
>           e.printStackTrace();
>         }
>       }
>     }

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Improoving performance (stc)

zhu hui
Hi, all

I am a new commer to carrot, and nowadays want to build a academic paper search engine with clustering results.

I just want to know that carrot seems to real time cluster web pages according to all the contents (accturally i am not quite sure about carrot's mechanisms). while i think i can create a vector of may be 40 words for each paper by batch processing, and while doing cluster, only use this generated 40 words. so when lucence search hits 400 papers, and i will use their 40 words to do clustering job.

does that make sense to increase the throughtput?

looking forward to feedbacks.

Eric. Syu

On Sat, Apr 18, 2009 at 7:15 PM, Dawid Weiss <[hidden email]> wrote:

Hi Eishay,

This patch looks sensible -- I guess it will speed up things on multi-core
processors, but still.

The way to go is probably to add a JIRA issue with the patch to existing code. I
would change the following things in your patch:

- the number of worker threads could be taken from Runtime.availableProcessors(),

- rethrow a runtime exception instead of fall through handler,

- you should shut down the thread pool executor once done with processing and
exceptions.

Dawid

eishay wrote:
> Hi,
>
> Please let me know if there is a better place for such discussion.
> I noticed that we can make the loop that does the create links in base clusters graph at STCEngine.createMergedClusters run in threads and by that make it much faster for large amount of documents.
> Using it on 8k documents it decreased the time for the same code section from 67971ms to 20465ms (more then three times faster).
>
> Here is an example code, please let me know if there is a better way to share it:
>     private void createLinksInBaseClustersGraph(final float MERGE_THRESHOLD)
>     {
>       //I use a four cores machine.
>       ThreadPoolExecutor tpe = new ThreadPoolExecutor(4, 4, 1000, TimeUnit.SECONDS, new ArrayBlockingQueue<Runnable>(baseClusters.size()));
>       List<FutureTask> list = new ArrayList<FutureTask>();
>       for (int i = 1; i < baseClusters.size(); i++)
>       {
>         final int index = i;
>         FutureTask task = new FutureTask(new Callable(){
>
>           public Object call() throws Exception
>           {
>             BaseCluster a = (BaseCluster) baseClusters.elementAt(index);
>             final long a_docCount = a.getNode().getSuffixedDocumentsCount();
>
>             for (int j = 0; j < index; j++)
>             {
>                 BaseCluster b = (BaseCluster) baseClusters.elementAt(j);
>
>                 final double a_and_b_docCount = a.getNode()
>                     .getInternalDocumentsRepresentation().numberOfSetBitsAfterAnd(
>                         b.getNode().getInternalDocumentsRepresentation());
>
>                 // BUG: This check should be bidirectional (see Zamir's paper).
>                 if (((a_and_b_docCount / b.getNode().getSuffixedDocumentsCount()) > MERGE_THRESHOLD)
>                     && ((a_and_b_docCount / a_docCount) > MERGE_THRESHOLD))
>                 {
>                     // add links to base cluster graph. This is actually redundant as
>                     // we're adding two
>                     // directed edges.
>                     a.addLink(b);
>                     b.addLink(a);
>                 }
>             }
>             return null;
>           }});
>         tpe.execute(task);
>         list.add(task);
>       }
>       for(FutureTask task: list)
>       {
>         try
>         {
>           task.get();
>         }
>         catch (Exception e)
>         {
>           e.printStackTrace();
>         }
>       }
>     }

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers



--
Nothing Impossible

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Improoving performance (stc)

eishay
In reply to this post by Dawid Weiss-2
Absolutely Dawid, the snippet wasn't meant to be a patch :-)

I actually thought about some generalization of the pool. There might be more places in the code where we could use concurrency, could we have the same pool be used in other places and not create it per transaction? Maybe have it created and destroyed as part of the bundle lifescycle?

In addition, there might be a threshold where there is no sense in creating new runnables (it might even be slower). I'll experiment with that.

As a side note, the best way to do this kind of concurrency might be using the Java Fork Join libraries. Unfortunately they can run only on java6+ and many machines are still using java5.  
Reply | Threaded
Open this post in threaded view
|

Re: Improoving performance (stc)

eishay
In reply to this post by zhu hui
Hi Eric,

I totally makes sense. I am trying to use Carrot in a bit different way, i.e. not for real time but more of a batch process (while still enabling the usage pattern you described). I am doing experiments with large amounts of semi disconnected articles and it seems like the number of words to be tokenized does make a difference in quality up to a certain point. This point depends a lot on the type of the algorithm and article content (academic paper, news article, static wiki page, etc.) used.

Eishay
Reply | Threaded
Open this post in threaded view
|

Re: Improoving performance (stc)

Dawid Weiss-2
In reply to this post by eishay


> I actually thought about some generalization of the pool. There might be more
> places in the code where we could use concurrency, could we have the same
> pool be used in other places and not create it per transaction? Maybe have it
> created and destroyed as part of the bundle lifescycle?

I do get your point. We generally consider a clustering process to be
single-threaded (because in search results clustering scenario processing cores
are saturated by HTTP requests anyway). Multi-threading is for running multiple
clustering processes concurrently and this can be done safely via
CachingController, for example.

I guess what I'm trying to say is that I don't have any good hints as to where
such a generalization could be placed in the existing codebase. Since you're
working with larger data sets it's your call to suggest something -- if anybody
doesn't like it, you can count on our assertiveness :)

> In addition, there might be a threshold where there is no sense in creating
> new runnables (it might even be slower). I'll experiment with that.

This is correct.

> As a side note, the best way to do this kind of concurrency might be using
> the Java Fork Join libraries. Unfortunately they can run only on java6+ and
> many machines are still using java5.

Can you give me a pointer to this? Sounds interesting. Java6 is really a huge
boost in performance. We will try to keep backward compatibility with 1.5, but I
would tell everyone to make the switch to 1.6 asap.

Dawid


------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance (stc)

eishay
> I do get your point. We generally consider a clustering process to be
> single-threaded (because in search results clustering scenario processing cores
> are saturated by HTTP requests anyway). Multi-threading is for running multiple
> clustering processes concurrently and this can be done safely via
> CachingController, for example.
I'll check the use of CachingController for throughput, but right now I'm more concerned about the latency. Though I know I cannot make it fast enough for online processing with the data profile I'm going to use, I would like to have it complete asap.

> I guess what I'm trying to say is that I don't have any good hints as to where
> such a generalization could be placed in the existing codebase. Since you're
> working with larger data sets it's your call to suggest something -- if anybody
> doesn't like it, you can count on our assertiveness :)
I'm examining the application with a profiler (YourKit) and trying find parallelize only in places it is possible and really takes a significant amount of cpu time. I'll keep you updated :-)

>> Can you give me a pointer to this? Sounds interesting. Java6 is really a huge
>> boost in performance. We will try to keep backward compatibility with 1.5, but I
>> would tell everyone to make the switch to 1.6 asap.
FJ is really cool:
http://www.infoq.com/news/2007/07/concurrency-java-se-7
http://gee.cs.oswego.edu/dl/papers/fj.pdf
Unfortunately I'll be stuck in the next few months with java5 on my development environment (waiting for Apple to fix a bug which is a blocker for me). But as soon I'll be able to use java6 I'll convert whatever threadpool code I'll do to FJ.

<Scala fanboy mumbling to himself> Another option is to use Scala's Actors which is also based on FJ and runs on java5 :-) <end of mumbling>

Eishay
Reply | Threaded
Open this post in threaded view
|

Re: Improoving performance (stc)

zhu hui
In reply to this post by eishay
Hi, Eishay

Thanks very much for your kindly reply.

I have just readed the doctor thesis of David Weiss.

I noticed that doctor's Descriptive K-means algorithm seems to meet my needs very well, I guess there is no open source library available now , and not implemented in Carrot, so may be i will try to implement it in my system.

and I really want to get some advice from doctor David Weiss. ^_^

Best Wishes!

Eric Syu.

On Sat, Apr 18, 2009 at 11:42 PM, eishay <[hidden email]> wrote:

Hi Eric,

I totally makes sense. I am trying to use Carrot in a bit different way, i.e. not for real time but more of a batch process (while still enabling the usage pattern you described). I am doing experiments with large amounts of semi disconnected articles and it seems like the number of words to be tokenized does make a difference in quality up to a certain point. This point depends a lot on the type of the algorithm and article content (academic paper, news article, static wiki page, etc.) used.

Eishay
--
View this message in context: http://n2.nabble.com/Improoving-performance-%28stc%29-tp2653763p2655974.html
Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.


------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers



--
Nothing Impossible

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance (stc)

Dawid Weiss-2
In reply to this post by eishay

Interesting reading, thanks. I don't think switching to FJ can improve things a
lot in our case because the number of fork/switches we have is not that large
and if you use a local (or better -- reusable) thread pool, there will be no
loss on creating new threads and synchronization costs should be minimal.

That said, it's something worth trying, of course.

Dawid

eishay wrote:

>> I do get your point. We generally consider a clustering process to be
>> single-threaded (because in search results clustering scenario processing cores
>> are saturated by HTTP requests anyway). Multi-threading is for running multiple
>> clustering processes concurrently and this can be done safely via
>> CachingController, for example.
> I'll check the use of CachingController for throughput, but right now I'm more concerned about the latency. Though I know I cannot make it fast enough for online processing with the data profile I'm going to use, I would like to have it complete asap.
>
>> I guess what I'm trying to say is that I don't have any good hints as to where
>> such a generalization could be placed in the existing codebase. Since you're
>> working with larger data sets it's your call to suggest something -- if anybody
>> doesn't like it, you can count on our assertiveness :)
> I'm examining the application with a profiler (YourKit) and trying find parallelize only in places it is possible and really takes a significant amount of cpu time. I'll keep you updated :-)
>
>>> Can you give me a pointer to this? Sounds interesting. Java6 is really a huge
>>> boost in performance. We will try to keep backward compatibility with 1.5, but I
>>> would tell everyone to make the switch to 1.6 asap.
> FJ is really cool:
> http://www.infoq.com/news/2007/07/concurrency-java-se-7
> http://gee.cs.oswego.edu/dl/papers/fj.pdf
> Unfortunately I'll be stuck in the next few months with java5 on my development environment (waiting for Apple to fix a bug which is a blocker for me). But as soon I'll be able to use java6 I'll convert whatever threadpool code I'll do to FJ.
>
> <Scala fanboy mumbling to himself> Another option is to use Scala's Actors which is also based on FJ and runs on java5 :-) <end of mumbling>
>
> Eishay
>

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance (stc)

Stanislaw Osinski
Administrator
Interesting reading, thanks. I don't think switching to FJ can improve things a
lot in our case because the number of fork/switches we have is not that large
and if you use a local (or better -- reusable) thread pool, there will be no
loss on creating new threads and synchronization costs should be minimal.

When it comes to reusing thread pools, there already is some kind of mechanism for it implemented by org.carrot2.core.ProcessingComponentBase.getSharedExecutor(int maxConcurrentThreads, Class<?> clazz). Looking at the usage pattern, the reuse is currently happening on a per-component class basis.

Cheers,

S.

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance (stc)

eishay
Thanks Stanislaw, I'll look into it.
Eishay