strategies for clustering big data

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

strategies for clustering big data

zimzaz
Ok,

I have Carrot2 Workbench up and running with Solr 1.4/Nutch and successfully producing clusters.  Yay!

I am about to upgrade my machine to a 7.5 GB / 4 core AWS instance.

Now I need to figure out some strategies for crawling, indexing, and clustering against a Wikipedia copy.    (11M articles).

What I want to do is to feed Carrot2 queries from a list of keywords that I have and create and save the resulting answer sets & clusters from which a script can extract identifiers and pass them to another machine.

First question, am I in the ballpark with this machine? Am I going to be able to make sufficient progress running batch queries?

I see from the Carrot2 documentation that STC algorithm is recommended for large data sets. Any and all advice would be much welcome.

Fred


-----------------------------------------------------
Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for monthly updates


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: strategies for clustering big data

Dawid Weiss-2
Hi Fred,

The question is more adequate to Solr/ Lucene mailing list because
Carrot2 and its clustering algorithms sit on top of the search results
which are typically not too large (and don't require such vast amounts
of memory). For Solr and Lucene, having more ram will be nice,
although an on-disk index typically gains more from faster disks
(ssds) than ram alone.

What are the queries you plan on executing? Are they regular queries
returning a few hundred snippets? Are they something else? If they're
regular queries, then both Lingo and STC should be more than fine. If
you plan to run large-scale clusterings then they will probably slow
down significantly with larger inputs (it's an inherent feature of
these algorithms, they've been designed to work on small and medium
data sets). If you hit this, you may want to try Lingo3G (Carrot
Search's commercial algorithm). This also won't scale to super-large
inputs, but should do better than the open source algorithms on large
data.

Dawid

On Wed, Sep 21, 2011 at 6:09 PM, Fred Zimmerman <[hidden email]> wrote:

> Ok,
> I have Carrot2 Workbench up and running with Solr 1.4/Nutch and successfully
> producing clusters.  Yay!
> I am about to upgrade my machine to a 7.5 GB / 4 core AWS instance.
> Now I need to figure out some strategies for crawling, indexing, and
> clustering against a Wikipedia copy.    (11M articles).
> What I want to do is to feed Carrot2 queries from a list of keywords that I
> have and create and save the resulting answer sets & clusters from which a
> script can extract identifiers and pass them to another machine.
> First question, am I in the ballpark with this machine? Am I going to be
> able to make sufficient progress running batch queries?
> I see from the Carrot2 documentation that STC algorithm is recommended for
> large data sets. Any and all advice would be much welcome.
> Fred
>
> -----------------------------------------------------
> Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
> monthly updates
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: strategies for clustering big data

zimzaz
Thanks -- got it.  Queries are simple keyword searches intended to bring back 25-50 documents.  Fred



On Wed, Sep 21, 2011 at 15:06, Dawid Weiss <[hidden email]> wrote:
Hi Fred,

The question is more adequate to Solr/ Lucene mailing list because
Carrot2 and its clustering algorithms sit on top of the search results
which are typically not too large (and don't require such vast amounts
of memory). For Solr and Lucene, having more ram will be nice,
although an on-disk index typically gains more from faster disks
(ssds) than ram alone.

What are the queries you plan on executing? Are they regular queries
returning a few hundred snippets? Are they something else? If they're
regular queries, then both Lingo and STC should be more than fine. If
you plan to run large-scale clusterings then they will probably slow
down significantly with larger inputs (it's an inherent feature of
these algorithms, they've been designed to work on small and medium
data sets). If you hit this, you may want to try Lingo3G (Carrot
Search's commercial algorithm). This also won't scale to super-large
inputs, but should do better than the open source algorithms on large
data.

Dawid

On Wed, Sep 21, 2011 at 6:09 PM, Fred Zimmerman <[hidden email]> wrote:
> Ok,
> I have Carrot2 Workbench up and running with Solr 1.4/Nutch and successfully
> producing clusters.  Yay!
> I am about to upgrade my machine to a 7.5 GB / 4 core AWS instance.
> Now I need to figure out some strategies for crawling, indexing, and
> clustering against a Wikipedia copy.    (11M articles).
> What I want to do is to feed Carrot2 queries from a list of keywords that I
> have and create and save the resulting answer sets & clusters from which a
> script can extract identifiers and pass them to another machine.
> First question, am I in the ballpark with this machine? Am I going to be
> able to make sufficient progress running batch queries?
> I see from the Carrot2 documentation that STC algorithm is recommended for
> large data sets. Any and all advice would be much welcome.
> Fred
>
> -----------------------------------------------------
> Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
> monthly updates
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: strategies for clustering big data

Dawid Weiss-2
You should be fine with Lingo or STC then. Speak up if you have any problems.

Dawid

On Wed, Sep 21, 2011 at 9:10 PM, Fred Zimmerman <[hidden email]> wrote:

> Thanks -- got it.  Queries are simple keyword searches intended to bring
> back 25-50 documents.  Fred
>
>
>
> On Wed, Sep 21, 2011 at 15:06, Dawid Weiss <[hidden email]>
> wrote:
>>
>> Hi Fred,
>>
>> The question is more adequate to Solr/ Lucene mailing list because
>> Carrot2 and its clustering algorithms sit on top of the search results
>> which are typically not too large (and don't require such vast amounts
>> of memory). For Solr and Lucene, having more ram will be nice,
>> although an on-disk index typically gains more from faster disks
>> (ssds) than ram alone.
>>
>> What are the queries you plan on executing? Are they regular queries
>> returning a few hundred snippets? Are they something else? If they're
>> regular queries, then both Lingo and STC should be more than fine. If
>> you plan to run large-scale clusterings then they will probably slow
>> down significantly with larger inputs (it's an inherent feature of
>> these algorithms, they've been designed to work on small and medium
>> data sets). If you hit this, you may want to try Lingo3G (Carrot
>> Search's commercial algorithm). This also won't scale to super-large
>> inputs, but should do better than the open source algorithms on large
>> data.
>>
>> Dawid
>>
>> On Wed, Sep 21, 2011 at 6:09 PM, Fred Zimmerman <[hidden email]>
>> wrote:
>> > Ok,
>> > I have Carrot2 Workbench up and running with Solr 1.4/Nutch and
>> > successfully
>> > producing clusters.  Yay!
>> > I am about to upgrade my machine to a 7.5 GB / 4 core AWS instance.
>> > Now I need to figure out some strategies for crawling, indexing, and
>> > clustering against a Wikipedia copy.    (11M articles).
>> > What I want to do is to feed Carrot2 queries from a list of keywords
>> > that I
>> > have and create and save the resulting answer sets & clusters from which
>> > a
>> > script can extract identifiers and pass them to another machine.
>> > First question, am I in the ballpark with this machine? Am I going to be
>> > able to make sufficient progress running batch queries?
>> > I see from the Carrot2 documentation that STC algorithm is recommended
>> > for
>> > large data sets. Any and all advice would be much welcome.
>> > Fred
>> >
>> > -----------------------------------------------------
>> > Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
>> > monthly updates
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > All the data continuously generated in your IT infrastructure contains a
>> > definitive record of customers, application performance, security
>> > threats, fraudulent activity and more. Splunk takes this data and makes
>> > sense of it. Business sense. IT sense. Common sense.
>> > http://p.sf.net/sfu/splunk-d2dcopy1
>> > _______________________________________________
>> > Carrot2-developers mailing list
>> > [hidden email]
>> > https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>> >
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure contains a
>> definitive record of customers, application performance, security
>> threats, fraudulent activity and more. Splunk takes this data and makes
>> sense of it. Business sense. IT sense. Common sense.
>> http://p.sf.net/sfu/splunk-d2dcopy1
>> _______________________________________________
>> Carrot2-developers mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers