stoplist and stemmer question

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

stoplist and stemmer question

milos
Hello,

I have a Carrot2 3.0 webapp + Lucene 2.3.2 installation.
My documents are all indexed and stored in a Lucene index in Serbian
language. Sorry for repeating some questions...

1) What does Carrot use for stop-list and stemmer in the clustering
process by default? (maybe no stop-list and no stemmer by default, or
english sl and stemmer).

2) Suppose I want to make my own stop-list for Serbian or for Serbian and
English together. Where to put that list and how to activate it?

3) What is your opinion about stop-list and its impact on quality of
clustering?

4) Is stemming independent from stop-list removal ? How can I
activate/deactivate it and can I add two stemmers for two languages?

5) What is your opinion about stemming and its impact on quality of
clustering (I do not have Serbian stemmer so is it a serious problem to
turn off stemming)?

Sincerely, Milos







------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: stoplist and stemmer question

Dawid Weiss-2

Hi Milos,

> 1) What does Carrot use for stop-list and stemmer in the clustering
> process by default? (maybe no stop-list and no stemmer by default, or
> english sl and stemmer).

In general it depends which clustering algorithm you use, because they differ in
their implementation. In the 3.0 line we tried to consolidate the linguistic/
preprocessing layer though, so linguistic resources for STC and Lingo are taked
from the implementation of ILanguageModelFactory, with the default being
DefaultLanguageModelFactory, which in turn has the following attribute:

     @Required
     @Processing
     @Input
     @Attribute(key = AttributeNames.ACTIVE_LANGUAGE)
     public LanguageCode current = LanguageCode.ENGLISH;

You can change the active language by passing a processing attribute keyed with
Attributenames.ACTIVE_LANGUAGE ("active-language") and valued using one of the
supported ISO codes. We currently support the following:

     DANISH ("da"),
     DUTCH ("nl"),
     ENGLISH ("en"),
     FINNISH ("fi"),
     FRENCH ("fr"),
     GERMAN ("de"),
     HUNGARIAN ("hu"),
     ITALIAN ("it"),
     NORWEGIAN ("no"),
     POLISH ("pl"),
     PORTUGUESE ("pt"),
     ROMANIAN ("ro"),
     RUSSIAN ("ru"),
     SPANISH ("es"),
     SWEDISH ("sv"),
     TURKISH ("tr");

Serbian is not among the list, so we would love it if you could contribute back
at least the stopword list.

> 2) Suppose I want to make my own stop-list for Serbian or for Serbian and
> English together. Where to put that list and how to activate it?

If you have mixed languages in your documents then yes, this would be the way to
do it. Our open source algorithms currently don't have the ability to cluster
multi-lingual document collections. Ideally, you'd have to run a pre-filtering,
identifying and splitting languages and then running clustering on each sub-group.

> 3) What is your opinion about stop-list and its impact on quality of
> clustering?

Stop list is important, especially if you cluster small numbers of documents.
For very large collections both stemming and stop lists have been shown to be of
lower importance (they don't improve quality that much). For Slavic languages
there is usually many more word forms for a single stem compared to English, so
again -- stemming will be of importance.

I guess what I'm trying to say is that it all depends on the context of your
documents, their content, language, etc.

> 4) Is stemming independent from stop-list removal ? How can I
> activate/deactivate it and can I add two stemmers for two languages?

You can use a fake stemmer (which simply returns the word form as a stem) and a
real stop word list, which would answer question one. We don't currently support
  multi-lingual situations in Carrot2, so it would require some hacking in
DefaultLanguageModelFactory and LanguageCodes, for example creating a fake
language ID and then combining stemmers/ stop word lists.

> 5) What is your opinion about stemming and its impact on quality of
> clustering (I do not have Serbian stemmer so is it a serious problem to
> turn off stemming)?

I suggest that you try and see what happens. It's hard to predict. I'm guessing
stemming could be helpful, but even with a decent stop word list you should be
able to get something sensible out of the system.

D.

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: stoplist and stemmer question

milos
Hi D,

I use Lingo for clustering. If I understood well 3.0 webapp uses English
as a default language? It means that carrot applies English stemmer and
stop list when clustering with Lingo?

> You can change the active language by passing a processing attribute keyed
> with
> Attributenames.ACTIVE_LANGUAGE ("active-language") and valued using one of

How to do that in 3.0 webapp?

In my GET request to webapp server I do not have the following:
EToolsDocumentSource.language=ENGLISH

> Serbian is not among the list, so we would love it if you could contribute
> back
> at least the stopword list.

I'll do that I promise.
I also contacted a Serbian professor in USA to obtain  Serbian stemmer.

Regards, Milos



------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: stoplist and stemmer question

Dawid Weiss-2

Hi Milos,

Sorry for the delay.

> I use Lingo for clustering. If I understood well 3.0 webapp uses English
> as a default language? It means that carrot applies English stemmer and
> stop list when clustering with Lingo?

Yes, English is used as the default for clustering in the webapp (and in other
applications).

>> You can change the active language by passing a processing attribute keyed
>> with
>> Attributenames.ACTIVE_LANGUAGE ("active-language") and valued using one of
>
> How to do that in 3.0 webapp?

The webapp passes all request parameters to the clustering process as default,
so you can simply add the parameters to your URI. For example, if you add:

active-language=POLISH&DefaultLanguageModelFactory.mergeResources=false

(merge resources affects how stopwords from different languages are treated, if
you merge them, stemming will have very little influence on the results), then
the result is:

http://demo.carrot2.org/head/search?source=web&view=tree&skin=fancy-compact&query=data+mining&results=100&algorithm=lingo&EToolsDocumentSource.country=ALL&EToolsDocumentSource.language=ENGLISH&EToolsDocumentSource.safeSearch=false&active-language=POLISH&DefaultLanguageModelFactory.mergeResources=false

compare it to the baseline in which English is used:

http://demo.carrot2.org/head/search?source=web&view=tree&skin=fancy-compact&query=data+mining&results=100&algorithm=lingo&EToolsDocumentSource.country=ALL&EToolsDocumentSource.language=ENGLISH&EToolsDocumentSource.safeSearch=false

> In my GET request to webapp server I do not have the following:
> EToolsDocumentSource.language=ENGLISH

The language (or region) is part of many components, not only clustering. The
parameter above affects the default language of Etools search engine (so that
you get the results filtered to English only). This will be also the case with
other sources -- MSN or Yahoo Boss have their own language and region codes.

And no, these codes are not unified or consistent in any way, unfortunately...

>> Serbian is not among the list, so we would love it if you could contribute
>> back at least the stopword list.
>
> I'll do that I promise.

Contribute the stop words list -- this has shown to provide a big boost in
quality for most algorithms. Stemming is of secondary importance.

Dawid


------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers