Enhanced stopword management

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Enhanced stopword management

Bogdan94202
Hi,

I am wondering if it would be possible to provide more sophisticated stopwords mechanism.
For example I would like to be able to actually define different sets (clusters :) ) of stopwords which I could easily enable/disable. For example imagine I am running against documents where personal names are used frequently - I would like to be able to filter out the names for a moment, but at other point in time I would like to be able to use the names in the clustering process. Which means I have to be able to include/exclude some word sets in the stopwords. I would also like to be able to easily define different sets (e.g. personal names, locations, etc.).
Which classes should I have an eye on?

Additionally I would like to be able to define stopwords using regular expressions, do you think this is achievable? how?

Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: Enhanced stopword management

Dawid Weiss-2
At the moment the stop words are loaded once and cannot be changed at
runtime. There is a JIRA enhancement request that targets runtime
changes of stopwords and other aspects of processing, but at the time
of writing this is impossible.

Regular expressions -- no, but you could modify the source code and
provide your own stop word marker implementation which would provide
the features you need. The reality is most people use a fixed set of
stop words, so the application you have is quite specific and may
require custom development.

D.

On Thu, Dec 24, 2009 at 5:07 PM, Bogdan94202 <[hidden email]> wrote:

>
> Hi,
>
> I am wondering if it would be possible to provide more sophisticated
> stopwords mechanism.
> For example I would like to be able to actually define different sets
> (clusters :) ) of stopwords which I could easily enable/disable. For example
> imagine I am running against documents where personal names are used
> frequently - I would like to be able to filter out the names for a moment,
> but at other point in time I would like to be able to use the names in the
> clustering process. Which means I have to be able to include/exclude some
> word sets in the stopwords. I would also like to be able to easily define
> different sets (e.g. personal names, locations, etc.).
> Which classes should I have an eye on?
>
> Additionally I would like to be able to define stopwords using regular
> expressions, do you think this is achievable? how?
>
> Best regards,
> Bogdan
> --
> View this message in context: http://n2.nabble.com/Enhanced-stopword-management-tp4213674p4213674.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Enhanced stopword management

Bogdan94202
> The reality is most people use a fixed set of
> stop words, so the application you have is quite specific and may
> require custom development.

I am ok with doing some custom development, but it would be good if you could point me to the proper classes it is best to be modified.
Reply | Threaded
Open this post in threaded view
|

Re: Enhanced stopword management

Dawid Weiss-2
> I am ok with doing some custom development, but it would be good if you
> could point me to the proper classes it is best to be modified.

Look at the preprocessing infrastructure sub-project, it's in
carrot-util-text. The class I'd be after is StopListMarker. You could
also modify the entire infrastructure concerned with languages and
their specific resources, but the stop list marker class is a
centralized point where various experiments would be easiest to apply.

Let us know what worked and what didn't.

Dawid

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers