german stopwords

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

german stopwords

reinhard
trying the query
"marillenbaum krankheit"
source: yahoo, 100 results
cluster labels contain stopwords such as "hat", "habe", "haben"

http://search.carrot2.org/stable/search?source=boss-web&view=tree&skin=fancy-compact&query=marillenbaum+krankheit&results=100&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=&BossSearchService.sites=#


regards
reinhard


------------------------------------------------------------------------------

_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: german stopwords

Stanislaw Osinski
Administrator
Hi,

We don't have automatic language identification yet, so you'd need to specify the language and region manually (in advanced options):

http://search.carrot2.org/stable/search?source=boss-web&view=tree&skin=fancy-compact&query=marillenbaum+krankheit&results=100&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=GERMAN&BossSearchService.sites=

This time the German stopword would get applied, but looking at the results, it could still be improved...

S.

On Sun, May 30, 2010 at 23:00, reinhard schwab <[hidden email]> wrote:
trying the query
"marillenbaum krankheit"
source: yahoo, 100 results
cluster labels contain stopwords such as "hat", "habe", "haben"

http://search.carrot2.org/stable/search?source=boss-web&view=tree&skin=fancy-compact&query=marillenbaum+krankheit&results=100&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=&BossSearchService.sites=#


regards
reinhard


------------------------------------------------------------------------------

_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers



------------------------------------------------------------------------------


_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: german stopwords

reinhard
hi,

ok.
i would add then  "hat" and "mir" to the stopword list.

regards
reinhard

Stanislaw Osinski schrieb:

> Hi,
>
> We don't have automatic language identification yet, so you'd need to
> specify the language and region manually (in advanced options):
>
> http://search.carrot2.org/stable/search?source=boss-web&view=tree&skin=fancy-compact&query=marillenbaum+krankheit&results=100&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=GERMAN&BossSearchService.sites=
> <http://search.carrot2.org/stable/search?source=boss-web&view=tree&skin=fancy-compact&query=marillenbaum+krankheit&results=100&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=GERMAN&BossSearchService.sites=>
>
> This time the German stopword would get applied, but looking at the
> results, it could still be improved...
>
> S.
>
> On Sun, May 30, 2010 at 23:00, reinhard schwab <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     trying the query
>     "marillenbaum krankheit"
>     source: yahoo, 100 results
>     cluster labels contain stopwords such as "hat", "habe", "haben"
>
>     http://search.carrot2.org/stable/search?source=boss-web&view=tree&skin=fancy-compact&query=marillenbaum+krankheit&results=100&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=&BossSearchService.sites=#
>     <http://search.carrot2.org/stable/search?source=boss-web&view=tree&skin=fancy-compact&query=marillenbaum+krankheit&results=100&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=&BossSearchService.sites=#>
>
>
>     regards
>     reinhard
>
>
>     ------------------------------------------------------------------------------
>
>     _______________________________________________
>     Carrot2-developers mailing list
>     [hidden email]
>     <mailto:[hidden email]>
>     https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
>
>  
> ------------------------------------------------------------------------
>
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>  


------------------------------------------------------------------------------

_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: german stopwords

Stanislaw Osinski
Administrator

i would add then  "hat" and "mir" to the stopword list.

Good idea! I added a few others and now the result seem better indeed. It's still in trunk only, so the search URL for the most up to date version is different ("head" instead of "stable" in the URL):

http://search.carrot2.org/head/search?source=boss-web&view=tree&skin=fancy-compact&query=marillenbaum+krankheit&results=100&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=GERMAN&BossSearchService.sites=

I've also addressed the <wbr> tag problem, Yahoo seems to preserve the tag in its results, but it seems safe to simply remove the tag altogether.

S.



------------------------------------------------------------------------------


_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: german stopwords

reinhard
thelanguage and country/region parameter is somehow strange.

http://search.carrot2.org/head/search?source=boss-web&view=tree&skin=fancy-compact&query=phytoplasma&results=200&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=&BossSearchService.sites=
96500 results

http://search.carrot2.org/head/search?source=boss-web&view=tree&skin=fancy-compact&query=phytoplasma&results=200&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=GERMAN&BossSearchService.sites=
96200 results

http://search.carrot2.org/head/search?source=boss-web&view=tree&skin=fancy-compact&query=phytoplasma&results=200&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=AUSTRIA&BossSearchService.sites=
96300 results

this is quite strange. more results for austria than for german and
nearly same as without filter parameter.
what is going on here?

if i query "phytoplasma" at
http://de.search.yahoo.com/
i get quite different results for language german or country/region austria.

255 results from austria
2230 results for german

regards
reinhard

Stanislaw Osinski schrieb:

>
>     i would add then  "hat" and "mir" to the stopword list.
>
>
> Good idea! I added a few others and now the result seem better indeed.
> It's still in trunk only, so the search URL for the most up to date
> version is different ("head" instead of "stable" in the URL):
>
> http://search.carrot2.org/head/search?source=boss-web&view=tree&skin=fancy-compact&query=marillenbaum+krankheit&results=100&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=GERMAN&BossSearchService.sites=
> <http://search.carrot2.org/head/search?source=boss-web&view=tree&skin=fancy-compact&query=marillenbaum+krankheit&results=100&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=GERMAN&BossSearchService.sites=>
>
> I've also addressed the <wbr> tag problem, Yahoo seems to preserve the
> tag in its results, but it seems safe to simply remove the tag altogether.
>
> S.
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
>
>  
> ------------------------------------------------------------------------
>
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>  


------------------------------------------------------------------------------

_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: german stopwords

Stanislaw Osinski
Administrator

thelanguage and country/region parameter is somehow strange.

http://search.carrot2.org/head/search?source=boss-web&view=tree&skin=fancy-compact&query=phytoplasma&results=200&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=&BossSearchService.sites=
96500 results

http://search.carrot2.org/head/search?source=boss-web&view=tree&skin=fancy-compact&query=phytoplasma&results=200&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=GERMAN&BossSearchService.sites=
96200 results

http://search.carrot2.org/head/search?source=boss-web&view=tree&skin=fancy-compact&query=phytoplasma&results=200&algorithm=lingo&BossWebSearchService.filter=&BossSearchService.languageAndRegion=AUSTRIA&BossSearchService.sites=
96300 results

this is quite strange. more results for austria than for german and
nearly same as without filter parameter.
what is going on here?

I wish I knew myself ;-) I tried making requests directly to BOSS API and I'm getting similar/problematic results:

http://boss.yahooapis.com/ysearch/web/v1/phytoplasma?appid=txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-&start=0&count=50&format=xml&sites=&lang=de&region=at

http://boss.yahooapis.com/ysearch/web/v1/phytoplasma?appid=txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-&start=0&count=50&format=xml&sites=&lang=de&region=de

http://boss.yahooapis.com/ysearch/web/v1/phytoplasma?appid=txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-&start=0&count=50&format=xml&sites=&lang=en&region=us

The results differ a bit, which suggests that the region parameter hasn't been completely ignored, but still lots of non-German documents appear. Interestingly, Bing behaves the same. No clear explanation springs to my mind without deeper searching of BOSS/Bing support lists.

S.

------------------------------------------------------------------------------


_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers