cluster labels

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

cluster labels

reinhard
hi developers,

i just compare my own cluster algorithm with carrot2/lingo and
escpecially the cluster labels.
if i search for "girsule" at yahoo with 100 results,
carrot2 shows me a cluster label "Ihres".
in german, i would classify this term as stop word.
which stop words are used in carrot2 for german?

regards
reinhard

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: cluster labels

Dawid Weiss-2
> if i search for "girsule" at yahoo with 100 results,
> carrot2 shows me a cluster label "Ihres".
> in german, i would classify this term as stop word.
> which stop words are used in carrot2 for german?

Assuming proper language has been passed from the input source (Yahoo
Boss) or forced manually, stopwords.de is the resource you're after. I
just looked into it and it contains:

ihr
ihre
ihren

I've just added ihres to the SVN, thanks.

Dawid

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: cluster labels

reinhard
Dawid Weiss schrieb:

>> if i search for "girsule" at yahoo with 100 results,
>> carrot2 shows me a cluster label "Ihres".
>> in german, i would classify this term as stop word.
>> which stop words are used in carrot2 for german?
>>    
>
> Assuming proper language has been passed from the input source (Yahoo
> Boss) or forced manually, stopwords.de is the resource you're after. I
> just looked into it and it contains:
>
> ihr
> ihre
> ihren
>
> I've just added ihres to the SVN, thanks.
>
> Dawid
there is another issue with this search:
same query, "girsule" and yahoo boss source 100 results.
there is one cluster labeled "wbr".
this "wbr" obviously does not come from the snippet text.
is this some kind of break(br)? this should be filtered out i guess...

these are the snippets of the cluster:

Bibliothek / Donau-Universität Krems. ... Susanne *Girsule*. Phone: +43
(0)2732 893-2231. Fax: +43 (0)2732 893-4230. E-Mail:
susanne.*girsule*@donau-uni<wbr>.ac.at ...

Geschichte und Arbeitsmaterialien zu Jansdorf. Viele Infos auch für
Familien- und Ahnenforscher(Genealogen<wbr>) ... Dr. Norbert *Girsule*
veröffentlicht (Es ist kein wirklich ...

Last updated: Mon, 1 Apr 1996 4:14:59 +0100. Main Menu -- Main ... 0.12
0.00 48 0 | /n-online/veranst/logos/girsul<wbr>e.gif>*girsule* 0.14 0.59
55 1724400 | /n-online/veranst/messen<wbr>. ...

... 'n','Turnen','2000'); REPLACE INTO mitglied VALUES
('183','Herr','','Norbert<wbr>','*GIRSULE*','0000-00-00'<wbr>,'Friedhofstraße
22/12','3133','Traismauer<wbr>' ...


regards
reinhard


------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: cluster labels

Dawid Weiss-2
These are typically issues with search sources -- Web pages that
double-escape entities or search engines that escape them incorrectly.
If Yahoo Boss returns such encoded entities as part of its result, we
really can't do much about it other than recursive HTML-entity
stripping, but it's generally a pain.

Dawid

On Sat, Apr 10, 2010 at 1:35 PM, reinhard schwab <[hidden email]> wrote:

> Dawid Weiss schrieb:
>>> if i search for "girsule" at yahoo with 100 results,
>>> carrot2 shows me a cluster label "Ihres".
>>> in german, i would classify this term as stop word.
>>> which stop words are used in carrot2 for german?
>>>
>>
>> Assuming proper language has been passed from the input source (Yahoo
>> Boss) or forced manually, stopwords.de is the resource you're after. I
>> just looked into it and it contains:
>>
>> ihr
>> ihre
>> ihren
>>
>> I've just added ihres to the SVN, thanks.
>>
>> Dawid
> there is another issue with this search:
> same query, "girsule" and yahoo boss source 100 results.
> there is one cluster labeled "wbr".
> this "wbr" obviously does not come from the snippet text.
> is this some kind of break(br)? this should be filtered out i guess...
>
> these are the snippets of the cluster:
>
> Bibliothek / Donau-Universität Krems. ... Susanne *Girsule*. Phone: +43
> (0)2732 893-2231. Fax: +43 (0)2732 893-4230. E-Mail:
> susanne.*girsule*@donau-uni<wbr>.ac.at ...
>
> Geschichte und Arbeitsmaterialien zu Jansdorf. Viele Infos auch für
> Familien- und Ahnenforscher(Genealogen<wbr>) ... Dr. Norbert *Girsule*
> veröffentlicht (Es ist kein wirklich ...
>
> Last updated: Mon, 1 Apr 1996 4:14:59 +0100. Main Menu -- Main ... 0.12
> 0.00 48 0 | /n-online/veranst/logos/girsul<wbr>e.gif>*girsule* 0.14 0.59
> 55 1724400 | /n-online/veranst/messen<wbr>. ...
>
> ... 'n','Turnen','2000'); REPLACE INTO mitglied VALUES
> ('183','Herr','','Norbert<wbr>','*GIRSULE*','0000-00-00'<wbr>,'Friedhofstraße
> 22/12','3133','Traismauer<wbr>' ...
>
>
> regards
> reinhard
>
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: cluster labels

Dawid Weiss-2
To be exact, this is what is passed back from Yahoo BOSS:

<abstract><![CDATA[gerta <b>girsule</b> - HEROLD.at hat für Sie 1
Telefonbucheinträge zu gerta <b>girsule</b> gefunden. <b>...</b>
http://www.herold.at/telefonbu<wbr>ch/gerta-<b>girsule</b>/ Link in
einer Internetseite <b>...</b>]]></abstract>

As you can see, we just display this literally (to keep the search
engine's highlights, for example), <wbr> is simply not interpreted
right by the browser. Why, I have no idea.

Dawid

On Sat, Apr 10, 2010 at 2:03 PM, Dawid Weiss
<[hidden email]> wrote:

> These are typically issues with search sources -- Web pages that
> double-escape entities or search engines that escape them incorrectly.
> If Yahoo Boss returns such encoded entities as part of its result, we
> really can't do much about it other than recursive HTML-entity
> stripping, but it's generally a pain.
>
> Dawid
>
> On Sat, Apr 10, 2010 at 1:35 PM, reinhard schwab <[hidden email]> wrote:
>> Dawid Weiss schrieb:
>>>> if i search for "girsule" at yahoo with 100 results,
>>>> carrot2 shows me a cluster label "Ihres".
>>>> in german, i would classify this term as stop word.
>>>> which stop words are used in carrot2 for german?
>>>>
>>>
>>> Assuming proper language has been passed from the input source (Yahoo
>>> Boss) or forced manually, stopwords.de is the resource you're after. I
>>> just looked into it and it contains:
>>>
>>> ihr
>>> ihre
>>> ihren
>>>
>>> I've just added ihres to the SVN, thanks.
>>>
>>> Dawid
>> there is another issue with this search:
>> same query, "girsule" and yahoo boss source 100 results.
>> there is one cluster labeled "wbr".
>> this "wbr" obviously does not come from the snippet text.
>> is this some kind of break(br)? this should be filtered out i guess...
>>
>> these are the snippets of the cluster:
>>
>> Bibliothek / Donau-Universität Krems. ... Susanne *Girsule*. Phone: +43
>> (0)2732 893-2231. Fax: +43 (0)2732 893-4230. E-Mail:
>> susanne.*girsule*@donau-uni<wbr>.ac.at ...
>>
>> Geschichte und Arbeitsmaterialien zu Jansdorf. Viele Infos auch für
>> Familien- und Ahnenforscher(Genealogen<wbr>) ... Dr. Norbert *Girsule*
>> veröffentlicht (Es ist kein wirklich ...
>>
>> Last updated: Mon, 1 Apr 1996 4:14:59 +0100. Main Menu -- Main ... 0.12
>> 0.00 48 0 | /n-online/veranst/logos/girsul<wbr>e.gif>*girsule* 0.14 0.59
>> 55 1724400 | /n-online/veranst/messen<wbr>. ...
>>
>> ... 'n','Turnen','2000'); REPLACE INTO mitglied VALUES
>> ('183','Herr','','Norbert<wbr>','*GIRSULE*','0000-00-00'<wbr>,'Friedhofstraße
>> 22/12','3133','Traismauer<wbr>' ...
>>
>>
>> regards
>> reinhard
>>
>>
>> ------------------------------------------------------------------------------
>> Download Intel&#174; Parallel Studio Eval
>> Try the new software tools for yourself. Speed compiling, find bugs
>> proactively, and fine-tune applications for parallel performance.
>> See why Intel Parallel Studio got high marks during beta.
>> http://p.sf.net/sfu/intel-sw-dev
>> _______________________________________________
>> Carrot2-developers mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>>
>

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers