Questions about LINGO and STC

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions about LINGO and STC

serghiño80
Good morning,

I downloaded the application Carrot2 Workbench Document Clustering for testing. XML created by me.
I have several questions about the operation of the application:

1-Carrot2 Document Clustering Workbench, supports multilanguage? (ie can pass documents created in different languages)
 
2-attribute "Procesing Language" in the algorithms and LINGO STC is required, I guess the language you choose in this attribute would be the same language as the documents to pass truth?

3-attribute "Merge Lexical Resources" label applies to filter stop and stop right word in different languages?

4-If I enter documents into the application to treat Carrot2 Document Clustering Workbench as follows:
 
<? Xml version = "1.0" encoding = "UTF-8">
<searchresult>

<document id="0">
<title> king </ title>
<snippet> dona_maria_de_las_mercedes_de_borbon la_mareta orleans king sar_dona_maria_de_las_mercedes barcelona casa_de_su_majestad_el_rey casa_del_rey familia_real </ snippet>
<url> http://www.20000102_606_C33.source.es </ url>
</ document>
.......
.......
</ searchresult>

The application, comprising the field title or snippet?
Is correct this type of input to the implementation?
 
 
Thank you very much for everything in advance.

------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Questions about LINGO and STC

Stanislaw Osinski
Administrator
Hello,
 
1-Carrot2 Document Clustering Workbench, supports multilanguage? (ie can pass documents created in different languages)

The question is actually whether Carrot2 algorithms support multilingual clustering, Workbench is just of many interfaces to Carrot2 (and all interfaces use the same algorithms).

Currently Carrot2 supports clustering in different languages, however, you must declare the language in which the documents are written. In other words, currently Carrot2 can't intelligently handle a mix of documents in different languages on input.
 
2-attribute "Procesing Language" in the algorithms and LINGO STC is required, I guess the language you choose in this attribute would be the same language as the documents to pass truth?

Exactly, that should give the best results. Essentially, the algorithms will use the stop words list and stemmers appropriate for the "Processing Language". I've noticed the documentation of the Processing Language is a bit misleading, I'll correct it in a second.
 
3-attribute "Merge Lexical Resources" label applies to filter stop and stop right word in different languages?

Quoting from the documentation (http://download.carrot2.org/head/manual/#section.attribute.DefaultLanguageModelFactory.mergeResources):

Merge Lexical Resources

Merges stop words and stop labels from all known languages. If set to false, only stop words and stop labels of the active language will be used. If set to true, stop words from all LanguageCodes will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings.

4-If I enter documents into the application to treat Carrot2 Document Clustering Workbench as follows:
 
<? Xml version = "1.0" encoding = "UTF-8">
<searchresult>

<document id="0">
<title> king </ title>
<snippet> dona_maria_de_las_mercedes_de_borbon la_mareta orleans king sar_dona_maria_de_las_mercedes barcelona casa_de_su_majestad_el_rey casa_del_rey familia_real </ snippet>
<url> http://www.20000102_606_C33.source.es </ url>
</ document>
.......
.......
</ searchresult>

The application, comprising the field title or snippet?
Is correct this type of input to the implementation?

Well, you should definitely get some results :-) Carrot2 is suitable for clustering "natural text" into well-labeled clusters and is based heavily on frequent phrases (sequences of words). So if your documents are just "bags of features" without any meaningful sequences, I'm guessing that other algorithms may do better than Carrot2.

Cheers,

S.

------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers