integration with MS FSIS / FAST ESP and adding new language - Croatian

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

integration with MS FSIS / FAST ESP and adding new language - Croatian

zvonimirm
Hi,

I would like to try to integrate Carrot2 to our Microsoft FSIS (FAST ESP 5.3) search, and also add Croatian language.
Does anybody have any experience with FAST ESP 5.3 (or MS FSIS), or should focus on XML as input?
Also we have access to advanced language tools for Croatian language, and would like use Croatian language with Carrot2, so any pointers how to do it would be greatly appreciated :).

Best regards,

Zvonimir
Reply | Threaded
Open this post in threaded view
|

Re: integration with MS FSIS / FAST ESP and adding new language - Croatian

Dawid Weiss-2
Hi Zvonimir.

> Does anybody have any experience with FAST ESP 5.3 (or MS FSIS), or should
> focus on XML as input?

Not that I'm aware of. If there is a way to fetch results from Fast
ESP in XML then Carrot2 can be configured to fetch this and
potentially convert it to the required format using XSLT. I would
start with trying to fetch these results using Workbench -- see here:

http://download.carrot2.org/head/manual/index.html#section.getting-started.xml-files

once you have this working, configuring the XML document source with
an XSLT shouldn't be much of an issue.

A lot also depends on non-functional requirements -- if your query
rate is high then you'll probably want to configure Carrot2 as a
service queried by your application (so that clusters can be displayed
lazily after or during the rest of the application loading time).  For
this the DCS (document clustering server) component would be probably
ideal; it's REST over HTTP.

All depends on the use case, in other words.

> Also we have access to advanced language tools for Croatian language, and
> would like use Croatian language with Carrot2, so any pointers how to do it
> would be greatly appreciated :).

I assume these tools you mention are not open source, are they?
Because if they are it'd be nice to integrate them into the codebase.
Let me know what you have and how it works; I'll try to help with the
integration even if they're proprietary.

Dawid

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

RE: integration with MS FSIS / FAST ESP and adding new language - Croatian

zvonimirm

Hi Dawid,

 

Thank you for suggestion about integration with FAST ESP. I will also try to use Carrot2 C# api, it is same feature set as DCS?

 

Unfortunately language tools we have are form 3rd party and we can use it just for this project, and they are not open source.

What do we need to implement Croatian u Carrot2, because probably can get it if needed?

 

Best regards,

Zvonimir

 

From: JIRA [hidden email] [via Carrot2 Users and Developers Forum] [mailto:ml-node+[hidden email]]
Sent: Saturday, August 11, 2012 11:04 PM
To: Zvonimir Mavretić
Subject: Re: integration with MS FSIS / FAST ESP and adding new language - Croatian

 

Hi Zvonimir.

> Does anybody have any experience with FAST ESP 5.3 (or MS FSIS), or should
> focus on XML as input?

Not that I'm aware of. If there is a way to fetch results from Fast
ESP in XML then Carrot2 can be configured to fetch this and
potentially convert it to the required format using XSLT. I would
start with trying to fetch these results using Workbench -- see here:

http://download.carrot2.org/head/manual/index.html#section.getting-started.xml-files

once you have this working, configuring the XML document source with
an XSLT shouldn't be much of an issue.

A lot also depends on non-functional requirements -- if your query
rate is high then you'll probably want to configure Carrot2 as a
service queried by your application (so that clusters can be displayed
lazily after or during the rest of the application loading time).  For
this the DCS (document clustering server) component would be probably
ideal; it's REST over HTTP.

All depends on the use case, in other words.

> Also we have access to advanced language tools for Croatian language, and
> would like use Croatian language with Carrot2, so any pointers how to do it
> would be greatly appreciated :).

I assume these tools you mention are not open source, are they?
Because if they are it'd be nice to integrate them into the codebase.
Let me know what you have and how it works; I'll try to help with the
integration even if they're proprietary.

Dawid

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers


To unsubscribe from integration with MS FSIS / FAST ESP and adding new language - Croatian, click here.
NAML

Reply | Threaded
Open this post in threaded view
|

Re: integration with MS FSIS / FAST ESP and adding new language - Croatian

Dawid Weiss-2
Hi Zvonimir,

> Thank you for suggestion about integration with FAST ESP. I will also try to
> use Carrot2 C# api, it is same feature set as DCS?

The C# API is a cross-compilation from Java to .NET using IKVM. We
don't publish the build process or the tiny tweaks that we have made
(to allow assembly resources etc.). The C# port is, in other words,
free but not open-source. C# API will allow you to use Carrot2
clustering components from C# directly but will not be suitable for
code-level customizations unless you recompile Carrot2 yourself (not a
big task but requires some time to get familiar with IKVM).

> Unfortunately language tools we have are form 3rd party and we can use it
> just for this project, and they are not open source.

Ok, clear.

> What do we need to implement Croatian u Carrot2, because probably can get it
> if needed?

I'll need to add rudimentary Croatian support so that it appear in the
list of supported languages first... I'll do it today. If you want to
override the entire lexical pipeline then look at the example class:
UsingCustomLanguageModel. It declares a custom stemmer, tokenizer and
lexical data (stop words). For now you'll need to either ignore the
LanguageCode and use Croatian for everything or pick one of the
languages that are present and substitute its resources with yours.

        Map<String, Object> attrs = Maps.newHashMap();
        BasicPreprocessingPipelineDescriptor.attributeBuilder(attrs)
            .stemmerFactory(CustomStemmerFactory.class)
            .tokenizerFactory(CustomTokenizerFactory.class)
            .lexicalDataFactory(CustomLexicalDataFactory.class);
        controller.init(attrs);

Like I said, it will require some coding and modifications to the
baseline codebase if you have proprietary tools.

Dawid

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

RE: integration with MS FSIS / FAST ESP and adding new language - Croatian

zvonimirm

Hi Dawid,

 

Because we are mostly .net programmers, and if we manage to recompile C# api, can we make needed changes directly in C#, or it needed to be made in original Java source code?

Especially if you need to make changes to allow Croatian.

Best regards,

Zvonimir

Reply | Threaded
Open this post in threaded view
|

Re: integration with MS FSIS / FAST ESP and adding new language - Croatian

Dawid Weiss-2
You will need to modify Java code and compile it to .NET using IKVM.
The result is an assembly that you'll be able to use from C#. The Java
examples will work for you when ported, although class and method
naming may look weird to a C# programmer. Like I said, what we
distribute as the C# API layer is not open source and you won't have
access to that.

Dawid

On Mon, Aug 13, 2012 at 1:17 PM, zvonimirm <[hidden email]> wrote:

> Hi Dawid,
>
>
>
> Because we are mostly .net programmers, and if we manage to recompile C#
> api, can we make needed changes directly in C#, or it needed to be made in
> original Java source code?
>
> Especially if you need to make changes to allow Croatian.
>
> Best regards,
>
> Zvonimir
>
>
> ________________________________
> View this message in context: RE: integration with MS FSIS / FAST ESP and
> adding new language - Croatian
> Sent from the Carrot2 Users and Developers Forum mailing list archive at
> Nabble.com.
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

RE: integration with MS FSIS / FAST ESP and adding new language - Croatian

zvonimirm

Thank you for explanation.

Best regards,

Zvonimir

Reply | Threaded
Open this post in threaded view
|

Re: integration with MS FSIS / FAST ESP and adding new language - Croatian

Dawid Weiss-2
Alternatively, modify Java source code to your liking and build the
DCS, then from C# use the DCS as a REST (XML over HTTP) service. May
be simpler to start with.

Dawid

On Mon, Aug 13, 2012 at 1:28 PM, zvonimirm <[hidden email]> wrote:

> Thank you for explanation.
>
> Best regards,
>
> Zvonimir
>
>
> ________________________________
> View this message in context: RE: integration with MS FSIS / FAST ESP and
> adding new language - Croatian
> Sent from the Carrot2 Users and Developers Forum mailing list archive at
> Nabble.com.
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: integration with MS FSIS / FAST ESP and adding new language - Croatian

Dawid Weiss-2
Hi Zvonimir,

Could you take a look at this issue:
http://issues.carrot2.org/browse/CARROT-946

and verify that stopwords.hk (UTF-8 encoded plain text) contains
indeed sensible stop words for Croatian? Stop words should be high
frequency function words that typically wouldn't be used on their own
and wouldn't form cluster labels.

Dawid

On Mon, Aug 13, 2012 at 1:33 PM, Dawid Weiss
<[hidden email]> wrote:

> Alternatively, modify Java source code to your liking and build the
> DCS, then from C# use the DCS as a REST (XML over HTTP) service. May
> be simpler to start with.
>
> Dawid
>
> On Mon, Aug 13, 2012 at 1:28 PM, zvonimirm <[hidden email]> wrote:
>> Thank you for explanation.
>>
>> Best regards,
>>
>> Zvonimir
>>
>>
>> ________________________________
>> View this message in context: RE: integration with MS FSIS / FAST ESP and
>> adding new language - Croatian
>> Sent from the Carrot2 Users and Developers Forum mailing list archive at
>> Nabble.com.
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Carrot2-developers mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>>

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

RE: integration with MS FSIS / FAST ESP and adding new language - Croatian

zvonimirm

Hi Dawid,

 

Stops words are Croatian ones, and looks fine.

Just small question about file extension, .hk – this is for Croatian (Hrvatski), because standard mark should be hr?

 

Best regards,

Zvonimir

 

From: JIRA [hidden email] [via Carrot2 Users and Developers Forum] [mailto:ml-node+[hidden email]]
Sent: Tuesday, August 14, 2012 9:27 AM
To: Zvonimir Mavretić
Subject: Re: integration with MS FSIS / FAST ESP and adding new language - Croatian

 

Hi Zvonimir,

Could you take a look at this issue:
http://issues.carrot2.org/browse/CARROT-946

and verify that stopwords.hk (UTF-8 encoded plain text) contains
indeed sensible stop words for Croatian? Stop words should be high
frequency function words that typically wouldn't be used on their own
and wouldn't form cluster labels.

Dawid

On Mon, Aug 13, 2012 at 1:33 PM, Dawid Weiss
<[hidden email]> wrote:


> Alternatively, modify Java source code to your liking and build the
> DCS, then from C# use the DCS as a REST (XML over HTTP) service. May
> be simpler to start with.
>
> Dawid
>
> On Mon, Aug 13, 2012 at 1:28 PM, zvonimirm <[hidden email]> wrote:
>> Thank you for explanation.
>>
>> Best regards,
>>
>> Zvonimir
>>
>>
>> ________________________________
>> View this message in context: RE: integration with MS FSIS / FAST ESP and
>> adding new language - Croatian
>> Sent from the Carrot2 Users and Developers Forum mailing list archive at
>> Nabble.com.
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Carrot2-developers mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>>


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers


To unsubscribe from integration with MS FSIS / FAST ESP and adding new language - Croatian, click here.
NAML

Reply | Threaded
Open this post in threaded view
|

Re: integration with MS FSIS / FAST ESP and adding new language - Croatian

Dawid Weiss-2
> (Hrvatski), because standard mark should be hr?

Mistake, mistake on my part -- will fix, thanks for pointing this out!

Dawid

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers