Mapping between Lucene and Carrot document fields

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Mapping between Lucene and Carrot document fields

milos
Hello,
sorry for a long post but I must solv the problem...

I have a Lucene index with following fields:

title (field.index.tokenized, field.store.yes)
body  (field.index.tokenized, field.store.no)
meta  (field.index.tokenized, field.store.yes)
classes (field.index.tokenized, field.store.yes)
summary (field.index.no, field.store.yes)
url     (field.index.no, field.store.yes)
date  (field.index.no, field.store.yes)

I understood that when I issue the query to Carrot2 it searches over ALL
searchable fields (field.index.tokenized) in Lucene index to obtain Lucene
documents that will be further clustered (in my case title, body, meta and
classes).

Question 1: is that true?

Now Carrot2 performs clustering on retrieved Lucene docs using text from
Lucene fields that are defined using SimpleFieldMapper.contentField and
SimpleFieldMapper.titleField and the mapping is stored in
lucene.atributes.xml file of my webapp. I put the following mapping:
titleField = title and contentField = summary.

Question 2: Carrot2 will now perform clustering using title and summary
lucene fields only?

Question 3: Carrot2 will perform clustering using summary field even it is
not tokenized in Lucene index, but just stored in it?

Question 4: Is it possible to tell Carrot2 to cluster not only on title
and summary, but also on classes and meta Lucene fields (since they are
also stored in my index)? If yes how (merging fields into summary is not
ok since then I'll have ugly display in summary for the doc)?

Question 5: Is it possible for each document to display in results page
not only title, summary and url, but also classes, meta and date
information by adding these 3 fields to carrot2 Document object when it is
created and then to change somehow documents.xsl to show additional
fields?

Regards, Milos




------------------------------------------------------------------------------
Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code to
build responsive, highly engaging applications that combine the power of local
resources and data with the reach of the web. Download the Adobe AIR SDK and
Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Mapping between Lucene and Carrot document fields

Stanislaw Osinski
Administrator
I have a Lucene index with following fields:

title (field.index.tokenized, field.store.yes)
body  (field.index.tokenized, field.store.no)
meta  (field.index.tokenized, field.store.yes)
classes (field.index.tokenized, field.store.yes)
summary (field.index.no, field.store.yes)
url     (field.index.no, field.store.yes)
date  (field.index.no, field.store.yes)

I understood that when I issue the query to Carrot2 it searches over ALL
searchable fields (field.index.tokenized) in Lucene index to obtain Lucene
documents that will be further clustered (in my case title, body, meta and
classes).

Question 1: is that true?

No. If you use the default SimpleFieldMapper (http://download.carrot2.org/head/javadoc/org/carrot2/source/lucene/SimpleFieldMapper.html), only the fields that get mapped to document title and snippet will be searched.
 
Now Carrot2 performs clustering on retrieved Lucene docs using text from
Lucene fields that are defined using SimpleFieldMapper.contentField and
SimpleFieldMapper.titleField and the mapping is stored in
lucene.atributes.xml file of my webapp. I put the following mapping:
titleField = title and contentField = summary.

Question 2: Carrot2 will now perform clustering using title and summary
lucene fields only?

Yes.
 
Question 3: Carrot2 will perform clustering using summary field even it is
not tokenized in Lucene index, but just stored in it?

Yes, tokenization does not matter, Carrot2 will always use the raw text and perform tokenization on its own.
 
Question 4: Is it possible to tell Carrot2 to cluster not only on title
and summary, but also on classes and meta Lucene fields (since they are
also stored in my index)? If yes how (merging fields into summary is not
ok since then I'll have ugly display in summary for the doc)?

Not at the moment -- Carrot2 does not have an algorithm that can cluster based on both unstructured text and some 'structured' fields, like class. This concept is known as 'semi-supervised clustering', I'm currently researching if we can use a similar approach in Carrot2, but it's a long way before the results make it into the public repository. I'll let you know when I get some promising results though.

Question 5: Is it possible for each document to display in results page
not only title, summary and url, but also classes, meta and date
information by adding these 3 fields to carrot2 Document object when it is
created and then to change somehow documents.xsl to show additional
fields?

Right, this is what I missed when answering your previous post :-) To achieve this, you'd need to do two things:

1. Modify documents.xsl in the way I described in the previous post
2. Create your own implementation of IFieldMapper (http://download.carrot2.org/head/javadoc/org/carrot2/source/lucene/IFieldMapper.html) that adds the extra fields you want to display to the document.

S.

------------------------------------------------------------------------------
Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code to
build responsive, highly engaging applications that combine the power of local
resources and data with the reach of the web. Download the Adobe AIR SDK and
Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Mapping between Lucene and Carrot document fields

milos

Hello,

> Question 5: Is it possible for each document to display in results page
>> not only title, summary and url, but also classes, meta and date
>> information by adding these 3 fields to carrot2 Document object when it
>> is
>> created and then to change somehow documents.xsl to show additional
>> fields?
>
>
> Right, this is what I missed when answering your previous post :-) To
> achieve this, you'd need to do two things:
>
> 1. Modify documents.xsl in the way I described in the previous post
> 2. Create your own implementation of IFieldMapper (
> http://download.carrot2.org/head/javadoc/org/carrot2/source/lucene/IFieldMapper.html)
> that adds the extra fields you want to display to the document.

I changed Carrot2 Document to include more fields from Lucene index and
changed the documents.xsl as you suggested and everything works fine!
Thank you :)

But still..

>>
>> I have a Lucene index with following fields:
>>
>> title (field.index.tokenized, field.store.yes)
>> body  (field.index.tokenized, field.store.no)
>> meta  (field.index.tokenized, field.store.yes)
>> classes (field.index.tokenized, field.store.yes)
>> summary (field.index.no, field.store.yes)
>> url     (field.index.no, field.store.yes)
>> date  (field.index.no, field.store.yes)
>>
>> I understood that when I issue the query to Carrot2 it searches over ALL
>> searchable fields (field.index.tokenized) in Lucene index to obtain
>> Lucene
>> documents that will be further clustered (in my case title, body, meta
>> and
>> classes).
>>
>> Question 1: is that true?
>
>
> No. If you use the default SimpleFieldMapper (
> http://download.carrot2.org/head/javadoc/org/carrot2/source/lucene/SimpleFieldMapper.html),
> only the fields that get mapped to document title and snippet will be
> searched.
>

... is there any way to tell Carrot to search over title, body and meta
fields? I mean to change the default behaviour of SimpleFieldMapper. I
noticed that if I explicitly specify in the query lucene field other than
default  then everything is OK. But that is ugly way to search over Lucene
index.

Best regards, Milos



------------------------------------------------------------------------------
Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code to
build responsive, highly engaging applications that combine the power of local
resources and data with the reach of the web. Download the Adobe AIR SDK and
Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Mapping between Lucene and Carrot document fields

Stanislaw Osinski
Administrator
... is there any way to tell Carrot to search over title, body and meta
fields? I mean to change the default behaviour of SimpleFieldMapper. I
noticed that if I explicitly specify in the query lucene field other than
default  then everything is OK. But that is ugly way to search over Lucene
index.

Simply roll out your own IFieldMapper and implement the getSearchFields() method to return the field names you want to search.

S.

------------------------------------------------------------------------------
Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code to
build responsive, highly engaging applications that combine the power of local
resources and data with the reach of the web. Download the Adobe AIR SDK and
Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers