Problem with clustering results from mulitple source

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with clustering results from mulitple source

knowledge
Hello everybody! I have a problem with clustering the results of more than one sources.

I know that for that problem I should write my own program and a good example is the etools clustering, which clusters the results from google and etools.
With this in mind I had done the clustering of two different solr sources and one google source with carrot2 3.6.2 and worked great.

Then I realized that carrot2 latest version was 3.7 and I tried to do the same, but without any results.

The problem is that the same type of sources have to get the same attribute.

Am I missing something? Is there any other way to do it?
Reply | Threaded
Open this post in threaded view
|

Re: Problem with clustering results from mulitple source

Dawid Weiss-2
> Then I realized that carrot2 latest version was 3.7 and I tried to do the
> same, but without any results.
> The problem is that the same type of sources have to get the same attribute.

Can you post your source and an example that doesn't work on github or
as a gist? It'd be easier to help you if I saw the code.

Dawid

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Problem with clustering results from mulitple source

knowledge
This post was updated on .
I have posted a sample of my code to
https://gist.github.com/vcharpas/5452799


2013/4/24 Dawid Weiss <dawid.weiss@cs.put.poznan.pl>

> > Then I realized that carrot2 latest version was 3.7 and I tried to do the
> > same, but without any results.
> > The problem is that the same type of sources have to get the same
> attribute.
>
> Can you post your source and an example that doesn't work on github or
> as a gist? It'd be easier to help you if I saw the code.
>
> Dawid
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Carrot2-developers mailing list
> Carrot2-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Carrot2-developers mailing list
Carrot2-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Problem with clustering results from mulitple source

knowledge
Excuse me, this is my previous code, that worked, https://gist.github.com/vcharpas/5452799 with the 3.6.2.

The the new version of code is https://gist.github.com/vcharpas/5452969


2013/4/24 Vassilis Charpantidis <[hidden email]>
I have posted a sample of my code to https://gist.github.com/vcharpas/5452799


2013/4/24 Dawid Weiss <[hidden email]>
> Then I realized that carrot2 latest version was 3.7 and I tried to do the
> same, but without any results.
> The problem is that the same type of sources have to get the same attribute.

Can you post your source and an example that doesn't work on github or
as a gist? It'd be easier to help you if I saw the code.

Dawid

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers


------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Problem with clustering results from mulitple source

Dawid Weiss-2
There have been some changes to SolrDocumentSource but I honestly
don't think they should affect the source you pasted. solrIdFieldName
is not even obligatory so you don't have to pass it, unless you want
Solr-side clusters or highlighter output... This may be it, now that I
think of it. There is a functionally incompatible change to
SolrDocumentSource -- if you have both a field and its highlighted
fragments then (as of 3.7.0) the highlighter will take precedence. You
can disable this by setting solr*.useHighlighterOutput to false. Let
me know if this worked.

If not, describe what's not working because I'm not sure if I get you.

Dawid

On Wed, Apr 24, 2013 at 5:24 PM, Vassilis Charpantidis
<[hidden email]> wrote:

> Excuse me, this is my previous code, that worked,
> https://gist.github.com/vcharpas/5452799 with the 3.6.2.
>
> The the new version of code is https://gist.github.com/vcharpas/5452969
>
>
>
> 2013/4/24 Vassilis Charpantidis <[hidden email]>
>>
>> I have posted a sample of my code to
>> https://gist.github.com/vcharpas/5452799
>>
>>
>> 2013/4/24 Dawid Weiss <[hidden email]>
>>>
>>> > Then I realized that carrot2 latest version was 3.7 and I tried to do
>>> > the
>>> > same, but without any results.
>>> > The problem is that the same type of sources have to get the same
>>> > attribute.
>>>
>>> Can you post your source and an example that doesn't work on github or
>>> as a gist? It'd be easier to help you if I saw the code.
>>>
>>> Dawid
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Try New Relic Now & We'll Send You this Cool Shirt
>>> New Relic is the only SaaS-based application performance monitoring
>>> service
>>> that delivers powerful full stack analytics. Optimize and monitor your
>>> browser, app, & servers with just a few lines of code. Try New Relic
>>> and get this awesome Nerd Life shirt!
>>> http://p.sf.net/sfu/newrelic_d2d_apr
>>> _______________________________________________
>>> Carrot2-developers mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Problem with clustering results from mulitple source

knowledge
This post was updated on .
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with clustering results from mulitple source

knowledge
This post was updated on .
In reply to this post by Dawid Weiss-2
Thank you so much for your answer. I think the problem is exactly what you describe. I swiched to 3.7 for the reason of highlighting.

The code crashes at Document class, where the following lines through an error

if( doc.getStringId() != null ) {
        if (!ids.add(doc.getStringId()))
        {
                throw new IllegalArgumentException(
                     "Identifiers must be unique, duplicated identifier: " + doc.getStringId());
         }
}

I tried to comment the throw lines, but then the clustering crashes although the results are displayed correctly
Reply | Threaded
Open this post in threaded view
|

Re: Problem with clustering results from mulitple source

Dawid Weiss-2
In reply to this post by knowledge
If you specify the ID field then:

1) make sure it is indeed present in Solr search results (it needs to
be returned in the set of returned fields to match with highlighter
output);
2) make sure it is indeed unique per each document,
3) make sure you didn't make a type in the name of that ID field by accident.

An ID field in Solr is typically defined in the schema with uniqueKey,
for example:

<uniqueKey>id</uniqueKey>

If this still doesn't work please attach an example of Solr's output
and I'll dig.

Dawid

>From what you're describing it seems that your ID field

On Thu, Apr 25, 2013 at 8:53 AM, Vassilis Charpantidis
<[hidden email]> wrote:

> Thank you for much for your answer. I think the problem is exactly what you
> describe. I swiched to 3.7 for the reason of highlighting.
>
> The code crashes at Document class, where the following lines through an
> error
>
> if( doc.getStringId() != null ) {
> if (!ids.add(doc.getStringId()))
> {
>         throw new IllegalArgumentException(
>              "Identifiers must be unique, duplicated identifier: " +
> doc.getStringId());
>  }
> }
>
> I tryed to comment the throw lines, but then the clustering crashes although
> the results are displayed correctly
>
>
> 2013/4/25 Dawid Weiss <[hidden email]>
>>
>> There have been some changes to SolrDocumentSource but I honestly
>> don't think they should affect the source you pasted. solrIdFieldName
>> is not even obligatory so you don't have to pass it, unless you want
>> Solr-side clusters or highlighter output... This may be it, now that I
>> think of it. There is a functionally incompatible change to
>> SolrDocumentSource -- if you have both a field and its highlighted
>> fragments then (as of 3.7.0) the highlighter will take precedence. You
>> can disable this by setting solr*.useHighlighterOutput to false. Let
>> me know if this worked.
>>
>> If not, describe what's not working because I'm not sure if I get you.
>>
>> Dawid
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Problem with clustering results from mulitple source

knowledge
I've been experimenting with the code and I realized the followings:
1. When I use the default solr source only, everything is ok (including the use of highlighting)
2. I made a simpler example and I found that the problem was not the existence of two solr sources, but the existence of them. Only 1 Solr source can cause the same problem.

I have deleted the previous gist documents and created a new one with my complete code of the simple case of 1 solr source and 1 google source https://gist.github.com/vcharpas/5459392
Reply | Threaded
Open this post in threaded view
|

Re: Problem with clustering results from mulitple source

Dawid Weiss-2
Ah, yep -- true; if your Document instances are aggregated from
multiple sources you must implement the logic required to make
Document identifiers unique (or leave them empty). So inside your
implementation of WebDocumentSource copy over (clone) the Document
instances from all Solr instances and from Google and do *not* copy
the ID field (or copy it over but make sure it is unique across the
merged collection).

Dawid

On Thu, Apr 25, 2013 at 2:49 PM, knowledge <[hidden email]> wrote:

> I've been experimenting with the code and I realized the followings:
> 1. When I use the default solr source only, everything is ok (including the
> use of highlighting)
> 2. I made a simpler example and I found that the problem was not the
> existence of two solr sources, but the existence of them. Only 1 Solr source
> can cause the same problem.
>
> I have deleted the previous gist documents and created a new one with my
> complete code of the simple case of 1 solr source and 1 google source
> https://gist.github.com/vcharpas/5459392
>
>
>
> --
> View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Problem-with-clustering-results-from-mulitple-source-tp7578034p7578044.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Problem with clustering results from mulitple source

knowledge
Thank you very much!!! This was the answer and to save someone else effort I am posting the fix lines below:

Document temp = new Document(doc.getTitle(), doc.getSummary(), doc.getContentUrl());
temp.setField(Document.SOURCES, Lists.newArrayList("MySolrSource"));
response.results.add(temp);

from

response.results.add(doc);


2013/4/25 Dawid Weiss <[hidden email]>
Ah, yep -- true; if your Document instances are aggregated from
multiple sources you must implement the logic required to make
Document identifiers unique (or leave them empty). So inside your
implementation of WebDocumentSource copy over (clone) the Document
instances from all Solr instances and from Google and do *not* copy
the ID field (or copy it over but make sure it is unique across the
merged collection).

Dawid

On Thu, Apr 25, 2013 at 2:49 PM, knowledge <[hidden email]> wrote:
> I've been experimenting with the code and I realized the followings:
> 1. When I use the default solr source only, everything is ok (including the
> use of highlighting)
> 2. I made a simpler example and I found that the problem was not the
> existence of two solr sources, but the existence of them. Only 1 Solr source
> can cause the same problem.
>
> I have deleted the previous gist documents and created a new one with my
> complete code of the simple case of 1 solr source and 1 google source
> https://gist.github.com/vcharpas/5459392
>
>
>
> --
> View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Problem-with-clustering-results-from-mulitple-source-tp7578034p7578044.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers


------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers