Invalid Document

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Invalid Document

amine
This post was updated on .
Hi,

I installed carrot2-webapp-3.5.2 and I want to add results from the source guardian.co.uk new sources. I followed the docs and came up with the following:

suites/source-guardian.xml

 <component-suite>
  <sources>
    <source component-class="org.carrot2.source.opensearch.OpenSearchDocumentSource" id="guardian"
            attribute-sets-resource="source-guardian-attributes.xml">
      <label>Guardian</label>
      <title>News from The Guardian</title>
      <mnemonic>G</mnemonic>
      <description>Searches news from guardian.co.uk</description>
      <example-queries>
        <example-query>london</example-query>
        <example-query>nato</example-query>
        <example-query>riots</example-query>
      </example-queries>
    </source>
  </sources>
</component-suite>

suites/source-guardian-attributes.xml

<attribute-sets>
  <attribute-set id="jobs">
    <value-set>
      <label>Guardian</label>
      <attribute key="OpenSearchDocumentSource.feedUrlTemplate">
         <value type="java.lang.String" value="http://content.guardianapis.com/search?q=${searchTerms}&amp;format=xml&amp;api-key=xxx&amp;start=${startIndex}&amp;limit=${count}" />
      </attribute>
      <attribute key="OpenSearchDocumentSource.resultsPerPage">
        <value type="java.lang.Integer" value="50" />
      </attribute>
      <attribute key="OpenSearchDocumentSource.maximumResults">
        <value type="java.lang.Integer" value="400" />
      </attribute>
      <attribute key="XmlDocumentSource.xslt">
        <value>
          <wrapper class="org.carrot2.util.resource.FileResource" absolute-path="/xxx/carrot2-webapp-3.5.2/WEB-INF/suites/transform-guardian-to-carrot2.xsl"/>
        </value>
      </attribute>
    </value-set>
  </attribute-set>
</attribute-sets>

suites/transform-guardian-to-carrot2.xsl

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
     xmlns:media="http://search.yahoo.com/mrss">

  <xsl:output indent="yes" omit-xml-declaration="no"
       media-type="application/xml" encoding="UTF-8" />

  <xsl:template match="/">
    <searchresult>
          <query>Tomate</query>
      <xsl:for-each select="response/results/content">
        <document>
          <title><xsl:value-of select="@web-title" /></title>
          <snippet>
            snippet
          </snippet>
          <url><xsl:value-of select="@web-url" /></url>
        </document>
      </xsl:for-each>
    </searchresult>
  </xsl:template>
</xsl:stylesheet>

What's going on ?

When i request this source by hand, get the xml response and process the transformation, the result respect what carrot2 expects (according to: Doc-10.2.1 Carrot2 input XML format)

When i add my source to suite-webapp.xml and perform a search, I get an "Invald Document" error.

Has someone been through this ?

Json handling

Some services don't provide API's that responses in xml format. Could you please tell me where should I start to contribute to carrot2 code in order to handle json ?

Thanks a lot
Amine
Reply | Threaded
Open this post in threaded view
|

Re: Invalid Document

Dawid Weiss-2
Can you attach these descriptor files, please? Or send them to my
e-mail directly.

Dawid

On Mon, Dec 5, 2011 at 7:40 AM, amine <[hidden email]> wrote:

> Hi,
>
> I installed carrot2-webapp-3.5.2 and I want to add results from the source
> guardian.co.uk new sources. I followed the docs and came up with the
> following:
>
> suites/source-guardian.xml
>
>
>
> suites/source-guardian-attributes.xml
>
>
>
> suites/transform-guardian-to-carrot2.xsl
>
>
>
> What's going on ?
>
> When i request this source by hand, get the xml response and process the
> transformation, the result respect what carrot2 expects (according to:
> http://download.carrot2.org/head/manual/index.html#section.architecture.input-xml
> Doc-10.2.1 Carrot2 input XML format )
>
> When i add my source to *suite-webapp.xml* and perform a search, I get an
> *"Invald Document"* error.
>
> Has someone been through this ?
>
> Json handling
>
> Some services don't provide API's that responses in xml format. Could you
> please tell me where should I start to contribute to carrot2 code in order
> to handle json ?
>
> Thanks a lot
> Amine
>
> --
> View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Invalid-Document-tp7061959p7061959.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Invalid Document

amine
Hello Dawid,

Here are the files:

source-guardian.xml
source-guardian-attributes.xml
transform-guardian-to-carrot2.xsl

I switched from an OpenSearchDocumentSource to an XmlDocumentSource, xslt is fine, but the searchTerms are not taken into account.

Thanks a lot,
Amine
Reply | Threaded
Open this post in threaded view
|

Re: Invalid Document

Dawid Weiss-2
That's because XmlDocumentSource doesn't do any paging and has a
different URL parameters format; only these to be specific:

            attributes.put("query", (query != null ? query : ""));
            attributes.put("results", (results != -1 ? results : ""));

>From what I can see the guardian's XML format is not opensearch so you
won't be able to use that directly. It is paged XML or JSON so you'll
either need to stick with XML and the first 50 (maximum) results
returned by the guardian or you'll need to write your own
IDocumentSource that will fetch more results and parse them (from JSON
or XML) to Carrot2's Document list for further processing.

Dawid

On Mon, Dec 5, 2011 at 8:30 AM, amine <[hidden email]> wrote:

> Hello Dawid,
>
> Here are the files:
>
> http://carrot2-users-and-developers-forum.607571.n2.nabble.com/file/n7062024/source-guardian.xml
> source-guardian.xml
> http://carrot2-users-and-developers-forum.607571.n2.nabble.com/file/n7062024/source-guardian-attributes.xml
> source-guardian-attributes.xml
> http://carrot2-users-and-developers-forum.607571.n2.nabble.com/file/n7062024/transform-guardian-to-carrot2.xsl
> transform-guardian-to-carrot2.xsl
>
> I switched from an OpenSearchDocumentSource to an XmlDocumentSource, xslt is
> fine, but the searchTerms are not taken into account.
>
> Thanks a lot,
> Amine
>
> --
> View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Invalid-Document-tp7061959p7062024.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers