I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

taojian
Hello~...
I think I maybe find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer...

The function addStemStatistics sets the attributes in preprocessContext.allStems.
The final result of allStems.tfByDocument may be out of documents' order.
In other words, tfByDocument[stemIndex] may be {2, 15, 8, 20, 4, 2, 6,30}(this stem appears in document 2, 8, 4, 6), for example.

However, in the function buildTermDocumentMatrix of TermDocumentMatrixBuilder, carrot2 computes the weights as follow:
           
            int tfByDocumentIndex = 0;
            for (int documentIndex = 0; documentIndex < documentCount; documentIndex++)
            {
                if (tfByDocumentIndex * 2 < tfByDocument.length
                    && tfByDocument[tfByDocumentIndex * 2] == documentIndex)
                {
                    double weight = termWeighting.calculateTermWeight(
                        tfByDocument[tfByDocumentIndex * 2 + 1], df, documentCount);

                    weight *= getWeightBoost(titleFieldIndex, fieldIndices);
                    tfByDocumentIndex++;

                    tdMatrix.set(i, documentIndex, weight);
                }
            }

Obviously, for above example, the weights of document 4 and 6 for stemIndex in the matrix is 0... I think this is inconsistent with the actual situation.

Waiting for your answer~... ^^

       
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

Dawid Weiss-2
I'll look into this, Taojian but in general if you could provide a
failing JUnit test case it'd be much simpler to investigate and fix :)
There are lots of unit tests for preprocessing context so it shouldn't
be too hard to write a new one based on that.

Dawid

On Wed, Apr 11, 2012 at 4:11 AM, taojian <[hidden email]> wrote:

> Hello~...
> I think I maybe find a bug in
> org.carrot2.text.preprocessing.LanguageModelStemmer...
>
> The function addStemStatistics sets the attributes in
> preprocessContext.allStems.
> The final result of allStems.tfByDocument may be out of documents' order.
> In other words, tfByDocument[/stemIndex/] may be {*2*, 15, *8*, 20, *4*, 2,
> *6*,30}(this stem appears in document 2, 8, 4, 6), for example.
>
> However, in the function buildTermDocumentMatrix of
> TermDocumentMatrixBuilder, carrot2 computes the weights as follow:
>
>            int tfByDocumentIndex = 0;
>            for (int documentIndex = 0; documentIndex < documentCount;
> documentIndex++)
>            {
>                if (tfByDocumentIndex * 2 < tfByDocument.length
>                    && tfByDocument[tfByDocumentIndex * 2] == documentIndex)
>                {
>                    double weight = termWeighting.calculateTermWeight(
>                        tfByDocument[tfByDocumentIndex * 2 + 1], df,
> documentCount);
>
>                    weight *= getWeightBoost(titleFieldIndex, fieldIndices);
>                    tfByDocumentIndex++;
>
>                    tdMatrix.set(i, documentIndex, weight);
>                }
>            }
>
> Obviously, for above example, the weights of document 4 and 6 for
> /stemIndex/ in the matrix is 0... I think this is inconsistent with the
> actual situation.
>
> Waiting for your answer~... ^^
>
>
>
> --
> View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/I-think-I-find-a-bug-in-org-carrot2-text-preprocessing-LanguageModelStemmer-tp7454900p7454900.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Better than sec? Nothing is better than sec when it comes to
> monitoring Big Data applications. Try Boundary one-second
> resolution app monitoring today. Free.
> http://p.sf.net/sfu/Boundary-dev2dev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

Dawid Weiss-2
This indeed looks like something is not right. Not only the indices
but also the logic in that loop is off. We're looking into this, I
filed a bug to track this:

http://issues.carrot2.org/browse/CARROT-905

Thanks for reporting and keep digging -- I'm sure there's plenty of
issues left! :)

Dawid

On Wed, Apr 11, 2012 at 9:39 AM, Dawid Weiss
<[hidden email]> wrote:

> I'll look into this, Taojian but in general if you could provide a
> failing JUnit test case it'd be much simpler to investigate and fix :)
> There are lots of unit tests for preprocessing context so it shouldn't
> be too hard to write a new one based on that.
>
> Dawid
>
> On Wed, Apr 11, 2012 at 4:11 AM, taojian <[hidden email]> wrote:
>> Hello~...
>> I think I maybe find a bug in
>> org.carrot2.text.preprocessing.LanguageModelStemmer...
>>
>> The function addStemStatistics sets the attributes in
>> preprocessContext.allStems.
>> The final result of allStems.tfByDocument may be out of documents' order.
>> In other words, tfByDocument[/stemIndex/] may be {*2*, 15, *8*, 20, *4*, 2,
>> *6*,30}(this stem appears in document 2, 8, 4, 6), for example.
>>
>> However, in the function buildTermDocumentMatrix of
>> TermDocumentMatrixBuilder, carrot2 computes the weights as follow:
>>
>>            int tfByDocumentIndex = 0;
>>            for (int documentIndex = 0; documentIndex < documentCount;
>> documentIndex++)
>>            {
>>                if (tfByDocumentIndex * 2 < tfByDocument.length
>>                    && tfByDocument[tfByDocumentIndex * 2] == documentIndex)
>>                {
>>                    double weight = termWeighting.calculateTermWeight(
>>                        tfByDocument[tfByDocumentIndex * 2 + 1], df,
>> documentCount);
>>
>>                    weight *= getWeightBoost(titleFieldIndex, fieldIndices);
>>                    tfByDocumentIndex++;
>>
>>                    tdMatrix.set(i, documentIndex, weight);
>>                }
>>            }
>>
>> Obviously, for above example, the weights of document 4 and 6 for
>> /stemIndex/ in the matrix is 0... I think this is inconsistent with the
>> actual situation.
>>
>> Waiting for your answer~... ^^
>>
>>
>>
>> --
>> View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/I-think-I-find-a-bug-in-org-carrot2-text-preprocessing-LanguageModelStemmer-tp7454900p7454900.html
>> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>>
>> ------------------------------------------------------------------------------
>> Better than sec? Nothing is better than sec when it comes to
>> monitoring Big Data applications. Try Boundary one-second
>> resolution app monitoring today. Free.
>> http://p.sf.net/sfu/Boundary-dev2dev
>> _______________________________________________
>> Carrot2-developers mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>>

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

Stanislaw Osinski
Administrator
In reply to this post by taojian
Thanks for the report, good catch! I've just pushed a fix for the issue.

Out of curiosity -- did you get the incorrect results when using TermDocumentMatrixBuilder along with Carrot2 PreprocessingPipeline? Looking quickly at the preprocessing pipeline, document indexes in tfByDocument arrays should actually be increasing (because the tokens array is built in the order of documents on input and then all sorting is done using IndirectSorter.mergesort(), which is stable), so the but should not manifest itself.

That aside, the code in the loop didn't make much sense, it's fixed now.

If you find any other bugs, please let us know (ideally, with a JUnit test case :-) ).

Staszek

On Wed, Apr 11, 2012 at 04:11, taojian <[hidden email]> wrote:
Hello~...
I think I maybe find a bug in
org.carrot2.text.preprocessing.LanguageModelStemmer...

The function addStemStatistics sets the attributes in
preprocessContext.allStems.
The final result of allStems.tfByDocument may be out of documents' order.
In other words, tfByDocument[/stemIndex/] may be {*2*, 15, *8*, 20, *4*, 2,
*6*,30}(this stem appears in document 2, 8, 4, 6), for example.

However, in the function buildTermDocumentMatrix of
TermDocumentMatrixBuilder, carrot2 computes the weights as follow:

           int tfByDocumentIndex = 0;
           for (int documentIndex = 0; documentIndex < documentCount;
documentIndex++)
           {
               if (tfByDocumentIndex * 2 < tfByDocument.length
                   && tfByDocument[tfByDocumentIndex * 2] == documentIndex)
               {
                   double weight = termWeighting.calculateTermWeight(
                       tfByDocument[tfByDocumentIndex * 2 + 1], df,
documentCount);

                   weight *= getWeightBoost(titleFieldIndex, fieldIndices);
                   tfByDocumentIndex++;

                   tdMatrix.set(i, documentIndex, weight);
               }
           }

Obviously, for above example, the weights of document 4 and 6 for
/stemIndex/ in the matrix is 0... I think this is inconsistent with the
actual situation.

Waiting for your answer~... ^^



--
View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/I-think-I-find-a-bug-in-org-carrot2-text-preprocessing-LanguageModelStemmer-tp7454900p7454900.html
Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers


------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

Dawid Weiss-2
> Out of curiosity -- did you get the incorrect results when using
> TermDocumentMatrixBuilder along with Carrot2 PreprocessingPipeline? Looking

Exactly  - Tiaojan, how did you get the unordered example in the first
place? Was it analytically or empirically? If it was acquired
empirically then something else is wrong and needs to be checked.

Dawid

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

taojian
In reply to this post by Stanislaw Osinski
this bug cause the stemmer that we think can improve the results, finally product the wrong matrix...

I find the bug when I try to change the stemmer~~
在 2012年4月11日 下午5:37,Stanislaw Osinski <[hidden email]>写道:
Thanks for the report, good catch! I've just pushed a fix for the issue.

Out of curiosity -- did you get the incorrect results when using TermDocumentMatrixBuilder along with Carrot2 PreprocessingPipeline? Looking quickly at the preprocessing pipeline, document indexes in tfByDocument arrays should actually be increasing (because the tokens array is built in the order of documents on input and then all sorting is done using IndirectSorter.mergesort(), which is stable), so the but should not manifest itself.

That aside, the code in the loop didn't make much sense, it's fixed now.

If you find any other bugs, please let us know (ideally, with a JUnit test case :-) ).

Staszek

On Wed, Apr 11, 2012 at 04:11, taojian <[hidden email]> wrote:
Hello~...
I think I maybe find a bug in
org.carrot2.text.preprocessing.LanguageModelStemmer...

The function addStemStatistics sets the attributes in
preprocessContext.allStems.
The final result of allStems.tfByDocument may be out of documents' order.
In other words, tfByDocument[/stemIndex/] may be {*2*, 15, *8*, 20, *4*, 2,
*6*,30}(this stem appears in document 2, 8, 4, 6), for example.

However, in the function buildTermDocumentMatrix of
TermDocumentMatrixBuilder, carrot2 computes the weights as follow:

           int tfByDocumentIndex = 0;
           for (int documentIndex = 0; documentIndex < documentCount;
documentIndex++)
           {
               if (tfByDocumentIndex * 2 < tfByDocument.length
                   && tfByDocument[tfByDocumentIndex * 2] == documentIndex)
               {
                   double weight = termWeighting.calculateTermWeight(
                       tfByDocument[tfByDocumentIndex * 2 + 1], df,
documentCount);

                   weight *= getWeightBoost(titleFieldIndex, fieldIndices);
                   tfByDocumentIndex++;

                   tdMatrix.set(i, documentIndex, weight);
               }
           }

Obviously, for above example, the weights of document 4 and 6 for
/stemIndex/ in the matrix is 0... I think this is inconsistent with the
actual situation.

Waiting for your answer~... ^^



--
View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/I-think-I-find-a-bug-in-org-carrot2-text-preprocessing-LanguageModelStemmer-tp7454900p7454900.html
Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers


------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers



------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

Dawid Weiss-2
If you can provide a full example (with a fake stemmer class if the
one you're using is proprietary) it'd be great. These indices should
be ordered, with any stemmer, so I wonder what's going wrong and
where.

Dawid

2012/4/11 Taojian Lu <[hidden email]>:

> this bug cause the stemmer that we think can improve the results, finally
> product the wrong matrix...
>
> I find the bug when I try to change the stemmer~~
> 在 2012年4月11日 下午5:37,Stanislaw Osinski <[hidden email]>写道:
>
>> Thanks for the report, good catch! I've just pushed a fix for the issue.
>>
>> Out of curiosity -- did you get the incorrect results when using
>> TermDocumentMatrixBuilder along with Carrot2 PreprocessingPipeline? Looking
>> quickly at the preprocessing pipeline, document indexes
>> in tfByDocument arrays should actually be increasing (because the tokens
>> array is built in the order of documents on input and then all sorting is
>> done using IndirectSorter.mergesort(), which is stable), so the but should
>> not manifest itself.
>>
>> That aside, the code in the loop didn't make much sense, it's fixed now.
>>
>> If you find any other bugs, please let us know (ideally, with a JUnit test
>> case :-) ).
>>
>> Staszek
>>
>> On Wed, Apr 11, 2012 at 04:11, taojian <[hidden email]> wrote:
>>>
>>> Hello~...
>>> I think I maybe find a bug in
>>> org.carrot2.text.preprocessing.LanguageModelStemmer...
>>>
>>> The function addStemStatistics sets the attributes in
>>> preprocessContext.allStems.
>>> The final result of allStems.tfByDocument may be out of documents' order.
>>> In other words, tfByDocument[/stemIndex/] may be {*2*, 15, *8*, 20, *4*,
>>> 2,
>>> *6*,30}(this stem appears in document 2, 8, 4, 6), for example.
>>>
>>> However, in the function buildTermDocumentMatrix of
>>> TermDocumentMatrixBuilder, carrot2 computes the weights as follow:
>>>
>>>            int tfByDocumentIndex = 0;
>>>            for (int documentIndex = 0; documentIndex < documentCount;
>>> documentIndex++)
>>>            {
>>>                if (tfByDocumentIndex * 2 < tfByDocument.length
>>>                    && tfByDocument[tfByDocumentIndex * 2] ==
>>> documentIndex)
>>>                {
>>>                    double weight = termWeighting.calculateTermWeight(
>>>                        tfByDocument[tfByDocumentIndex * 2 + 1], df,
>>> documentCount);
>>>
>>>                    weight *= getWeightBoost(titleFieldIndex,
>>> fieldIndices);
>>>                    tfByDocumentIndex++;
>>>
>>>                    tdMatrix.set(i, documentIndex, weight);
>>>                }
>>>            }
>>>
>>> Obviously, for above example, the weights of document 4 and 6 for
>>> /stemIndex/ in the matrix is 0... I think this is inconsistent with the
>>> actual situation.
>>>
>>> Waiting for your answer~... ^^
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://carrot2-users-and-developers-forum.607571.n2.nabble.com/I-think-I-find-a-bug-in-org-carrot2-text-preprocessing-LanguageModelStemmer-tp7454900p7454900.html
>>> Sent from the Carrot2 Users and Developers Forum mailing list archive at
>>> Nabble.com.
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Better than sec? Nothing is better than sec when it comes to
>>> monitoring Big Data applications. Try Boundary one-second
>>> resolution app monitoring today. Free.
>>> http://p.sf.net/sfu/Boundary-dev2dev
>>> _______________________________________________
>>> Carrot2-developers mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Better than sec? Nothing is better than sec when it comes to
>> monitoring Big Data applications. Try Boundary one-second
>> resolution app monitoring today. Free.
>> http://p.sf.net/sfu/Boundary-dev2dev
>> _______________________________________________
>> Carrot2-developers mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>>
>
>
> ------------------------------------------------------------------------------
> Better than sec? Nothing is better than sec when it comes to
> monitoring Big Data applications. Try Boundary one-second
> resolution app monitoring today. Free.
> http://p.sf.net/sfu/Boundary-dev2dev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

taojian
In reply to this post by taojian
:) I am a newer in JUnit.. In most of time, i still debug step by step in hand... I think i should learn to use the JUnit hoho~~
在 2012年4月11日 下午5:54,Taojian Lu <[hidden email]>写道:
this bug cause the stemmer that we think can improve the results, finally product the wrong matrix...

I find the bug when I try to change the stemmer~~
在 2012年4月11日 下午5:37,Stanislaw Osinski <[hidden email]>写道:
Thanks for the report, good catch! I've just pushed a fix for the issue.

Out of curiosity -- did you get the incorrect results when using TermDocumentMatrixBuilder along with Carrot2 PreprocessingPipeline? Looking quickly at the preprocessing pipeline, document indexes in tfByDocument arrays should actually be increasing (because the tokens array is built in the order of documents on input and then all sorting is done using IndirectSorter.mergesort(), which is stable), so the but should not manifest itself.

That aside, the code in the loop didn't make much sense, it's fixed now.

If you find any other bugs, please let us know (ideally, with a JUnit test case :-) ).

Staszek

On Wed, Apr 11, 2012 at 04:11, taojian <[hidden email]> wrote:
Hello~...
I think I maybe find a bug in
org.carrot2.text.preprocessing.LanguageModelStemmer...

The function addStemStatistics sets the attributes in
preprocessContext.allStems.
The final result of allStems.tfByDocument may be out of documents' order.
In other words, tfByDocument[/stemIndex/] may be {*2*, 15, *8*, 20, *4*, 2,
*6*,30}(this stem appears in document 2, 8, 4, 6), for example.

However, in the function buildTermDocumentMatrix of
TermDocumentMatrixBuilder, carrot2 computes the weights as follow:

           int tfByDocumentIndex = 0;
           for (int documentIndex = 0; documentIndex < documentCount;
documentIndex++)
           {
               if (tfByDocumentIndex * 2 < tfByDocument.length
                   && tfByDocument[tfByDocumentIndex * 2] == documentIndex)
               {
                   double weight = termWeighting.calculateTermWeight(
                       tfByDocument[tfByDocumentIndex * 2 + 1], df,
documentCount);

                   weight *= getWeightBoost(titleFieldIndex, fieldIndices);
                   tfByDocumentIndex++;

                   tdMatrix.set(i, documentIndex, weight);
               }
           }

Obviously, for above example, the weights of document 4 and 6 for
/stemIndex/ in the matrix is 0... I think this is inconsistent with the
actual situation.

Waiting for your answer~... ^^



--
View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/I-think-I-find-a-bug-in-org-carrot2-text-preprocessing-LanguageModelStemmer-tp7454900p7454900.html
Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers


------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers




------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

taojian
I think the source of bug is in "target.add((SparseArray.mergeSparseArrays(source))":
    private void storeTfByDocument(
        ArrayList<int []> target, ArrayList<int []> source)
    {
        assert source.size() > 0 : "Empty source document list?";
        if (source.size() == 1)
        {
            // Just copy the reference over if a single list is available.
            target.add(source.get(0));
        }
        else
        {
            // Merge sparse representations if more than one.
            target.add(SparseArray.mergeSparseArrays(source));
        }
    }
every AllWord.tfByDocument's document index is increasing but the situation of stem is not?... the merging doesn't order the document indexes form different words...
 

si
在 2012年4月11日 下午5:59,Taojian Lu <[hidden email]>写道:
:) I am a newer in JUnit.. In most of time, i still debug step by step in hand... I think i should learn to use the JUnit hoho~~
在 2012年4月11日 下午5:54,Taojian Lu <[hidden email]>写道:

this bug cause the stemmer that we think can improve the results, finally product the wrong matrix...

I find the bug when I try to change the stemmer~~
在 2012年4月11日 下午5:37,Stanislaw Osinski <[hidden email]>写道:
Thanks for the report, good catch! I've just pushed a fix for the issue.

Out of curiosity -- did you get the incorrect results when using TermDocumentMatrixBuilder along with Carrot2 PreprocessingPipeline? Looking quickly at the preprocessing pipeline, document indexes in tfByDocument arrays should actually be increasing (because the tokens array is built in the order of documents on input and then all sorting is done using IndirectSorter.mergesort(), which is stable), so the but should not manifest itself.

That aside, the code in the loop didn't make much sense, it's fixed now.

If you find any other bugs, please let us know (ideally, with a JUnit test case :-) ).

Staszek

On Wed, Apr 11, 2012 at 04:11, taojian <[hidden email]> wrote:
Hello~...
I think I maybe find a bug in
org.carrot2.text.preprocessing.LanguageModelStemmer...

The function addStemStatistics sets the attributes in
preprocessContext.allStems.
The final result of allStems.tfByDocument may be out of documents' order.
In other words, tfByDocument[/stemIndex/] may be {*2*, 15, *8*, 20, *4*, 2,
*6*,30}(this stem appears in document 2, 8, 4, 6), for example.

However, in the function buildTermDocumentMatrix of
TermDocumentMatrixBuilder, carrot2 computes the weights as follow:

           int tfByDocumentIndex = 0;
           for (int documentIndex = 0; documentIndex < documentCount;
documentIndex++)
           {
               if (tfByDocumentIndex * 2 < tfByDocument.length
                   && tfByDocument[tfByDocumentIndex * 2] == documentIndex)
               {
                   double weight = termWeighting.calculateTermWeight(
                       tfByDocument[tfByDocumentIndex * 2 + 1], df,
documentCount);

                   weight *= getWeightBoost(titleFieldIndex, fieldIndices);
                   tfByDocumentIndex++;

                   tdMatrix.set(i, documentIndex, weight);
               }
           }

Obviously, for above example, the weights of document 4 and 6 for
/stemIndex/ in the matrix is 0... I think this is inconsistent with the
actual situation.

Waiting for your answer~... ^^



--
View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/I-think-I-find-a-bug-in-org-carrot2-text-preprocessing-LanguageModelStemmer-tp7454900p7454900.html
Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers


------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers





------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

Dawid Weiss-2
Good point. This shouldn't play any role now but it's worth mentioning
in the javadocs that these arrays don't need to come in sorted df
order (they used to but it's not the case anymore).

D.

2012/4/11 Taojian Lu <[hidden email]>:

> I think the source of bug is in
> "target.add((SparseArray.mergeSparseArrays(source))":
>     private void storeTfByDocument(
>         ArrayList<int []> target, ArrayList<int []> source)
>     {
>         assert source.size() > 0 : "Empty source document list?";
>         if (source.size() == 1)
>         {
>             // Just copy the reference over if a single list is available.
>             target.add(source.get(0));
>         }
>         else
>         {
>             // Merge sparse representations if more than one.
>             target.add(SparseArray.mergeSparseArrays(source));
>         }
>     }
> every AllWord.tfByDocument's document index is increasing but the situation
> of stem is not?... the merging doesn't order the document indexes form
> different words...
>
>
> si
> 在 2012年4月11日 下午5:59,Taojian Lu <[hidden email]>写道:
>
>> :) I am a newer in JUnit.. In most of time, i still debug step by step in
>> hand... I think i should learn to use the JUnit hoho~~
>> 在 2012年4月11日 下午5:54,Taojian Lu <[hidden email]>写道:
>>
>>> this bug cause the stemmer that we think can improve the results, finally
>>> product the wrong matrix...
>>>
>>> I find the bug when I try to change the stemmer~~
>>> 在 2012年4月11日 下午5:37,Stanislaw Osinski <[hidden email]>写道:
>>>>
>>>> Thanks for the report, good catch! I've just pushed a fix for the issue.
>>>>
>>>> Out of curiosity -- did you get the incorrect results when using
>>>> TermDocumentMatrixBuilder along with Carrot2 PreprocessingPipeline? Looking
>>>> quickly at the preprocessing pipeline, document indexes
>>>> in tfByDocument arrays should actually be increasing (because the tokens
>>>> array is built in the order of documents on input and then all sorting is
>>>> done using IndirectSorter.mergesort(), which is stable), so the but should
>>>> not manifest itself.
>>>>
>>>> That aside, the code in the loop didn't make much sense, it's fixed now.
>>>>
>>>> If you find any other bugs, please let us know (ideally, with a JUnit
>>>> test case :-) ).
>>>>
>>>> Staszek
>>>>
>>>> On Wed, Apr 11, 2012 at 04:11, taojian <[hidden email]> wrote:
>>>>>
>>>>> Hello~...
>>>>> I think I maybe find a bug in
>>>>> org.carrot2.text.preprocessing.LanguageModelStemmer...
>>>>>
>>>>> The function addStemStatistics sets the attributes in
>>>>> preprocessContext.allStems.
>>>>> The final result of allStems.tfByDocument may be out of documents'
>>>>> order.
>>>>> In other words, tfByDocument[/stemIndex/] may be {*2*, 15, *8*, 20,
>>>>> *4*, 2,
>>>>> *6*,30}(this stem appears in document 2, 8, 4, 6), for example.
>>>>>
>>>>> However, in the function buildTermDocumentMatrix of
>>>>> TermDocumentMatrixBuilder, carrot2 computes the weights as follow:
>>>>>
>>>>>            int tfByDocumentIndex = 0;
>>>>>            for (int documentIndex = 0; documentIndex < documentCount;
>>>>> documentIndex++)
>>>>>            {
>>>>>                if (tfByDocumentIndex * 2 < tfByDocument.length
>>>>>                    && tfByDocument[tfByDocumentIndex * 2] ==
>>>>> documentIndex)
>>>>>                {
>>>>>                    double weight = termWeighting.calculateTermWeight(
>>>>>                        tfByDocument[tfByDocumentIndex * 2 + 1], df,
>>>>> documentCount);
>>>>>
>>>>>                    weight *= getWeightBoost(titleFieldIndex,
>>>>> fieldIndices);
>>>>>                    tfByDocumentIndex++;
>>>>>
>>>>>                    tdMatrix.set(i, documentIndex, weight);
>>>>>                }
>>>>>            }
>>>>>
>>>>> Obviously, for above example, the weights of document 4 and 6 for
>>>>> /stemIndex/ in the matrix is 0... I think this is inconsistent with the
>>>>> actual situation.
>>>>>
>>>>> Waiting for your answer~... ^^
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://carrot2-users-and-developers-forum.607571.n2.nabble.com/I-think-I-find-a-bug-in-org-carrot2-text-preprocessing-LanguageModelStemmer-tp7454900p7454900.html
>>>>> Sent from the Carrot2 Users and Developers Forum mailing list archive
>>>>> at Nabble.com.
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Better than sec? Nothing is better than sec when it comes to
>>>>> monitoring Big Data applications. Try Boundary one-second
>>>>> resolution app monitoring today. Free.
>>>>> http://p.sf.net/sfu/Boundary-dev2dev
>>>>> _______________________________________________
>>>>> Carrot2-developers mailing list
>>>>> [hidden email]
>>>>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Better than sec? Nothing is better than sec when it comes to
>>>> monitoring Big Data applications. Try Boundary one-second
>>>> resolution app monitoring today. Free.
>>>> http://p.sf.net/sfu/Boundary-dev2dev
>>>> _______________________________________________
>>>> Carrot2-developers mailing list
>>>> [hidden email]
>>>> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>>>>
>>>
>>
>
>
> ------------------------------------------------------------------------------
> Better than sec? Nothing is better than sec when it comes to
> monitoring Big Data applications. Try Boundary one-second
> resolution app monitoring today. Free.
> http://p.sf.net/sfu/Boundary-dev2dev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

taojian
In reply to this post by Stanislaw Osinski
Ok...~ i don't agree with "The issue seems to be relevant only when TermDocumentMatrixBuilder is used separately from Carrot2 preprocessing pipeline"...

I think the bug can affect the clustering results even we use the original preprocessing pipeline in Carrot2...~ I don't think it has something to do with my stemmer.

I debuged the source, and I found that after the stemming, the indices of AllStems.tfByDocument are not in order. Just as i mentioned, the merging of different words' tfByDocuments producted the stem's  tfByDocument which is out of order.

therefore, i think the fixing can improve the results of clustering in Carrot2...

Do you agree with me?

My English is just so so.... I hope you can understand my point~~

Looking forward to your reply~
^.^
Reply | Threaded
Open this post in threaded view
|

Re: I think I find a bug in org.carrot2.text.preprocessing.LanguageModelStemmer

Stanislaw Osinski
Administrator
I debuged the source, and I found that after the stemming, the indices of
AllStems.tfByDocument are not in order. Just as i mentioned, the merging of
different words' tfByDocuments producted the stem's  tfByDocument which is
out of order.

therefore, i think the fixing can improve the results of clustering in
Carrot2...

Do you agree with me?

You are right, document indices will be out of order after merging indeed. If you pull the latest source code from GitHub, it has the fix already.

Staszek


------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers