Where is the effect of constant factor(s = 2.5)?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Where is the effect of constant factor(s = 2.5)?

zivelian
Hi Stanislaw,
[from my previous post...]

Stanislaw Osinski wrote
>
> Is it 3(there are 3 values in term1's row which is not 0), so term 1's
> weight = log(7/3) = 0.368


It's this one.

S.
If it's 3, then where is the effect of constant factor (s=2.5), because we just count values which are not 0?

So where is the effect of constant factor(s = 2.5)?
Reply | Threaded
Open this post in threaded view
|

Re: Where is the effect of constant factor(s = 2.5)?

Stanislaw Osinski
Administrator
If it's 3, then where is the effect of constant factor (s=2.5), because we
just count values which are not 0?

So where is the effect of constant factor(s = 2.5)?

There is no effect on the idf (as it depends on document counts and not term occurrences), the effect is only on tf -- you count each occurrence of a word in document's title as 2.5 (or whatever value for s you assume).

Cheers,

S.

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables
unlimited royalty-free distribution of the report engine
for externally facing server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Where is the effect of constant factor(s = 2.5)?

zivelian
Hello Stanislaw,

Stanislaw Osinski wrote
>
> If it's 3, then where is the effect of constant factor (s=2.5), because we
> just count values which are not 0?
>
> So where is the effect of constant factor(s = 2.5)?


There is no effect on the idf (as it depends on document counts and not term
occurrences), the effect is only on tf -- you count each occurrence of a
word in document's title as 2.5 (or whatever value for s you assume).

Cheers,

S.
So is it like this?

Let assume this condition: term1 in document 3 and document 7 appears in document title, so it will be multiplied by constant factor s=2.5, the matrix is like below:

tf weighting

   d1    d2    d3   d4     d5     d6    d7
[ 0.00 0.00 2.50 1.00 0.00 0.00 2.50 ] --> term 1, idf = log(N/dfi) = log(7/3) = 0.368
[ 1.00 1.00 0.00 0.00 0.00 1.00 0.00 ] --> term 2, idf = log(N/dfi) = log(7/3) = 0.368
[ 1.00 1.00 0.00 0.00 0.00 1.00 0.00 ] --> term 3, idf = log(N/dfi) = log(7/3) = 0.368
[ 1.00 0.00 0.00 0.00 1.00 0.00 0.00 ] --> term 4, idf = log(N/dfi) = log(7/2) = 0.544
[ 0.00 0.00 1.00 1.00 0.00 0.00 0.00 ] --> term 5, idf = log(N/dfi) = log(7/2) = 0.544

And then tf . idf weighting, for example we multiply 2.5*0.368 = 0.92(how if this value(0.92) is greater than 1.00, is it permitted?)

     d1     d2     d3      d4       d5     d6       d7
[ 0.000 0.000 0.920 0.368 0.000 0.000 0.920 ]
[ 0.368 0.368 0.000 0.000 0.000 0.368 0.000 ]
[ 0.368 0.368 0.000 0.000 0.000 0.368 0.000 ]
[ 0.544 0.000 0.000 0.000 0.544 0.000 0.000 ]
[ 0.000 0.000 0.544 0.544 0.000 0.000 0.000 ]

Are two matrices on the above is right?

And how if tf . idf weight is greater than 1.00, is it permitted?

And is normalization process needed because of this(the tf . idf weight is greater than 1.00) problem?



Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Where is the effect of constant factor(s = 2.5)?

Stanislaw Osinski
Administrator
Hi,
 
So is it like this?

Let assume this condition: term1 in document 3 and document 7 appears in
document title, so it will be multiplied by constant factor s=2.5, the
matrix is like below:

tf weighting

  d1    d2    d3   d4     d5     d6    d7
[ 0.00 0.00 2.50 1.00 0.00 0.00 2.50 ] --> term 1, idf = log(N/dfi) =
log(7/3) = 0.368
[ 1.00 1.00 0.00 0.00 0.00 1.00 0.00 ] --> term 2, idf = log(N/dfi) =
log(7/3) = 0.368
[ 1.00 1.00 0.00 0.00 0.00 1.00 0.00 ] --> term 3, idf = log(N/dfi) =
log(7/3) = 0.368
[ 1.00 0.00 0.00 0.00 1.00 0.00 0.00 ] --> term 4, idf = log(N/dfi) =
log(7/2) = 0.544
[ 0.00 0.00 1.00 1.00 0.00 0.00 0.00 ] --> term 5, idf = log(N/dfi) =
log(7/2) = 0.544

And then tf . idf weighting, for example we multiply 2.5*0.368 = 0.92(how if
this value(0.92) is greater than 1.00, is it permitted?)

It is very likely to happen even without weighting, e.g. if a term appears twice in the same document. So mutliplying by 2.5 is no different from multiple occurrences of the same word in one document. That's why for some applications you need to normalize the columns of td matrix.
 
    d1     d2     d3      d4       d5     d6       d7
[ 0.000 0.000 0.920 0.368 0.000 0.000 0.920 ]
[ 0.368 0.368 0.000 0.000 0.000 0.368 0.000 ]
[ 0.368 0.368 0.000 0.000 0.000 0.368 0.000 ]
[ 0.544 0.000 0.000 0.000 0.544 0.000 0.000 ]
[ 0.000 0.000 0.544 0.544 0.000 0.000 0.000 ]

Are two matrices on the above is right?

Looks ok.
 
And is normalization process needed because of this(the tf . idf weight is
greater than 1.00) problem?

It's needed because otherwise, longer documents would get higher values in the matrix (because they simply have more words), and that's sometimes undesirable.

Cheers,

S.

------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, &
iPhoneDevCamp as they present alongside digital heavyweights like Barbarian
Group, R/GA, & Big Spaceship. http://p.sf.net/sfu/creativitycat-com 
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers