Get the SVD matrice in the LINGO algorithm

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Get the SVD matrice in the LINGO algorithm

Eloi Zablocki
Hello everyone,

First of all, congratulations for the really nice work on Carrot2, and thanks for letting it be open-source!
I have a question regarding the LINGO algorithm.
Since it does not scale well to a very large number of documents, I would like to know if it is possible to cluster only a sample of all the documents (let's say 10,000) and get the matrices of the SVD.
The advantage would be to use the learned matrices of the SVD to be able to clusterize all the documents. In practice, we could project the other documents (the one that have not been clustered) in a good word space (discovered with the SVD), compute the similarity between clustered documents and not clustered documents in this space. And eventually assign a cluster to the documents that were not clustered depending on the labels of the closest clustered documents.

Before jumping into the code and modify the algorithm I have the following questions:
- have you tried the above approach? Are the results ok?
- Is there a built-in way or easy way to get the SVD matrices that you would recommend?

Thanks a lot in advance,
Best,

Eloi

------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Get the SVD matrice in the LINGO algorithm

Dawid Weiss
Hi Eloi,

> First of all, congratulations for the really nice work on Carrot2, and
> thanks for letting it be open-source!

Thank you!

> I would like to know if it is possible to cluster only a sample of all the documents
> (let's say 10,000) and get the matrices of the SVD.
> The advantage would be to use the learned matrices of the SVD to be able to
> clusterize all the documents. In practice, we could project the other
> documents (the one that have not been clustered) in a good word space
> (discovered with the SVD), compute the similarity between clustered
> documents and not clustered documents in this space. And eventually assign a
> cluster to the documents that were not clustered depending on the labels of
> the closest clustered documents.

Sure, you could do this. I guess you could even do a bit better by
running a few clustering rounds over a random subset of documents and
then somehow comparing the output (or SVD decompositions?) to
eliminate potential noise.

> Before jumping into the code and modify the algorithm I have the following questions:
> - have you tried the above approach? Are the results ok?

I haven't (Staszek will follow-up, I'm sure), but from a shallow point
of view it looks looks like a combination of methods or ideas used
elsewhere (LSI, random projections).

> - Is there a built-in way or easy way to get the SVD matrices that you would recommend?

You'd probably need to tweak the source code, which shouldn't be too
scary, actually. The algorithm is simple enough that you could even
try to build a simpler proof-of-concept in R or any other environment
that would accelerate development. Then if this works, you could try
to port it to Java/Carrot2.

Dawid

------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers