Cluster preprocessed documents

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Cluster preprocessed documents

Heba Ezzat
Hello all

I am new to Carrot2 :)
I am using carrot2-java-api-3.5.0-dev on eclipse.

I want to first make text preprocessing on a set of Arabic documents. Then, apply Lingo Cluster algorithm on the preprocessed text

I found that I can apply Lingo algorithm by the following couple of lines:

        final Controller controller =  ControllerFactory.createCachingPooling(IDocumentSource.class);
        final ProcessingResult byTopicClusters = controller.process(documents,null   ,LingoClusteringAlgorithm.class);        

But, the clusters returned were very bad and labelled by stopwords. So, I decided that some preprocessing must be done first.
I found that preprocessing could be done using CompletePreprocessingPipeline as follow:

        final CompletePreprocessingPipeline preprocessingPipeline = new CompletePreprocessingPipeline();
        final PreprocessingContext context = preprocessingPipeline.preprocess(documents, null,LanguageCode.ARABIC);

But, How can I pass the preprocessed documents to the controller?
I dont know whether my derivation is logic or not

Thanks,
Heba
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Dawid Weiss-2
Hi Heba,

You need to specify the input documents' language manually if you're
clustering using the API only. Take a look at this example:

ClusteringNonEnglishContent.java

It contains the code that you can modify to work with Arabic (simply
specify LanguageCode.ARABIC as the input documents' language). Let us
know if this helped (and if you have any improvements of course!).

Dawid


On Mon, Apr 4, 2011 at 2:27 PM, Heba Ezzat <[hidden email]> wrote:

> Hello all
>
> I am new to Carrot2 :)
> I am using carrot2-java-api-3.5.0-dev on eclipse.
>
> I want to first make text preprocessing on a set of Arabic documents. Then,
> apply Lingo Cluster algorithm on the preprocessed text
>
> I found that I can apply Lingo algorithm by the following couple of lines:
>
>        final Controller controller =
> ControllerFactory.createCachingPooling(IDocumentSource.class);
>        final ProcessingResult byTopicClusters =
> controller.process(documents,null   ,LingoClusteringAlgorithm.class);
>
> But, the clusters returned were very bad and labelled by stopwords. So, I
> decided that some preprocessing must be done first.
> I found that preprocessing could be done using CompletePreprocessingPipeline
> as follow:
>
>        final CompletePreprocessingPipeline preprocessingPipeline = new
> CompletePreprocessingPipeline();
>        final PreprocessingContext context =
> preprocessingPipeline.preprocess(documents, null,LanguageCode.ARABIC);
>
> But, How can I pass the preprocessed documents to the controller?
> I dont know whether my derivation is logic or not
>
> Thanks,
> Heba
>
> --
> View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Cluster-preprocessed-documents-tp6238374p6238374.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Create and publish websites with WebMatrix
> Use the most popular FREE web apps or write code yourself;
> WebMatrix provides all the features you need to develop and
> publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>

------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself;
WebMatrix provides all the features you need to develop and
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Heba Ezzat
Thanks Dawid for you reply,

My problem mainly in the preprocessing phase...
How can I preprocess the documents before clustering?
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Dawid Weiss-2
Other than modifying the internals of Carrot2 you can't. What kind of
preprocessing do you have in mind? Preprocessing is built-in into
Carrot2 (and a lot of it), so maybe you're trying to do something that
is in there already?

Dawid

On Mon, Apr 4, 2011 at 11:25 PM, Heba Ezzat <[hidden email]> wrote:

> Thanks Dawid for you reply,
>
> My problem mainly in the preprocessing phase...
> How can I preprocess the documents before clustering?
>
> --
> View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Cluster-preprocessed-documents-tp6238374p6240269.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Xperia(TM) PLAY
> It's a major breakthrough. An authentic gaming
> smartphone on the nation's most reliable network.
> And it wants your games.
> http://p.sf.net/sfu/verizon-sfdev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Heba Ezzat
I want to use CompletePreprocessingPipeline before clustering. But I dont know how.

I know that I can use CompletePreprocessingPipeline as follows:

      final CompletePreprocessingPipeline preprocessingPipeline = new CompletePreprocessingPipeline();
      final PreprocessingContext context = preprocessingPipeline.preprocess(documents, null,LanguageCode.ARABIC);

but, my missing part is how to pass the preprocessed documents to be clustered.


Heba
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Dawid Weiss-2
Heba, you won't be able to use CompletePreprocessingPipeline before
clustering because it is an internal component that is part of the
clustering process... You could replace this component and use
something of your own, but I don't see the point. Why don't you just
cluster your documents using the code snippet I provided before (in
the example)? This uses complete preprocessing pipeline internally
anyway.

The LanguageCode.ARABIC flag needs to be set on each document, not on
the preprocessing object (because in a multi-lingual setting the
clusterer will invoke preprocessing with each language separately).

Dawid

On Tue, Apr 5, 2011 at 9:31 AM, Heba Ezzat <[hidden email]> wrote:

> I want to use CompletePreprocessingPipeline before clustering. But I dont
> know how.
>
> I know that I can use CompletePreprocessingPipeline as follows:
>
>      final CompletePreprocessingPipeline preprocessingPipeline = new
> CompletePreprocessingPipeline();
>      final PreprocessingContext context =
> preprocessingPipeline.preprocess(documents, null,LanguageCode.ARABIC);
>
> but, my missing part is how to pass the preprocessed documents to be
> clustered.
>
>
> Heba
>
> --
> View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Cluster-preprocessed-documents-tp6238374p6241377.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Xperia(TM) PLAY
> It's a major breakthrough. An authentic gaming
> smartphone on the nation's most reliable network.
> And it wants your games.
> http://p.sf.net/sfu/verizon-sfdev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Heba Ezzat
I ve used ClusteringNonEnglishContent.java and I ve specified LanguageCode.ARABIC as the input documents' language.
But, but I found that the clusters results contain stopwords that should have been removed in the preprocessing.

for example :
فى كل (7 documents, score: 30.634754283025156)
في اي (6 documents, score: 39.022973055181886)
كل ما (2 documents, score: 36.69559989409924)
لا بد أن (2 documents, score: 27.13532862854613)

Thats why I thought that some preprocessing need to take place before clustering.

How come the clusters contain stopwords if the CompletePreprocessingPipeline is invoked internally?


Thanks,
Heba

On Tue, Apr 5, 2011 at 9:43 AM, Dawid Weiss <[hidden email]> wrote:
Heba, you won't be able to use CompletePreprocessingPipeline before
clustering because it is an internal component that is part of the
clustering process... You could replace this component and use
something of your own, but I don't see the point. Why don't you just
cluster your documents using the code snippet I provided before (in
the example)? This uses complete preprocessing pipeline internally
anyway.

The LanguageCode.ARABIC flag needs to be set on each document, not on
the preprocessing object (because in a multi-lingual setting the
clusterer will invoke preprocessing with each language separately).

Dawid

On Tue, Apr 5, 2011 at 9:31 AM, Heba Ezzat <[hidden email]> wrote:
> I want to use CompletePreprocessingPipeline before clustering. But I dont
> know how.
>
> I know that I can use CompletePreprocessingPipeline as follows:
>
>      final CompletePreprocessingPipeline preprocessingPipeline = new
> CompletePreprocessingPipeline();
>      final PreprocessingContext context =
> preprocessingPipeline.preprocess(documents, null,LanguageCode.ARABIC);
>
> but, my missing part is how to pass the preprocessed documents to be
> clustered.
>
>
> Heba
>
> --
> View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Cluster-preprocessed-documents-tp6238374p6241377.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Xperia(TM) PLAY
> It's a major breakthrough. An authentic gaming
> smartphone on the nation's most reliable network.
> And it wants your games.
> http://p.sf.net/sfu/verizon-sfdev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
>


------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Dawid Weiss-2
Can you check if these stopwords occur in stopwords.ar file (or in
stoplabels.ar)? Maybe they are simply missing in there and need to be
added?

Dawid

2011/4/5 Heba Ezzat <[hidden email]>:

> I ve used ClusteringNonEnglishContent.java and I ve specified
> LanguageCode.ARABIC as the input documents' language.
> But, but I found that the clusters results contain stopwords that should
> have been removed in the preprocessing.
> for example :
> فى كل (7 documents, score: 30.634754283025156)
> في اي (6 documents, score: 39.022973055181886)
> كل ما (2 documents, score: 36.69559989409924)
> لا بد أن (2 documents, score: 27.13532862854613)
> Thats why I thought that some preprocessing need to take place before
> clustering.
> How come the clusters contain stopwords if
> the CompletePreprocessingPipeline is invoked internally?
>
> Thanks,
> Heba
> On Tue, Apr 5, 2011 at 9:43 AM, Dawid Weiss <[hidden email]>
> wrote:
>>
>> Heba, you won't be able to use CompletePreprocessingPipeline before
>> clustering because it is an internal component that is part of the
>> clustering process... You could replace this component and use
>> something of your own, but I don't see the point. Why don't you just
>> cluster your documents using the code snippet I provided before (in
>> the example)? This uses complete preprocessing pipeline internally
>> anyway.
>>
>> The LanguageCode.ARABIC flag needs to be set on each document, not on
>> the preprocessing object (because in a multi-lingual setting the
>> clusterer will invoke preprocessing with each language separately).
>>
>> Dawid
>>
>> On Tue, Apr 5, 2011 at 9:31 AM, Heba Ezzat <[hidden email]> wrote:
>> > I want to use CompletePreprocessingPipeline before clustering. But I
>> > dont
>> > know how.
>> >
>> > I know that I can use CompletePreprocessingPipeline as follows:
>> >
>> >      final CompletePreprocessingPipeline preprocessingPipeline = new
>> > CompletePreprocessingPipeline();
>> >      final PreprocessingContext context =
>> > preprocessingPipeline.preprocess(documents, null,LanguageCode.ARABIC);
>> >
>> > but, my missing part is how to pass the preprocessed documents to be
>> > clustered.
>> >
>> >
>> > Heba
>> >
>> > --
>> > View this message in context:
>> > http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Cluster-preprocessed-documents-tp6238374p6241377.html
>> > Sent from the Carrot2 Users and Developers Forum mailing list archive at
>> > Nabble.com.
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > Xperia(TM) PLAY
>> > It's a major breakthrough. An authentic gaming
>> > smartphone on the nation's most reliable network.
>> > And it wants your games.
>> > http://p.sf.net/sfu/verizon-sfdev
>> > _______________________________________________
>> > Carrot2-developers mailing list
>> > [hidden email]
>> > https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>> >
>> >
>
>

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Heba Ezzat
Surprisingly, the words exist in the stopwords.ar !!

I debugged the code ...but I couldn't find where it invokes the CompletePreprocessingPipeline, or where it checks the stopwords.
any suggestions?

Heba

On Tue, Apr 5, 2011 at 10:19 AM, Dawid Weiss <[hidden email]> wrote:
Can you check if these stopwords occur in stopwords.ar file (or in
stoplabels.ar)? Maybe they are simply missing in there and need to be
added?

Dawid

2011/4/5 Heba Ezzat <[hidden email]>:
> I ve used ClusteringNonEnglishContent.java and I ve specified
> LanguageCode.ARABIC as the input documents' language.
> But, but I found that the clusters results contain stopwords that should
> have been removed in the preprocessing.
> for example :
> فى كل (7 documents, score: 30.634754283025156)
> في اي (6 documents, score: 39.022973055181886)
> كل ما (2 documents, score: 36.69559989409924)
> لا بد أن (2 documents, score: 27.13532862854613)
> Thats why I thought that some preprocessing need to take place before
> clustering.
> How come the clusters contain stopwords if
> the CompletePreprocessingPipeline is invoked internally?
>
> Thanks,
> Heba
> On Tue, Apr 5, 2011 at 9:43 AM, Dawid Weiss <[hidden email]>
> wrote:
>>
>> Heba, you won't be able to use CompletePreprocessingPipeline before
>> clustering because it is an internal component that is part of the
>> clustering process... You could replace this component and use
>> something of your own, but I don't see the point. Why don't you just
>> cluster your documents using the code snippet I provided before (in
>> the example)? This uses complete preprocessing pipeline internally
>> anyway.
>>
>> The LanguageCode.ARABIC flag needs to be set on each document, not on
>> the preprocessing object (because in a multi-lingual setting the
>> clusterer will invoke preprocessing with each language separately).
>>
>> Dawid
>>
>> On Tue, Apr 5, 2011 at 9:31 AM, Heba Ezzat <[hidden email]> wrote:
>> > I want to use CompletePreprocessingPipeline before clustering. But I
>> > dont
>> > know how.
>> >
>> > I know that I can use CompletePreprocessingPipeline as follows:
>> >
>> >      final CompletePreprocessingPipeline preprocessingPipeline = new
>> > CompletePreprocessingPipeline();
>> >      final PreprocessingContext context =
>> > preprocessingPipeline.preprocess(documents, null,LanguageCode.ARABIC);
>> >
>> > but, my missing part is how to pass the preprocessed documents to be
>> > clustered.
>> >
>> >
>> > Heba
>> >
>> > --
>> > View this message in context:
>> > http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Cluster-preprocessed-documents-tp6238374p6241377.html
>> > Sent from the Carrot2 Users and Developers Forum mailing list archive at
>> > Nabble.com.
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > Xperia(TM) PLAY
>> > It's a major breakthrough. An authentic gaming
>> > smartphone on the nation's most reliable network.
>> > And it wants your games.
>> > http://p.sf.net/sfu/verizon-sfdev
>> > _______________________________________________
>> > Carrot2-developers mailing list
>> > [hidden email]
>> > https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>> >
>> >
>
>


------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Dawid Weiss-2
Heba,

I will add some logging statements to state which language the
clustering is performed in. Can you send me the snippet of code you
are using for clustering? We suspect you are still using the english
preprocessing infrastructure.

Dawid

2011/4/5 Heba Ezzat <[hidden email]>:

> Surprisingly, the words exist in the stopwords.ar !!
> I debugged the code ...but I couldn't find where it invokes
> the CompletePreprocessingPipeline, or where it checks the stopwords.
> any suggestions?
> Heba
>
> On Tue, Apr 5, 2011 at 10:19 AM, Dawid Weiss <[hidden email]>
> wrote:
>>
>> Can you check if these stopwords occur in stopwords.ar file (or in
>> stoplabels.ar)? Maybe they are simply missing in there and need to be
>> added?
>>
>> Dawid
>>
>> 2011/4/5 Heba Ezzat <[hidden email]>:
>> > I ve used ClusteringNonEnglishContent.java and I ve specified
>> > LanguageCode.ARABIC as the input documents' language.
>> > But, but I found that the clusters results contain stopwords that should
>> > have been removed in the preprocessing.
>> > for example :
>> > فى كل (7 documents, score: 30.634754283025156)
>> > في اي (6 documents, score: 39.022973055181886)
>> > كل ما (2 documents, score: 36.69559989409924)
>> > لا بد أن (2 documents, score: 27.13532862854613)
>> > Thats why I thought that some preprocessing need to take place before
>> > clustering.
>> > How come the clusters contain stopwords if
>> > the CompletePreprocessingPipeline is invoked internally?
>> >
>> > Thanks,
>> > Heba
>> > On Tue, Apr 5, 2011 at 9:43 AM, Dawid Weiss
>> > <[hidden email]>
>> > wrote:
>> >>
>> >> Heba, you won't be able to use CompletePreprocessingPipeline before
>> >> clustering because it is an internal component that is part of the
>> >> clustering process... You could replace this component and use
>> >> something of your own, but I don't see the point. Why don't you just
>> >> cluster your documents using the code snippet I provided before (in
>> >> the example)? This uses complete preprocessing pipeline internally
>> >> anyway.
>> >>
>> >> The LanguageCode.ARABIC flag needs to be set on each document, not on
>> >> the preprocessing object (because in a multi-lingual setting the
>> >> clusterer will invoke preprocessing with each language separately).
>> >>
>> >> Dawid
>> >>
>> >> On Tue, Apr 5, 2011 at 9:31 AM, Heba Ezzat <[hidden email]>
>> >> wrote:
>> >> > I want to use CompletePreprocessingPipeline before clustering. But I
>> >> > dont
>> >> > know how.
>> >> >
>> >> > I know that I can use CompletePreprocessingPipeline as follows:
>> >> >
>> >> >      final CompletePreprocessingPipeline preprocessingPipeline = new
>> >> > CompletePreprocessingPipeline();
>> >> >      final PreprocessingContext context =
>> >> > preprocessingPipeline.preprocess(documents,
>> >> > null,LanguageCode.ARABIC);
>> >> >
>> >> > but, my missing part is how to pass the preprocessed documents to be
>> >> > clustered.
>> >> >
>> >> >
>> >> > Heba
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> >
>> >> > http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Cluster-preprocessed-documents-tp6238374p6241377.html
>> >> > Sent from the Carrot2 Users and Developers Forum mailing list archive
>> >> > at
>> >> > Nabble.com.
>> >> >
>> >> >
>> >> >
>> >> > ------------------------------------------------------------------------------
>> >> > Xperia(TM) PLAY
>> >> > It's a major breakthrough. An authentic gaming
>> >> > smartphone on the nation's most reliable network.
>> >> > And it wants your games.
>> >> > http://p.sf.net/sfu/verizon-sfdev
>> >> > _______________________________________________
>> >> > Carrot2-developers mailing list
>> >> > [hidden email]
>> >> > https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>> >> >
>> >> >
>> >
>> >
>
>

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Heba Ezzat
Here is the code snippet I used:

File location = new File("C:\\Text\\");
final ArrayList<Document> documents = new ArrayList<Document>();
if (location.isDirectory() && location != null) 
for (File f : location.listFiles()) 
{
if (f.isFile()) 
{
String fileContents = FileUtils.readFileToString(f);
//String fileContents = FileUtils.readFileToString(f, "Cp1252");
documents.add(new Document("", fileContents,  LanguageCode.ARABIC));
}
}
final Controller controller =  ControllerFactory.createCachingPooling(IDocumentSource.class);
final ProcessingResult byTopicClusters = controller.process(documents, null,LingoClusteringAlgorithm.class);
      final List<Cluster> clustersByTopic = byTopicClusters.getClusters();
      
    
       ConsoleFormatter.displayClusters(clustersByTopic,0);


On Tue, Apr 5, 2011 at 10:54 AM, Dawid Weiss <[hidden email]> wrote:
Heba,

I will add some logging statements to state which language the
clustering is performed in. Can you send me the snippet of code you
are using for clustering? We suspect you are still using the english
preprocessing infrastructure.

Dawid

2011/4/5 Heba Ezzat <[hidden email]>:
> Surprisingly, the words exist in the stopwords.ar !!
> I debugged the code ...but I couldn't find where it invokes
> the CompletePreprocessingPipeline, or where it checks the stopwords.
> any suggestions?
> Heba
>
> On Tue, Apr 5, 2011 at 10:19 AM, Dawid Weiss <[hidden email]>
> wrote:
>>
>> Can you check if these stopwords occur in stopwords.ar file (or in
>> stoplabels.ar)? Maybe they are simply missing in there and need to be
>> added?
>>
>> Dawid
>>
>> 2011/4/5 Heba Ezzat <[hidden email]>:
>> > I ve used ClusteringNonEnglishContent.java and I ve specified
>> > LanguageCode.ARABIC as the input documents' language.
>> > But, but I found that the clusters results contain stopwords that should
>> > have been removed in the preprocessing.
>> > for example :
>> > فى كل (7 documents, score: 30.634754283025156)
>> > في اي (6 documents, score: 39.022973055181886)
>> > كل ما (2 documents, score: 36.69559989409924)
>> > لا بد أن (2 documents, score: 27.13532862854613)
>> > Thats why I thought that some preprocessing need to take place before
>> > clustering.
>> > How come the clusters contain stopwords if
>> > the CompletePreprocessingPipeline is invoked internally?
>> >
>> > Thanks,
>> > Heba
>> > On Tue, Apr 5, 2011 at 9:43 AM, Dawid Weiss
>> > <[hidden email]>
>> > wrote:
>> >>
>> >> Heba, you won't be able to use CompletePreprocessingPipeline before
>> >> clustering because it is an internal component that is part of the
>> >> clustering process... You could replace this component and use
>> >> something of your own, but I don't see the point. Why don't you just
>> >> cluster your documents using the code snippet I provided before (in
>> >> the example)? This uses complete preprocessing pipeline internally
>> >> anyway.
>> >>
>> >> The LanguageCode.ARABIC flag needs to be set on each document, not on
>> >> the preprocessing object (because in a multi-lingual setting the
>> >> clusterer will invoke preprocessing with each language separately).
>> >>
>> >> Dawid
>> >>
>> >> On Tue, Apr 5, 2011 at 9:31 AM, Heba Ezzat <[hidden email]>
>> >> wrote:
>> >> > I want to use CompletePreprocessingPipeline before clustering. But I
>> >> > dont
>> >> > know how.
>> >> >
>> >> > I know that I can use CompletePreprocessingPipeline as follows:
>> >> >
>> >> >      final CompletePreprocessingPipeline preprocessingPipeline = new
>> >> > CompletePreprocessingPipeline();
>> >> >      final PreprocessingContext context =
>> >> > preprocessingPipeline.preprocess(documents,
>> >> > null,LanguageCode.ARABIC);
>> >> >
>> >> > but, my missing part is how to pass the preprocessed documents to be
>> >> > clustered.
>> >> >
>> >> >
>> >> > Heba
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> >
>> >> > http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Cluster-preprocessed-documents-tp6238374p6241377.html
>> >> > Sent from the Carrot2 Users and Developers Forum mailing list archive
>> >> > at
>> >> > Nabble.com.
>> >> >
>> >> >
>> >> >
>> >> > ------------------------------------------------------------------------------
>> >> > Xperia(TM) PLAY
>> >> > It's a major breakthrough. An authentic gaming
>> >> > smartphone on the nation's most reliable network.
>> >> > And it wants your games.
>> >> > http://p.sf.net/sfu/verizon-sfdev
>> >> > _______________________________________________
>> >> > Carrot2-developers mailing list
>> >> > [hidden email]
>> >> > https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>> >> >
>> >> >
>> >
>> >
>
>


------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Dawid Weiss-2
I've just committed the logging statement to track this. I also
checked the example -- if you're using the trunk (SVN checkout), I
replaced in ClusteringNonEnglishContent the following line:


        for (Document document : SampleDocumentData.DOCUMENTS_DATA_MINING)
        {
            documents.add(new Document(document.getTitle(),
document.getSummary(),
                document.getContentUrl(), LanguageCode.ARABIC));
        }

Then I configured log4j.xml in carrot2-examples project to debug:

    <!-- Root logger -->
    <root>
        <priority value="debug" />
        <appender-ref ref="appender-console" />
    </root>

and I get these logs:


2011-04-05 11:26:03,709 DEBUG
org.carrot2.text.clustering.MultilingualClustering: Performing
monolingual clustering in: Arabic
2011-04-05 11:26:03,789 DEBUG org.carrot2.util.resource.ResourceCache:
Loading resources, locations:
[org.carrot2.util.resource.ContextClassLoaderLocator [current:
sun.misc.Launcher$AppClassLoader@546b97fd]]
2011-04-05 11:26:03,791 DEBUG
org.carrot2.util.resource.ResourceLookup: getFirst():
        stopwords.ar
        - 1 hit from: org.carrot2.util.resource.ContextClassLoaderLocator
[current: sun.misc.Launcher$AppClassLoader@546b97fd]
                - file:/D:/carrot2/carrot2.sf.trunk/core/carrot2-util-text/src-resources/stopwords.ar

(among other things)

So it is making use of stopwords.ar... can you verify that this is the
case in your setup as well? If yes and stopwords still pop up in the
results we will need to take a closer look and I will ask you to
format a ZIP bundle with the example (and a few documents, either
inline or as separate files).

Thanks,
Dawid

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Heba Ezzat
I ve run ClusteringNonEnglishContent as you said

These are are subset of the log:

2011-04-06 09:30:27,453 DEBUG org.carrot2.util.resource.ResourceCache: Loading resources, locations: [org.carrot2.util.resource.ContextClassLoaderLocator [current: sun.misc.Launcher$AppClassLoader@1a46e30]]
2011-04-06 09:30:27,456 DEBUG org.carrot2.util.resource.ResourceLookup: getFirst():
- 2 hits from: org.carrot2.util.resource.ContextClassLoaderLocator [current: sun.misc.Launcher$AppClassLoader@1a46e30]
- file:/C:/carrot2/trunk/core/carrot2-util-text/src-resources/stopwords.ar
- file:/C:/carrot2/trunk/core/carrot2-util-text/src-resources/stopwords.ar
.
.
.
2011-04-06 09:30:27,466 DEBUG org.carrot2.util.resource.ResourceLookup: getFirst():
- 2 hits from: org.carrot2.util.resource.ContextClassLoaderLocator [current: sun.misc.Launcher$AppClassLoader@1a46e30]
- file:/C:/carrot2/trunk/core/carrot2-util-text/src-resources/stopwords.bg
- file:/C:/carrot2/trunk/core/carrot2-util-text/src-resources/stopwords.bg


But, the following line doesnt exist:
2011-04-05 11:26:03,709 DEBUG
org.carrot2.text.clustering.MultilingualClustering: Performing monolingual clustering in: Arabic


what does that mean :)

Thanks,
Heba

On Tue, Apr 5, 2011 at 11:29 AM, Dawid Weiss <[hidden email]> wrote:
I've just committed the logging statement to track this. I also
checked the example -- if you're using the trunk (SVN checkout), I
replaced in ClusteringNonEnglishContent the following line:


       for (Document document : SampleDocumentData.DOCUMENTS_DATA_MINING)
       {
           documents.add(new Document(document.getTitle(),
document.getSummary(),
               document.getContentUrl(), LanguageCode.ARABIC));
       }

Then I configured log4j.xml in carrot2-examples project to debug:

   <!-- Root logger -->
   <root>
       <priority value="debug" />
       <appender-ref ref="appender-console" />
   </root>

and I get these logs:


2011-04-05 11:26:03,709 DEBUG
org.carrot2.text.clustering.MultilingualClustering: Performing
monolingual clustering in: Arabic
2011-04-05 11:26:03,789 DEBUG org.carrot2.util.resource.ResourceCache:
Loading resources, locations:
[org.carrot2.util.resource.ContextClassLoaderLocator [current:
sun.misc.Launcher$AppClassLoader@546b97fd]]
2011-04-05 11:26:03,791 DEBUG
org.carrot2.util.resource.ResourceLookup: getFirst():
       stopwords.ar
       - 1 hit from: org.carrot2.util.resource.ContextClassLoaderLocator
[current: sun.misc.Launcher$AppClassLoader@546b97fd]
               - file:/D:/carrot2/carrot2.sf.trunk/core/carrot2-util-text/src-resources/stopwords.ar

(among other things)

So it is making use of stopwords.ar... can you verify that this is the
case in your setup as well? If yes and stopwords still pop up in the
results we will need to take a closer look and I will ask you to
format a ZIP bundle with the example (and a few documents, either
inline or as separate files).

Thanks,
Dawid


------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Heba Ezzat
:) :) yes, you are right 

this line appeared:
2011-04-06 10:05:30,496 DEBUG org.carrot2.text.clustering.MultilingualClustering: Performing monolingual clustering in: Arabic

but still the clusters show stopwords !!

Heba

On Wed, Apr 6, 2011 at 9:40 AM, Dawid Weiss <[hidden email]> wrote:
> But, the following line doesnt exist:
> 2011-04-05 11:26:03,709 DEBUG
> org.carrot2.text.clustering.MultilingualClustering: Performing monolingual
> clustering in: Arabic
>
> what does that mean :)

It may mean you not working with SVN checkout because I've just added
it for you? :) Send me an entire working example, including Arabic
documents (either inline or as separate files) and I'll try to help
you.

Dawid


------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Dawid Weiss-2
Ok, please send me the ZIP file showing this, possibly indicating
which terms are stopwords, ok? It looks like something is wrong, so
it's definitely worth investigating.

Dawid

On Wed, Apr 6, 2011 at 10:11 AM, Heba Ezzat <[hidden email]> wrote:

> :) :) yes, you are right
> this line appeared:
> 2011-04-06 10:05:30,496 DEBUG
> org.carrot2.text.clustering.MultilingualClustering: Performing monolingual
> clustering in: Arabic
> but still the clusters show stopwords !!
> Heba
> On Wed, Apr 6, 2011 at 9:40 AM, Dawid Weiss <[hidden email]>
> wrote:
>>
>> > But, the following line doesnt exist:
>> > 2011-04-05 11:26:03,709 DEBUG
>> > org.carrot2.text.clustering.MultilingualClustering:
>> > Performing monolingual
>> > clustering in: Arabic
>> >
>> > what does that mean :)
>>
>> It may mean you not working with SVN checkout because I've just added
>> it for you? :) Send me an entire working example, including Arabic
>> documents (either inline or as separate files) and I'll try to help
>> you.
>>
>> Dawid
>
>

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Heba Ezzat
Attached is a zipped folder containing 100 files which I am testing on.
in addition to the crested clusters where the stopwords are highlighted

Thanks,
Heba

On Wed, Apr 6, 2011 at 10:13 AM, Dawid Weiss <[hidden email]> wrote:
Ok, please send me the ZIP file showing this, possibly indicating
which terms are stopwords, ok? It looks like something is wrong, so
it's definitely worth investigating.

Dawid

On Wed, Apr 6, 2011 at 10:11 AM, Heba Ezzat <[hidden email]> wrote:
> :) :) yes, you are right
> this line appeared:
> 2011-04-06 10:05:30,496 DEBUG
> org.carrot2.text.clustering.MultilingualClustering: Performing monolingual
> clustering in: Arabic
> but still the clusters show stopwords !!
> Heba
> On Wed, Apr 6, 2011 at 9:40 AM, Dawid Weiss <[hidden email]>
> wrote:
>>
>> > But, the following line doesnt exist:
>> > 2011-04-05 11:26:03,709 DEBUG
>> > org.carrot2.text.clustering.MultilingualClustering:
>> > Performing monolingual
>> > clustering in: Arabic
>> >
>> > what does that mean :)
>>
>> It may mean you not working with SVN checkout because I've just added
>> it for you? :) Send me an entire working example, including Arabic
>> documents (either inline or as separate files) and I'll try to help
>> you.
>>
>> Dawid
>
>


------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers

Text.rar (23K) Download Attachment
Created Clusters.docx (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Dawid Weiss-2
I've filed an issue here:

http://issues.carrot2.org/browse/CARROT-795

Can you attach the snippet of code that you use for clustering these
files as well? Thanks. Also, if you add yourself to the watchers list
on this issue, you'll know when we find out what's happening.

Dawid

On Wed, Apr 6, 2011 at 10:30 AM, Heba Ezzat <[hidden email]> wrote:

> Attached is a zipped folder containing 100 files which I am testing on.
> in addition to the crested clusters where the stopwords are highlighted
>
> Thanks,
> Heba
>
> On Wed, Apr 6, 2011 at 10:13 AM, Dawid Weiss <[hidden email]>
> wrote:
>>
>> Ok, please send me the ZIP file showing this, possibly indicating
>> which terms are stopwords, ok? It looks like something is wrong, so
>> it's definitely worth investigating.
>>
>> Dawid
>>
>> On Wed, Apr 6, 2011 at 10:11 AM, Heba Ezzat <[hidden email]>
>> wrote:
>> > :) :) yes, you are right
>> > this line appeared:
>> > 2011-04-06 10:05:30,496 DEBUG
>> > org.carrot2.text.clustering.MultilingualClustering: Performing
>> > monolingual
>> > clustering in: Arabic
>> > but still the clusters show stopwords !!
>> > Heba
>> > On Wed, Apr 6, 2011 at 9:40 AM, Dawid Weiss
>> > <[hidden email]>
>> > wrote:
>> >>
>> >> > But, the following line doesnt exist:
>> >> > 2011-04-05 11:26:03,709 DEBUG
>> >> > org.carrot2.text.clustering.MultilingualClustering:
>> >> > Performing monolingual
>> >> > clustering in: Arabic
>> >> >
>> >> > what does that mean :)
>> >>
>> >> It may mean you not working with SVN checkout because I've just added
>> >> it for you? :) Send me an entire working example, including Arabic
>> >> documents (either inline or as separate files) and I'll try to help
>> >> you.
>> >>
>> >> Dawid
>> >
>> >
>
>

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Cluster preprocessed documents

Heba Ezzat
Thank you for your help,
attached is the code snippet I used

Heba

On Wed, Apr 6, 2011 at 10:34 AM, Dawid Weiss <[hidden email]> wrote:
I've filed an issue here:

http://issues.carrot2.org/browse/CARROT-795

Can you attach the snippet of code that you use for clustering these
files as well? Thanks. Also, if you add yourself to the watchers list
on this issue, you'll know when we find out what's happening.

Dawid

On Wed, Apr 6, 2011 at 10:30 AM, Heba Ezzat <[hidden email]> wrote:
> Attached is a zipped folder containing 100 files which I am testing on.
> in addition to the crested clusters where the stopwords are highlighted
>
> Thanks,
> Heba
>
> On Wed, Apr 6, 2011 at 10:13 AM, Dawid Weiss <[hidden email]>
> wrote:
>>
>> Ok, please send me the ZIP file showing this, possibly indicating
>> which terms are stopwords, ok? It looks like something is wrong, so
>> it's definitely worth investigating.
>>
>> Dawid
>>
>> On Wed, Apr 6, 2011 at 10:11 AM, Heba Ezzat <[hidden email]>
>> wrote:
>> > :) :) yes, you are right
>> > this line appeared:
>> > 2011-04-06 10:05:30,496 DEBUG
>> > org.carrot2.text.clustering.MultilingualClustering: Performing
>> > monolingual
>> > clustering in: Arabic
>> > but still the clusters show stopwords !!
>> > Heba
>> > On Wed, Apr 6, 2011 at 9:40 AM, Dawid Weiss
>> > <[hidden email]>
>> > wrote:
>> >>
>> >> > But, the following line doesnt exist:
>> >> > 2011-04-05 11:26:03,709 DEBUG
>> >> > org.carrot2.text.clustering.MultilingualClustering:
>> >> > Performing monolingual
>> >> > clustering in: Arabic
>> >> >
>> >> > what does that mean :)
>> >>
>> >> It may mean you not working with SVN checkout because I've just added
>> >> it for you? :) Send me an entire working example, including Arabic
>> >> documents (either inline or as separate files) and I'll try to help
>> >> you.
>> >>
>> >> Dawid
>> >
>> >
>
>


------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers

MyCarrotExp.java (1K) Download Attachment