Arabic stop label list

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Arabic stop label list

Dawid Weiss-2
Hi Lora,

Thanks for feedback, comments below.

> Yes, those are very common phrases where not each word of them is (necessarily) a stop word by itself.

This won't work. The way Carrot2 is designed, stop words must be
single-worded (multi-word tokens won't be discovered because of the
way preprocessing infrastructure works). You can use stoplabels to get
rid of entire phrases that are nonsensical.

> We don't find Lucene's Arabic stop words list relevant for our needs and hence we didn't use it (btw you can find in the net some criticism about this list), instead we create one of our own. But as you asked we will go over the merged list and correct it.

I know about the criticism, but I can't judge it myself -- don't know
what's in the list. There have been recent JIRA patches that
supposedly fix that stop word list. Thanks for working on this,
however. If you wish, you could contribute such a list back to Lucene
and Carrot2, we don't mind at all.

> Using the DCS (after adding the stop words updating the language and successfully compiled and rebuild) we received good clustering for English xml-s but only 1 cluster – “other topics” for Arabic xml-s. (Please find attached my mail regarding this problem)

This should not be the case. I checked with the "iran.xml" file you've
provided and curl (simple HTTP POST utility). Examples:

curl http://localhost:8080/dcs/rest -# -F "dcs.clusters.only=true" -F
"active-language=ARABIC" -F "dcs.c2stream=@iran.xml" -o
iran-arabic.xml

curl http://localhost:8080/dcs/rest -# -F "dcs.clusters.only=true" -F
"active-language=ENGLISH" -F "dcs.c2stream=@iran.xml" -o
iran-english.xml

For the DCS, you'll have to force Arabic as the processing language
(the first in the examples above). Compare the stemmer/ stopwords
between English and Arabic active language -- they should be
different. I attach an example to prove it really works.

Dawid

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers

iran-arabic.xml (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Arabic stop label list

Dawid Weiss-2
Hi Lora,

Apologies for late response.

> Regarding the stop words list:
> We cleaned all the “phrases” (more than a single word), and removed the ones
> that are relevant to the stop labels list.

Excellent.

> Regarding the merged list, we cleaned much of Lucene’s list:
>
> There are some prefixes which are *not* a separate word, like: ب ا أ (in
> English we write: “in the house” in Arabic it’s more like “inthehous” where
> “in” and “the” are not separated words and hence not stop words)
> Comma is not a stop word

As far as I remember, Lucene's stop word list contained "normalized"
entries -- entries that matched tokens after the normalization engine
digested them, perhaps this is the source of these weird tokens. There
are commas and other punctuation characters because Lucene uses a very
simple analyzer for Arabic (breaking on white spaces). In Carrot2,
this should not be an issue.

> Some name’s prefixes like ابو  : they are very common  prefixes like the
> typically Dutch prefix "van" – If we omit it “Van Damme” will be “Damme”.

Such things should be forbidden as cluster labels by themselves, but
allowed in general -- we should place them in stoplabels, but not in
stopwords file.

> 1.      Lucene’s list contains a lot of “common mistakes and typos” of stop
> words as stop word. An (I hope appropriate) similar example is the polish
> word “złoty” (l with stroke) which is commonly written as “zloty”. This is a
> typo (and if not – I’m sorry I will look for another example). If “złoty”
> was a stop word would you like to put  “zloty” also in the list?

Excellent example. Zloty is not even a typo, it's usually laziness on
behalf of people who type (l-stroke requires alt-l key combination,
it's faster to type without diacritics). Back to your question -- I
believe we should focus on correct words only, hoping they are
dominant in the corpus of Web documents.

> 2.      Lucene’s list contains several numbers (I don’t know why these
> numbers were picked and not others. more over - the question whether to
> consist numbers in stop words list is a known controversy )

True. I think we should at most put them in stoplabels.ar file, they
are definitely not stopwords to me.

> I would like to give inline credit to the people who worked on the lists:

Feel free to put inline credits on lines starting with # (at the start
of the file, perhaps).

Dawid

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Arabic stop label list

OrbDevelopment
In reply to this post by Dawid Weiss-2
Hello Dawid,
          I've been trying to test the Arabic Language Clustering yet with no success. I changed the active language to Arabic using service.AddFormValue("active-language ", "ARABIC"); I'm using the sample DCS code sample in CSharp. The server gives an Internal Server Error (500). Should I be changing anything in the encoding? It works perfectly fine for all languages except for Arabic. Below is the example code, please advise.

Cheers,

AG

using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;

namespace Org.Carrot2
{
    /// <summary>
    /// Stream data for the form.
    /// </summary>
    internal sealed class StreamData
    {
        internal Stream stream;
        internal string streamName;

        internal StreamData(string name, Stream stream)
        {
            this.stream = stream;
            this.streamName = name;
        }
    }

    /// <summary>
    /// A simplified HTTP POST <code>multipart/form-data</code> uploader.
    /// </summary>
    public class MultipartFileUpload
    {
        private static readonly uint BUFFER_SIZE = 1024 * 18;

        private Uri uri;
        private IList<KeyValuePair<string, object>> formData
            = new List<KeyValuePair<string, object>>();

        public MultipartFileUpload(Uri target)
        {
            this.uri = target;
        }

        /// <summary>
        /// Reset the object for reuse.
        /// </summary>
        public void Reset()
        {
            formData.Clear();
        }

        /// <summary>
        /// Adds a form value to the request, if the form field already exists, overwrite it.
        /// </summary>
        /// the name of the form field
        /// value of the form field
        public void AddFormValue(String name, String value)
        {
            AddFormValueInternal(name, (object) value);
        }

        /// <summary>
        /// Adds stream data to the request's form data.
        /// </summary>
        /// the name of the form field
        /// string or StreamData object.
        private void AddFormValueInternal(String name, object value)
        {
            this.formData.Add(new KeyValuePair<string,object>(name, value));
        }

        /// <summary>
        /// Attach a stream (file) to the post method, the parameter name and file can not be null.
        /// </summary>
        /// the parameter that your web server expects to be associated with a file
        /// the name of the file as it should appear to the web server
        /// the actual content of the file you want to upload
        /// <exception cref="ArgumentNullException">file can not be null, name of the parameter can’t be null</exception>
        public void AddFormStream(String parameterName, String fileDisplayName, Stream stream)
        {
            if (stream == null || !stream.CanRead)
            {
                throw new ArgumentNullException("stream", "You must pass a reference to a readable stream");
            }

            if (parameterName == null)
            {
                throw new ArgumentNullException("parameterName", "You must provide the name of the file parameter.");
            }

            AddFormValueInternal(parameterName, new StreamData(fileDisplayName, stream));
        }

        /// <summary>
        /// Performs the actual upload.
        /// </summary>
        /// <returns>the response as a byte array</returns>
        public byte[] Post()
        {
            // generate parameter boundary
            string boundaryRaw = "boundary" + DateTime.Now.Ticks.ToString("x");
            string boundary = "--" + boundaryRaw;

            HttpWebRequest webrequest = (HttpWebRequest) WebRequest.Create(uri);
            webrequest.ContentType = "multipart/form-data; boundary=" + boundaryRaw;
            webrequest.Method = "POST";
            webrequest.KeepAlive = false;

            // Encode form parameters and push them to the stream.
            using (Stream requestStream = webrequest.GetRequestStream())
            {
                byte[] buffer = new byte[BUFFER_SIZE];
                int bytesRead = 0;
                byte[] bytes;

                foreach (KeyValuePair<string, object> kv in formData)
                {
                    string key = kv.Key;
                    object value = kv.Value;

                    if (value is string)
                    {
                        string part =
                            boundary + "\r\n"
                            + "content-disposition: form-data; name=\""
                            + key + "\"\r\n\r\n"
                            + (value as string)
                            + "\r\n";

                        bytes = Encoding.UTF8.GetBytes(part);
                        requestStream.Write(bytes, 0, bytes.Length);
                    }
                    else if (value is StreamData)
                    {
                        StreamData sd = value as StreamData;
                        string part =
                            boundary + "\r\n"
                            + "content-disposition: form-data; name=\""
                            + key + "\"; filename=\"" + sd.streamName + "\"\r\n"
                            + "content-type: application/octet-stream\r\n\r\n";

                        bytes = Encoding.UTF8.GetBytes(part);
                        requestStream.Write(bytes, 0, bytes.Length);

                        // Copy stream contents.
                        using (Stream s = sd.stream)
                        {
                            while ((bytesRead = s.Read(buffer, 0, buffer.Length)) != 0)
                            {
                                requestStream.Write(buffer, 0, bytesRead);
                            }
                        }

                        requestStream.WriteByte((byte) '\r');
                        requestStream.WriteByte((byte) '\n');
                    }
                    else
                    {
                        throw new Exception("Panic: object of unknown type: " + value);
                    }
                }

                bytes = Encoding.UTF8.GetBytes(boundary + "--\r\n");
                requestStream.Write(bytes, 0, bytes.Length);
                requestStream.Close();

                // Copy the response to a byte array in memory (so that we don't need
                // to track web connection resources after returning).
                WebResponse response = webrequest.GetResponse();
                Stream responseStream = response.GetResponseStream();
                MemoryStream memStream = new MemoryStream();
                while ((bytesRead = responseStream.Read(buffer, 0, buffer.Length)) != 0)
                {
                    memStream.Write(buffer, 0, bytesRead);
                }

                responseStream.Close();
                response.Close();

                Reset();
                return memStream.ToArray();
            }
        }
    }
}

Reply | Threaded
Open this post in threaded view
|

Re: Arabic stop label list

Dawid Weiss-2
Can you send us the XML document you're trying to cluster?

Dawid

On Mon, Jan 11, 2010 at 3:01 PM, OrbDevelopment <[hidden email]> wrote:

>
> Hello Dawid,
>          I've been trying to test the Arabic Language Clustering yet with
> no success. I changed the active language to Arabic using
> service.AddFormValue("active-language ", "ARABIC"); I'm using the sample DCS
> code sample in CSharp. The server gives an Internal Server Error (500).
> Should I be changing anything in the encoding? It works perfectly fine for
> all languages except for Arabic. Below is the example code, please advise.
>
> Cheers,
>
> AG
>
> using System;
> using System.Collections.Generic;
> using System.Text;
> using System.Net;
> using System.IO;
>
> namespace Org.Carrot2
> {
>    /// <summary>
>    /// Stream data for the form.
>    /// </summary>
>    internal sealed class StreamData
>    {
>        internal Stream stream;
>        internal string streamName;
>
>        internal StreamData(string name, Stream stream)
>        {
>            this.stream = stream;
>            this.streamName = name;
>        }
>    }
>
>    /// <summary>
>    /// A simplified HTTP POST <code>multipart/form-data</code> uploader.
>    /// </summary>
>    public class MultipartFileUpload
>    {
>        private static readonly uint BUFFER_SIZE = 1024 * 18;
>
>        private Uri uri;
>        private IList<KeyValuePair<string, object>> formData
>            = new List<KeyValuePair<string, object>>();
>
>        public MultipartFileUpload(Uri target)
>        {
>            this.uri = target;
>        }
>
>        /// <summary>
>        /// Reset the object for reuse.
>        /// </summary>
>        public void Reset()
>        {
>            formData.Clear();
>        }
>
>        /// <summary>
>        /// Adds a form value to the request, if the form field already
> exists, overwrite it.
>        /// </summary>
>        /// the name of the form field
>        /// value of the form field
>        public void AddFormValue(String name, String value)
>        {
>            AddFormValueInternal(name, (object) value);
>        }
>
>        /// <summary>
>        /// Adds stream data to the request's form data.
>        /// </summary>
>        /// the name of the form field
>        /// string or StreamData object.
>        private void AddFormValueInternal(String name, object value)
>        {
>            this.formData.Add(new KeyValuePair<string,object>(name, value));
>        }
>
>        /// <summary>
>        /// Attach a stream (file) to the post method, the parameter name
> and file can not be null.
>        /// </summary>
>        /// the parameter that your web server expects to be associated with
> a file
>        /// the name of the file as it should appear to the web server
>        /// the actual content of the file you want to upload
>        /// <exception cref="ArgumentNullException">file can not be null,
> name of the parameter can’t be null</exception>
>        public void AddFormStream(String parameterName, String
> fileDisplayName, Stream stream)
>        {
>            if (stream == null || !stream.CanRead)
>            {
>                throw new ArgumentNullException("stream", "You must pass a
> reference to a readable stream");
>            }
>
>            if (parameterName == null)
>            {
>                throw new ArgumentNullException("parameterName", "You must
> provide the name of the file parameter.");
>            }
>
>            AddFormValueInternal(parameterName, new
> StreamData(fileDisplayName, stream));
>        }
>
>        /// <summary>
>        /// Performs the actual upload.
>        /// </summary>
>        /// <returns>the response as a byte array</returns>
>        public byte[] Post()
>        {
>            // generate parameter boundary
>            string boundaryRaw = "boundary" +
> DateTime.Now.Ticks.ToString("x");
>            string boundary = "--" + boundaryRaw;
>
>            HttpWebRequest webrequest = (HttpWebRequest)
> WebRequest.Create(uri);
>            webrequest.ContentType = "multipart/form-data; boundary=" +
> boundaryRaw;
>            webrequest.Method = "POST";
>            webrequest.KeepAlive = false;
>
>            // Encode form parameters and push them to the stream.
>            using (Stream requestStream = webrequest.GetRequestStream())
>            {
>                byte[] buffer = new byte[BUFFER_SIZE];
>                int bytesRead = 0;
>                byte[] bytes;
>
>                foreach (KeyValuePair<string, object> kv in formData)
>                {
>                    string key = kv.Key;
>                    object value = kv.Value;
>
>                    if (value is string)
>                    {
>                        string part =
>                            boundary + "\r\n"
>                            + "content-disposition: form-data; name=\""
>                            + key + "\"\r\n\r\n"
>                            + (value as string)
>                            + "\r\n";
>
>                        bytes = Encoding.UTF8.GetBytes(part);
>                        requestStream.Write(bytes, 0, bytes.Length);
>                    }
>                    else if (value is StreamData)
>                    {
>                        StreamData sd = value as StreamData;
>                        string part =
>                            boundary + "\r\n"
>                            + "content-disposition: form-data; name=\""
>                            + key + "\"; filename=\"" + sd.streamName +
> "\"\r\n"
>                            + "content-type:
> application/octet-stream\r\n\r\n";
>
>                        bytes = Encoding.UTF8.GetBytes(part);
>                        requestStream.Write(bytes, 0, bytes.Length);
>
>                        // Copy stream contents.
>                        using (Stream s = sd.stream)
>                        {
>                            while ((bytesRead = s.Read(buffer, 0,
> buffer.Length)) != 0)
>                            {
>                                requestStream.Write(buffer, 0, bytesRead);
>                            }
>                        }
>
>                        requestStream.WriteByte((byte) '\r');
>                        requestStream.WriteByte((byte) '\n');
>                    }
>                    else
>                    {
>                        throw new Exception("Panic: object of unknown type:
> " + value);
>                    }
>                }
>
>                bytes = Encoding.UTF8.GetBytes(boundary + "--\r\n");
>                requestStream.Write(bytes, 0, bytes.Length);
>                requestStream.Close();
>
>                // Copy the response to a byte array in memory (so that we
> don't need
>                // to track web connection resources after returning).
>                WebResponse response = webrequest.GetResponse();
>                Stream responseStream = response.GetResponseStream();
>                MemoryStream memStream = new MemoryStream();
>                while ((bytesRead = responseStream.Read(buffer, 0,
> buffer.Length)) != 0)
>                {
>                    memStream.Write(buffer, 0, bytesRead);
>                }
>
>                responseStream.Close();
>                response.Close();
>
>                Reset();
>                return memStream.ToArray();
>            }
>        }
>    }
> }
>
>
> --
> View this message in context: http://n2.nabble.com/Arabic-stop-label-list-tp3829001p4285489.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>
------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Arabic stop label list

OrbDevelopment
This post was updated on .
There you go. This XML will works fine for English, Danish or Russian but not for Arabic

In addition I attached the error position. And below you will find the error from the log file

2010-01-11 19:12:02,530 [WARN ] [jetty] /dcs/rest
java.lang.IllegalArgumentException: No enum const class org.carrot2.text.linguistic.LanguageCode.ARABIC

Amir



Input.xmlError.gif
Reply | Threaded
Open this post in threaded view
|

Re: Arabic stop label list

Dawid Weiss-2
Hi Amir,

I can't reproduce this error. Which version of the DCS are you using?
I just compiled from the trunk, changed the C# code to emit UTF8
output by adding:

        public static void Main()
        {
            Console.OutputEncoding = System.Text.Encoding.UTF8;

and then to ClusterFromStream method:

            // Pass query hint.
            service.AddFormValue("query", queryHint);
            service.AddFormValue("active-language", "ARABIC");

also changed path references to the file you made available:

            ClusterFromFile(service, "..\\..\\..\\shared\\Input.xml",
"data mining");

After compiling, the result is:

## Clustering documents from a file...
  اليمن [6 document(s)]
  حملة واسعة [3 document(s)]
  عملية [3 document(s)]
  الاتحاد من أجل المتوسط [2 document(s)]
  السفارة الأميركية في صنعاء فتح أبوابها [2 document(s)]
  السفارة الامريكية [2 document(s)]
  تنظيم القاعدة في ثلاث [2 document(s)]
  شنت قوات [2 document(s)]
  لقتال القاعدة [2 document(s)]
  Other Topics [1 document(s)]

## Clustering documents from an XML string...
  اليمن [6 document(s)]
  حملة واسعة [3 document(s)]
  عملية [3 document(s)]
...

Can you check the DCS's logs and console -- is there any information
about this error? If not, feel free to send me the entire DCS
distribution for evaluation (private e-mail, please).

Dawid


On Mon, Jan 11, 2010 at 3:39 PM, OrbDevelopment <[hidden email]> wrote:

>
> There you go. This XML will works fine for English, Danish or Russian but not
> for Arabic
>
> In addition I attached the error position
>
> Amir
> http://n2.nabble.com/file/n4285685/Input.xml Input.xml
> http://n2.nabble.com/file/n4285685/Error.gif Error.gif
> --
> View this message in context: http://n2.nabble.com/Arabic-stop-label-list-tp3829001p4285685.html
> Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev
> _______________________________________________
> Carrot2-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/carrot2-developers
>

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Arabic stop label list

OrbDevelopment
Error from log file below:

2010-01-11 19:12:02,530 [WARN ] [jetty] /dcs/rest
java.lang.IllegalArgumentException: No enum const class org.carrot2.text.linguistic.LanguageCode.ARABIC

I use the 3.1.1 version of the DCS. Latest version available at download link
Reply | Threaded
Open this post in threaded view
|

Re: Arabic stop label list

Stanislaw Osinski
Administrator

Error from log file below:

2010-01-11 19:12:02,530 [WARN ] [jetty] /dcs/rest
java.lang.IllegalArgumentException: No enum const class
org.carrot2.text.linguistic.LanguageCode.ARABIC

I use the 3.1.1 version of the DCS. Latest version available at download
link

Arabic support has not yet been officially released (will be in version 3.2.0), please use the development version:

http://download.carrot2.org/head/carrot2-dcs-3.2.0-dev.zip

Cheers,

Staszek


------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers