Question on ITokenType.TT_FULL_URL specification

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Question on ITokenType.TT_FULL_URL specification

Tom Pines
Hello,

The string "t.e.st." is tokenized as  ITokenType.TT_FULL_URL, which seems correct per the ExtendedWhitespaceTokenizerImpl.jflex snippet:
<<
(("http" | "https" | "ftp") "://")? {BARE_URL} {URL_PATH}?   { return ITokenType.TT_FULL_URL; }
>>

Was this intended, or should it be something like:

("http" | "https" | "ftp") "://" {BARE_URL} {URL_PATH}?   { return ITokenType.TT_FULL_URL; }

or maybe:

("http" | "https" | "ftp" | "file" | "jar") "://" {BARE_URL} {URL_PATH}?   { return ITokenType.TT_FULL_URL; }

Thanks,
Tom
Reply | Threaded
Open this post in threaded view
|

Re: Question on ITokenType.TT_FULL_URL specification

Stanislaw Osinski
Administrator
Hi Tom,

The string "t.e.st." is tokenized as  ITokenType.TT_FULL_URL, which seems
correct per the ExtendedWhitespaceTokenizerImpl.jflex snippet:
<<
(("http" | "https" | "ftp") "://")? {BARE_URL} {URL_PATH}?   { return
ITokenType.TT_FULL_URL; }
>>

Was this intended, or should it be something like:

("http" | "https" | "ftp") "://" {BARE_URL} {URL_PATH}?   { return
ITokenType.TT_FULL_URL; }

or maybe:

("http" | "https" | "ftp" | "file" | "jar") "://" {BARE_URL} {URL_PATH}?   {
return ITokenType.TT_FULL_URL; }

I think that depends on the use case really -- does yours require mandatory presence of the protocol prefixes?

As far as I recall, from the clustering standpoint, it doesn't matter much because full URLs are omitted during clustering (I need to make sure we don't have the same bug you picked up in the numeric filter).

S.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Question on ITokenType.TT_FULL_URL specification

Tom Pines
"I think that depends on the use case really -- does yours require mandatory presence of the protocol prefixes?"

Not currently. Just curious.

I noticed that ITokenType.TT_FULL_URL tokens aren't indexed.

Thanks for the quick responses on the posts !

Tom