Zend Lucene fails all searches with special characters - zend-search-lucene

if anyone knows a simple answer to this, I don't have to wade through creating an extra index with escaped strings and crying my eyes out while littering my pretty code.
Basically, the Lucene search we have running cannot handle any non-letter characters. Space, percent signs, dots, dashes, slashes, you name it. This is higly infuriating, because I cannot make any search on items containing these characters, no matter wherever I escape them or not.
I have two options: Kill these characters in a separate index and strip them from the names I'm searching or stop goddamn searching.

You can escape special characters using '/'. Lucene treats followings the following as special characters and you will have to escape those characters to make it work.
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
If you want to search "2+3", query should be "2/+3"

Use QueryParser.escape(String s) to escape the query string.

According to http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html#-
The escape character is slash-backward, not -forward: .
And to answer Ankit, $ doesn't seem to have to be escaped since it's not a special character.
Escaping the dash as suggested by Ralph doesn't make a difference for me (Zend Lucene). You'd think that when a word 'abc-def' is indexed and you search for 'abc-def' you'll somehow find that word, regardless of whether the dash is ignored at the indexing step or not. Same input should have same result. The word seems to be indexed as two separate tokens 'abc' and 'def'. Yet searching for 'abc-def' gives no results when 'abc def' does.

Related

Elastic search: Create tokens separated by either <space> or "-" and greater than 3 chars

In my elastic search setup, I would like to create tokens separated by either " " or "-" and greater than 3 chars.
I believe pattern tokenizer can work but I am not able to create the regular expression.
Please help me in regular expression
You should be able to use the following regex in the pattern field of your pattern tokenizer:
([^\s-]{3,})
The \s means any whitespace character.
The - means the literal dash character.
Putting the two of them between [^ and ] means match any character that isn't the ones in the list (in this case, anything not whitespace and not a dash)
The {3,} means the previous match has to occur 3 times or more.
The parenthesis around the entire statement means you want to capture what is inside, and the pattern tokenizer pulls its tokens from the matching groups of the regex.
You can play with this regex here and see how it will split your string:
https://regex101.com/r/2e9p34/1
On a side note, there may be other better ways to do this that will better handle edge cases you aren't thinking of, but I decided to answer your question exactly as you asked it. I highly recommend exploring all of the options ElasticSearch provides for its analyzers for your use case to see which one best fits your needs.
Hope this helps!

Replace non-word characters, unless given sequence matches

I have a string like this:
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
I want to replace all non-word characters (symbols and whitespace), except the ### delimiters.
I'm currently using:
str.gsub(/[^\w#]+/, 'X')
which yields:
"JimXBobXsXemailX###hl###address###endhl###XisXjb#exampleXcom"
In practice, this is good enough, but it offends me for two reasons:
The # in the email address is not replaced.
The use of [^\w] instead of \W feels sloppy.
How do I replace all non-word characters, unless those characters make up the ###hl### or ###endhl### delimiter strings?
str.gsub(/(###.*?###|\w+)|./) { $1 || "X" }
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"
This approach uses the fact that alternations work like case structure: the first matching one consumes the corresponding string, then no further matching is done on it. Thus, ###.*?### will consume a marker (like ###hl###; nothing else will be matched inside it. We also match any sequence of word characters. If any of those are captured, we can just return them as-is ($1). If not, then we match any other character (i.e. not inside a marker, and not a word character) and replace it with "X".
Regarding your second point, I think you are asking too much; there is no simple way to avoid that.
Regarding the first point, a simple way is to temporarily replace "###" with a character that you will never use (let's say you are using a system without "\r", so that that character is not used; we can use that as a temporal replacement).
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
.gsub("###", "\r").gsub(/[^\w\r]/, "X").gsub("\r", "###")
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"

Select multiple words and characters using GREP

I need some GREP help. I'm trying to search for text in an InDesign file that has lesser-than "<" and greater-than ">" characters on either end. The text could be one word or more and could include spaces and even numbers. But, here's here's the catch, there may be multiple words on a line, such as , , <12 peaches> and <3 plums>.
I tried using <(.+)> but this picks up the whole paragraph and removes the beginning and ending brackets < > but leaves the ones in the middle.
Anyone know the proper wildcard structure that would find any text or number between these brackets and even if more than one set of "<>" appear in a single paragraph?
FYI, GREP in InDesign is not exactly the same as it's used on the web.
The + metacharacter is greedy meaning it tends to catch more than needed. You can restrain it with a question mark like this :
<.+?>
or change your pattern with
<[^>]+>

Regular expression help to skip first occurrence of a special character while allowing for later special chars but no whitespace

I'm looking for words starting with a hashtag: "#yolo"
My regex for this was very simple: /#\w+/
This worked fine until I hit words that ended with a question mark: "#yolo?".
I updated my regex to allow for words and any non whitespace character as well: /#[\w\S]*/.
The problem is I sometimes need to pull a match from a word starting with two '#' characters, up until whitespace, that may contain a special character in it or at the end of the word (which I need to capture).
Example:
"##yolo?"
And I would like to end up with:
"#yolo?"
Note: the regular expressions are for Ruby.
P.S. I'm testing these out here: http://rubular.com/
Maybe this would work
#(#?[\S]+)
What about
#[^#\s]+
\w is a subset of ^\s (i.e. \S) so you don't need both. Also, I assume you don't want any more #s in the match, so we use [^#\s] which negates both whitespace and # characters.

regular expressions matches characters on different lines at the start

My question is how to match the first three characters of certain lines within a string using regular expressions the regex i have should work however when i run the program it only matches the first three characters of the first line the string is
.V/RTEE/EW\n.N/ERER/JAN/21
my regex is ^(.[VN]/)* so it needs to match .V/ and .N/ any help I will be very grateful
You need to suppress the special meaning of the . and /
Use \ in-front of them.

Resources