Why is preg_replace removing umlauts? - codeigniter

I am trying to create an preg_replace for my search form, but it keeps replacing umlauts too...
Code:
$zoekwoord = $this->input->get('q', TRUE);
$zoekwoord = preg_replace('/[^a-zA-Z0-9_ %\[\]\.\(\)%&-]u/s', '', $zoekwoord);
Any idea how to keep special chars? (like ö)

You defined the pattern that removes any char but an ASCII letter, digit, _ and some special chars.
You need to replace the [A-Za-z0-9_] with \w and make it Unicode aware with the /u modifier.
Use
'/[^\w %[\].()%&-]+/u'
Note that only ] needs to be escaped inside this character class. /s modifier is redundant, and I believe you made a typo adding u to the end of the pattern.

Related

How do you print the underscore in a zebra label?

I’m printing a parameter returned from a query that’s a string of letters and underscores.
The label prints just the letters without the underscores, and I’m not sure how to fix it.
^FD<String>^FS
^FH^FD<String>^FS
Thank you very much.
(Removing the FH Only reads to the first underscore.
The ^FH command without parameter defaults to underscore as the hexidecimal escape character. Either remove the ^FH or specify a different escape character like backslash using ^FH\^FD<String>^FS.

Faster alternative than a regex? Maybe also the regex approach wrong [duplicate]

I am using gsub in Ruby to make a word within text bold. I am using a word boundary so as to not make letters within other words bold, but am finding that this ignores words that have a quote after them. For example:
text.gsub(/#{word}\b/i, "<b>#{word}</b>")
text = "I said, 'look out below'"
word = below
In this case the word below is not made bold. Is there any way to ignore certain characters along with a word boundary?
All that escaping in the Regexp.new is looking quite ugly. You could greatly simplify that by using a Regexp literal:
word = 'below'
text = "I said, 'look out below'"
reg = /\b#{word}\b/i
text.gsub!(reg, '<b>\0</b>')
Also, you could use the modifier form of gsub! directly, unless that string is aliased in some other place in your code that you are not showing us. Lastly, if you use the single quoted string literal inside your gsub call, you don't need to escape the backslash.
Be very careful with your \b boundaries. Here’s why.
The #{word} syntax doesn't work for regular expressions. Use Regexp.new instead:
word = "below"
text = "I said, 'look out below'"
reg = Regexp.new("\\b#{word}\\b", true)
text = text.gsub(reg, "<b>\\0</b>")
Note that when using sting you need to escape \b to \\b, or it is interpreted as a backspace. If word may contain special regex characters, escape it using Regexp.escape.
Also, by replacing the string to <b>#{word}</b> you may change casing of the string: "BeloW" will be replaced to "below". \0 corrects this by replacing with the found word. In addition, I added \\b at the beginning, you don't want to look for "day" and end up with "sunday".

Regex that considers apostrophes as word characters? [duplicate]

So I want to split a string in java on any non-alphanumeric characters.
Currently I have been doing it like this
words= Str.split("\\W+");
However I want to keep apostrophes("'") in there. Is there any regular expression to preserve apostrophes but kick the rest of the junk? Thanks.
words = Str.split("[^\\w']+");
Just add it to the character class. \W is equivalent to [^\w], which you can then add ' to.
Do note, however, that \w also actually includes underscores. If you want to split on underscores as well, you should be using [^a-zA-Z0-9'] instead.
For basic English characters, use
words = Str.split("[^a-zA-Z0-9']+");
If you want to include English words with special characters (such as fiancé) or for languages that use non-English characters, go with
words = Str.split("[^\\p{L}0-9']+");

How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

I want to extract #hashtags from a string, also those that have special characters such as #1+1.
Currently I'm using:
#hashtags ||= string.scan(/#\w+/)
But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.
How do I do this?
EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...
Also, the hash sign at the beginning should be removed.
The Solution
You probably want something like:
'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]
Note that the zero-width assertion at the beginning is required to avoid capturing the pound sign as part of the match.
References
String#encode
Ruby's POSIX Character Classes
This should work:
#hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten
Or if you don't want your hashtag to start with a special character:
#hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten
How about this:
#hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]
Takes cares of #alphabets or #2323+2323 #2323-2323 #2323+65656-67676
Also removes # at beginning
Or if you want it in array form:
#hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}
Wow, this took so long but I still don't understand why scan(/#[[:alpha:]]+|#[\d\+-]+\d+/) works but not scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/) in my computer. The difference being the () on the 2nd scan statement. This has no effect as it should be when I use with match method.

Ignoring a character along with word boundary in regex

I am using gsub in Ruby to make a word within text bold. I am using a word boundary so as to not make letters within other words bold, but am finding that this ignores words that have a quote after them. For example:
text.gsub(/#{word}\b/i, "<b>#{word}</b>")
text = "I said, 'look out below'"
word = below
In this case the word below is not made bold. Is there any way to ignore certain characters along with a word boundary?
All that escaping in the Regexp.new is looking quite ugly. You could greatly simplify that by using a Regexp literal:
word = 'below'
text = "I said, 'look out below'"
reg = /\b#{word}\b/i
text.gsub!(reg, '<b>\0</b>')
Also, you could use the modifier form of gsub! directly, unless that string is aliased in some other place in your code that you are not showing us. Lastly, if you use the single quoted string literal inside your gsub call, you don't need to escape the backslash.
Be very careful with your \b boundaries. Here’s why.
The #{word} syntax doesn't work for regular expressions. Use Regexp.new instead:
word = "below"
text = "I said, 'look out below'"
reg = Regexp.new("\\b#{word}\\b", true)
text = text.gsub(reg, "<b>\\0</b>")
Note that when using sting you need to escape \b to \\b, or it is interpreted as a backspace. If word may contain special regex characters, escape it using Regexp.escape.
Also, by replacing the string to <b>#{word}</b> you may change casing of the string: "BeloW" will be replaced to "below". \0 corrects this by replacing with the found word. In addition, I added \\b at the beginning, you don't want to look for "day" and end up with "sunday".

Resources