How to split by Replacement Character "" in Ruby? - ruby

I have this character , or see screenshot below. Its the "replacement character" in Ruby.
I'm using an external API that does parsing, and unfortunately returns this character instead of - for un-ordered list points.
I would like to split by this character in what is returned, but I've been unsuccessful with below.
text.split(//)
How can I split by this character?

These will match any non ASCII character:
[^\x00-\x7F] or [^[:ascii:]].
As noted by #engineersmnky this may not be the most ideal solution if the data you are parsing could contain more unrecognized characters.
Use this regex if you want to split only the  character:
[\uF0B7]

Related

Regex for capital letters not matching accented characters

I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo

regular expressions matches characters on different lines at the start

My question is how to match the first three characters of certain lines within a string using regular expressions the regex i have should work however when i run the program it only matches the first three characters of the first line the string is
.V/RTEE/EW\n.N/ERER/JAN/21
my regex is ^(.[VN]/)* so it needs to match .V/ and .N/ any help I will be very grateful
You need to suppress the special meaning of the . and /
Use \ in-front of them.

Parsing out abnormal characters

I have to work with text that was previously copy/pasted from an excel document into a .txt file. There are a few characters that I assume mean something to excel but that show up as an unrecognised character (i.e. that '?' symbol in gedit, or one of those rectangles in some other text editors.). I wanted to parse those out somehow, but I'm unsure of how to do so. I know regular expressions can be helpful, but there really isn't a pattern that matches unrecognisable characters. How should I set about doing this?
you could work with http://spreadsheet.rubyforge.org/ maybe to read / parse the data
I suppose you're getting these characters because the text file contains invalid Unicode characters, that means your '?'s and triangles could actually be unrecognized multi byte sequences.
If you want to properly handle the spreadsheet contents, i recommend you to first export the data to CSV using (Open|Libre)Office and choosing UTF-8 as file encoding.
https://en.wikipedia.org/wiki/Comma-separated_values
If you are not worried about multi byte sequences I find this regex to be handy:
line.gsub( /[^0-9a-zA-Z\-_]/, '*' )

Regexp Greek chars by number

I deal with strings that contain Greek and English (Latin) text. I'd like to use a regex to catch all the Greek words that contain 4 or more characters on them.
Using regexp manual I figure out that I can use \p{Greek} to grab all Greek words and \w{4,} in order to grab 4+ character words. However, these two don't work together, from various tests I made.
Is there any way to do what I want using 1 regexp expression? Strings are UTF-8 and come out of tweets.
Regards
Are you using the UTF-8 pattern modifier?
/\p{Greek}{4,}/u

gsub ASCII code characters from a string in ruby

I am using nokogiri to screen scrape some HTML. In some occurrences, I am getting some weird characters back, I have tracked down the ASCII code for these characters with the following code:
#parser.leads[0].phone_numbers[0].each_byte do |c|
puts "char=#{c}"
end
The characters in question have an ASCII code of 194 and 160.
I want to somehow strip these characters out while parsing.
I have tried the following code but it does not work.
#parser.leads[0].phone_numbers[0].gsub(/160.chr/,'').gsub(/194.chr/,'')
Can anyone tell me how to achieve this?
I found this question while trying to strip out invisible characters when "trimming" a string.
s.strip did not work for me and I found that the invisible character had the ord number 194
None of the methods above worked for me but then I found "Convert non-breaking spaces to spaces in Ruby " question which says:
Use /\u00a0/ to match non-breaking spaces: s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces
Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.
So glad I found that! Now I'm using:
s.gsub(/[[:space:]]/,'')
This doesn't answer the question of how to gsub specific character codes, but if you're just trying to remove whitespace it seems to work pretty well.
Your problem is that you want to do a method call but instead you're creating a Regexp. You're searching and replacing strings consisting of the string "160" followed by any character and then the string "chr", and then doing the same except with "160" replaced with "194".
Instead, do gsub(160.chr, '').
Update (2018): This code does not work in current Ruby versions. Please refer to other answers.
You can also try
s.gsub(/\xA0|\xC2/, '')
or
s.delete 160.chr+194.chr
First thought would be should you be using gsub! instead of gsub
gsub returns a string and gsub! performs the substitution in place
I was getting "invalid multibyte escape" error while trying the above solution, but for a different situation. Google was return \xA0 when the number is greater than 999 and I wanted to remove it. So what I did was use return_value.gsub(/[\xA0]/n,"") instead and it worked perfectly fine for me.

Resources