How do I remove a non-breaking space in Ruby

How do I remove a non-breaking space in Ruby - ruby

I have a string that looks like this:
d = "foo\u00A0\bar"
When I check the length, it says that it is 7 characters long. I checked online and found out that it is a non-breaking space. Could someone show me how to remove all the non-breaking spaces in a string?

In case you do not care about the non-breaking space specifically, but about any "special" unicode whitespace character that might appear in your string, you can replace it using the POSIX bracket expression for whitespace:
s.gsub(/[[:space:]]/, '')
These bracket expressions (as opposed to matchers like \s) do not only match ASCII characters, but all unicode characters of a class.
For more details see the ruby documentation

irb(main):001:0> d = "foo\u00A0\bar"
=> "foo \bar"
irb(main):002:0> d.gsub("\u00A0", "")
=> "foo\bar"

It's an old thread but maybe it helps somebody.
I found myself looking for a solution to the same problem when I discovered that strip doesn't do the job. I checked with method ord what the character was and used chr to represent it in gsub
2.2.3 :010 > 160.chr("UTF-8")
=> " "
2.2.3 :011 > 160.chr("UTF-8").strip
=> " "
2.2.3 :012 > nbsp = 160.chr("UTF-8")
=> " "
2.2.3 :013 > nbsp.gsub(160.chr("UTF-8"),"")
=> ""
I couldn't understand why strip doesn't remove something that looked like a space to me so I checked here what ASCII 160 actually is.

d.gsub("\u00A0", "") does not work in Ruby 1.8. Instead use d.gsub(/\302\240/,"")
See http://blog.grayproductions.net/articles/understanding_m17n for lots more on the character encoding differences between 1.8 and 1.9.

Related

Replacing regex capture with the same capture and an extra string

I am trying to escape certain characters in a string. In particular, I want to turn
abc/def.ghi into abc\/def\.ghi
I tried to use the following syntax:
1.9.3p125 :076 > "abc/def.ghi".gsub(/([\/.])/, '\\\1')
=> "abc\\1def\\1ghi"
Hmm. This behaves as if capture replacements didn't work. Yet, when I tried this:
1.9.3p125 :075 > "abc/def.ghi".gsub(/([\/.])/, '\1')
=> "abc/def.ghi"
... I got the replacement to work, but, of course, my prefixes weren't part of it.
What is the correct syntax to do something like this?

This should be easier
gsub(/(?=[.\/])/, "\\")

If you are trying to prepare a string to be used as a regex pattern, use the right tool:
Regexp.escape('abc/def.ghi')
=> "abc/def\\.ghi"
You can then use the resulting string to create a regex:
/#{ Regexp.escape('abc/def.ghi') }/
=> /abc\/def\.ghi/
or:
Regexp.new(Regexp.escape('abc/def.ghi'))
=> /abc\/def\.ghi/
From the docs:
Escapes any characters that would have special meaning in a regular expression. Returns a new escaped string, or self if no characters are escaped. For any string, Regexp.new(Regexp.escape(str))=~str will be true.
Regexp.escape('\*?{}.') #=> \\\*\?\{\}\.

You can pass a block to gsub:
>> "abc/def.ghi".gsub(/([\/.])/) {|m| "\\#{m}"}
=> "abc\\/def\\.ghi"
Not nearly as elegant as #sawa's answer, but it was the only way I could find to get it to work if you need the replacing string to contain the captured group/backreference (rather than inserting the replacement before the look-ahead).

How to detect this unprintable character using regex or other ways in Ruby?

I came upon a strange character (using Nokogiri).
irb(main):081:0> sss.dump
=> "\"\\u{a0}\""
irb(main):082:0> puts sss
=> nil
irb(main):083:0> sss
=> " "
irb(main):084:0> sss =~ /\s/
=> nil
irb(main):085:0> sss =~ /[[:print:]]/
=> 0
irb(main):087:0> sss == ' '
=> false
irb(main):088:0> sss.length
=> 1
Any idea what is this strange character?
When it's displayed in a webpage, it's a white space, but it doesn't match a whitespace \s
using regular expression. Ruby even thinks it's a printable character!
How do I detect characters like this and exclude them or flag them as whitespace (if possible)?
Thanks

It's the non-breaking space. In HTML, it's used pretty frequently and often written as . One way to find out the identity of a character like "\u{a0}" is to search the web for U+00A0 (using four or more hexadecimal digits) because that's how the Unicode specification notates Unicode code points.
The non-breaking space and other things like it are included in the regex /[[:space:]]/.

Use a regex to match which is stored in a hash

Here is a Ruby code that explains the issue:
1.8.7 :018 > pattern[:key] = '554 5.7.1.*Service unavailable; Helo command .* blocked using'
=> "554 5.7.1.*Service unavailable; Helo command .* blocked using"
1.8.7 :019 > line = '554 5.7.1 Service unavailable; Helo command [abc] blocked using dbl'
=> "554 5.7.1 Service unavailable; Helo command [abc] blocked using dbl"
1.8.7 :020 > line =~ /554 5.7.1.*Service unavailable; Helo command .* blocked using/
=> 0
1.8.7 :021 > line =~ /pattern[:key]/
=> nil
1.8.7 :022 >
Regex works when using directly as a string but doesn't work when I'm using it as a value of hash.

This isn't a Ruby question per se, it's how to construct a regex pattern that accomplishes what you want.
In "regex-ese", /pattern[:key]/ means:
Find pattern.
Following pattern look for one of :, k, e or y.
Ruby doesn't automatically interpolate variables in strings or regex patterns like Perl does, so, instead, we have to mark where the variable is using #{...} inline.
If you're only using /pattern[:key]/ as a pattern, don't bother interpolating it into a pattern. Instead, take the direct path and let Regexp do it for you:
pattern[:key] = 'foo'
Regexp.new(pattern[:key])
=> /foo/
Which is the same result as:
/#{pattern[:key]}/
=> /foo/
but doesn't waste CPU cycles.
Another of your attempts used ., [ and ], which are reserved characters in patterns, used to help define patterns. If you need to use such characters, you can have Ruby's Regexp.escape add \ escape characters appropriately, preserving their normal/literal meaning in the string:
Regexp.escape('5.7.1 [abc]')
=> "5\\.7\\.1\\ \\[abc\\]"
which, in real life is "5\.7\.1\ \[abc\]" (when not being displayed in IRB)
To use that in a regex, use:
Regexp.new(Regexp.escape('5.7.1 [abc]'))
=> /5\.7\.1\ \[abc\]/

line =~ /#{pattern[:key]}/
or...
line =~ Regexp.new pattern[:key]
If you want to escape regex special characters...
line =~ /#{Regexp.quote pattern[:key]}/
Edit: Since you're new to ruby, may I suggest you do this, wherever pattern is defined:
pattern[:key] = Regexp.new '554 5.7.1.*Service unavailable; Helo command .* blocked using'
Then you can simple use the Regexp object stored in pattern
line =~ pattern[:key]

How to get Ruby 1.9 regexp supports \p{Nonspacing_Mark}?

Isn't the diacritical mark above "a" should be removed by the Regex?
"hǎo".gsub(/\p{Nonspacing_Mark}/, '')
=> "hǎo"
"hǎo".gsub(/\p{Mn}/, '')
=> "hǎo"
Update:
I kind of get it from how it works in Java.
Normalizer.normalize("hǎo", Form.NFD).replaceAll("\\p{Mn}+", "")
I need to normalizer it first to split the "ǎ" into "a" and the diacritical mark.

puts UnicodeUtils.nfkd("ﻺ (hǎo)").gsub(/[\p{Nonspacing_Mark}]/, '')
See How to replace the Unicode gem on Ruby 1.9?

How to extract a single character (as a string) from a larger string in Ruby?

What is the Ruby idiomatic way for retrieving a single character from a string as a one-character string? There is the str[n] method of course, but (as of Ruby 1.8) it returns a character code as a fixnum, not a string. How do you get to a single-character string?

In Ruby 1.9, it's easy. In Ruby 1.9, Strings are encoding-aware sequences of characters, so you can just index into it and you will get a single-character string out of it:
'µsec'[0] => 'µ'
However, in Ruby 1.8, Strings are sequences of bytes and thus completely unaware of the encoding. If you index into a string and that string uses a multibyte encoding, you risk indexing right into the middle of a multibyte character (in this example, the 'µ' is encoded in UTF-8):
'µsec'[0] # => 194
'µsec'[0].chr # => Garbage
'µsec'[0,1] # => Garbage
However, Regexps and some specialized string methods support at least a small subset of popular encodings, among them some Japanese encodings (e.g. Shift-JIS) and (in this example) UTF-8:
'µsec'.split('')[0] # => 'µ'
'µsec'.split(//u)[0] # => 'µ'

Before Ruby 1.9:
'Hello'[1].chr # => "e"
Ruby 1.9+:
'Hello'[1] # => "e"
A lot has changed in Ruby 1.9 including string semantics.

Should work for Ruby before and after 1.9:
'Hello'[2,1] # => "l"
Please see Jörg Mittag's comment: this is correct only for single-byte character sets.

'abc'[1..1] # => "b"

'abc'[1].chr # => "b"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How do I remove a non-breaking space in Ruby - ruby

I have a string that looks like this: d = "foo\u00A0\bar" When I check the length, it says that it is 7 characters long. I checked online and found out that it is a non-breaking space. Could someone show me how to remove all the non-breaking spaces in a string?

irb(main):001:0> d = "foo\u00A0\bar" => "foo \bar" irb(main):002:0> d.gsub("\u00A0", "") => "foo\bar"

d.gsub("\u00A0", "") does not work in Ruby 1.8. Instead use d.gsub(/\302\240/,"") See http://blog.grayproductions.net/articles/understanding_m17n for lots more on the character encoding differences between 1.8 and 1.9.

Related

Replacing regex capture with the same capture and an extra string

How to detect this unprintable character using regex or other ways in Ruby?

Use a regex to match which is stored in a hash

How to get Ruby 1.9 regexp supports \p{Nonspacing_Mark}?

How to extract a single character (as a string) from a larger string in Ruby?

Categories

Resources