I have octal escapes in my html(stored as a string), which displays on the browser as �.
eg:- "Thanks for the update\205.nt"
Is there a way to remove these from the string or make it render properly on the browser?
blunt solution :
"Thanks for the update\205".encode('ascii', :invalid => :replace, :replace => "")
=>"Thanks for the update"
see String#encode for a more subtle approach
.gsub(/[^[:print:]]/, ' ') works perfectly.
http://geek.michaelgrace.org/2010/10/remove-non-printable-characters-from-string-using-ruby-regex/
Related
I'm learning ruby and try to get the filename from a ftp server. The string I got was encoded in gb2312(simplified Chinese), It's success in most cases with these codes:
str = str.force_encoding("gb2312")
str = str.encode("utf-8")
but it will make an error "in encode': "\xFD" followed by "\x88" on GB2312 (Encoding::InvalidByteSequenceError)" if the string contains the symbol "[" or "【".
The Ruby Encoding allows a lot of introspection. That way, you can find out pretty well, how to handle a given String:
"【".encoding
=> #<Encoding:UTF-8>
"【".valid_encoding?
=> true
"【".force_encoding("gb2312").valid_encoding?
=> false
That shows that this character is not with the given character-set! If you need to transform all those characters, you can use the encode method and provide defaults or replace undefined characters like so:
"【".encode("gb2312", invalid: :replace, undef: :replace)
=> "\x{A1BE}"
If you have a String that has mixed character Encodings, you are pretty screwed. There is no way to find out without a lot of guessing.
I have a problem with saving records in MongoDB using Mongoid when they contain multibyte characters. This is the string:
a="Chris \xA5\xEB\xAE\xDFe\xA5"
I first convert it to BINARY and I then gsub it like this:
a.force_encoding("BINARY").gsub(0xA5.chr,"oo")
...which works fine:
=> "Chris oo\xEB\xAE\xDFeoo"
But it seems that I can not use the chr method if I use Regexp:
a.force_encoding("BINARY").gsub(/0x....?/.chr,"")
NoMethodError: undefined method `chr' for /0x....?/:Regexp
Anybody with the same issue?
Thanks a lot...
You can do that with interpolation
a.force_encoding("BINARY").gsub(/#{0xA5.chr}/,"")
gives
"Chris \xEB\xAE\xDFe"
EDIT: based on the comments, here a version that translates the binary encode string to an ascii representation and do a regex on that string
a.unpack('A*').to_s.gsub(/\\x[A-F0-9]{2}/,"")[2..-3] #=>"Chris "
the [2..-3] at the end is to get rid of the beginning [" and and trailing "]
NOTE: to just get rid of the special characters you also could just use
a.gsub(/\W/,"") #=> "Chris"
The actual string does not contain the literal characters \xA5: that is just how characters that would otherwise be unprintable are shown to you (similar when a string contains a newline ruby shows you \n).
If you want to change any non ascii stuff you could do this
a="Chris \xA5\xEB\xAE\xDFe\xA5"
a.force_encoding('BINARY').encode('ASCII', :invalid => :replace, :undef => :replace, :replace => 'oo')
This starts by forcing the string to the binary encoding (you always want to start with a string where the bytes are valid for its encoding. binary is always valid since it can contain arbitrary bytes). Then it converts it to ASCII. Normally this would raise an error since there are characters that it doesn't know what to do with but the extra options we've passed tell it to replace invalid/undefined sequences with the characters 'oo'
I'm reading data from a remote source, and occassionally get some characters in another encoding. They're not important.
I'd like to get get a "best guess" utf-8 string, and ignore the invalid data.
Main goal is to get a string I can use, and not run into errors such as:
Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8:
invalid byte sequence in utf-8
I thought this was it:
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
will replace all knowns with '?'.
To ignore all unknowns, :replace => '':
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")
Edit:
I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:
string.encode("UTF-8", ...).force_encoding('UTF-8')
Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.
Edit 2:
Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.
String#chars or String#each_char can be also used.
# Table 3-8. Use of U+FFFD in UTF-8 Conversion
# http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf)
str = "\x61"+"\xF1\x80\x80"+"\xE1\x80"+"\xC2"
+"\x62"+"\x80"+"\x63"+"\x80"+"\xBF"+"\x64"
p [
'abcd' == str.chars.collect { |c| (c.valid_encoding?) ? c : '' }.join,
'abcd' == str.each_char.map { |c| (c.valid_encoding?) ? c : '' }.join
]
String#scrub can be used since Ruby 2.1.
p [
'abcd' == str.scrub(''),
'abcd' == str.scrub{ |c| '' }
]
This works great for me:
"String".encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "").force_encoding('UTF-8')
To ignore all unknown parts of the string that aren't correctly UTF-8 encoded the following (as you originally posted) almost does what you want.
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")
The caveat is that encode doesn't do anything if it thinks the string is already UTF-8. So you need to change encodings, going via an encoding that can still encode the full set of unicode characters that UTF-8 can encode. (If you don't you'll corrupt any characters that aren't in that encoding - 7bit ASCII would be a really bad choice!) So go via UTF-16:
string.encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')
With a bit of help from #masakielastic I have solved this problem for my personal purposes using the #chars method.
The trick is to break down each character into its own separate block so that ruby can fail.
Ruby needs to fail when it confronts binary code etc. If you don't allow ruby to go ahead and fail its a tough road when it comes to this stuff. So I use the String#chars method to break the given string into an array of characters. Then I pass that code into a sanitizing method that allows the code to have "microfailures" (my coinage) within the string.
So, given a "dirty" string, lets say you used File#read on a picture. (my case)
dirty = File.open(filepath).read
clean_chars = dirty.chars.select do |c|
begin
num_or_letter?(c)
rescue ArgumentError
next
end
end
clean = clean_chars.join("")
def num_or_letter?(char)
if char =~ /[a-zA-Z0-9]/
true
elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
true
end
end
allowing the code to fail somewhere along in the process seems to be the best way to move through it. So long as you contain those failures within blocks you can grab what is readable by the UTF-8-only-accepting parts of ruby
I have not had luck with the one-line uses of String#encode ala string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
. Do not work reliably for me.
But I wrote a pure ruby "backfill" of String#scrub to MRI 1.9 or 2.0 or any other ruby that does not offer a String#scrub.
https://github.com/jrochkind/scrub_rb
It makes String#scrub available in rubies that don't have it; if loaded in MRI 2.1, it will do nothing and you'll still be using the built-in String#scrub, so it can allow you to easily write code that will work on any of these platforms.
It's implementation is somewhat similar to some of the other char-by-char solutions proposed in other answers, but it does not use exceptions for flow control (don't do that), is tested, and provides an API compatible with MRI 2.1 String#scrub
I have a string that looks like this:
d = "foo\u00A0\bar"
When I check the length, it says that it is 7 characters long. I checked online and found out that it is a non-breaking space. Could someone show me how to remove all the non-breaking spaces in a string?
In case you do not care about the non-breaking space specifically, but about any "special" unicode whitespace character that might appear in your string, you can replace it using the POSIX bracket expression for whitespace:
s.gsub(/[[:space:]]/, '')
These bracket expressions (as opposed to matchers like \s) do not only match ASCII characters, but all unicode characters of a class.
For more details see the ruby documentation
irb(main):001:0> d = "foo\u00A0\bar"
=> "foo \bar"
irb(main):002:0> d.gsub("\u00A0", "")
=> "foo\bar"
It's an old thread but maybe it helps somebody.
I found myself looking for a solution to the same problem when I discovered that strip doesn't do the job. I checked with method ord what the character was and used chr to represent it in gsub
2.2.3 :010 > 160.chr("UTF-8")
=> " "
2.2.3 :011 > 160.chr("UTF-8").strip
=> " "
2.2.3 :012 > nbsp = 160.chr("UTF-8")
=> " "
2.2.3 :013 > nbsp.gsub(160.chr("UTF-8"),"")
=> ""
I couldn't understand why strip doesn't remove something that looked like a space to me so I checked here what ASCII 160 actually is.
d.gsub("\u00A0", "") does not work in Ruby 1.8. Instead use d.gsub(/\302\240/,"")
See http://blog.grayproductions.net/articles/understanding_m17n for lots more on the character encoding differences between 1.8 and 1.9.
I thought this code would work, but the regular expression doesn't ever match the \r\n. I have viewed the data I am reading in a hex editor and verified there really is a hex D and hex A pattern in the file.
I have also tried the regular expressions /\xD\xA/m and /\x0D\x0A/m but they also didn't match.
This is my code right now:
lines2 = lines.gsub( /\r\n/m, "\n" )
if ( lines == lines2 )
print "still the same\n"
else
print "made the change\n"
end
In addition to alternatives, it would be nice to know what I'm doing wrong (to facilitate some learning on my part). :)
Use String#strip
Returns a copy of str with leading and trailing whitespace removed.
e.g
" hello ".strip #=> "hello"
"\tgoodbye\r\n".strip #=> "goodbye"
Using gsub
string = string.gsub(/\r/," ")
string = string.gsub(/\n/," ")
Generally when I deal with stripping \r or \n, I'll look for both by doing something like
lines.gsub(/\r\n?/, "\n");
I've found that depending on how the data was saved (the OS used, editor used, Jupiter's relation to Io at the time) there may or may not be the newline after the carriage return. It does seem weird that you see both characters in hex mode. Hope this helps.
If you are using Rails, there is a squish method
"\tgoodbye\r\n".squish => "goodbye"
"\tgood \t\r\nbye\r\n".squish => "good bye"
What do you get when you do puts lines? That will give you a clue.
By default File.open opens the file in text mode, so your \r\n characters will be automatically converted to \n. Maybe that's the reason lines are always equal to lines2. To prevent Ruby from parsing the line ends use the rb mode:
C:\> copy con lala.txt
a
file
with
many
lines
^Z
C:\> irb
irb(main):001:0> text = File.open('lala.txt').read
=> "a\nfile\nwith\nmany\nlines\n"
irb(main):002:0> bin = File.open('lala.txt', 'rb').read
=> "a\r\nfile\r\nwith\r\nmany\r\nlines\r\n"
irb(main):003:0>
But from your question and code I see you simply need to open the file with the default modifier. You don't need any conversion and may use the shorter File.read.
modified_string = string.gsub(/\s+/, ' ').strip
lines2 = lines.split.join("\n")
"still the same\n".chomp
or
"still the same\n".chomp!
http://www.ruby-doc.org/core-1.9.3/String.html#method-i-chomp
How about the following?
irb(main):003:0> my_string = "Some text with a carriage return \r"
=> "Some text with a carriage return \r"
irb(main):004:0> my_string.gsub(/\r/,"")
=> "Some text with a carriage return "
irb(main):005:0>
Or...
irb(main):007:0> my_string = "Some text with a carriage return \r\n"
=> "Some text with a carriage return \r\n"
irb(main):008:0> my_string.gsub(/\r\n/,"\n")
=> "Some text with a carriage return \n"
irb(main):009:0>
I think your regex is almost complete - here's what I would do:
lines2 = lines.gsub(/[\r\n]+/m, "\n")
In the above, I've put \r and \n into a class (that way it doesn't matter in which order they might appear) and added the "+" qualifier (so that "\r\n\r\n\r\n" would also match once, and the whole thing replaced with "\n")
Just another variant:
lines.delete(" \n")
Why not read the file in text mode, rather than binary mode?
lines.map(&:strip).join(" ")
You can use this :
my_string.strip.gsub(/\s+/, ' ')
def dos2unix(input)
input.each_byte.map { |c| c.chr unless c == 13 }.join
end
remove_all_the_carriage_returns = dos2unix(some_blob)