Could anyone help me here to understand when we need to consider the
below 4 methods:
strict_decode64(str)
strict_encode64(bin)
urlsafe_encode64(bin)
urlsafe_decode64(str)
From the doc also I didn't get any examples. So examples with
explanation might be helpful for me to understand.
Thanks in advance
An example of usage would be:
require "base64"
Base64.strict_encode64('Stuff to be encoded')
Base64.strict_decode64("U3R1ZmYgdG8gYmUgZW5jb2RlZA==")
Strict means that white spaces / CR/LF are rejected at decode and CR/LF are not added at encode.
Note that if the folowing is accepted:
Base64.decode64("U3R1ZmYgdG8gYmUgZW5jb2RlZA==\n")
with strict the above is not accepted because of the trailing \n (linefeed) and the following line will throw ArgumentError: invalid base64 exception:
Base64.strict_decode64("U3R1ZmYgdG8gYmUgZW5jb2RlZA==\n")
So strict accepts/expects only alphanumeric characters at decode and returns only alphanumeric at encode.
Please try the following and see how one encodes wraps the lines every 60 characters with '\n' (linefeed) and the strict doesn't:
print Base64.encode64('I will not use spaces and new lines. I will not use spaces and new lines. I will not use spaces and new lines. I will not use spaces and new lines.I will not use spaces and new lines.')
print Base64.strict_encode64('I will not use spaces and new lines. I will not use spaces and new lines. I will not use spaces and new lines. I will not use spaces and new lines.I will not use spaces and new lines.')
The _encode and _decode do opposite things: the first one converts a normal string into an encoded string, and the second one converts an encoded string into a normal string.
str = "Hello!"
str == decode64(encode64(str)) # This is true
The difference between strict_ and urlsafe_ is the characters that will be used inside the encoded string. When you need to pass your string inside a URL, all characters are not allowed (like / for instance, because it has a special meaning in URLs) so you should use the urlsafe_ version.
Related
I have this character , or see screenshot below. Its the "replacement character" in Ruby.
I'm using an external API that does parsing, and unfortunately returns this character instead of - for un-ordered list points.
I would like to split by this character in what is returned, but I've been unsuccessful with below.
text.split(//)
How can I split by this character?
These will match any non ASCII character:
[^\x00-\x7F] or [^[:ascii:]].
As noted by #engineersmnky this may not be the most ideal solution if the data you are parsing could contain more unrecognized characters.
Use this regex if you want to split only the character:
[\uF0B7]
I have a string like this:
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
I want to replace all non-word characters (symbols and whitespace), except the ### delimiters.
I'm currently using:
str.gsub(/[^\w#]+/, 'X')
which yields:
"JimXBobXsXemailX###hl###address###endhl###XisXjb#exampleXcom"
In practice, this is good enough, but it offends me for two reasons:
The # in the email address is not replaced.
The use of [^\w] instead of \W feels sloppy.
How do I replace all non-word characters, unless those characters make up the ###hl### or ###endhl### delimiter strings?
str.gsub(/(###.*?###|\w+)|./) { $1 || "X" }
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"
This approach uses the fact that alternations work like case structure: the first matching one consumes the corresponding string, then no further matching is done on it. Thus, ###.*?### will consume a marker (like ###hl###; nothing else will be matched inside it. We also match any sequence of word characters. If any of those are captured, we can just return them as-is ($1). If not, then we match any other character (i.e. not inside a marker, and not a word character) and replace it with "X".
Regarding your second point, I think you are asking too much; there is no simple way to avoid that.
Regarding the first point, a simple way is to temporarily replace "###" with a character that you will never use (let's say you are using a system without "\r", so that that character is not used; we can use that as a temporal replacement).
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
.gsub("###", "\r").gsub(/[^\w\r]/, "X").gsub("\r", "###")
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"
I want to extract #hashtags from a string, also those that have special characters such as #1+1.
Currently I'm using:
#hashtags ||= string.scan(/#\w+/)
But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.
How do I do this?
EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...
Also, the hash sign at the beginning should be removed.
The Solution
You probably want something like:
'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]
Note that the zero-width assertion at the beginning is required to avoid capturing the pound sign as part of the match.
References
String#encode
Ruby's POSIX Character Classes
This should work:
#hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten
Or if you don't want your hashtag to start with a special character:
#hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten
How about this:
#hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]
Takes cares of #alphabets or #2323+2323 #2323-2323 #2323+65656-67676
Also removes # at beginning
Or if you want it in array form:
#hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}
Wow, this took so long but I still don't understand why scan(/#[[:alpha:]]+|#[\d\+-]+\d+/) works but not scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/) in my computer. The difference being the () on the 2nd scan statement. This has no effect as it should be when I use with match method.
I am trying to come up with a regex to remove all special characters except some. For example, I have a string:
str = "subscripción gustaría♥"
I want the output to be "subscripción gustaría".
The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.
str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'')
This doesn't work. The last special character is not removed.
Can someone help? (This is ruby 1.8)
Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.
Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.
ruby-1.8.7-p248 > str = "subscripción gustaría♥"
=> "subscripci\303\263n gustar\303\255a\342\231\245"
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
=> nil
str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')
The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones?
But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples:
if str = "subscripción gustaría♥" is utf-8
then you can blacklist all char above the range (excl. whitespaces):
str.gsub(/[^\x{0021}-\x{017E}\s]/,'')
if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:
str.gsub(/[\x01-\x1F]/,'')
The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.
It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).
Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):
puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'')
This next example specifies that characters 0xA1 and 0xC3 should be deleted.
puts str.gsub(/[\xA1\xC3]/,'')
I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.
I am using nokogiri to screen scrape some HTML. In some occurrences, I am getting some weird characters back, I have tracked down the ASCII code for these characters with the following code:
#parser.leads[0].phone_numbers[0].each_byte do |c|
puts "char=#{c}"
end
The characters in question have an ASCII code of 194 and 160.
I want to somehow strip these characters out while parsing.
I have tried the following code but it does not work.
#parser.leads[0].phone_numbers[0].gsub(/160.chr/,'').gsub(/194.chr/,'')
Can anyone tell me how to achieve this?
I found this question while trying to strip out invisible characters when "trimming" a string.
s.strip did not work for me and I found that the invisible character had the ord number 194
None of the methods above worked for me but then I found "Convert non-breaking spaces to spaces in Ruby " question which says:
Use /\u00a0/ to match non-breaking spaces: s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces
Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.
So glad I found that! Now I'm using:
s.gsub(/[[:space:]]/,'')
This doesn't answer the question of how to gsub specific character codes, but if you're just trying to remove whitespace it seems to work pretty well.
Your problem is that you want to do a method call but instead you're creating a Regexp. You're searching and replacing strings consisting of the string "160" followed by any character and then the string "chr", and then doing the same except with "160" replaced with "194".
Instead, do gsub(160.chr, '').
Update (2018): This code does not work in current Ruby versions. Please refer to other answers.
You can also try
s.gsub(/\xA0|\xC2/, '')
or
s.delete 160.chr+194.chr
First thought would be should you be using gsub! instead of gsub
gsub returns a string and gsub! performs the substitution in place
I was getting "invalid multibyte escape" error while trying the above solution, but for a different situation. Google was return \xA0 when the number is greater than 999 and I wanted to remove it. So what I did was use return_value.gsub(/[\xA0]/n,"") instead and it worked perfectly fine for me.