Ruby Base64 check if it's encoded [duplicate] - ruby

i may recieve these two strings:
base = Base64.encode64(File.open("/home/usr/Desktop/test", "rb").read)
=> "YQo=\n"
string = File.open("/home/usr/Desktop/test", "rb").read
=> "a\n"
what i have tried so far is to check string with regular expression i-e. /([A-Za-z0-9+\/]{4})*([A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==$)/ but this would be very heavy if the file is big.
I also have tried base.encoding.name and string.encoding.name but both returns the same.
I have also seen this post and got regular expression solution but any other solution ?
Any idea ? I just want to get is the string is actually text or base64 encoded text....

You can use something like this, not very performant but you are guaranteed not to get false positives:
require 'base64'
def base64?(value)
value.is_a?(String) && Base64.strict_encode64(Base64.decode64(value)) == value
end
The use of strict_encode64 versus encode64 prevents Ruby from inadvertently inserting newlines if you have a long string. See this post for details.

Related

Beginner question: What is a role of plus in front of a string in ruby?

I have a code that looks like following
value = +"#{x}/part"
value << "/part2"
I understand that value would contain something like valueOfX/part/part2, but I don't understand why there is + in front of the string. I tried searching for it, but search engines are not very good at understanding what "plus in front of a string ruby" means. I also tried to run this in online ruby repl with no difference when + is added or not added.
So question is why may it be useful to have + like this?
If the string is frozen, then return duplicated mutable string.
If the string is not frozen, then return the string itself.
source: https://ruby-doc.org/core/String.html#method-i-2B-40
So in your case, since your string is not frozen, your code is equivalent to:
value = "#{x}/part"
EDIT:
As explained by #stefan in the comments, in Ruby 2.x, interpolated string were frozen with frozen_string_literal: true. So value = +"#{x}/part" is not equivalent to value = "#{x}/part". It's not the case anymore with Ruby 3.

Ruby gsub with string manipulation

I am new to ruby and writing the expression to replace the string between the xml tags by hashing the value inside that.
I did the following to replace with the new password
puts "<password>check1</password>".gsub(/(?<=password\>)[^\/]+(?=\<\/password)/,'New \0')
RESULT: <password>New check1</password> (EXPECTED)
My expectation is to get the result like this (Md5 checksum of the value "New check1")
<password>6aaf125b14c97b307c85fc6e681c410e</password>
I tried it in the following ways and none of them was successful (I have included the required libraries "require 'digest'").
puts "<password>check1</password>".gsub(/(?<=password\>)[^\/]+(?=\<\/password)/,Digest::MD5.hexdigest('\0'))
puts "<password>check1</password>".gsub(/(?<=password\>)[^\/]+(?=\<\/password)/,Digest::MD5.hexdigest '\0')
puts "<password>check1</password>".gsub(/(?<=password\>)[^\/]+(?=\<\/password)/, "Digest::MD5.hexdigest \0")
Any help on this to achieve the expectation is very much appreciated
This will work:
require 'digest'
line = "<other>stuff</other><password>check1</password><more>more</more>"
line.sub(/<password>(?<pwd>[^<]+)<\/password>/, Digest::SHA2.hexdigest(pwd))
=> "<other>stuff</other>8a859fd2a56cc37285bc3e307ef0d9fc1d2ec054ea3c7d0ec0ff547cbfacf8dd<more>more</more>"
Make sure the input is one line at a time, and you'll probably want sub, not gsub
P.S.: agree with Tom Lord's comment.. if your XML is not gargantuan in size, try to use an XML library to parse it... Ox or Nokogiri perhaps?
Different libraries have different advantages.
This is a variant of Tilo's answer.
require 'digest'
line = "<other>stuff</other><password>check1</password><more>more</more>"
r = /(?<=<password>).+?(?=<\/password>)/
line.sub(r) { |pwd| Digest::SHA2.hexdigest(pwd) }
#=> "<other>stuff</other><password>8a859fd2a56cc37285bc3e307ef0d9f
# c1d2ec054ea3c7d0ec0ff547cbfacf8dd</password><more>more</more>"
(I've displayed the returned string on two lines so make it readable without the need for horizontal scrolling.)
The regular expression reads, "match '<password>' in a positive lookbehind ((?<=...)), followed by any number of characters, lazily ('?'), followed by the string '</password>' in a positive lookahead ((?=...)).

Check if a string contains a character in a unicode range (using Ruby)

I want to create a simple function in Ruby that will check if the given string contains any unicode characters in the ranges such as the following:
U+007B -- U+00BF
U+02B0 -- U+037F
U+2000 -- U+2BFF
How can I accomplish this? Google is coming up blank for me, all things about removing unicode characters or checking if a string contains unicode.
The easiest thing would probably be a regex using String#index, String#match, or even String#[]:
string.index(/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/)
string.match(/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/)
string[/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/]
All three will give you nil (which is falsey) if they don't find the pattern and non-nil (which will be truthy) if they do.
I would do as below:
my_string = "{ How are you ?}"
puts my_string.chars.any? { |chr| ("\u007B".."\u00BF").include?(chr) }
#=> true

Encoding::UndefinedConversionError

I keep getting an Encoding::UndefinedConversionError - "\xC2" from ASCII-8BIT to UTF-8 every time I try to convert a hash into a JSON string. I tried with [.encode | .force_encoding](["UTF-8" | "ASCII-8BIT" ]), chaining .encode with .force_encoding, backwards, switching parameters but nothing seemed to work so I caught the error like this:
begin
menu.to_json
rescue Encoding::UndefinedConversionError
puts $!.error_char.dump
p $!.error_char.encoding
end
Where menu is a sequel's dataset.to_hash with content from a MySQL DB, utf8_general_ci encoding and returned this:
"\xC2"
<#Encoding:ASCII-8BIT>
The encoding never changes, no matter what .encode/.force_encoding I use. I've even tried to replace the string .gsub!(/\\\xC2/) without luck.
Any ideas?
menu.to_s.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
This worked perfectly, I had to replace some extra characters but there are no more errors.
What do you expect for "\xC2"? Probably a Â
With ASCII-8BIT you have binary data, and ruby cant decide, what should be.
You must first set the encoding with force_encoding.
You may try the following code:
Encoding.list.each{|enc|
begin
print "%-10s\t" % [enc]
print "\t\xC2".force_encoding(enc)
print "\t\xC2".force_encoding(enc).encode('utf-8')
rescue => err
print "\t#{err}"
end
print "\n"
}
The result are the possible values in different encodings for your "\xC2".
The result may depend on your Output format, but I think you can make a good guess, which encoding you have.
When you defined the encoding you need (probably cp1251) you can
menu.force_encoding('cp1252').to_json
See also Kashyaps comment.
If you don't care about losing the strange characters, you can blow them away:
str.force_encoding("ASCII-8BIT").encode('UTF-8', undef: :replace, replace: '')
Your auto-accepted solution doesn't work, there are effectively no errors, but it is NOT JSON.
I solved the problem using the oj gem, it now works find. It is also faster than the standard JSON library.
Writting :
menu_json = Oj.dump menu
Reading :
menu2 = Oj.load menu_json
https://github.com/ohler55/oj for more details. I hope it will help.
:fallback option can be useful if you know what chars you want to replace
"Text 🙂".encode("ASCII", "UTF-8", fallback: {"🙂" => ":)"})
#=> hello :)
From docs:
Sets the replacement string by the given object for undefined character. The object should be a Hash, a Proc, a Method, or an object which has [] method. Its key is an undefined character encoded in the source encoding of current transcoder. Its value can be any encoding until it can be converted into the destination encoding of the transcoder.

when we import csv data, how eliminate "invalid byte sequence in UTF-8"

we allow users to import data via csv (using ruby 1.9.2, hence it's fastercsv).
being user data, of course, it might not be properly sanitized.
When we try to display the data in an /index method we sometimes get the error "invalid byte sequence in UTF-8" pointing to our erb where we display one of the fields widget.name
When we do the import we'd like to FORCE the incoming data to be valid... is there a ruby operator that will map a string to a valid utf8 string, eg, something like
goodstring = badstring.no_more_invalid_bytes
One example of 'bad' data is char that looks like a hyphen but is not a regular ascii hyphen. We'd prefer to map the non-utf-8 chars to a reasonable ascii equivalent (umlat-u going to u for exmaple) BUT we're okay with simply stripping the character to.
since this is when importing lots of data, it needs to be a fast built-in operator, hopefully...
Note: here is an example of the data. The file comes form windows and is 8bit ascii. when we import it and in our erb we display widget.name.inspect (instead of widget.name) we get:
"Chains \x96 Accessories"
so one example of the data is a "hyphen" that's actually 8 bit code 96.
--- when we changed our csv parse to assign fldval = d.encode('UTF-8')
it throws this error:
Encoding::UndefinedConversionError in StoresController#importfinderitems
"\x96" from ASCII-8BIT to UTF-8
what we're looking for is a simple way to just force it to be valid utf8 regardless of origin type, even if we simply strip non-ascii.
while not as 'nice' as forcing the encoding, this works at a slight expense to our import time:
d.to_s.strip.gsub(/\P{ASCII}/, '')
Thank you, Mladen!
Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.
For example:
CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')
Would convert all strings to UTF-8.
Also you can use the more standard encoding name 'ISO-8859-1'
CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})
CSV.parse(File.read('/path/to/csv').scrub)
I answered a similar question that deals with reading external files in 1.9.2 with non-UTF-8 encodings. I think that answer will help you a lot: Character Encoding issue in Rails v3/Ruby 1.9.2
Note that you need to know the source encoding for you to convert it anything reliably. There are libraries like the one I linked to in my other answer that can help you determine this.
Also, if you aren't loading the data from a file, you can convert the encoding of a string in 1.9.2 quite easily:
'string'.encode('UTF-8')
However, it's rare that you're building a string in another encoding, and it's best to convert it at the time it's read into your environment if possible.
Ruby 1.9 can change string encoding with invalid detection and replacement:
str = str.encode('UTF-8', :invalid => :replace)
For unusual strings such as strings loaded from a file of unknown encoding, it's wise to use #encode instead of a regex, #gsub, or #delete, because these all need the string to be parsed-- but if the string is broken, it can't be parsed, so those methods fail.
If you get a message like this:
error ** from ASCII-8BIT to UTF-8
Then you're probably trying to convert a binary string that's already in UTF-8, and you can force UTF-8:
str.force_encoding('UTF-8')
If you know the original string is not in binary UTF-8, or if the output string has illiegal characters, then read up on Ruby encoding transliterations.
If you are using Rails, you can try to fix it with the following
'Your string with strange stuff ##~'.mb_chars.tidy_bytes
It removes you the invalid utf-8 chars and replaces it with valid ones.
More info: https://apidock.com/rails/String/mb_chars
Upload the CSV file to Google Docs Spreadsheet and re-download it as a CSV file. Import and voila! (Worked in my case)
Presumably Google converts it to the wanted format..
Source: Excel to CSV with UTF-8 Encoding
As mentioned by someone else, scrub works well to clean this up in Ruby 2.1+. If you have a large file you may not want to read the whole thing into memory, so you can use scrub like this:
data = IO::read(file_path).scrub("")
CSV.parse(data, :col_sep => ',', :headers => true) do |row|
puts row
end
I am using MAC and I was having the same error:
rescue in parse:Invalid byte sequence in UTF-8 in line 1 (CSV::MalformedCSVError)
I added :encoding => 'ISO-8859-1' that resolved my error and csv file could be read.
results = CSV.read("query_result.csv",{:headers => true, :encoding => 'ISO-8859-1'})
:headers => true : If set to :first_row or true, the initial row of the CSV file will be treated as a row of headers. If set to an Array, the contents will be used as the headers. If set to a String, the String is run through a call of ::parse_line with the same :col_sep, :row_sep, and :quote_char as this instance to produce an Array of headers. This setting causes #shift to return rows as CSV::Row objects instead of Arrays and #read to return CSV::Table objects instead of an Array of Arrays.
irb(main):024:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true)
=> <#CSV io_type:StringIO encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" headers:true>
irb(main):025:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true).to_a
=> [#<CSV::Row "a":"1" "b":"2" "c":"3">]
irb(main):026:0> rows.first['a']
=> "1"
In above example you can clearly see that this also enables us to use data as hashes.
The only thing you would need to be careful about while using headers: true that it won't allow any duplicate headers as keys are unique in hashes.
Only do this
anyobject.to_csv(:encoding => 'utf-8')

Resources