Is it possible to identify the format of a string? - ruby

Is it possible to recognize if a string is formatted as a BSON ObjectID?
For strings we could do:
"hello".is_a?(String) # => true
That would not work since the ObjectID is a String anyway. But is it possible to analyze the string to determine if it's formatted as a BSON ObjectID?
Usually, ObjectIDs have this format.
52f4e2274d6f6865080c0000
The formatting criteria is stated in the docs:
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.

Any 24 chararcters long hexadecimal string is a valid BSON object id, so you can check it using this regular expression:
'52f4e2274d6f6865080c0000' =~ /\A\h{24}\z/
# => 0
Both the moped (used by mongoid) and the bson (used by mongo_mapper) gems encapsulates this check in a legal? method:
require 'moped'
Moped::BSON::ObjectId.legal?('00' * 12)
# => true
require 'bson'
BSON::ObjectId.legal?('00' * 12)
# => true

In Mongoid use: .is_a?(Moped::BSON::ObjectId) sytanx.
Example:
some_id = YourModel.first.id
some_id.is_a?(Moped::BSON::ObjectId)
Note:
"52d7874679478f45e8000001".is_a?(String) # Prints true

Related

How can I convert a UUID to a string using a custom character set in Ruby?

I want to create a valid IFC GUID (IfcGloballyUniqueId) according to the specification here:
http://www.buildingsmart-tech.org/ifc/IFC2x3/TC1/html/ifcutilityresource/lexical/ifcgloballyuniqueid.htm
It's basically a UUID or GUID (128 bit) mapped to a set of 22 characters to limit storage space in a text file.
I currently have this workaround, but it's merely an approximation:
guid = '';22.times{|i|guid<<'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$'[rand(64)]}
It seems best to use ruby SecureRandom to generate a 128 bit UUID, like in this example (https://ruby-doc.org/stdlib-2.3.0/libdoc/securerandom/rdoc/SecureRandom.html):
SecureRandom.uuid #=> "2d931510-d99f-494a-8c67-87feb05e1594"
This UUID needs to be mapped to a string with a length of 22 characters according to this format:
1 2 3 4 5 6
0123456789012345678901234567890123456789012345678901234567890123
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$";
I don't understand this exactly.
Should the 32-character long hex-number be converted to a 128-character long binary number, then devided in 22 sets of 6 bits(except for one that gets the remaining 2 bits?) for which each can be converted to a decimal number from 0 to 64? Which then in turn can be replaced by the corresponding character from the conversion table?
I hope someone can verify if I'm on the right track here.
And if I am, is there a computational faster way in Ruby to convert the 128 bit number to the 22 sets of 0-64 than using all these separate conversions?
Edit: For anyone having the same problem, this is my solution for now:
require 'securerandom'
# possible characters in GUID
guid64 = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$'
guid = ""
# SecureRandom.uuid: creates a 128 bit UUID hex string
# tr('-', ''): removes the dashes from the hex string
# pack('H*'): converts the hex string to a binary number (high nibble first) (?) is this correct?
# This reverses the number so we end up with the leftover bit on the end, which helps with chopping the sting into pieces.
# It needs to be reversed again to end up with a string in the original order.
# unpack('b*'): converts the binary number to a bit string (128 0's and 1's) and places it into an array
# [0]: gets the first (and only) value from the array
# to_s.scan(/.{1,6}/m): chops the string into pieces 6 characters(bits) with the leftover on the end.
[SecureRandom.uuid.tr('-', '')].pack('H*').unpack('b*')[0].to_s.scan(/.{1,6}/m).each do |num|
# take the number (0 - 63) and find the matching character in guid64, add the found character to the guid string
guid << guid64[num.to_i(2)]
end
guid.reverse
Base64 encoding is pretty close to what you want here, but the mappings are different. No big deal, you can fix that:
require 'securerandom'
require 'base64'
# Define the two mappings here, side-by-side
BASE64 = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
IFCB64 = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$'
def ifcb64(hex)
# Convert from hex to binary, then from binary to Base64
# Trim off the == padding, then convert mappings with `tr`
Base64.encode64([ hex.tr('-', '') ].pack('H*')).gsub(/\=*\n/, '').tr(BASE64, IFCB64)
end
ifcb64(SecureRandom.uuid)
# => "fa9P7E3qJEc1tPxgUuPZHm"

ruby string interpolation issue with net/imap library

I am using ruby's built-in imap/net library. There seems to be an issue with string interpolation. The following works fine:
imap = Net::IMAP.new('imap.gmail.com',993, true)
imap.login('myuser#gmail.com','password')
imap.select('INBOX')
ids = imap.search(["SINCE", "26-Sep-2016"])
=> [1, 2]
I used a string literal above. However, when I replace the string literal with the following, an error occurs:
imap = Net::IMAP.new('imap.gmail.com',993, true)
imap.login('myuser#gmail.com','password')
imap.select('INBOX')
ids = imap.search(["SINCE", Time.now.strftime('%d-%b-%y')])
=> Net::IMAP::BadResponseError: Could not parse command
What does only a string literal work when the result of Time.now.strftime is a string literal? This is a standard library in ruby, so for sure it must work. What am I missing?
%y in Time#strftime is the 2-digit year, and Net::IMAP#search requires a 4-digit year. You want %Y:
Time.now.strftime('%d-%b-%y') # => "28-Sep-16"
Time.now.strftime('%d-%b-%Y') # => "28-Sep-2016"
This isn't just a requirement of Net::IMAP, it's actually defined by the IMAP specification on pages 84 and 85.

How to convert a Base64 encoded string to UUID format

How can I convert a Base64 encoded string to a hex encoded string with dashes(basically to uuid format)?
For example if I have
'FWLalpF2T5mmyxS03Q+hNQ0K'
then how can I convert it to:
1562da96-9176-4f99-a6cb-14b4dd0fa135
I was familiar with unpack but this prompted me to learn the directive as pointed out by cremno.
simplest form:
b64 = 'FWLalpF2T5mmyxS03Q+hNQ0K'
b64.unpack("m0").first.unpack("H8H4H4H4H12").join('-')
#=> "1562da96-9176-4f99-a6cb-14b4dd0fa135"
b64.unpack("m0")
give us:
#=> ["\x15b\xDA\x96\x91vO\x99\xA6\xCB\x14\xB4\xDD\x0F\xA15\r\n"]
which is an array so we use .first to grab the string and unpack again using the directive to format it in the 8-4-4-4-12 format:
b64.unpack("m0").first.unpack("H8H4H4H4H12")
gives us:
#=> ["1562da96", "9176", "4f99", "a6cb", "14b4dd0fa135"]
an array of strings, so now we just join it with the -:
b64.unpack("m0").first.unpack("H8H4H4H4H12").join('-')
#=> "1562da96-9176-4f99-a6cb-14b4dd0fa135"
OOPS
The accepted answer has a flaw:
b64 = 'FWLalpF2T5mmyxS03Q+hNQ0K'
b64.unpack("m0").first.unpack("H8H4H4H4H12").join('-')
# => "1562da96-9176-4f99-a6cb-14b4dd0fa135"
Changing the last char in the b64 string results in the same UUID:
b64 = 'FWLalpF2T5mmyxS03Q+hNQ0L'
b64.unpack("m0").first.unpack("H8H4H4H4H12").join('-')
# => "1562da96-9176-4f99-a6cb-14b4dd0fa135"
To prevent this, you might want to hash your input (base64 or anything else) to the correct length e.g. with MD5:
require "digest"
b64 = 'FWLalpF2T5mmyxS03Q+hNQ0K'
Digest::MD5.hexdigest(b64).unpack("a8a4a4a4a12").join('-')
# => "df71c785-6552-a977-e0ac-8edb8fd63f6f"
Now the full input is relevant, altering the last char results in a different UUID:
require "digest"
b64 = 'FWLalpF2T5mmyxS03Q+hNQ0L'
Digest::MD5.hexdigest(s).unpack("a8a4a4a4a12").join('-')
# => "2625f170-d05a-f65d-38ff-5d9a7a972382"

String is unexpectedly converted to hex

I tried to get some data from Firebird database. I have a field "UID", whose value is de6c50a94aee524d9d287a43158360f4 String(16).
When I get it with Ruby, I got:
"UID"=>"\xDElP\xA9J\xEERM\x9D(zC\x15\x83`\xF4"
Why didn't I get a string?
conn.query(:hash , 'SELECT FIRST 1 UID FROM cmd').first
The UID you receive is a binary array, which in ruby is represented as a packed string. To unpack it do the following:
"\xDElP\xA9J\xEERM\x9D(zC\x15\x83`\xF4".unpack('n*').map { |x| x.to_s(16) }.join
# => "de6c50a94aee524d9d287a43158360f4"
Your UID is a 128bit value. The hex string representation of UID can be built with unpack:
str = "%08x%04x%04x%04x%04x%08x" % UID.unpack("NnnnnN")
=> "de6c50a94aee524d9d287a43158360f4"
The reason for the specific formatting is this code is really for UUID's
str = "%08x-%04x-%04x-%04x-%04x%08x" % UID.unpack("NnnnnN")
=> "de6c50a9-4aee-524d-9d28-7a43158360f4"
As I commented, I guess the datatype of UID in your Firebird database is a CHAR(16) CHARACTER SET OCTETS, this is a binary datatype. Firebird (before Firebird 4) doesn't know the SQL types BINARY or VARBINARY, but fields with CHARACTER SET OCTETS are binary.
The value you are retrieving is probably a UUID. You either need to use the value as a binary, or select a human 'readable' UUID string using UUID_TO_CHAR:
SELECT FIRST 1 UUID_TO_CHAR(UID) FROM cmd

when we import csv data, how eliminate "invalid byte sequence in UTF-8"

we allow users to import data via csv (using ruby 1.9.2, hence it's fastercsv).
being user data, of course, it might not be properly sanitized.
When we try to display the data in an /index method we sometimes get the error "invalid byte sequence in UTF-8" pointing to our erb where we display one of the fields widget.name
When we do the import we'd like to FORCE the incoming data to be valid... is there a ruby operator that will map a string to a valid utf8 string, eg, something like
goodstring = badstring.no_more_invalid_bytes
One example of 'bad' data is char that looks like a hyphen but is not a regular ascii hyphen. We'd prefer to map the non-utf-8 chars to a reasonable ascii equivalent (umlat-u going to u for exmaple) BUT we're okay with simply stripping the character to.
since this is when importing lots of data, it needs to be a fast built-in operator, hopefully...
Note: here is an example of the data. The file comes form windows and is 8bit ascii. when we import it and in our erb we display widget.name.inspect (instead of widget.name) we get:
"Chains \x96 Accessories"
so one example of the data is a "hyphen" that's actually 8 bit code 96.
--- when we changed our csv parse to assign fldval = d.encode('UTF-8')
it throws this error:
Encoding::UndefinedConversionError in StoresController#importfinderitems
"\x96" from ASCII-8BIT to UTF-8
what we're looking for is a simple way to just force it to be valid utf8 regardless of origin type, even if we simply strip non-ascii.
while not as 'nice' as forcing the encoding, this works at a slight expense to our import time:
d.to_s.strip.gsub(/\P{ASCII}/, '')
Thank you, Mladen!
Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.
For example:
CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')
Would convert all strings to UTF-8.
Also you can use the more standard encoding name 'ISO-8859-1'
CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})
CSV.parse(File.read('/path/to/csv').scrub)
I answered a similar question that deals with reading external files in 1.9.2 with non-UTF-8 encodings. I think that answer will help you a lot: Character Encoding issue in Rails v3/Ruby 1.9.2
Note that you need to know the source encoding for you to convert it anything reliably. There are libraries like the one I linked to in my other answer that can help you determine this.
Also, if you aren't loading the data from a file, you can convert the encoding of a string in 1.9.2 quite easily:
'string'.encode('UTF-8')
However, it's rare that you're building a string in another encoding, and it's best to convert it at the time it's read into your environment if possible.
Ruby 1.9 can change string encoding with invalid detection and replacement:
str = str.encode('UTF-8', :invalid => :replace)
For unusual strings such as strings loaded from a file of unknown encoding, it's wise to use #encode instead of a regex, #gsub, or #delete, because these all need the string to be parsed-- but if the string is broken, it can't be parsed, so those methods fail.
If you get a message like this:
error ** from ASCII-8BIT to UTF-8
Then you're probably trying to convert a binary string that's already in UTF-8, and you can force UTF-8:
str.force_encoding('UTF-8')
If you know the original string is not in binary UTF-8, or if the output string has illiegal characters, then read up on Ruby encoding transliterations.
If you are using Rails, you can try to fix it with the following
'Your string with strange stuff ##~'.mb_chars.tidy_bytes
It removes you the invalid utf-8 chars and replaces it with valid ones.
More info: https://apidock.com/rails/String/mb_chars
Upload the CSV file to Google Docs Spreadsheet and re-download it as a CSV file. Import and voila! (Worked in my case)
Presumably Google converts it to the wanted format..
Source: Excel to CSV with UTF-8 Encoding
As mentioned by someone else, scrub works well to clean this up in Ruby 2.1+. If you have a large file you may not want to read the whole thing into memory, so you can use scrub like this:
data = IO::read(file_path).scrub("")
CSV.parse(data, :col_sep => ',', :headers => true) do |row|
puts row
end
I am using MAC and I was having the same error:
rescue in parse:Invalid byte sequence in UTF-8 in line 1 (CSV::MalformedCSVError)
I added :encoding => 'ISO-8859-1' that resolved my error and csv file could be read.
results = CSV.read("query_result.csv",{:headers => true, :encoding => 'ISO-8859-1'})
:headers => true : If set to :first_row or true, the initial row of the CSV file will be treated as a row of headers. If set to an Array, the contents will be used as the headers. If set to a String, the String is run through a call of ::parse_line with the same :col_sep, :row_sep, and :quote_char as this instance to produce an Array of headers. This setting causes #shift to return rows as CSV::Row objects instead of Arrays and #read to return CSV::Table objects instead of an Array of Arrays.
irb(main):024:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true)
=> <#CSV io_type:StringIO encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" headers:true>
irb(main):025:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true).to_a
=> [#<CSV::Row "a":"1" "b":"2" "c":"3">]
irb(main):026:0> rows.first['a']
=> "1"
In above example you can clearly see that this also enables us to use data as hashes.
The only thing you would need to be careful about while using headers: true that it won't allow any duplicate headers as keys are unique in hashes.
Only do this
anyobject.to_csv(:encoding => 'utf-8')

Resources