how can I query for unicode characters in mongodb using Ruby? - ruby

Let's say I have a record in my database that has:
name: "World\u0092s Greatest Jet Fighter Pilot"
OK I need to get in there and clean out the \u0092 (there were a ton of these in the db). I can query like this:
# encoding: UTF-8
...
def self.by_partial name
return Movie.find(:all, :conditions => {:name => /^.*#{name}.*/i})
end
# console:
>> sel = Movie.by_partial(/Greatest/) and sel.size
=> 1
and get back the correct number of records. But when I throw in the unicode, it fails:
>> sel = Movie.by_partial(/\u0092/) and sel.size
=> 0
>> sel = Movie.by_partial(/\\u0092/) and sel.size
=> 0
>> sel = Movie.by_partial('\u0092') and sel.size
=> 0
>> sel = Movie.by_partial('\\u0092') and sel.size
=> 0
What do I need to do to be able to query for records that contain unicode characters? Is this a setting in the rails console? I managed to solve this by iterating the records and checking like so if mov.name =~ /\u0092/ ... but I can't figure out how to pass a unicode string into my mongoid selector. Iterating the records seemed way too brute force. Luckily I don't need to do this very often.

I don't think your problem is with Unicode, your problems are:
The string interpolation inside by_partial.
And \u only works inside double quoted strings.
Second things first:
> '\u0070'
=> "\\u0070"
> '\\u0070'
=> "\\u0070"
> "\u0070"
=> "p"
So Movie.by_partial("\u0092") should work.
Your first problem is that you're passing /\u0092/ (which does match the character in question) to by_partial but by_partial does this:
/^.*#{name}.*/i
And /^.*#{/\u0092/}.*/i and that ends up as /^.*(?-mix:\u0092).*/i. I'd guess that the MongoDB driver is having some issues translating that Ruby regex into a JavaScript regex.
The MongoDB driver doesn't seem to like \u in a regex at all. Feeding /^\u0070/ into MongoDB doesn't get me any matches but /^p/ does find what I'm expecting, /^#{"\u0070"}/ also works. I'm not sure what's going on in the guts of the MongoDB regex translator but we're not the only ones to come across this. I'd guess that the MongoDB regex translator doesn't understand \u so it ends up being converted to a raw \\u0092 and since you don't have that sequence of six characters in your database, you don't find anything.

Related

Ruby Base64 check if it's encoded [duplicate]

i may recieve these two strings:
base = Base64.encode64(File.open("/home/usr/Desktop/test", "rb").read)
=> "YQo=\n"
string = File.open("/home/usr/Desktop/test", "rb").read
=> "a\n"
what i have tried so far is to check string with regular expression i-e. /([A-Za-z0-9+\/]{4})*([A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==$)/ but this would be very heavy if the file is big.
I also have tried base.encoding.name and string.encoding.name but both returns the same.
I have also seen this post and got regular expression solution but any other solution ?
Any idea ? I just want to get is the string is actually text or base64 encoded text....
You can use something like this, not very performant but you are guaranteed not to get false positives:
require 'base64'
def base64?(value)
value.is_a?(String) && Base64.strict_encode64(Base64.decode64(value)) == value
end
The use of strict_encode64 versus encode64 prevents Ruby from inadvertently inserting newlines if you have a long string. See this post for details.

Ruby: Why does unpack('Q') give a different result than manual conversion?

I'm trying to write a function that will .unpack('Q') (unpack to uint64_t) without access to the unpack method.
When I manually convert from string to binary to uint64, I get a different result than .unpack('Q'):
Integer('abcdefgh'.unpack('B*').first, 2) # => 7017280452245743464
'abcdefgh'.unpack('Q').first # => 7523094288207667809
I don't understand what's happening here.
I also don't understand why the output of .unpack('Q') is fixed regardless of the size of the input. If I add a thousand characters after 'abcdefgh' and then unpack('Q') it, I still just get [7523094288207667809]?
Byte order matters:
Integer('abcdefgh'.
each_char.
flat_map { |c| c.unpack('B*') }.
reverse.
join, 2)
#⇒ 7523094288207667809
'abcdefgh'.unpack('Q*').first
#⇒ 7523094288207667809
Your code produces the wrong result because after converting to binary, bytes should be reversed.
For the last part of your question, the reason the output of .unpack('Q') doesn't change with a longer input string is because the format is specifying a single 64-bit value so any characters after the first 8 are ignored. If you specified a format of Q2 and a 16 character string you'd decode 2 values:
> 'abcdefghihjklmno'.unpack('Q2')
=> [7523094288207667809, 8029475498074204265]
and again you'd find adding additional characters wouldn't change the result:
> 'abcdefghihjklmnofoofoo'.unpack('Q2')
=> [7523094288207667809, 8029475498074204265]
A format of Q* would return as many values as multiples of 64-bits were in the input:
> 'abcdefghihjklmnopqrstuvw'.unpack('Q*')
=> [7523094288207667809, 8029475498074204265, 8608196880778817904]
> 'abcdefghihjklmnopqrstuvwxyz'.unpack('Q*')
=> [7523094288207667809, 8029475498074204265, 8608196880778817904]

ruby extract string between two string

I am having a string as below:
str1='"{\"#Network\":{\"command\":\"Connect\",\"data\":
{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"'
I wanted to extract the somename string from the above string. Values of xx:xx:xx:xx:xx:xx, somename and 123456789 can change but the syntax will remain same as above.
I saw similar posts on this site but don't know how to use regex in the above case.
Any ideas how to extract the above string.
Parse the string to JSON and get the values that way.
require 'json'
str = "{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
json = JSON.parse(str.strip)
name = json["#Network"]["data"]["Name"]
pwd = json["#Network"]["data"]["Pwd"]
Since you don't know regex, let's leave them out for now and try manual parsing which is a bit easier to understand.
Your original input, without the outer apostrophes and name of variable is:
"{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
You say that you need to get the 'somename' value and that the 'grammar will not change'. Cool!.
First, look at what delimits that value: it has quotes, then there's a colon to the left and comma to the right. However, looking at other parts, such layout is also used near the command and near the pwd. So, colon-quote-data-quote-comma is not enough. Looking further to the sides, there's a \"Name\". It never occurs anywhere in the input data except this place. This is just great! That means, that we can quickly find the whereabouts of the data just by searching for the \"Name\" text:
inputdata = .....
estposition = inputdata.index('\"Name\"')
raise "well-known marker wa not found in the input" unless estposition
now, we know:
where the part starts
and that after the "Name" text there's always a colon, a quote, and then the-interesting-data
and that there's always a quote after the interesting-data
let's find all of them:
colonquote = inputdata.index(':\"', estposition)
datastart = colonquote+3
lastquote = inputdata.index('\"', datastart)
dataend = lastquote-1
The index returns the start position of the match, so it would return the position of : and position of \. Since we want to get the text between them, we must add/subtract a few positions to move past the :\" at begining or move back from \" at end.
Then, fetch the data from between them:
value = inputdata[datastart..dataend]
And that's it.
Now, step back and look at the input data once again. You say that grammar is always the same. The various bits are obviously separated by colons and commas. Let's try using it directly:
parts = inputdata.split(/[:,]/)
=> ["\"{\\\"#Network\\\"",
"{\\\"command\\\"",
"\\\"Connect\\\"",
"\\\"data\\\"",
"\n{\\\"Id\\\"",
"\\\"xx",
"xx",
"xx",
"xx",
"xx",
"xx\\\"",
"\\\"Name\\\"",
"\\\"somename\\\"",
"\\\"Pwd\\\"",
"\\\"123456789\\\"}}}\\0\""]
Please ignore the regex for now. Just assume it says a colon or comma. Now, in parts you will get all the, well, parts, that were detected by cutting the inputdata to pieces at every colon or comma.
If the layout never changes and is always the same, then your interesting-data will be always at place 13th:
almostvalue = parts[12]
=> "\\\"somename\\\""
Now, just strip the spurious characters. Since the grammar is constant, there's 2 chars to be cut from both sides:
value = almostvalue[2..-3]
Ok, another way. Since regex already showed up, let's try with them. We know:
data is prefixed with \"Name\" then colon and slash-quote
data consists of some text without quotes inside (well, at least I guess so)
data ends with a slash-quote
the parts in regex syntax would be, respectively:
\"Name\":\"
[^\"]*
\"
together:
inputdata =~ /\\"Name\\":\\"([^\"]*)\\"/
value = $1
Note that I surrounded the interesting part with (), hence after sucessful match that part is available in the $1 special variable.
Yet another way:
If you look at the grammar carefully, it really resembles a set of embedded hashes:
\"
{ \"#Network\" :
{ \"command\" : \"Connect\",
\"data\" :
{ \"Id\" : \"xx:xx:xx:xx:xx:xx\",
\"Name\" : \"somename\",
\"Pwd\" : \"123456789\"
}
}
}
\0\"
If we'd write something similar as Ruby hashes:
{ "#Network" =>
{ "command" => "Connect",
"data" =>
{ "Id" => "xx:xx:xx:xx:xx:xx",
"Name" => "somename",
"Pwd" => "123456789"
}
}
}
What's the difference? the colon was replaced with =>, and the slashes-before-quotes are gone. Oh, and also opening/closing \" is gone and that \0 at the end is gone too. Let's play:
tmp = inputdata[2..-4] # remove opening \" and closing \0\"
tmp.gsub!('\"', '"') # replace every \" with just "
Now, what about colons.. We cannot just replace : with =>, because it would damage the internal colons of the xx:xx:xx:xx:xx:xx part.. But, look: all the other colons have always a quote before them!
tmp.gsub!('":', '"=>') # replace every quote-colon with quote-arrow
Now our tmp is:
{"#Network"=>{"command"=>"Connect","data"=>{"Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789"}}}
formatted a little:
{ "#Network"=>
{ "command"=>"Connect",
"data"=>
{ "Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789" }
}
}
So, it looks just like a Ruby hash. Let's try 'destringizing' it:
packeddata = eval(tmp)
value = packeddata['#Network']['data']['Name']
Done.
Well, this has grown a bit and Jonas was obviously faster, so I'll leave the JSON part to him since he wrote it already ;) The data was so similar to Ruby hash because it was obviously formatted as JSON which is a hash-like structure too. Using the proper format-reading tools is usually the best idea, but mind that the JSON library when asked to read the data - will read all of the data and then you can ask them "what was inside at the key xx/yy/zz", just like I showed you with the read-it-as-a-Hash attempt. Sometimes when your program is very short on the deadline, you cannot afford to read-it-all. Then, scanning with regex or scanning manually for "known markers" may (not must) be much faster and thus prefereable. But, still, much less convenient. Have fun.

Ruby,Rhomobile,JqueryMobile and Single Quote

In rhomobile, which is on ruby I have a parsing of file and saving to sqlite db such a code
Questions.delete_all()
file_name = File.join(Rho::RhoApplication::get_model_path('app','Settings'), 'questions.txt')
file = File.new(file_name)
file.each_line("\n") do |row|
col = row.split("|")
#question=Questions.create(
{"id" => col[0], "question" => col[1],"answered"=>'0',"show"=>'1',"tutorial"=>col[4]}
)
break if file.lineno > 1500
end
file.close
when in text in string there is single quote aka ' , for example an expression
It's funny
Then after parsing, saving and populating I get
It�s funny
Any idea how to solve this and where from it comes, from Ruby, From sqlite or from what? how to solve it?
I would check to make sure that your parsing isn't doing something funny. The Rhodes handles all of the necessary escaping in its ORM. I've never had any issues with quotes in the db.

when we import csv data, how eliminate "invalid byte sequence in UTF-8"

we allow users to import data via csv (using ruby 1.9.2, hence it's fastercsv).
being user data, of course, it might not be properly sanitized.
When we try to display the data in an /index method we sometimes get the error "invalid byte sequence in UTF-8" pointing to our erb where we display one of the fields widget.name
When we do the import we'd like to FORCE the incoming data to be valid... is there a ruby operator that will map a string to a valid utf8 string, eg, something like
goodstring = badstring.no_more_invalid_bytes
One example of 'bad' data is char that looks like a hyphen but is not a regular ascii hyphen. We'd prefer to map the non-utf-8 chars to a reasonable ascii equivalent (umlat-u going to u for exmaple) BUT we're okay with simply stripping the character to.
since this is when importing lots of data, it needs to be a fast built-in operator, hopefully...
Note: here is an example of the data. The file comes form windows and is 8bit ascii. when we import it and in our erb we display widget.name.inspect (instead of widget.name) we get:
"Chains \x96 Accessories"
so one example of the data is a "hyphen" that's actually 8 bit code 96.
--- when we changed our csv parse to assign fldval = d.encode('UTF-8')
it throws this error:
Encoding::UndefinedConversionError in StoresController#importfinderitems
"\x96" from ASCII-8BIT to UTF-8
what we're looking for is a simple way to just force it to be valid utf8 regardless of origin type, even if we simply strip non-ascii.
while not as 'nice' as forcing the encoding, this works at a slight expense to our import time:
d.to_s.strip.gsub(/\P{ASCII}/, '')
Thank you, Mladen!
Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.
For example:
CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')
Would convert all strings to UTF-8.
Also you can use the more standard encoding name 'ISO-8859-1'
CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})
CSV.parse(File.read('/path/to/csv').scrub)
I answered a similar question that deals with reading external files in 1.9.2 with non-UTF-8 encodings. I think that answer will help you a lot: Character Encoding issue in Rails v3/Ruby 1.9.2
Note that you need to know the source encoding for you to convert it anything reliably. There are libraries like the one I linked to in my other answer that can help you determine this.
Also, if you aren't loading the data from a file, you can convert the encoding of a string in 1.9.2 quite easily:
'string'.encode('UTF-8')
However, it's rare that you're building a string in another encoding, and it's best to convert it at the time it's read into your environment if possible.
Ruby 1.9 can change string encoding with invalid detection and replacement:
str = str.encode('UTF-8', :invalid => :replace)
For unusual strings such as strings loaded from a file of unknown encoding, it's wise to use #encode instead of a regex, #gsub, or #delete, because these all need the string to be parsed-- but if the string is broken, it can't be parsed, so those methods fail.
If you get a message like this:
error ** from ASCII-8BIT to UTF-8
Then you're probably trying to convert a binary string that's already in UTF-8, and you can force UTF-8:
str.force_encoding('UTF-8')
If you know the original string is not in binary UTF-8, or if the output string has illiegal characters, then read up on Ruby encoding transliterations.
If you are using Rails, you can try to fix it with the following
'Your string with strange stuff ##~'.mb_chars.tidy_bytes
It removes you the invalid utf-8 chars and replaces it with valid ones.
More info: https://apidock.com/rails/String/mb_chars
Upload the CSV file to Google Docs Spreadsheet and re-download it as a CSV file. Import and voila! (Worked in my case)
Presumably Google converts it to the wanted format..
Source: Excel to CSV with UTF-8 Encoding
As mentioned by someone else, scrub works well to clean this up in Ruby 2.1+. If you have a large file you may not want to read the whole thing into memory, so you can use scrub like this:
data = IO::read(file_path).scrub("")
CSV.parse(data, :col_sep => ',', :headers => true) do |row|
puts row
end
I am using MAC and I was having the same error:
rescue in parse:Invalid byte sequence in UTF-8 in line 1 (CSV::MalformedCSVError)
I added :encoding => 'ISO-8859-1' that resolved my error and csv file could be read.
results = CSV.read("query_result.csv",{:headers => true, :encoding => 'ISO-8859-1'})
:headers => true : If set to :first_row or true, the initial row of the CSV file will be treated as a row of headers. If set to an Array, the contents will be used as the headers. If set to a String, the String is run through a call of ::parse_line with the same :col_sep, :row_sep, and :quote_char as this instance to produce an Array of headers. This setting causes #shift to return rows as CSV::Row objects instead of Arrays and #read to return CSV::Table objects instead of an Array of Arrays.
irb(main):024:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true)
=> <#CSV io_type:StringIO encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" headers:true>
irb(main):025:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true).to_a
=> [#<CSV::Row "a":"1" "b":"2" "c":"3">]
irb(main):026:0> rows.first['a']
=> "1"
In above example you can clearly see that this also enables us to use data as hashes.
The only thing you would need to be careful about while using headers: true that it won't allow any duplicate headers as keys are unique in hashes.
Only do this
anyobject.to_csv(:encoding => 'utf-8')

Resources