Ruby - read bytes from a file, convert to integer - ruby

I'm trying to read unsigned integers from a file (stored as consecutive byte) and convert them to Integers. I've tried this:
file = File.new(filename,"r")
num = file.read(2).unpack("S") #read an unsigned short
puts num #value will be less than expected
What am I doing wrong here?

You're not reading enough bytes. As you say in the comment to tadman's answer, you get 202 instead of 3405691582
Notice that the first 2 bytes of 0xCAFEBABE is 0xCA = 202
If you really want all 8 bytes in a single number, then you need to read more than the unsigned short
try
num = file.read(8).unpack("L_")
The underscore is assuming that the native long is going to be 8 bytes, which definitely is not guaranteed.

How about looking in The Pickaxe? (Ruby 1.9, p. 44)
File.open("testfile")
do |file|
file.each_byte {|ch| print "#{ch.chr}:#{ch} " }
end
each_byte iterates over a file byte by byte.

There are a couple of libraries that help with parsing binary data in Ruby, by letting you declare the data format in a simple high-level declarative DSL and then figure out all the packing, unpacking, bit-twiddling, shifting and endian-conversions by themselves.
I have never used one of these, but here's two examples. (There are more, but I don't know them):
BitStruct
BinData

Ok, I got it to work:
num = file.read(8).unpack("N")
Thanks for all of your help.

What format are the numbers stored in the file? Is it in hex? Your code looks correct to me.

When dealing with binary data you need to be sure you're opening the file in binary mode if you're on Windows. This goes for both reading and writing.
open(filename, "rb") do |file|
num = file.read(2).unpack("S")
puts num
end
There may also be issues with "endian" encoding depending on the source platform. For instance, PowerPC-based machines, which include old Mac systems, IBM Power servers, PS3 clusters, or Sun Sparc servers.
Can you post an example of how it's "less"? Usually there's an obvious pattern to the data.
For example, if you want 0x1234 but you get 0x3412 it's an endian problem.

Related

Ruby unpack binary

I am working on unpacking a binary file for the first time in Ruby. Already found the unpack method which works pretty nice. Which according to the docs works perfect for 8(1 byte),16(2 byte),32(4 byte) and 64 bit(8 byte).
But now I have to unpack 5 bytes. How do I do this?
Thx in advance!
To literally unpack five bytes: str.unpack 'C5'
That gives you five byte values as unsigned ints. The question is how to reinterpret those ints as a single data type. Pack/unpack only recognize the standard power of two sizes, so you'll have to do that part manually.
For example, to get a little endian unsigned 40-bit int
bytes = str.unpack 'C5'
int = bytes.map.with_index { |byte, i| byte << (i * 8) }.reduce(:+)
If you need to do something more sophisticated like a signed type or a float... good luck.

Ruby convert Hexadecimal to integer (16 bits signed)

I am developing a software for using a RFID reader with ruby on rails and, after open the socket and get the tags, I convert data to hexadecimal with:
while line = s.gets
puts line.unpack('H*').to_s
end
Then I get "a55a0019833400393939393939303030303232fd6f02080d0a" for one tag.
The RFID reader user manual tells:
Remark:RSSI express as complement code, total 16 bits,which is 10 times the real value. For example, the real value is -65.7dBm,then RSSI=fd6f
I have found online calculators (mathsinfun and calc.penjee.com) where I am able to convert the fd6f in -675.
I would like to know how can I get this conversion in Ruby 2.3.1 to continue with my project.
Any help will be appreciated.
s> is the correct unpack symbol for a 16-bit unsigned big endian number, so:
"\xfd\x6f".unpack('s>')[0] / 10.0
Result is:
-65.7

How to read a large file into a string

I'm trying to save and load the states of Matrices (using Matrix) during the execution of my program with the functions dump and load from Marshal. I can serialize the matrix and get a ~275 KB file, but when I try to load it back as a string to deserialize it into an object, Ruby gives me only the beginning of it.
# when I want to save
mat_dump = Marshal.dump(#mat) # serialize object - OK
File.open('mat_save', 'w') {|f| f.write(mat_dump)} # write String to file - OK
# somewhere else in the code
mat_dump = File.read('mat_save') # read String from file - only reads like 5%
#mat = Marshal.load(mat_dump) # deserialize object - "ArgumentError: marshal data too short"
I tried to change the arguments for load but didn't find anything yet that doesn't cause an error.
How can I load the entire file into memory? If I could read the file chunk by chunk, then loop to store it in the String and then deserialize, it would work too. The file has basically one big line so I can't even say I'll read it line by line, the problem stays the same.
I saw some questions about the topic:
"Ruby serialize array and deserialize back"
"What's a reasonable way to read an entire text file as a single string?"
"How to read whole file in Ruby?"
but none of them seem to have the answers I'm looking for.
Marshal is a binary format, so you need to read and write in binary mode. The easiest way is to use IO.binread/write.
...
IO.binwrite('mat_save', mat_dump)
...
mat_dump = IO.binread('mat_save')
#mat = Marshal.load(mat_dump)
Remember that Marshaling is Ruby version dependent. It's only compatible under specific circumstances with other Ruby versions. So keep that in mind:
In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number.

Parsing a big string in Ruby

I have a file of a few hundred megabytes containing strings:
str1 x1 x2\n
str2 xx1 xx2\n
str3 xxx1 xxx2\n
str4 xxxx1 xxxx2\n
str5 xxxxx1 xxxxx2
where x1 and x2 are some numbers. How big the numbers x(...x)1 and x(...x)2 are is unknown.
Each line has in "\n" in it. I have a list of strings str2 and str4.
I want to find the corresponding numbers for those strings.
What I'm doing is pretty straightforward (and, probably, not efficient performance-wise):
source_str = read_from_file() # source_str contains all file content of a few hundred Megabyte
str_to_find = [str2, str4]
res = []
str_to_find.each do |x|
index = source_str.index(x)
if index
a = source_str[index .. index + x.length] # a contains "str2"
#?? how do I "select" xx1 and xx2 ??
# and finally...
# res << num1
# res << num2
end
end
Note that I can't apply source_str.split("\n") due to the error ArgumentError: invalid byte sequence in UTF-8 and I can't fix it by changing a file in any way. The file can't be changed.
You want to avoid reading a hundred of megabytes into memory, as well as scanning them repeatedly. This has the potential of taking forever, while clogging the machine's available memory.
Try to re-frame the problem, so you can treat the large input file as a stream, so instead of asking for each string you want to find "does it exist in my file?", try asking for each line in the file "does it contain a string I am looking for?".
str_to_find = [str2, str4]
numbers = []
File.foreach('foo.txt') do |li|
columns = li.split
numbers += columns[2] if str_to_find.include?(columns.shift)
end
Also, read again #theTinMan's answer regarding the file encoding - what he is suggesting is that you may be able fine-tune the reading of the file to avoid the error, without changing the file itself.
If you have a very large number of items in str_to_find, I'd suggest that you use a Set instead of an Array for better performance:
str_to_find = [str1, str2, ... str5000].to_set
If you want to find a line in a text file, which it sounds like you are reading, then read the file line-by-line.
The IO class has the foreach method, which makes it easy to read a file line-by-line, which also makes it possible to easily locate lines that contain the particular string you want to find.
If you had your source input file saved as "foo.txt", you could read it using something like:
str2 = 'some value'
str4 = 'some other value'
numbers = []
File.foreach('foo.txt') do |li|
numbers << li.split[2] if li[str2] || li[str2]
end
At the end of the loop numbers should contain the numbers you want.
You say you're getting an encoding error, but you don't give us any clue what the characters are that are causing it. Without that information we can't really help you fix that problem except to say you need to tell Ruby what the file encoding is. You can do that when the file is opened; You'd properly set the open_args to whatever the encoding should be. Odds are good it should be an encoding of ISO-8859-1 or Win-1252 since those are very common with Windows machines.
I have to find a list of values, iterating through each line doesn't seem sensible because I'd have to iterate for each value over and over again.
We can only work with the examples you give us. Since that wasn't clearly explained in your question you got an answer based on what was initially said.
Ruby's Regexp has the tools necessary to make this work, but to do it correctly requires taking advantage of Perl's Regexp::Assemble library, since Ruby has nothing close to it. See "Is there an efficient way to perform hundreds of text substitutions in ruby?" for more information.
Note that this will allow you to scan through a huge string in memory, however that is still not a good way to process what you are talking about. I'd use a database instead, which are designed for this sort of task.

Ruby on Rails - generating bit.ly style identifiers

I'm trying to generate UUIDs with the same style as bit.ly urls like:
http://bit [dot] ly/aUekJP
or cloudapp ones:
http://cl [dot] ly/1hVU
which are even smaller
how can I do it?
I'm now using UUID gem for ruby but I'm not sure if it's possible to limitate the length and get something like this.
I am currently using this:
UUID.generate.split("-")[0] => b9386070
But I would like to have even smaller and knowing that it will be unique.
Any help would be pretty much appreciated :)
edit note: replaced dot letters with [dot] for workaround of banned short link
You are confusing two different things here. A UUID is a universally unique identifier. It has a very high probability of being unique even if millions of them were being created all over the world at the same time. It is generally displayed as a 36 digit string. You can not chop off the first 8 characters and expect it to be unique.
Bitly, tinyurl et-al store links and generate a short code to represent that link. They do not reconstruct the URL from the code they look it up in a data-store and return the corresponding URL. These are not UUIDS.
Without knowing your application it is hard to advise on what method you should use, however you could store whatever you are pointing at in a data-store with a numeric key and then rebase the key to base32 using the 10 digits and 22 lowercase letters, perhaps avoiding the obvious typo problems like 'o' 'i' 'l' etc
EDIT
On further investigation there is a Ruby base32 gem available that implements Douglas Crockford's Base 32 implementation
A 5 character Base32 string can represent over 33 million integers and a 6 digit string over a billion.
If you are working with numbers, you can use the built in ruby methods
6175601989.to_s(30)
=> "8e45ttj"
to go back
"8e45ttj".to_i(30)
=>6175601989
So you don't have to store anything, you can always decode an incoming short_code.
This works ok for proof of concept, but you aren't able to avoid ambiguous characters like: 1lji0o. If you are just looking to use the code to obfuscate database record IDs, this will work fine. In general, short codes are supposed to be easy to remember and transfer from one medium to another, like reading it on someone's presentation slide, or hearing it over the phone. If you need to avoid characters that are hard to read or hard to 'hear', you might need to switch to a process where you generate an acceptable code, and store it.
I found this to be short and reliable:
def create_uuid(prefix=nil)
time = (Time.now.to_f * 10_000_000).to_i
jitter = rand(10_000_000)
key = "#{jitter}#{time}".to_i.to_s(36)
[prefix, key].compact.join('_')
end
This spits out unique keys that look like this: '3qaishe3gpp07w2m'
Reduce the 'jitter' size to reduce the key size.
Caveat:
This is not guaranteed unique (use SecureRandom.uuid for that), but it is highly reliable:
10_000_000.times.map {create_uuid}.uniq.length == 10_000_000
The only way to guarantee uniqueness is to keep a global count and increment it for each use: 0000, 0001, etc.

Resources