Ruby unpack file compressed in snappy framing format *.sz - ruby

I need to unpack snappy *.sz files in Ruby.
Format specification is here:
https://github.com/google/snappy/blob/master/framing_format.txt
I have found 2 gems so far.
https://github.com/miyucy/snappy - seems to be completely useless.
https://github.com/willglynn/snappy-ruby - is able to unpack separate snappy chunks but not the whole framing snappy file.
QUESTION:
Is there a working ruby gem that would allow me to do something like:
framing_snappy.unpack('filename.sz')
or the only way is write own code that will parse bytes and mess with bitwise shifts?

Just in case someone is facing similar issue.
I finally came up with this code and it seems to be working.

Related

Missing data when decompressing zlib data with Ruby

Well I have a deflated json encoded in a request log and I need to decompress it.
I tried to use zlib, i.e:
Zlib::Inflate.new(-Zlib::MAX_WBITS).inflate(File.read("PATH_OF_FILE"))
It shows only a part of the JSON. Something like:
"{\"seq\":53,\"app_id\":\"567067343352427\",\"app_ver\":\"10.3.2\",\"build_num\":\"46395473\",\"device_id\":\"c12f541a-5936-4477-b6fc-653db675d16"
There is a lot of missing data because the deflated data is too big.
Deflate full data:
Check it here.
After testing, I figured out that only this part is being decompressed:
Check it here.
Well I'm a bit confused with this. Could someone could please help me with it?

How to convert hadoop sequence file to json format?

As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks
Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.
I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});

How to get faster YAML loading in Ruby 1.8.7?

In this Ruby 1.8.7 application, YAML deserializing (done with YAML.load) is needed because the existing data is stored in many relatively small YAML documents, but is a performance bottleneck.
Is there a way or a library that has this better? Upgrading to ruby 1.9 is not an option.
I am not an expert but if it's possible for you to convert the YAML documents to Marshal documents and then use Marshal.load afterwards in the application, it should be much faster. I used this gist a while back to compare YAML vs Marshal performance.
I didn't find a way to do this. I've tried converting YAML to JSON via string manipulation, then parsing that with fast JSON parsers such as Yajl and OJ, but the overhead of converting YAML to JSON was already longer than actually parsing YAML.
My conversion script probably wasn't as fast as it could be if someone smart really dedicated a lot of time to do this, but I gave up on this early after I realized even if I optimize my own script, it still wouldn't beat YAML parsing time significantly enough to warrant the whole approach.
According to this experiment, using ZAML under 1.8.7 will be faster than the YAML parser.

Reading in CSV files smaller than 10K from S3 with Ruby 1.9.2 p290

The following code snippet works fine for CSV file sizes larger than 10 K.
lines = CSV.read(open(resource.csv(:original)))
This is reading the the CSV file stored in Amazon S3 using Paperclip gem.
If the file size is smaller than 10 K however, I get the following error:
ActionView::Template::Error (can't convert StringIO into String):
I googled and found the following post:
http://adayinthepit.com/?p=269
So I tried to use the fastercsv gem, when I ran my program again, here is the error that I get:
ActionView::Template::Error (Please switch to Ruby 1.9's standard CSV library. It's FasterCSV plus support for Ruby 1.9's m17n encoding engine.):
Looks like it is a Catch-22. How can I process files smaller than 10 K in ruby 1.9.2 p290?
Please advise.
Thanks.
Bharat
I'm going to guess that CSV.read is being handed a StringIO when it wants a String. If so, then you should be able to stick a read call in and switch to CSV.parse to make everyone happy:
lines = CSV.parse(open(resource.csv(:original)).read)

Ruby feed parsing: "Input is not proper UTF-8, indicate encoding!"

I am trying to parse RSS feeds using Feedzirra.
Some of them are ok, but others return the error:
Error while parsing. Input is not proper UTF-8, indicate encoding !
How do I fix it?
This does not seem to be a Feedzirra issue, IMO.
Your libxml or nokigiri dependencies may not be up-to-date. Update these gems and try again.
Like mentioned here, encoding detection is not 100% accurate.
If you'd like to ignore the ones which give you errors,
Feedzirra has callback functions
Another feature present in Feedzirra is the ability to create callback
functions that get called “on success” and “on failure” when getting a
feed. This makes it easy to do things like log errors or update data
stores.
Also, please give us more context on what code gives you the error or which file are you trying to parse.

Resources