Reading a big .tar.gz file in ruby line by line - ruby

I have a big zipped file (in GBs) with extension .tar.gz, which contains huge sql queries in each line so I want to read it line by line. Currently I am using TarReader and Zlib gem to do it below way:
tar_extract = Gem::Package::TarReader.new(Zlib::GzipReader.open('/Users/sachin.japate/Desktop/80637.tar.gz'))
tar_extract.rewind
tar_extract.each do |entry|
if entry.file?
query_list = entry.read.split("\n")
puts query_list
end
end
now the issue here is: when the file is very big this is going out of integer range, and giving this error:
/usr/lib/ruby-flo/lib/ruby/2.1.0/rubygems/package/tar_reader/entry.rb:126:in `read': integer 7422013772 too big to convert to `int' (RangeError)
then I tried it this way:
query = ""
while byte = entry.getc do
query = query + byte
if byte == "\n"
puts query
query = ""
end
end
Although this is working but this is very slow and impacting the performance, so my question is: Is there any way to read big TarReader entry object line by line in memory efficiently?

Related

Ruby Zlib compression gives different outputs for the same input

I have this ruby method for compressing a string -
def compress_data(data)
output = StringIO.new
gz = Zlib::GzipWriter.new(output)
gz.write(data)
gz.close
compressed_data = output.string
compressed_data
end
When I call this method with the same input, I get different outputs at different times. I am trying to get the byte array for the compressed outputs and compare them.
The output is Different when I run the below -
input = "hello world"
output1 = (compress_data input).bytes.to_a
sleep 1
output2 = (compress_data input).bytes.to_a
if output1 == output2
puts 'Same'
else
puts 'Different'
end
The output is Same when I remove the sleep. Does the compression algorithm have something to do with the current time?
Option 1 - fixed mtime:
Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:
gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close
Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).
Option 2 - skip the header:
Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:
data = compress_data(input).bytes[10..-1]
Note that you do not need to call to_a on bytes. It is already an Array:
String.bytes -> an_array
Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.

How to decoding IFC using Ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.
I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980
If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

libsvm: read vectors from word2vec

Is there an easy way to use w2v's output vectors in libsvm?
There are two output formats for w2v: binary and text. In the text format each line begins with a word followed by a space-separated vector. e.g.:
something -0.197045 -0.292196 -0.107292 -0.168469 0.114897 -0.006383 -0.000056 0.068514 -0.079548 0.251488 0.185607 0.248675 -0.058647 0.062771 0.129014 -0.024715 -0.168974 -0.035367 -0.009597 0.090379 0.030133 0.017338 0.062264 -0.219165 -0.214198 0.226869 -0.058710 0.034563 -0.046304 0.2
Found a way with ruby:
First require the libsvm wrapper:
require 'libsvm'
read the vectors file (assuming textual form):
lines = File.readlines('vectors.txt')
insert to a hash
words = {}
lines[1..-1].each{ |l| sp = l.strip.split; words[sp[0]] = sp[1..-1].map(&:to_f) }
and finally use libsvm:
examples = words.values.map { |ary| Libsvm::Node.features(ary) }

Recovering hex data from a large log-file using Ruby and RegEx

I'm trying to filter/append lines of hex data from a large log-file, using Ruby and RegEx.
The lines of the log-file that I need look like this:
Data: 10 55 61 (+ lots more hex data)
I want to add all of the hex data, for further processing later. The regex /^\sData:(.+)/ should do the trick.
My Ruby-program looks like this:
puts "Start"
fileIn = File.read("inputfile.txt")
fileOut = File.new("outputfile.txt", "w+")
fileOut.puts "Start of regex data\n"
fileIn.each_line do
dataLine = fileIn.match(/^\sData:(.+)/).captures
fileOut.write dataLine
end
fileOut.puts "\nEOF"
fileOut.close
puts "End"
It works - sort of - but the lines in the output file are all the same, just repeating the result of the first regex match.
What am I doing wrong?
You are iterating over the same entire file. You need to iterate over the line.
fileIn.each_line do |line|
dataLine = line.match(/^\sData:(.+)/).captures
fileOut.write dataLine
end

Get numbers from a list in a file, output to another file in Ruby?

I have a big text file that contains - among others- lines like these:
"X" : "452345230"
I want to find all lines that contain "X" , and take just the number (without the quotation marks), and then output the numbers in another file, in this fashion:
452349532
234523452
213412411
219456433
etc.
What I did so far is this:
myfile = File.open("myfile.txt")
x = []
myfile.grep(/"X"/) {|line|
x << line.match( /"(\d{9})/ ).values_at( 1 )[0]
puts x
File.open("output.txt", 'w') {|f| f.write(x) }
}
it works, but the list it produces is of this form:
["23419230", "2349345234" , ... ]
How do I output it like I showed before, just numbers and each number in a line?
Thanks.
Here's a solution that doesn't leave files open:
File.open("output.txt", 'w') do |output|
File.open("myfile.txt").each do |line|
output.puts line[/\d{9}/] if line[/"X"/]
end
end
I couldn't reproduce what you saw:
$ cat myfile.txt
"X" : "452345230"
"X" : "452345231"
"X" : "452345232"
"X" : "452345233"
$ ./scanner.rb
452345230
452345230
452345231
452345230
452345231
452345232
452345230
452345231
452345232
452345233
$ cat output.txt
452345230452345231452345232452345233$
However, I did notice that your application is incredibly wasteful and probably not doing what you expect: You open output.txt, write some content to it, then close it again. The next time it is opened in the loop, it is overwritten. If your file is 1000 lines long, this won't be so bad, you're only making 1000 files. If your file is 1,000,000 lines long, this is going to represent a pretty horrible performance penalty as you create a file, write into it, and then delete it again, one million times. Oops.
I re-wrote your tool a little bit:
$ cat scanner.rb
#!/usr/bin/ruby -w
myfile = File.open("myfile.txt")
output = File.open("output.txt", 'w')
myfile.grep(/"X"/) {|line|
x = line.match( /"(\d{9})/ ).values_at( 1 )[0]
puts x
output.write(x + "\n")
}
This opens each file exactly onces, writes each new line one at a time, and then lets them both be closed when the application quits. Depending upon if this is a small portion of your application or the entire thing, this might be alright. (If this is a small portion of the program, then definitely close the files when you're done with them.)
This might still be wasteful for one million matched lines -- those writes are almost certainly handed straight to the system call write(2), which will involve some overhead.
How many of these will you be running? Millions? Billions? If this needs more refinement feel free to ask...
Solution:
myfile = File.open("myfile.txt")
File.open("output.txt", 'w') do |output|
content = myfile.lines.map { |line| line.scan(/^"X".*(\d{9})/) }.flatten.join("\n")
output.write(content)
end
Edited: I updated the code reducing it a bit. If the example above seems complicated, you can also grab the data you want with the following statement (could be a little bit clear of what's happening):
content = myfile.lines.select { |line| line =~ /"X"/ }.map { |line| line.scan(/\d{9}/) }.join("\n")

Resources