Reading files in a zip archive, without unzipping the archive - ruby

I have a directory with 100+ zip files and I need to read the files inside the zip files to do some data processing, without unzipping the archive.
Is there a Ruby library to read the contents of files in zip archives, without unzipping the file?
Using rubyzip gives an error:
require 'zip'
Zip::File.open('my_zip.zip') do |zip_file|
# Handle entries one by one
zip_file.each do |entry|
# Extract to file/directory/symlink
puts "Extracting #{entry.name}"
entry.extract('here')
# Read into memory
content = entry.get_input_stream.read
end
end
Gives this error:
test.rb:12:in `block (2 levels) in <main>': undefined method `read' for Zip::NullInputStream:Module (NoMethodError)
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/entry_set.rb:42:in `call'
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/entry_set.rb:42:in `block in each'
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/entry_set.rb:41:in `each'
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/entry_set.rb:41:in `each'
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/central_directory.rb:182:in `each'
from test.rb:6:in `block in <main>'
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/file.rb:99:in `open'
from test.rb:4:in `<main>'

The Zip::NullInputStream is returned if the entry is a directory and not a file, could that be the case?
Here's a more robust variation of the code:
#!/usr/bin/env ruby
require 'rubygems'
require 'zip'
Zip::File.open('my_zip.zip') do |zip_file|
# Handle entries one by one
zip_file.each do |entry|
if entry.directory?
puts "#{entry.name} is a folder!"
elsif entry.symlink?
puts "#{entry.name} is a symlink!"
elsif entry.file?
puts "#{entry.name} is a regular file!"
# Read into memory
entry.get_input_stream { |io| content = io.read }
# Output
puts content
else
puts "#{entry.name} is something unknown, oops!"
end
end
end

I came across the same issue and checking for if entry.file?, before entry.get_input_stream.read, resolved the issue.
require 'zip'
Zip::File.open('my_zip.zip') do |zip_file|
# Handle entries one by one
zip_file.each do |entry|
# Extract to file/directory/symlink
puts "Extracting #{entry.name}"
entry.extract('here')
# Read into memory
if entry.file?
content = entry.get_input_stream.read
end
end
end

Related

Getting "Unknown file type" in ruby

here's my code:
> !#usr/bin/ruby
require 'fileutils'
Dir.chdir "/home/john/Documents"
if (Dir.exist?("Photoshoot") === false) then
Dir.mkdir "Photoshoot"
puts "Directory: 'Photoshoot' created"
end
Dir.chdir "/run/user/1000/gvfs"
camdirs = Dir.glob('*')
numcams = camdirs.length
camnum = 0
campath = []
while camnum < numcams do
campath.push("/run/user/1000/gvfs/#{camdirs[camnum]}/DCIM")
puts campath[camnum]
camnum += 1
end
campath.each do |path|
Dir.chdir (path)
foldnum = 0
foldir = Dir.glob('*')
puts foldir
Dir.entries("#{path}/#{foldir[foldnum]}").each do |filename|
filetype = File.extname(filename)
if filetype == ".JPG"
FileUtils.mv("#{path}/#{foldir[foldnum]}/#{filename}", "/home/john/Documents/Photoshoot")
end
foldnum += 1
end
end
puts "#{numcams} cameras detected"
I'm just trying to go into some cameras I have connected and extract all the images into a file but its giving me this error. One of the things that's messing me up is that the images are stored in sub-folders under DCIM. When I just use .entries it gives me the folders the images are in as well as the images.
/usr/lib/ruby/2.3.0/fileutils.rb:1387:in `copy': unknown file type: /run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C021%5D/DCIM//IMG_0092.JPG (RuntimeError)
from /usr/lib/ruby/2.3.0/fileutils.rb:472:in `block in copy_entry'
from /usr/lib/ruby/2.3.0/fileutils.rb:1498:in `wrap_traverse'
from /usr/lib/ruby/2.3.0/fileutils.rb:469:in `copy_entry'
from /usr/lib/ruby/2.3.0/fileutils.rb:530:in `rescue in block in mv'
from /usr/lib/ruby/2.3.0/fileutils.rb:527:in `block in mv'
from /usr/lib/ruby/2.3.0/fileutils.rb:1571:in `block in fu_each_src_dest'
from /usr/lib/ruby/2.3.0/fileutils.rb:1585:in `fu_each_src_dest0'
from /usr/lib/ruby/2.3.0/fileutils.rb:1569:in `fu_each_src_dest'
from /usr/lib/ruby/2.3.0/fileutils.rb:517:in `mv'
from /home/john/Desktop/TestExtract.rb:34:in `block (2 levels) in <main>'
from /home/john/Desktop/TestExtract.rb:31:in `each'
from /home/john/Desktop/TestExtract.rb:31:in `block in <main>'
from /home/john/Desktop/TestExtract.rb:26:in `each'
from /home/john/Desktop/TestExtract.rb:26:in `<main>'
/run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C022%5D/DCIM
/run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C021%5D/DCIM
/run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C020%5D/DCIM
104___03
105___04
106___05
102___01
[Finished in 0.1s with exit code 1]
[shell_cmd: ruby "/home/john/Desktop/TestExtract.rb"]
[dir: /home/john/Desktop]
[path: /home/john/bin:/home/john/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin]
Any advice? I can't figure out what's wrong.
The reason the path to your files looks strange is because your camera storage has been mounted using FUSE. If you look very closely, you'll see that it is looking for:
/run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C021%5D/DCIM//IMG_0092.JPG
You have two forward slashes before the final filename. Try correcting this on line 34 of your app.
If the problem still manifests then it is possible that the user running the operation in Ruby does not have permission to that filesystem or the manner in which the paths are constructed by FUSE is not compatible with Ruby FileUtils.
You can try to run:
cat /run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C021%5D/DCIM/IMG_0092.JPG
as the same user that is running the Ruby process to ensure you have read permission to the filesystem.

Create a tar.gz with contens of a specific path (without chdir) with Ruby

I'm working on method in Ruby that will create a tar.gz file that will archive directories and files under a certain path (cdpath), it is expected to be similar to tar -C cdpath -zcf targzfile srcs, but without changing the CWD (to keep it thread safe). I'm using Gem::Package::TarWriter to create the Tar object and wrap it with Zlib::GzipWriter to compress.
Here's what I came up with (this is just a simple standalone test):
require 'rubygems/package'
require 'zlib'
require 'pathname'
require 'find'
cdpath="/absolute/path/to/some/place"
targzfile="test.tar.gz"
src=["some-dir-name-at-cdpath"]
BLOCKSIZE_TO_READ = 1024 * 1000
path = Pathname.new(cdpath)
raise "path #{cdpath} should be an absolute path" unless path.absolute?
raise "path #{cdpath} should be a directory" unless File.directory? cdpath
raise "Destination tar.gz file #{targzfile} already exists" if File.exist? targzfile
raise "no file or directory to tar" if !src || src.length == 0
src.each { |p| p.sub! /^/, "#{cdpath}/" }
File.open targzfile, 'wb' do |otargzfile|
Zlib::GzipWriter.wrap otargzfile do |gz|
Gem::Package::TarWriter.new gz do |tar|
Find.find *src do |f|
relative_path = f.sub "#{cdpath}/", ""
mode = File.stat(f).mode
if File.directory? f
tar.mkdir relative_path, mode
else
File.open f, 'rb' do |rio|
tar.add_file relative_path, mode do |tio|
tio.write rio.read
end
end
end
end
end
end
end
However, I'm hitting the following exception and I can't seem to figure out what I'm doing wrong.
/usr/lib/ruby/2.1.0/rubygems/package/tar_writer.rb:108:in `add_file': Gem::Package::NonSeekableIO (Gem::Package::NonSeekableIO)
from tartest2.rb:29:in `block (5 levels) in <main>'
from tartest2.rb:28:in `open'
from tartest2.rb:28:in `block (4 levels) in <main>'
from /usr/lib/ruby/2.1.0/find.rb:48:in `block (2 levels) in find'
from /usr/lib/ruby/2.1.0/find.rb:47:in `catch'
from /usr/lib/ruby/2.1.0/find.rb:47:in `block in find'
from /usr/lib/ruby/2.1.0/find.rb:42:in `each'
from /usr/lib/ruby/2.1.0/find.rb:42:in `find'
from tartest2.rb:22:in `block (3 levels) in <main>'
from /usr/lib/ruby/2.1.0/rubygems/package/tar_writer.rb:85:in `new'
from tartest2.rb:21:in `block (2 levels) in <main>'
from tartest2.rb:20:in `wrap'
from tartest2.rb:20:in `block in <main>'
from tartest2.rb:19:in `open'
from tartest2.rb:19:in `<main>'
EDIT: I was able to resolve this, by using TarWriter's add_file_simple instead of add_file, the file size needs to be obtained using File.stat method, details are in the answer below.
As described in the OP, the solution is to use add_file_simple method instead of add_file, this also requires that you obtain the file size using File.stat method.
Here's a working method:
# similar as 'tar -C cdpath -zcf targzfile srcs', the difference is 'srcs' is related
# to the current working directory, instead of 'cdpath'
def self.cdtargz(cdpath, targzfile, *src)
path = Pathname.new(cdpath)
raise "path #{cdpath} should be an absolute path" unless path.absolute?
raise "path #{cdpath} should be a directory" unless File.directory? cdpath
raise "Destination tar.gz file #{targzfile} already exists" if File.exist? targzfile
raise "no file or directory to tar" if !src || src.length == 0
src.each { |p| p.sub! /^/, "#{cdpath}/" }
File.open targzfile, 'wb' do |otargzfile|
Zlib::GzipWriter.wrap otargzfile do |gz|
Gem::Package::TarWriter.new gz do |tar|
Find.find *src do |f|
relative_path = f.sub "#{cdpath}/", ""
mode = File.stat(f).mode
size = File.stat(f).size
if File.directory? f
tar.mkdir relative_path, mode
else
tar.add_file_simple relative_path, mode, size do |tio|
File.open f, 'r' do |rio|
tio.write rio.read
end
end
end
end
end
end
end
end
EDIT: After reviewing the answer in this question, I revised the above slightly to avoid "slurping" the files, in my case 95% of the files are quite small, but few very BIG ones, so this makes a lot of sense. Here's the updated version:
BLOCKSIZE_TO_READ = 1024 * 1000
def self.cdtargz(cdpath, targzfile, *src)
path = Pathname.new(cdpath)
raise "path #{cdpath} should be an absolute path" unless path.absolute?
raise "path #{cdpath} should be a directory" unless File.directory? cdpath
raise "Destination tar.gz file #{targzfile} already exists" if File.exist? targzfile
raise "no file or directory to tar" if !src || src.length == 0
src.each { |p| p.sub! /^/, "#{cdpath}/" }
File.open targzfile, 'wb' do |otargzfile|
Zlib::GzipWriter.wrap otargzfile do |gz|
Gem::Package::TarWriter.new gz do |tar|
Find.find *src do |f|
relative_path = f.sub "#{cdpath}/", ""
mode = File.stat(f).mode
size = File.stat(f).size
if File.directory? f
tar.mkdir relative_path, mode
else
tar.add_file_simple relative_path, mode, size do |tio|
File.open f, 'rb' do |rio|
while buffer = rio.read(BLOCKSIZE_TO_READ)
tio.write buffer
end
end
end
end
end
end
end
end
end

Undefined method each in Ruby Regexp task

I have series of zip files under #workingdir, and am trying to unzip the files that match #Regexp, and print the lines from them.
require 'zip/zip'
#workingdir = '/my/dir/structure/*.zip'
#Regexp = '/yup:maybe.*nope/i'
Dir.glob(#workingdir) do |zips|
Zip::ZipFile.open(zips) do |file|
file.each do |search|
tempFile = file.read(search)
tempFile.each do |line|
if (line =~ #Regexp ) then
p line
end
end
end
end
end
Below is the error message from IRB:
NoMethodError: undefined method `each' for #<String:0x0000000168bf40>
from (irb):70:in `block (3 levels) in irb_binding'
from /var/lib/gems/1.9.1/gems/rubyzip2-2.0.2/lib/zip/zip.rb:1122:in `each'
from /var/lib/gems/1.9.1/gems/rubyzip2-2.0.2/lib/zip/zip.rb:1122:in `each'
from /var/lib/gems/1.9.1/gems/rubyzip2-2.0.2/lib/zip/zip.rb:1265:in `each'
from (irb):68:in `block (2 levels) in irb_binding'
from /var/lib/gems/1.9.1/gems/rubyzip2-2.0.2/lib/zip/zip.rb:1381:in `open'
from (irb):67:in `block in irb_binding'
from (irb):66:in `glob'
from (irb):66
from /usr/bin/irb:12:in `<main>'
I tried tempFile.grep, and received the same error, except that grep was an undefined method. I believe I need to define a class.
Turns out my code had two problems. 1) My regular expression was being processed as a string (I should not have used the quotes). 2) Seeing as it runs fine otherwise on Ruby 1.8.7, I suspect the is a difference in how 1.8.7 and 1.9.1 process the 'each' method. If anyone has additional insights, I'm more than happy to hear them. The code below works fine on 1.8.7:
require 'zip/zip'
#workingdir = '/my/dir/structure/*.zip'
#Regexp = /regexp/i
Dir.glob(#workingdir) do |zips|
Zip::ZipFile.open(zips) do |file|
file.each do |search|
tempFile = file.read(search)
tempFile.each do |line|
if (line =~ #Regexp) then
puts zips + ': ' + line.chomp
end
end
end
end
end
Thanks again everyone!

error related to REXML

I'm not sure it's REXML or ruby issue.
But this is happening when I work with REXML.
The program below should access elements of each xml file in the directory.
#!/usr/bin/ruby -w
require 'rexml/document'
include REXML
p "Current directory was: " + Dir.pwd
Dir.chdir("/home/askar/xml_files1") {
p "Now we're in: " + Dir.pwd
if File.exist?(Dir.pwd)
xml_files = Dir.glob("ShipmentRequest*.xml")
Dir.foreach(Dir.pwd) do |file|
xmlfile = File.new(file)
xmldoc = Document.new(xmlfile)
end
else
puts "It's empty"
end
}
When I run:
ruby import_xml.rb
Errors:
"Current directory was: /home/askar/Dropbox/rails_studio/xml_to_mysql"
"Now we're in: /home/askar/xml_files1"
There're 6226 files in the folder...
/home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/source.rb:148:in `read': Is a directory - . (Errno::EISDIR)
from /home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/source.rb:148:in `initialize'
from /home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/source.rb:14:in `new'
from /home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/source.rb:14:in `create_from'
from /home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/parsers/baseparser.rb:127:in `stream='
from /home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/parsers/baseparser.rb:116:in `initialize'
from /home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:9:in `new'
from /home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:9:in `initialize'
from /home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/document.rb:245:in `new'
from /home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/document.rb:245:in `build'
from /home/askar/.rvm/rubies/ruby-1.9.3-p429/lib/ruby/1.9.1/rexml/document.rb:43:in `initialize'
from import_xml.rb:20:in `new'
from import_xml.rb:20:in `block (2 levels) in <main>'
from import_xml.rb:17:in `foreach'
from import_xml.rb:17:in `block in <main>'
from import_xml.rb:8:in `chdir'
from import_xml.rb:8:in `<main>'
When I comment out:
#xmldoc = Document.new(xmlfile)
it's not giving errors.
Folder /home/askar/xml_files1 contains only 3 xml files.
I'm using Linux Mint Nadia and
ruby -v
ruby 1.9.3p429 (2013-05-15 revision 40747) [x86_64-linux]
If you noticed, for some reason, error shows ruby 1.9.1. Is this an issue?
I think #halfelf is correct here. The API docs say that Dir.foreach will iterate over every entry in the directory - and in Unix, that includes the two directories . and ...
A couple lines before your Dir.foreach call, you use glob to build an array of files called xml_files. What happens if you iterate over that in your loop instead?
Just a guess: Not everything returned by Dir.foreach(Dir.pwd) is a file that can be read. Some of them are directories.
Using Nokogiri, here's how I'd write this:
#!/usr/bin/ruby -w
require 'nokogiri'
DIRNAME = "/home/askar/xml_files1"
puts "Current directory is: #{ Dir.pwd }"
Dir.chdir(DIRNAME) do
puts "Now in: #{ DIRNAME }"
xml_files = Dir.glob("ShipmentRequest*.xml")
if xml_files.empty?
puts "#{ DIRNAME } is empty."
else
xml_files.each do |file|
doc = Nokogiri::XML(open(file))
# ... do something with the doc ...
end
end
end

IOError: closed stream in Ruby SFTP

The following code tries to list the entries of a remote directory via SFTP and Net::SFTP, but it causes an "closed stream" IOError if the directory contains a large number of files (~ 6000 files):
require 'net/ssh'
require 'net/sftp'
Net::SFTP.start('hostname', 'username', :password => 'password') do |sftp|
# list the entries in a directory
sftp.dir.foreach("/") do |entry|
puts entry.longname
end
end
What is the best way to avoid it? Versions are net-sftp Gem: 2.0.5 and net-ssh Gem: 2.2.1, Ruby: 1.8.7. The full error message reads:
IOError: closed stream
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/ruby_compat.rb:33:in `select'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/ruby_compat.rb:33:in `io_select'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/ruby_compat.rb:32:in `synchronize'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/ruby_compat.rb:32:in `io_select'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/transport/packet_stream.rb:73:in `available_for_read?'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/transport/packet_stream.rb:85:in `next_packet'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/transport/session.rb:170:in `poll_message'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/transport/session.rb:165:in `loop'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/transport/session.rb:165:in `poll_message'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:451:in `dispatch_incoming_packets'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:213:in `preprocess'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:197:in `process'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:161:in `loop'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:161:in `loop_forever'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:161:in `loop'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:110:in `close'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-sftp-2.0.5/lib/net/sftp.rb:36:in `start'
The behavior could be deliberate, if we take a look at the dir source code in net-sftp/lib/net/sftp/operations/dir.rb, we see a close operation:
def foreach(path)
..
ensure
sftp.close!(handle) if handle
end
It is possible that this close operation causes the closed stream error. If it does not indicate a bug, it is possible the catch the IOError exception. It also seems to help to run the SSH event loop occasionally:
begin
..
sftp.dir.foreach("/") do |entry|
puts entry.longname
# ...
sftp.loop # Runs the SSH event loop
end
rescue IOError => Ex
puts "*** We are done: "+Ex.message
end

Resources