efficiently processing large quantity of files in Ruby - ruby

I am writing a script and I need to traverse a file system, and return the SHA1 sum of the files.
The code I am using is this:
time ruby -r'digest/sha1' -r'find' -e 'Find.find("/") {|x| next unless File.file?(x) ; Digest::SHA1.hexdigest(File.read(x))}
The problem is, I get this error message after about 5 seconds after execution
-e:1:in `read': failed to allocate memory (NoMemoryError)
from -e:1:in `open'
from -e:1:in `block in <main>'
from /usr/share/ruby/find.rb:43:in `block in find'
from /usr/share/ruby/find.rb:42:in `catch'
from /usr/share/ruby/find.rb:42:in `find'
from -e:1:in `<main>'
Why am I getting this error, and what is the "best practice" for handling a task like this?
Help appreciated.

It doesn't seem to be well documented (or at least, I'm not looking in the write place) but the Digest library provides a way of hashsumming files by reading the files in chunks and computing the hashsum, versus File.read which reads the whole file into memory.
The working code would be:
begin
Find.find("/") do |file|
next unless File.file?(file)
puts "#{Digest::SHA1.file(file)} #{file}"
end
rescue => e
puts e
end

Why make it difficult by putting this in a one-liner ?
If you put your code in a script like this, on my system everyting runs smooth and every file on my HD is read.
On a data disk you'rd better find a way to handle large files, like the solution at https://www.ruby-forum.com/topic/58563 I adapted for SHA1.
require 'digest/sha1'
require 'find'
Find.find("/") do |file|
next unless File.file?(file)
begin
sha = File.open(file, 'rb') do |io|
dig = Digest::SHA1.new
buf = ""
dig.update(buf) while io.read(4096, buf)
dig
end
puts "#{sha} #{file}"
rescue => e
puts e.backtrace
end
end
gives
ba4aeced8ab461b75ff87d989ff16ca2464ea787 /$AVG/$VAULT/vault.db
31d8730390451d236b80c4351b6b287d6853570c /$AVG/$VAULT/vvfolder.idx
b4c783e3478e5b6f795e92d3cf5d85837fffd128 /$Recycle.Bin/S-1-5-21-50811273-296787125-2640436092-1000/desktop.ini
b4c783e3478e5b6f795e92d3cf5d85837fffd128 /$Recycle.Bin/S-1-5-21-50811273-296787125-2640436092-1011/desktop.ini
3109805dcc447395f58fec8b5e8a8fca1d20892b /.rnd
61fc34796b7cc67caf9da685e59461c9d13fba29 /4nt500/4NT.INI
...

Related

Getting "Unknown file type" in ruby

here's my code:
> !#usr/bin/ruby
require 'fileutils'
Dir.chdir "/home/john/Documents"
if (Dir.exist?("Photoshoot") === false) then
Dir.mkdir "Photoshoot"
puts "Directory: 'Photoshoot' created"
end
Dir.chdir "/run/user/1000/gvfs"
camdirs = Dir.glob('*')
numcams = camdirs.length
camnum = 0
campath = []
while camnum < numcams do
campath.push("/run/user/1000/gvfs/#{camdirs[camnum]}/DCIM")
puts campath[camnum]
camnum += 1
end
campath.each do |path|
Dir.chdir (path)
foldnum = 0
foldir = Dir.glob('*')
puts foldir
Dir.entries("#{path}/#{foldir[foldnum]}").each do |filename|
filetype = File.extname(filename)
if filetype == ".JPG"
FileUtils.mv("#{path}/#{foldir[foldnum]}/#{filename}", "/home/john/Documents/Photoshoot")
end
foldnum += 1
end
end
puts "#{numcams} cameras detected"
I'm just trying to go into some cameras I have connected and extract all the images into a file but its giving me this error. One of the things that's messing me up is that the images are stored in sub-folders under DCIM. When I just use .entries it gives me the folders the images are in as well as the images.
/usr/lib/ruby/2.3.0/fileutils.rb:1387:in `copy': unknown file type: /run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C021%5D/DCIM//IMG_0092.JPG (RuntimeError)
from /usr/lib/ruby/2.3.0/fileutils.rb:472:in `block in copy_entry'
from /usr/lib/ruby/2.3.0/fileutils.rb:1498:in `wrap_traverse'
from /usr/lib/ruby/2.3.0/fileutils.rb:469:in `copy_entry'
from /usr/lib/ruby/2.3.0/fileutils.rb:530:in `rescue in block in mv'
from /usr/lib/ruby/2.3.0/fileutils.rb:527:in `block in mv'
from /usr/lib/ruby/2.3.0/fileutils.rb:1571:in `block in fu_each_src_dest'
from /usr/lib/ruby/2.3.0/fileutils.rb:1585:in `fu_each_src_dest0'
from /usr/lib/ruby/2.3.0/fileutils.rb:1569:in `fu_each_src_dest'
from /usr/lib/ruby/2.3.0/fileutils.rb:517:in `mv'
from /home/john/Desktop/TestExtract.rb:34:in `block (2 levels) in <main>'
from /home/john/Desktop/TestExtract.rb:31:in `each'
from /home/john/Desktop/TestExtract.rb:31:in `block in <main>'
from /home/john/Desktop/TestExtract.rb:26:in `each'
from /home/john/Desktop/TestExtract.rb:26:in `<main>'
/run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C022%5D/DCIM
/run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C021%5D/DCIM
/run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C020%5D/DCIM
104___03
105___04
106___05
102___01
[Finished in 0.1s with exit code 1]
[shell_cmd: ruby "/home/john/Desktop/TestExtract.rb"]
[dir: /home/john/Desktop]
[path: /home/john/bin:/home/john/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin]
Any advice? I can't figure out what's wrong.
The reason the path to your files looks strange is because your camera storage has been mounted using FUSE. If you look very closely, you'll see that it is looking for:
/run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C021%5D/DCIM//IMG_0092.JPG
You have two forward slashes before the final filename. Try correcting this on line 34 of your app.
If the problem still manifests then it is possible that the user running the operation in Ruby does not have permission to that filesystem or the manner in which the paths are constructed by FUSE is not compatible with Ruby FileUtils.
You can try to run:
cat /run/user/1000/gvfs/gphoto2:host=%5Busb%3A002%2C021%5D/DCIM/IMG_0092.JPG
as the same user that is running the Ruby process to ensure you have read permission to the filesystem.

knowing if file is YAML or not

I would like to expect that YAML.load_file(foo) of ruby YAML module returns null if foo is not a YAML file. But I get exception:
did not find expected alphabetic or numeric character while scanning an alias at line 3 column 3 (Psych::SyntaxError)
from /usr/lib/ruby/2.4.0/psych.rb:377:in `parse_stream'
from /usr/lib/ruby/2.4.0/psych.rb:325:in `parse'
from /usr/lib/ruby/2.4.0/psych.rb:252:in `load'
from /usr/lib/ruby/2.4.0/psych.rb:473:in `block in load_file'
from /usr/lib/ruby/2.4.0/psych.rb:472:in `open'
from /usr/lib/ruby/2.4.0/psych.rb:472:in `load_file'
from ./select.rb:27:in `block in selecting'
from ./select.rb:26:in `each'
from ./select.rb:26:in `selecting'
from ./select.rb:47:in `block (2 levels) in <main>'
from ./select.rb:46:in `each'
from ./select.rb:46:in `block in <main>'
from ./select.rb:44:in `each'
from ./select.rb:44:in `<main>'
How can I triage if a file is a YAML file or not without a exception? In my case, I navigate to a directory and process markdown files: I add to a list markdown files with a key output: word and I return that list
mylist = Array.new
mylist = []
for d in (directory - excludinglist)
begin
info = YAML.load_file(d)
if info
if info.has_key?('output')
if info['output'].has_key?(word)
mylist.push(d)
end
end
end
rescue Psych::SyntaxError => error
return []
end
end
return mylist
When I catch exceptions, the bucle does not continue to push elements on my list.
The short answer: you can't.
Because YAML is just a text file, the only way to know whether a given text file is YAML or not is to parse it. The parser will try to parse the file, and if it is not valid YAML, it will raise an error.
Errors and exceptions are a common part of Ruby, especially in the world of IO. There's no reason to be afraid of them. You can easily rescue from them and continue on your way:
begin
yaml = YAML.load_file(foo)
rescue Psych::SyntaxError => e
# handle the bad YAML here
end
You mentioned that the following code will not work because you need to handle multiple files in a directory:
def foo
mylist = []
for d in (directory - excludinglist)
begin
info = YAML.load_file(d)
if info
if info.has_key?('output')
if info['output'].has_key?(word)
mylist.push(d)
end
end
end
rescue Psych::SyntaxError => error
return []
end
return mylist
end
The only issue here is that when you hit an error, you respond by returning from the function early. If you don't return, the for-loop will continue and you will get your desired functionality:
def foo
mylist = []
for d in (directory - excludinglist)
begin
info = YAML.load_file(d)
if info
if info.has_key?('output')
if info['output'].has_key?(word)
mylist.push(d)
end
end
end
rescue Psych::SyntaxError => error
# do nothing!
# puts "or your could display an error message!"
end
end
return mylist
end
Psych::SyntaxError gets raised by Psych::Parser#parse, the source for which is written in C. So unless you want to work with C, you can't write a patch for the method in Ruby to prevent the exception from getting raised.
Still, you could certainly rescue the exception, like so:
begin
foo = YAML.load_file("not_yaml.txt")
rescue Psych::SyntaxError => error
puts "bad yaml"
end

Reading files in a zip archive, without unzipping the archive

I have a directory with 100+ zip files and I need to read the files inside the zip files to do some data processing, without unzipping the archive.
Is there a Ruby library to read the contents of files in zip archives, without unzipping the file?
Using rubyzip gives an error:
require 'zip'
Zip::File.open('my_zip.zip') do |zip_file|
# Handle entries one by one
zip_file.each do |entry|
# Extract to file/directory/symlink
puts "Extracting #{entry.name}"
entry.extract('here')
# Read into memory
content = entry.get_input_stream.read
end
end
Gives this error:
test.rb:12:in `block (2 levels) in <main>': undefined method `read' for Zip::NullInputStream:Module (NoMethodError)
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/entry_set.rb:42:in `call'
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/entry_set.rb:42:in `block in each'
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/entry_set.rb:41:in `each'
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/entry_set.rb:41:in `each'
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/central_directory.rb:182:in `each'
from test.rb:6:in `block in <main>'
from .gem/ruby/gems/rubyzip-1.1.6/lib/zip/file.rb:99:in `open'
from test.rb:4:in `<main>'
The Zip::NullInputStream is returned if the entry is a directory and not a file, could that be the case?
Here's a more robust variation of the code:
#!/usr/bin/env ruby
require 'rubygems'
require 'zip'
Zip::File.open('my_zip.zip') do |zip_file|
# Handle entries one by one
zip_file.each do |entry|
if entry.directory?
puts "#{entry.name} is a folder!"
elsif entry.symlink?
puts "#{entry.name} is a symlink!"
elsif entry.file?
puts "#{entry.name} is a regular file!"
# Read into memory
entry.get_input_stream { |io| content = io.read }
# Output
puts content
else
puts "#{entry.name} is something unknown, oops!"
end
end
end
I came across the same issue and checking for if entry.file?, before entry.get_input_stream.read, resolved the issue.
require 'zip'
Zip::File.open('my_zip.zip') do |zip_file|
# Handle entries one by one
zip_file.each do |entry|
# Extract to file/directory/symlink
puts "Extracting #{entry.name}"
entry.extract('here')
# Read into memory
if entry.file?
content = entry.get_input_stream.read
end
end
end

Zlib::BufError when using progressbar/ruby-progressbar gem

I use the following Ruby snippet to download a 8.9MB file.
require 'open-uri'
require 'net/http'
require 'uri'
def http_download_no_progress_bar(uri, filename)
uri.open(read_timeout: 500) do |file|
open filename, 'w' do |io|
file.each_line do |line|
io.write line
end
end
end
end
I want to add the progressbar gem to visualize the download process:
require 'open-uri'
require 'progressbar'
require 'net/http'
require 'uri'
def http_download_with_progressbar(uri, filename)
progressbar = nil
uri.open(
read_timeout: 500,
content_length_proc: lambda { |total|
if total && 0 < total.to_i
progressbar = ProgressBar.new("...", total)
progressbar.file_transfer_mode
end
},
progress_proc: lambda { |step|
progressbar.set step if progressbar
}
) do |file|
open filename, 'w' do |io|
file.each_line do |line|
io.write line
end
end
end
end
However, it now fails with the following error:
/home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http/response.rb:357:in `finish':
buffer error (Zlib::BufError)oooooo | 8.0MB 8.6MB/s ETA: 0:00:00
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http/response.rb:357:in `finish'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http/response.rb:262:in `ensure in inflater'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http/response.rb:262:in `inflater'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http/response.rb:274:in `read_body_0'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http/response.rb:201:in `read_body'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/open-uri.rb:328:in `block (2 levels) in open_http'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:1415:in `block (2 levels) in transport_request'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http/response.rb:162:in `reading_body'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:1414:in `block in transport_request'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:1405:in `catch'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:1405:in `transport_request'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:1378:in `request'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/open-uri.rb:319:in `block in open_http'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:853:in `start'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/open-uri.rb:313:in `open_http'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/open-uri.rb:724:in `buffer_open'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/open-uri.rb:210:in `block in open_loop'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/open-uri.rb:208:in `catch'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/open-uri.rb:208:in `open_loop'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/open-uri.rb:149:in `open_uri'
from /home/user/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/open-uri.rb:704:in `open'
Meanwhile I also tried the ruby-progressbar gem:
require 'open-uri'
require 'ruby-progressbar'
require 'net/http'
require 'uri'
def http_download_with_ruby_progressbar(uri, filename)
progressbar = nil
uri.open(
read_timeout: 500,
content_length_proc: lambda { |total|
if total && 0 < total.to_i
progressbar = ProgressBar.create(title: filename, total: total)
end
},
progress_proc: lambda { |step|
progressbar.progress = step if progressbar
}
) do |file|
open filename, 'w' do |io|
file.each_line do |line|
io.write line
end
end
end
end
It fails with the same error. Here is the associated issue for the problem.
The problem is the file you are trying to download as every method works with this file: https://androidnetworktester.googlecode.com/files/1mb.txt.
The problem is that your file is larger than it says it is. The content_length_proc says that it is 8549968 bytes (8.15MB) whereas it is 101187668 bytes (96.5MB) (check with ls after downloading the file). Now I have an alternative that does not crash and gives you a progressbar:
def http_download_with_words(uri, filename)
bytes_total = nil
uri.open(
read_timeout: 500,
:content_length_proc => lambda{|content_length|
bytes_total = content_length},
:progress_proc => lambda{|bytes_transferred|
if bytes_total
# Print progress
print("\r#{bytes_transferred}/#{bytes_total}")
else
# We don’t know how much we get, so just print number
# of transferred bytes
print("\r#{bytes_transferred} (total size unknown)")
end
}
) do |file|
open filename, 'w' do |io|
file.each_line do |line|
io.write line
end
end
end
end
http_download_with_words(URI( 'http://data.wien.gv.at/daten/geo?service=WFS&request=GetFeature&version=1.1.0&typeName=ogdwien%3aBAUMOGD&srsName=EPSG:4326' ), 'temp.txt')
which is pretty self-explanatory, (seen here.)
Now the part I haven't been able to figure out is how exactly the progressbar gem is interfering with the ZLib. Most things seem to work fine inside the procs (e.g. having them print random stuff) so I assume both of these progressbars do something odd on completion that somehow messes with the transfer. I'd be very interested if anyone can figure out why that is?
In my testing when this occurred it was due to the raise in #set. As for why it results in an error in Zlib, that's not clear. Perhaps some strange exception handling in there. In my case I did "progbar.set(count) rescue nil" to get rid of the issue.

IOError: closed stream in Ruby SFTP

The following code tries to list the entries of a remote directory via SFTP and Net::SFTP, but it causes an "closed stream" IOError if the directory contains a large number of files (~ 6000 files):
require 'net/ssh'
require 'net/sftp'
Net::SFTP.start('hostname', 'username', :password => 'password') do |sftp|
# list the entries in a directory
sftp.dir.foreach("/") do |entry|
puts entry.longname
end
end
What is the best way to avoid it? Versions are net-sftp Gem: 2.0.5 and net-ssh Gem: 2.2.1, Ruby: 1.8.7. The full error message reads:
IOError: closed stream
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/ruby_compat.rb:33:in `select'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/ruby_compat.rb:33:in `io_select'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/ruby_compat.rb:32:in `synchronize'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/ruby_compat.rb:32:in `io_select'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/transport/packet_stream.rb:73:in `available_for_read?'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/transport/packet_stream.rb:85:in `next_packet'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/transport/session.rb:170:in `poll_message'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/transport/session.rb:165:in `loop'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/transport/session.rb:165:in `poll_message'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:451:in `dispatch_incoming_packets'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:213:in `preprocess'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:197:in `process'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:161:in `loop'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:161:in `loop_forever'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:161:in `loop'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-ssh-2.2.1/lib/net/ssh/connection/session.rb:110:in `close'
from ~/.rvm/gems/ruby-1.8.7-p330/gems/net-sftp-2.0.5/lib/net/sftp.rb:36:in `start'
The behavior could be deliberate, if we take a look at the dir source code in net-sftp/lib/net/sftp/operations/dir.rb, we see a close operation:
def foreach(path)
..
ensure
sftp.close!(handle) if handle
end
It is possible that this close operation causes the closed stream error. If it does not indicate a bug, it is possible the catch the IOError exception. It also seems to help to run the SSH event loop occasionally:
begin
..
sftp.dir.foreach("/") do |entry|
puts entry.longname
# ...
sftp.loop # Runs the SSH event loop
end
rescue IOError => Ex
puts "*** We are done: "+Ex.message
end

Resources