I have a file in tune of few hundred MBs to be compressed. I don't need to go through that file per se, so I am free to use either system or use Zlib like explained in this SO question.
I am inclined towards system because my ruby process doesn't have to bother reading it and bloating up, so use well known gzip command to run through system. Also, I get the exit status, so I know how it went.
Anything I am missing? Is there a best practice around this? Any loopholes?
If you will use system command than you can't intervene in compression. So you won't be able to redirect compressed output to socket, provide external progress bar, combine custom tar archive etc. These things may be important during compression of large files.
Please look at the following example using ruby-zstds (zstd is better than gzip today).
require "socket"
require "zstds"
require "minitar"
TCPSocket.open "google.com", 80 do |socket|
writer = ZSTDS::Stream::Writer.new socket
begin
Minitar::Writer.open writer do |tar|
tar.add_file_simple "file.txt" do |tar_writer|
File.open "file.txt", "r" do |file|
tar_writer.write(file.read(512)) until file.eof?
end
end
tar.add_file_simple "file2.txt" ...
end
ensure
writer.close
end
end
We are reading file.txt in a streaming way, adding it to tar archive and sending portions to google immediately. We don't need to store any compressed files.
Related
I'd like to read and write a file atomically in Ruby between multiple independent Ruby processes (not threads).
I found atomic_write from ActiveSupport. This writes to a temp file, then moves it over the original and sets all permissions. However, this does not prevent the file from being read while it is being written.
I have not found any atomic_read. (Are file reads already atomic?)
Do I need to implement my own separate 'lock' file that I check for before reads and writes? Or is there a better mechanism already present in the file system for flagging a file as 'busy' that I could check before any read/write?
The motivation is dumb, but included here because you're going to ask about it.
I have a web application using Sinatra and served by Thin which (for its own reasons) uses a JSON file as a 'database'. Each request to the server reads the latest version of the file, makes any necessary changes, and writes out changes to the file.
This would be fine if I only had a single instance of the server running. However, I was thinking about having multiple copies of Thin running behind an Apache reverse proxy. These are discrete Ruby processes, and thus running truly in parallel.
Upon further reflection I realize that I really want to make the act of read-process-write atomic. At which point I realize that this basically forces me to process only one request at a time, and thus there's no reason to have multiple instances running. But the curiosity about atomic reads, and preventing reads during write, remains. Hence the question.
You want to use File#flock in exclusive mode. Here's a little demo. Run this in two different terminal windows.
filename = 'test.txt'
File.open(filename, File::RDWR) do |file|
file.flock(File::LOCK_EX)
puts "content: #{file.read}"
puts 'doing some heavy-lifting now'
sleep(10)
end
Take a look at transaction and open_and_lock_file methods in "pstore.rb" (Ruby stdlib).
YAML::Store works fine for me. So when I need to read/write atomically I (ab)use it to store data as a Hash.
I have thousands (or more) of gzipped files in a directory (on a Windows system) and one of my tools consumes those gzipped files. If it encounters a corrupt gzip file, it conveniently ignores them instead of raising an alarm.
I have been trying to write a Perl program that loops through each file and makes a list of files which are corrupt.
I am using the Compress::Zlib module, and have tried reading the first 1KB of each file, but that did not work since some of the files are corrupted towards the end (verified during the manual extract, alarm raised only towards the end) and reading first 1KB doesn't show a problem. I am wondering if a CRC check of these files will be of any help.
Questions:
Will CRC validation work in this case? If yes, how does it work? Will the true CRC be part of the gzip header, and we are to compare it with the calculated CRC from the file we have? How do I accomplish this in Perl?
Are there any other simpler ways to do this?
In short, the only way to check a gzip file is to decompress it until you get an error, or get to the end successfully. You do not however need to store the result of the decompression.
The CRC stored at the end of a gzip file is the CRC of the uncompressed data, not the compressed data. To use it for verification, you have to decompress all of the data. This is what gzip -t does, decompressing the data and checking the CRC, but not storing the uncompressed data.
Often a corruption in the compressed data will be detected before getting to the end. But if not, then the CRC, as well as a check against an uncompressed length also stored at the end, will with a probability very close to one detect a corrupted file.
The Archive::Zip FAQ gives some very good guidance on this.
It looks like the best option for you is to check the CRC of each member of the archives, and a sample program that does this -- ziptest.pl -- comes with the Archive::Zip module installation.
It should be easy to test the file is not corrupt by just using "gunzip -t" command, gunzip is available for windows as well and should come with gzip package.
I wrote a script that operates on my Mac just fine. It has this line of code in it:
filename = "2011"
File.open(filename, File::WRONLY|File::CREAT|File::EXCL) do |logfile|
logfile.puts "MemberID,FirstName,LastName,BadEmail,gender,dateofbirth,ActiveStatus,Phone"
On Windows the script runs fine and it creates the logfile 2011, but it doesn't actually puts anything to that logfile, so the file is created, the script runs, but the logging doesn't happen.
Does anyone know why? I can't think of what would have changed in the actual functionality of the script that would cause the logging to cease.
First, for clarity I wouldn't use the flags to specify how to open/create the file. I'd use:
File.open(filename, 'a')
That's the standard mode for log-files; You want to create it if it doesn't exist, and you want to append if it does.
Logging typically requires writing to the same file multiple times through the running time of an application. People like to open the log and leave it open, but there's potential for problems if the code crashes before the file is closed or it gets flushed by Ruby or the OS. Also, the built-in buffering by Ruby and the OS can cause the file to buffer, then flush, which, when you're tailing the file, will make it jump in big chunks, which isn't much good if you're watching for something.
You can tell Ruby to force flushing immediately when you write to the file by setting sync = true:
logfile = File.open(filename, 'a')
logfile.sync = true
logfile.puts 'foo'
logfile.close
You could use fsync, which also forces the OS to flush its buffer.
The downside to forcing sync in either way is you negate the advantage of buffering your I/O. For normal file writing, like to a text file, don't use sync because you'll slow your application down. Instead let normal I/O happen as Ruby and the OS want. But for logging it's acceptable because logging should periodically send a line, not a big blob of text.
You could immediately flush the output, but that gets redundant and violates the DRY principle:
logfile = File.open(filename, 'a')
logfile.puts 'foo'
logfile.flush
logfile.puts 'bar'
logfile.flush
logfile.close
close flushes before actually closing the file I/O.
You can wrap your logging output in a method:
def log(text)
File.open(log_file, 'a') do |logout|
logout.puts(text)
end
end
That'll open, then close, the log file, and automatically flush the buffer, and negate the need to use sync.
Or you could take advantage of Ruby's Logger class and let it do all the work for you.
I am trying to download more than 1m pages (URLs ending by a sequence ID). I have implemented kind of multi-purpose download manager with configurable number of download threads and one processing thread. The downloader downloads files in batches:
curl = Curl::Easy.new
batch_urls.each { |url_info|
curl.url = url_info[:url]
curl.perform
file = File.new(url_info[:file], "wb")
file << curl.body_str
file.close
# ... some other stuff
}
I have tried to download 8000 pages sample. When using the code above, I get 1000 in 2 minutes. When I write all URLs into a file and do in shell:
cat list | xargs curl
I gen all 8000 pages in two minutes.
Thing is, I need it to have it in ruby code, because there is other monitoring and processing code.
I have tried:
Curl::Multi - it is somehow faster, but misses 50-90% of files (does not download them and gives no reason/code)
multiple threads with Curl::Easy - around the same speed as single threaded
Why is reused Curl::Easy slower than subsequent command line curl calls and how can I make it faster? Or what I am doing wrong?
I would prefer to fix my download manager code than to make downloading for this case in a different way.
Before this, I was calling command-line wget which I provided with a file with list of URLs. Howerver, not all errors were handled, also it was not possible to specify output file for each URL separately when using URL list.
Now it seems to me that the best way would be to use multiple threads with system call to 'curl' command. But why when I can use directly Curl in Ruby?
Code for the download manager is here, if it might help: Download Manager (I have played with timeouts, from not-setting it to various values, it did not seem help)
Any hints appreciated.
This could be a fitting task for Typhoeus
Something like this (untested):
require 'typhoeus'
def write_file(filename, data)
file = File.new(filename, "wb")
file.write(data)
file.close
# ... some other stuff
end
hydra = Typhoeus::Hydra.new(:max_concurrency => 20)
batch_urls.each do |url_info|
req = Typhoeus::Request.new(url_info[:url])
req.on_complete do |response|
write_file(url_info[:file], response.body)
end
hydra.queue req
end
hydra.run
Come to think of it, you might get a memory problem because of the enormous amout of files. One way to prevent that would be to never store the data in a variable but instead stream it to the file directly. You could use em-http-request for that.
EventMachine.run {
http = EventMachine::HttpRequest.new('http://www.website.com/').get
http.stream { |chunk| print chunk }
# ...
}
So, if you don't set a on_body handler than curb will buffer the download. If you're downloading files you should use an on_body handler. If you want to download multiple files using Ruby Curl, try the Curl::Multi.download interface.
require 'rubygems'
require 'curb'
urls_to_download = [
'http://www.google.com/',
'http://www.yahoo.com/',
'http://www.cnn.com/',
'http://www.espn.com/'
]
path_to_files = [
'google.com.html',
'yahoo.com.html',
'cnn.com.html',
'espn.com.html'
]
Curl::Multi.download(urls_to_download, {:follow_location => true}, {}, path_to_files) {|c,p|}
If you want to just download a single file.
Curl::Easy.download('http://www.yahoo.com/')
Here is a good resource: http://gist.github.com/405779
There's been benchmarks done that has compared curb with other methods such as HTTPClient. The winner, in almost all categories was HTTPClient. Plus, there have been some documented scenarios where curb does NOT work in multi-threading scenarios.
Like you, I've had your experience. I ran system commands of curl in 20+ concurrent threads and it was 10 X fasters than running curb in 20+ concurrent threads. No matter, what I tried, this was always the case.
I've since then switched to HTTPClient, and the difference is huge. Now it runs as fast as 20 concurrent curl system commands, and uses less CPU as well.
First let me say that I know almost nothing about Ruby.
What I do know is that Ruby is an interpreted language; it's not surprising that it's slower than heavily optimised code that's been compiled for a specific platform. Every file operation will probably have checks around it that curl doesn't. The "some other stuff" will slow things down even more.
Have you tried profiling your code to see where most of the time is being spent?
Stiivi,
any chance that Net::HTTP would suffice for simple downloading of HTML pages?
You didn't specify a Ruby version, but threads in 1.8.x are user-space threads, not scheduled by the OS, so the entire Ruby interpreter only ever use one CPU/core. On top of that there is a Global Interpreter Lock, and probably other locks as well, interfering with concurrency. Since you're trying to maximize network throughput, you're probably underutilizing CPUs.
Spawn as many processes as the machine has memory for, and limit the reliance on threads.
I'm working on a project right now where I need to read header data from files on remote servers. I'm talking about many and large files so I cant read whole files, but just the header data I need.
The only solution I have is to mount the remote server with fuse and then read the header from the files as if they where on my local computer. I've tried it and it works. But it has some drawbacks. Specially with FTP:
Really slow (FTP is compared to SSH with curlftpfs). From same server, with SSH 90 files was read in 18 seconds. And with FTP 10 files in 39 seconds.
Not dependable. Sometimes the mountpoint will not be unmounted.
If the server is active and a passive mounting is done. That mountpoint and the parent folder gets locked in about 3 minutes.
Does timeout, even when there's data transfer going (guess this is the FTP-protocol and not curlftpfs).
Fuse is a solution, but I don't like it very much because I don't feel that I can trust it. So my question is basically if there's any other solutions to the problem. Language is preferably Ruby, but any other will work if Ruby does not support the solution.
Thanks!
What type of information are you looking for?
You could try using ruby's open-uri module.
The following example is from http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html
require 'open-uri'
open("http://www.ruby-lang.org/en") {|f|
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
EDIT: It seems that the op wanted to retrieve ID3 tag information from the remote files. This is more complex.
From wiki:
This appears to be a difficult problem.
On wiki:
Tag location within file
Only with the ID3v2.4 standard has it
been possible to place the tag data at
the end of the file, in common with
ID3v1. ID3v2.2 and 2.3 require that
the tag data precede the file. Whilst
for streaming data this is absolutely
required, for static data it means
that the entire audio file must be
updated to insert data at the front of
the file. For initial tagging this
incurs a large penalty as every file
must be re-written. Tag writers are
encouraged to introduce padding after
the tag data in order to allow for
edits to the tag data without
requiring the entire audio file to be
re-written, but these are not standard
and the tag requirements may vary
greatly, especially if APIC
(associated pictures) are also
embedded.
This means that depending on the ID3 tag version of the file, you may have to read different parts of the file.
Here's an article that outlines the basics of reading ID3 tag using ruby for ID3tagv1.1 but should server as a good starting point: http://rubyquiz.com/quiz136.html
You could also look into using a ID3 parsing library, such as id3.rb or id3lib-ruby; however, I'm not sure if either supports the ability to parse a remote file (Most likely could through some modifications).
A "best-as-nothing" solution would be to start the transfer, and stop it when dowloaded file has more than bytes. Since not many (if any) libraries will allow interruption of the connection, it is more complex and will probably require you to manually code a specific ftp client, with two threads, one doing the FTP connection and transfer, and the other monitoring the size of the downloaded file and killing the first thread.
Or, at least, you could parallelize the file transfers. So that you don't wait for all the files being fully transferred to analyze the start of the file. The transfer will then continue
There has been a proposal of a RANG command, allowing to retrieve only a part of the files (here, the first bytes).
I didn't find any reference of inclusion of this proposal, nor implementation, however.
So, for a specific server it could be useful to test (or check the docs of the FTP server) - and use it if available.