I'm in need to transfer a file via sockets:
# sender
require 'socket'
SIZE = 1024 * 1024 * 10
TCPSocket.open('127.0.0.1', 12345) do |socket|
File.open('c:/input.pdf', 'rb') do |file|
while chunk = file.read(SIZE)
socket.write(chunk)
end
end
end
# receiver
require 'socket'
require 'benchmark'
SIZE = 1024 * 1024 * 10
server = TCPServer.new("127.0.0.1", 12345)
puts "Server listening..."
client = server.accept
time = Benchmark.realtime do
File.open('c:/output.pdf', 'w') do |file|
while chunk = client.read(SIZE)
file.write(chunk)
end
end
end
file_size = File.size('c:/output.pdf') / 1024 / 1024
puts "Time elapsed: #{time}. Transferred #{file_size} MB. Transfer per second: #{file_size / time} MB" and exit
Using Ruby 1.9 i get a transfer rate of ~ 16MB/s (~ 22MB/s using 1.8) when transfering a 80MB PDF file from / to localhost. I'm new to socket programming, but that seems pretty slow compared to just using FileUtils.cp. Is there anything i'm doing wrong?
Well, even with localhost, you still have to go through some of the TCP stack, introducing inevitable delays with packet fragmentation and rebuilding. It probably doesn't go out on the wire where you'd be limited to 100 megaBITs (~12.5 MB/s) per second or a gigibit (~125 MB/s) theoretical maximum.
None of that overhead exists for raw file copying disk to disk. You should keep in mind that even SATA1 gave you 1.5 gigabits/sec and I'd be surprised if you were still running on that backlevel. On top of that, your OS itself will undoubtedly be caching a lot of stuff, not possible when sending over the TCP stack.
16MB per second doesn't sound too bad to me.
I know this question is old, but why can't you compress before you send, then decompress on the receiving end?
require 'zlib'
def compress(input)
Zlib::Deflate.deflate(input)
end
def decompress(input)
Zlib::Inflate.inflate(input)
end
(Shameless plug) AFT (https://github.com/wlib/aft) already does what you're making
Related
I have the following script, which works good, locally (Windows 10 IIS, windows 2003 Server), but not on our hosting server (Windows 2003 Server). Anything over 4mb will download really slow and then timeout before it gets to the end of the file. However, locally, it downloads fast and full.
Doing a Direct Download (link to the file itself) downloads a 26.5mb file in 5 seconds from our hosting provider server. So, there is not an issue with a download limit. There is an issue it seems, with the hosting server and this script. Any ideas?
Response.AddHeader "content-disposition","filename=" & strfileName
Response.ContentType = "application/x-zip-compressed" 'here your content -type
Dim strFilePath, lSize, lBlocks
Const CHUNK = 2048
set objStream = CreateObject("ADODB.Stream")
objStream.Open
objStream.Type = 1
objStream.LoadFromfile Server.MapPath("up/"&strfileName&"")
lSize = objStream.Size
Response.AddHeader "Content-Size", lSize
lBlocks = 1
Response.Buffer = False
Do Until objStream.EOS Or Not Response.IsClientConnected
Response.BinaryWrite(objStream.Read(CHUNK))
Loop
objStream.Close
Just looking at the code snippet it appear to be fine and is the very approach I would use for downloading large files (especially like the use of Response.IsClientConnected).
However having said that, it's likely the size of the chunks being read in relation to the size of the file.
Very roughly the formula is something like this...
time to read = ((file size / chunk size) * read time)
So if we use your example of a 4 MB file (4194304 bytes) and say it takes 100 milliseconds to read each chunk then the following applies;
Chunk Size of 2048 bytes (2 KB) will take approx. 3 minutes to read.
Chunk Size of 20480 bytes (20 KB) will take approx. 20 seconds to read.
Classic ASP pages on IIS 7 and above have a default scriptTimeout of 00:01:30 so in the example above a 4 MB file constantly read at 100 milliseconds in 2 KB chunks would timeout before the script could finish.
Now these are just rough statistics your read time won't constantly stay the same and it's likely faster then 100 milliseconds (depending on disk read speeds) but I think you get the point.
So just try increasing the CHUNK.
Const CHUNK = 20480 'Read in chunks of 20 KB
The code I have is bit different, using a For..Next loop instead of Do..Until loop. Not 100% sure this will really work in your case, but worth a try. Here is my version of the code:
For i = 1 To iSz / chunkSize
If Not Response.IsClientConnected Then Exit For
Response.BinaryWrite objStream.Read(chunkSize)
Next
If iSz Mod chunkSize > 0 Then
If Response.IsClientConnected Then
Response.BinaryWrite objStream.Read(iSz Mod chunkSize)
End If
End If
Basically is due the script timeout. I had the same problem with 1GB files in IIS 10 after upgraded to Win 2016 with IIS 10 (default timeout is shorter by default).
I use chunks of 256000 and Server.ScriptTimeout = 600 '10 minutes
I'm working on a class to download videos from an url.
I want to stream these videos instead of downloading them at once, so my program uses less RAM.
The function is the following
def get_file(url, max_segment_size)
http_client = HTTPClient.new
segment = nil
http_client.get_content(url) do |chunk|
segment.nil? ? segment = chunk : segment << chunk
if segment.size >= max_segment_size
# send part to s3
send_part(segment)
segment = nil
end
end
# send last part
send_part(segment) if segment
end
However, the program still uses a lot of RAM. For example, streaming a file of 30MB makes the process consume 150MB. Comparing to downloading the whole file at once, it uses about the same amount of ram. (I tried using net/http with the read_body method. Same results)
My understanding was that setting segment = nil should free up the space on the memory that the variable was using.
Is this expected to happen? Is there a way to manually free up this space on ruby?
I have to calculate a hash from various streams (StringIO, File, chunked http responses...), and the sources are pretty big (around 100MB - 1GB). For example, I have the following code
require 'digest'
sha = Digest::SHA256.new
stream = StringIO.new("test\nfoo\nbar\nhello world")
# this could also be a File.open('my_file.txt')
# or a chunked http response
while content = stream.read(2)
sha.update content
end
puts sha.to_s
This works so far, but I was wondering how the sha.update method works. Does it store a copy from the overall String in its instance, so that the whole content is hold in memory?
This could lead to some serious memory issues, when loading 1GB of data into RAM (and doing this on multiple processes on the same machine)
I need to calculate some MD5 of very large files (>1TB) for deduplication purposes. For example, I have a 10GiB file and want to simultaneously calculate the MD5 of the whole file and of each of the 10 sequential 1GiB chunks.
I cannot use the form:
Digest::MD5.hexdigest(IO.read("file_10GiB.txt"))
because Ruby first reads the whole file in memory before calculating the MD5, so I quickly run out of memory.
So I can read it in 1MiB chunks with the following code.
require "digest"
fd = File.open("file_10GiB.txt")
digest_1GiB = Digest::MD5.new
digest_10GiB = Digest::MD5.new
10.times do
1024.times do
data_1MiB = fd.read(2**20) # Each MiB is read only once so
digest_1GiB << data_1MiB # there is no duplicate IO operations.
digest_10GiB << data_1MiB # It is then sent to both digests.
end
puts digest_1GiB.hexdigest
digest_1GiB.reset
end
puts digest_10GiB.hexdigest
That takes around 40 seconds. The results are similar using openssl instead of digest.
If I comment out digest_1GiB << data_1MiB or digest_10GiB << data_1MiB, it unsurprisingly goes twice as fast (20 seconds) and the bash command md5sum file_10GiB.txt takes around 20 seconds, so that is consistent.
Clearly both MD5's are being calculated in the same thread and core so I thought I could use some multithreading. I'm using Ruby MRI 2.2.1 that doesn't have real multithreading but using subprocesses I can calculate the MD5's on several cores simultaneously:
fd = File.open("file_10GiB.txt")
IO.popen("md5sum", "r+") do |md5sum_10GiB|
10.times do
IO.popen("md5sum", "r+") do |md5sum_1GiB|
1024.times do
data_1MiB = fd.read(2**20) # Each MiB is read only once so
md5sum_1GiB << data_1MiB # there is no duplicate IO operations.
md5sum_10GiB << data_1MiB # It is then sent to both digests.
end
md5sum_1GiB.close_write
puts md5sum_1GiB.gets
end
end
md5sum_10GiB.close_write
puts md5sum_10GiB.gets
end
But this takes 120 seconds, three times slower. Why would that be?
Strangely, if I comment out md5sum_1GiB << data_1MiB or md5sum_10GiB << data_1MiB it doesn't take 60 seconds as expected, but 40, which is still half of the theoretical speed.
The results are similar using Open3::popen2 instead of IO::popen. Same for openssl md5 instead of md5sum.
I have confirmed the speed differences of those pieces of code with significantly larger files to make sure it wasn't insignificant measurement errors and the proportions stay the same.
I have very fast IO storage with about 2.5GiB/s sequential read so I don't think it could cause any limitations here.
I am running a ruby script from the command line. The script downloads a file (15 MB), unzips it, parses it as JSON and then populates a mysql db with it.
When I run it, I get a simple 'Killed' message back. What's going on? How can I find out what the problem is?
I am using it on an EC2 micro instance.
Thanks
Here's the script
require 'open-uri'
require 'zlib'
require 'json'
require_relative '../db/db.rb'
dl = open('........')
ex = Zlib::GzipReader.new dl
json = JSON.parse ex.read
events = json['resultsPage']['results']['event']
puts "starting to parse #{events.count} event(s)..."
created = 0
updated = 0
events[1..10].each do |event|
performances = event['performance']
performances.each do |performance|
ar_show = Show.find_or_initialize_by_songkick_id performance['id']
ar_show.artist_name = performance['displayName']
ar_show.new_record? ? created += 1 : updated += 1
ar_show.save!
end
end
Import.create :updated => updated, :new => created
puts "complete. new: #{created} - updated: #{updated}"
You are almost certainly running out of memory, as a micro instance doesn't have much memory or swap space available. I've had this happen with Perl programs. Dynamic languages can use a lot of memory when processing large chunks of data.
The best way to test this theory is to spin up a small or large instance for under an hour (so you won't pay much for it) and try the script there. If it runs, you know that a micro instance is too small for your program to run on.