I am running a ruby script from the command line. The script downloads a file (15 MB), unzips it, parses it as JSON and then populates a mysql db with it.
When I run it, I get a simple 'Killed' message back. What's going on? How can I find out what the problem is?
I am using it on an EC2 micro instance.
Thanks
Here's the script
require 'open-uri'
require 'zlib'
require 'json'
require_relative '../db/db.rb'
dl = open('........')
ex = Zlib::GzipReader.new dl
json = JSON.parse ex.read
events = json['resultsPage']['results']['event']
puts "starting to parse #{events.count} event(s)..."
created = 0
updated = 0
events[1..10].each do |event|
performances = event['performance']
performances.each do |performance|
ar_show = Show.find_or_initialize_by_songkick_id performance['id']
ar_show.artist_name = performance['displayName']
ar_show.new_record? ? created += 1 : updated += 1
ar_show.save!
end
end
Import.create :updated => updated, :new => created
puts "complete. new: #{created} - updated: #{updated}"
You are almost certainly running out of memory, as a micro instance doesn't have much memory or swap space available. I've had this happen with Perl programs. Dynamic languages can use a lot of memory when processing large chunks of data.
The best way to test this theory is to spin up a small or large instance for under an hour (so you won't pay much for it) and try the script there. If it runs, you know that a micro instance is too small for your program to run on.
Related
I wrote an algorithm inspired by the merge part of the merge sort.
def self.merge(arr)
if arr.length == 1
return arr
end
groups = []
(0...-(-arr.length/2)).each do |i|
groups << []
if !arr[2*i+1].nil?
arr[2*i].each do |cal1|
arr[2*i+1].each do |cal2|
mergecal = func(cal1,cal2)
if mergecal
groups[i] << mergecal
else
mergecal = nil
end
end
end
else
groups[i] = arr[2*i]
end
end
arr = nil
return merge(groups)
end
After the page using this algorithm is rendered, Task Manager reported around 500MB of RAM usage. Then by refreshing the same page again, memory usage have now reached 1GB. I tried adding GC.start(full_mark: true) to the controller just after the function call, but nothing seems to have changed. I'm not sure whether the memory leak has to code with my code or Ruby itself.
Ruby garbage collection doesn't immediately reduce the amount of memory your ruby program has allocated. Memory allocation is expensive so even if the objects you create are immediately collected by GC the memory is slowly released back to the OS. If you think this function has a memory leak you should try running it in a non-rails process where you have more control over object lifecycles. You can use GC.stat to get information about the number of live and free objects before and after you run the GC. It's also worth reading up on how ruby GC works I like this article.
I must be missing something but every single application I write in Ruby seems like leaking some memory. I use Ruby MRI 2.3 but I see the same behaviour with other versions.
Whenever I write a test application that does something inside a loop it is slowly leaking memory.
while true
#do something
sleep 0.1
end
For instance, I can write to array and then clean it in a loop, or just send http post request.
Here is just one example, but I have many examples like this:
require 'net/http'
require 'json'
require 'openssl'
class Tester
def send_http some_json
begin
#uri = URI('SERVER_URL')
#http = Net::HTTP.new(#uri.host, #uri.port)
#http.use_ssl = true
#http.keep_alive_timeout = 10
#http.verify_mode = OpenSSL::SSL::VERIFY_NONE
#http.read_timeout = 30
#req = Net::HTTP::Post.new(#uri.path, 'Content-Type' => 'application/json')
#req.body = some_json.to_json
res = #http.request(#req)
rescue Exception => e
puts e.message
puts e.backtrace.inspect
end
end
def run
while true
some_json = {"name": "My name"}
send_http(some_json)
sleep 0.1
end
end
end
Tester.new.run
The leak I see is very small, it can be 0.5 mb every hour.
I ran the code with MemoryProfiler and with GC::Profiler.enable and it shows that I have no leaks. So it must be 2 options:
There is a memory leak in C code. This might be possible but I don't use any external gems so I find it hard to believe that Ruby is leaking.
There is no memory leak and this is some sort of Ruby memory management mechanism. The thing is that I can defiantly see the memory growing. Until when will it grow? How much do I need to wait to know if it is a leak or now?
The same code runs perfectly fine with JRuby without any leaks.
I was amazed reading a post:
stack overlflow
from Joe Edgar:
Ruby’s history is mostly as a command line tool for text processing
and therefore it values quick startup and a small memory footprint. It
was not designed for long-running daemon/server processes
If what is written there is true and Ruby doesn't release memory back to OS then... We will always have a leak, right?
For instance:
Ruby asks for memory from OS.
OS provides the memory to Ruby.
Ruby frees the memory but GC still didn't run.
Ruby asks for more memory from OS.
OS provide more memory to Ruby.
Ruby runs GC but it is too late as Ruby already asked twice.
And so on and on.
What am I missing here?
Look Into GC Compaction and (Un)frozen String Literals
"Identical" Strings Aren't Necessarily Identical
Prior to Ruby 2.7.0, mainline Ruby didn't have compacting garbage collection. While I don't fully understand all the internals, the gist is that certain objects couldn't be garbage collected. Since you're using Ruby 2.3, that's something to keep in mind as you work on your memory allocation issues. Other non-YARV VMs may handle their internals differently, which is why you might see variation when using alternative engines like JRuby.
Even with Ruby 3.0.0-preview2, String literals aren't frozen by default, so your current implementation is creating a new String object with a unique object ID every tenth of a second. Consider the following:
3.times.map { 'foo'.__id__ }
#=> [240, 260, 280]
Even though the String objects seem identical, Ruby is actually allocating each one as a unique object in memory. Because a loop iteration is not a scope gate, those String objects can't be collected or compacted by YARV.
Enable Frozen String Literals by Default
You may have other issues as well, but it seems likely that your largest issue is keeping all of those String literals in scope indefinitely within your endless while-loop. You may be able to resolve your garbage collection problem (it's not a memory leak) by using frozen String literals instead. Consider the following:
# run irb with universally-frozen string literals
RUBYOPT="--enable-frozen-string-literal" irb
3.times.map { 'foo'.__id__ }
#=> [240, 240, 240]
You can solve this within your code in other ways as well, but reducing the number of String literals that remain in scope seems like a very sensible place to start.
I'm working on a class to download videos from an url.
I want to stream these videos instead of downloading them at once, so my program uses less RAM.
The function is the following
def get_file(url, max_segment_size)
http_client = HTTPClient.new
segment = nil
http_client.get_content(url) do |chunk|
segment.nil? ? segment = chunk : segment << chunk
if segment.size >= max_segment_size
# send part to s3
send_part(segment)
segment = nil
end
end
# send last part
send_part(segment) if segment
end
However, the program still uses a lot of RAM. For example, streaming a file of 30MB makes the process consume 150MB. Comparing to downloading the whole file at once, it uses about the same amount of ram. (I tried using net/http with the read_body method. Same results)
My understanding was that setting segment = nil should free up the space on the memory that the variable was using.
Is this expected to happen? Is there a way to manually free up this space on ruby?
Greetings, all,
I need to run a potentially long-running process from Ruby 1.9.2 on Windows and subsequently capture and parse the data from the external process's standard output and error. A large amount of data can be sent to each, but I am only necessarily interested in one line at a time (not capturing and storing the whole of the output).
After a bit of research, I found that the Open3 class would take care of executing the process and giving me IO objects connected to the process's standard output and error (via popen3).
Open3.popen3("external-program.bat") do |stdin, out, err, thread|
# Step3.profit() ?
end
However, I'm not sure how to continually read from both streams without blocking the program. Since calling IO#readlines on out or err when a lot of data has been sent results in a memory allocation error, I'm trying to continuously check both streams for available input, but not having much luck with any of my implementations.
Thanks in advance for any advice!
After a lot of different trial and error attempts, I eventually came up with using two threads, one to read from each stream (generator.rb is just a script I wrote to output things to standard out and err):
require 'open3'
data = {}
Open3.popen3("ruby generator.rb") do |stdin, out, err, external|
# Create a thread to read from each stream
{ :out => out, :err => err }.each do |key, stream|
Thread.new do
until (line = stream.gets).nil? do
data[key] = line
end
end
end
# Don't exit until the external process is done
external.join
end
puts data[:out]
puts data[:err]
It simply outputs the last line sent to standard output and error by the calling program, but could obviously be extended to do additional processing (with different logic in each thread). A method I was using before I finally came up with this was resulting in some failures due to race conditions; I don't know if this code is still vulnerable, but I've yet to experience a similar failure.
I have a datastore with a cache and a db, simple. The tricksy part is that I want a way to control if the the datastore hits the db in a real-time way. That is to say while the process is running I want to be able to toggle if it's connected to the db or not.
I looked into env variables, but it doesn't seem like those get updated as the process runs. Is there a simple way to get a bit from the command line into the running process, or do I just need to rely on ops being able to drop the db listeners in case of disaster?
Note that this is all being done in vanilla ruby - not ruby on rails.
Thanks!
-Jess
I think you can use named pipes for simple communication:
#pipes.rb:
f = File.open 'mypipe', 'r+'
loop do
begin
s = f.read_nonblock 1
rescue Exception
end
case s
when '0'
puts 'Turn off DB access!'
when '1'
puts 'Turn on DB access!'
end
sleep 1
end
And you can control your db access externally by writing to the named pipe:
jablan-mbp:dev $ echo 101 > mypipe
Which results in:
jablan-mbp:dev $ ruby pipes.rb
Turn on DB access!
Turn off DB access!
Turn on DB access!
A shared-memory strategy might be worth considering. Assuming you're running on a POSIX system, check out mmap for memory-mapped files, and SysVIPC for message queues, semaphores, and shared memory.
Assuming *NIX, have you considered signals? (kill -HUP pid) - http://ruby-doc.org/core/classes/Signal.html