Ruby paging over API response dataset causes memory spike - ruby

I'm experiencing an issue with a large memory spike when I page through a dataset returned by an API. The API is returning ~150k records, I'm requesting 10k records at a time and paging through 15 pages of data. The data is an array of hashes, each hash containing 25 keys with ~50-character string values. This process kills my 512mb Heroku dyno.
I have a method used for paging an API response dataset.
def all_pages value_key = 'values', &block
response = {}
values = []
current_page = 1
total_pages = 1
offset = 0
begin
response = yield offset
#The following seems to be the culprit
values += response[value_key] if response.key? value_key
offset = response['offset']
total_pages = (response['totalResults'].to_f / response['limit'].to_f).ceil if response.key? 'totalResults'
end while (current_page += 1) <= total_pages
values
end
I call this method as so:
all_pages("items") do |current_page|
get "#{data_uri}/data", query: {offset: current_page, limit: 10000}
end
I know it's the concatenation of the arrays that is causing the issue as removing that line allows the process to run with no memory issues. What am I doing wrong? The whole dataset is probably no larger than 20mb - how is that consuming all the dyno memory? What can I do to improve the effeciency here?
Update
Response looks like this: {"totalResults":208904,"offset":0,"count":1,"hasMore":true, limit:"10000","items":[...]}
Update 2
Running with report shows the following:
[HTTParty] [2014-08-13 13:11:22 -0700] 200 "GET 29259/data" -
Memory 171072KB
[HTTParty] [2014-08-13 13:11:26 -0700] 200 "GET 29259/data" -
Memory 211960KB
... removed for brevity ...
[HTTParty] [2014-08-13 13:12:28 -0700] 200 "GET 29259/data" -
Memory 875760KB
[HTTParty] [2014-08-13 13:12:33 -0700] 200 "GET 29259/data" -
Errno::ENOMEM: Cannot allocate memory - ps ax -o pid,rss | grep -E "^[[:space:]]*23137"
Update 3
I can recreate the issue with the basic script below. The script is hard coded to only pull 100k records and already consumes over 512MB of memory on my local VM.
#! /usr/bin/ruby
require 'uri'
require 'net/http'
require 'json'
uri = URI.parse("https://someapi.com/data")
offset = 0
values = []
begin
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.set_debug_output($stdout)
request = Net::HTTP::Get.new(uri.request_uri + "?limit=10000&offset=#{offset}")
request.add_field("Content-Type", "application/json")
request.add_field("Accept", "application/json")
response = http.request(request)
json_response = JSON.parse(response.body)
values << json_response['items']
offset += 10000
end while offset < 100_000
values
Update 4
I've made a couple of improvements which seem to help but not completely alleviate the issue.
1) Using symbolize_keys turned out to consume less memory. This is because the keys of each hash are the same and it's cheaper to symbolize them then to parse them as seperate Strings.
2) Switching to ruby-yajl for JSON parsing consumes significantly less memory as well.
Memory consumption of processing 200k records:
JSON.parse(response.body): 861080KB (Before completely running out of memory)
JSON.parse(response.body, symbolize_keys: true): 573580KB
Yajl::Parser.parse(response.body): 357236KB
Yajl::Parser.parse(response.body, symbolize_keys: true): 264576KB
This is still an issue though.
Why does a dataset that's no more than 20MB take that much memory to process?
What is the "right way" to process large datasets like this?
What does one do when the dataset becomes 10x larger? 100x larger?
I will buy a beer for anyone who can thoroughly answer these three questions!
Thanks a lot in advance.

You've identified the problem to be using += with your array. So the likely solution is to add the data without creating a new array each time.
values.push response[value_key] if response.key? value_key
Or use the <<
values << response[value_key] if response.key? value_key
You should only use += if you actually want a new array. It doesn't appear you do actually want a new array, but actually just want all the elements in a single array.

Related

Improve code result speed by multiprocessing

I'm self study of Python and it's my first code.
I'm working for analyze logs from the servers. Usually I need analyze full day logs. I created script (this is example, simple logic) just for check speed. If I use normal coding the duration of analyzing 20mil rows about 12-13 minutes. I need 200mil rows by 5 min.
What I tried:
Use multiprocessing (met issue with share memory, think that fix it). But as the result - 300K rows = 20 sec and no matter how many processes. (PS: Also need control processors count in advance)
Use threading (I found that it's not give any speed, 300K rows = 2 sec. But normal code same, 300K = 2 sec)
Use asyncio (I think that script is slow because need reads many files). Result same as threading - 300K = 2 sec.
Finally I think that all three my script incorrect and didn't work correctly.
PS: I try to avoid use specific python modules (like pandas) because in this case it will be more difficult to execute on different servers. Better to use common lib.
Please help to check 1st - multiprocessing.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a, n):
proc_num = os.getpid()
a_temp_m = a["vod_miss"]
a_temp_h = a["vod_hit"]
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m[n] = a_temp_m[n] + 1
elif j[3].find('HIT') != -1:
a_temp_h[n] = a_temp_h[n] + 1
a["vod_miss"][n] = a_temp_m[n]
a["vod_hit"][n] = a_temp_h[n]
if __name__ == '__main__':
procs = []
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
n = 1
vod_live_cuts[i] = manager.list([0] * cpu)
vod_live_cuts[ii] = manager.list([0] * cpu)
for m in file:
proc = Process(target=argument, args=(m, vod_live_cuts, (n-1)))
procs.append(proc)
proc.start()
if n >= cpu:
n = 1
proc.join()
else:
n += 1
[proc.join() for proc in procs]
[proc.close() for proc in procs]
I'm expect, each file by def argument will be processed by independent process and finally all results will be saved in dict vod_live_cuts. For each process I added independent list in dict. I think it will help cross operation for use this parameter. But maybe it's wrong way :(
using IPC is costly, so only use "shared objects" for saving the final result, not for intermediate results while parsing the file.
limiting the number of processes is done by using a multiprocessing.Pool, the following code uses it to reach the max hard-disk speed, you only need to post-process the results.
you can only parse data as fast as your HDD can read it (typically 30-80 MB/s), so if you need to improve the performance further you should use SSD or RAID0 for higher disk speed, you cannot get much faster than this without changing your hardware.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager, Pool
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a):
proc_num = os.getpid()
a_temp_m_n = 0 # make it local to process
a_temp_h_n = 0 # as shared lists use IPC
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m_n = a_temp_m_n + 1
elif j[3].find('HIT') != -1:
a_temp_h_n = a_temp_h_n + 1
a["vod_miss"].append(a_temp_m_n)
a["vod_hit"].append(a_temp_h_n)
if __name__ == '__main__':
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
vod_live_cuts[i] = manager.list()
vod_live_cuts[ii] = manager.list()
with Pool(cpu) as pool:
tasks = []
for m in file:
task = pool.apply_async(argument, args=(m, vod_live_cuts))
tasks.append(task)
for task in tasks:
task.get()
print(list(vod_live_cuts[i]))
print(list(vod_live_cuts[ii]))

Pyspark performance tunning - cache or not to cache?

I am trying to speed up the calculations from multiple operations that I am adding as columns in a pyspark data frame, when I found the sparkbyexamples article on performance tunning. I am considering how to use the cache and the spark.sql.shuffle.partitions, solutions.
Would cache be appropriate for a code that first joins multiple data
frames and then adds calculations over different windows?
What happens when reassigning the cached data frame (see bellow)?
Example:
df = dfA.join(dfB, on = ['key'], how ='left') # should I add .cache here?
w_u = Window.partitionBy('user')
w_m = Window.partitionBy(['user','month']).orderBy('month')\
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
MLAB = ['val1','val2'] # example to indicate that I run similar operations multiple times
for mlab in MLAB:
percent_50 = F.expr('percentile_approx('+mlab+',0.5)')
df = df.withColumn(mlab+'_md', percent_50.over(w_u) # what happens with the cache when I reassing it
Afterwards I am adding additional operations that include aggregations, such as:
radius_df = (df
# number of visits per stop
.groupby('userId', 'locationId').agg(F.count(F.lit(1)).alias('n_i'),
F.first('locationLongitude').alias('locationLongitude'),
F.first('locationLatitude').alias('locationLatitude'))
#compute center of mass (lat/lon) per user
.withColumn('center_lon', F.avg(F.col('locationLongitude')).over(w))
.withColumn('center_lat', F.avg(F.col('locationLatitude')).over(w))
# compute total visits
.withColumn('N', F.sum(F.col('n_i')).over(w))
# compute (r_i - r_cm)
.withColumn('distance', distance(F.col('locationLatitude'), F.col('locationLongitude'), F.col('center_lat'), F.col('center_lon')))
# compute n_i(r_i - r_cm)^2 / N
.withColumn('distance2', F.col('n_i') * (F.col('distance') * F.col('distance')) / F.col('N'))
# compute sum(n_i(r_i - r_cm)^2)
.groupBy('userId').agg(F.sum(F.col('distance2')).alias('sum_dist2'))
# square root
.withColumn('radius_gyr', F.sqrt(F.col('sum_dist2')))
.select('userId','radius_gyr')
)
df_f = df.join(radius_df.dropDuplicates(), on='userId', how='left')
I am open to any suggestions on how to speed up the code. Many thanks.

Ruby Zlib compression gives different outputs for the same input

I have this ruby method for compressing a string -
def compress_data(data)
output = StringIO.new
gz = Zlib::GzipWriter.new(output)
gz.write(data)
gz.close
compressed_data = output.string
compressed_data
end
When I call this method with the same input, I get different outputs at different times. I am trying to get the byte array for the compressed outputs and compare them.
The output is Different when I run the below -
input = "hello world"
output1 = (compress_data input).bytes.to_a
sleep 1
output2 = (compress_data input).bytes.to_a
if output1 == output2
puts 'Same'
else
puts 'Different'
end
The output is Same when I remove the sleep. Does the compression algorithm have something to do with the current time?
Option 1 - fixed mtime:
Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:
gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close
Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).
Option 2 - skip the header:
Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:
data = compress_data(input).bytes[10..-1]
Note that you do not need to call to_a on bytes. It is already an Array:
String.bytes -> an_array
Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.

Parse Apache Formatted URLs in Ruby

How can I take in a Apache Common Log file and list all of the URLs in it in a neat histogram like:
/favicon.ico ##
/manual/mod/mod_autoindex.html #
/ruby/faq/Windows/ ##
/ruby/faq/Windows/index.html #
/ruby/faq/Windows/RubyonRails #
/ruby/rubymain.html #
/robots.txt ########
Sample of test file:
65.54.188.137 - - [03/Sep/2006:03:50:20 -0400] "GET /~longa/geomed/ppa/doc/localg/localg.htm HTTP/1.0" 200 24834
65.54.188.137 - - [03/Sep/2006:03:50:32 -0400] "GET /~longa/geomed/modules/sv/scen1.html HTTP/1.0" 200 1919
65.54.188.137 - - [03/Sep/2006:03:53:51 -0400] "GET /~longa/xlispstat/code/statistics/introstat/axis/code/axisDens.lsp HTTP/1.0" 200 15962
65.54.188.137 - - [03/Sep/2006:04:03:03 -0400] "GET /~longa/geomed/modules/cluster/lab/nm.pop HTTP/1.0" 200 66302
65.54.188.137 - - [03/Sep/2006:04:11:15 -0400] "GET /~longa/geomed/data/france/names.txt HTTP/1.0" 200 20706
74.129.13.176 - - [03/Sep/2006:04:14:35 -0400] "GET /~jbyoder/ambiguouslyyours/ambig.rss HTTP/1.1" 304 -
This is what I have right now (but I'm not sure how to make the histogram):
...
---
$apache_line = /\A(?<ip_address>\S+) \S+ \S+ \[(?<time>[^\]]+)\] "(?<method>GET|POST) (?<url>\S+) \S+?" (?<status>\d+) (?<bytes>\S+)/
$parts = apache_line.match(file)
$p parts[:ip_address], parts[:status], parts[:method], parts[:url]
def get_url(file)
hits = Hash.new {|h,k| h[k]=0}
File.read(file).to_a.each do |line|
while $p parts[:url]
if k = k
h[k]+=1
puts "%-15s %s" % [k,'#'*h[k]]
end
end
end
...
---
Here is the full question: http://pastebin.com/GRPS6cTZ Pseudo code is fine.
You can create a hash mapping each path to the number of hits. For convenience, I suggest using a Hash that sets the value to 0 when you ask for a path it hasn't seen before. For example:
hits = Hash.new{ |h,k| h[k]=0 }
...
hits["/favicon.ico"] += 1
hits["/ruby/faq/Windows/"] += 1
hits["/favicon.ico"] += 1
p hits
#=> {"/favicon.ico"=>2, "/ruby/faq/Windows/"=>1}
In case the log file is really huge, instead of slurping the whole thing into memory, process the lines one at a time. (Look through the methods of the File class.)
Because Apache log file formats don't have standard delimiters, I'd suggesting using a regular expression to take each line and separate it into the chunks you want. Assuming you're using Ruby 1.9, I'm going to use named captures for clean access to the methods later on. For example:
apache_line = /\A(?<ip_address>\S+) \S+ \S+ \[(?<time>[^\]]+)\] "(?<method>GET|POST) (?<url>\S+) \S+?" (?<status>\d+) (?<bytes>\S+)/
...
parts = apache_line.match(log_line)
p parts[:ip_address], parts[:status], parts[:method], parts[:url]
You might want to choose to filter these based on the status code. For example, do you want to include in your graph all the 404 hits where someone mistyped? If you're not slurping all the lines into memory, you won't be using Array#select but instead skipping over them during your loop.
After you have gathered all your hits, then its time to write out the results. Some helpful tips:
Hash#keys can give you all the keys of the array (the paths) at once. You probably want to write out all the paths with the same amount of whitespace, so you need to figure out which is the longest. Perhaps you want to map the paths to their lengths and then get the max element, or perhaps you want to use max_by to find the longest path and then find its length.
Although geeky, using sprintf or String#% is a great way to lay out formatted reports. For example:
puts "%-15s %s" % ["Hello","####"]
#=> "Hello ####"
Just like you needed to find the longest name for good formatting, might want to to find the URL with the most hits, so that you can scale your longest amount of hashes to that value. Hash#values will give you an array of all values. Alternatively, perhaps you have a requirement that one # must always represent 100 hits, or something.
Note that String#* lets you create a string by repetition:
p '#'*10
#=> "##########"
If you have specific questions with your code, ask more questions!
Since this is homework, I won't give you the exact answer, but Simone Carletti has implemented a Ruby class to parse Apache log files. You might start there and look at how he does things.

Testing time critical code

I've written a feature for my library Rubikon that displays a throbber (a spinning — as you may have seen in other console apps) as long as some other code is running.
To test this feature I capture the output of the throbber in a StringIO and compare it with the expected value. As the throbber is only displayed as long as the other code is running the content of the IO gets longer when the code runs longer. In my tests I do a simple sleep 1 and should have a constant 1 second delay. This works most of the time, but sometimes (apparently due to external factors like heavy load on the CPU) it fails, because the code doesn't run for 1 second, but for a bit more, so that the throbber prints a few additional characters.
My question is: Is there any possibility to test such time critical features in Ruby?
From your github repository, I found this test for the Throbber class:
should 'work correctly' do
ostream = StringIO.new
thread = Thread.new { sleep 1 }
throbber = Throbber.new(ostream, thread)
thread.join
throbber.join
assert_equal " \b-\b\\\b|\b/\b", ostream.string
end
I'll assume that a throbber iterates over ['-', '\', '|', '/'], backspacing before each write, once per second. Consider the following test:
should 'work correctly' do
ostream = StringIO.new
started_at = Time.now
ended_at = nil
thread = Thread.new { sleep 1; ended_at = Time.now }
throbber = Throbber.new(ostream, thread)
thread.join
throbber.join
duration = ended_at - started_at
iterated_chars = " -\\|/"
expected = ""
if duration >= 1
# After n seconds we should have n copies of " -\\|/", excluding \b for now
expected << iterated_chars * duration.to_i
end
# Next append the characters we'd get from working for fractions of a second:
remainder = duration - duration.to_i
expected << iterated_chars[0..((iterated_chars.length*remainder).to_i)] if remainder > 0.0
expected = expected.split('').join("\b") + "\b"
assert_equal expected, ostream.string
end
The last assignment of expected is a bit unpleasant, but I made the assumption that the throbber would write character/backspace pairs atomically. If this is not true, you should be able to insert the \b escape sequence into the iterated_chars string and remove the last assignment entirely.
This question is similar (I think, altough I'm not completely sure) to this one:
Only real time operating system can
give you such precision. You can
assume Thread.Sleep has a precision of
about 20 ms so you could, in theory
sleep until the desired time - the
actual time is about 20 ms and THEN
spin for 20 ms but you'll have to
waste those 20 ms. And even that
doesn't guarantee that you'll get real
time results, the scheduler might just
take your thread out just when it was
about to execute the RELEVANT part
(just after spinning)
The problem is not rubby (possibly, I'm no expert in ruby), the problem is the real time capabilities of your operating system.

Resources