Stream based parsing and writing of JSON - ruby

I fetch about 20,000 datasets from a server in 1,000 batches. Each dataset is a JSON object. Persisted this makes around 350 MB of uncompressed plaintext.
I have a memory limit of 1GB. Hence, I write each 1,000 JSON objects as an array into a raw JSON file in append mode.
The result is a file with 20 JSON arrays which needs to be aggregated. I need to touch them anyway, because I want to add metadata. Generally the Ruby Yajl Parser makes this possible like so:
raw_file = File.new(path_to_raw_file, 'r')
json_file = File.new(path_to_json_file, 'w')
datasets = []
parser = Yajl::Parser.new
parser.on_parse_complete = Proc.new { |o| datasets += o }
parser.parse(datasets)
hash = { date: Time.now, datasets: datasets }
Yajl::Encoder.encode(hash, json_file)
Where is the problem with this solution? The problem is that still the whole JSON is parsed into memory, which I must avoid.
Basically what I need is a solution which parses the JSON from an IO object and encodes them to another IO object, at the same time.
I assumed Yajl offers this, but I haven't found a way, nor did its API give any hints, so I guess not. Is there a JSON Parser library which supports this? Are there other solutions?
The only solution I can think of is to use the IO.seek capabilities. Write all the datasets arrays one after another [...][...][...] and after every array, I seek back to the start and overwrite ][ with ,, effectively connecting the arrays manually.

Why can't you retrieve a single record at a time from the database, process it as necessary, convert it to JSON, then emit it with a trailing/delimiting comma?
If you started with a file that only contained [, then appended all your JSON strings, then, on the final entry didn't append a comma, and instead used a closing ], you'd have a JSON array of hashes, and would only have to process one row's worth at a time.
It'd be a tiny bit slower (maybe) but wouldn't impact your system. And DB I/O can be very fast if you use blocking/paging to retrieve a reasonable number of records at a time.
For instance, here's a combination of some Sequel example code, and code to extract the rows as JSON and build a larger JSON structure:
require 'json'
require 'sequel'
DB = Sequel.sqlite # memory database
DB.create_table :items do
primary_key :id
String :name
Float :price
end
items = DB[:items] # Create a dataset
# Populate the table
items.insert(:name => 'abc', :price => rand * 100)
items.insert(:name => 'def', :price => rand * 100)
items.insert(:name => 'ghi', :price => rand * 100)
add_comma = false
puts '['
items.order(:price).each do |item|
puts ',' if add_comma
add_comma ||= true
print JSON[item]
end
puts "\n]"
Which outputs:
[
{"id":2,"name":"def","price":3.714714089426208},
{"id":3,"name":"ghi","price":27.0179624376119},
{"id":1,"name":"abc","price":52.51248221170203}
]
Notice the order is now by "price".
Validation is easy:
require 'json'
require 'pp'
pp JSON[<<EOT]
[
{"id":2,"name":"def","price":3.714714089426208},
{"id":3,"name":"ghi","price":27.0179624376119},
{"id":1,"name":"abc","price":52.51248221170203}
]
EOT
Which results in:
[{"id"=>2, "name"=>"def", "price"=>3.714714089426208},
{"id"=>3, "name"=>"ghi", "price"=>27.0179624376119},
{"id"=>1, "name"=>"abc", "price"=>52.51248221170203}]
This validates the JSON and demonstrates that the original data is recoverable. Each row retrieved from the database should be a minimal "bitesized" piece of the overall JSON structure you want to build.
Building upon that, here's how to read incoming JSON in the database, manipulate it, then emit it as a JSON file:
require 'json'
require 'sequel'
DB = Sequel.sqlite # memory database
DB.create_table :items do
primary_key :id
String :json
end
items = DB[:items] # Create a dataset
# Populate the table
items.insert(:json => JSON[:name => 'abc', :price => rand * 100])
items.insert(:json => JSON[:name => 'def', :price => rand * 100])
items.insert(:json => JSON[:name => 'ghi', :price => rand * 100])
items.insert(:json => JSON[:name => 'jkl', :price => rand * 100])
items.insert(:json => JSON[:name => 'mno', :price => rand * 100])
items.insert(:json => JSON[:name => 'pqr', :price => rand * 100])
items.insert(:json => JSON[:name => 'stu', :price => rand * 100])
items.insert(:json => JSON[:name => 'vwx', :price => rand * 100])
items.insert(:json => JSON[:name => 'yz_', :price => rand * 100])
add_comma = false
puts '['
items.each do |item|
puts ',' if add_comma
add_comma ||= true
print JSON[
JSON[
item[:json]
].merge('foo' => 'bar', 'time' => Time.now.to_f)
]
end
puts "\n]"
Which generates:
[
{"name":"abc","price":3.268814929005337,"foo":"bar","time":1379688093.124606},
{"name":"def","price":13.871147312377719,"foo":"bar","time":1379688093.124664},
{"name":"ghi","price":52.720984131655676,"foo":"bar","time":1379688093.124702},
{"name":"jkl","price":53.21477190840114,"foo":"bar","time":1379688093.124732},
{"name":"mno","price":40.99364022416619,"foo":"bar","time":1379688093.124758},
{"name":"pqr","price":5.918738444452265,"foo":"bar","time":1379688093.124803},
{"name":"stu","price":45.09391752439902,"foo":"bar","time":1379688093.124831},
{"name":"vwx","price":63.08947792357426,"foo":"bar","time":1379688093.124862},
{"name":"yz_","price":94.04921035056373,"foo":"bar","time":1379688093.124894}
]
I added the timestamp so you can see that each row is processed individually, AND to give you an idea how fast the rows are being processed. Granted, this is a tiny, in-memory database, which has no network I/O to content with, but a normal network connection through a switch to a database on a reasonable DB host should be pretty fast too. Telling the ORM to read the DB in chunks can speed up the processing because the DBM will be able to return larger blocks to more efficiently fill the packets. You'll have to experiment to determine what size chunks you need because it will vary based on your network, your hosts, and the size of your records.
Your original design isn't good when dealing with enterprise-sized databases, especially when your hardware resources are limited. Over the years we've learned how to parse BIG databases, which make 20,000 row tables appear miniscule. VM slices are common these days and we use them for crunching, so they're often the PCs of yesteryear: single CPU with small memory footprints and dinky drives. We can't beat them up or they'll be bottlenecks, so we have to break the data into the smallest atomic pieces we can.
Harping about DB design: Storing JSON in a database is a questionable practice. DBMs these days can spew JSON, YAML and XML representations of rows, but forcing the DBM to search inside stored JSON, YAML or XML strings is a major hit in processing speed, so avoid it at all costs unless you also have the equivalent lookup data indexed in separate fields so your searches are at the highest possible speed. If the data is available in separate fields, then doing good ol' database queries, tweaking in the DBM or your scripting language of choice, and emitting the massaged data becomes a lot easier.

It is possible via JSON::Stream or Yajl::FFI gems. You will have to write your own callbacks though. Some hints on how to do that can be found here and here.
Facing a similar problem I have created the json-streamer gem that will spare you the need to create your own callbacks. It will yield you each object one by one removing it from the memory afterwards. You could then pass these to another IO object as intended.

There is a library called oj that does exactly that. It can do parsing and generation. For example, for parsing you can use Oj::Doc:
Oj::Doc.open('[3,[2,1]]') do |doc|
result = {}
doc.each_leaf() do |d|
result[d.where?] = d.fetch()
end
result
end #=> ["/1" => 3, "/2/1" => 2, "/2/2" => 1]
You can even backtrack in the file using doc.move(path). it seems very flexible.
For writing documents, you can use Oj::StreamWriter:
require 'oj'
doc = Oj::StreamWriter.new($stdout)
def write_item(doc, item)
doc.push_object
doc.push_key "type"
doc.push_value "item"
doc.push_key "value"
doc.push_value item
doc.pop
end
def write_array(doc, array)
doc.push_object
doc.push_key "type"
doc.push_value "array"
doc.push_key "value"
doc.push_array
array.each do |item|
write_item(doc, item)
end
doc.pop
doc.pop
end
write_array(doc, [{a: 1}, {a: 2}]) #=> {"type":"array","value":[{"type":"item","value":{":a":1}},{"type":"item","value":{":a":2}}]}

Related

How to parse a Hash of Hashes from a CSV file

I have a CSV file that I need to read and extract all rows which have a "created_at" within a certain range. The CSV itself is about 5000 lines in Excel.
This is how I am pulling the info from the file:
CSV.foreach("sample_data.csv", :headers => true, :header_converters => :symbol, :converters => :all) do |row|
data[row.fields[0]] = Hash[row.headers[1..-1].zip(row.fields[1..-1])]
end
Here's the last Hash created after using CSV.foreach:
2760=>{:created_at=>1483189568, :readable_date=>"12/31/2016", :first_name=>"Louise", :last_name=>"Garza", :email=>"lgarza24n#drupal.org", :gender=>"Female", :company=>"Cogilith", :currency=>"EUR", :word=>"orchestration", :drug_brand=>"EPIVIR", :drug_name=>"lamivudine", :drug_company=>"State of Florida DOH Central Pharmacy", :pill_color=>"Maroon", :frequency=>"Yearly", :token=>"_", :keywords=>"in faucibus", :bitcoin_address=>"19jTjXLPQUL1nEmHrpqeqM1FdtDFZmUZ2E"}}
When I run data[2759].first I get:
created_at
1309380645
I need to pull every hash where created_at is between range = 1403321503..1406082945. I tried about twenty different methods using each and collect on the data hash with no success. My last attempt printed out an empty {} for each original hash.
I'm trying to test this with no success:
data.each do |hash|
if hash.first.to_s.to_i > 1403321503 && hash.first.to_s.to_i < 1406082945
puts hash
end
end
I'm not sure how to isolate the value of key:created_at and then see if it is within the range. I also tried doing hash.first.to_s.to_i =/== range.
I am able to get just the :created_at value by using data[1].first.last but when I try to use that in a method it errors out.
Here is a link to the original CSV: goo.gl/NOjAPo
It is not on my work computer so I can't do a pastebin of it.
I would only store rows in the data hash that are within the range. IMO that performs betters, because it needs less memory than reading all data into data and remove the unwanted entries in a second step.
DATE_RANGE = (1403321503..1406082945)
CSV.foreach("sample_data.csv",
:headers => true,
:header_converters => :symbol,
:converters => :all) do |row|
attrs = Hash[row.headers[1..-1].zip(row.fields[1..-1])]
data[row.fields[0]] = attrs if DATE_RANGE.cover?(attrs[:created_at])
end
It might make sense to check the condition before actually creating the hash by checking DATE_RANGE.cover? against the column number (is created_at in row.fields[1]?).
Use Enumerable#select
hash.select do |_, v|
(1403321503..1406082945) === v[:created_at]
end
Here we also use Range#=== also known as case-equal, or triple-equal, to check if the value is inside the range.

build a hash from iterating over a hash with nested arrays

I'd like to structure data I get pack from an Instagram API call:
{"attribution"=>nil,
"tags"=>["loudmouth"],
"location"=>{"latitude"=>40.7181015, "name"=>"Fontanas Bar", "longitude"=>-73.9922791, "id"=>31443955},
"comments"=>{"count"=>0, "data"=>[]},
"filter"=>"Normal",
"created_time"=>"1444181565",
"link"=>"https://instagram.com/p/8hJ-UwIDyC/",
"likes"=>{"count"=>0, "data"=>[]},
"images"=>
{"low_resolution"=>{"url"=>"https://scontent.cdninstagram.com/hphotos-xaf1/t51.2885-15/s320x320/e35/12145134_169501263391761_636095824_n.jpg", "width"=>320, "height"=>320},
"thumbnail"=>
{"url"=>"https://scontent.cdninstagram.com/hphotos-xfa1/t51.2885-15/s150x150/e35/c135.0.810.810/12093266_813307028768465_178038954_n.jpg", "width"=>150, "height"=>150},
"standard_resolution"=>
{"url"=>"https://scontent.cdninstagram.com/hphotos-xaf1/t51.2885-15/s640x640/sh0.08/e35/12145134_169501263391761_636095824_n.jpg", "width"=>640, "height"=>640}},
"users_in_photo"=>
[{"position"=>{"y"=>0.636888889, "x"=>0.398666667},
"user"=>
{"username"=>"ambersmelson",
"profile_picture"=>"http://photos-h.ak.instagram.com/hphotos-ak-xfa1/t51.2885-19/11909108_1492226137759631_1159527917_a.jpg",
"id"=>"194780705",
"full_name"=>""}}],
"caption"=>
{"created_time"=>"1444181565",
"text"=>"the INCOMPARABLE Amber Nelson closing us out! #loudmouth",
"from"=>
{"username"=>"alex3nglish",
"profile_picture"=>"http://photos-f.ak.instagram.com/hphotos-ak-xaf1/t51.2885-19/s150x150/11906214_483262888501413_294704768_a.jpg",
"id"=>"30822062",
"full_name"=>"Alex English"}}
I'd like to structure it in this way:
hash ={}
hash {"item1"=>
:location => {"latitude"=>40.7181015, "name"=>"Fontanas Bar", "longitude"=>-73.9922791, "id"=>31443955},
:created_time => "1444181565",
:images =>https://scontent.cdninstagram.com/hphotos-xaf1/t51.2885-15/s320x320/e35/12145134_169501263391761_636095824_n.jpg"
:user =>"Alex English"}
I'm iterating over 20 objects, each with their location, images, etc... how can I get a hash structure like the one above ?
This is what I've tried:
array_images = Array.new
# iterate through response object to extract what is needed
response.each do |item|
array_images << { :image => item.images.low_resolution.url,
:location => item.location,:created_time => Time.at(item.created_time.to_i), :user => item.user.full_name}
end
Which works fine. So what is the better way, the fastest one?
The hash that you gave is one item in the array stored at the key "data" in a larger hash right? At least that's how it is for the tags/ endpoint so I'll assume it's the same here. (I'm referring to that array of hashes as data)
hash = {}
data.each_with_index do |h, idx|
hash["item#{idx + 1}"] = {
location: h["location"], #This grabs the entire hash at "location" because you are wanting all of that data
created_time: h["created_time"],
image: h["images"]["low_resolution"]["url"], # You can replace this with whichever resolution.
caption: h["caption"]["from"]["full_name"]
}
end
I feel like you want a more simple solution, but I'm not sure how that's going to happen as you want things nested at different levels and you are pulling things from diverse levels of nesting.

I want to do an autocomplete with Ruby and Sequel

I am using Sequel with prostgres and Sinatra. I want to do an autocomplete search. I’ve verified my jQuery which sends a GET works fine.
The Ruby code is:
get '/search' do
search = params[:search]
DB[:candidates].select(:last).where('last LIKE ?', '_a_').each do |row|
l = row[:last]
end
end
The problem is the Sequel query:
I have tried every possible configuration of the query that I can think of with no luck.
So, for example, in the above query I get all the people who have "a" in their last name but when I change the query to:
DB[:candidates].select(:last).where('last LIKE ?', 'search')
or
DB[:candidates].select(:last).where('last LIKE ?', search) # (without '')
I get nothing.
I have done warn params.inspect which indicates the param search is being passed, so I am stuck.
Any ideas how the query should be written?
Finally, the second part of the question the results (when it works with '_a_') are rendered as {:last=>"Yao"} I would like just Yao, how can I do that?
I have tried numerous different types of query including raw SQL but no luck. Or is the approach just plain wrong?
Just installed Sequel and made working example:
require "rubygems"
require "sequel"
# connect to an in-memory database
DB = Sequel.sqlite
# create an items table
DB.create_table :items do
primary_key :id
String :name
Float :price
end
# create a dataset from the items table
items = DB[:items]
# populate the table
items.insert(:name => 'abc', :price => rand * 100)
items.insert(:name => 'def', :price => rand * 100)
items.insert(:name => 'ghi', :price => rand * 100)
items.insert(:name => 'gui', :price => rand * 100)
# print out the number of records
puts "Item count: #{items.count}"
# print out the average price
puts "The average price is: #{items.avg(:price)}"
recs = items.select(:name).where(Sequel.like(:name, 'g%'))
recs.each do |rec|
puts rec.values
end
I think you will get the point.
UPDATED
So in your case you should try this:
DB[:candidates]
.select(:last)
.where(Sequel.like(:last, "#{search}%"))
.map{|rec| rec.values}.flatten
It should return array of found strings.
Copy/pasting from the Sequel documentation:
You can search SQL strings in a case sensitive manner using the Sequel.like method:
items.where(Sequel.like(:name, 'Acme%')).sql
#=> "SELECT * FROM items WHERE (name LIKE 'Acme%')"
You can search SQL strings in a case insensitive manner using the Sequel.ilike method:
items.where(Sequel.ilike(:name, 'Acme%')).sql
#=> "SELECT * FROM items WHERE (name ILIKE 'Acme%')"
You can specify a Regexp as a like argument, but this will probably only work on PostgreSQL and MySQL:
items.where(Sequel.like(:name, /Acme.*/)).sql
#=> "SELECT * FROM items WHERE (name ~ 'Acme.*')"
Like can also take more than one argument:
items.where(Sequel.like(:name, 'Acme%', /Beta.*/)).sql
#=> "SELECT * FROM items WHERE ((name LIKE 'Acme%') OR (name ~ 'Beta.*'))"
Open up a Sequel console (not your Sinatra app) and play with the query until you get results back. Since you say you want only the last column your query should be something like:
# Search anywhere inside the last name
DB[:candidates].where( Sequel.ilike(:last, "%#{search}%") ).select_map(:last)
# Find last names starting with the search string
DB[:candidates].where( Sequel.ilike(:last, "#{search}%") ).select_map(:last)
Uglier alternatives:
DB[:candidates]
.select(:last)
.where( Sequel.ilike(:last, "%#{search}%") )
.all
.map{ |hash| hash[:last] }
DB[:candidates]
.select(:last)
.where( Sequel.ilike(:last, "%#{search}%") )
.map( :last )
If you want to rank the search results by the best matches, you might be interested in my free LiqrrdMetal library. Instead of searching on the DB, you would pull a full list of all last names into Ruby and use LiqrrdMetal to search through them. This would allow a search string of "pho" to match both "Phong" as well as "Phrogz", with the former scoring higher in the rankings.

Speed up csv import

I want to import big amount of cvs data (not directly to AR, but after some fetches), and my code is very slow.
def csv_import
require 'csv'
file = File.open("/#{Rails.public_path}/uploads/shate.csv")
csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => :auto, :headers => :first_row})
csv.each do |row|
#ename,esupp= row[1].split(/_/)
#(ename,esupp,foo) = row[1]..split('_')
abrakadabra = row[0].to_s()
(ename,esupp) = abrakadabra.split(/_/)
eprice = row[6]
eqnt = row[1]
# logger.info("1) ")
# logger.info(ename)
# logger.info("---")
# logger.info(esupp)
#----
#ename = row[4]
#eprice = row[7]
#eqnt = row[10]
#esupp = row[12]
if ename.present? && ename.size>3
search_condition = "*" + ename.upcase + "*"
if esupp.present?
#supplier = #suppliers.find{|item| item['SUP_BRAND'] =~ Regexp.new(".*#{esupp}.*") }
supplier = Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
logger.warn("!!! *** supp !!!")
#logger.warn(supplier)
end
if supplier.present?
#search = ArtLookup.find(:all, :conditions => ['MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE)', search_condition.gsub(/[^0-9A-Za-z]/, '')])
#articles = Article.find(:all, :conditions => { :ART_ID => #search.map(&:ARL_ART_ID)})
#art_concret = #articles.find_all{|item| item.ART_ARTICLE_NR.gsub(/[^0-9A-Za-z]/, '').include?(ename.gsub(/[^0-9A-Za-z]/, '')) }
#aa = #art_concret.find{|item| item['ART_SUP_ID']==supplier.SUP_ID} #| #articles
if #aa.present?
#art = Article.find_by_ART_ID(#aa)
end
if #art.present?
#art.PRICEM = eprice
#art.QUANTITYM = eqnt
#art.datetime_of_update = DateTime.now
#art.save
end
end
logger.warn("------------------------------")
end
#logger.warn(esupp)
end
end
Even if I delete and get only this, it is slow.
def csv_import
require 'csv'
file = File.open("/#{Rails.public_path}/uploads/shate.csv")
csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => :auto, :headers => :first_row})
csv.each do |row|
end
end
Could anybody help me increase the speed using fastercsv?
I don't think it will get much faster.
That said, some testing shows that a significant part of time is spent for the transcoding (about 15% for my test case). So if you could skip that (e.g. by creating the CSV in UTF-8 already) you would see some improvement.
Besides, according to ruby-doc.org the "primary" interface for reading CSVs is
foreach, so this should be preferred:
def csv_import
import 'csv'
CSV.foreach("/#{Rails.public_path}/uploads/shate.csv", {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row}) do | row |
# use row here...
end
end
Update
You could also try splitting the parsing into several threads. I reached some performance increase experimenting with this code (treatment of heading left out):
N = 10000
def csv_import
all_lines = File.read("/#{Rails.public_path}/uploads/shate.csv").lines
# parts will contain the parsed CSV data of the different chunks/slices
# threads will contain the threads
parts, threads = [], []
# iterate over chunks/slices of N lines of the CSV file
all_lines.each_slice(N) do | plines |
# add an array object for the current chunk to parts
parts << result = []
# create a thread for parsing the current chunk, hand it over the chunk
# and the current parts sub-array
threads << Thread.new(plines.join, result) do | tsrc, tresult |
# parse the chunk
parsed = CSV.parse(tsrc, {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ";", :row_sep => :auto})
# add the parsed data to the parts sub-array
tresult.replace(parsed.to_a)
end
end
# wait for all threads to finish
threads.each(&:join)
# merge all the parts sub-arrays into one big array and iterate over it
parts.flatten(1).each do | row |
# use row (Array)
end
end
This splits the input into chunks of 10000 lines and creates a parsing thread for each of the chunks. Each threads gets handed over a sub-array in the array parts for storing its result. When all threads are finished (after threads.each(&:join)) the results of all chunks in parts are joint and that's it.
As it's name implies Faster CSV is Well Faster :)
http://fastercsv.rubyforge.org
also see. for some more info
Ruby on Rails Moving from CSV to FasterCSV
I'm curious how big the file is, and how many columns it has.
Using CSV.foreach is the preferred way. It would be interesting to see the memory profile as your app is running. (Sometimes the slowness is due to printing, so make sure you don't do more of that than you need)
You might be able to preprocess it, and exclude any row that doesn't have the esupp, as it looks like your code only cares about those rows. Also, you could truncate any right-side columns you don't care about.
Another technique would be to gather up the unique components and put them in a hash. Seems like you are firing the same query multiple times.
You just need to profile it and see where it's spending its time.
check out the Gem smarter_csv! It can read CSV files in chunks, and you can then create Resque jobs to further process and insert those chunks into a database.
https://github.com/tilo/smarter_csv

ORM for SQL Scripting

What is the best way to run simple SQL scripts in a database (preferably DBM implementation agnostically)?
So, for illustration purposes, using your best/suggested way, I'd like to see a script that creates a few tables with names from an array ['cars_table', 'ice_cream_t'], deletes all elements with id=5 in a table, and does a join between two tables and prints the result formatted in some nice way.
I've heard of Python and PL/SQL to do this
Ruby/Datamapper seems very attractive
Java + JDBC, maybe
Others?
Some of these are mostly used in a full application or within a framework. I'd like to see them used simply in scripts.
Ruby/Sequel is currently my weapon of choice.
Short example from the site:
require "rubygems"
require "sequel"
# connect to an in-memory database
DB = Sequel.sqlite
# create an items table
DB.create_table :items do
primary_key :id
String :name
Float :price
end
# create a dataset from the items table
items = DB[:items]
# populate the table
items.insert(:name => 'abc', :price => rand * 100)
items.insert(:name => 'def', :price => rand * 100)
items.insert(:name => 'ghi', :price => rand * 100)
# print out the number of records
puts "Item count: #{items.count}"
# print out the average price
puts "The average price is: #{items.avg(:price)}"
By using SQL DDL (Data Definition Language), which can be done db agnostically, if you're careful.
There are examples at the Wikipedia article:
http://en.wikipedia.org/wiki/Data_Definition_Language

Resources