My sinatra app has to parse a ~60MB XML-file. This file hardly ever changes: on a nightly cron job, It is overwritten with another one.
Are there tricks or ways to keep the parsed file in memory, as a variable, so that I can read from it on incoming requests, but not have to parse it over and over for each incoming request?
Some Pseudocode to illustrate my problem.
get '/projects/:id'
return #nokigiri_object.search("//projects/project[#id=#{params[:id]}]/name/text()")
end
post '/projects/update'
if params[:token] == "s3cr3t"
#nokogiri_object = reparse_the_xml_file
end
end
What I need to know, is how to create such a #nokogiri_object so that it persists when Sinatra runs. Is that possible at all? Or do I need some storage for that?
You could try:
configure do
##nokogiri_object = parse_xml
end
Then ##nokogiri_object will be available in your request methods. It's a class variable rather than an instance variable, but should do what you want.
The proposed solution gives a warning
warning: class variable access from toplevel
You can use a class method to access the class variable and the warning will disappear
require 'sinatra'
class Cache
##count = 0
def self.init()
##count = 0
end
def self.increment()
##count = ##count + 1
end
def self.count()
return ##count
end
end
configure do
Cache::init()
end
get '/' do
if Cache::count() == 0
Cache::increment()
"First time"
else
Cache::increment()
"Another time #{Cache::count()}"
end
end
Two options:
Save the parsed file to a new file and always read that one.
You can save in a file – serialize - a hash with two keys: 'last-modified' and 'data'.
The 'last-modified' value is a date and you check in every request if that day is today. If it is not today then a new file is downloaded, parsed and stored with today's date.
The 'data' value is the parsed file.
That way you parse just once time, sort of a cache.
Save the parsed file to a NoSQL database, for example redis.
Related
I have the following Ruby code:
module BigTime
FOO1_MONEY_PIT = 500
FOO2_MONEY_PIT = 501
class LoseMoney
##SiteName = 'FOO1'
#site_num = ##SiteName_MONEY_PIT
def other_unimportant_stuff
whatever
end
end
end
So, what I'm trying to do here is set the SiteName and then use SiteName and combine it with the string _MONEY_PIT so I can access FOO1_MONEY_PIT and store its contents (500 in this case) in #site_num. Of course, the above code doesn't work, but there must be a way I can do this?
Thanks!!
If you want to dynamically get the value of a constant, you can use Module#const_get:
module BigTime
FOO1_MONEY_PIT = 500
FOO2_MONEY_PIT = 501
class LoseMoney
##SiteName = 'FOO1'
#site_num = BigTime.const_get(:"#{##SiteName}_MONEY_PIT")
end
end
Do not, under any circumstance, use Kernel#eval for this. Kernel#eval is extremely dangerous in any context where there is even the slightest possibility that an attacker may be able to control parts of the argument.
For example, if a user can choose the name of the site, and they name their site require 'fileutils'; FileUtils.rm_rf('/'), then Ruby will happily evaluate that code, just like you told it to!
Kernel#eval is very dangerous and you should not get into the habit of just throwing an eval at a problem. It is a very specialized tool that should only be employed when there is no other option (spoiler alert: there almost always is another option), and only after a thorough security review.
Please note that dynamically constructing variable names is already a code smell by itself, regardless of whether you use eval or not. It pretty much always points to a design flaw somewhere. In general, you can almost guaranteed replace the multiple variables with a data structure. E.g. in this case something like this:
module BigTime
MONEY_PITS = {
'FOO1' => 500,
'FOO2' => 501,
}.freeze
class LoseMoney
##SiteName = 'FOO1'
#site_num = MONEY_PITS[##SiteName]
end
end
You can refactor this as to use a Hash for your name lookups, and a getter method to retrieve it for easy testing/validation. For example:
module BigTime
MONEY_PITS = { FOO1: 500, FOO2: 501 }
MONEY_PIT_SUFFIX = '_MONEY_PIT'
class LoseMoney
##site = :FOO1
def initialize
site_name
end
def site_name
#site_name ||= '%d%s' % [MONEY_PITS[##site], MONEY_PIT_SUFFIX]
end
end
end
BigTime::LoseMoney.new.site_name
#=> "500_MONEY_PIT"
I'm trying to call but I keep getting an error. This is my code:
require 'rubygems'
require 'net/http'
require 'uri'
require 'json'
class AlchemyAPI
#Setup the endpoints
##ENDPOINTS = {}
##ENDPOINTS['taxonomy'] = {}
##ENDPOINTS['taxonomy']['url'] = '/url/URLGetRankedTaxonomy'
##ENDPOINTS['taxonomy']['text'] = '/text/TextGetRankedTaxonomy'
##ENDPOINTS['taxonomy']['html'] = '/html/HTMLGetRankedTaxonomy'
##BASE_URL = 'http://access.alchemyapi.com/calls'
def initialize()
begin
key = File.read('C:\Users\KVadher\Desktop\api_key.txt')
key.strip!
if key.empty?
#The key file should't be blank
puts 'The api_key.txt file appears to be blank, please copy/paste your API key in the file: api_key.txt'
puts 'If you do not have an API Key from AlchemyAPI please register for one at: http://www.alchemyapi.com/api/register.html'
Process.exit(1)
end
if key.length != 40
#Keys should be exactly 40 characters long
puts 'It appears that the key in api_key.txt is invalid. Please make sure the file only includes the API key, and it is the correct one.'
Process.exit(1)
end
#apiKey = key
rescue => err
#The file doesn't exist, so show the message and create the file.
puts 'API Key not found! Please copy/paste your API key into the file: api_key.txt'
puts 'If you do not have an API Key from AlchemyAPI please register for one at: http://www.alchemyapi.com/api/register.html'
#create a blank file to hold the key
File.open("api_key.txt", "w") {}
Process.exit(1)
end
end
# Categorizes the text for a URL, text or HTML.
# For an overview, please refer to: http://www.alchemyapi.com/products/features/text-categorization/
# For the docs, please refer to: http://www.alchemyapi.com/api/taxonomy/
#
# INPUT:
# flavor -> which version of the call, i.e. url, text or html.
# data -> the data to analyze, either the the url, text or html code.
# options -> various parameters that can be used to adjust how the API works, see below for more info on the available options.
#
# Available Options:
# showSourceText -> 0: disabled (default), 1: enabled.
#
# OUTPUT:
# The response, already converted from JSON to a Ruby object.
#
def taxonomy(flavor, data, options = {})
unless ##ENDPOINTS['taxonomy'].key?(flavor)
return { 'status'=>'ERROR', 'statusInfo'=>'Taxonomy info for ' + flavor + ' not available' }
end
#Add the URL encoded data to the options and analyze
options[flavor] = data
return analyze(##ENDPOINTS['taxonomy'][flavor], options)
print
end
**taxonomy(text,"trees",1)**
end
In ** ** I have entered my call. Am I doing something incorrect. The error I receive is:
C:/Users/KVadher/Desktop/testrub:139:in `<class:AlchemyAPI>': undefined local variable or method `text' for AlchemyAPI:Class (NameError)
from C:/Users/KVadher/Desktop/testrub:6:in `<main>'
I feel as though I'm calling as normal and that there is something wrong with the api code itself? Although I may be wrong.
Yes, as jon snow says, the function (method) call must be outside of the class. The methods are defined along with the class.
Also, Options should be a Hash, not a number, as you call options[flavor] = data, which is going to cause you another problem.
I believe maybe you meant to put text in quotes, as that is one of your flavors.
Furthermore, because you declared a class, this is called an instance method, and you must make an instance of the class to use this:
my_instance = AlchemyAPI.new
my_taxonomy = my_instance.taxonomy("text", "trees")
That's enough to get it to work, it seems like you have a ways to go to get this all working though. Good luck!
I'm setting up Backburner as a work queue, and my job items need to return JSON for the resulting data they create. I'm not sure how to structure this. As a test I've tried doing:
class PrintJob
include Backburner::Performable
def self.print(text)
puts text
return "results"
end
end
Backburner.configure do |config|
config.beanstalk_url = ["beanstalk://127.0.0.1"]
# etc
end
val = PrintJob.async.print('some cool text')
puts val
and running Backburner.work inside IRB. The puts works but the return value comes back as true instead of "results".
Is there a way to get return values out of async methods? Or should I try a different approach, e.g. having one queue for jobs and another for results? If so, how can I associate the result 'job' with the original work it belongs to?
Note: I'm eventually using Sinatra and not Rails.
I try to use the ruby standard csv lib to dump out the arr of object to a csv.file , called 'a.csv'
http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-dump
dump(ary_of_objs, io = "", options = Hash.new)
but in this method, how can i dump into a file?
there is no such examples exists and help. I google it no example to do for me...
Also, the docs said that...
The next method you can provide is an instance method called
csv_headers(). This method is expected to return the second line of
the document (again as an Array), which is to be used to give each
column a header. By default, ::load will set an instance variable if
the field header starts with an # character or call send() passing the
header as the method name and the field value as an argument. This
method is only called on the first object of the Array.
Anyone knows how to pass the instance method csv_headers() to this dump function?
I haven't tested this out yet, but it looks like io should be set to a file. According to the doc you linked "The io parameter can be used to serialize to a File"
Something like:
f = File.open("filename")
dump(ary_of_objs, io = f, options = Hash.new)
The accepted answer doesn't really answer the question so I thought I'd give a useful example.
First of all if you look at the docs at http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html, if you hover over the method name for dump you see you can click to show source. If you do that you'll see that the dump method attempts to call csv_headers on the first object you pass in from ary_of_objs:
obj_template = ary_of_objs.first
...snip...
headers = obj_template.csv_headers
Then later you see that the method will call csv_dump on each object in ary_of_objs and pass in the headers:
ary_of_objs.each do |obj|
begin
csv << obj.csv_dump(headers)
rescue NoMethodError
csv << headers.map do |var|
if var[0] == #
obj.instance_variable_get(var)
else
obj[var[0..-2]]
end
end
end
end
So we need to augment each entry in array_of_objs to respond to those two methods. Here's an example wrapper class that would take a Hash, and return the hash keys as the CSV headers and then be able to dump each row based on the headers.
class CsvRowDump
def initialize(row_hash)
#row = row_hash
end
def csv_headers
#row.keys
end
def csv_dump(headers)
headers.map { |h| #row[h] }
end
end
There's one more catch though. This dump method wants to write an extra line at the top of the CSV file before the headers, and there's no way to skip that if you call this method due to this code at the top:
# write meta information
begin
csv << obj_template.class.csv_meta
rescue NoMethodError
csv << [:class, obj_template.class]
end
Even if you return '' from CsvRowDump.csv_meta that will still be a blank line where a parse expects the headers. So instead lets let dump write that line and then remove it afterwards when we call dump. This example assumes you have an array of hashes that all have the same keys (which will be the CSV header).
#rows = #hashes.map { |h| CsvRowDump.new(h) }
File.open(#filename, "wb") do |f|
str = CSV::dump(#rows)
f.write(str.split(/\n/)[1..-1].join("\n"))
end
I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.
On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.
I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.
Is there something I am overlooking?
Many thanks
Update to add code sample
class FeedImporter
class FeedListing
include ::SAXMachine
element :id
element :title
element :description
element :url
def to_hash
{}.tap do |hash|
self.class.column_names.each do |key|
hash[key] = send(key)
end
end
end
end
class Feed
include ::SAXMachine
elements :listing, :as => :listings, :class => FeedListing
end
def perform
open('~/feeds/large_feed.xml') do |file|
# I think that SAXMachine is trying to load All of the listing elements into this one ruby object.
puts 'Parsing'
feed = Feed.parse(file)
# We are now iterating over each of the listing elements, but they have been "parsed" from the feed already.
puts 'Importing'
feed.listings.each do |listing|
Listing.import(listing.to_hash)
end
end
end
end
As you can see, I don't care about the <listings> element in the feed. I just want the attributes of each <listing> element.
The output looks like this:
Parsing
... wait forever
Importing (actually, I don't ever see this on the big file (1.6gb) because too much memory is used :(
Here's a Reader that will yield each listing's XML to a block, so you can process each Listing without loading the entire document into memory
reader = Nokogiri::XML::Reader(file)
while reader.read
if reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT and reader.name == 'listing'
listing = FeedListing.parse(reader.outer_xml)
Listing.import(listing.to_hash)
end
end
If listing elements could be nested, and you wanted to parse the outermost listings as single documents, you could do this:
require 'rubygems'
require 'nokogiri'
# Monkey-patch Nokogiri to make this easier
class Nokogiri::XML::Reader
def element?
node_type == TYPE_ELEMENT
end
def end_element?
node_type == TYPE_END_ELEMENT
end
def opens?(name)
element? && self.name == name
end
def closes?(name)
(end_element? && self.name == name) ||
(self_closing? && opens?(name))
end
def skip_until_close
raise "node must be TYPE_ELEMENT" unless element?
name_to_close = self.name
if self_closing?
# DONE!
else
level = 1
while read
level += 1 if opens?(name_to_close)
level -= 1 if closes?(name_to_close)
return if level == 0
end
end
end
def each_outer_xml(name, &block)
while read
if opens?(name)
yield(outer_xml)
skip_until_close
end
end
end
end
once you have it monkey-patched, it's easy to deal with each listing individually:
open('~/feeds/large_feed.xml') do |file|
reader = Nokogiri::XML::Reader(file)
reader.each_outer_xml('listing') do |outer_xml|
listing = FeedListing.parse(outer_xml)
Listing.import(listing.to_hash)
end
end
Unfortunately there are now three different repos for sax-machine. And worse, the gemspec version was not bumped.
Despite the comment on Greg Weber's blog, I don't think this code was integrated into pauldix's or ezkl's forks. To use the lazy, fiber-based version of the code, I think you need to specifically reference gregweb's version in your gemfile like this:
gem 'sax-machine', :git => 'https://github.com/gregwebs/sax-machine'
I forked sax-machine so that it uses constant memory: https://github.com/gregwebs/sax-machine
Good news: there is a new maintainer that is planning on merging my changes.
Myself and the new maintainer have been using my fork without issue for a year now.
You are right, SAXMachine reads the whole document eagerly. Have a look at it's handler sources: https://github.com/pauldix/sax-machine/blob/master/lib/sax-machine/sax_handler.rb
To solve your Problem, I would use http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/SAX/Parser.html directly and implement the handler yourself.