How to override or edit the last printed lines in a ruby CLI script? - ruby

I am trying to build a script that gives me feedback about progress on the command-line. Actually it is just putting a newline for every n-th progress step made. Console looks like
10:30:00 Parsed 0 of 1'000'000 data entries (0 %)
10:30:10 Parsed 1'000 of 1'000'000 data entries (1 %)
10:30:20 Parsed 2'000 of 1'000'000 data entries (2 %)
[...] etc [...]
11:00:00 Parsed 1'000'000 of 1'000'000 data entries (100 %)
Even if timestamp and progressnumbers are fictional, you should see the problem.
What I want is to do it "wget-style" with a progressbar updated on the command line, with linewidth in mind.
First I thought about the use of curses because I had hands on as I tried to learn C, but I never could get warm with it, also I think it is bloated for the purpose of manipulating just a few lines. Also I dont need any coloring. Also most other libraries I found seemed to be specialized for coloring.
Can someone help me with this problem?

A while ago I created a class to be a status text on which you can change part of the content of the text within the line. It might be useful to you.
The class with an example use are:
class StatusText
def initialize(parms={})
#previous_size = 0
#stream = parms[:stream]==nil ? $stdout : parms[:stream]
#parms = parms
#parms[:verbose] = true if parms[:verbose] == nil
#header = []
#onChange = nil
pushHeader(#parms[:base]) if #parms[:base]
end
def setText(complement)
text = "#{#header.join(" ")}#{#parms[:before]}#{complement}#{#parms[:after]}"
printText(text)
end
def cleanAll
printText("")
end
def cleanContent
printText "#{#parms[:base]}"
end
def nextLine(text=nil)
if #parms[:verbose]
#previous_size = 0
#stream.print "\n"
end
if text!=nil
line(text)
end
end
def line(text)
printText(text)
nextLine
end
#Callback in the case the status text changes
#might be useful to log the status changes
#The callback function receives the new text
def onChange(&block)
#on_change = block
end
def pushHeader(head)
#header.push(head)
end
def popHeader
#header.pop
end
def setParm(parm, value)
#parms[parm] = value
if parm == :base
#header.last = value
end
end
private
def printText(text)
#If not verbose leave without printing
if #parms[:verbose]
if #previous_size > 0
#go back
#stream.print "\033[#{#previous_size}D"
#clean
#stream.print(" " * #previous_size)
#go back again
#stream.print "\033[#{#previous_size}D"
end
#print
#stream.print text
#stream.flush
#store size
#previous_size = text.gsub(/\e\[\d+m/,"").size
end
#Call callback if existent
#on_change.call(text) if #on_change
end
end
a = StatusText.new(:before => "Evolution (", :after => ")")
(1..100).each {|i| a.setText(i.to_s); sleep(1)}
a.nextLine
Just copy, paste in a ruby file and try it out. I use escape sequences to reposition the cursor.
The class has lots of features I needed at the time (like piling up elements in the status bar) that you can use to complement your solution, or you can just clean it up to its core.
I hope it helps.

In the meanwhile I found some gems that give me a progressbar, I will list them up here:
ProgressBar from paul at github
a more recent version from pgericson at github
ruby-progressbar from jfelchner at github
simple_progressbar from bitboxer at github
I tried the one from pgericson and that from jfelchner, they both have pros and cons but also both fits my needs. Probably I will fork and extend one of them in the future.
I hope this one helps others to find faster, what I searched for months.

Perhaps replace your outputting to this:
print "Progress #{progress_var}%\r"

Related

Looking for a cleaner way to scrape from website by avoiding repeating

Hi I am just doing a bit of refactoring on a small cli web scraping project I did in Ruby and I was simply wondering if there was cleaner way to write a particular section without repeating the code too much.
Basically with the code below, I pulled data from a website but I had to do this per page. You will notice that both methods are only different by their name and the source.
def self.scrape_first_page
html = open("https://www.texasblackpages.com/united-states/san-antonio")
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
def self.scrape_second_page
html = open('https://www.texasblackpages.com/united-states/san-antonio?page=2')
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
Is there a way for me to streamline this process all with just one method pulling from one source, but have the ability to access different pages within the same site, or this is pretty much the best and only way? They owners of the website do not have a public api from me to pull from in case anyone is wondering.
Remember that in programming you want to steer towards code that follows the Zero, One or Infinity Rule avoid the dreaded two. In other words, write methods that take no arguments, fixed arguments (one), or an array of unspecified size (infinity).
So the first step is to clean up the scraping function to make it as generic as possible:
def scrape(page)
doc = Nokogiri::HTML(open(page))
# Use map here to return an array of Business objects
doc.css('div.grid_element').map do |business|
Business.new.tap do |biz|
# Use tap to modify this object before returning it
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
end
Note that apart from the extraction code, there's nothing specific about this. Takes a URL, returns Business objects in an Array.
In order to generate pages 1..N, consider this:
def pages(base_url, start: 1)
page = start
Enumerator.new do |y|
loop do
y << base_url % page
page += 1
end
end
end
Now that's an infinite series, but you can always cap it to whatever you want with take(n) or by instead looping until you get an empty list:
# Collect all business from each of the pages...
businesses = pages('https://www.texasblackpages.com/united-states/san-antonio?page=%d').lazy.map do |page|
# ...by scraping the page...
scrape(page)
end.take_while do |results|
# ...and iterating until there's no results, as in Array#any? is false.
results.any?
end.to_a.flatten
The .lazy part means "evaluate each part of the chain sequentially" as opposed to the default behaviour of trying to evaluate each stage to completion. This is important or else it will try and download an infinite number of pages before moving to the next test.
The .to_a on the end forces that chain to run to completion. The .flatten squishes all the page-wise results into a single result set.
Of course if you want to scrape the first N pages, it's a lot easier:
pages('https://www.texasblackpages.com/.../san-antonio?page=%d').take(n).flat_map do |page|
scrape(page)
end
It's almost no code!
This was suggested by #Todd A. Jacobs
def self.scrape(url)
html = open(url)
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
The downside is with there not being a public api I had to invoke the method as many times as I need it since the url's are representing different pages within the wbesite, but this is fine because I was able to get rid of the repeating methods.
def make_listings
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=2")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=3")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=4")
end
i ever had some problem with you, i do loop though. usually if the page support pagination then the first page it have chance to use query param page also.
def self.scrape
page = 1
loop do
url = "https://www.texasblackpages.com/united-states/san-antonio?page=#{page}"
# do nokogiri parse
# do data scrapping
page += 1
end
end
you can have break on certain page condition.

How to write a while loop properly

I'm trying to scrape a website however I cannot seem to get my while-loop to break out once it hits a page with no more information:
def scrape_verse_items(keyword)
pg = 1
while pg < 1000
puts "page #{pg}"
url = "https://www.bible.com/search/bible?page=#{pg}&q=#{keyword}&version_id=1"
doc = Nokogiri::HTML(open(url))
items = doc.css("ul.search-result li.reference")
error = doc.css('div#noresults')
until error.any? do
if keyword != ''
item_hash = {}
items.each do |item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
item_hash[title] = content
end
else
puts "Please enter a valid search"
end
if error.any?
break
end
end
pg += 1
end
item_hash
end
puts scrape_verse_items('joy')
I know this doesn't exactly answer your question, but perhaps you might consider using a different approach altogether.
Using while and until loops can get a bit confusing, and usually isn't the most performant way of doing things.
Maybe you would consider using recursion instead.
I've written a small script that seems to work :
class MyScrapper
def initialize;end
def call(keyword)
puts "Please enter a valid search" && return unless keyword
scrape({}, keyword, 1)
end
private
def scrape(results, keyword, page)
doc = load_page(keyword, page)
return results if doc.css('div#noresults').any?
build_new_items(doc).merge(scrape(results, keyword, page+1))
end
def load_page(keyword, page)
url = "https://www.bible.com/search/bible?page=#{page}&q=#{keyword}&version_id=1"
Nokogiri::HTML(open(url))
end
def build_new_items(doc)
items = doc.css("ul.search-result li.reference")
items.reduce({}) do |list, item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
list[title] = content
list
end
end
end
You call it by doing MyScrapper.new.call("Keyword") (It might make more sense to have this as a module you include or even have them as class methods to avoid the need to instantiate the class.
What this does is, call a method called scrape and you give it the starting results, keyword, and page. It loads the page, if there are no results it returns the existing results it has found.
Otherwise it builds a hash from the page it loaded, and then the method calls itself, and merges the results with the new hash it just build. It does this till there are no more results.
If you want to limit the page results you can just change this like:
return results if doc.css('div#noresults').any?
to this:
return results if doc.css('div#noresults').any? || page > 999
Note: You might want to double-check the results that are being returned are correct. I think they should be but I wrote this quite quickly, so there could always be a small bug hiding somewhere in there.

Toggling true/false: editing a file in ruby

I have some code that tries to change 'false' to 'true' in a ruby file, but it only works once while the script is running.
toggleto = true
text = File.read(filename)
text.gsub!("#{!toggleto}", "#{toggleto}")
File.open(filename, 'w+') {|file| file.write(text); file.close}
As far as I know, as long as I close a file, i should be able to read it it afterwards with what I previously wrote and thus change it back and forth no matter how many times.
Larger Context:
def toggleAutoAction
require "#{#require_path}/options"
filename = "#{#require_path}/options.rb"
writeToggle(filename, !OPTIONS[:auto])
0
end
def writeToggle(filename, toggleto)
text = File.read(filename)
text.gsub!(":auto => #{!toggleto}", ":auto => #{toggleto}")
File.open(filename, 'w+') {|file| file.write(text); file.close}
end
def exitOrMenu
puts "Are you done? (y/n)"
prompt
if gets.chomp == 'n'
whichAction
else
exit
end
end
def whichAction
if action == 5
toggleAutoAction
else
puts "Sorry, that isn't an option...returning"
return 1
end
exitOrMenu
end
The problem lays within this method:
def toggleAutoAction
require "#{#require_path}/options" # here
filename = "#{#require_path}/options.rb"
writeToggle(filename, !OPTIONS[:auto])
0
end
Ruby will not load the options.rb a second time (i.e. with the exact same path name), hence your !OPTIONS[:auto] will only be evaluated once (otherwise you would get a constant-already-defined-warning, provided OPTIONS is defined in options.rb). See Kernel#require docs.
You could, of course, do crazy stuff like
eval File.read("#{#require_path}/options.rb")
but I would not recommend that (performance wise).
As noted above, reading/writing from/to YAML files is less painful ;-)

Parsing large file with SaxMachine seems to be loading the whole file into memory

I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.
On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.
I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.
Is there something I am overlooking?
Many thanks
Update to add code sample
class FeedImporter
class FeedListing
include ::SAXMachine
element :id
element :title
element :description
element :url
def to_hash
{}.tap do |hash|
self.class.column_names.each do |key|
hash[key] = send(key)
end
end
end
end
class Feed
include ::SAXMachine
elements :listing, :as => :listings, :class => FeedListing
end
def perform
open('~/feeds/large_feed.xml') do |file|
# I think that SAXMachine is trying to load All of the listing elements into this one ruby object.
puts 'Parsing'
feed = Feed.parse(file)
# We are now iterating over each of the listing elements, but they have been "parsed" from the feed already.
puts 'Importing'
feed.listings.each do |listing|
Listing.import(listing.to_hash)
end
end
end
end
As you can see, I don't care about the <listings> element in the feed. I just want the attributes of each <listing> element.
The output looks like this:
Parsing
... wait forever
Importing (actually, I don't ever see this on the big file (1.6gb) because too much memory is used :(
Here's a Reader that will yield each listing's XML to a block, so you can process each Listing without loading the entire document into memory
reader = Nokogiri::XML::Reader(file)
while reader.read
if reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT and reader.name == 'listing'
listing = FeedListing.parse(reader.outer_xml)
Listing.import(listing.to_hash)
end
end
If listing elements could be nested, and you wanted to parse the outermost listings as single documents, you could do this:
require 'rubygems'
require 'nokogiri'
# Monkey-patch Nokogiri to make this easier
class Nokogiri::XML::Reader
def element?
node_type == TYPE_ELEMENT
end
def end_element?
node_type == TYPE_END_ELEMENT
end
def opens?(name)
element? && self.name == name
end
def closes?(name)
(end_element? && self.name == name) ||
(self_closing? && opens?(name))
end
def skip_until_close
raise "node must be TYPE_ELEMENT" unless element?
name_to_close = self.name
if self_closing?
# DONE!
else
level = 1
while read
level += 1 if opens?(name_to_close)
level -= 1 if closes?(name_to_close)
return if level == 0
end
end
end
def each_outer_xml(name, &block)
while read
if opens?(name)
yield(outer_xml)
skip_until_close
end
end
end
end
once you have it monkey-patched, it's easy to deal with each listing individually:
open('~/feeds/large_feed.xml') do |file|
reader = Nokogiri::XML::Reader(file)
reader.each_outer_xml('listing') do |outer_xml|
listing = FeedListing.parse(outer_xml)
Listing.import(listing.to_hash)
end
end
Unfortunately there are now three different repos for sax-machine. And worse, the gemspec version was not bumped.
Despite the comment on Greg Weber's blog, I don't think this code was integrated into pauldix's or ezkl's forks. To use the lazy, fiber-based version of the code, I think you need to specifically reference gregweb's version in your gemfile like this:
gem 'sax-machine', :git => 'https://github.com/gregwebs/sax-machine'
I forked sax-machine so that it uses constant memory: https://github.com/gregwebs/sax-machine
Good news: there is a new maintainer that is planning on merging my changes.
Myself and the new maintainer have been using my fork without issue for a year now.
You are right, SAXMachine reads the whole document eagerly. Have a look at it's handler sources: https://github.com/pauldix/sax-machine/blob/master/lib/sax-machine/sax_handler.rb
To solve your Problem, I would use http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/SAX/Parser.html directly and implement the handler yourself.

Sinatra with a persistent variable

My sinatra app has to parse a ~60MB XML-file. This file hardly ever changes: on a nightly cron job, It is overwritten with another one.
Are there tricks or ways to keep the parsed file in memory, as a variable, so that I can read from it on incoming requests, but not have to parse it over and over for each incoming request?
Some Pseudocode to illustrate my problem.
get '/projects/:id'
return #nokigiri_object.search("//projects/project[#id=#{params[:id]}]/name/text()")
end
post '/projects/update'
if params[:token] == "s3cr3t"
#nokogiri_object = reparse_the_xml_file
end
end
What I need to know, is how to create such a #nokogiri_object so that it persists when Sinatra runs. Is that possible at all? Or do I need some storage for that?
You could try:
configure do
##nokogiri_object = parse_xml
end
Then ##nokogiri_object will be available in your request methods. It's a class variable rather than an instance variable, but should do what you want.
The proposed solution gives a warning
warning: class variable access from toplevel
You can use a class method to access the class variable and the warning will disappear
require 'sinatra'
class Cache
##count = 0
def self.init()
##count = 0
end
def self.increment()
##count = ##count + 1
end
def self.count()
return ##count
end
end
configure do
Cache::init()
end
get '/' do
if Cache::count() == 0
Cache::increment()
"First time"
else
Cache::increment()
"Another time #{Cache::count()}"
end
end
Two options:
Save the parsed file to a new file and always read that one.
You can save in a file – serialize - a hash with two keys: 'last-modified' and 'data'.
The 'last-modified' value is a date and you check in every request if that day is today. If it is not today then a new file is downloaded, parsed and stored with today's date.
The 'data' value is the parsed file.
That way you parse just once time, sort of a cache.
Save the parsed file to a NoSQL database, for example redis.

Resources