how do I save the parsed data to a file - ruby

I wounder how I can save the parsed data to a txt file. My script is only saving the last parsed. Do i need to add .each do ? kind of lost right now
here is my code and if maybe somebody could explain to me how save the parsed info on a new line
here is the code
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.clearsearch.se/foretag/-/q_advokat/1/"
doc = Nokogiri::HTML(open(url))
doc.css(".gray-border-bottom").each do |item|
title = item.css(".medium").text.strip
phone = item.css(".grayborderwrapper > .bold").text.strip
adress = item.css(".grayborder span").text.strip
www = item.css(".click2www").map { |link| link['href'] }
puts "#{title} ; \n"
puts "#{phone} ; \n"
puts "#{adress} ; \n"
puts "#{www} ; \n\n\n"
puts "Writing"
company = "#{title}; #{phone}; #{adress}; #{www} \n\n"
puts "saving"
file = File.open("exporterad.txt", "w")
file.write(company)
file.close
puts "done"
end
puts "done"

Calling File.open inside your loop truncates the file to zero length with each invocation. Instead, open the file outside your loop (using the block form):
File.open("exporterad.txt", "w") do |file|
doc.css(".gray-border-bottom").each do |item|
# ...
file.write(company)
# ...
end
end # <- file is closed automatically at the end of the block

Related

Refactoring my code so that file closes automatically once loaded, how does the syntax work?

My program loads a list from a file, and I'm trying to change the method so that it closes automatically.
I've looked at the Ruby documentation, the broad stackoverflow answer, and this guy's website, but the syntax is always different and doesn't mean much to me yet.
My original load:
def load_students(filename = "students.csv")
if filename == nil
filename = "students.csv"
elsif filename == ''
filename = "students.csv"
end
file = File.open(filename, "r")
file.readlines.each do |line|
name, cohort = line.chomp.split(",")
add_students(name).to_s
end
file.close
puts "List loaded from #{filename}."
end
My attempt to close automatically:
def load_students(filename = "students.csv")
if filename == nil
filename = "students.csv"
elsif filename == ''
filename = "students.csv"
end
open(filename, "r", &block)
line.each do |line|
name, cohort = line.chomp.split(",")
add_students(name).to_s
end
puts "List loaded from #{filename}."
end
I'm looking for the same result, but without having to manually close the file.
I don't think it'll be much different, so how does the syntax work for automatically closing with blocks?
File.open(filename, 'r') do |file|
file.readlines.each do |line|
name, cohort = line.chomp.split(",")
add_students(name).to_s
end
end
I’d refactor the whole code:
def load_students(filename = "students.csv")
filename = "students.csv" if filename.to_s.empty?
File.open(filename, "r") do |file|
file.readlines.each do |line|
add_students(line.chomp.split(",").first)
end
end
puts "List loaded from #{filename}."
end
Or, even better, as suggested by Kimmo Lehto in comments:
def load_students(filename = "students.csv")
filename = "students.csv" if filename.to_s.empty?
File.foreach(filename) do |line|
add_students(line.chomp.split(",").first)
end
puts "List loaded from #{filename}."
end

Executing program from command line

I have done a program that sends requests to a url and saves them in a file. The program is this, and is working perfectly:
require 'open-uri'
n = gets.to_i
out = gets.chomp
output = File.open( out, "w" )
for i in 1..n
response = open('http://slowapi.com/delay/10').read
output << (response +"\n")
puts response
end
output.close
I want to modify it so that I can execute it from command line. I must run it like this:
fle --test abc -n 300 -f output
What must I do?
Something like this should do the trick:
#!/usr/bin/env ruby
require 'open-uri'
require 'optparse'
# Prepare the parser
options = {}
oparser = OptionParser.new do |opts|
opts.banner = "Usage: fle [options]"
opts.on('-t', '--test [STRING]', 'Test string') { |v| options[:test] = v }
opts.on('-n', '--count COUNT', 'Number of times to send request') { |v| options[:count] = v.to_i }
opts.on('-f', '--file FILE', 'Output file', :REQUIRED) { |v| options[:out_file] = v }
end
# Parse our options
oparser.parse! ARGV
# Check if required options have been filled, print help and exit otherwise.
if options[:count].nil? || options[:out_file].nil?
$stderr.puts oparser.help
exit 1
end
File::open(options[:out_file], 'w') do |output|
options[:count].times do
response = open('http://slowapi.com/delay/10').read
output.puts response # Puts the response into the file
puts response # Puts the response to $stdout
end
end
Here's a more idiomatic way of writing your code:
require 'open-uri'
n = gets.to_i
out = gets.chomp
File.open(out, 'w') do |fo|
n.times do
response = open('http://slowapi.com/delay/10').read
fo.puts response
puts response
end
end
This uses File.open with a block, which allows Ruby to close the file once the block exits. It's a much better practice than assigning the file handle to a variable and use that to close the file later.
How to handle passing in variables from the command-line as options is handled in the other answers.
The first step would be to save you program in a file, add #!/usr/bin/env ruby at the top and chmod +x yourfilename to be able to execute your file.
Now you are able to run your script from the command line.
Secondly, you need to modify your script a little bit to pick up command line arguments. In Ruby, the command line arguments are stored inside ARGV, so something like
ARGV.each do|a|
puts "Argument: #{a}"
end
allows you to retrieve command line arguments.

Ruby output is not displayed on the sinatra browser

I want to bulid a multi threaded application. If i do not use threads, everything works fine. When i try to use threads, then nothing is displayed on the browser. when i use the syntax 'puts "%s" %io.read' then it displays on the command prompt and not on the browser. Any help would be appreciated.
require 'sinatra'
require 'thread'
set :environment, :production
get '/price/:upc/:rtype' do
Webupc = "#{params[:upc]}"
Webformat = "#{params[:rtype]}"
MThread = Thread.new do
puts "inside thread"
puts "a = %s" %Webupc
puts "b = %s" %Webformat
#call the price
Maxupclen = 16
padstr = ""
padupc = ""
padlen = (Maxupclen - Webupc.length)
puts "format type: #{params[:rtype]}"
puts "UPC: #{params[:upc]}"
puts "padlen: %s" %padlen
if (Webformat == 'F')
puts "inside format"
if (padlen == 0 ) then
IO.popen("tstprcpd.exe #{Webupc}")
{ |io|
"%s" %io.read
}
elsif (padlen > 0 ) then
for i in 1 .. padlen
padstr = padstr + "0"
end
padupc = padstr + Webupc
puts "padupc %s" %padupc
IO.popen("tstprcpd.exe #{padupc}") { |io|
"%s" %io.read
}
elsif (padlen < 0 ) then
IO.popen("date /T") { |io|
"UPC length must be 16 digits or less." %io.read
}
end
end
end
end
Your code has several problems:
It is not formatted properly
You are using Uppercase names for variables; that makes them constants!
puts will not output to the browser, but to the console. The browser will recieve the return value of the block, i.e. the return value of the last statement in the block. Therefore, you need to build your output differently (see below).
You are never joining the thread
Here's a minimal sinatra app that uses a thread. However, the thread makes no sense in this case because you must wait for its termination anyway before you can output the result to the browser. In order to build the output I have used StringIO, which you can use with puts to build a multiline string conveniently. However, you could also simply initialize res with an empty string with res = "" and then append your lines to this string with res << "new line\n".
require 'sinatra'
require 'thread'
require 'stringio'
get '/' do
res = StringIO.new
th = Thread.new do
res.puts 'Hello, world!'
end
th.join
res.string
end

Nokogiri and XPath: saving text result of scrape

I would like to save the text results of a scrape in a file. This is my current code:
require "rubygems"
require "open-uri"
require "nokogiri"
class Scrapper
attr_accessor :html, :single
def initialize(url)
download = open(url)
#page = Nokogiri::HTML(download)
#html = #page.xpath('//div[#class = "quoteText"andfollowing-sibling::div[1][#class = "quoteFooter" and .//a[#href and normalize-space() = "hard-work"]]]')
end
def get_quotes
#quotes_array = #html.collect {|node| node.text.strip}
#single = #quotes_array.each do |quote|
quote.gsub(/\s{2,}/, " ")
end
end
end
I know that I can write a file like this:
File.open('text.txt', 'w') do |fo|
fo.write(content)
but I don't know how to incorporate #single which holds the results of my scrape. Ultimate goal is to insert the information into a database.
I have come across some folks using Yaml but I am finding it hard to follow the step to step guide.
Can anyone point me in the right direction?
Thank you.
Just use:
#single = #quotes_array.map do |quote|
quote.squeeze(' ')
end
File.open('text.txt', 'w') do |fo|
fo.puts #single
end
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #quotes_array.map{ |q| q.squeeze(' ') }
end
and don't bother creating #single.
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #html.collect { |node| node.text.strip.squeeze(' ') }
end
and don't bother creating #single or #quotes_array.
squeeze is part of the String class. This is from the documentation:
" now is the".squeeze(" ") #=> " now is the"

Download a file only if it exists with ruby

I'm doing a scraper to download all the issues of The Exile available at http://exile.ru/archive/list.php?IBLOCK_ID=35&PARAMS=ISSUE.
So far, my code is like this:
require 'rubygems'
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_exile_URL = "http://exile.ru/docs/pdf/issues/exile"
for number in 120..290
numero = BASE_exile_URL + number.to_s + ".pdf"
puts "Downloading issue #{number}"
open(numero) { |f|
File.open("#{DATA_DIR}/#{number}.pdf",'w') do |file|
file.puts f.read
end
}
end
puts "done"
The thing is, a lot of the issue links are down, and the code creates a PDF for every issue and, if it's empty, it will leave an empty PDF. How can I change the code so that it can only create and copy a file if the link exists?
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
url_template = "http://exile.ru/docs/pdf/issues/exile%d.pdf"
filename_template = "#{DATA_DIR}/%d.pdf"
(120..290).each do |number|
pdf_url = url_template % number
print "Downloading issue #{number}"
# Opening the URL downloads the remote file.
open(pdf_url) do |pdf_in|
if pdf_in.read(4) == '%PDF'
pdf_in.rewind
File.open(filename_template % number,'w') do |pdf_out|
pdf_out.write(pdf_in.read)
end
print " OK\n"
else
print " #{pdf_url} is not a PDF\n"
end
end
end
puts "done"
open(url) downloads the file and provides a handle to a local temp file. A PDF starts with '%PDF'. After reading the first 4 characters, if the file is a PDF, the file pointer has to be put back to the beginning to capture the whole file when writing a local copy.
you can use this code to check if exist the file:
require 'net/http'
def exist_the_pdf?(url_pdf)
url = URI.parse(url_pdf)
Net::HTTP.start(url.host, url.port) do |http|
puts http.request_head(url.path)['content-type'] == 'application/pdf'
end
end
Try this:
require 'rubygems'
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_exile_URL = "http://exile.ru/docs/pdf/issues/exile"
for number in 120..290
numero = BASE_exile_URL + number.to_s + ".pdf"
open(numero) { |f|
content = f.read
if content.include? "Link is missing"
puts "Issue #{number} doesnt exists"
else
puts "Issue #{number} exists"
File.open("./#{number}.pdf",'w') do |file|
file.write(content)
end
end
}
end
puts "done"
The main thing I added is a check to see if the string "Link is missing". I wanted to do it using HTTP status codes but they always give a 200 back, which is not the best practice.
The thing to note is that with my code you always download the whole site to look for that string, but I don't have any other idea to fix it at the moment.

Resources