How do I avoid EOFError with Ruby script? - ruby

I have a Ruby script (1.9.2p290) where I am trying to call a number of URLs, and then append information from those URLs into a file. The issue is that I keep getting an end of file error - EOFError. An example of what I'm trying to do is:
require "open-uri"
proxy_uri = URI.parse("http://IP:PORT")
somefile = File.open("outputlist.txt", 'a')
(1..100).each do |num|
page = open('SOMEURL' + num, :proxy => proxy_uri).read
pattern = "<img"
tags = page.scan(pattern)
output << tags.length
end
somefile.puts output
somefile.close
I don't know why I keep getting this end of file error, or how I can avoid getting the error. I think it might have something to do with the URL that I'm calling (based on some dialogue here: What is an EOFError in Ruby file I/O?), but I'm not sure why that would affect the I/O or cause an end of file error.
Any thoughts on what I might be doing wrong here or how I can get this to work?
Thanks in advance!

The way you are writing your file isn't idiomatic Ruby. This should work better:
(1..100).each do |num|
page = open('SOMEURL' + num, :proxy => proxy_uri).read
pattern = "<img"
tags = page.scan(pattern)
output << tags.length
end
File.open("outputlist.txt", 'a') do |fo|
fo.puts output
end
I suspect that the file is being closed because it's been opened, then not written-to while 100 pages are processed. If that takes a while I can see why they'd close it to avoid apps using up all the file handles. Writing it the Ruby-way automatically closes the file immediately after the write, avoiding holding handles open artificially.
As a secondary thing, rather than use a simple pattern match to try to locate image tags, use a real HTML parser. There will be little difference in processing speed, but potentially more accuracy.
Replace:
page = open('SOMEURL' + num, :proxy => proxy_uri).read
pattern = "<img"
tags = page.scan(pattern)
output << tags.length
with:
require 'nokogiri'
doc = Nokogiri::HTML(open('SOMEURL' + num, :proxy => proxy_uri))
output << doc.search('img').size

Related

Ruby - iterate tasks with files

I am struggling to iterate tasks with files in Ruby.
(Purpose of the program = every week, I have to save 40 pdf files off the school system containing student scores, then manually compare them to last week's pdfs and update one spreadsheet with every student who has passed their target this week. This is a task for a computer!)
I have converted a pdf file to text, and my program then extracts the correct data from the text files and turns each student into an array [name, score, house group]. It then checks each new array against the data in the csv file, and adds any new results.
My program works on a single pdf file, because I've manually typed in:
f = File.open('output\agb summer report.txt')
agb = []
f.each_line do |line|
agb.push line
end
But I have a whole folder of pdf files that I want to run the program on iteratively. I've also had problems when I try to write each result to a new-named file.
I've tried things with variables and code blocks, but I now don't think you can use a variable in that way?
Dir.foreach('output') do |ea|
f = File.open(ea)
agb = []
f.each_line do |line|
agb.push line
end
end
^ This doesn't work. I've also tried exporting the directory names to an array, and doing something like:
a.each do |ea|
var = '\'output\\' + ea + '\''
f = File.open(var)
agb = []
f.each_line do |line|
agb.push line
end
end
I think I'm fundamentally confused about the sorts of object File and Dir are? I've searched a lot and haven't found a solution yet. I am fairly new to Ruby.
Anyway, I'm sure this can be done - my current backup plan is to copy my program 40 times with different details, but that sounds absurd. Please offer thoughts?
You're very close. Dir.foreach() will return the name of the files whereas File.open() is going to want the path. A crude example to illustrate this:
directory = 'example_directory'
Dir.foreach(directory) do |file|
# Assuming Unix style filesystem, skip . and ..
next if file.start_with? '.'
# Simply puts the contents
path = File.join(directory, file)
puts File.read(path)
end
Use Globbing for File Lists
You need to use Dir#glob to get your list of files. For example, given three PDF files in /tmp/pdf, you collect them with a glob like so:
Dir.glob('/tmp/pdf/*pdf')
# => ["/tmp/pdf/1.pdf", "/tmp/pdf/2.pdf", "/tmp/pdf/3.pdf"]
Dir.glob('/tmp/pdf/*pdf').class
# => Array
Once you have a list of filenames, you can iterate over them with something like:
Dir.glob('/tmp/pdf/*pdf').each do |pdf|
text = %x(pdftotext "#{pdf}")
# do something with your textual data
end
If you're on a Windows system, then you might need a gem like pdf-reader or something else from Ruby Toolbox that suits you better to actually parse the PDF. Regardless, you should use globbing to create a file list; what you do after that depends on what kind of data the file actually holds. IO#read and descendants like File#read are good places to start.
Handling Text Files
If you're dealing with text files rather than PDF files, then something like this will get you started:
Dir.glob('/tmp/pdf/*txt').each do |text|
# Do something with your textual data. In this case, just
# dump the files to standard output.
p File.read(text)
end
You can use Dir.new("./") to get all the files in the current directory
so something like this should work.
file_names = Dir.new "./"
file_names.each do |file_name|
if file_name.end_with? ".txt"
f = File.open(file_name)
agb = []
f.each_line do |line|
agb.push line
end
end
end
btw, you can just use agb = f.to_a to convert the file contents into an array were each element is a line from the file.
file_names = Dir.new "./"
file_names.each do |file_name|
if file_name.end_with? ".txt"
f = File.open file_name
agb = f.to_a
# do whatever processing you need to do
end
end
if you assign your target folder like this /path/to/your/folder/*.txt it will only iterate over text files.
2.2.0 :009 > target_folder = "/home/ziya/Desktop/etc3/example_folder/*.txt"
=> "/home/ziya/Desktop/etc3/example_folder/*.txt"
2.2.0 :010 > Dir[target_folder].each do |texts|
2.2.0 :011 > puts texts
2.2.0 :012?> end
/home/ziya/Desktop/etc3/example_folder/ex4.txt
/home/ziya/Desktop/etc3/example_folder/ex3.txt
/home/ziya/Desktop/etc3/example_folder/ex2.txt
/home/ziya/Desktop/etc3/example_folder/ex1.txt
iteration over text files is ok
2.2.0 :002 > Dir[target_folder].each do |texts|
2.2.0 :003 > File.open(texts, 'w') {|file| file.write("your content\n")}
2.2.0 :004?> end
results
2.2.0 :008 > system ("pwd")
/home/ziya/Desktop/etc3/example_folder
=> true
2.2.0 :009 > system("for f in *.txt; do cat $f; done")
your content
your content
your content
your content

Reading Several URIs in ruby

I need to read the contents of web page for several times and extract some information out of it for which I use regular expressions. I am using open-uri to read contents of the page and the sample code I written is as follows:
require 'open-uri'
def getResults(words)
results = []
words.each do |word|
results.push getAResult(word)
end
results
end
def getAResult(word)
file = open("http://www.somapage.com?option=#{word}")
contents = file.read
file.close
contents.match /some-regex-here/
$1.empty? ? -1 : $1.to_f
end
The problem is unless I comment out file.close line getAResult returns always -1. When I try this code on console, getAResult immediately returns -1, but ruby process runs for another two to three seconds or so.
If I remove file.close line getAResult returns the correct result, but now getResults is a bunch of -1s except for the first one. I tried to use curb gem for reading the page, but similar problem appears.
This seems like an issue related with threading. However, I couldn't come up with something reasonable to search and find a corresponding solution. What do you think problem would be?
NOTE: This web page I try to read does not return results so fast. It takes some time.
try hpricot or nokogiri
it can search documents via XPath in your html file
You should grab the match result, like the following:
1.9.3-327 (main):0 > contents.match /div/
=> #<MatchData "div">
1.9.3-327 (main):0 > $1
=> nil
1.9.3-327 (main):0 > contents.match /(div)/
=> #<MatchData "div" 1:"div">
1.9.3-327 (main):0 > $1
=> "div"
If you are worried about thread safety, then you shouldn't use the $n regexp variables. Capture your results directly, like this:
value = contents[/regexp/]
Specifically, here's a more ruby-like formatting of that method:
def getAResult(word)
contents = open("http://www.somapage.com?option=#{word}"){|f| f.read }
value = contents[/some-regex-here/]
value.empty? ? -1 : value.to_f
end
The block form of #open (as above) automatically closes the file when you are done with it.

Trying to open a file in Ruby - Getting TypeError: can't convert String into Integer

Not sure whats going on here, or what could be the integer in this case. Here's the code:
def build_array_from_file(filename)
contents = []
File.read(File.expand_path('lib/project_euler/' + filename), 'r') do |file|
while line = file.get
contents << line
end
end
contents
end
filename is a string and I've checked to make sure the path comes up valid.
Any thoughts? Thanks.
File.read has no second argument for mode nor block, that's File.open:
contents_string = File.read(File.expand_path('lib/project_euler/' + filename))
Note that you can also write:
contents = File.open(path).lines # returns a lazy enumerator, keeps the file open
Or:
contents = File.readlines(path) # returns an array, the file is closed.
File.read doesn't need the mode r - you already request 'read' in File.read. The parameters fo File.read are - after the filename - the offset and length (that's why a integer was expected in the error message).
You may give the mode as File.read(filename, :mode => 'r') This may be usefull, if you need the mode rb or r:utf-8 (but there is also a encoding-option).

Script that saves a series of pages then tries to combine them but only combines one?

Here's my code..
require "open-uri"
base_url = "http://en.wikipedia.org/wiki"
(1..5).each do |x|
# sets up the url
full_url = base_url + "/" + x.to_s
# reads the url
read_page = open(full_url).read
# saves the contents to a file and closes it
local_file = "my_copy_of-" + x.to_s + ".html"
file = open(local_file,"w")
file.write(read_page)
file.close
# open a file to store all entrys in
combined_numbers = open("numbers.html", "w")
entrys = open(local_file, "r")
combined_numbers.write(entrys.read)
entrys.close
combined_numbers.close
end
As you can see. It basically scrapes the contents of the wikipedia articles 1 through 5 and then attempts to combine them nto a single file called numbers.html.
It does the first bit right. But when it gets to the second. It only seem's to write in the contents of the fifth article in the loop.
I can't see where im going wrong though. Any help?
You chose the wrong mode when opening your summary file. "w" overwrites existing files while "a" appends to existing files.
So use this to get your code working:
combined_numbers = open("numbers.html", "a")
Otherwise with each pass of the loop the file contents of numbers.html are overwritten with the current article.
Besides I think you should use the contents in read_page to write to numbers.html instead of reading them back in from your freshly written file:
require "open-uri"
(1..5).each do |x|
# set up and read url
url = "http://en.wikipedia.org/wiki/#{x.to_s}"
article = open(url).read
# saves current article to a file
# (only possible with 1.9.x use open too if on 1.8.x)
IO.write("my_copy_of-#{x.to_s}.html", article)
# add current article to summary file
open("numbers.html", "a") do |f|
f.write(article)
end
end

Converting python script to ruby (downloading part of a file)

I've been at this for a couple of day, and am having no luck at all. Despite reading over these two posts, I can't seem to rewrite this little python script I did up in ruby.
clean_link = link['href'].replace(' ', '%20')
mp3file = urllib2.urlopen(clean_link)
output = open('temp.mp3','wb')
output.write(mp3file.read(2000))
output.close()
I've been looking at using open-uri and net/http to do the same in ruby, but keep hitting a url redirect issue. So far I have
clean_link = link.attributes['href'].gsub(' ', '%20')
link_pieces = clean_link.scan(/http:\/\/(?:www\.)?([^\/]+?)(\/.*?\.mp3)/)
host = link_pieces[0][0]
path = link_pieces[0][1]
Net::HTTP.start(host) do |http|
resp = http.get(path)
open("temp.mp3", "wb") do |file|
file.write(resp.body)
end
end
Is there a simpler way to do this in ruby? Also, as with the python script, is there a way to only download part of the file?
EDIT: progress updated
see here & here
http.request_get('/index.html') {|res|
size = 0
res.read_body do |chunk|
size += chunk.size
# do some processing
break if size >= 2000
end
}
but you can't control chunk sizes here

Resources