File system crawler - iteration bugs - ruby

I'm currently building a file system crawler with the following code:
require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
count = 0
Find.find('/Users/Anconia/crawler/') do |file|
if file =~ /\b.xls$/ # check if filename ends in desired format
contents = Spreadsheet.open(file).worksheets
contents.each do |row|
if row =~ /regex/
puts file
count += 1
end
end
end
end
puts "#{count} files were found"
And am receiving the following output:
0 files were found
The regex is tested and correct - I currently use it in another crawler that works.
The output of row.inspect is
#<Spreadsheet::Excel::Worksheet:0x003ffa5d418538 #row_addresses= #default_format= #selected= #dimensions= #name=Sheet1 #workbook=#<Spreadsheet::Excel::Workbook:0x007ff4bb147140> #rows=[] #columns=[] #links={} #merged_cells=[] #protected=false #password_hash=0 #changes={} #offsets={} #reader=#<Spreadsheet::Excel::Reader:0x007ff4bb1f3b98> #ole=#<Ole::Storage::RangesIOMigrateable:0x007ff4bb126fa8> #offset=15341 #guts={} #rows[3]> - certainly nothing to iterate over.

Try this:
content = Spreadsheet.open(file)
sheet = content.worksheet 0
sheet.each do |row|
...

As Diego mentioned, I should have been iterating over contents - really appreciate the clarification! It should also be noted that row must be converted to a string before any iteration takes place.

Related

Ruby - Can't find a string in a txt file

I'm writing a code to search a string in all txt files of a directory. The code works ok in 2 of of 3 files.
search = ['first', 'second', ...]
Dir["directory/*.txt"].each do |txt|
file = File.read(txt, encoding: "ISO8859-1:utf-8")
search.each do |se|
puts se if file.include? se #added to see if it finds a record - not working
file.each_line do |li|
if li.include? se
puts li # I removed everything else to see if this works - not working
end
end
end
end
Like I said before, It works fine with 2/3 files (80 MB, 88 MB, 224 MB). I left just the 224 MB file in the directory (the one that is not working), but still nothing.
I have been searching all day, but didn't find something that would help me. Why would not work in the 224 MB file, if has the same txt format and its from the same source.
EDIT:
Not working because doesn't find the string that I know is there and only happens for the third file mentioned.
Edit2:
I did li.split("\t") and know that li[2] its the column that I know the search string is.
Then changed the code to:
file.each_line.with_index do |li, line|
data = li.split("\t")
if line == 3
puts data[2] #I got in console the string that i'm looking for
end
# but then when i try to use it I cant
if data[2] == search #this is false i tried change both .to_s or .to_i
puts li
end
I did another test like:
puts data[2].to_i + 1 #result is 1 when data[2] is just numbers
I downloaded again the file and try it again, but nothing seems to work. its like it can return the string data[2] but dont recognize it or cant do anything with it. And like I said, is just in 1 file out of 3.
[EDIT]
Problem was that txt files were damage from source, months later I try again this code with new generated txt files, and this worked with no issues.
Thanks all for comments and answers
I've seen similar issues when working with strings that exceed the threshold of some memory limitation somewhere.
I would try breaking the large files up into smaller chunks like this:
FILE_SIZE_LIMIT_IN_MB = 80
search = ['first', 'second', ...]
def read_file(path)
File.open(path, 'r') do |f|
until f.eof? do
yield f.read(FILE_SIZE_LIMIT_IN_MB * 1024 * 1024)
end
end
end
Dir["directory/*.txt"].each do |txt|
read_file(txt) do |file|
search.each do |se|
puts se if file.include? se #added to see if it finds a record - not working
file.each_line do |li|
if li.include? se
puts li # I removed everything else to see if this works - not working
end
end
end
end
end
It looks like you are searching line-by-line. If so, you can save a ton of memory overhead and searching through arrays by reading line by line. In order to to that, you'll wan to move the search.each loop inside the loop that reads the files. Here's my attempt:
search = ['first', 'second', ...]
Dir["directory/*.txt"].each do |txt|
File.foreach(txt, {encoding: "ISO8859-1:utf-8"}) do |li|
search.each do |se|
puts se if li.include? se
end
end
end
The foreach method doesn't slurp in the entire file.
This doesn't work if the search string stretches across a newline barrier. If you have some other separator that would work better, you can optionally override the default:
File.foreach(txt, "\t", {encoding: "ISO8859-1:utf-8"}) do |r| # Tab-separated records

Ruby - iterate tasks with files

I am struggling to iterate tasks with files in Ruby.
(Purpose of the program = every week, I have to save 40 pdf files off the school system containing student scores, then manually compare them to last week's pdfs and update one spreadsheet with every student who has passed their target this week. This is a task for a computer!)
I have converted a pdf file to text, and my program then extracts the correct data from the text files and turns each student into an array [name, score, house group]. It then checks each new array against the data in the csv file, and adds any new results.
My program works on a single pdf file, because I've manually typed in:
f = File.open('output\agb summer report.txt')
agb = []
f.each_line do |line|
agb.push line
end
But I have a whole folder of pdf files that I want to run the program on iteratively. I've also had problems when I try to write each result to a new-named file.
I've tried things with variables and code blocks, but I now don't think you can use a variable in that way?
Dir.foreach('output') do |ea|
f = File.open(ea)
agb = []
f.each_line do |line|
agb.push line
end
end
^ This doesn't work. I've also tried exporting the directory names to an array, and doing something like:
a.each do |ea|
var = '\'output\\' + ea + '\''
f = File.open(var)
agb = []
f.each_line do |line|
agb.push line
end
end
I think I'm fundamentally confused about the sorts of object File and Dir are? I've searched a lot and haven't found a solution yet. I am fairly new to Ruby.
Anyway, I'm sure this can be done - my current backup plan is to copy my program 40 times with different details, but that sounds absurd. Please offer thoughts?
You're very close. Dir.foreach() will return the name of the files whereas File.open() is going to want the path. A crude example to illustrate this:
directory = 'example_directory'
Dir.foreach(directory) do |file|
# Assuming Unix style filesystem, skip . and ..
next if file.start_with? '.'
# Simply puts the contents
path = File.join(directory, file)
puts File.read(path)
end
Use Globbing for File Lists
You need to use Dir#glob to get your list of files. For example, given three PDF files in /tmp/pdf, you collect them with a glob like so:
Dir.glob('/tmp/pdf/*pdf')
# => ["/tmp/pdf/1.pdf", "/tmp/pdf/2.pdf", "/tmp/pdf/3.pdf"]
Dir.glob('/tmp/pdf/*pdf').class
# => Array
Once you have a list of filenames, you can iterate over them with something like:
Dir.glob('/tmp/pdf/*pdf').each do |pdf|
text = %x(pdftotext "#{pdf}")
# do something with your textual data
end
If you're on a Windows system, then you might need a gem like pdf-reader or something else from Ruby Toolbox that suits you better to actually parse the PDF. Regardless, you should use globbing to create a file list; what you do after that depends on what kind of data the file actually holds. IO#read and descendants like File#read are good places to start.
Handling Text Files
If you're dealing with text files rather than PDF files, then something like this will get you started:
Dir.glob('/tmp/pdf/*txt').each do |text|
# Do something with your textual data. In this case, just
# dump the files to standard output.
p File.read(text)
end
You can use Dir.new("./") to get all the files in the current directory
so something like this should work.
file_names = Dir.new "./"
file_names.each do |file_name|
if file_name.end_with? ".txt"
f = File.open(file_name)
agb = []
f.each_line do |line|
agb.push line
end
end
end
btw, you can just use agb = f.to_a to convert the file contents into an array were each element is a line from the file.
file_names = Dir.new "./"
file_names.each do |file_name|
if file_name.end_with? ".txt"
f = File.open file_name
agb = f.to_a
# do whatever processing you need to do
end
end
if you assign your target folder like this /path/to/your/folder/*.txt it will only iterate over text files.
2.2.0 :009 > target_folder = "/home/ziya/Desktop/etc3/example_folder/*.txt"
=> "/home/ziya/Desktop/etc3/example_folder/*.txt"
2.2.0 :010 > Dir[target_folder].each do |texts|
2.2.0 :011 > puts texts
2.2.0 :012?> end
/home/ziya/Desktop/etc3/example_folder/ex4.txt
/home/ziya/Desktop/etc3/example_folder/ex3.txt
/home/ziya/Desktop/etc3/example_folder/ex2.txt
/home/ziya/Desktop/etc3/example_folder/ex1.txt
iteration over text files is ok
2.2.0 :002 > Dir[target_folder].each do |texts|
2.2.0 :003 > File.open(texts, 'w') {|file| file.write("your content\n")}
2.2.0 :004?> end
results
2.2.0 :008 > system ("pwd")
/home/ziya/Desktop/etc3/example_folder
=> true
2.2.0 :009 > system("for f in *.txt; do cat $f; done")
your content
your content
your content
your content

Script that saves a series of pages then tries to combine them but only combines one?

Here's my code..
require "open-uri"
base_url = "http://en.wikipedia.org/wiki"
(1..5).each do |x|
# sets up the url
full_url = base_url + "/" + x.to_s
# reads the url
read_page = open(full_url).read
# saves the contents to a file and closes it
local_file = "my_copy_of-" + x.to_s + ".html"
file = open(local_file,"w")
file.write(read_page)
file.close
# open a file to store all entrys in
combined_numbers = open("numbers.html", "w")
entrys = open(local_file, "r")
combined_numbers.write(entrys.read)
entrys.close
combined_numbers.close
end
As you can see. It basically scrapes the contents of the wikipedia articles 1 through 5 and then attempts to combine them nto a single file called numbers.html.
It does the first bit right. But when it gets to the second. It only seem's to write in the contents of the fifth article in the loop.
I can't see where im going wrong though. Any help?
You chose the wrong mode when opening your summary file. "w" overwrites existing files while "a" appends to existing files.
So use this to get your code working:
combined_numbers = open("numbers.html", "a")
Otherwise with each pass of the loop the file contents of numbers.html are overwritten with the current article.
Besides I think you should use the contents in read_page to write to numbers.html instead of reading them back in from your freshly written file:
require "open-uri"
(1..5).each do |x|
# set up and read url
url = "http://en.wikipedia.org/wiki/#{x.to_s}"
article = open(url).read
# saves current article to a file
# (only possible with 1.9.x use open too if on 1.8.x)
IO.write("my_copy_of-#{x.to_s}.html", article)
# add current article to summary file
open("numbers.html", "a") do |f|
f.write(article)
end
end

Strange number conversion while reading a csv file with ruby

i've got a strange problem in ruby on rails
There is a csv file, made with Excel 2003.
5437390264172534;Mark;5
I have a page with upload input and i read the file like this:
file = params[:upload]['datafile']
file.read.split("\n").each do |line|
num,name,type = line.split(";")
logger.debug "row: #{num} #{name} #{type}"
end
etc
So. finally i've got the following:
num = 5437...2534
name = Mark
type = 5
Why num has so strange value?
Also i tried to do like this:
str = file.read
csv = CSV.parse(str)
csv.each do |line|
RAILS_DEFAULT_LOGGER.info "######## #{line.to_yaml}"
end
but again i got
######## ---
- !str:CSV::Cell "5437...2534;Mark;5"
The csv file in win1251 (i can't change file encoding)
ruby file in UTF8
ruby version 1.8.4
rails version 2.0.2
If it indeed has a strange value, it probably has to to do with the code you didn't post. Edit your question, and include the smallest bit of code that will run independently and still produce your questionable output.
split() returns an array of strings. So the first value of your CSV file is a String, not a Bignum. Maybe you need num.to_i, or a test like num.is_a?(Bignum) somewhere in your code.
file = File.open("test.csv", "r")
# Just getting the first line
line = file.gets
num,name,type = line.split(";")
# split() returns an array of String
puts num.class
puts num
# Make num a number
puts num.to_i.class
puts num.to_i
file.close
Running that file here gives me this:
$ ruby test.rb
String
5437390264172534
Bignum
5437390264172534

Ruby - Reading csv file and executing value in loop is skipping over lines in the csv file

I'm sure this is a completely ignorant question but here it goes. The following code's objective is to read a list of id's from a standard csv file, use the value to append to a URL, call the URL and extract a specific attribute via xpath. The problem I'm having is that the loop seems to be skipping some lines.
In example, here is a sample of 10 values:
777961
777972
781033
781044
781055
847066
744187
893908
369009
369010
The code is only reading every other line. The actual file has around 6000 lines, not huge but I'm only getting about 2500 values returned in the second file.
f = File.open('test.csv', 'r+')
url_f = File.open("url.csv", "w")
for line in f
f.each_line do |item|
item = f.gets
url = "http://test.com/testid=" + item
client = HTTPClient.new
resp = client.get_content(url)
doc = Nokogiri::HTML(resp)
doc.xpath("//link[#rel='canonical']/#href").each do |attr|
url_f.puts attr.value
puts attr.value
end
puts item
end
end
Nevermind, I figured it out.
I had the line item = f.gets which would call the next line every time the loop ran thus skipping every other line. I knew it was a noob question. :P

Resources