Ruby - Can't find a string in a txt file - ruby

I'm writing a code to search a string in all txt files of a directory. The code works ok in 2 of of 3 files.
search = ['first', 'second', ...]
Dir["directory/*.txt"].each do |txt|
file = File.read(txt, encoding: "ISO8859-1:utf-8")
search.each do |se|
puts se if file.include? se #added to see if it finds a record - not working
file.each_line do |li|
if li.include? se
puts li # I removed everything else to see if this works - not working
end
end
end
end
Like I said before, It works fine with 2/3 files (80 MB, 88 MB, 224 MB). I left just the 224 MB file in the directory (the one that is not working), but still nothing.
I have been searching all day, but didn't find something that would help me. Why would not work in the 224 MB file, if has the same txt format and its from the same source.
EDIT:
Not working because doesn't find the string that I know is there and only happens for the third file mentioned.
Edit2:
I did li.split("\t") and know that li[2] its the column that I know the search string is.
Then changed the code to:
file.each_line.with_index do |li, line|
data = li.split("\t")
if line == 3
puts data[2] #I got in console the string that i'm looking for
end
# but then when i try to use it I cant
if data[2] == search #this is false i tried change both .to_s or .to_i
puts li
end
I did another test like:
puts data[2].to_i + 1 #result is 1 when data[2] is just numbers
I downloaded again the file and try it again, but nothing seems to work. its like it can return the string data[2] but dont recognize it or cant do anything with it. And like I said, is just in 1 file out of 3.
[EDIT]
Problem was that txt files were damage from source, months later I try again this code with new generated txt files, and this worked with no issues.
Thanks all for comments and answers

I've seen similar issues when working with strings that exceed the threshold of some memory limitation somewhere.
I would try breaking the large files up into smaller chunks like this:
FILE_SIZE_LIMIT_IN_MB = 80
search = ['first', 'second', ...]
def read_file(path)
File.open(path, 'r') do |f|
until f.eof? do
yield f.read(FILE_SIZE_LIMIT_IN_MB * 1024 * 1024)
end
end
end
Dir["directory/*.txt"].each do |txt|
read_file(txt) do |file|
search.each do |se|
puts se if file.include? se #added to see if it finds a record - not working
file.each_line do |li|
if li.include? se
puts li # I removed everything else to see if this works - not working
end
end
end
end
end

It looks like you are searching line-by-line. If so, you can save a ton of memory overhead and searching through arrays by reading line by line. In order to to that, you'll wan to move the search.each loop inside the loop that reads the files. Here's my attempt:
search = ['first', 'second', ...]
Dir["directory/*.txt"].each do |txt|
File.foreach(txt, {encoding: "ISO8859-1:utf-8"}) do |li|
search.each do |se|
puts se if li.include? se
end
end
end
The foreach method doesn't slurp in the entire file.
This doesn't work if the search string stretches across a newline barrier. If you have some other separator that would work better, you can optionally override the default:
File.foreach(txt, "\t", {encoding: "ISO8859-1:utf-8"}) do |r| # Tab-separated records

Related

Ruby - iterate tasks with files

I am struggling to iterate tasks with files in Ruby.
(Purpose of the program = every week, I have to save 40 pdf files off the school system containing student scores, then manually compare them to last week's pdfs and update one spreadsheet with every student who has passed their target this week. This is a task for a computer!)
I have converted a pdf file to text, and my program then extracts the correct data from the text files and turns each student into an array [name, score, house group]. It then checks each new array against the data in the csv file, and adds any new results.
My program works on a single pdf file, because I've manually typed in:
f = File.open('output\agb summer report.txt')
agb = []
f.each_line do |line|
agb.push line
end
But I have a whole folder of pdf files that I want to run the program on iteratively. I've also had problems when I try to write each result to a new-named file.
I've tried things with variables and code blocks, but I now don't think you can use a variable in that way?
Dir.foreach('output') do |ea|
f = File.open(ea)
agb = []
f.each_line do |line|
agb.push line
end
end
^ This doesn't work. I've also tried exporting the directory names to an array, and doing something like:
a.each do |ea|
var = '\'output\\' + ea + '\''
f = File.open(var)
agb = []
f.each_line do |line|
agb.push line
end
end
I think I'm fundamentally confused about the sorts of object File and Dir are? I've searched a lot and haven't found a solution yet. I am fairly new to Ruby.
Anyway, I'm sure this can be done - my current backup plan is to copy my program 40 times with different details, but that sounds absurd. Please offer thoughts?
You're very close. Dir.foreach() will return the name of the files whereas File.open() is going to want the path. A crude example to illustrate this:
directory = 'example_directory'
Dir.foreach(directory) do |file|
# Assuming Unix style filesystem, skip . and ..
next if file.start_with? '.'
# Simply puts the contents
path = File.join(directory, file)
puts File.read(path)
end
Use Globbing for File Lists
You need to use Dir#glob to get your list of files. For example, given three PDF files in /tmp/pdf, you collect them with a glob like so:
Dir.glob('/tmp/pdf/*pdf')
# => ["/tmp/pdf/1.pdf", "/tmp/pdf/2.pdf", "/tmp/pdf/3.pdf"]
Dir.glob('/tmp/pdf/*pdf').class
# => Array
Once you have a list of filenames, you can iterate over them with something like:
Dir.glob('/tmp/pdf/*pdf').each do |pdf|
text = %x(pdftotext "#{pdf}")
# do something with your textual data
end
If you're on a Windows system, then you might need a gem like pdf-reader or something else from Ruby Toolbox that suits you better to actually parse the PDF. Regardless, you should use globbing to create a file list; what you do after that depends on what kind of data the file actually holds. IO#read and descendants like File#read are good places to start.
Handling Text Files
If you're dealing with text files rather than PDF files, then something like this will get you started:
Dir.glob('/tmp/pdf/*txt').each do |text|
# Do something with your textual data. In this case, just
# dump the files to standard output.
p File.read(text)
end
You can use Dir.new("./") to get all the files in the current directory
so something like this should work.
file_names = Dir.new "./"
file_names.each do |file_name|
if file_name.end_with? ".txt"
f = File.open(file_name)
agb = []
f.each_line do |line|
agb.push line
end
end
end
btw, you can just use agb = f.to_a to convert the file contents into an array were each element is a line from the file.
file_names = Dir.new "./"
file_names.each do |file_name|
if file_name.end_with? ".txt"
f = File.open file_name
agb = f.to_a
# do whatever processing you need to do
end
end
if you assign your target folder like this /path/to/your/folder/*.txt it will only iterate over text files.
2.2.0 :009 > target_folder = "/home/ziya/Desktop/etc3/example_folder/*.txt"
=> "/home/ziya/Desktop/etc3/example_folder/*.txt"
2.2.0 :010 > Dir[target_folder].each do |texts|
2.2.0 :011 > puts texts
2.2.0 :012?> end
/home/ziya/Desktop/etc3/example_folder/ex4.txt
/home/ziya/Desktop/etc3/example_folder/ex3.txt
/home/ziya/Desktop/etc3/example_folder/ex2.txt
/home/ziya/Desktop/etc3/example_folder/ex1.txt
iteration over text files is ok
2.2.0 :002 > Dir[target_folder].each do |texts|
2.2.0 :003 > File.open(texts, 'w') {|file| file.write("your content\n")}
2.2.0 :004?> end
results
2.2.0 :008 > system ("pwd")
/home/ziya/Desktop/etc3/example_folder
=> true
2.2.0 :009 > system("for f in *.txt; do cat $f; done")
your content
your content
your content
your content

Ignoring multiple header lines in a CSV

I've worked a bit with Ruby's CSV module, but am having some problems getting it to ignore multiple header lines.
Specifically, here are the first twenty lines of a file I want to parse:
USGS Digital Spectral Library splib06a
Clark and others 2007, USGS, Data Series 231.
For further information on spectrsocopy, see: http://speclab.cr.usgs.gov
ASCII Spectral Data file contents:
line 15 title
line 16 history
line 17 to end: 3-columns of data:
wavelength reflectance standard deviation
(standard deviation of 0.000000 means not measured)
( -1.23e34 indicates a deleted number)
----------------------------------------------------
Olivine GDS70.a Fo89 165um W1R1Bb AREF
copy of splib05a r 5038
0.205100 -1.23e34 0.090781
0.213100 -1.23e34 0.018820
0.221100 -1.23e34 0.005416
0.229100 -1.23e34 0.002928
The actual headers are given on the tenth line, and the seventeenth line is where the actual data start.
Here's my code:
require "nyaplot"
# Note that DataFrame basically just inherits from Ruby's CSV module.
class SpectraHelper < Nyaplot::DataFrame
class << self
def from_csv filename
df = super(filename, col_sep: ' ') do |csv|
csv.convert do |field, info|
STDERR.puts "Field is #{field}"
end
end
end
end
def csv_headers
[:wavelength, :reflectance, :standard_deviation]
end
end
def read_asc filename
f = File.open(filename, "r")
16.times do
line = f.gets
puts "Ignoring #{line}"
end
d = SpectraHelper.from_csv(f)
end
The output suggests that my calls to f.gets are not actually ignoring those lines, and I can't understand why. Here are the first few lines of output:
Field is Clark
Field is and
Field is others
Field is 2007,
Field is USGS,
I tried looking for a tutorial or example which shows processing of more complicated CSV files, but haven't had much luck. If someone could point me towards a resource which answers this question, I would be grateful (and would prefer to mark that as accepted over a solution to my specific problem — but both would be appreciated).
Using Ruby 2.1.
It believe that you are using ::open which uses IO.open. This method will open the file again.
I modified the script a bit
require 'csv'
class SpectraHelper < CSV
def self.from_csv(filename)
df = open(filename, 'r' , col_sep: ' ') do |csv|
csv.drop(16).each {|c| p c}
end
end
end
def read_asc(filename)
SpectraHelper.from_csv(filename)
end
read_asc "data/csv1.csv"
It turns out the problem here was not with my understanding of CSV, but rather with now Nyaplot::DataFrame handles CSV files.
Basically, Nyaplot doesn't actually store things as CSVs. CSV is just an intermediate format. So a simple way to handle the files makes use of #khelli's suggestion:
def read_asc filename
Nyaplot::DataFrame.new(CSV.open(filename, 'r',
col_sep: ' ',
headers: [:wavelength, :reflectance, :standard_deviation],
converters: :numeric).
drop(16).
map do |csv_row|
csv_row.to_h.delete_if { |k,v| k.nil? }
end)
end
Thanks, everyone, for the suggestions.
I wouldn't use the CSV module since your file is not well formatted. the following code will read the file and give you an array of your records:
lines = File.open(filename,'r').readlines
lines.slice!(0,16)
records = lines.map {|line| line.chomp.split}
the recordsoutput:
[["0.205100", "-1.23e34", "0.090781"], ["0.213100", "-1.23e34", "0.018820"], ["0.221100", "-1.23e34", "0.005416"], ["0.229100", "-1.23e34", "0.002928"]]

File system crawler - iteration bugs

I'm currently building a file system crawler with the following code:
require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
count = 0
Find.find('/Users/Anconia/crawler/') do |file|
if file =~ /\b.xls$/ # check if filename ends in desired format
contents = Spreadsheet.open(file).worksheets
contents.each do |row|
if row =~ /regex/
puts file
count += 1
end
end
end
end
puts "#{count} files were found"
And am receiving the following output:
0 files were found
The regex is tested and correct - I currently use it in another crawler that works.
The output of row.inspect is
#<Spreadsheet::Excel::Worksheet:0x003ffa5d418538 #row_addresses= #default_format= #selected= #dimensions= #name=Sheet1 #workbook=#<Spreadsheet::Excel::Workbook:0x007ff4bb147140> #rows=[] #columns=[] #links={} #merged_cells=[] #protected=false #password_hash=0 #changes={} #offsets={} #reader=#<Spreadsheet::Excel::Reader:0x007ff4bb1f3b98> #ole=#<Ole::Storage::RangesIOMigrateable:0x007ff4bb126fa8> #offset=15341 #guts={} #rows[3]> - certainly nothing to iterate over.
Try this:
content = Spreadsheet.open(file)
sheet = content.worksheet 0
sheet.each do |row|
...
As Diego mentioned, I should have been iterating over contents - really appreciate the clarification! It should also be noted that row must be converted to a string before any iteration takes place.

Ruby: How to replace text in a file?

The following code is a line in an xml file:
<appId>455360226</appId>
How can I replace the number between the 2 tags with another number using ruby?
There is no possibility to modify a file content in one step (at least none I know, when the file size would change).
You have to read the file and store the modified text in another file.
replace="100"
infile = "xmlfile_in"
outfile = "xmlfile_out"
File.open(outfile, 'w') do |out|
out << File.open(infile).read.gsub(/<appId>\d+<\/appId>/, "<appId>#{replace}</appId>")
end
Or you read the file content to memory and afterwords you overwrite the file with the modified content:
replace="100"
filename = "xmlfile_in"
outdata = File.read(filename).gsub(/<appId>\d+<\/appId>/, "<appId>#{replace}</appId>")
File.open(filename, 'w') do |out|
out << outdata
end
(Hope it works, the code is not tested)
You can do it in one line like this:
IO.write(filepath, File.open(filepath) {|f| f.read.gsub(//<appId>\d+<\/appId>/, "<appId>42</appId>"/)})
IO.write truncates the given file by default, so if you read the text first, perform the regex String.gsub and return the resulting string using File.open in block mode, it will replace the file's content in one fell swoop.
I like the way this reads, but it can be written in multiple lines too of course:
IO.write(filepath, File.open(filepath) do |f|
f.read.gsub(//<appId>\d+<\/appId>/, "<appId>42</appId>"/)
end
)
replace="100"
File.open("xmlfile").each do |line|
if line[/<appId>/ ]
line.sub!(/<appId>\d+<\/appId>/, "<appId>#{replace}</appId>")
end
puts line
end
The right way is to use an XML parsing tool, and example of which is XmlSimple.
You did tag your question with regex. If you really must do it with a regex then
s = "Blah blah <appId>455360226</appId> blah"
s.sub(/<appId>\d+<\/appId>/, "<appId>42</appId>")
is an illustration of the kind of thing you can do but shouldn't.

Strange number conversion while reading a csv file with ruby

i've got a strange problem in ruby on rails
There is a csv file, made with Excel 2003.
5437390264172534;Mark;5
I have a page with upload input and i read the file like this:
file = params[:upload]['datafile']
file.read.split("\n").each do |line|
num,name,type = line.split(";")
logger.debug "row: #{num} #{name} #{type}"
end
etc
So. finally i've got the following:
num = 5437...2534
name = Mark
type = 5
Why num has so strange value?
Also i tried to do like this:
str = file.read
csv = CSV.parse(str)
csv.each do |line|
RAILS_DEFAULT_LOGGER.info "######## #{line.to_yaml}"
end
but again i got
######## ---
- !str:CSV::Cell "5437...2534;Mark;5"
The csv file in win1251 (i can't change file encoding)
ruby file in UTF8
ruby version 1.8.4
rails version 2.0.2
If it indeed has a strange value, it probably has to to do with the code you didn't post. Edit your question, and include the smallest bit of code that will run independently and still produce your questionable output.
split() returns an array of strings. So the first value of your CSV file is a String, not a Bignum. Maybe you need num.to_i, or a test like num.is_a?(Bignum) somewhere in your code.
file = File.open("test.csv", "r")
# Just getting the first line
line = file.gets
num,name,type = line.split(";")
# split() returns an array of String
puts num.class
puts num
# Make num a number
puts num.to_i.class
puts num.to_i
file.close
Running that file here gives me this:
$ ruby test.rb
String
5437390264172534
Bignum
5437390264172534

Resources