Finding lines in a text file matching a regular expression - ruby

Can anyone explain how I could use regular expressions in Ruby to only return the matches of a string.
For example, if the code reads in a .txt file with a series of names in it:
John Smith
James Jones
David Brown
Tom Davidson
etc etc
..and the word to match is typed in as being 'ohn', it would then just return 'John Smith', but none of the other names.

Note: Instead of using File.each_line, use IO.foreach in modern Rubies instead. For instance:
[1] pry(main)> IO.foreach('./.bashrc') do |l|
[1] pry(main)* puts l
[1] pry(main)* end
export PATH=~/bin:$PATH
export EDITOR='vi'
export VISUAL=$EDITOR
Progress happens and things change.
Here are some different ways to get where you're going.
Notice first I'm using a more idiomatic way of writing the code for reading lines from a file. Ruby's IO and File libraries make it very easy to open, read and close the file in a nice neat package.
File.each_line('file.txt') do |li|
puts li if (li['ohn'])
end
That looks for 'ohn' anywhere in the line, but doesn't bother with a regular expression.
File.each_line('file.txt') do |li|
puts li if (li[/ohn/])
end
That looks for the same string, only it uses a regex to get there. Functionally it's the same as the first example.
File.each_line('file.txt') do |li|
puts li if (li[/ohn\b/])
end
This is a bit smarter way of looking for names that end with 'ohn'. It uses regex but also specifies that the pattern has to occur at the end of a word. \b means "word-boundary".
Also, when reading files, it's important to always think ahead about whether the file being read could ever exceed the RAM available to your app. It's easy to read an entire file into memory in one pass, then process it from RAM, but you can cripple or kill your app or machine if you exceed the physical RAM available to you.
Do you know if the code shown by the other answers is in fact loading the entire file into RAM or is somehow optimized by streaming from the readlines function to the select function?
From the IO#readlines documentation:
Reads the entire file specified by name as individual lines, and returns those lines in an array. Lines are separated by sep.
An additional consideration is memory allocation during a large, bulk read. Even if you have sufficient RAM, you can run into situations where a language chokes as it reads in the data, finds it hasn't allocated enough memory to the variable, and has to pause as it grabs more. That cycle repeats until the entire file is loaded.
I became sensitive to this many years ago when I was loading a very big data file into a Perl app on HP's biggest mini, that I managed. The app would pause for a couple seconds periodically and I couldn't figure out why. I dropped into the debugger and couldn't find the problem. Finally, by tracing the run using old-school print statements I isolated the pauses to a file "slurp". I had plenty of RAM, and plenty of processing power, but Perl wasn't allocating enough memory. I switched to reading line by line and the app flew through its processing. Ruby, like Perl, has good I/O, and can read a big file very quickly when it's reading line-by-line. I have never found a good reason for slurping a text file, except when it's possible to have content I want spread across several lines, but that is not a common occurrence.

Maybe I'm not understanding the problem fully, but you could do something like:
File.readlines("path/to/file.txt").select { |line| line =~ /ohn/ }
to get an array of all the lines that match your criteria.

query = 'ohn'
names = File.readlines('names.txt')
matches = names.select { |name| name[/#{query}/i] }
#=> ["John Smith"]
Remove the i at the end of the regex if you wish the query to be case sensitive.

Old question, but Array#grep can also be used to search a list of strings
File.readlines("names.txt").grep /#{query}/i

Related

Difference between 2 ways working with pipes using ARGF?

Using ARGF I can create Ruby programs that respect pipelines. Suppose, I to constantly read new entries:
$ tail -f log/test.log | my_prog
I can do this using:
ARGF.each_line do |line|
...
end
Also, I found another way:
while input = ARGF.gets
input.each_line do |line|
...
end
end
Looks like, both variants do the same thing or there is a difference between them? If so, what is it?
Thanks in advance.
As Stefan mentioned, you did a little mistake in second case. Proper way of using "ARGF.gets" approach in your case will look like:
while input = ARGF.gets
# input here represents a line
end
If you rewrite the second example as above, you will not have difference in behavior.
Actual difference you may notice between ARGF#gets and ARGF#each_line is in semantics: each_line accepts block or returns enumerator and gets returns a next line if it is available.
Another option is to use Kernel#gets. Beware it's behavior may differ from ARGF#gets in some cases, especially if you change a separator:
A separator of nil reads the entire contents, and a zero-length separator reads the input one paragraph at a time, where paragraphs are divided by two consecutive newlines.
But for reading (and then printing) constantly from stdin you may use it as follows:
print while gets

Fastest way to check that a PDF is corrupted (Or just missing EOF) in Ruby?

I am looking for a way to check if a PDF is missing an end of file character. So far I have found I can use the pdf-reader gem and catch the MalformedPDFError exception, or of course I could simply open the whole file and check if the last character was an EOF. I need to process lots of potentially large PDF's and I want to load as little memory as possible.
Note: all the files I want to detect will be lacking the EOF marker, so I feel like this is a little more specific scenario then detecting general PDF "corruption". What is the best, fast way to do this?
TL;DR
Looking for %%EOF, with or without related structures, is relatively speedy even if you scan the entirety of a reasonably-sized PDF file. However, you can gain a speed boost if you restrict your search to the last kilobyte, or the last 6 or 7 bytes if you simply want to validate that %%EOF\n is the only thing on the last line of a PDF file.
Note that only a full parse of the PDF file can tell you if the file is corrupted, and only a full parse of the File Trailer can fully validate the trailer's conformance to standards. However, I provide two approximations below that are reasonably accurate and relatively fast in the general case.
Check Last Kilobyte for File Trailer
This option is fairly fast, since it only looks at the tail of the file, and uses a string comparison rather than a regular expression match. According to Adobe:
Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.
Therefore, the following will work by looking for the file trailer instruction within that range:
def valid_file_trailer? filename
File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end
A Stricter Check of the File Trailer via Regex
However, the ISO standard is both more complex and a lot more strict. It says, in part:
The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).
Without actually parsing the PDF, you won't be able to validate this with perfect accuracy using regular expressions, but you can get close. For example:
def valid_file_trailer? filename
pattern = /^startxref\n\d+\n%%EOF\n\z/m
File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end

Line count in csv doesn't match

I have a large CSV with a large number of columns. I am trying to count the number of lines using
File.open(file).readlines.to_a.compact.count.to_i
It displays 57 although there are only 56 rows. Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
You need to show an example of the incoming data if you want us to help beyond generic answers.
To fix the problem, you have to be able to identify the line. We can't help you there because it could look like anything. Making a wild guess, I'd say that one of the columns had an embedded new-line in it, which forces the line to wrap.
It the file is a true CSV file, that column should be wrapped in double-quotes, so you could search the file for lines that do NOT end with whatever data type should be in the last column, then read the next line, join them, then rewrite the file. But, again, we have nothing to work with, because your file's format could be a huge number of different things.
Your best bet is to use the CSV class that comes with Ruby, and let it read the file, instead of trying to treat it like a text file. CSV files are text, but they are formatted to maintain the columns and rows, so using the CSV class will give you a better chance of getting at the data.
Looking at your code:
There are a number of ways to count the number of lines in a file, including the easiest which is:
`wc -l /path/to/file`.to_i
if you're using *nix.
Using File.open(file).readlines.to_a is horribly redundant and not fast or scalable if your file is big.
readlines returns an array.
to_a returns an array.
Why turn the array into an array?
readlines loads an entire file into memory, then splits it on line ends into an array. That process can be a lot slower than simply reading the file line-by-line and incrementing a counter, plus "slurping" can make your program crawl if the file is larger than available memory.
See "Why is "slurping" a file not a good practice?" for more information.
compact removes nils from an array. readlines should never return any nils so compact will iterate over the array looking for something that shouldn't exist.
count returns an integer.
to_i converts the receiver to an integer.
In other words, to_i is turning an integer into an integer. Why?
If you want to do it in Ruby instead of using wc -l, do something simple and fast:
lines_in_file = 0
File.foreach(some_file) { lines_in_file += 1 }
After running that, lines_in_file will contain the number of lines read. Memory won't be impacted and it'll run like blue blazes on huge files.

Most reliable way to get text into ruby script

I have a ruby script that’ll do some text parsing (à lá markdown). It does it in a sequence of steps, like
string = string.gsub # more code here
string = string.gsub # more code here
# and so on
what is the best (i.e. most reliable) way to feed text into string in the first place? It’s a script, and the text it’ll be fed can vary a lot — it can be multilingual, have some characters that might trip a shell (like ", ', ’, &, $ you get the idea), and will likely be multi-line.
Is there some trick on the lines of
cat << EOF
bunch of text here
EOF
Additional considerations
I’m not looking for a markdown parser, this is something I want to do, not something I want a tool for.
I’m not a big ruby user (I’m starting to use it), so the more detailed the answer you can provide, the better.
It must be completely scriptable (i.e., no interrupting to ask the user for information).
The Kernel#gets method will read a string separated using the record separator from stdin or files specified on the command line. So if you use that you can do things like:
yourscript <filename #read from filename
yourscript file1 file2 # read both file1 and file2
yourscript #lets you type at your script
So to run something like:
cat <<'eof' |ruby yourscript.rb
This' & will $all 'eof' be 'fine'''
eof
Script might contain something like:
s = gets() # read a line
lines = readlines() # read all lines into an array
That's fairly standard for command-line scripts. If you want to have a user-interface then you'll want something more complex. There is an option to the Ruby interpreter to set the encoding of files as they are read.
Just read from stdin (which is an IO object):
$stdin.read
As you can see, stdin is provided in the global variable $stdin. Since it’s an IO object, there are a lot of other methods available if read doesn’t suit your needs.
Here’s a simple one-line example in the shell:
$ echo "foo\nbar" | ruby -e 'puts $stdin.read.upcase'
FOO
BAR
Obviously reading from stdin is extremely flexible since you can pipe input in from anywhere.
Ruby is very adept at encodings (see eg. Encoding docs). To get text into Ruby, one typically uses either gets, or reads File objects, or uses a GUI, which one can build with gtk2 gem or rugui (if already finished). In case you are getting texts from the wild internet, security should be your concern. Ruby used to have 4 $SAFE levels, but after some discussions, now there might be only 3 of them left. In any case, the best strategy to handle strings is to know as much as possible about the properties of the string that you expect in advance. Handling absolutely arbitrary strings is a surprisingly difficult task. Try to limit the number of possible encodings and figure the maximum size for the string that you expect.
Also, with respect to your original stated goal writing a markdown-processor-like something, you might want to not reinvent the wheel (unless it is for didactic purposes). There is this SO post:
Better ruby markdown interpreter?
The answer will direct you to kramdown gem, which gets a lot of praise, though I have not tried it personally.

Is Regex faster than array comparison in this case?

Say I have an incoming string that I want scan to see if it contains any of the words I have chosen to be "bad." :)
Is it faster to split the string into an array, as well as keep the bad words in an array, and then iterate through each bad word as well as each incoming word and see if there's a match, kind of like:
badwords.each do |badword|
incoming.each do |word|
trigger = true if badword == word
end
end
OR is it faster to do this:
incoming.each do |word|
trigger = true if badwords.include? word
end
OR is it faster to leave the string as it is and run a .match() with a regex that looks something like:
/\bbadword1\b|\bbadword2\b|\bbadword3\b/
Or is the performance difference almost completely negligible? Been wondering this for a while.
You're giving the regex an advantage by not stopping your loop when it finds a match. Try:
incoming.find{|word| badwords.include? word}
My money is still on the regex though which should be simplified to:
/\b(badword1|badword2|badword3)\b/
or to make it a fair fight:
/\a(badword1|badword2|badword3)\z/
Once it is compiled, the Regex is the fastest in real live (i.e. really long incoming string, many similar bad words, etc.) since it can run on incoming in situ and will handle overlapping parts of your "bad words" really well.
The answer probably depends on the number of bad words to check: if there is only one bad word it probably doesn't make a huge difference, if there are 50 then checking an array would probably get slow. On the other hand, with tens or hundreds of thousands of words the regexp probably won't be too fast either
If you need to handle large numbers of bad words, you might want to consider splitting into individual words and then using a bloomfilter to test whether the word is likely to be bad or not.
This does not excatly answer your question but this will definitely help solve it.
Take some examples what your are tring to acheive and put them to bench marks.
you can find how to do benchmarking in ruby here
Just put the varoius forms between report block and get the benchmarks and decide yourself what suits you the best.
http://ruby.about.com/od/tasks/f/benchmark.htm
http://ruby-doc.org/stdlib-1.9.3/libdoc/benchmark/rdoc/Benchmark.html
For better solutions use the real data to test.
Benchmarks are always better than discussions :)
If you want to scan a string for occurrences of words, use scan to find them.
Use Regexp.union to build a pattern that will find the strings in your black-list. You will want to wrap the result with \b to force matching word-boundaries, and use a case-insensitive search.
To give you an idea of how Regexp.union can help:
words = %w[foo bar]
Regexp.union(words)
=> /foo|bar/
'Daniel Foo killed him a bar'.scan(/\b#{Regexp.union(words)}\b/i)
=> ["foo", "bar"]
You could also build the pattern using Regexp.new or /.../ if you want a bit more control:
Regexp.new('\b(?:' + words.join('|') + ')\b', Regexp::IGNORECASE)
=> /\b(?:foo|bar)\b/i
/\b(?:#{words.join('|')})\b/i
=> /\b(?:foo|bar)\b/i
'Daniel Foo killed him a bar'.scan(/\b(?:#{words.join('|')})\b/i)
=> ["Foo", "bar"]
As a word of advice, black-listing words you find offensive is easily tricked by a user, and often gives results that are wrong because many "offensive" words are only offensive in a certain context. A user can deliberately misspell them or use "l33t" speak and have an almost inexhaustible supply of alternate spellings that will make you constantly update your list. It's a source of enjoyment to some people to fool a system.
I was once given a similar task and wrote a translator to supply alternate spellings for "offensive" words. I started with a list of words and terms I'd gleaned from the Internet and started my code running. After several million alternates were added to the database I pulled the plug and showed management it was a fools-errand because it was trivial to fool it.

Resources