How to check for multiple words inside a folder - ruby

I have a words in a text file called words.txt, and I need to check if any of those words are in my Source folder, which also contains sub-folders and files.
I was able to get all of the words into an array using this code:
array_of_words = []
File.readlines('words.txt').map do |word|
array_of_words << word
end
And I also have (kinda) figured out how to search through the whole Source folder including the sub-folders and sub-files for a specific word using:
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath).any?{ |l| l['api'] }
end
Instead of searching for one word like api, I want to search the Source folder for the whole array of words (if that is possible).

Consider this:
File.readlines('words.txt').map do |word|
array_of_words << word
end
will read the entire file into memory, then convert it into individual elements in an array. You could accomplish the same thing using:
array_of_words = File.readlines('words.txt')
A potential problem is its not scalable. If "words.txt" is larger than the available memory your code will have problems so be careful.
Searching a file for an array of words can be done a number of ways, but I've always found it easiest to use a regular expression. Perl has a great module called Regexp::Assemble that makes it easy to convert a list of words into a very efficient pattern, but Ruby is missing that sort of functionality. See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for one solution I put together in the past to help with that.
Ruby does have Regexp.union however it's only a partial help.
words = %w(foo bar)
re = Regexp.union(words) # => /foo|bar/
The pattern generated has flags for the expression so you have to be careful with interpolating it into another pattern:
/#{re}/ # => /(?-mix:foo|bar)/
(?-mix: will cause you problems so don't do that. Instead use:
/#{re.source}/ # => /foo|bar/
which will generate the pattern and behave like we expect.
Unfortunately, that's not a complete solution either, because the words could be found as sub-strings in other words:
'foolish'[/#{re.source}/] # => "foo"
The way to work around that is to set word-boundaries around the pattern:
/\b(?:#{re.source})\b/ # => /\b(?:foo|bar)\b/
which then look for whole words:
'foolish'[/\b(?:#{re.source})\b/] # => nil
More information is available in Ruby's Regexp documentation.
Once you have a pattern you want to use then it becomes a simpler matter to search. Ruby has the Find class, which makes it easy to recursively search directories for files. The documentation covers how to use it.
Alternately, you can cobble your own method using the Dir class. Again, it has examples in the documentation to use it, but I usually go with Find.
When reading the files you're scanning I'd recommend using foreach to read the files line-by-line. File.read and File.readlines are not scalable and can make your program behave erratically as Ruby tries to read a big file into memory. Instead, foreach will result in very scalable code that runs more quickly. See "Why is "slurping" a file not a good practice?" for more information.
Using the links above you should be able to put something together quickly that'll run efficiently and be flexible.
This untested code should get you started:
WORD_ARRAY = File.readlines('words.txt').map(&:chomp)
WORD_RE = /\b(?:#{Regexp.union(WORD_ARRAY).source}\b)/
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts "#{filepath}: #{!!File.read(filepath)[WORD_RE]}"
end
It will output the file it's reading, and "true" or "false" whether there is a hit finding one of the words in the list.
It's not scalable because of readlines and read and could suffer serious slowdown if any of the files are huge. Again, see the caveats in the "slurp" link above.

Recursively searches directory for any of the words contained in words.txt
re = /#{File.readlines('words.txt').map { |word| Regexp.quote(word.strip) }.join('|')}/
Dir['Source/**/*.{cpp,txt,html}'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath, "r:ascii").grep(re).any?
end

Related

Delete Duplicate Lines Ruby

I working on a json file, I think. But Regardless, I'm working with a lot of different hashes and fetching different values and etc. This is
{"notification_rule"=>
{"id"=>"0000000",
"contact_method"=>
{"id"=>"000000",
"address"=>"cod.lew#gmail.com",}
{"notification_rule"=>
{"id"=>"000000"
"contact_method"=>
{"id"=>"PO0JGV7",
"address"=>"cod.lew#gmail.com",}
Essential, this is the type of hash I'm currently working with. With my code:
I wanted to stop duplicates of the same thing in the text file. Because whenever I run this code it brings both the address of both these hashes. And I understand why, because its looping over again, but I thought this code that I added would help resolve that issue:
Final UPDATE
if jdoc["notification_rule"]["contact_method"]["address"].to_s.include?(".com")
numbers.print "Employee Name: "
numbers.puts jdoc["notification_rule"]["contact_method"]["address"].gsub(/#target.com/, '').gsub(/\w+/, &:capitalize)
file_names = ['Employee_Information.txt']
file_names.each do |file_name|
text = File.read(file_name)
lines = text.split("\n")
new_contents = lines.uniq.join("\n")
File.open(file_name, "w") { |file| file.puts new_contents }
end
else
nil
end
This code looks really confused and lacking a specific purpose. Generally Ruby that's this tangled up is on the wrong track, as with Ruby there's usually a simple way of expressing something simple, and testing for duplicated addresses is one of those things that shouldn't be hard.
One of the biggest sources of confusion is the responsibility of a chunk of code. In that example you're not only trying to import data, loop over documents, clean up email addresses, and test for duplicates, but somehow facilitate printing out the results. That's a lot of things going on all at once, and they all have to work perfectly for that chunk of code to be fully operational. There's no way of getting it partially working, and no way of knowing if you're even on the right track.
Always try and break down complex problems into a few simple stages, then chain those stages together as necessary.
Here's how you can define a method to clean up your email addresses:
def address_scrub(address)
address.gsub(/\#target.com/, '').gsub(/\w+/, &:capitalize)
end
Where that can be adjusted as necessary, and presumably tested to ensure it's working correctly, which you can now do indepenedently of the other code.
As for the rest, it looks like this:
require 'set'
# Read in duplicated addresses from a file, clean up with chomp, using a Set
# for fast lookups.
duplicates = Set.new(
File.open("Employee_Information.txt", "r").readlines.map(&:chomp)
)
# Extract addresses from jdoc document array
filtered = jdocs.map do |jdoc|
# Convert to jdoc/address pair
[ jdoc, address_scrub(jdoc["notification_rule"]["contact_method"]["address"]) ]
end.reject do |jdoc, address|
# Remove any that are already in the duplicates list
duplicates.include?(address)
end.map do |jdoc, _|
# Return only the document
jdoc
end
Where that processes jdocs, an array of jdoc structures, and removes duplicates in a series of simple steps.
With the chaining approach you can see what's happening before you add on the next "link", so you can work incrementally towards a solution, adjusting as you go. Any mistakes are fairly easy to catch because you're able to, at any time, inspect the intermediate products of those stages.

Regex in Ruby for a URL that is an image

So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:
def parse_html(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//a[#href]")
nodes.inject([]) do |uris, node|
uris << node.attr('href').strip
end.uniq
end
I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:
^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)
Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.
I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:
"//a[img][#href]"
Or even go further and extract just the URIs directly from the href values:
uris = html_doc.xpath("//a[img]/#href").map(&:value)
As some have said, you may not want to use Regex for this, but if you're determined to:
^http(s?):\/\/.*\.(jpeg|jpg|gif|png)
Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.
Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.
For your simple example, I would suggest using a simple condition like:
IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
# ...
In the context of your question, you might want to change your method to:
IMAGE_EXTS = %w[gif jpg png]
def parse_html(html)
uris = []
Nokogiri::HTML(html).xpath("//a[#href]").each do |node|
uri = node.attr('href').strip
uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
end
uris.uniq
end

Parsing a big string in Ruby

I have a file of a few hundred megabytes containing strings:
str1 x1 x2\n
str2 xx1 xx2\n
str3 xxx1 xxx2\n
str4 xxxx1 xxxx2\n
str5 xxxxx1 xxxxx2
where x1 and x2 are some numbers. How big the numbers x(...x)1 and x(...x)2 are is unknown.
Each line has in "\n" in it. I have a list of strings str2 and str4.
I want to find the corresponding numbers for those strings.
What I'm doing is pretty straightforward (and, probably, not efficient performance-wise):
source_str = read_from_file() # source_str contains all file content of a few hundred Megabyte
str_to_find = [str2, str4]
res = []
str_to_find.each do |x|
index = source_str.index(x)
if index
a = source_str[index .. index + x.length] # a contains "str2"
#?? how do I "select" xx1 and xx2 ??
# and finally...
# res << num1
# res << num2
end
end
Note that I can't apply source_str.split("\n") due to the error ArgumentError: invalid byte sequence in UTF-8 and I can't fix it by changing a file in any way. The file can't be changed.
You want to avoid reading a hundred of megabytes into memory, as well as scanning them repeatedly. This has the potential of taking forever, while clogging the machine's available memory.
Try to re-frame the problem, so you can treat the large input file as a stream, so instead of asking for each string you want to find "does it exist in my file?", try asking for each line in the file "does it contain a string I am looking for?".
str_to_find = [str2, str4]
numbers = []
File.foreach('foo.txt') do |li|
columns = li.split
numbers += columns[2] if str_to_find.include?(columns.shift)
end
Also, read again #theTinMan's answer regarding the file encoding - what he is suggesting is that you may be able fine-tune the reading of the file to avoid the error, without changing the file itself.
If you have a very large number of items in str_to_find, I'd suggest that you use a Set instead of an Array for better performance:
str_to_find = [str1, str2, ... str5000].to_set
If you want to find a line in a text file, which it sounds like you are reading, then read the file line-by-line.
The IO class has the foreach method, which makes it easy to read a file line-by-line, which also makes it possible to easily locate lines that contain the particular string you want to find.
If you had your source input file saved as "foo.txt", you could read it using something like:
str2 = 'some value'
str4 = 'some other value'
numbers = []
File.foreach('foo.txt') do |li|
numbers << li.split[2] if li[str2] || li[str2]
end
At the end of the loop numbers should contain the numbers you want.
You say you're getting an encoding error, but you don't give us any clue what the characters are that are causing it. Without that information we can't really help you fix that problem except to say you need to tell Ruby what the file encoding is. You can do that when the file is opened; You'd properly set the open_args to whatever the encoding should be. Odds are good it should be an encoding of ISO-8859-1 or Win-1252 since those are very common with Windows machines.
I have to find a list of values, iterating through each line doesn't seem sensible because I'd have to iterate for each value over and over again.
We can only work with the examples you give us. Since that wasn't clearly explained in your question you got an answer based on what was initially said.
Ruby's Regexp has the tools necessary to make this work, but to do it correctly requires taking advantage of Perl's Regexp::Assemble library, since Ruby has nothing close to it. See "Is there an efficient way to perform hundreds of text substitutions in ruby?" for more information.
Note that this will allow you to scan through a huge string in memory, however that is still not a good way to process what you are talking about. I'd use a database instead, which are designed for this sort of task.

Easier way to search through large files in Ruby?

I'm writing a simple log sniffer that will search logs for specific errors that are indicative of issues with the software I support. It allows the user to specify the path to the log and specify how many days back they'd like to search.
If users have log roll over turned off, the log files can sometimes get quite large. Currently I'm doing the following (though not done with it yet):
File.open(#log_file, "r") do |file_handle|
file_handle.each do |line|
if line.match(/\d+++-\d+-\d+/)
etc...
The line.match obviously looks for the date format we use in the logs, and the rest of the logic will be below. However, is there a better way to search through the file without .each_line? If not, I'm totally fine with that. I just wanted to make sure I'm using the best resources available to me.
Thanks
fgrep as a standalone or called from system('fgrep ...') may be faster solution
file.readlines might be better in speed, but it's a time-space tradeoff
look at this little research - last approaches seem to be rather fast.
Here are some coding hints...
Instead of:
File.open(#log_file, "r") do |file_handle|
file_handle.each do |line|
use:
File.foreach(#log_file) do |line|
next unless line[/\A\d+++-\d+-\d+/]
foreach simplifies opening and looping over the file.
next unless... makes a tight loop skipping every line that does NOT start with your target string. The less you do before figuring out whether you have a good line, the faster your code will run.
Using an anchor at the start of your pattern, like \A gives the regex engine a major hint about where to look in the line, and allows it to bail out very quickly if the line doesn't match. Also, using line[/\A\d+++-\d+-\d+/] is a bit more concise.
If your log file is sorted by date, then you can avoid having search through the entire file by doing a binary search. In this case you'd:
Open the file like you are doing
Use lineo= to fast forward to the middle of the file.
Check if the date on the beging of the line is higher or lower than the date you are looking for.
Continue splitting the file in halves until you find what you need.
I do however think your file needs to be very large for the above to make sense.
Edit
Here is some code which shows the basic idea. It find a line containing search date, not the first. This can be fixed either by more binary searches or by doing an linear search from the last midpoint, which did not contain date. There also isn't a termination condition in case the date is not in the file. These small additions, are left as an exercise to the reader :-)
require 'date'
def bin_fsearch(search_date, file)
f = File.open file
search = {min: 0, max: f.size}
while true
# go to file midpoint
f.seek (search[:max] + search[:min]) / 2
# read in until EOL
f.gets
# record the actual mid-point we are using
pos = f.pos
# read in next line
line = f.gets
# get date from line
line_date = Date.parse(line)
if line_date < search_date
search[:min] = f.pos
elsif line_date > search_date
search[:max] = pos
else
f.seek pos
return
end
end
end
bin_fsearch(Date.new(2013, 5, 4), '/var/log/system.log')
Try this, it will search one time at a time & should be pretty fast & takes less memory.
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
Another more faster option is to read the whole file into one array. it would be fast but will take LOT of memory.
File.readlines.each do |line|
#do stuff with each line
end
Further, if you need fastest approach with least amount of memory try grep which is specifically tuned for searching through large files. so should be fast & memory responsive
`grep -e regex bigfile`.split(/\n/).each do |line|
# ... (called on each matching line) ...
end
Faster than line-by-line is read the line by chunks:
File.open('file.txt') do |f|
buff = f.read(10240)
# ...
end
But you are using regexp to match dates, you might get incomplete lines. You will have to deal with it in your logic.
Also, if performance is that important, consider write a really simple C extension.
If the log file can get huge, and that is your concern, then maybe you can consider saving the errors in a database. Then, you will get faster response.

How to make a word list from a text file in a more Ruby way?

I'd like to get a word list from a text file using Ruby. I found how to use regex to parse only words here, so I made a script like following:
src = File.open("text.txt")
word_list = []
src.each do |line|
word_list << line.downcase.split(/[^[:alpha:]]/).delete_if {|x| x == ""}
end
word_list.flatten!.uniq!.sort!
p word_list
And the following is a sample text file text.txt:
TextMate may be the latest craze for developing Ruby on Rails
applications, but Vim is forever. This plugin offers the following
features for Ruby on Rails application development.
Automatically detects buffers containing files from Rails applications, and applies settings to those buffers (and only those
buffers). You can use an autocommand to apply your own custom
settings as well.
Unintrusive. Only files in a Rails application should be affected; regular Ruby scripts are left untouched. Even when enabled, the
plugin should keep out of your way if you're not using its features.
Easy navigation of the Rails directory structure. gf considers context and knows about partials, fixtures, and much more. There are
two commands, :A (alternate) and :R (related) for easy jumping between
files, including favorites like model to migration, template to
helper, and controller to functional test. For more advanced usage,
:Rmodel, :Rview, :Rcontroller, and several other commands are
provided.
As a Ruby novice, I'd like to learn better (more clear, concise, and following conventions) solutions for this problem.
Thanks for any advices and corrections.
A more idiomatic code would be:
word_list = open("text.txt")
.lines
.flat_map { |line| line.downcase.split(/[^[:alpha:]]/).reject(&:empty?) }
.uniq
.sort
# I suppose you want each line and collect the results
word_list = File.open("text.txt").each_line.collect do |line|
# collecting is done via collect above, no need anymore
# .reject(&:empty?) calls .empty? on each element
line.downcase.split(/[^[:alpha:]]/).reject(&:empty?)
# you can chain on blocks as well
end.flatten!.uniq!.sort!
p word_list

Resources