Delete Duplicate Lines Ruby - ruby

I working on a json file, I think. But Regardless, I'm working with a lot of different hashes and fetching different values and etc. This is
{"notification_rule"=>
{"id"=>"0000000",
"contact_method"=>
{"id"=>"000000",
"address"=>"cod.lew#gmail.com",}
{"notification_rule"=>
{"id"=>"000000"
"contact_method"=>
{"id"=>"PO0JGV7",
"address"=>"cod.lew#gmail.com",}
Essential, this is the type of hash I'm currently working with. With my code:
I wanted to stop duplicates of the same thing in the text file. Because whenever I run this code it brings both the address of both these hashes. And I understand why, because its looping over again, but I thought this code that I added would help resolve that issue:
Final UPDATE
if jdoc["notification_rule"]["contact_method"]["address"].to_s.include?(".com")
numbers.print "Employee Name: "
numbers.puts jdoc["notification_rule"]["contact_method"]["address"].gsub(/#target.com/, '').gsub(/\w+/, &:capitalize)
file_names = ['Employee_Information.txt']
file_names.each do |file_name|
text = File.read(file_name)
lines = text.split("\n")
new_contents = lines.uniq.join("\n")
File.open(file_name, "w") { |file| file.puts new_contents }
end
else
nil
end

This code looks really confused and lacking a specific purpose. Generally Ruby that's this tangled up is on the wrong track, as with Ruby there's usually a simple way of expressing something simple, and testing for duplicated addresses is one of those things that shouldn't be hard.
One of the biggest sources of confusion is the responsibility of a chunk of code. In that example you're not only trying to import data, loop over documents, clean up email addresses, and test for duplicates, but somehow facilitate printing out the results. That's a lot of things going on all at once, and they all have to work perfectly for that chunk of code to be fully operational. There's no way of getting it partially working, and no way of knowing if you're even on the right track.
Always try and break down complex problems into a few simple stages, then chain those stages together as necessary.
Here's how you can define a method to clean up your email addresses:
def address_scrub(address)
address.gsub(/\#target.com/, '').gsub(/\w+/, &:capitalize)
end
Where that can be adjusted as necessary, and presumably tested to ensure it's working correctly, which you can now do indepenedently of the other code.
As for the rest, it looks like this:
require 'set'
# Read in duplicated addresses from a file, clean up with chomp, using a Set
# for fast lookups.
duplicates = Set.new(
File.open("Employee_Information.txt", "r").readlines.map(&:chomp)
)
# Extract addresses from jdoc document array
filtered = jdocs.map do |jdoc|
# Convert to jdoc/address pair
[ jdoc, address_scrub(jdoc["notification_rule"]["contact_method"]["address"]) ]
end.reject do |jdoc, address|
# Remove any that are already in the duplicates list
duplicates.include?(address)
end.map do |jdoc, _|
# Return only the document
jdoc
end
Where that processes jdocs, an array of jdoc structures, and removes duplicates in a series of simple steps.
With the chaining approach you can see what's happening before you add on the next "link", so you can work incrementally towards a solution, adjusting as you go. Any mistakes are fairly easy to catch because you're able to, at any time, inspect the intermediate products of those stages.

Related

Regex in Ruby for a URL that is an image

So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:
def parse_html(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//a[#href]")
nodes.inject([]) do |uris, node|
uris << node.attr('href').strip
end.uniq
end
I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:
^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)
Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.
I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:
"//a[img][#href]"
Or even go further and extract just the URIs directly from the href values:
uris = html_doc.xpath("//a[img]/#href").map(&:value)
As some have said, you may not want to use Regex for this, but if you're determined to:
^http(s?):\/\/.*\.(jpeg|jpg|gif|png)
Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.
Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.
For your simple example, I would suggest using a simple condition like:
IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
# ...
In the context of your question, you might want to change your method to:
IMAGE_EXTS = %w[gif jpg png]
def parse_html(html)
uris = []
Nokogiri::HTML(html).xpath("//a[#href]").each do |node|
uri = node.attr('href').strip
uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
end
uris.uniq
end

How to check for multiple words inside a folder

I have a words in a text file called words.txt, and I need to check if any of those words are in my Source folder, which also contains sub-folders and files.
I was able to get all of the words into an array using this code:
array_of_words = []
File.readlines('words.txt').map do |word|
array_of_words << word
end
And I also have (kinda) figured out how to search through the whole Source folder including the sub-folders and sub-files for a specific word using:
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath).any?{ |l| l['api'] }
end
Instead of searching for one word like api, I want to search the Source folder for the whole array of words (if that is possible).
Consider this:
File.readlines('words.txt').map do |word|
array_of_words << word
end
will read the entire file into memory, then convert it into individual elements in an array. You could accomplish the same thing using:
array_of_words = File.readlines('words.txt')
A potential problem is its not scalable. If "words.txt" is larger than the available memory your code will have problems so be careful.
Searching a file for an array of words can be done a number of ways, but I've always found it easiest to use a regular expression. Perl has a great module called Regexp::Assemble that makes it easy to convert a list of words into a very efficient pattern, but Ruby is missing that sort of functionality. See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for one solution I put together in the past to help with that.
Ruby does have Regexp.union however it's only a partial help.
words = %w(foo bar)
re = Regexp.union(words) # => /foo|bar/
The pattern generated has flags for the expression so you have to be careful with interpolating it into another pattern:
/#{re}/ # => /(?-mix:foo|bar)/
(?-mix: will cause you problems so don't do that. Instead use:
/#{re.source}/ # => /foo|bar/
which will generate the pattern and behave like we expect.
Unfortunately, that's not a complete solution either, because the words could be found as sub-strings in other words:
'foolish'[/#{re.source}/] # => "foo"
The way to work around that is to set word-boundaries around the pattern:
/\b(?:#{re.source})\b/ # => /\b(?:foo|bar)\b/
which then look for whole words:
'foolish'[/\b(?:#{re.source})\b/] # => nil
More information is available in Ruby's Regexp documentation.
Once you have a pattern you want to use then it becomes a simpler matter to search. Ruby has the Find class, which makes it easy to recursively search directories for files. The documentation covers how to use it.
Alternately, you can cobble your own method using the Dir class. Again, it has examples in the documentation to use it, but I usually go with Find.
When reading the files you're scanning I'd recommend using foreach to read the files line-by-line. File.read and File.readlines are not scalable and can make your program behave erratically as Ruby tries to read a big file into memory. Instead, foreach will result in very scalable code that runs more quickly. See "Why is "slurping" a file not a good practice?" for more information.
Using the links above you should be able to put something together quickly that'll run efficiently and be flexible.
This untested code should get you started:
WORD_ARRAY = File.readlines('words.txt').map(&:chomp)
WORD_RE = /\b(?:#{Regexp.union(WORD_ARRAY).source}\b)/
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts "#{filepath}: #{!!File.read(filepath)[WORD_RE]}"
end
It will output the file it's reading, and "true" or "false" whether there is a hit finding one of the words in the list.
It's not scalable because of readlines and read and could suffer serious slowdown if any of the files are huge. Again, see the caveats in the "slurp" link above.
Recursively searches directory for any of the words contained in words.txt
re = /#{File.readlines('words.txt').map { |word| Regexp.quote(word.strip) }.join('|')}/
Dir['Source/**/*.{cpp,txt,html}'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath, "r:ascii").grep(re).any?
end

Algorithm in Ruby to trigger a method based on presence of certain texts

I don't know if it can be called an algorithm but i think its close.
I will be pulling data from an API that will have certain words in the title, eg:
Great Software 2.0 Download Now
Buy Great Software for just $10
Great Software Torrent Download
So, i want to do different things based on the presence of certain words such as Download, Buy etc. For eg, if it has the word 'buy' in it, i would like to extract the word buy and the amount value that is present in the title and show it in another div, so in this case it would be "Buy for $10" or "Buy $10" etc. I can do if/else as well but I don't want to use if else because there could be more such conditions in the future. So what i am thinking about is using the send method. eg:
def buy(string)
'Buy for just' + string.scan(/\$\d+/).first
end
def whichkeyword(title)
send (title.scan(/(download|buy)/i)[0][0]).downcase.to_sym, title
end
whichkeyword('Buy this software for $10 now')
is there a better way to do this? Or is this even a good way to do it? Any help would be appreciated
First of all, use send if and only you are to call private method, use public_send otherwise.
In this particular case metaprogramming is an overkill. It requires too much redundant code, plus it requires the code to be changed for new items. I would go with building a hash like:
#hash = { 'buy' => { text: 'Buy for just %{placeholder}', re: /\$\d+/ } }
This hash might be places somewhere outside of the code, e. g. it might be stored in yml file near the code and loaded in advance. That way you might be able to change a behaviour without modifying the code, that is handy for instance in gem.
As we have a hash defined/loaded, I would call the method:
def format string
key = string[/#{Regexp.union(#hash.keys).source}/i].downcase
puts #hash[key][:text] % { placeholder: string[#hash[key][:re]] }
end
Yielding:
▶ format("Buy this software for $10 now")
#⇒ Buy for just $10
There are many advantages over declaring methods, e. g. now matches might contain spaces, you might easily add/remove matchers etc.
First of all, your algorithm can work, but has some troubles in it, like what if no keyword is applied.
I have two solutions for you:
NLP
If you want to do it much more dynamic, you can use NLP - Natural language Processing. NLP will find main words in you sentence and then you can find the good solution for each.
A good gem for that is Treat that you can use with stanford-core-nlp. After processing the data you can find the verbs and even synonyms in the sentence and figure out what to do.
sentence('Buy this software for $10 now').verbs # ['buy']
Simple Hash
This solution is less dynamic, but much more simple. Like you did with the scan, just use Constant to manage your keywords, and the output from them(I would do it with lambdas). you can also add default to the hash
KEYWORDS = Hash.new('Default Title').merge(
buy: -> { },
download: -> { }
)
KEYWORDS[sentence[/(#{KEYWORDS.keys.join('|')})/i].downcase]
I think this solution is good enough.
The only thing that looks strange is scan(/(download|buy)/i)[0][0].
As for me I don't very much like using [] syntax in Ruby.
I think using scan here is not necessary.
What about
def whichkeyword(title)
title =~ /(download|buy)/i
send $1.downcase.to_sym, title unless $1.nil?
end
UPDATE
def whichkeyword(title)
action = title[/(download|buy)/i]
public_send action.downcase.to_sym, title if action
end

Regex causing high CPU load, causing Rails to not respond

I have a Ruby 1.8.7 script to parse iOS localization files:
singleline_comment = /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m
string_line = /\s*"(.*?)"\s*=\s*"(.*?)"\s*\;\s*/xm
out = decoded_src.scan(/(?:#{singleline_comment}|#{multiline_comment})?\s*?#{string_line}/)
It used to work fine, but today we tested it with a file that is 800Kb, and that doesn't have ; at the end of each line. The result was a high CPU load and no response from the Rails server. My assumption is that it took the whole file as a single string in the capturing group and that blocked the server.
The solution was to add ? (regex quantificator, 0 or 1 time) to the ; literal character:
/\s*"(.*?)"\s*=\s*"(.*?)"\s*\;?\s*/xm
Now it works fine again even with those files in the old iOS format, but my fear now is, what if a user submits a malformed file, like one with no ending ". Will my server get blocked again?
And how do I prevent this? Is there any way to try to run this only for five seconds? What I can I do to avoid halting my whole Rails application?
It looks like you're trying to parse an entire configuration as if it was a string. While that is doable, it's error-prone. Regular expression engines have to do a lot of looking forward and backward, and poorly written patterns can end up wasting a huge amount of CPU time. Sometimes a minor tweak will fix the problem, but the more text being processed, and the more complex the expression, the higher the chance of something happening that will mess you up.
From benchmarking different ways of getting at data for my own work, I've learned that anchoring regexp patterns can make a huge difference in speed. If you can't anchor a pattern somehow, then you are going to suffer from the backtracking and greediness of patterns unless you can limit what the engine wants to do by default.
I have to parse a lot of device configurations, but instead of trying to treat them as a single string, I break them down into logical blocks consisting of arrays of lines, and then I can provide logic to extract data from those blocks based on knowledge that blocks contain certain types of information. Small blocks are faster to search, and it's a lot easier to write patterns that can be anchored, providing huge speedups.
Also, don't hesitate to use Ruby's String methods, like split to tear apart lines, and sub-string matching to find lines containing what you want. They're very fast and less likely to induce slowdowns.
If I had a string like:
config = "name:\n foo\ntype:\n thingie\nlast update:\n tomorrow\n"
chunks = config.split("\n").slice_before(/^\w/).to_a
# => [["name:", " foo"], ["type:", " thingie"], ["last update:", " tomorrow"]]
command_blocks = chunks.map{ |k, v| [k[0..-2], v.strip] }.to_h
command_blocks['name'] # => "foo"
command_blocks['last update'] # => "tomorrow"
slice_before is a very useful method for this sort of task as it lets us define a pattern that is then used to test for breaks in the master array, and group by those. The Enumerable module has lots of useful methods in it, so be sure to look through it.
The same data could be parsed.
Of course, without sample data for what you're trying to do it's difficult to suggest something that works better, but the idea is, break down your input into small manageable chunks and go from there.
As a comment on how you're defining your patterns.
Instead of using /\/.../ (which is known as "leaning-toothpicks syndrome") use %r which allows you to define a different delimiter:
singleline_comment = /\/\/(.*)$/ # => /\/\/(.*)$/
singleline_comment = %r#//(.*)$# # => /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m # => /\/\*(.*?)\*\//m
multiline_comment = %r#/\*(.*?)\*/#m # => /\/\*(.*?)\*\//m
The first line in each sample above is how you're doing it, and the second is how I'd do it. They result in identical regexp objects, but the second ones are easier to understand.
You can even have Regexp help you by escaping things for you:
NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*?)'
GREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*)'
EOL = '$'
Regexp.new(Regexp.escape('//') + GREEDY_CAPTURE_NONE_TO_ALL_CHARS + EOL) # => /\/\/(.*)$/
Regexp.new(Regexp.escape('/*') + NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS + Regexp.escape('*/'), Regexp::MULTILINE) # => /\/\*(.*?)\*\//m
Doing this you can iteratively build up extremely complex expressions while keeping them relatively easy to maintain.
As far as halting your Rails app, don't try to process the files in the same Ruby process. Run a separate job that watches for the files and process them and store whatever you're looking for to be accessed as needed later. That way your server will continue to respond rather than lock up. I wouldn't do it in a thread, but would write a separate Ruby script that looks for incoming data, and if nothing is found, sleeps for some interval of time then looks again. Ruby's sleep method will help with that, or you could use the cron capability of your OS.

Ruby - Files - gets method

I am following Wicked cool ruby scripts book.
here,
there are two files, file_output = file_list.txt and oldfile_output = file_list.old. These two files contain list of all files the program went through and going to go through.
Now, the file is renamed as old file if a 'file_list.txt' file exists .
then, I am not able to understand the code.
Apparently every line of the file is read and the line is stored in oldfile hash.
Can some one explain from 4 the line?
And also, why is gets used here? why cant a .each method be used to read through every line?
if File.exists?(file_output)
File.rename(file_output, oldfile_output)
File.open(oldfile_output, 'rb') do |infile|
while (temp = infile.gets)
line = /(.+)\s{5,5}(\w{32,32})/.match(temp)
puts "#{line[1]} ---> #{line[2]}"
oldfile_hash[line[1]] = line[2]
end
end
end
Judging from the redundant use of quantifiers ({5,5} and {32,32}) in the regex (which would be better written as {5}, {32}), it looks like the person who wrote that code is not a professional Ruby programmer. So you can assume that the choice taken in the code is not necessarily the best.
As you pointed out, the code could have used each instead of while with gets. The latter approach is sort of an old-school Ruby way of doing it. There is nothing wrong in using it. Until the end of file is reached, gets will return a string, and when it does reach the end of file, gets will return nil, so the while loop works as the same when you use each; in each iteration, it reads the next line.
It looks like each line is supposed to represent a key-value pair. The regex assumes that the key is not an empty string, and that the key and the value are separated by exactly five spaces, and the the value consists of exactly thirty-two letters. Each key-value pair is printed (perhaps for monitoring the progress), and is stored in oldfile_hash, which is most likely a hash.
So the point of using .gets is to tell when the file is finished being read. Essentially, it's tied to the
while (condition)
....
end
block. So gets serves as a little method that will keep giving ruby the next line of the file until there is no more lines to give.

Resources