Getting timeouts for huge arrays - ruby

I am taking some sentences in an array and some keywords in queries to check whether the keywords are present in sentences. For small sentences arrays it works fine but for huge array sentences it gets timeout everytime. Any idea on how to optimise this. TIA
def textQueries(sentences, queries)
queries.map { |query|
index_arr = []
sentences.map.with_index { |sentence, index|
sentence_arr = sentence.split(' ')
if query.split(' ').all? { |qur| sentence_arr.include?(qur) }
index_arr << index
end
}
index_arr << -1 if index_arr.empty?
puts index_arr.join " "
}
end
Example inputs :
**Sentences**:
it go will away
go do art
what to will east
**Queries**
it will
go east will
will
**Expected Result**
0
-1
0 2

There are a few optimizations that I see at first glance:
You are currently splitting each sentence for every query. Your sample data has 3 sentences and 3 queries. This means each sentence is split 3 times (once of each query). Since the result doesn't depend on the query you should do this up front. Each sentence should only be split once.
You are currently using sentences.map to iterate sentences, but don't capture the result. You are only using it for iteration purposes and push results to the index_arr. map creates a new array which you don't use, meaning you are chewing up memory that could be used elsewhere. This could be changed to each which is far more efficient if you don't use the return value.
The code query.split(' ').all? { |qur| sentence_arr.include?(qur) } isn't really optimal, since it starts searching for a specific word from the front of sentence_arr each time. Checking if a certain collection is a subset or superset of another collection is something where Set often shines.
With all the above in mind something like this should be a lot faster:
require 'set'
def text_queries(sentences, queries)
sentences = sentences.map { |sentence| Set.new(sentence.split(' ')) }
queries.map do |query|
query = Set.new(query.split(' '))
indexes = sentences.each_index.select { |index| sentences[index] >= query }
indexes << -1 if indexes.empty?
indexes
end
end
Note: If you decide to output the values to the console (like shown in the question):
puts indexes.join(' ')
Then there is no reason to use queries.map since an array with nil values will be returned (puts always returns nil). Change the map to each in this scenario.

Related

Loop through elements of XML file to see if they include any value within an array?

I've read through tons of questions and solutions to determine whether this was already answered elsewhere, but it seems that none of the things I found were exactly what I was trying to get at.
I have an XML document that has hundreds of entries of text, and each entry also lists a URL. Each URL is a string (within tags), ending with a unique 4-digit number. The XML file is basically formatted like so:
<entry>
[other content]
<id>http://www.URL.com/blahblahblah-1234</id>
[other content]
</entry>
I want to essentially single out only the URLs that have a particular number at the end, out of a list of numbers. I put all of the numbers in an array, with the values set as strings ( numbers = ["1234", "8649", etc.]). I've been using nokogiri for other parts of my script, and when I am only looking for a particular string, I just use include?, which works perfectly. However, I'm not sure how to automate this when I have hundreds of strings within the "numbers" array. This is essentially what I logistically need to happen:
id = nokodoc.css("id")
id.each { |id|
hyperlink = id.text
if hyperlink.include?(numbers)
puts "yes!"
else
puts "no :("
end
}
Obviously this doesn't work, because include? expects a string, whereas I'm passing an entire array. (For instance, if I do include?(numbers[0]), it works.) I've tried this with any? but it doesn't seem to work in this case.
Is there a Ruby method that I'm not aware of, that can tell me whether any of the values within an array is present in any of the nodes that I'm looping through? Let me know if any of this needs to be clarified—phrasing the proper question is often the hardest part!
Edit: As a sidenote, ultimately I'd like to remove all entries that correspond to any links that do not end with one of the numbers in the array, i.e.
if hyperlink.include? (any number from the array)
puts "this one is good"
else
id.parent.remove
So I would somehow need the final product to remain parsable with nokogiri.
Thank you so much in advance, for any and all insight!
You can do this:
numbers = ['1234', '8649', ..]
urls = nokodoc.css('id').map(&:text)
urls = urls.select { |url| numbers.any? { |n| url.include? n } }
But it's not efficient. If you know the pattern -- extract the number, and then check if it's in the array. For example, if it's always last 4 digits:
numbers = ['1234', '8649', ..]
urls = nokodoc.css('id').map(&:text)
urls = urls.select { |url| numbers.include? url[-4..-1] }
UPDATE
For the change in the question:
numbers = ['1234', '8649', ..]
nodes = nokodoc.css('id')
nodes.each do |node|
url = node.text
if numbers.any? { |n| url.include? n }
puts 'this one is good'
else
node.parent.remove
end
end

What programming patterns or strategy should I use to deal with small inconsistencies in data processing?

In the ruby gem I am writing I have to take in as input certain known query parameters and massage them into a query string and then use that constructed (url) string as a rest endpoint to retrieve that data.
Now there are some weird inconsistencies in inputs coming in and I am forking my code to normalize inputs into a consistent output.
def build_query(params, endpoint)
limit = Hash[limit: params[:limit] || 0]
skip = Hash[skip: params[:skip] || 0]
asc = Hash[asc: params[:asc] || ""]
desc = Hash[desc: params[:desc] || ""]
query = [limit, skip, asc, desc].select { |hash| hash.values.none? { |val| val == '' || val == 0 } }
encoded = query.map{ |q| q.to_query }.join("&")
references = build_references(params[:include]) || ""
query_string = references.empty? ? "#{endpoint}#{encoded}" : "#{endpoint}#{references}&#{encoded}"
end
You will see above that the references piece of the params are not handled the same way as the rest of the parameters. There are more slightly inconsistent edge cases coming soon. And the only way I know how to deal with these is to keep forking my code inside this function. It's going to get messy soon!
So how should I now refactor this code? Where should I go from here to manage this complexity? Should I use collaborating objects (ParamsBuilder or QueryManager) and some kind of polymorphism strategy?
I would like to keep my code simple and functional as much as possible.
plain = %i|limit skip asc desc| # plain values
built = { include: ->(input) { build_references(input) } } # procs
query = (plain | built).map do |p|
case p
when Symbol then params[p]
when Array then p.last.(params[p.first])
end
end.reject(&:blank?).map(&:to_query).join("&")
[endpoint, query].join
Basically, you have two types of parameters: those you are to pass through as is (like :limit,) and those, you are to transform (like :include.)
Former are just passed through, latter are transformed using the list of lambdas specified in the very beginning of this snippet.
Since you were using to_query in the original question, I suggest you use rails, hence you have blank? method on hand and there is no need to explicitly check for empty strings and/or zeroes.
In the last step, we reject blanks and join everything with an ampersand.

In Ruby, what is the best way to convert alphanumeric entries to integers for a column of a CSV containing a huge number of rows?

My CSV contains about 60 million rows. The 10th column contains some alphanumeric entries, some of which repeat, that I want to convert into integers with a one-to-one mapping. That is, I don't want the same entry in Original.csv to have multiple corresponding integer values in Processed.csv. So, initially, I wrote the following code:
require 'csv'
udids = []
CSV.open('Original.csv', "wb") do |csv|
CSV.foreach('Processed.csv', :headers=>true) do |row|
unless udids.include?(row[9])
udids << row[9]
end
udid = udids.index(row[9]) + 1
array = [udid]
csv<<array
end
end
But, the program was taking a lot of time, which I soon realized was because it had to check all the previous rows to make sure only the new values get assigned a new integer value, and the existing ones are not assigned any new value.
So, I thought of hashing them, because when exploring the web about this issue, I learnt that hashing is faster than sequential comparing, somehow (I have not read the details about the how, but anyway...) So, I wrote the following code to hash them:
arrayUDID=[]
arrayUser=[]
arrayHash=[]
array1=[]
f = File.open("Original.csv", "r")
f.each_line { |line|
row = line.split(",");
arrayUDID<<row[9]
arrayUser<<row[9]
}
arrayUser = arrayUser.uniq
arrayHash = []
for i in 0..arrayUser.size-1
arrayHash<<arrayUser[i]
arrayHash<<i
end
hash = Hash[arrayHash.each_slice(2).to_a]
array1=hash.values_at *arrayUDID
logfile = File.new("Processed.csv","w")
for i in 0..array1.size-1
logfile.print("#{array1[i]}\n")
end
logfile.close
But here again, I observed that the program was taking a lot of time, which I realized must be due to the hash array (or hash table) running out of memory.
So, can you kindly suggest any method that will work for my huge file in a reasonable amount of time? By reasonable amount, I mean within 10 hours, because I realize that it's going to take some hours at least as it took about 5 hours to extract that dataset from an even bigger dataset. So, with my aforementioned codes, it was not getting finished even after 2 days of running the programs. So, if you can suggest a method which can do the task by leaving the computer on overnight, that would be great. Thanks.
I think this should work:
udids = {}
unique_count = 1
output_csv = CSV.open("Processed.csv", "w")
CSV.foreach("Original.csv").with_index do |row, i|
output_csv << row and next if i == 0 # skip first row (header info)
val = row[9]
if udids[val.to_sym]
row[9] = udids[val.to_sym]
else
udids[val.to_sym] = unique_count
row[9] = unique_count
unique_count += 1
end
output_csv << row
end
output_csv.close
The performance depends heavily on how many duplicates there are (the more the better), but basically it keeps track of each value as a key in a hash, and checks to see if it has encountered that value yet or not. If so, it uses the corresponding value, and if not it increments a counter, stores that count as the new value for that key and continues.
I was able to process a 10 million line test CSV file in about 3 minutes.

Ruby // Random number between range, ensure uniqueness against others existing stored ones

Currently trying to generate a random number in a specific range;
and ensure that it would be unique against others stored records.
Using Mysql. Could be like an id, incremented; but can't be it.
Currently testing other existing records in an 'expensive' manner;
but I'm pretty sure that there would be a clean 1/2 lines of code to use
Currently using :
test = 0
Order.all.each do |ord|
test = (0..899999).to_a.sample.to_s.rjust(6, '0')
if Order.find_by_number(test).nil? then
break
end
end
return test
Thanks for any help
Here your are my one-line solution. It is also the quicker one since calls .pluck to retrieve the numbers from the Order table. .select instantiates an "Order" object for every record (that is very costly and unnecessary) while .pluck does not. It also avoids to iterate again each object with a .map to get the "number" field. We can avoid the second .map as well if we convert, using CAST in this case, to a numeric value from the database.
(Array(0...899999) - Order.pluck("CAST('number' AS UNSIGNED)")).sample.to_s.rjust(6, '0')
I would do something like this:
# gets all existing IDs
existing_ids = Order.all.select(:number).map(&:number).map(&:to_i)
# removes them from the acceptable range
available_numbers = (0..899999).to_a - existing_ids
# choose one (which is not in the DB)
available_numbers.sample.to_s.rjust(6, '0')
I think, you can do something like below :
def uniq_num_add(arr)
loop do
rndm = rand(1..15) # I took this range as an example
# random number will be added to the array, when the number will
# not be present
break arr<< "%02d" % rndm unless arr.include?(rndm)
end
end
array = []
3.times do
uniq_num_add(array)
end
array # => ["02", "15", "04"]

Ruby, Count syllables

I am using ruby to calculate the Gunning Fog Index of some content that I have, I can successfully implement the algorithm described here:
Gunning Fog Index
I am using the below method to count the number of syllables in each word:
Tokenizer = /([aeiouy]{1,3})/
def count_syllables(word)
len = 0
if word[-3..-1] == 'ing' then
len += 1
word = word[0...-3]
end
got = word.scan(Tokenizer)
len += got.size()
if got.size() > 1 and got[-1] == ['e'] and
word[-1].chr() == 'e' and
word[-2].chr() != 'l' then
len -= 1
end
return len
end
It sometimes picks up words with only 2 syllables as having 3 syllables. Can anyone give any advice or is aware of a better method?
text = "The word logorrhoea is often used pejoratively to describe prose that is highly abstract and contains little concrete language. Since abstract writing is hard to visualize, it often seems as though it makes no sense and all the words are excessive. Writers in academic fields that concern themselves mostly with the abstract, such as philosophy and especially postmodernism, often fail to include extensive concrete examples of their ideas, and so a superficial examination of their work might lead one to believe that it is all nonsense."
# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')
word_array = text.split(' ')
word_array.each do |word|
puts word if count_syllables(word) > 2
end
"themselves" is being counted as 3 but it's only 2
The function I give you before is based upon these simple rules outlined here:
Each vowel (a, e, i, o, u, y) in a
word counts as one syllable subject to
the following sub-rules:
Ignore final -ES, -ED, -E (except
for -LE)
Words of three letters or
less count as one syllable
Consecutive vowels count as one
syllable.
Here's the code:
def new_count(word)
word.downcase!
return 1 if word.length <= 3
word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word.sub!(/^y/, '')
word.scan(/[aeiouy]{1,2}/).size
end
Obviously, this isn't perfect either, but all you'll ever get with something like this is a heuristic.
EDIT:
I changed the code slightly to handle a leading 'y' and fixed the regex to handle 'les' endings better (such as in "candles").
Here's a comparison using the text in the question:
# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')
words = text.split(' ')
words.each do |word|
old = count_syllables(word.dup)
new = new_count(word.dup)
puts "#{word}: \t#{old}\t#{new}" if old != new
end
The output is:
logorrhoea: 3 4
used: 2 1
makes: 2 1
themselves: 3 2
So it appears to be an improvement.
One thing you ought to do is teach your algorithm about diphthongs. If I'm reading your code correctly, it would incorrectly flag "aid" as having two syllables.
You can also add "es" and the like to your special-case endings (you already have "ing") and just not count it as a syllable, but that might still result in some miscounts.
Finally, for best accuracy, you should convert your input to a spelling scheme or alphabet that has a definite relationship to the word's pronunciation. With your "themselves" example, the algorithm has no reliable way to know that the "e" "ves" is dropped. However, if you respelled it as "themselvz", or taught the algorithm the IPA and fed it [ðəmsɛlvz], it becomes very clear that the word is only pronounced with two syllables. That, of course, assumes you have control over the input, and is probably more work than just counting the syllables yourself.
To begin with it seems you should decrement len for the suffixes that should be excluded.
len-=1 if /.*[ing,es,ed]$/.match(word)
You could also check out Lingua::EN::Readability.
It can also calculate several readability measures, such as a Fog Index and a Flesch-Kincaid level.
PS. I think I know where you got the function from. DS.
There is also a rubygem called Odyssey that calculates Gunning Fog, along with some of the other popular ones (Flesch-Kincaid, SMOG, etc.)

Resources