Parsing one large array into several sub-arrays - ruby

I have a list of adjectives (found here), that I would like to be the basis for a "random_adjective(category)" method.
I'm really just taking a stab at this, as my first real attempt at a useful program.
Step 1: Open file, remove formatting. No problem.
list=File.read('adjectivelist')
list.gsub(/\n/, " ")
The next step is to break the string up by category..
list.split(" ")
Now I have an array of every word in the file. Neat. The ones with a tilde before them represent the category names.
Now I would like to break up this LARGE array into several smaller ones, based on category.
I need help with the syntax here, although the pseudocode for this would be something like
Scan the array for an element which begins with a tilde.
Now create a new array based on the name of that element sans the tilde, and ALSO place this "category name" into the "categories" array. Now pull all the elements from the main array, and pop them into the sub-array, until you meet another tilde. Then repeat the process until there are no more elements in the array.
Finally I would pull a random word from the category named in the parameter. If there was no category name matching the parameter, it would return false and exit (this is simply in case I want to add more categories later.)
Tips would be appreciated

You may want to go back and split first time around like this:
categories = list.split(" ~")
Then each list item will start with the category name. This will save you having to go back through your data structure as you suggest. Consider that a tip: sometimes it's better to re-think the start of a coding problem than to head inexorably forwards
The structure you are reaching towards is probably a Hash, where the keys are category names, and the values are arrays of all the matching adjectives. It might look like this:
{
'category' => [ 'word1', 'word2', 'word3' ]
}
So you might do this:
words_in_category = Hash.new
categories.each do |category_string|
cat_name, *words = category_string.split(" ")
words_in_category[cat_name] = words
end
Finally, to pick a random element from an array, Ruby provides a very useful method sample, so you can just do this
words_in_category[ chosen_category ].sample
. . . assuming chosen_category contains the string name of an actual category. I'll leave it to you to figure out how to put this all together and handle errors, bad input etc

Use slice_before:
categories = list.split(" ").slice_before(/~\w+/)
This will create an sub array for each word starting with ~, containing all words before the next matching word.

If this file format is your original and you have freedom to change it, then I recommend you save the data as yaml or json format and read it when needed. There are libraries to do this. That is all. No worry about the mess. Don't spend time reinventing the wheel.

Related

Comparing two files in Ruby with different data types

I had an interview today and wanted input on how you would solve this issue that came up. I answered the question, but in my mind I was thinking there is a better way.
Here is the scenario. You have two files that you need to compare. In the first file you have a list in string format of NFL team abbreviations for example:
ARI
CHIC
GB
NYG
DET
WASH
PHL
PITT
STL
SF
CLEV
IND
DAL
KC
In the second file you would have the following information in a hash or json for example:
"data":
{"description": name: "CLEV","totfd":26,"totyds":396,"pyds":282,"ryds":114,"pen":4,"penyds":24,
"trnovr":0,"pt":4,"ptyds":163,"ptavg":36,"top":"37:05"}},"players":null}
How would you take the strings in the first file (the abbreviations) and see if that abbreviation was included somewhere in the data of the second file? So, for example I want to see if CLEV, ARI, WASH, so on would be anywhere in the second file. If that abbreviation is included I would want to extract information based on that abbreviation.
Here was my answer:
I would iterate over each abbreviation looking for that specific abbreviation inside the second file.
I felt my answer was poor, but I wanted to see if others had a good idea on what they would do.
thanks
Mike Riley
You should ask questions in your interview. Some questions I'd ask:
Will the hash/json include duplicate data for teams? Meaning, will CLEV have multiple records in there? If not, now you know you have unique data so there's no need to group anything ahead of time.
If it's not unique, I'd get a list of all the names that exist in the hash, so you can do a comparison between the array given and the other file.
This is in O(n) for the traversal + O(logN) for the value lookup:
hash = [{'description': 'some team', 'name': 'CLEV','totfd':26,'totyds':396,'pyds':282 },
{'description': 'some team', 'name': 'PHL','totfd':26,'totyds':396,'pyds':282 }]
hash_names = hash.map { |team| team[:name] }
Now that we have a list of names in the hash, we can find out where there is an overlap. We can add the two arrays together and figure out who shows up in there more than once. There are many ways to do that, but we should keep with our run time of O(n):
list = ["ARI","CHIC","GB","NYG","DET","WASH","PHL","PITT","STL","SF","CLEV","IND","DAL"]
teams_in_both = (list + hash_names).group_by { |team| team }.keep_if { |_, occ| occ.size > 1 }.map(&:first)
Now we have a list of:
["PHL", "CLEV"]
We know enough to say who's important to us and can fetch the remaining data accordingly.

Logic for parsing names

I am wanting to solve this problem, but am kind of unsure how to correctly structure the logic for doing this. I am given a list of user names and I am told to find an extracted name for that. So, for example, I'll see a list of user names such as this:
jason
dooley
smith
rob.smith
kristi.bailey
kristi.betty.bailey
kristi.b.bailey
robertvolk
robvolk
k.b.dula
kristidula
kristibettydula
kristibdula
kdula
kbdula
alexanderson
caesardv
joseluis.lopez
jbpritzker
jean-luc.vey
dvandewal
malami
jgarciathome
christophertroethlisberger
How can I then turn each user name into an extracted name? The only parameter I am given is that every user name is guaranteed to have at least a partial person's name.
So for example, kristi.bailey would be turned into "Kristi Bailey"
alexanderson would be turned into "Alex Anderson"
So, the pattern I see is that, if I see a period I will turn that into two strings (possibly a first and last name). If I see three periods then it will be first, middle. The problem I am having trouble finding the logic for is when the name is just clumped up together like alexanderson or jgarciathome. How can I turn that into an extracted name? I was thinking of doing something like if I see 2 consonants and a vowel in a row I would separate the names, but I don't think that'll work.
Any ideas?
I'd use a string.StartsWith method and a string.EndsWith method and determine the maximum overlap on each. As long as it's more than 2 characters, call that the common name. Sort them into buckets based on the common name. It's a naive implementation, but it that's where I'd start.
Example:
string name1 = "kristi.bailey";
string name2 = "kristi.betty.bailey";
// We've got a 6 character overlap for first name:
name2.StartsWith(name1.Substring(0,6)) // this is true
// We've got a 6 character overlap for last name:
name2.EndsWith(name1.Substring(7)) // this is true
HTH!

Condense nested for loop to improve processing time with text analysis python

I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:
features = [0 for i in xrange(len(dictionary))]
for bgrm in new_scored:
for i in xrange(len(dictionary)):
if bgrm[0] == dictionary[i]:
features[i] = int(bgrm[1])
break
I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.
The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.
I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.
Thank you in advance!
Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.
If you refactor dictionary that way, then the loop can be rewritten as:
features = [0 for key in dictionary]
for bgram in new_scored:
try:
features[dictionary[bgram[0]]] = int(bgrm[1])
except KeyError:
# do something if the bigram is not in the dictionary for some reason
This should convert what was an O(n) traversal through dictionary into a hash lookup.
Hope this helps.

Using Nokogiri with multiple search elements

In this XML snippet I need to replace the data in the UID for some of the blocks. The actual file contains more than 100 similar blocks.
Although I have been able to extract subsets based on name="Track (Timeline)", I am struggling to reduce this subset to the specific block I need by also using the data in the <TrackID>, if name="Track (TimeLine)" and the text of <TrackID> is 0x1200 then set UID to xxxx.
I am new to Nokogiri and, although I write test scripts, I do not consider myself a programmer.
<StructuralMetadata key="06.0E.2B.34.02.53.01.01.0D.01.01.01.01.01.3B.00" length="116" name="Track (TimeLine)">
<EditRate>25/1</EditRate>
<Origin>0</Origin>
<Sequence>32-04-25-67-E7-A7-86-4A-9B-28-53-6F-66-74-65-6C</Sequence>
<TrackID>0x1200</TrackID>
<TrackName>Softel VBI Data</TrackName>
<TrackNumber>0x17010101</TrackNumber>
<UID>34-C1-B9-B9-5F-07-A4-4E-8F-F4-53-6F-66-74-65-6C</UID>
</StructuralMetadata>
<StructuralMetadata key="06.0E.2B.34.02.53.01.01.0D.01.01.01.01.01.3B.00" length="116" name="Track (TimeLine)">
<EditRate>25/1</EditRate>
<Origin>0</Origin>
<Sequence>35-12-2D-86-E6-74-0B-4C-B4-24-53-6F-66-74-65-6C</Sequence>
<TrackID>0x1300</TrackID>
<TrackName>Softel VBI Data</TrackName>
<TrackNumber>0x0</TrackNumber>
<UID>37-0C-80-34-4C-8D-CE-41-85-F3-53-6F-66-74-65-6C</UID>
</StructuralMetadata>
Using xpath:
//StructuralMetadata
will select all StructuralMetadata elements in your XML. The double slash at the start means to select nodes wherever they appear in the document.
You don't want all the nodes though, you can filter the ones you want with a predicate:
//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]
This will select all StructuralMetadata elements that have a name attribute with the value Track (TimeLine), and a TrackID child element with contents 0x1200.
As you're interested in the UID element, you can further refine the expression:
//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]/UID
This expression will match all the UID elements that are children of StructuralMetadata elements that match the predicate described above.
Putting this to use:
require 'nokogiri'
# Parse the document, assuming xml_file is a File object containing the XML
doc = Nokogiri::XML(xml_file)
# I'm assuming there is only one element in the document that matches
# the criteria, so I'm using at_xpath
node = doc.at_xpath('//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]/UID')
# At this point, doc contains a representation of the xml, and node points to
# the UID node within that representation. We can update the contents of
# this node
node.content = 'XXX'
# Now write out the updated XML. This just writes it to standard output,
# you could write it to a file or elsewhere if needed
puts doc.to_xml
A great way to approach this problem is with the ‘map reduce’ style of programming, which works to take a large list of things and narrow it down and combine it into the result you're after. Specifically, Array#find and Array#select are really useful for this sort of problem. Check out this example:
require 'nokogiri'
xml = Nokogiri::XML.parse(File.read "sample.xml")
element = xml.css('StructuralMetadata').find { |item|
item['name'] == "Track (TimeLine)" and item.css('TrackID').text == "0x1200"
}
puts element.to_xml
This little program first uses the CSS selector to get all of the <StructuralMetadata> elements in the document. It returns an array, which we can filter to just what we want using the Array#find method. Array#select is its cousin which returns an array of all the matching objects instead of the first one it happens to find.
Inside the block we have a test to check if the <StructuralMetadata> tag is the one we’re after. Then it puts the element.to_xml string to the console so you can see which thing it found if you run this as a command-line script. Now you can find the element, you can modify it in the usual way and save out a new XML file or whatever.

Selecting random phrase from a list

I've been playing around with a .lua file which passes a random phrase using the following line:
SendChatMessage(GetRandomArgument("text1", "text2", "text3", "text4"), "RAID")
My problem is that I have a lot of phrases and the one line of code is very long indeed.
Is there a way to hold
text1
text2
text3
text3
in a list somewhere else in the code (or externally) and call a random value from the main code. Would make maintaining the list of text options easier.
For lists up to a few hundred elements, then the following will work:
messages = {
"text1",
"text2",
"text3",
"text4",
-- ...
}
SendChatMessage(GetRandomArgument(unpack(messages)), "RAID")
For longer lists, you would be well served to replace GetRandomArgument with GetRandomElement that would take a single table as its argument and return a random entry from the table.
Edit: Olle's answer shows one way that something like GetRandomElement might be implemented. But it used table.getn on every call which is deprecated in Lua 5.1, and its replacement (table.maxn) has a runtime cost proportional to the number of elements in the table.
The function table.maxn is only required if the table in use might have missing elements in its array part. However, in this case of a list of items to choose among, there is likely to be no reason to need to allow holes in the list. If you need to edit the list at run time, you can always use table.remove to remove an item since it will also close the gap.
With a guarantee of no gaps in the array of text, then you can implement GetRandomElement like this:
function GetRandomElement(a)
return a[math.random(#a)]
end
So that you send the message like this:
SendChatMessage(GetRandomElement(messages), "RAID")
You want a table to contain your phrases like
phrases = { "tex1", "text2", "text3" }
table.insert(phrases ,"text4") -- alternative syntax
SendChatMessage(phrases[math.random(table.getn(phrases))], "RAID")
Note: getn gets the size of the table; math.random gets a random number (with a max of the size of the phrases table) and the phrases[] syntax returns the table element at the index inside [].

Resources