Parsing text files in Ruby when the content isn't well formed - ruby

I'm trying to read files and create a hashmap of the contents, but I'm having trouble at the parsing step. An example of the text file is
put 3
returns 3
between
3
pargraphs 1
4
3
#foo 18
****** 2
The word becomes the key and the number is the value. Notice that the spacing is fairly erratic. The word isn't always a word (which doesn't get picked up by /\w+/) and the number associated with that word isn't always on the same line. This is why I'm calling it not well-formed. If there were one word and one number on one line, I could just split it, but unfortunately, this isn't the case. I'm trying to create a hashmap like this.
{"put"=>3, "#foo"=>18, "returns"=>3, "paragraphs"=>1, "******"=>2, "4"=>3, "between"=>3}
Coming from Java, it's fairly easy. Using Scanner I could just use scanner.next() for the next key and scanner.nextInt() for the number associated with it. I'm not quite sure how to do this in Ruby when it seems I have to use regular expressions for everything.

I'd recommend just using split, as in:
h = Hash[*s.split]
where s is your text (eg s = open('filename').read. Believe it or not, this will give you precisely what you're after.
EDIT: I realized you wanted the values as integers. You can add that as follows:
h.each{|k,v| h[k] = v.to_i}

Related

Logic for parsing names

I am wanting to solve this problem, but am kind of unsure how to correctly structure the logic for doing this. I am given a list of user names and I am told to find an extracted name for that. So, for example, I'll see a list of user names such as this:
jason
dooley
smith
rob.smith
kristi.bailey
kristi.betty.bailey
kristi.b.bailey
robertvolk
robvolk
k.b.dula
kristidula
kristibettydula
kristibdula
kdula
kbdula
alexanderson
caesardv
joseluis.lopez
jbpritzker
jean-luc.vey
dvandewal
malami
jgarciathome
christophertroethlisberger
How can I then turn each user name into an extracted name? The only parameter I am given is that every user name is guaranteed to have at least a partial person's name.
So for example, kristi.bailey would be turned into "Kristi Bailey"
alexanderson would be turned into "Alex Anderson"
So, the pattern I see is that, if I see a period I will turn that into two strings (possibly a first and last name). If I see three periods then it will be first, middle. The problem I am having trouble finding the logic for is when the name is just clumped up together like alexanderson or jgarciathome. How can I turn that into an extracted name? I was thinking of doing something like if I see 2 consonants and a vowel in a row I would separate the names, but I don't think that'll work.
Any ideas?
I'd use a string.StartsWith method and a string.EndsWith method and determine the maximum overlap on each. As long as it's more than 2 characters, call that the common name. Sort them into buckets based on the common name. It's a naive implementation, but it that's where I'd start.
Example:
string name1 = "kristi.bailey";
string name2 = "kristi.betty.bailey";
// We've got a 6 character overlap for first name:
name2.StartsWith(name1.Substring(0,6)) // this is true
// We've got a 6 character overlap for last name:
name2.EndsWith(name1.Substring(7)) // this is true
HTH!

Ruby - Files - gets method

I am following Wicked cool ruby scripts book.
here,
there are two files, file_output = file_list.txt and oldfile_output = file_list.old. These two files contain list of all files the program went through and going to go through.
Now, the file is renamed as old file if a 'file_list.txt' file exists .
then, I am not able to understand the code.
Apparently every line of the file is read and the line is stored in oldfile hash.
Can some one explain from 4 the line?
And also, why is gets used here? why cant a .each method be used to read through every line?
if File.exists?(file_output)
File.rename(file_output, oldfile_output)
File.open(oldfile_output, 'rb') do |infile|
while (temp = infile.gets)
line = /(.+)\s{5,5}(\w{32,32})/.match(temp)
puts "#{line[1]} ---> #{line[2]}"
oldfile_hash[line[1]] = line[2]
end
end
end
Judging from the redundant use of quantifiers ({5,5} and {32,32}) in the regex (which would be better written as {5}, {32}), it looks like the person who wrote that code is not a professional Ruby programmer. So you can assume that the choice taken in the code is not necessarily the best.
As you pointed out, the code could have used each instead of while with gets. The latter approach is sort of an old-school Ruby way of doing it. There is nothing wrong in using it. Until the end of file is reached, gets will return a string, and when it does reach the end of file, gets will return nil, so the while loop works as the same when you use each; in each iteration, it reads the next line.
It looks like each line is supposed to represent a key-value pair. The regex assumes that the key is not an empty string, and that the key and the value are separated by exactly five spaces, and the the value consists of exactly thirty-two letters. Each key-value pair is printed (perhaps for monitoring the progress), and is stored in oldfile_hash, which is most likely a hash.
So the point of using .gets is to tell when the file is finished being read. Essentially, it's tied to the
while (condition)
....
end
block. So gets serves as a little method that will keep giving ruby the next line of the file until there is no more lines to give.

Parsing one large array into several sub-arrays

I have a list of adjectives (found here), that I would like to be the basis for a "random_adjective(category)" method.
I'm really just taking a stab at this, as my first real attempt at a useful program.
Step 1: Open file, remove formatting. No problem.
list=File.read('adjectivelist')
list.gsub(/\n/, " ")
The next step is to break the string up by category..
list.split(" ")
Now I have an array of every word in the file. Neat. The ones with a tilde before them represent the category names.
Now I would like to break up this LARGE array into several smaller ones, based on category.
I need help with the syntax here, although the pseudocode for this would be something like
Scan the array for an element which begins with a tilde.
Now create a new array based on the name of that element sans the tilde, and ALSO place this "category name" into the "categories" array. Now pull all the elements from the main array, and pop them into the sub-array, until you meet another tilde. Then repeat the process until there are no more elements in the array.
Finally I would pull a random word from the category named in the parameter. If there was no category name matching the parameter, it would return false and exit (this is simply in case I want to add more categories later.)
Tips would be appreciated
You may want to go back and split first time around like this:
categories = list.split(" ~")
Then each list item will start with the category name. This will save you having to go back through your data structure as you suggest. Consider that a tip: sometimes it's better to re-think the start of a coding problem than to head inexorably forwards
The structure you are reaching towards is probably a Hash, where the keys are category names, and the values are arrays of all the matching adjectives. It might look like this:
{
'category' => [ 'word1', 'word2', 'word3' ]
}
So you might do this:
words_in_category = Hash.new
categories.each do |category_string|
cat_name, *words = category_string.split(" ")
words_in_category[cat_name] = words
end
Finally, to pick a random element from an array, Ruby provides a very useful method sample, so you can just do this
words_in_category[ chosen_category ].sample
. . . assuming chosen_category contains the string name of an actual category. I'll leave it to you to figure out how to put this all together and handle errors, bad input etc
Use slice_before:
categories = list.split(" ").slice_before(/~\w+/)
This will create an sub array for each word starting with ~, containing all words before the next matching word.
If this file format is your original and you have freedom to change it, then I recommend you save the data as yaml or json format and read it when needed. There are libraries to do this. That is all. No worry about the mess. Don't spend time reinventing the wheel.

Ruby on Rails - generating bit.ly style identifiers

I'm trying to generate UUIDs with the same style as bit.ly urls like:
http://bit [dot] ly/aUekJP
or cloudapp ones:
http://cl [dot] ly/1hVU
which are even smaller
how can I do it?
I'm now using UUID gem for ruby but I'm not sure if it's possible to limitate the length and get something like this.
I am currently using this:
UUID.generate.split("-")[0] => b9386070
But I would like to have even smaller and knowing that it will be unique.
Any help would be pretty much appreciated :)
edit note: replaced dot letters with [dot] for workaround of banned short link
You are confusing two different things here. A UUID is a universally unique identifier. It has a very high probability of being unique even if millions of them were being created all over the world at the same time. It is generally displayed as a 36 digit string. You can not chop off the first 8 characters and expect it to be unique.
Bitly, tinyurl et-al store links and generate a short code to represent that link. They do not reconstruct the URL from the code they look it up in a data-store and return the corresponding URL. These are not UUIDS.
Without knowing your application it is hard to advise on what method you should use, however you could store whatever you are pointing at in a data-store with a numeric key and then rebase the key to base32 using the 10 digits and 22 lowercase letters, perhaps avoiding the obvious typo problems like 'o' 'i' 'l' etc
EDIT
On further investigation there is a Ruby base32 gem available that implements Douglas Crockford's Base 32 implementation
A 5 character Base32 string can represent over 33 million integers and a 6 digit string over a billion.
If you are working with numbers, you can use the built in ruby methods
6175601989.to_s(30)
=> "8e45ttj"
to go back
"8e45ttj".to_i(30)
=>6175601989
So you don't have to store anything, you can always decode an incoming short_code.
This works ok for proof of concept, but you aren't able to avoid ambiguous characters like: 1lji0o. If you are just looking to use the code to obfuscate database record IDs, this will work fine. In general, short codes are supposed to be easy to remember and transfer from one medium to another, like reading it on someone's presentation slide, or hearing it over the phone. If you need to avoid characters that are hard to read or hard to 'hear', you might need to switch to a process where you generate an acceptable code, and store it.
I found this to be short and reliable:
def create_uuid(prefix=nil)
time = (Time.now.to_f * 10_000_000).to_i
jitter = rand(10_000_000)
key = "#{jitter}#{time}".to_i.to_s(36)
[prefix, key].compact.join('_')
end
This spits out unique keys that look like this: '3qaishe3gpp07w2m'
Reduce the 'jitter' size to reduce the key size.
Caveat:
This is not guaranteed unique (use SecureRandom.uuid for that), but it is highly reliable:
10_000_000.times.map {create_uuid}.uniq.length == 10_000_000
The only way to guarantee uniqueness is to keep a global count and increment it for each use: 0000, 0001, etc.

Selecting random phrase from a list

I've been playing around with a .lua file which passes a random phrase using the following line:
SendChatMessage(GetRandomArgument("text1", "text2", "text3", "text4"), "RAID")
My problem is that I have a lot of phrases and the one line of code is very long indeed.
Is there a way to hold
text1
text2
text3
text3
in a list somewhere else in the code (or externally) and call a random value from the main code. Would make maintaining the list of text options easier.
For lists up to a few hundred elements, then the following will work:
messages = {
"text1",
"text2",
"text3",
"text4",
-- ...
}
SendChatMessage(GetRandomArgument(unpack(messages)), "RAID")
For longer lists, you would be well served to replace GetRandomArgument with GetRandomElement that would take a single table as its argument and return a random entry from the table.
Edit: Olle's answer shows one way that something like GetRandomElement might be implemented. But it used table.getn on every call which is deprecated in Lua 5.1, and its replacement (table.maxn) has a runtime cost proportional to the number of elements in the table.
The function table.maxn is only required if the table in use might have missing elements in its array part. However, in this case of a list of items to choose among, there is likely to be no reason to need to allow holes in the list. If you need to edit the list at run time, you can always use table.remove to remove an item since it will also close the gap.
With a guarantee of no gaps in the array of text, then you can implement GetRandomElement like this:
function GetRandomElement(a)
return a[math.random(#a)]
end
So that you send the message like this:
SendChatMessage(GetRandomElement(messages), "RAID")
You want a table to contain your phrases like
phrases = { "tex1", "text2", "text3" }
table.insert(phrases ,"text4") -- alternative syntax
SendChatMessage(phrases[math.random(table.getn(phrases))], "RAID")
Note: getn gets the size of the table; math.random gets a random number (with a max of the size of the phrases table) and the phrases[] syntax returns the table element at the index inside [].

Resources