Match Multiple Patterns in a String and Return Matches as Hash - ruby

I'm working with some log files, trying to extract pieces of data.
Here's an example of a file which, for the purposes of testing, I'm loading into a variable named sample. NOTE: The column layout of the log files is not guaranteed to be consistent from one file to the next.
sample = "test script result
Load for five secs: 70%/50%; one minute: 53%; five minutes: 49%
Time source is NTP, 23:25:12.829 UTC Wed Jun 11 2014
D
MAC Address IP Address MAC RxPwr Timing I
State (dBmv) Offset P
0000.955c.5a50 192.168.0.1 online(pt) 0.00 5522 N
338c.4f90.2794 10.10.0.1 online(pt) 0.00 3661 N
990a.cb24.71dc 127.0.0.1 online(pt) -0.50 4645 N
778c.4fc8.7307 192.168.1.1 online(pt) 0.00 3960 N
"
Right now, I'm just looking for IPv4 and MAC address; eventually the search will need to include more patterns. To accomplish this, I'm using two regular expressions and passing them to Regexp.union
patterns = Regexp.union(/(?<mac_address>\h{4}\.\h{4}\.\h{4})/, /(?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/)
As you can see, I'm using named groups to identify the matches.
The result I'm trying to achieve is a Hash. The key should equal the capture group name, and the value should equal what was matched by the regular expression.
Example:
{"mac_address"=>"0000.955c.5a50", "ip_address"=>"192.168.0.1"}
{"mac_address"=>"338c.4f90.2794", "ip_address"=>"10.10.0.1"}
{"mac_address"=>"990a.cb24.71dc", "ip_address"=>"127.0.0.1"}
{"mac_address"=>"778c.4fc8.7307", "ip_address"=>"192.168.1.1"}
Here's what I've come up with so far:
sample.split(/\r?\n/).each do |line|
hashes = []
line.split(/\s+/).each do |val|
match = val.match(patterns)
if match
hashes << Hash[match.names.zip(match.captures)].delete_if { |k,v| v.nil? }
end
end
results = hashes.reduce({}) { |r,h| h.each {|k,v| r[k] = v}; r }
puts results if results.length > 0
end
I feel like there should be a more "elegant" way to do this. My chief concern, though, is performance.

Related

Ruby Zlib compression gives different outputs for the same input

I have this ruby method for compressing a string -
def compress_data(data)
output = StringIO.new
gz = Zlib::GzipWriter.new(output)
gz.write(data)
gz.close
compressed_data = output.string
compressed_data
end
When I call this method with the same input, I get different outputs at different times. I am trying to get the byte array for the compressed outputs and compare them.
The output is Different when I run the below -
input = "hello world"
output1 = (compress_data input).bytes.to_a
sleep 1
output2 = (compress_data input).bytes.to_a
if output1 == output2
puts 'Same'
else
puts 'Different'
end
The output is Same when I remove the sleep. Does the compression algorithm have something to do with the current time?
Option 1 - fixed mtime:
Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:
gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close
Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).
Option 2 - skip the header:
Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:
data = compress_data(input).bytes[10..-1]
Note that you do not need to call to_a on bytes. It is already an Array:
String.bytes -> an_array
Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.

Ruby splitting a record into multiple records based on contents of a field

Record layout contains two fields:
Requistion
Test Names
Example record:
R00000001,"4 Calprotectin, 1 Luminex xTAG, 8 H. pylori stool antigen (IgA), 9 Lactoferrin, 3 Anti-gliadin IgA, 10 H. pylori Panel, 6 Fecal Fat, 11 Antibiotic Resistance Panel, 2 C. difficile Tox A/ Tox B, 5 Elastase, 7 Fecal Occult Blood, 12 Shigella"
The current Ruby code snippet that is used in the LIMS (Lab Info Management System) system is this:
subj.get_value('Tests').join(', ')
What I need to be able to do in the Ruby code snippet is create a new record off each comma-separated value in the second field.
NOTE:
the amount of values in the 'Test Names' field varies from 1 to 20...or more.
There can be 100's of Requistion records
Final result would be:
R00000001,"4 Calprotectin"
R00000001,"1 Luminex xTAG"
R00000001,"8 H. pylori stool antigen (IgA)"
R00000001,"9 Lactoferrin"
R00000001,"3 Anti-gliadin IgA"
R00000001,"10 H. pylori Panel"
R00000001,"6 Fecal Fat"
R00000001,"11 Antibiotic Resistance Panel"
R00000001,"2 C. difficile Tox A/ Tox B"
R00000001,"5 Elastase"
R00000001,"7 Fecal Occult Blood"
R00000001,"12 Shigella"
If your data is a reliable string which you've shown in your example, here's your method:
data = subj.get_value('Tests').join(', ') # assuming this gives your string obj.
def split_data(data)
arr = data.gsub('"','').split(',')
arr.map {|l| "#{arr[0]} \"#{l.strip}\""}[1..-1]
end
puts split_data(data)

Create array from csv using readlines ruby

I can’t seem to get this to work
I know I can do this with csv gem but Im trying out new stuff and I want to do it this way. All Im trying to do is to read lines in from a csv and then create one array from each line. I then want to put the second element in each array.
So far I have
filed="/Users/me/Documents/Workbook3.csv"
if File.exists?(filed)
File.readlines(filed).map {|d| puts d.split(",").to_a}
else puts "No file here”
The problem is that this creates one array which has all the lines in it whereas I want a separate array for each line (perhaps an array of arrays?)
Test data
Trade date,Settle date,Reference,Description,Unit cost (p),Quantity,Value (pounds)
04/09/2014,09/09/2014,S5411,Plus500 Ltd ILS0.01 152 # 419,419,152,624.93
02/09/2014,05/09/2014,B5406,Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75,4284.75,150,-6439.08
29/08/2014,03/09/2014,S5398,Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84,1116.84,520,5795.62
What I would like
S5411
B5406
S5398
Let write your data to a file:
s =<<THE_BITTER_END
Trade date,Settle date,Reference,Description,Unit cost (p),Quantity,Value (pounds)
04/09/2014,09/09/2014,S5411,Plus500 Ltd ILS0.01 152 # 419,419,152,624.93
02/09/2014,05/09/2014,B5406,Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75,4284.75,150,-6439.08
29/08/2014,03/09/2014,S5398,Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84,1116.84,520,5795.62
THE_BITTER_END
IO.write('temp',s)
#=> 363
We can then do this:
arr = File.readlines('temp').map { |s| s.split(',') }
#=> [["Trade date", "Settle date", "Reference", "Description", "Unit cost (p)",
"Quantity", "Value (pounds)\n"],
["04/09/2014", "09/09/2014", "S5411",
"Plus500 Ltd ILS0.01 152 # 419", "419", "152", "624.93\n"],
["02/09/2014", "05/09/2014", "B5406",
"Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75",
"4284.75", "150", "-6439.08\n"],
["29/08/2014", "03/09/2014", "S5398",
"Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84", "1116.84",
"520", "5795.62\n"]]
The values you want begin in the second element of arr and is the third element in each of those arrays. Therefore, you can pluck them out as follows:
arr[1..-1].map { |a| a[2] }
#=> ["S5411", "B5406", "S5398"]
Adopting #Stefan's suggestion of putting [2] within the block containing split, we can write this more compactly as follows:
File.readlines('temp')[1..-1].map { |s| s.split(',')[2] }
#=> ["S5411", "B5406", "S5398"]
You can also use built-in class CSV to do this very easily.
require "csv"
s =<<THE_BITTER_END
Trade date,Settle date,Reference,Description,Unit cost (p),Quantity,Value (pounds)
04/09/2014,09/09/2014,S5411,Plus500 Ltd ILS0.01 152 # 419,419,152,624.93
02/09/2014,05/09/2014,B5406,Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75,4284.75,150,-6439.08
29/08/2014,03/09/2014,S5398,Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84,1116.84,520,5795.62
THE_BITTER_END
arr = CSV.parse(s, :headers=>true).collect { |row| row["Reference"] }
p arr
#=> ["S5411", "B5406", "S5398"]
PS: I have borrowed the string from #Cary's answer

Ruby - URL to Markdown

TOTAL rookie here.
I'm working on customizing a script made by Brett Terpstra - http://brettterpstra.com/2013/11/01/save-pocket-favorites-to-nvalt-with-ifttt-and-hazel/
Mine is a different use: I'd like to save my pinboard bookmarks with a specific tag to a file in dropbox in Markdown.
I feed it a text file such as:
Title: Yesterday is over.
URL: http://www.jonacuff.com/blog/want-to-change-the-world-get-doing/
Tags: 2md, 2wcx, 2pdf
Date: June 20, 2013 at 06:20PM
Image: notused
Excerpt: You can't start the next chapter of your life if you keep re-reading the last one.
And it outputs the markdown file.
Everything works great except when the 'excerpt' (see above) is more than one line. Sometimes it's a couple of paragraphs. When that happens, it stops working. When I hit enter from the command line, it's still waiting for more input.
Here's an example of a file that it doesn't work on:
Title: Talking ’bout my Generation.
URL: http://blog.greglaurie.com/?p=8881
Tags: 2md, 2wcx, 2pdf
Date: June 28, 2013 at 09:46PM
Image: notused
Excerpt: Contrast two men from the 19th century: Max Jukes and Jonathan Edwards.
Max Jukes lived in New York. He did not believe in Christ or in raising his children in the way of the Lord. He refused to take his children to church, even when they asked to go. Of his 1,026 descendants:
•300 were sent to prison for an average term of 13 years
•190 were prostitutes
•680 were admitted alcoholics
His family, thus far, has cost the state in excess of $420,000 and has made no contribution to society.
Jonathan Edwards also lived in New York, at the same time as Jukes. He was known to have studied 13 hours a day and, in spite of his busy schedule of writing, teaching, and pastoring, he made it a habit to come home and spend an hour each day with his children. He also saw to it that his children were in church every Sunday. Of his 929 descendants:
•430 were ministers
•86 became university professors
•13 became university presidents
•75 authored good books
•7 were elected to the United States Congress
•1 was Vice President of the United States
Edwards’ family never cost the state one cent.
We tend to think that our decisions only affect ourselves, but they have ramifications for generations to come.
Here's a screenshot of what it looks like after I run the command: https://www.dropbox.com/s/i9zg483k7nkdp6f/Screenshot%202013-11-22%2016.39.17.png
I'm hoping it's something easy. Any ideas?
#!/usr/bin/env ruby
# Works with IFTTT recipe https://ifttt.com/recipes/125999
#
# Set Hazel to watch the folder you specify in the recipe.
# Make sure nvALT is set to store its notes as individual files.
# Edit the $target_folder variable below to point to your nvALT
# ntoes folder.
require 'date'
require 'open-uri'
require 'net/http'
require 'fileutils'
require 'cgi'
$target_folder = "~/Dropbox/messx/urls2md"
def url_to_markdown(url)
res = Net::HTTP.post_form(URI.parse("http://heckyesmarkdown.com/go/"),{'u'=>url,'read'=>'1'})
if res.code.to_i == 200
res.body
else
false
end
end
file = ARGV[0]
begin
input = IO.read(file).force_encoding('utf-8')
headers = {}
input.each_line {|line|
key, value = line.split(/: /)
headers[key] = value.strip || ""
}
outfile = File.join(File.expand_path($target_folder), headers['Title'].gsub(/["!*?'|]/,'') + ".txt")
date = Time.now.strftime("%Y-%m-%d %H:%M")
date_added = Date.parse(headers['Date']).strftime("%Y-%m-%d %H:%M")
content = "Title: #{headers['Title']}\nDate: #{date}\nDate Added: #{date_added}\nSource: #{headers['URL']}\n"
tags = false
if headers['Tags'].length > 0
tag_arr = header s['Tags'].split(", ")
tag_arr.map! {|tag|
%Q{"#{tag.strip}"}
}
tags = tag_arr.join(" ")
content += "Keywords: #{tags}\n"
end
markdown = url_to_markdown(headers['URL']).force_encoding('utf-8')
if markdown
content += headers['Image'].length > 0 ? "\n\n> #{headers['Excerpt']}\n\n---#{markdown}\n" : "\n\n"+markdown
else
content += headers['Image'].length > 0 ? "\n\n![](#{headers['Image']})\n\n#{headers['Excerpt']}\n" : "\n\n"+headers['Excerpt']
end
File.open(outfile,'w') {|f|
f.puts content
}
if tags && File.exists?("/usr/local/bin/openmeta")
%x{/usr/local/bin/openmeta -a #{tags} -p "#{outfile}"}
end
# FileUtils.rm(file)
rescue Exception => e
puts e
end
How about this? Modify your input.each_line area accordingly:
headers = {}
key = nil
input.each_line do |line|
match = /^(?<key>\w+)\s*:\s*(?<value>.*)/.match(line)
value = line
if match
key = match[:key].strip
headers[key] = match[:value].strip
else
headers[key] += line
end
end
First, splitting on just ":" is dangerous since that can be in content. Instead, a (simplified from code) regex of /^\w+:.*/ will match "Word: Content". Since the lines after the "Excerpt:" aren't prefixed, you need to hang on to the last seen key, and just append if there's no key for this line. You may need to add a newline in there, depending on what you're doing with that header information, but it seems to work.

Sorting and Balancing Across Multiple Columns

Problem
I have a Hash of data that looks something like this.
{ "GROUP_A" => [22, 440],
"GROUP_B" => [14, 70],
"GROUP_C" => [60, 620],
"GROUP_D" => [174, 40],
"GROUP_E" => [4, 12]
# ...few hundred more
}
GROUP_A has 22 accounts and they are using 440GB of data...and so on. There are a couple hundred of these groups. Some have a lot of accounts but use very little storage and some have only a few users and use A LOT of storage, some are just average.
I have X number of buckets (servers) that I want to put these groups of accounts into, and I want there to be approximately the same number of accounts per bucket and have each bucket also contain approximately the same amount of data. Number of groups is not important, so if a bucket had 1 group of 1000 accounts using 500GB of data and the next bucket had 10 groups of 97 accounts (970 total) using 450GB of data...I'd call it good.
So far I've not come up with an algorithm that will do this. In my mind I'm thinking of something like this perhaps?
PASS 1
Bucket 1: Group with largest data, 60 users.
Bucket 2: Next largest data group, 37 users.
Bucket 3: Next largest data group, 72 users.
Bucket 4: etc....
PASS 2
Bucket 1: Add a group with small amount of data, but more users than average.
# There's probably a ratio I can calculate to figure this out...divide users/datavmaybe?
Bucket 2: Find a "small data" group where sum of users in Bucket 1 ~= sum of users in Bucket 2
# But then there's no guarantee that the data usages will be close enough
Bucket 3: etc...
PASS 3
Bucket 1: Now what? Back to next largest data group?
I still think there's a better way to figure this out but it's not coming to me. If anyone has any thoughts I'm open to suggestions.
Matt
Solution 1.1 - Brute Force Update
Well....here's an update to the first attempt. This is still not a "knapsack-problem" solution. Just brute forcing the data so the accounts balance across buckets. This time I added some logic so that if a bucket has a higher full percentage of accounts vs. data...it will find the largest group (by data) that fits best based on number of accounts. I get a lot better distribution of data now vs. my first attempt (see the edit history if you want to look at the first attempt).
Right now I load each bucket in sequence, filling bucket one, then bucket two, etc... I think if I was to modify the code so that I filled them simultaneously (or nearly so) I'd get a better data balance.
e.g. 1st department into bucket 1, 2nd department into bucket 2, etc...until all buckets have one department... Then start back with bucket 1 again.
dept_arr_sorted_by_acct = dept_hsh.sort_by {|key, value| value[0]}
ap "MAX ACCTS: #{max_accts} AVG ACCTS: #{avg_accts}"
ap "MAX SIZE: #{max_size} AVG SIZE: #{avg_data}"
# puts dept_arr_sorted_by_acct
# exit
bucket_arr = Array.new
used_hsh = Hash.new
server_names.each do |s|
bucket_hsh = Hash.new
this_accts=0
this_data=0
my_key=""
my_val=[]
accts=0
data=0
accts_space_pct_used = 0
data_space_pct_used = 0
while this_accts < avg_accts
if accts_space_pct_used <= data_space_pct_used
# This loop runs if the % used of accts is less than % used of data
dept_arr_sorted_by_acct.each do |val|
# Sorted by num accts - ascending. Loop until we find the last entry in the array that has <= accts than what we need
next if used_hsh.has_key?(val[0])
#do nothing
if val[1][0] <= avg_accts-this_accts
my_key = val[0]
my_val = val[1]
accts = val[1][0]
data = val[1][1]
end
end
else
# This loop runs if the % used of data is less than % used of accts
dept_arr_sorted_by_data = dept_arr_sorted_by_acct.sort { |a,b| b[1][1] <=> a[1][1] }
dept_arr_sorted_by_data.each do |val|
# Sorted by size - descending. Find the first (largest data) entry where accts <= what we need
next if used_hsh.has_key?(val[0])
# do nothing
if val[1][0] <= avg_accts-this_accts
my_key = val[0]
my_val = val[1]
accts = val[1][0]
data = val[1][1]
break
end
end
end
used_hsh[my_key] = my_val
bucket_hsh[my_key] = my_val
this_accts = this_accts + accts
this_data = this_data + data
accts_space_pct_used = this_accts.to_f / avg_accts * 100
data_space_pct_used = this_data.to_f / avg_data * 100
end
bucket_arr << [this_accts, this_data, bucket_hsh]
end
x=0
while x < bucket_arr.size do
th = bucket_arr[x][2]
list_of_depts = []
th.each_key do |key|
list_of_depts << key
end
ap "Bucket #{x}: #{bucket_arr[x][0]} accounts :: #{bucket_arr[x][1]} data :: #{list_of_depts.size} departments"
#ap list_of_depts
x = x+1
end
...and the results...
"MAX ACCTS: 2279 AVG ACCTS: 379"
"MAX SIZE: 1693315 AVG SIZE: 282219"
"Bucket 0: 379 accounts :: 251670 data :: 7 departments"
"Bucket 1: 379 accounts :: 286747 data :: 10 departments"
"Bucket 2: 379 accounts :: 278226 data :: 14 departments"
"Bucket 3: 379 accounts :: 281292 data :: 19 departments"
"Bucket 4: 379 accounts :: 293777 data :: 28 departments"
"Bucket 5: 379 accounts :: 298675 data :: 78 departments"
(379 * 6 <> 2279) I still need to figure out how to account for when the MAX_ACCTS are not evenly divisible by the number of buckets. I tried adding a 1% pad to the AVG_ACCTS value, which in this case means the average would be 383 I think, but then all the buckets say they have 383 accounts in them...which can't be true because then there are more accounts in the buckets than MAX_ACCTS. I've got a mistake in the code somewhere that I haven't found yet.
This is an example of the knapsack problem. There are a few solutions, but it's a really tricky problem and it's better to research a good solution than to try and make your own.

Resources