n-way file merge in Ruby - ruby

I have several files from Java application (Gigaspaces logs) from multiple hosts which I need to merge based on date/time value.
Since every log file is already sorted, I need to get a first record from every file into an array, decide which one have a key with minimum value, merge it to result file, get a new line from the same file & repeat.
Record's definition - first line have a key and all following lines have no key, example:
2015-04-05 02:33:42,135 GSC SEVERE [com.gigaspaces.lrmi] - LRMI Transport Protocol caught server exception caused by [/10.0.1.2:46949] client.; Caused by: java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:311)
at com.gigaspaces.lrmi.SmartByteBufferCache.get(SmartByteBufferCache.java:50)
at com.gigaspaces.lrmi.nio.Reader.readBytesFromChannelNoneBlocking(Reader.java:410)
at com.gigaspaces.lrmi.nio.Reader.readBytesNonBlocking(Reader.java:644)
at com.gigaspaces.lrmi.nio.Reader.bytesToStream(Reader.java:509)
at com.gigaspaces.lrmi.nio.Reader.readRequest(Reader.java:112)
at com.gigaspaces.lrmi.nio.ChannelEntry.readRequest(ChannelEntry.java:121)
at com.gigaspaces.lrmi.nio.Pivot.handleReadRequest(Pivot.java:445)
at com.gigaspaces.lrmi.nio.selector.handler.ReadSelectorThread.handleRead(ReadSelectorThread.java:81)
at com.gigaspaces.lrmi.nio.selector.handler.ReadSelectorThread.handleConnection(ReadSelectorThread.java:45)
at com.gigaspaces.lrmi.nio.selector.handler.AbstractSelectorThread.doSelect(AbstractSelectorThread.java:74)
at com.gigaspaces.lrmi.nio.selector.handler.AbstractSelectorThread.run(AbstractSelectorThread.java:50)
at java.lang.Thread.run(Thread.java:662)
Ideally, result file should contain key, directory/filename.log & rest of the record.
Questions:
How to get a record from file in Ruby?
How to open multiple files and iterate through them using algorithm described above?

Code
Read all lines from all files that begin with a date-time string into an array, then sort the array by the date-time strings:
require 'date'
def get_key_rows(*fnames)
fnames.flat_map do |fname|
IO.foreach(fname).with_object([]) do |s, arr|
dt = DateTime.strptime(s[0, 19], '%Y-%m-%d %H:%M:%S') rescue nil
arr << [s[0, 19], fname, s[19..-1].rstrip] if dt
end
end.sort_by(&:first)
end
This method returns an array of three-element arrays. Each three-element array corresponds to a key line in one of the files, comprised of the date/time string, the filename and the remainder of the part of the line that follows the date/time string. Note that it is not necessary for key lines to be ordered within each file. The method uses:
DateTime#strptime to identify key rows;
Enumerable#flat_map, rather than Enumerable#map followed by Array#flatten; and
Enumerable#sort_by to sort the key rows by date/time.
Regarding sort_by, note that the strings can be sorted by the date/time strings, rather than by corresponding DateTime objects, because the form of the date/time string is 'yyyy-mm-dd hh-mm-ss'.
Examples
Let's create some files:
IO.write("f0", "2015-04-05 02:33:42,135 more stuff in f0\n" +
"more in f0\n" +
"2015-04-05 04:33:42,135 more stuff in f0\n" +
"even more in f0")
#=> 108
IO.write("f1", "2015-04-04 02:33:42,135 more stuff in f1\n" +
"2015-04-06 02:33:42,135 more stuff in f1\n" +
"more in f1")
#=> 92
IO.write("f2", "something in f2\n" +
"2015-04-05 02:33:43,135 more stuff in f2\n" +
"even more in f2\n" +
"2015-04-04 02:23:42,135 more stuff in f2")
#=> 113
get_key_rows('f0', 'f1', 'f2')
#=> [["2015-04-04 02:23:42", "f2", ",135 more stuff in f2"],
# ["2015-04-04 02:33:42", "f1", ",135 more stuff in f1"],
# ["2015-04-05 02:33:42", "f0", ",135 more stuff in f0"],
# ["2015-04-05 02:33:43", "f2", ",135 more stuff in f2"],
# ["2015-04-05 04:33:42", "f0", ",135 more stuff in f0"],
# ["2015-04-06 02:33:42", "f1", ",135 more stuff in f1"]]

Related

How to get a block at an offset in the IO.foreach loop in ruby?

I'm using the IO.foreach loop to find a string using regular expressions. I want to append the next block (next line) to the file_names list. How can I do that?
file_names = [""]
IO.foreach("a.txt") { |block|
if block =~ /^file_names*/
dir = # get the next block
file_names.append(dir)
end
}
Actually my input looks like this:
file_names[174]:
name: "vector"
dir_index: 1
mod_time: 0x00000000
length: 0x00000000
file_names[175]:
name: "stl_bvector.h"
dir_index: 2
mod_time: 0x00000000
length: 0x00000000
I have a list of file_names, and I want to capture each of the name, dir_index, mod_time and length properties and put them into the files_names array index according to the file_names index in the text.
You can use #each_cons to get the value of the next 4 rows from the text file:
files = IO.foreach("text.txt").each_cons(5).with_object([]) do |block, o|
if block[0] =~ /file_names.*/
o << block[1..4].map{|e| e.split(':')[1]}
end
end
puts files
#=> "vector"
# 1
# 0x00000000
# 0x00000000
# "stl_bvector.h"
# 2
# 0x00000000
# 0x00000000
Keep in mind that the files array contains subarrays of 4 elements. If the : symbol occurs later in the lines, you could replace the third line of my code with this:
o << block[1..4].map{ |e| e.partition(':').last.strip}
I also added #strip in case you want to remove the whitespaces around the values. With this line changed, the actual array will look something like this:
p files
#=>[["\"vector\"", "1", "0x00000000", "0x00000000"], ["\"stl_bvector.h\"", "2", "0x00000000", "0x00000000"]]
(the values don't contain the \ escape character, that's just the way #p shows it).
Another option, if you know the pattern 1 filename, 4 values will be persistent through the entire text file and the textfile always starts with a filename, you can replace #each_cons with #each_slice and remove the regex completely, this will also speed up the entire process:
IO.foreach("text.txt").each_slice(5).with_object([]) do |block, o|
o << block[1..4].map{ |e| e.partition(':').last.strip }
end
It's actually pretty easy to carve up a series of lines based on a pattern using slice_before:
File.readlines("data.txt").slice_before(/\Afile_names/)
Now you have an array of arrays that looks like:
[
[
"file_names[174]:\n",
" name: \"vector\"\n",
" dir_index: 1\n",
" mod_time: 0x00000000\n",
" length: 0x00000000\n"
],
[
"file_names[175]:\n",
" name: \"stl_bvector.h\"\n",
" dir_index: 2\n",
" mod_time: 0x00000000\n",
" length: 0x00000000"
]
]
Each of these groups could be transformed further, like for example into a Ruby Hash using those keys.

Create array from csv using readlines ruby

I can’t seem to get this to work
I know I can do this with csv gem but Im trying out new stuff and I want to do it this way. All Im trying to do is to read lines in from a csv and then create one array from each line. I then want to put the second element in each array.
So far I have
filed="/Users/me/Documents/Workbook3.csv"
if File.exists?(filed)
File.readlines(filed).map {|d| puts d.split(",").to_a}
else puts "No file here”
The problem is that this creates one array which has all the lines in it whereas I want a separate array for each line (perhaps an array of arrays?)
Test data
Trade date,Settle date,Reference,Description,Unit cost (p),Quantity,Value (pounds)
04/09/2014,09/09/2014,S5411,Plus500 Ltd ILS0.01 152 # 419,419,152,624.93
02/09/2014,05/09/2014,B5406,Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75,4284.75,150,-6439.08
29/08/2014,03/09/2014,S5398,Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84,1116.84,520,5795.62
What I would like
S5411
B5406
S5398
Let write your data to a file:
s =<<THE_BITTER_END
Trade date,Settle date,Reference,Description,Unit cost (p),Quantity,Value (pounds)
04/09/2014,09/09/2014,S5411,Plus500 Ltd ILS0.01 152 # 419,419,152,624.93
02/09/2014,05/09/2014,B5406,Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75,4284.75,150,-6439.08
29/08/2014,03/09/2014,S5398,Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84,1116.84,520,5795.62
THE_BITTER_END
IO.write('temp',s)
#=> 363
We can then do this:
arr = File.readlines('temp').map { |s| s.split(',') }
#=> [["Trade date", "Settle date", "Reference", "Description", "Unit cost (p)",
"Quantity", "Value (pounds)\n"],
["04/09/2014", "09/09/2014", "S5411",
"Plus500 Ltd ILS0.01 152 # 419", "419", "152", "624.93\n"],
["02/09/2014", "05/09/2014", "B5406",
"Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75",
"4284.75", "150", "-6439.08\n"],
["29/08/2014", "03/09/2014", "S5398",
"Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84", "1116.84",
"520", "5795.62\n"]]
The values you want begin in the second element of arr and is the third element in each of those arrays. Therefore, you can pluck them out as follows:
arr[1..-1].map { |a| a[2] }
#=> ["S5411", "B5406", "S5398"]
Adopting #Stefan's suggestion of putting [2] within the block containing split, we can write this more compactly as follows:
File.readlines('temp')[1..-1].map { |s| s.split(',')[2] }
#=> ["S5411", "B5406", "S5398"]
You can also use built-in class CSV to do this very easily.
require "csv"
s =<<THE_BITTER_END
Trade date,Settle date,Reference,Description,Unit cost (p),Quantity,Value (pounds)
04/09/2014,09/09/2014,S5411,Plus500 Ltd ILS0.01 152 # 419,419,152,624.93
02/09/2014,05/09/2014,B5406,Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75,4284.75,150,-6439.08
29/08/2014,03/09/2014,S5398,Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84,1116.84,520,5795.62
THE_BITTER_END
arr = CSV.parse(s, :headers=>true).collect { |row| row["Reference"] }
p arr
#=> ["S5411", "B5406", "S5398"]
PS: I have borrowed the string from #Cary's answer

YAML deserializer with position information?

Does anyone know of a YAML deserializer that can provide position information for the constructed objects?
I know how to deserialize a YAML file into a Java object. Simple instructions on http://yamlbeans.sourceforge.net/.
However, I want to do some algorithmic validation on the deserialized object and report error back to the user pointing to the position in the YAML that cause the error.
Example:
=========YAML file==========
name: Nathan Sweet
age: 28
address: 4011 16th Ave S
=======JAVA class======
public class Contact {
public String name;
public int age;
public String address;
}
Imagine if I want to first load the yaml into Contact class and then validate the address against some repository and error back if its invalid. Something like:
'Line 3 Column 9: The address does not match valid entry in the database'
The problem is, currently there is no way to get the position inside a deserialized object from YAML.
Anyone know a solution to this issue?
Most YAML parsers, if they keep any information about positions around they drop it while constructing the language native objects.
In ruamel.yaml ¹, I keep more information around because I want to be able to round-trip with minimal loss of original layout (e.g. keeping comments and key order in mappings).
I don't keep information on individual key-value pairs, but I do on the "upper-left" position of a mapping². Because of the kept order of the mapping items you can give some rather nice feedback. Given an input file:
- name: anthon
age: 53
adres: Rijn en Schiekade 105
- name: Nathan Sweet
age: 28
address: 4011 16th Ave S
And a program that you call with the input file as argument:
#! /usr/bin/env python2.7
# coding: utf-8
# http://stackoverflow.com/questions/30677517/yaml-deserializer-with-position-information?noredirect=1#comment49491314_30677517
import sys
import ruamel.yaml
up_arrow = '↑'
def key_error(key, value, line, col, error, e='E'):
print('E[{}:{}]: {}'.format(line, col, error))
print('{}{}: {}'.format(' '*col, key, value))
print('{}{}'.format(' '*(col), up_arrow))
print('---')
def value_error(key, value, line, col, error, e='E'):
val_col = col + len(key) + 2
print('{}[{}:{}]: {}'.format(e, line, val_col, error))
print('{}{}: {}'.format(' '*col, key, value))
print('{}{}'.format(' '*(val_col), up_arrow))
print('---')
def value_warning(key, value, line, col, error):
value_error(key, value, line, col, error, e='W')
class Contact(object):
def __init__(self, vals):
for offset, k in enumerate(vals):
self.check(k, vals[k], vals.lc.line+offset, vals.lc.col)
for k in ['name', 'address', 'age']:
if k not in vals:
print('K[{}:{}]: {}'.format(
vals.lc.line+offset, vals.lc.col, "missing key: "+k
))
print('---')
def check(self, key, value, line, col):
if key == 'name':
if value[0].lower() == value[0]:
value_error(key, value, line, col,
'value should start with uppercase')
elif key == 'age':
if value < 50:
value_warning(key, value, line, col,
'probably too young for knowing ALGOL 60')
elif key == 'address':
pass
else:
key_error(key, value, line, col,
"unexpected key")
data = ruamel.yaml.load(open(sys.argv[1]), Loader=ruamel.yaml.RoundTripLoader)
for x in data:
contact = Contact(x)
giving you E(rrors), W(arnings) and K(eys missing):
E[0:8]: value should start with uppercase
name: anthon
↑
---
E[2:2]: unexpected key
adres: Rijn en Schiekade 105
↑
---
K[2:2]: missing key: address
---
W[4:7]: probably too young for knowing ALGOL 60
age: 28
↑
---
Which you should be able to parser in a calling program in any language to give feedback. The check method of course need adjusting to your requirements. This is not as nice as being to do that in the language the rest of your application is in, but it might be better than nothing.
In my experience handling the above format is certainly simpler than extending an existing (open source) YAML parser.
¹ Disclaimer: I am the author of that package
² I want to use that kind of information at some point to preserve spurious newlines, inserted for readability
In python, you can readily write custom Dumper/Loader objects and use them to load (or dump) your yaml code. You can have these objects track the file/line info:
import yaml
from collections import OrderedDict
class YamlOrderedDict(OrderedDict):
"""
An OrderedDict that was loaded from a yaml file, and is annotated
with file/line info for reporting about errors in the source file
"""
def _annotate(self, node):
self._key_locs = {}
self._value_locs = {}
nodeiter = node.value.__iter__()
for key in self:
subnode = nodeiter.next()
self._key_locs[key] = subnode[0].start_mark.name + ':' + \
str(subnode[0].start_mark.line+1)
self._value_locs[key] = subnode[1].start_mark.name + ':' + \
str(subnode[1].start_mark.line+1)
def key_loc(self, key):
try:
return self._key_locs[key]
except AttributeError, KeyError:
return ''
def value_loc(self, key):
try:
return self._value_locs[key]
except AttributeError, KeyError:
return ''
# Use YamlOrderedDict objects for yaml maps instead of normal dict
yaml.add_representer(OrderedDict, lambda dumper, data:
dumper.represent_dict(data.iteritems()))
yaml.add_representer(YamlOrderedDict, lambda dumper, data:
dumper.represent_dict(data.iteritems()))
def _load_YamlOrderedDict(loader, node):
rv = YamlOrderedDict(loader.construct_pairs(node))
rv._annotate(node)
return rv
yaml.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, _load_YamlOrderedDict)
Now when you read a yaml file, any mapping objects will be read as a YamlOrderedDict, which allows looking up the file location of keys in the mapping object. You can also add an iterator method like:
def iter_with_lines(self):
for key, val in self.items():
yield (key, val, self.key_loc(key))
...and now you can write a loop like:
for key,value,location in obj.iter_with_lines():
# iterate through the key/value pairs in a YamlOrderedDict, with
# the source file location

Match Multiple Patterns in a String and Return Matches as Hash

I'm working with some log files, trying to extract pieces of data.
Here's an example of a file which, for the purposes of testing, I'm loading into a variable named sample. NOTE: The column layout of the log files is not guaranteed to be consistent from one file to the next.
sample = "test script result
Load for five secs: 70%/50%; one minute: 53%; five minutes: 49%
Time source is NTP, 23:25:12.829 UTC Wed Jun 11 2014
D
MAC Address IP Address MAC RxPwr Timing I
State (dBmv) Offset P
0000.955c.5a50 192.168.0.1 online(pt) 0.00 5522 N
338c.4f90.2794 10.10.0.1 online(pt) 0.00 3661 N
990a.cb24.71dc 127.0.0.1 online(pt) -0.50 4645 N
778c.4fc8.7307 192.168.1.1 online(pt) 0.00 3960 N
"
Right now, I'm just looking for IPv4 and MAC address; eventually the search will need to include more patterns. To accomplish this, I'm using two regular expressions and passing them to Regexp.union
patterns = Regexp.union(/(?<mac_address>\h{4}\.\h{4}\.\h{4})/, /(?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/)
As you can see, I'm using named groups to identify the matches.
The result I'm trying to achieve is a Hash. The key should equal the capture group name, and the value should equal what was matched by the regular expression.
Example:
{"mac_address"=>"0000.955c.5a50", "ip_address"=>"192.168.0.1"}
{"mac_address"=>"338c.4f90.2794", "ip_address"=>"10.10.0.1"}
{"mac_address"=>"990a.cb24.71dc", "ip_address"=>"127.0.0.1"}
{"mac_address"=>"778c.4fc8.7307", "ip_address"=>"192.168.1.1"}
Here's what I've come up with so far:
sample.split(/\r?\n/).each do |line|
hashes = []
line.split(/\s+/).each do |val|
match = val.match(patterns)
if match
hashes << Hash[match.names.zip(match.captures)].delete_if { |k,v| v.nil? }
end
end
results = hashes.reduce({}) { |r,h| h.each {|k,v| r[k] = v}; r }
puts results if results.length > 0
end
I feel like there should be a more "elegant" way to do this. My chief concern, though, is performance.

Splitting a single string of hashes into an array of hashes

I can't get regex to split the string to give the desired result.
http://rubular.com/r/ytFwP3ivAv - according to rubular this expression should work.
str = "{"DATE"=>"11/26/2013 11:15", "DESC"=>"Accident (minor)", "LOCATION"=>"12 S THORNTON AV", "DISTRICT"=>"C5", "INCIDENT"=>"2013-00496193"}, {"DATE"=>"11/26/2013 11:10", "DESC"=>"Hold-up alarm", "LOCATION"=>"4725 S KIRKMAN RD", "DISTRICT"=>"E5", "INCIDENT"=>"2013-00496235"}"
sub_str_array = str.split(/({"[\w"=>\/ :,()-]*})/)
# the desired result - each hash is an element in an array
puts the_split[0] #=> {"DATE"=>"11/26/2013 11:15", "DESC"=>"Accident (minor)", "LOCATION"=>"12 S THORNTON AV", "DISTRICT"=>"C5", "INCIDENT"=>"2013-00496193"}
Is there another way (an easier way) to convert these string hashes into an array of hashes?
You can use this:
require 'json'
yourstr = '[' + '{"DATE"=>"11/26/2013 11:15", "DESC"=>"Accident (minor)", "LOCATION"=>"12 S THORNTON AV", "DISTRICT"=>"C5", "INCIDENT"=>"2013-00496193"}, {"DATE"=>"11/26/2013 11:10", "DESC"=>"Hold-up alarm", "LOCATION"=>"4725 S KIRKMAN RD", "DISTRICT"=>"E5", "INCIDENT"=>"2013-00496235"}, {"DATE"=>"11/26/2013 11:08", "DESC"=>"Missing person - adult", "LOCATION"=>"4818 S SEMORAN BV 503", "DISTRICT"=>"K1", "INCIDENT"=>"2013-00496198"}, {"DATE"=>"11/26/2013 11:07", "DESC"=>"911 hang up", "LOCATION"=>"311 W PRINCETON ST", "DISTRICT"=>"C2", "INCIDENT"=>"2013-00496231"}' + ']'
my_hash = JSON.parse(yourstr.gsub("=>", ":"))
puts my_hash[0]
You've set str as an object. Wrap it in quotes and it should work.
It may be better to use %Q(string goes here) rather than double quotes.
You can use eval "[#{str}]", if str is hardcoded and nobody can change it.

Resources