manipulating array: adding number of occurrences to the duplicate elements - ruby

(You are welcome to change the title to a more appropriate one!)
I got another Ruby/ERB question. I have this file:
ec2-23-22-59-32, mongoc, i-b8b44, instnum=1, Running
ec2-54-27-11-46, mongod, i-43f9f, instnum=2, Running
ec2-78-62-192-20, mongod, i-02fa4, instnum=3, Running
ec2-24-47-51-23, mongos, i-546c4, instnum=4, Running
ec2-72-95-64-22, mongos, i-5d634, instnum=5, Running
ec2-27-22-219-75, mongoc, i-02fa6, instnum=6, Running
And I can process the file to create an array like this:
irb(main):007:0> open(inFile).each { |ln| puts ln.split(',').map(&:strip)[0..1] }
ec2-23-22-59-32
mongoc
ec2-54-27-11-46
mongod
....
....
But what I really want is the occurrence number concatenated to the "mongo-type" so that it becomes:
ec2-23-22-59-32
mongoc1
ec2-54-27-11-46
mongod1
ec2-78-62-192-20
mongod2
ec2-24-47-51-23
mongos1
ec2-72-95-64-22
mongos2
ec2-27-22-219-75
mongoc2
The number of each mongo-type is not fixed and it changes over time. Any help with how can I do that? Thanks in advance. Cheers!!

Quick answer (maybe could be optimized):
data = 'ec2-23-22-59-32, mongoc, i-b8b44, instnum=1, Running
ec2-54-27-11-46, mongod, i-43f9f, instnum=2, Running
ec2-78-62-192-20, mongod, i-02fa4, instnum=3, Running
ec2-24-47-51-23, mongos, i-546c4, instnum=4, Running
ec2-72-95-64-22, mongos, i-5d634, instnum=5, Running
ec2-27-22-219-75, mongoc, i-02fa6, instnum=6, Running'
# a hash where we will save mongo types strings as keys
# and number of occurence as values
mtypes = {}
data.lines.each do |ln|
# get first and second element of given string to inst and mtype respectively
inst, mtype = ln.split(',').map(&:strip)[0..1]
# check if mtypes hash has a key that equ current mtype
# if yes -> add 1 to current number of occurence
# if not -> create new key and assign 1 as a value to it
# this is a if ? true : false -- ternary operator
mtypes[mtype] = mtypes.has_key?(mtype) ? mtypes[mtype] + 1 : 1
# combine an output string (everything in #{ } is a variables
# so #{mtype}#{mtypes[mtype]} means take current value of mtype and
# place after it current number of occurence stored into mtypes hash
p "#{inst} : #{mtype}#{mtypes[mtype]}"
end
Output:
# "ec2-23-22-59-32 : mongoc1"
# "ec2-54-27-11-46 : mongod1"
# "ec2-78-62-192-20 : mongod2"
# "ec2-24-47-51-23 : mongos1"
# "ec2-72-95-64-22 : mongos2"
# "ec2-27-22-219-75 : mongoc2"
Quite strightforward I think. If you don't understand something -- let me know.

Related

Ruby Zlib compression gives different outputs for the same input

I have this ruby method for compressing a string -
def compress_data(data)
output = StringIO.new
gz = Zlib::GzipWriter.new(output)
gz.write(data)
gz.close
compressed_data = output.string
compressed_data
end
When I call this method with the same input, I get different outputs at different times. I am trying to get the byte array for the compressed outputs and compare them.
The output is Different when I run the below -
input = "hello world"
output1 = (compress_data input).bytes.to_a
sleep 1
output2 = (compress_data input).bytes.to_a
if output1 == output2
puts 'Same'
else
puts 'Different'
end
The output is Same when I remove the sleep. Does the compression algorithm have something to do with the current time?
Option 1 - fixed mtime:
Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:
gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close
Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).
Option 2 - skip the header:
Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:
data = compress_data(input).bytes[10..-1]
Note that you do not need to call to_a on bytes. It is already an Array:
String.bytes -> an_array
Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.

PySpark - Sort RDD by Second Column

I've this RDD:
[[u''], [u'E01', u'Lokesh'], [u'E10', u'Venkat'], [u'EO2', u'Bhupesh'], [u'EO3', u'Amit'], [u'EO4', u'Ratan'], [u'EO5', u'Dinesh'], [u'EO6', u'Pavan'], [u'EO7', u'Tejas'], [u'EO8', u'Sheela']]
And I want to sort by the second column (name). I try this but without success:
[u'EO3', u'Amit'],
[u'EO2', u'Bhupesh'],
[u'EO5', u'Dinesh'],
[u'E01', u'Lokesh'],
[u'EO6', u'Pavan'],
[u'EO8', u'Sheela'],
[u'EO7', u'Tejas'],
[u'E10', u'Venkat']
I try with this:
sorted = employee_rows.sortBy(lambda line: line[1])
But it gives me this:
IndexError: list index out of range
How can sortby the second column?
Thanks!
In general, you should make all of your higher order rdd functions robust to bad inputs. In this case, your error is because you have at least one record that does not have a second column.
One way is to put a condition check on the length of line inside the lambda:
employee_rows.sortBy(lambda line: line[1] if len(line) > 1 else None).collect()
#[[u''],
# [u'EO3', u'Amit'],
# [u'EO2', u'Bhupesh'],
# [u'EO5', u'Dinesh'],
# [u'E01', u'Lokesh'],
# [u'EO6', u'Pavan'],
# [u'EO4', u'Ratan'],
# [u'EO8', u'Sheela'],
# [u'EO7', u'Tejas'],
# [u'E10', u'Venkat']]
Or you could define a custom sort function with try/except. Here's a way to make the "bad" rows sort last:
def mysort(line):
try:
return line[1]
except:
# since you're sorting alphabetically
return 'Z'
employee_rows.sortBy(mysort).collect()
#[[u'EO3', u'Amit'],
# [u'EO2', u'Bhupesh'],
# [u'EO5', u'Dinesh'],
# [u'E01', u'Lokesh'],
# [u'EO6', u'Pavan'],
# [u'EO4', u'Ratan'],
# [u'EO8', u'Sheela'],
# [u'EO7', u'Tejas'],
# [u'E10', u'Venkat'],
# [u'']]

Improving an algorithm for substring search when reading ZIP files

So I have a ZIP reader library, and I read ZIP files by first figuring out where the EOCD record is (the standard way "from the tail"). I have to look for a pattern that is roughly this:
4byte_magic_number, fixed_n_bytes, 2_bytes_of_comment_size, comment
The bytesize of comment is provided in the 2_bytes_of_comment_size. Just scanning for the magic number is insufficient, because I eager-read a substantial portion at the tail of the file - basically the maximum size the ZIP EOCD record can be, and then look for this pattern in there.
So far, I came up with this
def locate_eocd_signature(in_str)
# We have to scan from the _very_ tail. We read the very minimum size
# the EOCD record can have (up to and including the comment size), using
# a sliding window. Once our end offset matches the comment size we found our
# EOCD marker.
eocd_signature_int = 0x06054b50
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
end_location = minimum_record_size * -1
loop do
# If the window is nil, we have rolled off the start of the string, nothing to do here.
# We use negative values because if we used positive slice indices
# we would have to detect the rollover ourselves
break unless window = in_str[end_location, minimum_record_size]
window_location = in_str.bytesize + end_location
unpacked = window.unpack(unpack_pattern)
# If we found the signature, pick up the comment size, and check if the size of the window
# plus that comment size is where we are in the string. If we are - bingo.
if unpacked[0] == 0x06054b50 && comment_size = unpacked[-1]
assumed_eocd_location = in_str.bytesize - comment_size - minimum_record_size
# if the comment size is where we should be at - we found our EOCD
return assumed_eocd_location if assumed_eocd_location == window_location
end
end_location -= 1 # Shift the window back, by one byte, and try again.
end
end
but it just screams ugly at me. Is there a better way to do something like this? Is there a pack specifier that says "all the bytes in binary until the the end of the string" that I do not know of? Then I could tack that onto the end of the pack specifier for example... A bit at loss here.
In the end I opted for the following optimization. First, I made a method for finding all the indices of a given substring in a string - there is no stdlib builtin for this.
def all_indices_of_substr_in_str(of_substring, in_string)
last_i = 0
found_at_indices = []
while last_i = in_string.index(of_substring, last_i)
found_at_indices << last_i
last_i += of_substring.bytesize
end
found_at_indices
end
Then, we use it to "latch" onto the offsets in our buffer where our signature was found.
def locate_eocd_signature(in_str)
eocd_signature = 0x06054b50
eocd_signature_str = [eocd_signature].pack('V')
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
str_size = in_str.bytesize
indices = all_indices_of_substr_in_str(eocd_signature_str, in_str)
indices.each do |check_at|
maybe_record = in_str[check_at..str_size]
# If the record is smaller than the minimum - we will never recover anything
break if maybe_record.bytesize < minimum_record_size
# Now we check if the record ends with the combination
# of the comment size and an arbitrary byte string of that size.
# If it does - we found our match
*_unused, comment_size = maybe_record.unpack(unpack_pattern)
if (maybe_record.bytesize - minimum_record_size) == comment_size
return check_at # Found the EOCD marker location
end
end
# If we haven't caught anything, return nil deliberately instead of returning the last statement
nil
end

Redis Sorted Set: Bulk ZSCORE

How to get a list of members based on their ID from a sorted set instead of just one member?
I would like to build a subset with a set of IDs from the actual sorted set.
I am using a Ruby client for Redis and do not want to iterate one by one. Because there could more than 3000 members that I want to lookup.
Here is the issue tracker to a new command ZMSCORE to do bulk ZSCORE.
There is no variadic form for ZSCORE, yet - see the discussion at: https://github.com/antirez/redis/issues/2344
That said, and for the time being, what you could do is use a Lua script for that. For example:
local scores = {}
while #ARGV > 0 do
scores[#scores+1] = redis.call('ZSCORE', KEYS[1], table.remove(ARGV, 1))
end
return scores
Running this from the command line would look like:
$ redis-cli ZADD foo 1 a 2 b 3 c 4 d
(integer) 4
$ redis-cli --eval mzscore.lua foo , b d
1) "2"
2) "4"
EDIT: In Ruby, it would probably be something like the following, although you'd be better off using SCRIPT LOAD and EVALSHA and loading the script from an external file (instead of hardcoding it in the app):
require 'redis'
script = <<LUA
local scores = {}
while #ARGV > 0 do
scores[#scores+1] = redis.call('ZSCORE', KEYS[1], table.remove(ARGV, 1))
end
return scores
LUA
redis = ::Redis.new()
reply = redis.eval(script, ["foo"], ["b", "d"])
Lua script to get scores with member IDs:
local scores = {}
while #ARGV > 0 do
local member_id = table.remove(ARGV, 1)
local member_score = {}
member_score[1] = member_id
member_score[2] = redis.call('ZSCORE', KEYS[1], member_id)
scores[#scores + 1] = member_score
end
return scores

In Ruby how do I search a string line by line and only show lines that begin with 1

What i am wanting to do is to ssh into a ubiquiti device, run brctl showmacs br0 and only retrieve the mac addresses on the local port (1) for instance:
1 d4:ca:6d:ec:aa:fe no 0.05
would be printed/put/written-to-file because it begins with a 1 while:
2 4c:5e:0c:d5:ba:95 no 38.62
will not.
Strings respond to []; so you could take your collection #collection and :select where x[0] == '1'.
only_ones = #collection.select{|x| x[0] == '1' }
You can use SSHKit to run a remote command:
on 'ubiquiti.yourdomain.com' do
output = capture(:brctl, 'showmacs br0')
puts output.lines.select{|line| line.start_with? "1"}
end

Resources