Does anyone know of a way in Ruby, or even through a web service, if there's a tool that will take a bunch of ip addresses (currently about 2 million of them) in a file and convert them to ip ranges like 192.168.0.1 - 192.168.0.10 ?
Convert the IP address to 32 bit integer (Assume you're dealing with IPv4 address according to your post), remove the duplicates, sort them, and do the merge. After that, convert the integers back to IP string:
require 'ipaddr'
def to_ranges(ips)
ips = ips.map{|ip| IPAddr.new(ip).to_i }.uniq.sort
prev = ips[0]
ips
.slice_before {|e|
prev2, prev = prev, e
prev2 + 1 != e
}
.map {|addrs| if addrs.length > 1 then [addrs[0], addrs[-1]] else addrs end }
.map {|addrs| addrs.map{|ip| IPAddr.new(ip, Socket::AF_INET)}.join("-") }
end
# some ip samples
ips = (0..255).map{|i| ["192.168.0.#{i}", "192.168.1.#{i}", "192.168.2.#{i}"] }.reduce(:+)
ips += ["192.168.3.0", "192.168.3.1"]
ips += ["192.168.3.5", "192.168.3.6"]
ips += ["192.168.5.1"]
ips += ["192.168.6.255", "192.168.7.0", "192.168.7.1"]
p to_ranges(ips)
# => ["192.168.0.0-192.168.3.1", "192.168.3.5-192.168.3.6", "192.168.5.1", "192.168.6.255-192.168.7.1"]
Reading IP addresses from file and storing them in an array should be relatively easy. 2 million IP addresses is a small set. You don't need to worry to much about the memory usage. (If it really matters, you may need to implement a algorithm to incrementally convert and merge the addresses)
BTW, I found the handy method Enumerable#slice_before when solving your problem.
Related
Deploying an number of instances in AWS using terraform. The code is written to make use of specific private ip ranges. The code iterates through a range to provide the last two digits on an ip.
IP_AA_MGMT_Windows = [for i in range(1, var.Number_of_AA_Candidates +1 ) : format("%s%02d", "10.10.8.1", i)]
For information the subnet this belongs to has the following CIDR allocation
cidr_block = "10.10.8.0/22"
This gives an ip range of 10.10.8.0 - 10.10.11.255
The instance is created with no real problems. Expected private ip is allocated in an identical manner as the network interface.
resource "aws_instance" "Windows" {
instance_type = "t2.large"
subnet_id = aws_subnet.windows.id
vpc_security_group_ids = [aws_security_group.AA_Eng_Windows[count.index].id]
key_name = aws_key_pair.ENG-DEV.id
count = var.Number_of_AA_Candidates
private_ip = local.IP_AA_WINLAN_Windows[count.index]
associate_public_ip_address = false
An additional network interface is created and attached to the instance.
resource "aws_network_interface" "Windows_Access_Interface" {
subnet_id = aws_subnet.management.id
private_ip = local.IP_AA_MGMT_Windows[count.index]
security_groups = [aws_security_group.Windows.id]
count = var.Number_of_AA_Candidates
attachment {
instance = aws_instance.Windows[count.index].id
device_index = 1
}
All deploys correctly according to terraform. Its not until you check the private ip's in AWS or through terraform state show that you realise the network interface resource is created but with an incorrect private ip, not the one provisioned in the code. NOTE. Terraform plan provides output suggesting no problems with ip allocation.
Below is some of the output from the terrafrom show command.
# aws_network_interface.Windows_Access_Interface[0]:
resource "aws_network_interface" "Windows_Access_Interface" {
interface_type = "interface"
private_ip = "10.10.10.72"
private_ip_list = [
"10.10.10.72",
]
private_ip_list_enabled = false
private_ips = [
"10.10.10.72",
]
NOTE Some of the details in the show have intentionally been removed for security.
The question now is, what is causing this?
There are some important points to take into account here. One of the first being that there are five reserved addresses in each VPC [1]:
The first four IP addresses and the last IP address in each subnet CIDR block are not available for your use, and they cannot be assigned to a resource, such as an EC2 instance.
So that means you would have to start counting the addresses that are assignable from 10.10.8.4. That further means that in the range function, the counting would have to start after 3:
IP_AA_MGMT_Windows = [for i in range(4, var.Number_of_AA_Candidates + 1 ) : format("%s%02d", "10.10.8.1", i)]
Since IP addresses are not really strings, the format function along with the %02d will only add 0 and depending on the Number_of_AA_Candidates a certain number of decimal numbers at the end of the string. So for example, if Number_of_AA_Candidates was equal to 2, that would yield the following IP addresses:
> local.IP_AA_MGMT_Windows
[
"10.10.8.101",
"10.10.8.102",
]
Note that this is for the original range starting from 1. This looks like it is fine, but consider the case where you would add a double-digit number (or even triple-digit number to drive the point home). Additionally, the second part of the range is fine unless you set the Number_of_AA_Candidates to a value greater than or equal to the maximum number of IP addresses. If you were somehow to miscalculate, the range would go over and the IP addresses that would be created would not be valid IP addresses. To make sure you do not overstep the maximum number of available IPs in the CIDR range, you can calculate that number with:
2^10 - 5
10 is the number of bits that remains after subnet bits are deducted from the maximum number of bits which is 32. The 5 is the number of IP addresses that cannot be used. This leaves you with 1019 possible host addresses. To make sure that does not happen, you could introduce the ternary operator for the second part of the range function:
IP_AA_MGMT_Windows = [for i in range(4, (var.Number_of_AA_Candidates > 1019 ? 1019 : var.Number_of_AA_Candidates + 1) ) : format("%s%02d", "10.10.8.1", i)]
Now, those are two issues resolved. The third and final issue is the format function. To enable the usage of only available IP addresses and avoid using format, I suggest trying the cidrhost built-in function [2]. The cidrhost syntax is:
cidrhost(prefix, hostnum)
The hostnum part represents the wanted host IP address in the CIDR range. So for example, if you were to do:
cidrhost("10.10.8.0/22", 1)
This would return the first IP address in the range. For hostnum equal to 2 it would return 2nd, and so on.
To use this properly, you would have to modify the local variable to look like this:
IP_AA_MGMT_Windows = [for i in range(4, (var.Number_of_AA_Candidates > 1019 ? 1019 : var.Number_of_AA_Candidates + 1)) : cidrhost("10.10.8.0/22", i)]
This works well with any number up to the maximum number of host IP addresses. Finally, even though we know there are 5 IP addresses we cannot use, cidrhost does not know anything about that and always starts counting from the first to the last number in a CIDR range, so the last expression would have to use 1023 addresses, as we don't want to include the broadcast one (the start IP address is covered because we start from 4):
IP_AA_MGMT_Windows = [for i in range(4, (var.Number_of_AA_Candidates > 1023 ? 1023 : var.Number_of_AA_Candidates + 1)) : cidrhost("10.10.8.0/22", i)]
EDIT: After having a discussion in chat we have identified that there is an issue with the argument in the aws_network_interface (even though terraform did not complain about it). The argument in the question is private_ip while the provider lists that as private_ips which is a list of strings [3]. After changing that to:
private_ips = [ local.IP_AA_MGMT_Windows[count.index] ]
The apply worked as expected.
[1] https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html#subnet-sizing
[2] https://www.terraform.io/language/functions/cidrhost
[3] https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/network_interface#private_ips
Summary
Looking at the other questions that are somewhat in line with this is not helping, because I'm still opening the file line-by-line so I'm not running out of memory on the large file. In fact my memory usage is pretty low, but it is taking a really long time to create the smaller file so that I can search and concatenate the other CSV into the file.
Question
It has been 5 days and I'm not sure how far I have left to go, but it hasn't exited the foreach line of the main file, there are 17.8 million records in the csv file. Is there a faster way to handle this processing in ruby? Anything I can do to the MacOSX to optimize it? Any advice would be great.
# # -------------------------------------------------------------------------------------
# # USED TO GET ID NUMBERS OF THE SPECIFIC ITEMS THAT ARE NEEDED
# # -------------------------------------------------------------------------------------
etas_title_file = './HathiTrust ETAS Titles.csv'
oclc_id_array = []
angies_csv = []
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row|
oclc_id_array << row[:oclc]
angies_csv << row.to_h
end
oclc_id_array.uniq!
# -------------------------------------------------------------------------------------
# RUN ONCE IF DATABASE IS NOT POPULATED
# -------------------------------------------------------------------------------------
headers = %i[htid access rights ht_bib_key description source source_bib_num oclc_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author]
remove_keys = %i[access rights description source source_bib_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author]
new_hathi_csv = []
processed_keys = []
CSV.foreach('./hathi_full_20200401.txt' ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row|
next unless oclc_id_array.include? row[:oclc_num]
next if processed_keys.include? row[:oclc_num]
puts "#{row[:oclc_num]} included? #{oclc_id_array.include? row[:oclc_num]}"
new_hathi_csv << row.to_h.except(*remove_keys)
processed_keys << row[:oclc_num]
end
As far as I was able to determine, OCLC IDs are alphanumeric. This means we want to use a Hash to store these IDs. A Hash has a general lookup complexity of O(1), while your unsorted Array has a lookup complexity of O(n).
If you use an Array, you worst case lookup is 18 million comparisons (to find a single element, Ruby has to go through all 18 million IDs), while with a Hash it will be one comparison. To put it simply: using a Hash will be millions of times faster than your current implementation.
The pseudocode below will give you an idea how to proceed. We will use a Set, which is like a Hash, but handy when all you need to do is check for inclusion:
oclc_ids = Set.new
CSV.foreach(...) {
oclc_ids.add(row[:oclc]) # Add ID to Set
...
}
# No need to call unique on a Set.
# The elements in a Set are always unique.
processed_keys = Set.new
CSV.foreach(...) {
next unless oclc_ids.include?(row[:oclc_num]) # Extremely fast lookup
next if processed_keys.include?(row[:oclc_num]) # Extremely fast lookup
...
processed_keys.add(row[:oclc_num])
}
I am building a tool to help me reverse engineer database files. I am targeting my tool towards fixed record length flat files.
What I know:
1) Each record has an index(ID).
2) Each record is separated by a delimiter.
3) Each record is fixed width.
4) Each column in each record is separated by at least one x00 byte.
5) The file header is at the beginning (I say this because the header does not contain the delimiter..)
Delimiters I have found in other files are: ( xFAxFA, xFExFE, xFDxFD ) But this is kind of irrelevant considering that I may use the tool on a different database in the future. So I will need something that will be able to pick out a 'pattern' despite how many bytes it is made of. Probably no more than 6 bytes? It would probably eat up too much data if it was more. But, my experience doing this is limited.
So I guess my question is, how would I find UNKNOWN delimiters in a large file? I feel that given, 'what I know' I should be able to program something, I just dont know where to begin...
# Really loose pseudo code
def begin_some_how
# THIS IS THE PART I NEED HELP WITH...
# find all non-zero non-ascii sets of 2 or more bytes that repeat more than twice.
end
def check_possible_record_lengths
possible_delimiter = begin_some_how
# test if any of the above are always the same number of bytes apart from each other(except one instance, the header...)
possible_records = file.split(possible_delimiter)
rec_length_count = possible_records.map{ |record| record.length}.uniq.count
if rec_length_count == 2 # The header will most likely not be the same size.
puts "Success! We found the fixed record delimiter: #{possible_delimiter}
else
puts "Wrong delimiter found"
end
end
possible = [",", "."]
result = [0, ""]
possible.each do |delimiter|
sizes = file.split( delimiter ).map{ |record| record.size }
next if sizes.size < 2
average = 0.0 + sizes.inject{|sum,x| sum + x }
average /= sizes.size #This should be the record length if this is the right delimiter
deviation = 0.0 + sizes.inject{|sum,x| sum + (x-average)**2 }
matching_value = average / (deviation**2)
if matching_value > result[0] then
result[0] = matching_value
result[1] = delimiter
end
end
Take advantage of the fact that the records have constant size. Take every possible delimiter and check how much each record deviates from the usual record length. If the header is small enough compared rest of the file this should work.
I am receiving some data which is parsed in a Ruby script, a sample of the parsed data looks like this;
{"address":"00","data":"FF"}
{"address":"01","data":"00"}
That data relates to the status (on/off) of plant items (Fans, coolers, heaters etc.) the address is a HEX number to tell you which set of bits the data refers to. So in the example above the lookup table would be; Both of these values are received as HEX as in this example.
Bit1 Bit2 Bit3 Bit4 Bit5 Bit6 Bit7 Bit8
Address 00: Fan1 Fan2 Fan3 Fan4 Cool1 Cool2 Cool3 Heat1
Address 01: Hum1 Hum2 Fan5 Fan6 Heat2 Heat3 Cool4 Cool5
16 Addresses per block (This example is 00-0F)
Data: FF tells me that all items in Address 00 are set on (high/1) I then need to output the result of the lookup for each individual bit e.g
{"element":"FAN1","data":{"type":"STAT","state":"1"}}
{"element":"FAN2","data":{"type":"STAT","state":"1"}}
{"element":"FAN3","data":{"type":"STAT","state":"1"}}
{"element":"FAN4","data":{"type":"STAT","state":"1"}}
{"element":"COOL1","data":{"type":"STAT","state":"1"}}
{"element":"COOL2","data":{"type":"STAT","state":"1"}}
{"element":"COOL3","data":{"type":"STAT","state":"1"}}
{"element":"HEAT1","data":{"type":"STAT","state":"1"}}
A lookup table could be anything up to 2048 bits (though I don't have anything that size in use at the moment - this is maximum I'd need to scale to)
The data field is the status of the all 8 bits per address, some may be on some may be off and this updates every time my source pushes new data at me.
I'm looking for a way to do this in code ideally for the lay-person as I'm still very new to doing much with Ruby. There was a code example here, but it was not used in the end and has been removed from the question so as not to confuse.
Based on the example below I've used the following code to make some progress. (note this integrates with an existing script all of which is not shown here. Nor is the lookup table shown as its quite big now.)
data = [feeder]
data.each do |str|
hash = JSON.parse(str)
address = hash["address"]
number = hash["data"].to_i(16)
binary_str = sprintf("%0.8b", number)
binary_str.reverse.each_char.with_index do |char, i|
break if i+1 > max_binary_digits
mouse = {"element"=>+table[address][i], "data"=>{"type"=>'STAT', "state"=>char}}
mousetrap = JSON.generate(mouse)
puts mousetrap
end
end
This gives me an output of {"element":"COOL1","data":{"type":"STAT","state":"0"}} etc... which in turn gives the correct output via my node.js script.
I have a new problem/query having got this to work and captured a whole bunch of data from last night & this morning. It appears that now I've built my lookup table I need some of the results to be modified based on the result of the lookup. I have other sensors which need to generate a different output to feed my SVG for example;
FAN objects need to output {"element":"FAN1","data":{"type":"STAT","state":"1"}}
DOOR objects need to output {"element":"DOOR1","data":{"type":"LAT","state":"1"}}
SWIPE objects need to output {"element":"SWIPE6","data":{"type":"ROUTE","state":"1"}}
ALARM objects need to output {"element":"PIR1","data":{"type":"PIR","state":"0"}}
This is due to the way the SVG deals with updating - I'm not in a position to modify the DOM stuff so would need to fix this in my Ruby script.
So to address this what I ended up doing was making an exact copy of my existing lookup table and rather than listing the devices I listed the type of output like so;
Address 00: STAT STAT STAT ROUTE ROUTE LAT LAT PIR
Address 01: PIR PIR STAT ROUTE ROUTE LAT LAT PIR
This might be very dirty (and it also means I have to duplicate my lookup table, but it actually might be better for my specific needs as devices within the dataset could have any name (I have no control over the received data) Having built a new lookup table I modified the code I had been provided with below and already used for the original lookup but I had to remove these 2 lines. Without removing them I was getting the result of the lookup output 8 times!
binary_str.reverse.each_char.with_index do |char, i|
break if i+1 > max_binary_digits
The final array was built using the following;
mouse = {"element"=>+table[address][i], "data"=>{"type"=>typetable[address][i], "state"=>char}}
mousetrap = JSON.generate(mouse)
puts mousetrap
This gave me exactly the output I require and was able to integrate with both the existing script, node.js websocket & mongodb 'state' database (which is read on initial load)
There is one last thing I'd like to try and do with this code, when certain element states are set to 1 I'd like to be able to look something else up (and then use that result) I'm thinking this may be best done with a find query to my MongoDB and then just use the result. Doing that would hit the db for every query, but there would only ever be a handful or results so most things would return null which is fine. Am I along the right method of thinking?
require 'json'
table = {
"00" => ["Fan1", "Fan2", "Fan3"],
"01" => ["Hum1", "Hum2", "Fan5"],
}
max_binary_digits = table.first[1].size
data = [
%Q[{"address": "00","data":"FF"}],
%Q[{"address": "01","data":"00"}],
%Q[{"address": "01","data":"03"}],
]
data.each do |str|
hash = JSON.parse(str)
address = hash["address"]
number = hash["data"].to_i(16)
binary_str = sprintf("%0.8b", number)
p binary_str
binary_str.reverse.each_char.with_index do |char, i|
break if i+1 > max_binary_digits
puts %Q[{"element":#{table[address][i]},"data":{"type":"STAT","state":"#{char}"}}}]
end
puts "-" * 20
end
--output:--
"11111111"
{"element":Fan1,"data":{"type":"STAT","state":"1"}}}
{"element":Fan2,"data":{"type":"STAT","state":"1"}}}
{"element":Fan3,"data":{"type":"STAT","state":"1"}}}
--------------------
"00000000"
{"element":Hum1,"data":{"type":"STAT","state":"0"}}}
{"element":Hum2,"data":{"type":"STAT","state":"0"}}}
{"element":Fan5,"data":{"type":"STAT","state":"0"}}}
--------------------
"00000011"
{"element":Hum1,"data":{"type":"STAT","state":"1"}}}
{"element":Hum2,"data":{"type":"STAT","state":"1"}}}
{"element":Fan5,"data":{"type":"STAT","state":"0"}}}
--------------------
My answer assumes Bit1 in your table is the least significant bit, if that is not the case remove .reverse in the code.
You can ask me anything you want about the code.
I wrote a Secret Santa program (ala Ruby Quiz...ish), but occasionally when the program runs, I get an error.
Stats: If there's 10 names in the pot, the error comes up about 5% of the time. If there's 100 names in the pot, it's less than 1%. This is on a trial of 1000 times in bash. I've determined that the gift arrays are coming up nil at some point, but I'm not sure why or how to avoid it.
Providing code...
0.upto($lname.length-1).each do |i|
j = rand($giftlname.length) # should be less each time.
while $giftlname[j] == $lname[i] # redo random if it picks same person
if $lname[i] == $lname.last # if random gives same output again, means person is left with himself; needs to switch with someone
$giftfname[j], $fname[i] = $giftfname[i], $fname[j]
$giftlname[j], $lname[i] = $giftlname[i], $lname[j]
$giftemail[j], $email[i] = $giftemail[i], $email[j]
else
j = rand($giftlname.length)
end
end
$santas.push('Santa ' + $fname[i] + ' ' + $lname[i] + ' sends gift to ' + $giftfname[j] + ' ' + $giftlname[j] + ' at ' + '<' + $giftemail[j] + '>.') #Error here, something is sometimes nil
$giftfname.delete_at(j)
$giftlname.delete_at(j)
$giftemail.delete_at(j)
end
Thanks SO!
I think your problem is right here:
$giftfname[j], $fname[i] = $giftfname[i], $fname[j]
Your i values range between zero to the last index in $fname (inclusive) and, presumably, your $giftfname starts off as a clone of $fname (or at least another array with the same length). But, as you spin through the each, you're shrinking $giftfname so $giftfname[i] will be nil and the swap operation above will put nil into $giftfname[j] (which is supposed to be a useful entry of $giftfname). Similar issues apply to $giftlname and $giftemail.
I'd recommend using one array with three element objects (first name, last name, email) instead of your three parallel arrays. There's also a shuffle method on Array that might be of use to you:
Start with an array of people.
Make copy of that array.
Shuffle the copy until it is different at every index from that original array.
Then zip the together to get your final list of giver/receiver pairs.
Figured it out and used the retry statement. the if statement now looks like this (all other variables have been edited to be non-global as well)
if lname[i] == lname.last
santas = Array.new
giftfname = fname.clone
giftlname = lname.clone
giftemail = email.clone
retry
That, aside from a few other edits, created the solution I needed without breaking apart the code too much again. Will definitely try out mu's solution as well, but I'm just glad I have this running error-free for now.