Pythonic way to generate an IPv4 report - windows

I am looking for the most Pythonic way to print a very generic report based on input data, to sort by number instances per IP address.
The incoming data looks like this
{'file3.txt': ['notice of IP abuse', '67.60.2.222', 'sender Bill Foobar'], 'file2.txt': ['DMCA notice 987654321', '101.143.22.22', 'Copyright Owner'], 'file1.txt': ['6.22.123.222', 'spam notification', 'Esmeralda Hopkins III'], 'file0.txt': ['DMCA notice 123456789', '67.60.2.222', 'Ian Flemming']}
I want to generate output that looks like this:
(2) 67.60.2.222
Bill Foobar | notice of IP abuse
Ian Flemming | DMCA notice 123456789
(1) 101.143.22.22
Copyright Owner | DMCA notice 987654321
(1) 6.22.123.222
Esmerelda Hopkins III | spam notification
Unfortunately I am stuck with Python 2 on Windows for this task, and I am unable to install any modules like Panda because I don't have pip.
I have tried all of the following:
print dict(sorted(matrix.items(), key=lambda item: item[1]))
print matrix
print dict(sorted(matrix.items(), key=lambda item: ip))
print dict(sorted(matrix.items))
print sorted(matrix)
sorted_x = sorted(matrix.items(), key=operator.itemgetter(0))
print sorted_x
the result is always this or something very similar:
{'file3.txt': ['notice of IP abuse', '67.60.2.222', 'sender Bill Foobar'], 'file2.txt': ['DMCA notice 987654321', '101.143.22.22', 'Copyright Owner'], 'file1.txt': ['6.22.123.222', 'spam notification', 'Esmeralda Hopkins III'], 'file0.txt': ['DMCA notice 123456789', '67.60.2.222', 'Ian Flemming']}
edit: I am able to sort the lists using this:
sorted_x = sorted(matrix.items(), key=operator.itemgetter(1))
print sorted_x
Now I am looking for a way to print in a report format, without using any any specialized modules since I don't have pip. I am thinking the Naïve Method is probably the best way, so my question is what is the most pythonic way to format it pretty?

Related

Need To filter output inrange between identifiers I.e grep between 1: this pattern and 2: that pattern and only show containing text

I need to filter the output of a dictionary I've managed to get the output down to the following example. I want to only output text between the Identifying pattern of 1: and 2:
I know of a solution with python but would like to accomplish this in the shell. I think it is possible with sed but not sure how.
n 1: feline mammal usually having thick soft fur and no ability
to roar: domestic cats; wildcats [syn: {cat}, {true cat}]
2: an informal term for a youth or man; "a nice guy"; "the guy's
only doing it for some doll" [syn: {guy}, {cat}, {hombre},
{bozo}]
3: a spiteful woman gossip; "what a cat she is!"
4: the leaves of the shrub Catha edulis which are chewed like
tobacco or used to make tea; has the effect of a euphoric
stimulant; "in Yemen kat is used daily by 85% of adults"
[syn: {kat}, {khat}, {qat}, {quat}, {cat}, {Arabian tea},
{African tea}]
5: a whip with nine knotted cords; "British sailors feared the
cat" [syn: {cat-o'-nine-tails}, {cat}]
6: a large tracked vehicle that is propelled by two endless
metal belts; frequently used for moving earth in construction
and farm work [syn: {Caterpillar}, {cat}]
v 1: beat with a cat-o'-nine-tails
2: eject the contents of the stomach through the mouth; "After
drinking too much, the students vomited"; "He purged
continuously"; "The patient regurgitated the food we gave him
last night" [syn: {vomit}, {vomit up}, {purge}, {cast},
{sick}, {cat}, {be sick}, {disgorge}, {regorge}, {retch},
{puke}, {barf}, {spew}, {spue}, {chuck}, {upchuck}, {honk},
{regurgitate}, {throw up}] [ant: {keep down}]
There is probably a better solution since I had an extra step after sed by using grep to remove unwanted lines. I still feel like I can accomplish this with only sed but here is my solution thus far:
dict -h dict.tu-chemnitz.de cat | sed -n '/1:/,/2:/p' | grep -v 2:

How to pass a SQL call into an AWK script to dictate text replacement on a file?

I am working on a script to do some personal accounting and budgeting. I'm sure there are easier way to do this, but I love UNIX-like CLI applications, so this is how I've chosen to go about it.
Currently, the pipeline starts with an AWK script that converts my CSV-formatted credit card statement into the plain-text into the double-entry accounting format that the CLI account program Ledger can read. I can then do whatever reporting I want via Ledger.
Here is my AWK script in its current state:
#!/bin/bash
awk -F "," 'NR > 1 {
gsub("[0-9]*\.[0-9]$", "&0", $7)
gsub(",,", ",", $0)
print substr($2,7,4) "-" substr($2,1,2) "-" substr($2,4,2) " * " $5
print " Expenses:"$6" -"$7
print " Liabilities "$7"\n"
}' /path/to/my/file.txt
Here is a simulated example of the original file (data is made up, format is correct):
POSTED,08/22/2018,08/23/2018,1234,RALPH'S COFFEE SHOP,Dining,4.33,
POSTED,08/22/2018,08/24/2018,1234,THE STUFF STORE,Merchandise,4.71,
POSTED,08/22/2018,08/22/2018,1234,PAST DUE FEE,Fee,25.0,
POSTED,08/21/2018,08/22/2018,5678,RALPH'S PAGODA,Dining,35.0,
POSTED,08/21/2018,08/23/2018,5678,GASLAND,Gas/Automotive,42.38,
POSTED,08/20/2018,08/21/2018,1234,CLASSY WALLMART,Grocery,34.67,
Here are the same entries after being converted to the Ledger format with the AWK script:
2018-08-22 * RALPH'S COFFEE SHOP
Expenses:Dining -4.33
Liabilities 4.33
2018-08-22 * THE STUFF STORE
Expenses:Merchandise -4.71
Liabilities 4.71
2018-08-22 * PAST DUE FEE
Expenses:Fee -25.00
Liabilities 25.00
2018-08-21 * RALPH'S PAGODA
Expenses:Dining -35.00
Liabilities 35.00
2018-08-21 * GASLAND
Expenses:Gas/Automotive -42.38
Liabilities 42.38
2018-08-20 * CLASSY WALMART
Expenses:Grocery -34.67
Liabilities 34.67
Ledger can do all sorts of cool reporting on the different categories of spending and earning. My credit card automatically assigns assigns categories to things (e.g. Expenses:Gas/Automotive, Expenses:Dining, etc.), but they are not always categoried in a way that reflects what was spent. I also want to be able to put in sub categories, such as Expenses:Dining:Coffee.
To do this, I created a SQLite database that contains the mappings I want. A query like:
SELECT v.name, tlc.name, sc.name
FROM vender AS v
JOIN top_level_category AS tlc ON v.top_level_category_id = tlc.id
JOIN sub_category AS sc ON v.sub_category_id = sc.id;
will output data like this:
RALPH'S COFFEE SHOP, Dining, Coffee
I want to figure out a way to pass the above SQL query into my AWK script in such a way that when AWK finds a vendor name in a line, it will replace the category assigned by the credit card with the category and subcategory from my database.
Any advice or thoughts would be greatly appreciated.

Extracting plain text output from binary file

I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy
The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language).
The documentation states that:
"GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)."
The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python.
thanks very much for your help on this.
Update: hexdump graph-name.4B.vout | head -5 gives:
0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4
0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19
0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5
0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d
0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string:
https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators
But the array is simple byte array. Here is example how to read it in python:
import struct
from array import array as binarray
import sys
inputfile = sys.argv[1]
data = open(inputfile).read()
a = binarray('c')
a.fromstring(data)
s = struct.Struct("f")
l = len(a)
print "%d bytes" %l
n = l / 4
for i in xrange(0, n):
x = s.unpack_from(a, i * 4)[0]
print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you:
cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";'
Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.

error in writing to a file

I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)
The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though.
The line looks like this in the input file
gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0
In output file however only gi|19125 is printed. How do I solve this?
Any help will be appreciated.
Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items.
Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.
FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously.
To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)
You can write it in pure Python e.g., for a small input file:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file
If you need to sort a file that does not fit in memory; see Sorting text file by using Python

text processing for IPv4 dotted decimal notation conversion to /8 or /16 format

I have an input file that contains a list of ip addresses and the ip_counts(some parameter that I use internally.)The file looks somewhat like this.
202.124.127.26 2135869
202.124.127.25 2111217
202.124.127.17 2058082
202.124.127.16 2014958
202.124.127.20 1949323
202.124.127.24 1933773
202.124.127.27 1932076
202.124.127.22 1886466
202.124.127.18 1882955
202.124.127.21 1803528
202.124.127.23 1786348
119.224.129.200 1776592
119.224.129.211 1639325
202.124.127.19 1479198
119.224.129.201 1145426
202.49.175.110 1133354
119.224.129.210 1119525
68.232.45.132 1085491
119.224.129.209 1015078
131.203.3.8 857951
202.162.73.4 817197
207.123.58.125 785326
202.7.6.18 762603
117.121.253.254 718022
74.125.237.120 710448
68.232.44.219 693002
202.162.73.2 671559
205.128.75.126 611301
119.161.91.17 604393
119.224.129.202 559930
8.27.241.126 528862
74.125.237.152 517516
8.254.9.254 514341
As you can see the ip addresses themselves are unsorted.So I use the sort command on the file to sort the ip addresses as below
cat address_count.txt | sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n > sorted_address.txt
Which gives me an output with ip addresses in the sorted order.The partial output of that file is shown below.
4.23.63.126 15731
4.26.254.254 320705
4.27.8.254 25174
8.12.129.50 176141
8.12.223.125 11800
8.19.32.65 15854
8.19.240.53 11013
8.19.240.70 11915
8.19.240.72 31541
8.19.240.73 23304
8.20.213.28 96434
8.20.213.32 108191
8.20.213.34 170058
8.20.213.39 23512
8.20.213.41 10420
8.20.213.61 24809
8.26.195.253 28568
8.27.152.253 104446
8.27.233.125 115856
8.27.235.126 16102
8.27.235.254 25628
8.27.238.254 108485
8.27.240.125 169262
8.27.241.126 528862
8.27.241.252 197302
8.27.248.125 14926
8.254.9.254 514341
12.129.210.71 89663
15.192.45.21 20139
15.192.45.26 35265
15.193.0.148 10313
15.193.113.29 40318
15.201.49.136 14243
15.240.238.52 57163
17.250.248.95 28166
23.33.125.13 19179
23.33.125.37 17953
31.151.163.60 72709
38.99.42.37 192356
38.99.68.180 41251
38.99.68.181 10272
38.104.237.74 74012
38.108.112.103 37034
38.108.112.115 69698
38.108.112.121 92173
38.108.112.122 99230
38.112.63.238 39958
38.119.130.62 42159
46.4.28.22 19769
Now I want to parse the file given above and convert it to aaa.bbb.ccc.0/8 format and
aaa.bbb.0.0/16 format and I also want to count the number of occurences of the ip's in each subnet.I want to do this using bash.I am open to using sed or awk.How do I achieve this.
For example
8.19.240.53 11013
8.19.240.70 11915
8.19.240.72 31541
8.19.240.73 23304
8.20.213.28 96434
8.20.213.32 108191
8.20.213.34 170058
8.20.213.39 23512
8.20.213.41 10420
8.20.213.61 24809
The about input portion should produce 8.19.240.0/8 and 8.20.213.0/8 and similarly for /16 domains.I also want to count the occurences of machines in the subnet.
For example In the above output this subnet should have the count 4 in the next column beside it.It should also add the already displayed count.i.e (11013 + 11915 + 31541 + 23304) in another column.
8.19.240.0/8 4 (11013 + 11915 + 31541 + 23304)
8.20.213.0/8 6 (96434 + 108191 + 170058 + 23512 + 10420 + 24809)
It would be great if someone could suggest some way to achieve this.
The main problem here is that without having the routing table from the individual moments the packets arrived, you have no idea what netblock they were originally in. Sure, you can put them in the class-full blocks they would be in, in a class-full routing situation, but all that will give you is a nice presentation (and, admittedly, a shorter file).
Furthermore, your example looks a bit broken. You have a bunch of IP addresses in 8.0.0.0/8 and you are aggregating them into what looks like /24 routes and presenting them with a /8 at the end.
Nonetheless, in awk you can use sub() to do text replacement (or you can use index to find occurrences of ., or you can use split to split at dots). It should be relatively easy to go from that to "drop last digit, add the string "0/24" and use that as a key to update an IP-count and a hit-count dictionary, then drop the last two octets and the slash, replace with "0.0/16" and do the same" (all arrays in awk are associative arrays, so essentially dicts). No need to sort in advance, when you loop through the result, you'll get the keys in a random order, but on average there will be fewer of them, so sorting afterwards will be cheaper.
I seem to not have an awk at hand, so I cannot give you a code example.
This might work for you:
awk '{a=$1;sub(/\.[^.]*$/,"",a);ac[a]++;at[a]+=$2};END{for(x in ac)print x".0/8",ac[x],at[x]}' file
This prints the '0/8 addresses to get the 0/16 duplicate the code i.e. b=a;sub(/\.[^.]*$/,"",b);ba[b]++ etc, etc.

Resources