alphanumeric sort in VIM - sorting
Suppose I have a list in a text file which is as follows -
TaskB_115
TaskB_19
TaskB_105
TaskB_13
TaskB_10
TaskB_0_A_1
TaskB_17
TaskB_114
TaskB_110
TaskB_0_A_5
TaskB_16
TaskB_12
TaskB_113
TaskB_15
TaskB_103
TaskB_2
TaskB_18
TaskB_106
TaskB_11
TaskB_14
TaskB_104
TaskB_112
TaskB_107
TaskB_0_A_4
TaskB_102
TaskB_100
TaskB_109
TaskB_101
TaskB_0_A_2
TaskB_0_A_3
TaskB_116
TaskB_1_A_0
TaskB_111
TaskB_108
If I sort in vim with command %sort, it gives me output as -
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_3
TaskB_0_A_4
TaskB_0_A_5
TaskB_10
TaskB_100
TaskB_101
TaskB_102
TaskB_103
TaskB_104
TaskB_105
TaskB_106
TaskB_107
TaskB_108
TaskB_109
TaskB_11
TaskB_110
TaskB_111
TaskB_112
TaskB_113
TaskB_114
TaskB_115
TaskB_116
TaskB_12
TaskB_13
TaskB_14
TaskB_15
TaskB_16
TaskB_17
TaskB_18
TaskB_19
TaskB_1_A_0
TaskB_2
But I would like to have the output as follows -
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_3
TaskB_0_A_4
TaskB_0_A_5
TaskB_1_A_0
TaskB_2
TaskB_10
TaskB_11
TaskB_12
TaskB_13
TaskB_14
TaskB_15
TaskB_16
TaskB_17
TaskB_18
TaskB_19
TaskB_100
TaskB_101
TaskB_102
TaskB_103
TaskB_104
TaskB_105
TaskB_106
TaskB_107
TaskB_108
TaskB_109
TaskB_110
TaskB_111
TaskB_112
TaskB_113
TaskB_114
TaskB_115
TaskB_116
Note I just wrote this list to demonstrate the problem. I could generate the list in VIM. But I want to do it for other things as well in VIM.
With [n] sorting is done on the first decimal number
in the line (after or inside a {pattern} match).
One leading '-' is included in the number.
try this command:
sor n
and you don't need the %, sort sorts all lines if no range was given.
EDIT
as commented by OP, if you have:
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_4
TaskB_0_A_3
TaskB_0_A_5
TaskB_1_A_0
you could try:
sor n /.*_\ze\d*/
or
sor nr /\d*$/
EDIT2
for newly edited question, this line may give you expected output based on your example data:
sor nr /\d*$/|sor n
Related
Python: Can I grab the specific lines from a large file faster?
I have two large files. One of them is an info file(about 270MB and 16,000,000 lines) like this: 1101:10003:17729 1101:10003:19979 1101:10003:23319 1101:10003:24972 1101:10003:2539 1101:10003:28242 1101:10003:28804 The other is a standard FASTQ format(about 27G and 280,000,000 lines) like this: #ST-E00126:65:H3VJ2CCXX:7:1101:1416:1801 1:N:0:5 NTGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG + AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK #ST-E00126:65:H3VJ2CCXX:7:1101:10003:75641:N:0:5 TAAGATAGATAGCCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG + AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifie. For each sequence,this part of the Line 1 is unique. 1101:1416:1801 and 1101:10003:75641 And I want to grab the Line 1 and the next three lines from the FASTQ file according to the info file. Here is my code: import gzip import re count = 0 with open('info_path') as info, open('grab_path','w') as grab: for i in info: sample = i.strip() with gzip.open('fq_path') as fq: for j in fq: count += 1 if count%4 == 1: line = j.strip() m = re.search(sample,j) if m != None: grab.writelines(line+'\n'+fq.next()+fq.next()+fq.next()) count = 0 break And it works, but because both of these two files have millions of lines, it's inefficient(running one day only get 20,000 lines). UPDATE at July 6th: I find that the info file can be read into the memory(thank #tobias_k for reminding me), so I creat a dictionary that the keys are info lines and the values are all 0. After that, I read the FASTQ file every 4 line, use the identifier part as the key,if the value is 0 then return the 4 lines. Here is my code: import gzip dic = {} with open('info_path') as info: for i in info: sample = i.strip() dic[sample] = 0 with gzip.open('fq_path') as fq, open('grap_path',"w") as grab: for j in fq: if j[:10] == '#ST-E00126': line = j.split(':') match = line[4] +':'+line[5]+':'+line[6][:-2] if dic.get(match) == 0: grab.writelines(j+fq.next()+fq.next()+fq.next()) This way is much faster, it takes 20mins to get all the matched lines(about 64,000,000 lines). And I have thought about sorting the FASTQ file first by external sort. Splitting the file that can be read into the memory is ok, my trouble is how to keep the next three lines following the indentifier line while sorting. The Google's answer is to linear these four lines first, but it will take 40mins to do so. Anyway thanks for your help.
You can sort both files by the identifier (the 1101:1416:1801) part. Even if files do not fit into memory, you can use external sorting. After this, you can apply a simple merge-like strategy: read both files together and do the matching in the meantime. Something like this (pseudocode): entry1 = readFromFile1() entry2 = readFromFile2() while (none of the files ended) if (entry1.id == entry2.id) record match else if (entry1.id < entry2.id) entry1 = readFromFile1() else entry2 = readFromFile2() This way entry1.id and entry2.id are always close to each other and you will not miss any matches. At the same time, this approach requires iterating over each file once.
Extracting plain text output from binary file
I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language). The documentation states that: "GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)." The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python. thanks very much for your help on this. Update: hexdump graph-name.4B.vout | head -5 gives: 0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4 0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19 0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5 0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d 0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string: https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators But the array is simple byte array. Here is example how to read it in python: import struct from array import array as binarray import sys inputfile = sys.argv[1] data = open(inputfile).read() a = binarray('c') a.fromstring(data) s = struct.Struct("f") l = len(a) print "%d bytes" %l n = l / 4 for i in xrange(0, n): x = s.unpack_from(a, i * 4)[0] print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you: cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";' Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.
error in writing to a file
I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done sort_bt=open("sort_blast.txt",'w+') sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name) subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True) The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though. The line looks like this in the input file gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0 In output file however only gi|19125 is printed. How do I solve this? Any help will be appreciated. Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items. Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner: def custom_sorter(first, second): """ A Custom Sort function which compares items based on the value in the 2nd and 6th columns. """ # First, we break the line into a list first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character. if len(first_items) >= 6 and len(second_items) >= 6: # We have enough items to compare if (first_items[1], first_items[5]) > (second_items[1], second_items[5]): return 1 elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]): return -1 else: # They are the same return 0 # Order doesn't matter then else: return 0 with open(src_file_path, 'r') as src_file: data = src_file.read() # Read in the src file all at once. Hope the file isn't too big! with open(dst_sorted_file_path, 'w+') as dst_sorted_file: for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly dst_sorted_file.write(line) # Write the line to the dst_file. FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously. To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python: from subprocess import check_call with open("sort_blast.txt",'wb') as output_file: check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file) You can write it in pure Python e.g., for a small input file: def custom_key(line): fields = line.split() # split line on any whitespace return fields[1], float(fields[5]) # Python uses zero-based indexing with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file: L = input_file.read().splitlines() # read from the input file L.sort(key=custom_key) # sort it output_file.write("\n".join(L)) # write to the output file If you need to sort a file that does not fit in memory; see Sorting text file by using Python
how to replace last comma in a line with a string in unix
I trying to insert a string in every line except for first and last lines in a file, but not able to get it done, can anyone give some clue how to achieve? Thanks in advance. How to replace last comma in a line with a string xxxxx (except for first and last rows) using unix Original File 00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR." 10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,02BLYPO 10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,02CHLSU 10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,09 10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,01CHLSU 10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,09 10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,02BLYSU 10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,01 99,SRI,FF,28 Expected File 00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR." 10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO 10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,xxxxx231100,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,xxxxx231300,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,xxxxx231900,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,xxxxx232200,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,xxxxx232400,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,xxxxx232700,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU 10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09 10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU 10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09 10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU 10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 99,SRI,FF,28
awk can be quite useful for manipulating data files like this one. Here's a one-liner that does more-or-less what you want. It prepends the string "xxxxx" to the twelfth field of each input line that has at least twelve fields. $ awk 'BEGIN{FS=OFS=","}NF>11{$12="xxxxx"$12}{print}' 16006747.txt 00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR." 10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO 10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU 10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09 10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU 10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09 10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU 10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 99,SRI,FF,28
text processing for IPv4 dotted decimal notation conversion to /8 or /16 format
I have an input file that contains a list of ip addresses and the ip_counts(some parameter that I use internally.)The file looks somewhat like this. 202.124.127.26 2135869 202.124.127.25 2111217 202.124.127.17 2058082 202.124.127.16 2014958 202.124.127.20 1949323 202.124.127.24 1933773 202.124.127.27 1932076 202.124.127.22 1886466 202.124.127.18 1882955 202.124.127.21 1803528 202.124.127.23 1786348 119.224.129.200 1776592 119.224.129.211 1639325 202.124.127.19 1479198 119.224.129.201 1145426 202.49.175.110 1133354 119.224.129.210 1119525 68.232.45.132 1085491 119.224.129.209 1015078 131.203.3.8 857951 202.162.73.4 817197 207.123.58.125 785326 202.7.6.18 762603 117.121.253.254 718022 74.125.237.120 710448 68.232.44.219 693002 202.162.73.2 671559 205.128.75.126 611301 119.161.91.17 604393 119.224.129.202 559930 8.27.241.126 528862 74.125.237.152 517516 8.254.9.254 514341 As you can see the ip addresses themselves are unsorted.So I use the sort command on the file to sort the ip addresses as below cat address_count.txt | sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n > sorted_address.txt Which gives me an output with ip addresses in the sorted order.The partial output of that file is shown below. 4.23.63.126 15731 4.26.254.254 320705 4.27.8.254 25174 8.12.129.50 176141 8.12.223.125 11800 8.19.32.65 15854 8.19.240.53 11013 8.19.240.70 11915 8.19.240.72 31541 8.19.240.73 23304 8.20.213.28 96434 8.20.213.32 108191 8.20.213.34 170058 8.20.213.39 23512 8.20.213.41 10420 8.20.213.61 24809 8.26.195.253 28568 8.27.152.253 104446 8.27.233.125 115856 8.27.235.126 16102 8.27.235.254 25628 8.27.238.254 108485 8.27.240.125 169262 8.27.241.126 528862 8.27.241.252 197302 8.27.248.125 14926 8.254.9.254 514341 12.129.210.71 89663 15.192.45.21 20139 15.192.45.26 35265 15.193.0.148 10313 15.193.113.29 40318 15.201.49.136 14243 15.240.238.52 57163 17.250.248.95 28166 23.33.125.13 19179 23.33.125.37 17953 31.151.163.60 72709 38.99.42.37 192356 38.99.68.180 41251 38.99.68.181 10272 38.104.237.74 74012 38.108.112.103 37034 38.108.112.115 69698 38.108.112.121 92173 38.108.112.122 99230 38.112.63.238 39958 38.119.130.62 42159 46.4.28.22 19769 Now I want to parse the file given above and convert it to aaa.bbb.ccc.0/8 format and aaa.bbb.0.0/16 format and I also want to count the number of occurences of the ip's in each subnet.I want to do this using bash.I am open to using sed or awk.How do I achieve this. For example 8.19.240.53 11013 8.19.240.70 11915 8.19.240.72 31541 8.19.240.73 23304 8.20.213.28 96434 8.20.213.32 108191 8.20.213.34 170058 8.20.213.39 23512 8.20.213.41 10420 8.20.213.61 24809 The about input portion should produce 8.19.240.0/8 and 8.20.213.0/8 and similarly for /16 domains.I also want to count the occurences of machines in the subnet. For example In the above output this subnet should have the count 4 in the next column beside it.It should also add the already displayed count.i.e (11013 + 11915 + 31541 + 23304) in another column. 8.19.240.0/8 4 (11013 + 11915 + 31541 + 23304) 8.20.213.0/8 6 (96434 + 108191 + 170058 + 23512 + 10420 + 24809) It would be great if someone could suggest some way to achieve this.
The main problem here is that without having the routing table from the individual moments the packets arrived, you have no idea what netblock they were originally in. Sure, you can put them in the class-full blocks they would be in, in a class-full routing situation, but all that will give you is a nice presentation (and, admittedly, a shorter file). Furthermore, your example looks a bit broken. You have a bunch of IP addresses in 8.0.0.0/8 and you are aggregating them into what looks like /24 routes and presenting them with a /8 at the end. Nonetheless, in awk you can use sub() to do text replacement (or you can use index to find occurrences of ., or you can use split to split at dots). It should be relatively easy to go from that to "drop last digit, add the string "0/24" and use that as a key to update an IP-count and a hit-count dictionary, then drop the last two octets and the slash, replace with "0.0/16" and do the same" (all arrays in awk are associative arrays, so essentially dicts). No need to sort in advance, when you loop through the result, you'll get the keys in a random order, but on average there will be fewer of them, so sorting afterwards will be cheaper. I seem to not have an awk at hand, so I cannot give you a code example.
This might work for you: awk '{a=$1;sub(/\.[^.]*$/,"",a);ac[a]++;at[a]+=$2};END{for(x in ac)print x".0/8",ac[x],at[x]}' file This prints the '0/8 addresses to get the 0/16 duplicate the code i.e. b=a;sub(/\.[^.]*$/,"",b);ba[b]++ etc, etc.