help with LibSVM input data - windows

I am using the LibSVM tool for my support vector classification implementation:-
The first line in my input data file looks as so:-
+1 15752:47 6279:45 475:40 5231:30 515:29 7529:28 11623:24 274:24 15431:21 7342:20 4819:20 7598:18 8853:17 11134:16 501:16 911:15 4656:15 5875:14 10725:13 7334:13 13762:13 8295:12 9314:12 317:12 10641:12 2690:12 8771:12 4698:11 11519:10 10069:9 10019:8 1120:8 15017:8 254:8 7900:8 5395:8 486:8 1763:8 11183:7 9163:7 9219:7 1827:7 11901:7 4068:6 15592:6 9925:6 3464:5 8408:5 15348:5 8432:5 10064:5 6319:4 5729:4 8334:4 11817:4 6238:4 4521:4 11761:4 328:4 15876:4 6494:4 280:4 14628:4 5514:4 6383:4 9149:4 2456:4 6741:4 482:4 2773:4 10873:3 8715:3 8802:3 11478:3 11848:3 12269:3 10592:3 12911:3 11051:3 10798:3 8412:3 232:3 7654:3 1210:3 502:3 12687:3 14459:2 2725:2 9851:2 5799:2 16046:2 3612:2 1440:2 8503:2 245:2 9780:2 322:2 11902:2 8977:2 14949:2 5710:2 6423:2 9896:2 5507:2 10646:2 9932:2 14894:2 3997:2 13429:2 9845:2 8547:2 2720:2 861:2 2830:2 5703:2 6994:2 13973:2 3086:2 262:2 7793:2 208:2 3221:2 13229:2 13350:2 372:2 10384:2 3970:2 13506:2 9720:2 8981:2 9296:1 10276:1 15098:1 6631:1 383:1 6510:1 13304:1 9646:1 8233:1 1080:1 8537:1 12129:1 10711:1 14569:1 2969:1 1215:1 12435:1 7689:1 12626:1 14609:1 13474:1 4488:1 103:1 621:1 12430:1 617:1 514:1 11673:1 215:1 8817:1 10968:1 4717:1 1807:1 5737:1 3156:1 14320:1 13457:1 12411:1 9596:1 15028:1 10531:1 4301:1 4799:1 6013:1 7619:1 6717:1 9344:1 1817:1 15868:1 11307:1 9632:1 6945:1 9916:1 11899:1 883:1 11696:1 14503:1 316:1 4012:1 9994:1 8501:1 1847:1 12534:1 14966:1 11800:1 8093:1 13403:1 7309:1 5957:1 6538:1 2535:1 7042:1 13792:1 15001:1 4894:1 4921:1 13739:1 15875:1 15802:1 14253:1 10376:1 974:1 1882:1 2397:1 8105:1 4725:1 7707:1 7506:1 9749:1 8640:1 12566:1
The name of my input data file is --> a1a
I tried to run the program on my windows command prompt as
svm-train a1a
I get the following error
Wrong input format at line 1
Could somebody help me out here? I can't seem to figure out what's wrong.
Thanks.

The feature numbers (14253, 10376, etc) have to be listed in increasing order. Once you do that, svm-train will take that data. So, for example, your file needs to begin:
+1 103:1 208:2 215:1 232:3 245:2 254:8 262:2 274:24 280:4 316:1 317:12 322:2 328:4 372:2 383:1 475:40 482:4 486:8 501:16 502:3 514:1 515:29 617:1 621:1 861:2 883:1 911:15 974:1 1080:1 1120:8 1210:3 1215:1 1440:2 1763:8 1807:1 1817:1 1827:7 1847:1 1882:1 2397:1 2456:4 2535:1 2690:12 2720:2 2725:2 2773:4 2830:2 2969:1 3086:2 3156:1 3221:2 3464:5 3612:2 3970:2 3997:2 4012:1 4068:6 4301:1 4488:1 4521:4 4656:15 4698:11 4717:1 4725:1 4799:1 4819:20 4894:1 4921:1 5231:30 5395:8 5507:2 5514:4 5703:2 5710:2 5729:4 5737:1 5799:2 5875:14 5957:1 6013:1 6238:4 6279:45 6319:4 6383:4 6423:2 6494:4 6510:1 6538:1 6631:1 6717:1 6741:4 6945:1 6994:2 7042:1 7309:1 7334:13 7342:20 7506:1 7529:28 7598:18 7619:1 7654:3 7689:1 7707:1 7793:2 7900:8 8093:1 8105:1 8233:1 8295:12 8334:4 8408:5 8412:3 8432:5 8501:1 8503:2 8537:1 8547:2 8640:1 8715:3 8771:12 8802:3 8817:1 8853:17 8977:2 8981:2 9149:4 9163:7 9219:7 9296:1 9314:12 9344:1 9596:1 9632:1 9646:1 9720:2 9749:1 9780:2 9845:2 9851:2 9896:2 9916:1 9925:6 9932:2 9994:1 10019:8 10064:5 10069:9 10276:1 10376:1 10384:2 10531:1 10592:3 10641:12 10646:2 10711:1 10725:13 10798:3 10873:3 10968:1 11051:3 11134:16 11183:7 11307:1 11478:3 11519:10 11623:24 11673:1 11696:1 11761:4 11800:1 11817:4 11848:3 11899:1 11901:7 11902:2 12129:1 12269:3 12411:1 12430:1 12435:1 12534:1 12566:1 12626:1 12687:3 12911:3 13229:2 13304:1 13350:2 13403:1 13429:2 13457:1 13474:1 13506:2 13739:1 13762:13 13792:1 13973:2 14253:1 14320:1 14459:2 14503:1 14569:1 14609:1 14628:4 14894:2 14949:2 14966:1 15001:1 15017:8 15028:1 15098:1 15348:5 15431:21 15592:6 15752:47 15802:1 15868:1 15875:1 15876:4 16046:2

Related

dask histogram from zarr file (a big zarr file)

So heres my question, I have a big 3dim array which is 100GB in size as a #zarr file (the array is more than twice the size). I have tried using the histogram from #Dask to calculate but I get an error saying that it cant do it because the file has tuples within tuples. Im guessing thats the zarr file formate rather than anything else?
any thoughts?
edit:
yes the bigger computer thing wouldnt actually work...
Im running a dask client on a single machine, it runsthe calculation but just gets stuck somewhere.
I just tried dask.map function across the file but when I plot it out I get something like this:
ValueError: setting an array element with a sequence.
heres a version of the script:
def histo(img):
return da.histogram(img, bins=255, range=[0, 255])
histo_1 = da.map_blocks(histo, fimg)
I am actually going to try and use it out side of the map function. I wonder rather than the map funtion, does the windowing from map blocks, actually cause the issue. well, ill let you know if it is or now....
edit 2
So I tried to remove the map blocks function as suggested and this was my result:
[in] h, bins =da.histogram(fused_crop, bins=255, range=[0, 255])
[in] bins
[out] array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21.,
22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32.,
33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54.,
55., 56., 57., 58., 59., 60., 61., 62., 63., 64., 65.,
66., 67., 68., 69., 70., 71., 72., 73., 74., 75., 76.,
77., 78., 79., 80., 81., 82., 83., 84., 85., 86., 87.,
88., 89., 90., 91., 92., 93., 94., 95., 96., 97., 98.,
99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
110., 111., 112., 113., 114., 115., 116., 117., 118., 119., 120.,
121., 122., 123., 124., 125., 126., 127., 128., 129., 130., 131.,
132., 133., 134., 135., 136., 137., 138., 139., 140., 141., 142.,
143., 144., 145., 146., 147., 148., 149., 150., 151., 152., 153.,
154., 155., 156., 157., 158., 159., 160., 161., 162., 163., 164.,
165., 166., 167., 168., 169., 170., 171., 172., 173., 174., 175.,
176., 177., 178., 179., 180., 181., 182., 183., 184., 185., 186.,
187., 188., 189., 190., 191., 192., 193., 194., 195., 196., 197.,
198., 199., 200., 201., 202., 203., 204., 205., 206., 207., 208.,
209., 210., 211., 212., 213., 214., 215., 216., 217., 218., 219.,
220., 221., 222., 223., 224., 225., 226., 227., 228., 229., 230.,
231., 232., 233., 234., 235., 236., 237., 238., 239., 240., 241.,
242., 243., 244., 245., 246., 247., 248., 249., 250., 251., 252.,
253., 254., 255.])
[in] h.compute
[out] <bound method DaskMethodsMixin.compute of dask.array<sum-aggregate, shape=(255,), dtype=int64, chunksize=(255,), chunktype=numpy.ndarray>>
im going to try in another notebook and see if it still occurs.
edit 3
its the stranges thing, but if I just declare the variable h, it comes out as one small element from the dask array?
edit
Strange, if i call the xarray.hist or the da.hist function, they both fall over. If I use the skimage.exposure.histogram it works but it appears that the zarr file is unpacked before the histogram is a calculated. Which is a bit of a problem...
Update 7th June 2020 (with a solution for not big but annoyingly medium data) see below for answer.
You probably want to use dask's function for this rather than map_blocks. For the latter, Dask expects the output of each call to be the same size as the input block, or a shape derived from the input block, instead of the one-dimensional fixed-size output of histogram.
h, bins =da.histogram(fused_crop, bins=255, range=[0, 255])
h.compute()
Update 7th June 2020 (with a solution for not big but annoyingly medium data):
So unfortunately I got a bit ill around this time and it took a while for me to feel a bit better. Then the pandemic happened and I was on full childcare duty. I tried lots of different option and what ultimately, this looked like was that the following:
1) if just using x.compute, the memory would very quickly fill up.
2) Using distributed would fill the hard drive with spill to disk and take hours but would hang and crash and not do anything because...it would compute (im guessing here but based on the graph and dask api) it would create a sub histogram array for every chunk... that would all need to be merged at some point.
3) The chunking of my data was sub optimal so the amount of tasks was massive but even then I couldn't compute a histogram when i improved the chunking.
In the end I looked for a dynamic way of updating the histogram data. So I used Zarr to do it, by computing to it. Since it allows conccurrent reads and writing functions. As a reminder : my data is a zarr array in 3 dims x,y,z and uncompressed 300GB but compressed it's about 100GB. On my 4 yr old laptop with 16GB of ram using the following worked (I should have said my data was 16 bit unsigned:
imgs = da.from_zarr(.....)
imgs2 = imgs.rechunk((a,b,c)) ## individual chunk dim per dim
h, bins = da.histogram(imgs2, bins = 255, range=[0, 65535]) # binning to 256
h_out = da.to_zarr(h, "histogram.zarr")
I ran the progress bar alongside the process and to get a histogram from file took :
[########################################] | 100% Completed | 18min 47.3s
Which I dont think is too bad for a 300GB array. Hopefully this helps someone else as well, thanks for the help earlier in the year #mdurant.

error in writing to a file

I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)
The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though.
The line looks like this in the input file
gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0
In output file however only gi|19125 is printed. How do I solve this?
Any help will be appreciated.
Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items.
Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.
FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously.
To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)
You can write it in pure Python e.g., for a small input file:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file
If you need to sort a file that does not fit in memory; see Sorting text file by using Python

alphanumeric sort in VIM

Suppose I have a list in a text file which is as follows -
TaskB_115
TaskB_19
TaskB_105
TaskB_13
TaskB_10
TaskB_0_A_1
TaskB_17
TaskB_114
TaskB_110
TaskB_0_A_5
TaskB_16
TaskB_12
TaskB_113
TaskB_15
TaskB_103
TaskB_2
TaskB_18
TaskB_106
TaskB_11
TaskB_14
TaskB_104
TaskB_112
TaskB_107
TaskB_0_A_4
TaskB_102
TaskB_100
TaskB_109
TaskB_101
TaskB_0_A_2
TaskB_0_A_3
TaskB_116
TaskB_1_A_0
TaskB_111
TaskB_108
If I sort in vim with command %sort, it gives me output as -
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_3
TaskB_0_A_4
TaskB_0_A_5
TaskB_10
TaskB_100
TaskB_101
TaskB_102
TaskB_103
TaskB_104
TaskB_105
TaskB_106
TaskB_107
TaskB_108
TaskB_109
TaskB_11
TaskB_110
TaskB_111
TaskB_112
TaskB_113
TaskB_114
TaskB_115
TaskB_116
TaskB_12
TaskB_13
TaskB_14
TaskB_15
TaskB_16
TaskB_17
TaskB_18
TaskB_19
TaskB_1_A_0
TaskB_2
But I would like to have the output as follows -
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_3
TaskB_0_A_4
TaskB_0_A_5
TaskB_1_A_0
TaskB_2
TaskB_10
TaskB_11
TaskB_12
TaskB_13
TaskB_14
TaskB_15
TaskB_16
TaskB_17
TaskB_18
TaskB_19
TaskB_100
TaskB_101
TaskB_102
TaskB_103
TaskB_104
TaskB_105
TaskB_106
TaskB_107
TaskB_108
TaskB_109
TaskB_110
TaskB_111
TaskB_112
TaskB_113
TaskB_114
TaskB_115
TaskB_116
Note I just wrote this list to demonstrate the problem. I could generate the list in VIM. But I want to do it for other things as well in VIM.
With [n] sorting is done on the first decimal number
in the line (after or inside a {pattern} match).
One leading '-' is included in the number.
try this command:
sor n
and you don't need the %, sort sorts all lines if no range was given.
EDIT
as commented by OP, if you have:
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_4
TaskB_0_A_3
TaskB_0_A_5
TaskB_1_A_0
you could try:
sor n /.*_\ze\d*/
or
sor nr /\d*$/
EDIT2
for newly edited question, this line may give you expected output based on your example data:
sor nr /\d*$/|sor n

how to replace last comma in a line with a string in unix

I trying to insert a string in every line except for first and last lines in a file, but not able to get it done, can anyone give some clue how to achieve? Thanks in advance.
How to replace last comma in a line with a string xxxxx (except for first and last rows)
using unix
Original File
00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR."
10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,02BLYPO
10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,02CHLSU
10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,09
10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,01CHLSU
10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,09
10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,02BLYSU
10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,01
99,SRI,FF,28
Expected File
00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR."
10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO
10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,xxxxx231100,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,xxxxx231300,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,xxxxx231900,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,xxxxx232200,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,xxxxx232400,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,xxxxx232700,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU
10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09
10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU
10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09
10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU
10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
99,SRI,FF,28
awk can be quite useful for manipulating data files like this one. Here's a one-liner that does more-or-less what you want. It prepends the string "xxxxx" to the twelfth field of each input line that has at least twelve fields.
$ awk 'BEGIN{FS=OFS=","}NF>11{$12="xxxxx"$12}{print}' 16006747.txt
00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR."
10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO
10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU
10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09
10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU
10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09
10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU
10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
99,SRI,FF,28

text processing for IPv4 dotted decimal notation conversion to /8 or /16 format

I have an input file that contains a list of ip addresses and the ip_counts(some parameter that I use internally.)The file looks somewhat like this.
202.124.127.26 2135869
202.124.127.25 2111217
202.124.127.17 2058082
202.124.127.16 2014958
202.124.127.20 1949323
202.124.127.24 1933773
202.124.127.27 1932076
202.124.127.22 1886466
202.124.127.18 1882955
202.124.127.21 1803528
202.124.127.23 1786348
119.224.129.200 1776592
119.224.129.211 1639325
202.124.127.19 1479198
119.224.129.201 1145426
202.49.175.110 1133354
119.224.129.210 1119525
68.232.45.132 1085491
119.224.129.209 1015078
131.203.3.8 857951
202.162.73.4 817197
207.123.58.125 785326
202.7.6.18 762603
117.121.253.254 718022
74.125.237.120 710448
68.232.44.219 693002
202.162.73.2 671559
205.128.75.126 611301
119.161.91.17 604393
119.224.129.202 559930
8.27.241.126 528862
74.125.237.152 517516
8.254.9.254 514341
As you can see the ip addresses themselves are unsorted.So I use the sort command on the file to sort the ip addresses as below
cat address_count.txt | sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n > sorted_address.txt
Which gives me an output with ip addresses in the sorted order.The partial output of that file is shown below.
4.23.63.126 15731
4.26.254.254 320705
4.27.8.254 25174
8.12.129.50 176141
8.12.223.125 11800
8.19.32.65 15854
8.19.240.53 11013
8.19.240.70 11915
8.19.240.72 31541
8.19.240.73 23304
8.20.213.28 96434
8.20.213.32 108191
8.20.213.34 170058
8.20.213.39 23512
8.20.213.41 10420
8.20.213.61 24809
8.26.195.253 28568
8.27.152.253 104446
8.27.233.125 115856
8.27.235.126 16102
8.27.235.254 25628
8.27.238.254 108485
8.27.240.125 169262
8.27.241.126 528862
8.27.241.252 197302
8.27.248.125 14926
8.254.9.254 514341
12.129.210.71 89663
15.192.45.21 20139
15.192.45.26 35265
15.193.0.148 10313
15.193.113.29 40318
15.201.49.136 14243
15.240.238.52 57163
17.250.248.95 28166
23.33.125.13 19179
23.33.125.37 17953
31.151.163.60 72709
38.99.42.37 192356
38.99.68.180 41251
38.99.68.181 10272
38.104.237.74 74012
38.108.112.103 37034
38.108.112.115 69698
38.108.112.121 92173
38.108.112.122 99230
38.112.63.238 39958
38.119.130.62 42159
46.4.28.22 19769
Now I want to parse the file given above and convert it to aaa.bbb.ccc.0/8 format and
aaa.bbb.0.0/16 format and I also want to count the number of occurences of the ip's in each subnet.I want to do this using bash.I am open to using sed or awk.How do I achieve this.
For example
8.19.240.53 11013
8.19.240.70 11915
8.19.240.72 31541
8.19.240.73 23304
8.20.213.28 96434
8.20.213.32 108191
8.20.213.34 170058
8.20.213.39 23512
8.20.213.41 10420
8.20.213.61 24809
The about input portion should produce 8.19.240.0/8 and 8.20.213.0/8 and similarly for /16 domains.I also want to count the occurences of machines in the subnet.
For example In the above output this subnet should have the count 4 in the next column beside it.It should also add the already displayed count.i.e (11013 + 11915 + 31541 + 23304) in another column.
8.19.240.0/8 4 (11013 + 11915 + 31541 + 23304)
8.20.213.0/8 6 (96434 + 108191 + 170058 + 23512 + 10420 + 24809)
It would be great if someone could suggest some way to achieve this.
The main problem here is that without having the routing table from the individual moments the packets arrived, you have no idea what netblock they were originally in. Sure, you can put them in the class-full blocks they would be in, in a class-full routing situation, but all that will give you is a nice presentation (and, admittedly, a shorter file).
Furthermore, your example looks a bit broken. You have a bunch of IP addresses in 8.0.0.0/8 and you are aggregating them into what looks like /24 routes and presenting them with a /8 at the end.
Nonetheless, in awk you can use sub() to do text replacement (or you can use index to find occurrences of ., or you can use split to split at dots). It should be relatively easy to go from that to "drop last digit, add the string "0/24" and use that as a key to update an IP-count and a hit-count dictionary, then drop the last two octets and the slash, replace with "0.0/16" and do the same" (all arrays in awk are associative arrays, so essentially dicts). No need to sort in advance, when you loop through the result, you'll get the keys in a random order, but on average there will be fewer of them, so sorting afterwards will be cheaper.
I seem to not have an awk at hand, so I cannot give you a code example.
This might work for you:
awk '{a=$1;sub(/\.[^.]*$/,"",a);ac[a]++;at[a]+=$2};END{for(x in ac)print x".0/8",ac[x],at[x]}' file
This prints the '0/8 addresses to get the 0/16 duplicate the code i.e. b=a;sub(/\.[^.]*$/,"",b);ba[b]++ etc, etc.

Resources