How to substitute the first line and split the first string on the following line: - bash

I am just new to scripting and I need some help. I have something like a bazillion files that look like this.
Assign F2 Height
3IleN 2.34025e+07
4PheN 2.05028e+07
6LysN 1.43672e+07
7ThrN 1.49120e+07
8LeuN 1.30838e+07
9ThrN 1.44298e+07
And i want it to look like this + save it in another file with the same name as the previous file however, with a "MOD" written at the beginning.
Number AA Height
3 IleN 6.20756e+07
4 PheN 5.26499e+07
7 ThrN 3.00216e+07
8 LeuN 3.26377e+07
9 ThrN 4.03901e+07
10 GlyN 2.73659e+07
12 ThrN 3.16319e+07
13 IleN 5.94604e+07
If you could please describe and explain the parameters used, that would be of great help.
Thanks!

The following should work for you:
sed 's/^\([0-9]*\)/\1 /' filename

Related

python regex specific blocks of text from large text file

I'm new to python and this site so thank-you in advance for your... understanding. This is my first attempt at a python script.
I'm having what I think is a performance issue trying to solve this problem which is causing me to not get any data back.
This code works on a small text file of a couple pages but when I try to use it on my 35MB real data text file it just hits the CPU and hasn't returned any data (>24 hours now).
Here's a snippet of the real data from the 35MB text file:
D)dddld
d00d90d
dd
ddd
vsddfgsdfgsf
dfsdfdsf
aAAAAAa
221546
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
sadfsdfsdf
sdfff
ff
f
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
aaa
aaa
aaaaa
What I'm trying to copy into a new file:
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
My Python code is:
import re
with open('testdata.txt') as myfile:
content = myfile.read()
text = re.search(r'\d{11}.*\n.*\n.*(\d{4})\D+(\d{2})\D+(\d{1})\D+(\d{2})\D+(\d{2})\D+\d{2}\D+\d{1}', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
myfile2.write(text)
Regex isn't the fastest way to search a string. You also compounded the problem by having a very big string (35MB). Reading an entire file into memory is generally not recommended because you may run into memory issues.
Judging from your regex pattern, it seems like you want to capture 4-line groups that start with an 11-digit string and end with some time-line string. Try this code:
import re
start_pattern = re.compile(r'^\d{11}$')
end_pattern = re.compile(r'^\d{4}\D+\d{2}\D+\d{1}\D+\d{2}\D+\d{2}\D+\d{2}\D+\d{1}$')
capturing = 0
capture = ''
with open('output.txt', 'w') as output_file:
with open('input.txt', 'r') as input_file:
for line in input_file:
if capturing > 0 and capturing <= 4:
capturing += 1
capture += line
elif start_pattern.match(line):
capturing = 1
capture = line
if capturing == 4:
if end_pattern.match(line):
output_file.write(capture + '\n')
else:
capturing = 0
It iterates over the input file, line by line. If it finds a line matching the start_pattern, it will read in 3 more. If the 4th line matches the end_pattern, it will write the whole group to the output file.

Extracting plain text output from binary file

I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy
The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language).
The documentation states that:
"GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)."
The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python.
thanks very much for your help on this.
Update: hexdump graph-name.4B.vout | head -5 gives:
0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4
0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19
0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5
0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d
0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string:
https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators
But the array is simple byte array. Here is example how to read it in python:
import struct
from array import array as binarray
import sys
inputfile = sys.argv[1]
data = open(inputfile).read()
a = binarray('c')
a.fromstring(data)
s = struct.Struct("f")
l = len(a)
print "%d bytes" %l
n = l / 4
for i in xrange(0, n):
x = s.unpack_from(a, i * 4)[0]
print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you:
cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";'
Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.

Using regex to find something between string and switch them

I have the following pattern ##/##/#### within a string "##/##/#### ##:## ## ###". As an example "11/22/3333 11:22 AM EST" would like to switch the 11 and 22 to result in 22/11/3333. I am new to understanding regex. Thank you.
You could do with:
'11/22/3333'.gsub(%r{(.*)/(.*)/(.*)}, '\2/\1/\3')
Something like this?
input_string="11/22/3333"
output_array=input_string.match(/(\d{2})\/(\d{2})\/(\d{4})/)
p "#{output_array[2]}/#{output_array[1]#{output_array[3]}}"

text processing for IPv4 dotted decimal notation conversion to /8 or /16 format

I have an input file that contains a list of ip addresses and the ip_counts(some parameter that I use internally.)The file looks somewhat like this.
202.124.127.26 2135869
202.124.127.25 2111217
202.124.127.17 2058082
202.124.127.16 2014958
202.124.127.20 1949323
202.124.127.24 1933773
202.124.127.27 1932076
202.124.127.22 1886466
202.124.127.18 1882955
202.124.127.21 1803528
202.124.127.23 1786348
119.224.129.200 1776592
119.224.129.211 1639325
202.124.127.19 1479198
119.224.129.201 1145426
202.49.175.110 1133354
119.224.129.210 1119525
68.232.45.132 1085491
119.224.129.209 1015078
131.203.3.8 857951
202.162.73.4 817197
207.123.58.125 785326
202.7.6.18 762603
117.121.253.254 718022
74.125.237.120 710448
68.232.44.219 693002
202.162.73.2 671559
205.128.75.126 611301
119.161.91.17 604393
119.224.129.202 559930
8.27.241.126 528862
74.125.237.152 517516
8.254.9.254 514341
As you can see the ip addresses themselves are unsorted.So I use the sort command on the file to sort the ip addresses as below
cat address_count.txt | sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n > sorted_address.txt
Which gives me an output with ip addresses in the sorted order.The partial output of that file is shown below.
4.23.63.126 15731
4.26.254.254 320705
4.27.8.254 25174
8.12.129.50 176141
8.12.223.125 11800
8.19.32.65 15854
8.19.240.53 11013
8.19.240.70 11915
8.19.240.72 31541
8.19.240.73 23304
8.20.213.28 96434
8.20.213.32 108191
8.20.213.34 170058
8.20.213.39 23512
8.20.213.41 10420
8.20.213.61 24809
8.26.195.253 28568
8.27.152.253 104446
8.27.233.125 115856
8.27.235.126 16102
8.27.235.254 25628
8.27.238.254 108485
8.27.240.125 169262
8.27.241.126 528862
8.27.241.252 197302
8.27.248.125 14926
8.254.9.254 514341
12.129.210.71 89663
15.192.45.21 20139
15.192.45.26 35265
15.193.0.148 10313
15.193.113.29 40318
15.201.49.136 14243
15.240.238.52 57163
17.250.248.95 28166
23.33.125.13 19179
23.33.125.37 17953
31.151.163.60 72709
38.99.42.37 192356
38.99.68.180 41251
38.99.68.181 10272
38.104.237.74 74012
38.108.112.103 37034
38.108.112.115 69698
38.108.112.121 92173
38.108.112.122 99230
38.112.63.238 39958
38.119.130.62 42159
46.4.28.22 19769
Now I want to parse the file given above and convert it to aaa.bbb.ccc.0/8 format and
aaa.bbb.0.0/16 format and I also want to count the number of occurences of the ip's in each subnet.I want to do this using bash.I am open to using sed or awk.How do I achieve this.
For example
8.19.240.53 11013
8.19.240.70 11915
8.19.240.72 31541
8.19.240.73 23304
8.20.213.28 96434
8.20.213.32 108191
8.20.213.34 170058
8.20.213.39 23512
8.20.213.41 10420
8.20.213.61 24809
The about input portion should produce 8.19.240.0/8 and 8.20.213.0/8 and similarly for /16 domains.I also want to count the occurences of machines in the subnet.
For example In the above output this subnet should have the count 4 in the next column beside it.It should also add the already displayed count.i.e (11013 + 11915 + 31541 + 23304) in another column.
8.19.240.0/8 4 (11013 + 11915 + 31541 + 23304)
8.20.213.0/8 6 (96434 + 108191 + 170058 + 23512 + 10420 + 24809)
It would be great if someone could suggest some way to achieve this.
The main problem here is that without having the routing table from the individual moments the packets arrived, you have no idea what netblock they were originally in. Sure, you can put them in the class-full blocks they would be in, in a class-full routing situation, but all that will give you is a nice presentation (and, admittedly, a shorter file).
Furthermore, your example looks a bit broken. You have a bunch of IP addresses in 8.0.0.0/8 and you are aggregating them into what looks like /24 routes and presenting them with a /8 at the end.
Nonetheless, in awk you can use sub() to do text replacement (or you can use index to find occurrences of ., or you can use split to split at dots). It should be relatively easy to go from that to "drop last digit, add the string "0/24" and use that as a key to update an IP-count and a hit-count dictionary, then drop the last two octets and the slash, replace with "0.0/16" and do the same" (all arrays in awk are associative arrays, so essentially dicts). No need to sort in advance, when you loop through the result, you'll get the keys in a random order, but on average there will be fewer of them, so sorting afterwards will be cheaper.
I seem to not have an awk at hand, so I cannot give you a code example.
This might work for you:
awk '{a=$1;sub(/\.[^.]*$/,"",a);ac[a]++;at[a]+=$2};END{for(x in ac)print x".0/8",ac[x],at[x]}' file
This prints the '0/8 addresses to get the 0/16 duplicate the code i.e. b=a;sub(/\.[^.]*$/,"",b);ba[b]++ etc, etc.

Running R Code from Command Line (Windows)

I have some R code inside a file called analyse.r. I would like to be able to, from the command line (CMD), run the code in that file without having to pass through the R terminal and I would also like to be able to pass parameters and use those parameters in my code, something like the following pseudocode:
C:\>(execute r script) analyse.r C:\file.txt
and this would execute the script and pass "C:\file.txt" as a parameter to the script and then it could use it to do some further processing on it.
How do I accomplish this?
You want Rscript.exe.
You can control the output from within the script -- see sink() and its documentation.
You can access command-arguments via commandArgs().
You can control command-line arguments more finely via the getopt and optparse packages.
If everything else fails, consider reading the manuals or contributed documentation
Identify where R is install. For window 7 the path could be
1.C:\Program Files\R\R-3.2.2\bin\x64>
2.Call the R code
3.C:\Program Files\R\R-3.2.2\bin\x64>\Rscript Rcode.r
There are two ways to run a R script from command line (windows or linux shell.)
1) R CMD way
R CMD BATCH followed by R script name. The output from this can also be piped to other files as needed.
This way however is a bit old and using Rscript is getting more popular.
2) Rscript way
(This is supported in all platforms. The following example however is tested only for Linux)
This example involves passing path of csv file, the function name and the attribute(row or column) index of the csv file on which this function should work.
Contents of test.csv file
x1,x2
1,2
3,4
5,6
7,8
Compose an R file “a.R” whose contents are
#!/usr/bin/env Rscript
cols <- function(y){
cat("This function will print sum of the column whose index is passed from commandline\n")
cat("processing...column sums\n")
su<-sum(data[,y])
cat(su)
cat("\n")
}
rows <- function(y){
cat("This function will print sum of the row whose index is passed from commandline\n")
cat("processing...row sums\n")
su<-sum(data[y,])
cat(su)
cat("\n")
}
#calling a function based on its name from commandline … y is the row or column index
FUN <- function(run_func,y){
switch(run_func,
rows=rows(as.numeric(y)),
cols=cols(as.numeric(y)),
stop("Enter something that switches me!")
)
}
args <- commandArgs(TRUE)
cat("you passed the following at the command line\n")
cat(args);cat("\n")
filename<-args[1]
func_name<-args[2]
attr_index<-args[3]
data<-read.csv(filename,header=T)
cat("Matrix is:\n")
print(data)
cat("Dimensions of the matrix are\n")
cat(dim(data))
cat("\n")
FUN(func_name,attr_index)
Runing the following on the linux shell
Rscript a.R /home/impadmin/test.csv cols 1
gives
you passed the following at the command line
/home/impadmin/test.csv cols 1
Matrix is:
x1 x2
1 1 2
2 3 4
3 5 6
4 7 8
Dimensions of the matrix are
4 2
This function will print sum of the column whose index is passed from commandline
processing...column sums
16
Runing the following on the linux shell
Rscript a.R /home/impadmin/test.csv rows 2
gives
you passed the following at the command line
/home/impadmin/test.csv rows 2
Matrix is:
x1 x2
1 1 2
2 3 4
3 5 6
4 7 8
Dimensions of the matrix are
4 2
This function will print sum of the row whose index is passed from commandline
processing...row sums
7
We can also make the R script executable as follows (on linux)
chmod a+x a.R
and run the second example again as
./a.R /home/impadmin/test.csv rows 2
This should also work for windows command prompt..
save the following in a text file
f1 <- function(x,y){
print (x)
print (y)
}
args = commandArgs(trailingOnly=TRUE)
f1(args[1], args[2])
No run the following command in windows cmd
Rscript.exe path_to_file "hello" "world"
This will print the following
[1] "hello"
[1] "world"

Resources