line replace given contexts on two sides of a split - bash

I am trying to do a line replace given contexts on two sides of a split. This seems much easier to do in python but my entire pipeline is in bash so I would love to stick to tools like sed, awk, grep, etc.
For example:
split_0 = split('\t')[0]
split_1 = split('\t')[1]
if (a b c in split_0 AND w x y z in split_1):
split_1 = split_1.replace('w x y z', 'w x_y z')
I can use awk to do splits like this:
awk -F '\t' '{print$1}'
But I don't know how to do this on both sides simultaneously in order to satisfy both conditions. Any help would be greatly appreciated.
Example input/output:
This is an example and I have many rules like this but basically what I want to do here is given an example where I have "ex" on the left side and "ih g z" on the right side, I want to make a substitution with ih g z going to ih g_z.
input: exam ih g z ae m
output: exam ih g_z ae m
I could do a brutal sed like:
sed 's/\(.*ex.*\t.*\)ih g z\(.*\)/\1ih g_z\2/g'
but this seems ugly and I am sure there is a much better way to do this. *I am not totally sure if the "\t" works that way in sed.

awk to the rescue!
awk -F'\t' '$1~/ex/ && $2~/ih g z/{sub("g z","g_z")}1' file
conditions on fields 1 and 2 separated by tab delimiter, replace string (once).
If you have a bunch of these replacement rules, it's better to not hard code them in the script
$ awk -F'\t' -v OFS='\t' 'NR==FNR{lr[NR]=$1; rr[NR]=$2;
ls[NR]=$3; rs[NR]=$4; next}
{for(i=1; i<=length(lr); i++)
if($1~lr[i] && $2~rr[i])
{gsub(ls[i],rs[i],$2);
print;
next}}1' rules file
111 2b2b2b
222 333u33u
4 bbb5az
9 nochange
where
$ head rules file
==> rules <==
1 2 a b
2 3 z u
4 5 e b
==> file <==
111 2a2a2a
222 333z33z
4 eee5az
9 nochange
Noticed that replacement will be for the first applicable rule on second field only and multiple times. Both files need to be tab delimited.

Related

Faster way to extract data from large file

I have file containing about 40000 frames of Cartesian coordinates of 28 atoms. I need to extract coordinates of atom 21 to 27 from each frame.
I tried using bash script with for-loop.
for i in {0..39999}
do
cat $1 | grep -A 27 "frame $i " | tail -n 6 | awk '{print $2, $3, $4}' >> new_coors.xyz
done
Data have following form:
28
-1373.82296 frame 0 xyz file generated by terachem
Re 1.6345663991 0.9571586961 0.3920887712
N 0.7107677071 -1.0248027788 0.5007181135
N -0.3626961076 1.1948218124 -0.4621264246
C -1.1299268126 0.0792071086 -0.5595954110
C -0.5157993503 -1.1509115191 -0.0469223696
C 1.3354467762 -2.1017253883 1.0125736017
C 0.7611763218 -3.3742177216 0.9821756556
C -1.1378354025 -2.4089069492 -0.1199253156
C -0.4944655989 -3.5108477831 0.4043826684
C -0.8597552614 2.3604180994 -0.9043060625
C -2.1340008843 2.4846545826 -1.4451933224
C -2.4023114639 0.1449111237 -1.0888703147
C -2.9292779079 1.3528434658 -1.5302429615
H 2.3226814021 -1.9233467458 1.4602019023
H 1.3128699342 -4.2076373780 1.3768411246
H -2.1105470176 -2.5059031902 -0.5582958817
H -0.9564415355 -4.4988963635 0.3544299401
H -0.1913951275 3.2219343258 -0.8231465989
H -2.4436044324 3.4620639189 -1.7693069306
H -3.0306593902 -0.7362803011 -1.1626515622
H -3.9523215784 1.4136948699 -1.9142814745
C 3.3621999538 0.4972227756 1.1031860016
O 4.3763020637 0.2022266109 1.5735343064
C 2.2906331057 2.7428149541 0.0483795630
O 2.6669163864 3.8206298898 -0.1683800650
C 1.0351398442 1.4995168190 2.1137684156
O 0.6510904387 1.8559680025 3.1601927094
Cl 2.2433490373 0.2064711824 -1.9226174036
It works but it takes enormous amount of time,
In future I will be working with larger file. Is there faster way to do that?
The reason why your program is slow is that you keep on re-reading your input file over and over in your for-loop. You can do everything with reading your file a single time and use awk instead:
awk '/frame/{c=0;next}{c++}(c>20 && c<27){ print $2,$3,$4 }' input > output
This answer assumes the following form of data:
frame ???
??? x y z ???
??? x y z ???
...
frame ???
??? x y z ???
??? x y z ???
...
The solution checks if it finds the word frame in a line. If so, it sets the atom counter c to zero and skips to the next line. From that point forward, it will always read increase the counter if it reads a new line. If the counter is between 20 and 27 (exclusive), it will print the coordinates.
You can now easily expand on this: Assume you want the same atoms but only from frame 1000 till 1500. You can do this by introducing a frame-counter fc
awk '/frame/{fc++;c=0;next}{c++}(fc>=1000 && fc <=1500) && (c>20 && c<27){ print $2,$3,$4 }' input > output
If frames numbers in file are already in sorted order, e.g. they have numbers 0 - 39999 in this order, then maybe something likes this could do the job (not tested, since we don't have a sample input file, as Jepessen suggested):
cat $1 | grep -A 27 -E "frame [0-9]+ " | \
awk '{if ($1 == "frame") n = 0; if (n++ > 20) print $2, $3, $4}' > new_coors.xyz
(code above made explicitly verbose to be easier to understand and closer to your existing script. If you need a more compact solution check kvantour answer)
You could perhaps use 2 passes of grep, rather than thousands?
Assuming you want the lines 21-27 after every frame, and you don't want to record the frame number itself, the following phrase should get the lines you want, which you can then 'tidy' with awk:
grep -A27 ' frame ' | grep -B6 '-----'
If you also wanted the frame numbers (I see no evidence), or you really want to restrict the range of frame numbers, you could do that with tee and >( grep 'frame') to generate a second file that you would then need to re-merge. If you added -n to grep then you could easily merge sort the files on line number.
Another way to restrict the frame number without doing multiple passes would be a more complex grep expression that describes the range of numbers (-E because life is too short for backticks):
-E ' frame (([0-9]{1,4}|[0-3][0-9]{1,4}) '

Joining lines, modulo the number of records

Say my stream is x*N lines long, where x is the number of records and N is the number of columns per record, and is output column-wise. For example, x=2, N=3:
1
2
Alice
Bob
London
New York
How can I join every line, modulo the number of records, back into columns:
1 Alice London
2 Bob New York
If I use paste, with N -s, I get the transposed output. I could use split, with the -l option equal to N, then recombine the pieces afterwards with paste, but I'd like to do it within the stream without spitting out temporary files all over the place.
Is there an "easy" solution (i.e., rather than invoking something like awk)? I'm thinking there may be some magic join solution, but I can't see it...
EDIT Another example, when x=5 and N=3:
1
2
3
4
5
a
b
c
d
e
alpha
beta
gamma
delta
epsilon
Expected output:
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
You are looking for pr to "columnate" the stream:
pr -T -s$'\t' -3 <<'END_STREAM'
1
2
Alice
Bob
London
New York
END_STREAM
1 Alice London
2 Bob New York
pr is in coreutils.
Most systems should include a tool called pr, intended to print files. It's part of POSIX.1 so it's almost certainly on any system you'll use.
$ pr -3 -t < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
Or if you prefer,
$ pr -3 -t -s, < inp1
1,a,alpha
2,b,beta
3,c,gamma
4,d,delta
5,e,epsilon
or
$ pr -3 -t -w 20 < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilo
Check the link above for standard usage information, or man pr for specific options in your operating system.
In order to reliably process the input you need to either know the number of columns in the output file or the number of lines in the output file. If you just know the number of columns, you'd need to read the input file twice.
Hackish coreutils solution
# If you don't know the number of output lines but the
# number of output columns in advance you can calculate it
# using wc -l
# Split the file by the number of output lines
split -l"${olines}" file FOO # FOO is a prefix. Choose a better one
paste FOO*
AWK solutions
If you know the number of output columns in advance you can use this awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
FNR==NR {
# We are reading the file twice (see invocation below)
# When reading it the first time we store the number
# of fields (lines) in the variable n because we need it
# when processing the file.
n=NF
}
{
# n / c is the number of output lines
# For every output line ...
for(i=0;i<n/c;i++) {
# ... print the columns belonging to it
for(ii=1+i;ii<=NF;ii+=n/c) {
printf "%s ", $ii
}
print "" # Adds a newline
}
}
and call it like this:
awk -vc=3 -f convert.awk file file # Twice the same file
If you know the number of ouput lines in advance you can use the following awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
{
# x is the number of output lines and has been passed to the
# script. For each line in output
for(i=0;i<x;i++){
# ... print the columns belonging to it
for(ii=i+1;ii<=NF;ii+=x){
printf "%s ",$ii
}
print "" # Adds a newline
}
}
And call it like this:
awk -vx=2 -f convert.awk file

Print lines after a pattern in several files

I have a file with the same pattern several times.
Something like:
time
12:00
12:32
23:22
time
10:32
1:32
15:45
I want to print the lines after the pattern, in the example time
in several files. The number of lines after the pattern is constant.
I found I can get the first part of my question with awk '/time/ {x=NR+3;next}(NR<=x){print}' filename
But I have no idea how to output each chunk into different files.
EDIT
My files are a bit more complex than my original question.
They have the following format.
4
gen
C -4.141000 -0.098000 0.773000
H -4.528000 -0.437000 -0.197000
H -4.267000 0.997000 0.808000
H -4.777000 -0.521000 1.563000
4
gen
C -4.414000 -0.398000 4.773000
H -4.382000 -0.455000 -4.197000
H -4.267000 0.973000 2.808000
H -4.333000 -0.000000 1.636000
I want to print the lines after
4
gen
EDIT 2
My expected output is x files x=# pattern.
From my second example, I want two files:
C -4.141000 -0.098000 0.773000
H -4.528000 -0.437000 -0.197000
H -4.267000 0.997000 0.808000
H -4.777000 -0.521000 1.563000
and
C -4.414000 -0.398000 4.773000
H -4.382000 -0.455000 -4.197000
H -4.267000 0.973000 2.808000
H -4.333000 -0.000000 1.636000
You can use this awk command:
awk '/time/{close(out); out="output" ++i; next} {print > out}' file
This awk command creates a variable out based on a fixed prefix output and an incrementing counter i which gets incremented every time we get a line time. All subsequent lines are redirected to this output file. Is is a good practice to close these file handles to avoid memory leak.
PS: If you want time line also in output then remove next in above command.
The revised "4/gen" requirements are somewhat ambiguous but the following script (which is just a variant of #anubhava's) conforms with those that are given and can easily be modified to deal with various edge cases:
awk '
/^ *4 *$/ {close(out); out=0; next}
/^ *gen *$/ && out==0 {out = "output." ++i; next}
out {print > out} '
I found another answer from anubhava here How to print 5 consecutive lines after a pattern in file using awk
and with head and for loop I solved my problem:
for i in {1..23}; do grep -A25 "gen" allTemp | tail -n 25 > xyz$i; head -n -27 allTemp > allTemp2; cp allTemp2 allTemp; done
I know that file allTemp has 23 occurrences of gen.
Head will remove the lines I printed to xyzi as well as the two lines I don't want and it will output a new file to allTemp2

Grouping the data from a column using awk or any other shell command

I could format the data using a perl script (hash). I am wondering if it can be done through some shell one liner, so that every time I dont need to write a perl script if there is some change in the
input format.
Example Input:
rinku a
rinku b
rinku c
rrs d
rrs e
abc f
abc g
abc h
abc i
xyz j
example Output:
rinku a,b,c
rrs d,e
abc f,g,h,i
xyz j
Please help me with a command using shell/awk/sed to format the input.
Thanks,
Rinku
How about
$ awk '{arr[$1]=arr[$1]?arr[$1]","$2:$2} END{for (i in arr) print i, arr[i]}' input
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
The awk program also has associative arrays, similar to Perl:
awk '{v[$1]=v[$1]","$2}END{for(k in v)print k" "substr(v[k],2)}' inputFile
For each line X Y (key of X, value of Y), it basically just appends ,Y to every array element indexed by X, taking advantage of the fact they all start as empty strings.
Then, since your values are then of the form ,x,y,z, you just strip off the first character when outputting.
This generates, for your input data (in inputFile):
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
As an aside, if you want it as nicely formatted as the original, you can create a program.awk file:
{
val[$1] = val[$1]","$2
if (length ($1) > maxlen) {
maxlen = length ($1)
}
}
END {
for (key in val) {
printf "%-*s %s\n", maxlen, key, substr(val[key],2)
}
}
and run that with:
awk -f program.awk inputFile
and you'll get:
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
sed -n ':cycle
$!N
s/^\([^[:blank:]]*\)\([[:blank:]]\{1,\}.*\)\n\1[[:blank:]]\{1,\}/\1\2,/;t cycle
P
s/.*\n//;t cycle' YourFile
trying not to use the hold buffer (and not loading the full file in memory)
- load the line
- if first word is the same as the one after CR, repolace the CR and first word by a ,
- if the case, restart at line loading
- if not, print first line
- replace the current buffer until first \n by nothing
- if the case restart at line loading
posix version so --posix on GNU sed

Bash - fill empty cell with following value in the column

I have a long tab-delimited CSV file and I am trying to paste in a cell a value that comes later on the column.
For instance, input.txt:
0
1
1.345 B
2
2.86 A
3
4
I would like an output such as:
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
I've been tinkering with code from other threads like this awk solution, but the problem is that the value I want is not before the empty cell, but after, kind of a .FillUp in Excel.
Additional information:
input file may have different number of lines
"A" and "B" in input file may be at different rows and not evenly separated
second column may have only two values
last cell in second column may not have value
[EDIT] for the last two rows in input.txt, B is known to be in the second column, as all rows after 2.86 are not A.
Thanks in advance.
$ tac input.txt | awk -v V=B '{if ($2) V=$2; else $2=V; print}' | tac
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
tac (cat backwards) prints a file in reverse. Reverse the file, fill in the missing values, and then reverse it again.
This allows you to process the file in a single pass as long as you know the first value to fill. It should be quite a bit faster than reversing the file twice.
awk 'BEGIN {fillvalue="B"} $2 {fillvalue=$2=="A"?"B":"A"} !$2 {$2=fillvalue} 1' input.txt
Note that this assumes knowledge about the nature of that second column being only 'A' or 'B' or blank.

Resources