For and If in Awk scripts (bash) - bash

My bash command is
awk -f code.txt input.txt
This is my code
{z=0
for(i=2;i<17;i++)
if ($i=="y")
z++
print $1 " " z}
This is my input
AaA y n y n y n n n y n n n n n y
BbB n y y n n n n n n n n n n n n
My output should be
AaA 5
BbB 2
Yet it is
AaA 4
BbB 2
After messing around with the code, it seems it doesn't register the last symbol of a line.
{z=0
for(i=2;i<18;i++)
print $i
print $1 " " z}
When I run this it outputs all y/n, so the problem must be somewhere in the if-statement.

It may be the case that your input file has MS-DOS line endings (CRLF). The last symbol will be read as y<CR>. To check whether this is true, on a Linux system you can run
hexdump -C input.txt

The problem seemed to be in CRLF. You can fix that in an awk loop with
sub(/\r/,"",$NF)
ie. by removing CR (replacing with nothing) from the last field. Also, you could use that sub or gsub for counting the y occurrences:
$ awk '$0=$1 OFS gsub(FS "y","&")' file
AaA 5
BbB 2
This way the \r does not matter as we just replace y with itself (&).

Related

Faster way to extract data from large file

I have file containing about 40000 frames of Cartesian coordinates of 28 atoms. I need to extract coordinates of atom 21 to 27 from each frame.
I tried using bash script with for-loop.
for i in {0..39999}
do
cat $1 | grep -A 27 "frame $i " | tail -n 6 | awk '{print $2, $3, $4}' >> new_coors.xyz
done
Data have following form:
28
-1373.82296 frame 0 xyz file generated by terachem
Re 1.6345663991 0.9571586961 0.3920887712
N 0.7107677071 -1.0248027788 0.5007181135
N -0.3626961076 1.1948218124 -0.4621264246
C -1.1299268126 0.0792071086 -0.5595954110
C -0.5157993503 -1.1509115191 -0.0469223696
C 1.3354467762 -2.1017253883 1.0125736017
C 0.7611763218 -3.3742177216 0.9821756556
C -1.1378354025 -2.4089069492 -0.1199253156
C -0.4944655989 -3.5108477831 0.4043826684
C -0.8597552614 2.3604180994 -0.9043060625
C -2.1340008843 2.4846545826 -1.4451933224
C -2.4023114639 0.1449111237 -1.0888703147
C -2.9292779079 1.3528434658 -1.5302429615
H 2.3226814021 -1.9233467458 1.4602019023
H 1.3128699342 -4.2076373780 1.3768411246
H -2.1105470176 -2.5059031902 -0.5582958817
H -0.9564415355 -4.4988963635 0.3544299401
H -0.1913951275 3.2219343258 -0.8231465989
H -2.4436044324 3.4620639189 -1.7693069306
H -3.0306593902 -0.7362803011 -1.1626515622
H -3.9523215784 1.4136948699 -1.9142814745
C 3.3621999538 0.4972227756 1.1031860016
O 4.3763020637 0.2022266109 1.5735343064
C 2.2906331057 2.7428149541 0.0483795630
O 2.6669163864 3.8206298898 -0.1683800650
C 1.0351398442 1.4995168190 2.1137684156
O 0.6510904387 1.8559680025 3.1601927094
Cl 2.2433490373 0.2064711824 -1.9226174036
It works but it takes enormous amount of time,
In future I will be working with larger file. Is there faster way to do that?
The reason why your program is slow is that you keep on re-reading your input file over and over in your for-loop. You can do everything with reading your file a single time and use awk instead:
awk '/frame/{c=0;next}{c++}(c>20 && c<27){ print $2,$3,$4 }' input > output
This answer assumes the following form of data:
frame ???
??? x y z ???
??? x y z ???
...
frame ???
??? x y z ???
??? x y z ???
...
The solution checks if it finds the word frame in a line. If so, it sets the atom counter c to zero and skips to the next line. From that point forward, it will always read increase the counter if it reads a new line. If the counter is between 20 and 27 (exclusive), it will print the coordinates.
You can now easily expand on this: Assume you want the same atoms but only from frame 1000 till 1500. You can do this by introducing a frame-counter fc
awk '/frame/{fc++;c=0;next}{c++}(fc>=1000 && fc <=1500) && (c>20 && c<27){ print $2,$3,$4 }' input > output
If frames numbers in file are already in sorted order, e.g. they have numbers 0 - 39999 in this order, then maybe something likes this could do the job (not tested, since we don't have a sample input file, as Jepessen suggested):
cat $1 | grep -A 27 -E "frame [0-9]+ " | \
awk '{if ($1 == "frame") n = 0; if (n++ > 20) print $2, $3, $4}' > new_coors.xyz
(code above made explicitly verbose to be easier to understand and closer to your existing script. If you need a more compact solution check kvantour answer)
You could perhaps use 2 passes of grep, rather than thousands?
Assuming you want the lines 21-27 after every frame, and you don't want to record the frame number itself, the following phrase should get the lines you want, which you can then 'tidy' with awk:
grep -A27 ' frame ' | grep -B6 '-----'
If you also wanted the frame numbers (I see no evidence), or you really want to restrict the range of frame numbers, you could do that with tee and >( grep 'frame') to generate a second file that you would then need to re-merge. If you added -n to grep then you could easily merge sort the files on line number.
Another way to restrict the frame number without doing multiple passes would be a more complex grep expression that describes the range of numbers (-E because life is too short for backticks):
-E ' frame (([0-9]{1,4}|[0-3][0-9]{1,4}) '

line replace given contexts on two sides of a split

I am trying to do a line replace given contexts on two sides of a split. This seems much easier to do in python but my entire pipeline is in bash so I would love to stick to tools like sed, awk, grep, etc.
For example:
split_0 = split('\t')[0]
split_1 = split('\t')[1]
if (a b c in split_0 AND w x y z in split_1):
split_1 = split_1.replace('w x y z', 'w x_y z')
I can use awk to do splits like this:
awk -F '\t' '{print$1}'
But I don't know how to do this on both sides simultaneously in order to satisfy both conditions. Any help would be greatly appreciated.
Example input/output:
This is an example and I have many rules like this but basically what I want to do here is given an example where I have "ex" on the left side and "ih g z" on the right side, I want to make a substitution with ih g z going to ih g_z.
input: exam ih g z ae m
output: exam ih g_z ae m
I could do a brutal sed like:
sed 's/\(.*ex.*\t.*\)ih g z\(.*\)/\1ih g_z\2/g'
but this seems ugly and I am sure there is a much better way to do this. *I am not totally sure if the "\t" works that way in sed.
awk to the rescue!
awk -F'\t' '$1~/ex/ && $2~/ih g z/{sub("g z","g_z")}1' file
conditions on fields 1 and 2 separated by tab delimiter, replace string (once).
If you have a bunch of these replacement rules, it's better to not hard code them in the script
$ awk -F'\t' -v OFS='\t' 'NR==FNR{lr[NR]=$1; rr[NR]=$2;
ls[NR]=$3; rs[NR]=$4; next}
{for(i=1; i<=length(lr); i++)
if($1~lr[i] && $2~rr[i])
{gsub(ls[i],rs[i],$2);
print;
next}}1' rules file
111 2b2b2b
222 333u33u
4 bbb5az
9 nochange
where
$ head rules file
==> rules <==
1 2 a b
2 3 z u
4 5 e b
==> file <==
111 2a2a2a
222 333z33z
4 eee5az
9 nochange
Noticed that replacement will be for the first applicable rule on second field only and multiple times. Both files need to be tab delimited.

Bash/shell script: create four random-length strings with fixed total length

I would like to create four strings, each with a random length, but their total length should be 10. So possible length combinations could be:
3 3 3 1
or
4 0 2 2
Which would then (respectively) result in strings like this:
111 222 333 4
or
1111 33 44
How could I do this?
$RANDOM will give you a random integer in range 0..32767.
Using some arithmetic expansion you can do:
remaining=10
for i in {1..3}; do
next=$((RANDOM % remaining)) # get a number in range 0..$remaining
echo -n "$next "
((remaining -= next))
done
echo $remaining
Update: to repeat the number N times, you can use a function like this:
repeat() {
for ((i=0; i<$1; i++)); do
echo -n $1
done
echo
}
repeat 3
333
Here is an algorithm:
Make first 3 strings with random length, which is not greater than sum of lenght (each time substract it). And rest of length - it's your last string.
Consider this:
sumlen=10
for i in {1..3}
do
strlen=$(($RANDOM % $sumlen)); sumlen=$(($sumlen-$strlen)); echo $strlen
done
echo $sumlen
This will output your lengths, now you can create strings, suppose you know how
alternative awk solution
awk 'function r(n) {return int(n*rand())}
BEGIN{srand(); s=10;
for(i=1;i<=3;i++) {a=r(s); s-=a; print a}
print s}'
3
5
1
1
srand() to set a randomized seed, otherwise will generate the same random numbers each time.
Here you can combine the next task of generating the strings into the same awk script
$ awk 'function r(n) {return int(n*rand())};
function rep(n,t) {c="";for(i=1;i<=n;i++) c=c t; return c}
BEGIN{srand(); s=10;
for(j=1;j<=3;j++) {a=r(s); s-=a; printf("%s ", rep(a,j))}
printf("%s\n", rep(s,j))}'
generated
1111 2 3 4444

Grouping the data from a column using awk or any other shell command

I could format the data using a perl script (hash). I am wondering if it can be done through some shell one liner, so that every time I dont need to write a perl script if there is some change in the
input format.
Example Input:
rinku a
rinku b
rinku c
rrs d
rrs e
abc f
abc g
abc h
abc i
xyz j
example Output:
rinku a,b,c
rrs d,e
abc f,g,h,i
xyz j
Please help me with a command using shell/awk/sed to format the input.
Thanks,
Rinku
How about
$ awk '{arr[$1]=arr[$1]?arr[$1]","$2:$2} END{for (i in arr) print i, arr[i]}' input
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
The awk program also has associative arrays, similar to Perl:
awk '{v[$1]=v[$1]","$2}END{for(k in v)print k" "substr(v[k],2)}' inputFile
For each line X Y (key of X, value of Y), it basically just appends ,Y to every array element indexed by X, taking advantage of the fact they all start as empty strings.
Then, since your values are then of the form ,x,y,z, you just strip off the first character when outputting.
This generates, for your input data (in inputFile):
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
As an aside, if you want it as nicely formatted as the original, you can create a program.awk file:
{
val[$1] = val[$1]","$2
if (length ($1) > maxlen) {
maxlen = length ($1)
}
}
END {
for (key in val) {
printf "%-*s %s\n", maxlen, key, substr(val[key],2)
}
}
and run that with:
awk -f program.awk inputFile
and you'll get:
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
sed -n ':cycle
$!N
s/^\([^[:blank:]]*\)\([[:blank:]]\{1,\}.*\)\n\1[[:blank:]]\{1,\}/\1\2,/;t cycle
P
s/.*\n//;t cycle' YourFile
trying not to use the hold buffer (and not loading the full file in memory)
- load the line
- if first word is the same as the one after CR, repolace the CR and first word by a ,
- if the case, restart at line loading
- if not, print first line
- replace the current buffer until first \n by nothing
- if the case restart at line loading
posix version so --posix on GNU sed

Moving columns from the back of a file to the front

I have a file with a very large number of columns (basically several thousand sets of threes) with three special columns (Chr and Position and Name) at the end.
I want to move these final three columns to the front of the file, so that that columns become Name Chr Position and then the file continues with the trios.
I think this might be possible with awk, but I don't know enough about how awk works to do it!
Sample input:
Gene1.GType Gene1.X Gene1.Y ....ending in GeneN.Y Chr Position Name
Desired Output:
Name Chr Position (Gene1.GType Gene1.X Gene1.Y ) x n samples
I think the below example does more or less what you want.
$ cat file
A B C D E F G Chr Position Name
1 2 3 4 5 6 7 8 9 10
$ cat process.awk
{
printf $(NF-2)" "$(NF-1)" "$NF
for( i=1; i<NF-2; i++)
{
printf " "$i
}
print " "
}
$ awk -f process.awk file
Chr Position Name A B C D E F G
8 9 10 1 2 3 4 5 6 7
NF in awk denotes the number of field on a row.
one liner:
awk '{ Chr=$(NF-2) ; Position=$(NF-1) ; Name=$NF ; $(NF-2)=$(NF-1)=$NF="" ; print Name, Chr, Position, $0 }' file

Resources