Split column into multiple based on match/delimiter using bash awk - bash

I have a dataset in a single column that I would like to split into any number of new columns when a certain string is found (in this case 'male_position'.
>cat test.file
male_position
0.00
0.00
1.05
1.05
1.05
1.05
3.1
5.11
12.74
30.33
40.37
40.37
male_position
0.00
1.05
2.2
4.0
4.0
8.2
25.2
30.1
male_position
1.0
5.0
I would like the script to produce new tab separated columns each time 'male_position' is encountered but just print each each line/data point below that (added to that column) until the next occurrence of 'male_position':
script.awk test.file > output
0.00 0.00 1.0
0.00 1.05 5.0
1.05 2.2
1.05 4.0
1.05 4.0
1.05 8.2
3.1 25.2
5.11 30.1
12.74
30.33
40.37
40.37
Any ideas?
update -
I have tried to adapt code based on this post(Linux split a column into two different columns in a same CSV file)
cat script.awk
BEGIN {
line = 0; #Initialize at zero
}
/male_position/ { #every time we hit the delimiter
line = 0; #resed line to zero
}
!/male_position/{ #otherwise
a[line] = a[line]" "$0; # Add the new input line to the output line
line++; # increase the counter by one
}
END {
for (i in a )
print a[i] # print the output
}
Results....
$ awk -f script.awk test.file
1.05 2.2
1.05 4.0
1.05 4.0
1.05 8.2
3.1 25.2
5.11 30.1
12.74
30.33
40.37
40.37
0.00 0.00 1.0
0.00 1.05 5.0
UPDATE 2 #######
I can recreate the expected with the test.file case. Running the script (script.awk) on Linux with test file and 'awk.script"(see above) seemed to work. However, that simple example file has only decreasing numbers of columns (data points) between the delimiter (male_position). When you increase the number of columns between, the output seems to fail...
cat test.file2
male_position
0.00
0.00
1.05
1.05
1.05
1.05
3.1
5.11
12.74
male_position
0
5
10
male_position
0
1
2
3
5
awk -f script.awk test.file2
0.00 0 0
0.00 5 1
1.05 10 2
1.05 3
1.05 5
1.05
3.1
5.11
12.74
there is no 'padding' of the lines after the the last observation for a given column, so a column with more values than the predeeding column has its values fall in line with the previous column ( the 3 and the 5 are in column 2, when they should be in column 3).

Here's a csplit+paste solution
$ csplit --suppress-matched -zs test.file2 /male_position/ {*}
$ ls
test.file2 xx00 xx01 xx02
$ paste xx*
0.00 0 0
0.00 5 1
1.05 10 2
1.05 3
1.05 5
1.05
3.1
5.11
12.74
From man csplit
csplit - split a file into sections determined by context lines
-z, --elide-empty-files
remove empty output files
-s, --quiet, --silent
do not print counts of output file sizes
--suppress-matched
suppress the lines matching PATTERN
/male_position/ is the regex used to split the input file
{*} specifies to create as many splits as possible
use -f and -n options to change the default output file names
paste xx* to paste the files column wise, TAB is default separator

Following awk may help you on same.
awk '/male_position/{count++;max=val>max?val:max;val=1;next} {array[val++,count]=$0} END{for(i=1;i<=max;i++){for(j=1;j<=count;j++){printf("%s%s",array[i,j],j==count?ORS:OFS)}}}' OFS="\t" Input_file
Adding a non-one liner form of solution too now.
awk '
/male_position/{
count++;
max=val>max?val:max;
val=1;
next}
{
array[val++,count]=$0
}
END{
for(i=1;i<=max;i++){
for(j=1;j<=count;j++){ printf("%s%s",array[i,j],j==count?ORS:OFS) }}
}
' OFS="\t" Input_file

Related

Using for-loop variable in awk print

I'm trying to print out
for i in 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40; do
awk -v a="$i" '{printf "%10.2f %10.2f\n", a, ($8*627.509)}' e1.txt > e2.txt
done
But when I open this file,
1.40 -12111939.85
1.40 -12112479.17
1.40 -12112817.98
1.40 -12112997.55
1.40 -12113047.39
1.40 -12112998.93
1.40 -12112873.57
1.40 -12112695.74
1.40 -12112504.02
1.40 -12112346.74
1.40 -12112316.49
1.40 -12112204.51
1.40 -12112149.56
Ignore the second column as it reads the value and operates from other txt file, e1.txt.
As it is shown, only the last for-loop index variable is used in this case. But I wish to print the for-loop values of 0.80 ~ 1.40 accordingly to each line.
For efficiency, I would avoid processing the same file 13 times.
The BEGIN block looks awkward because awk can't declare an array literal.
awk '
BEGIN {
a = "0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40"
n = split(a, as)
}
{
for (i=1; i <= n; i++)
printf "%10.2f %10.2f\n", as[i], ($8 * 627.509)
}
' e1.txt > e2.txt
If you want all the 0.80 first and all the 1.40 last, you can:
awk '...' e1.txt | sort -g > e2.txt

how to format simultaneously a string and floating number in awk?

I have a column as follows:
ifile.txt
1.25
2.78
?
?
5.6
3.4
I would like to format the floating points to decimal and skipping the strings as it is.
ofile.txt
1
3
?
?
6
3
Walter A, F. Knorr and Janez Kuhar suggested nice scripts to do it as per my question and need of a command like
awk '{printf "%d%s\n", $1}' ifile.txt
Again, I found I have a number of columns, however, other columns don't need any formatting. So I have to use the above command in the form of something like:
awk '{printf "%5s %d%s %5s %5s %5s\n", $1, $2, $3, $4, $5}' ifile.txt
for example:
ifile.txt
1 1.25 23.2 34 3.4
2 2.78 22.0 23 1.2
3 ? ? ? 4.3
4 ? ? ? 6.5
5 5.6 45.0 5 2.4
6 3.4 43.0 23 5.6
I used the following command as again suggested by F. Knorr in answer,
awk '$2~/^[0-9]+\.?[0-9]*$/{$2=int($2+0.5)}1' ifile.txt > ofile.txt
ofile.txt
1 1 23.2 34 3.4
2 3 22.0 23 1.2
3 ? ? ? 4.3
4 ? ? ? 6.5
5 6 45.0 5 2.4
6 3 43.0 23 5.6
It works fine, but need to format it. like
ofile.txt
1 1 23.2 34 3.4
2 3 22.0 23 1.2
3 ? ? ? 4.3
4 ? ? ? 6.5
5 6 45.0 5 2.4
6 3 43.0 23 5.6
You could first check whether the column contains a number (via regex) and then handle the printing accordingly:
awk '$1~/^[0-9]+\.?[0-9]*$/{printf "%i\n",$1+0.5; next}1' test.txt
Update: If it is the n-th column that needs to be formatted as described above (and no other formatting in other columns), then replace all $1 by $n in the following command:
awk '$1~/^[0-9]+\.?[0-9]*$/{$1=int($1+0.5)}1' test.txt
Just adding a half can be done with:
awk ' $1 ~ /^[0-9]+$|^[0-9]+.[0-9]+$/ { printf("%d\n", $1 + 0.5); next }
{ print $1 } ' file
or slightly shorter:
awk ' $1 ~ /^[0-9]+$|^[0-9]+.[0-9]+$/ { printf("%d\n", $1 + 0.5); next } 1' file

Merging similar columns from different files into a matrix

I have the files with following format each with the first column being common amongst all the files:
File1.txt
ID Score
ABCD 0.9
BCBS 0.2
NBNC 0.67
TCGS 0.8
File2.txt
ID Score
ABCD 0.3
BCBS 0.9
NBNC 0.73
TCGS 0.12
File3.txt
ID Score
ABCD 0.23
BCBS 0.65
NBNC 0.94
TCGS 0.56
I want to merge the second column (Score column) of all the files with the first column being common and display the file name minus the extension of each file as the header to identify as to where did the score come from such that the matrix would look something like
ID File1 File2 File3
ABCD 0.9 0.3 0.23
BCBS 0.2 0.9 0.65
NBNC 0.67 0.73 0.94
TCGS 0.8 0.12 0.56
$ cat tst.awk
BEGIN { OFS="\t" }
FNR>1 { id[FNR] = $1; score[FNR,ARGIND] = $2 }
END {
printf "%s%s", "ID", OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
sub(/\..*/,"",ARGV[colNr])
printf "%s%s", ARGV[colNr], (colNr<ARGIND?OFS:ORS)
}
for (rowNr=2; rowNr<=FNR; rowNr++) {
printf "%s%s", id[rowNr], OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", score[rowNr,colNr], (colNr<ARGIND?OFS:ORS)
}
}
}
$ awk -f tst.awk File1.txt File2.txt File3.txt
ID File1 File2 File3
ABCD 0.9 0.3 0.23
BCBS 0.2 0.9 0.65
NBNC 0.67 0.73 0.94
TCGS 0.8 0.12 0.56
Pick some string that can't occur in your input as the OFS, I used tab.
If you don't have GNU awk add FNR==1{ ARGIND++ } at the start of the script.
Another alternative
$ awk 'NR==1{$0=$1"\t"FILENAME}1' File1 > all;
for f in File{2..6};
do
paste all <(p $f) > temp && cp temp all;
done
define function p as
p() { awk 'NR==1{print FILENAME;next} {print $2}' $1; }
I copied your data to 6 identical files File1..File6 and the script produced this. Most of the work is setting up the column names
ID File1 File2 File3 File4 File5 File6
ABCD 0.9 0.9 0.9 0.9 0.9 0.9
BCBS 0.2 0.2 0.2 0.2 0.2 0.2
NBNC 0.67 0.67 0.67 0.67 0.67 0.67
TCGS 0.8 0.8 0.8 0.8 0.8 0.8

Manipulation of data with BASH

I have a file full of lines like the following:
8408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
The first two digits in the first column represent the year and I need to substitute them with the full year while retaining the full line. I tried various method with awk and sed but couldn't get them working.
My latest attempt is the following:
while read line
a=20
b=19
do
awk '{ if ("$1 | cut -c1-2" == "0"); then {print $a$1}; }' > test.txt
done < catalog.txt
The final output should be:
198408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
Any ideas on how I can do this? Any help would be greatly appreciated.
You have a conceptional problem. How would you know whether 15 refers to 2015 or 1915. Otherwise it's quite easy:
#!/bin/bash
for num in 8408292236 8508292236 0408292236 1508292236
do
prefix=${num:0:2} # get first two digits of line
if [ $prefix -ge 20 ]; then # assuming that 0-20 refers to the 2000s
echo "19$num"
else
echo "20$num"
fi
done
This will prefix 19 if the first field starts with 16 or more, and will prefix 20 otherwise. I think this is what you need.
awk ' { if ($1 > 1600000000 ) print "19" $0 ; else print "20" $0 ; }' catalog.txt
A solution using just bash:
while read A; do
PREFIX=20
if [ ${A:0:2} -gt 15 ] ; then
PREFIX=19
fi
echo "${PREFIX}${A}"
done
sed -e 's/^[0-3]/20\0/' -e 's/^[4-9]/19\0/'
This will append 19 to each line starting with 3 to 9 or 20 to each line starting with 0 to 2, so we consider 30 - 99 to correspond to 1900, and 00-29 to correspond to 2000.
Adapt to your need.
Use awk ternary operator.
$ cat catalog.txt
8408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
0408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
$ awk '$1=substr($1,1,2)<15 ? "20"$1 : "19"$1' catalog.txt
198408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
200408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
Explanation
Let’s use the ternary operator.
expr ? action1 : action2
Its pretty straight forward: if expris true then action1 is performed/evaluated , if not action2.
Field one must be prefixed with 19 or 20 depending of it's two first chars value substr($1,1,2).
NOTE: This only works if your data does not contains years bellow 1915.
At this point, we just need to change the first field $1 to our needs: $1=substr($1,1,2)<15 ? "20"$1 : "19"$1
Check: http://klashxx.github.io/ternary-operator/
Simply in bash:
while read i; do echo "$((${i:0:2}<20?20:19))$i"; done <file.txt
shorter in perl:
perl -pe'print substr($_,0,2)<20?20:19' file.txt
even shorter in sed:
sed 's/^[01]/20&/;t;s/^/19/' file.txt

bash - check for word in specific column, check value in other column of this line, cut and paste the line to new text file

My text files contain ~20k lines and look like this:
file_A:
ATOM 624 SC1 SER 288 54.730 23.870 56.950 1.00 0.00
ATOM 3199 NC3 POP 487 50.780 27.750 27.500 1.00 3.18
ATOM 3910 C2B POP 541 96.340 99.070 39.500 1.00 7.00
ATOM 4125 W PW 559 55.550 64.300 16.880 1.00 0.00
Now I need to check for POP in column 4 (line 2 and 3) and check if the values in the last column (10) exceed a specific threshold (e.g. 5.00). These lines - in this case just line 3 - need to be removed from file_A and copied to a new file_B. Meaning:
file_A:
ATOM 624 SC1 SER 288 54.730 23.870 56.950 1.00 0.00
ATOM 3199 NC3 POP 487 50.780 27.750 27.500 1.00 3.18
ATOM 4125 W PW 559 55.550 64.300 16.880 1.00 0.00
file_B:
ATOM 3910 C2B POP 541 96.340 99.070 39.500 1.00 7.00
I'm not sure wether to use sed, grep or awk or anything couple them :/
So far i could just delete the lines and create a new file without these lines...
awk '!/POP/' file_A > file_B
EDIT:
Does the following work for having more than one different words removed?
for (( i= ; i<$numberoflipids ; i++ ))
do
awk '$4~/"${nol[$i]}"/&&$NF>"$pr"{print >"patch_rmlipids.pdb";next}{print > "tmp"}' bilayer_CG_ordered.pdb && mv tmp patch.pdb
done
whereas $nol is an array containing the words to be removed, $pr is the given threshold and the .pdb are the used files
awk
awk '$4~/POP/&&$NF>5{print >"fileb";next}{print > "tmp"}' filea && mv tmp filea
.
$4~/POP/&&$NF>5 -Checks if fourth field contains POP and last field is more than five
{print >"fileb";next} -If they are writes the line to fileb and
skips further statements
{print > "tmp"} -Only executed if first part fails, write to tmp file
filea && mv tmp filea -The file used, if awk command succeeds then overwrite
it with tmp

Resources