how to format simultaneously a string and floating number in awk? - shell

I have a column as follows:
ifile.txt
1.25
2.78
?
?
5.6
3.4
I would like to format the floating points to decimal and skipping the strings as it is.
ofile.txt
1
3
?
?
6
3
Walter A, F. Knorr and Janez Kuhar suggested nice scripts to do it as per my question and need of a command like
awk '{printf "%d%s\n", $1}' ifile.txt
Again, I found I have a number of columns, however, other columns don't need any formatting. So I have to use the above command in the form of something like:
awk '{printf "%5s %d%s %5s %5s %5s\n", $1, $2, $3, $4, $5}' ifile.txt
for example:
ifile.txt
1 1.25 23.2 34 3.4
2 2.78 22.0 23 1.2
3 ? ? ? 4.3
4 ? ? ? 6.5
5 5.6 45.0 5 2.4
6 3.4 43.0 23 5.6
I used the following command as again suggested by F. Knorr in answer,
awk '$2~/^[0-9]+\.?[0-9]*$/{$2=int($2+0.5)}1' ifile.txt > ofile.txt
ofile.txt
1 1 23.2 34 3.4
2 3 22.0 23 1.2
3 ? ? ? 4.3
4 ? ? ? 6.5
5 6 45.0 5 2.4
6 3 43.0 23 5.6
It works fine, but need to format it. like
ofile.txt
1 1 23.2 34 3.4
2 3 22.0 23 1.2
3 ? ? ? 4.3
4 ? ? ? 6.5
5 6 45.0 5 2.4
6 3 43.0 23 5.6

You could first check whether the column contains a number (via regex) and then handle the printing accordingly:
awk '$1~/^[0-9]+\.?[0-9]*$/{printf "%i\n",$1+0.5; next}1' test.txt
Update: If it is the n-th column that needs to be formatted as described above (and no other formatting in other columns), then replace all $1 by $n in the following command:
awk '$1~/^[0-9]+\.?[0-9]*$/{$1=int($1+0.5)}1' test.txt

Just adding a half can be done with:
awk ' $1 ~ /^[0-9]+$|^[0-9]+.[0-9]+$/ { printf("%d\n", $1 + 0.5); next }
{ print $1 } ' file
or slightly shorter:
awk ' $1 ~ /^[0-9]+$|^[0-9]+.[0-9]+$/ { printf("%d\n", $1 + 0.5); next } 1' file

Related

Insert rows using awk

How can I insert a row using awk?
My file looks as:
1 43
2 34
3 65
4 75
I would like to insert three rows with "?" So my desire file looks as:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I am trying with the below script.
awk '{if(NR<=3){print "NR ?"}} {printf" " NR $2}' file.txt
Here's one way to do it:
$ awk 'BEGIN{s=" "; for(c=1; c<4; c++) print c s "?"}
{print c s $2; c++}' ip.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
$ awk 'BEGIN {printf "1 ?\n2 ?\n3 ?\n"} {printf "%d", $1 + 3; printf " %s\n", $2}' file.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
You could also add the 3 lines before awk, e.g.:
{ seq 3; cat file.txt; } | awk 'NR <= 3 { $2 = "?" } $1 = NR' OFS='\t'
Output:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I would do it following way using GNU AWK, let file.txt content be
1 43
2 34
3 65
4 75
then
awk 'BEGIN{OFS=" "}NR==1{print 1,"?";print 2,"?";print 3,"?"}{print NR+3,$2}' file.txt
output
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
Explanation: I set output field separator (OFS) to 7 spaces. For 1st row I do print three lines which consisting of subsequent number and ? sheared by output field separator. You might elect to do this using for loop, especially if you expect that requirement might change here. For every line I print number of row plus 4 (to keep order) and 2nd column ($2). Thanks to use of OFS, you would need to make only one change if requirement regarding number of spaces will be altered. Note that construct like
{if(condition){dosomething}}
might be written in GNU AWK in more concise manner as
(condition){dosomething}
(tested in gawk 4.2.1)

Combining multiple awk output statements into one line

I have some ascii files I’m processing, with 35 columns each, and variable number of rows. I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
I’ve done similar processing in the past, but by outputting temp files for each awk command, reading each successive temp file in to eventually create a final ascii file. Then, I would delete the temp files after. I’m hoping there is an easier/faster method than having to create a bunch of temp files.
Below is an initial working processing step, that the above awk commands would need to follow and fit into. This step gets the data from foo.txt, removes the header, and processes only the rows containing a particular, but varying, string.
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
There’s another processing step for different data files, that I would also need the 2 new columns discussed earlier. This is simply appending a unique file name from what’s being catted to the last column of every row in a new ascii file. This command is actually in a loop with varying input files, but I’ve simplified it here.
cat foo.txt | tail -n +2 | awk -v fname="$fname" '{print $0 OFS fname;}' >> foo_new.txt
An example of one of the foo.txt files.
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
Below would be the example foo_new.txt desired. The requested 2 columns of output from awk (last 2 columns). In this example, column 5 is the difference between column 3 and 2 plus 1. Column 6 is the result of column 1 divided by column 5.
20 0 5 F001 6 3.3
4 2 3 F002 2 2.0
12 4 8 F003 5 2.4
For the second example foo_new.txt. The last column is an example of fname. These are computed in the shell script, and passed to awk. I don't care if the results in column 7 (fname) are at the end or placed between columns 4 and 5, so long as it gets along with the other awk statements.
20 0 5 F001 6 3.3 C1
4 2 3 F002 2 2.0 C2
12 4 8 F003 5 2.4 C3
The best luck so far, but unfortunately this is producing a file with the original output first, and the added output below it. I'd like to have the added output appended on as columns (#5 and #6).
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
cat foo_new.txt | awk '{print $4=$3-$2+1, $5=$1/($3-$2+1)}' >> foo_new.txt
Consider an input file data with header line like this (based closely on your minimal example):
Col1 Col2 Col3 Col4
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
You want the output to contain a column 5 that is the value of $3 - $2 + 1 (column 3 minus column 2 plus 1), and a column 6 that is the value of column 1 divided by column 5 (with 1 decimal place in the output), and a file name that is based on a variable fname passed to the script but that has a unique value for each line. And you only want lines where column 4 matches F and 3 digits, and you want to skip the first line. That can all be written directly in awk:
awk -v fname=C '
NR == 1 { next }
$4 ~ /^F[0-9][0-9][0-9]$/ { c5 = $3 - $2 + 1
c6 = sprintf("%.1f", $1 / c5)
print $0, c5, c6, fname NR
}' data
You could write that on one line too:
awk -v fname=C 'NR==1{next} $4~/^F[0-9][0-9][0-9]$/ { c5=$3-$2+1; print $0,c5,sprintf("%.1f",$1/c5), fname NR }' data
The output is:
20 0 5 F001 6 3.3 C2
4 2 3 F002 2 2.0 C3
12 4 8 F003 5 2.4 C4
Clearly, you could change the file name so that the counter starts from 0 or 1 by using counter++ or ++counter respectively in place of the NR in the print statement, and you could format it with leading zeros or whatever else you want with sprintf() again. If you want to drop the first line of each file, rather than just the first file, change the NR == 1 condition to FNR == 1 instead.
Note that this does not need the preprocessing provided by cat foo.txt | tail -n +2.
I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
That's just:
awk -vN=9 -vanother_column=10 '{ v36 = $N - $(N+1); print $0, v36, $another_column / v36 }' input_file.tsv
I guess your file has some "header"/special "first line", so if it's the first line, then preserve it:
awk ... 'NR==1{print $0, "36_header", "37_header"} NR>1{ ... the script above ... }`
Taking first 3 columns from the example script you presented, and substituting N for 2 and another_column for 1, we get the following script:
# recreate input file
cat <<EOF |
20 0 5
4 2 3
12 4 8
100 10 29
EOF
tr -s ' ' |
tr ' ' '\t' > input_file.tsv
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N; print $0, tmp, $another_column / tmp }' input_file.tsv
and it will output:
20 0 5 5 4
4 2 3 1 4
12 4 8 4 3
100 10 29 19 5.26316
Such script:
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N + 1; print $0, tmp, sprintf("%.1f", $another_column / tmp) }' input_file.tsv
I think get's closer output to what you want:
20 0 5 6 3.3
4 2 3 2 2.0
12 4 8 5 2.4
100 10 29 20 5.0
And I guess that by that (N+1) you meant "the difference between two columns with 1 added".

Split column into multiple based on match/delimiter using bash awk

I have a dataset in a single column that I would like to split into any number of new columns when a certain string is found (in this case 'male_position'.
>cat test.file
male_position
0.00
0.00
1.05
1.05
1.05
1.05
3.1
5.11
12.74
30.33
40.37
40.37
male_position
0.00
1.05
2.2
4.0
4.0
8.2
25.2
30.1
male_position
1.0
5.0
I would like the script to produce new tab separated columns each time 'male_position' is encountered but just print each each line/data point below that (added to that column) until the next occurrence of 'male_position':
script.awk test.file > output
0.00 0.00 1.0
0.00 1.05 5.0
1.05 2.2
1.05 4.0
1.05 4.0
1.05 8.2
3.1 25.2
5.11 30.1
12.74
30.33
40.37
40.37
Any ideas?
update -
I have tried to adapt code based on this post(Linux split a column into two different columns in a same CSV file)
cat script.awk
BEGIN {
line = 0; #Initialize at zero
}
/male_position/ { #every time we hit the delimiter
line = 0; #resed line to zero
}
!/male_position/{ #otherwise
a[line] = a[line]" "$0; # Add the new input line to the output line
line++; # increase the counter by one
}
END {
for (i in a )
print a[i] # print the output
}
Results....
$ awk -f script.awk test.file
1.05 2.2
1.05 4.0
1.05 4.0
1.05 8.2
3.1 25.2
5.11 30.1
12.74
30.33
40.37
40.37
0.00 0.00 1.0
0.00 1.05 5.0
UPDATE 2 #######
I can recreate the expected with the test.file case. Running the script (script.awk) on Linux with test file and 'awk.script"(see above) seemed to work. However, that simple example file has only decreasing numbers of columns (data points) between the delimiter (male_position). When you increase the number of columns between, the output seems to fail...
cat test.file2
male_position
0.00
0.00
1.05
1.05
1.05
1.05
3.1
5.11
12.74
male_position
0
5
10
male_position
0
1
2
3
5
awk -f script.awk test.file2
0.00 0 0
0.00 5 1
1.05 10 2
1.05 3
1.05 5
1.05
3.1
5.11
12.74
there is no 'padding' of the lines after the the last observation for a given column, so a column with more values than the predeeding column has its values fall in line with the previous column ( the 3 and the 5 are in column 2, when they should be in column 3).
Here's a csplit+paste solution
$ csplit --suppress-matched -zs test.file2 /male_position/ {*}
$ ls
test.file2 xx00 xx01 xx02
$ paste xx*
0.00 0 0
0.00 5 1
1.05 10 2
1.05 3
1.05 5
1.05
3.1
5.11
12.74
From man csplit
csplit - split a file into sections determined by context lines
-z, --elide-empty-files
remove empty output files
-s, --quiet, --silent
do not print counts of output file sizes
--suppress-matched
suppress the lines matching PATTERN
/male_position/ is the regex used to split the input file
{*} specifies to create as many splits as possible
use -f and -n options to change the default output file names
paste xx* to paste the files column wise, TAB is default separator
Following awk may help you on same.
awk '/male_position/{count++;max=val>max?val:max;val=1;next} {array[val++,count]=$0} END{for(i=1;i<=max;i++){for(j=1;j<=count;j++){printf("%s%s",array[i,j],j==count?ORS:OFS)}}}' OFS="\t" Input_file
Adding a non-one liner form of solution too now.
awk '
/male_position/{
count++;
max=val>max?val:max;
val=1;
next}
{
array[val++,count]=$0
}
END{
for(i=1;i<=max;i++){
for(j=1;j<=count;j++){ printf("%s%s",array[i,j],j==count?ORS:OFS) }}
}
' OFS="\t" Input_file

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

split file into multiple files (by columns)

I have a file data.txt in which there are 200 columns and rows (a square matrix). So, i have been trying to split my file into 200 files, each of then with one of the column from the big data file. These where my two attempts employing cut and awk, however i don't understand why is not working.
NM=`awk 'NR==1{print NF-2}' < file.txt`
echo $NM
for (( i=1; i = $NM; i++ ))
do
echo $i
cut -f ${i} file.txt > tmpgrid_0${i}.dat
#awk '{print '$i'}' file.txt > tmpgrid_0${i}.dat
done
Any suggestions?.
EDIT: Thank you very much to all of you. All answers were valid but i cannot vote to all of them.
awk '{for(i=1;i<=5;i++){name=FILENAME"_"i;print $i> name}}' your_file
Tested with 5 columns:
> cat temp
PHE 5 2 4 6
PHE 5 4 6 4
PHE 5 4 2 8
TRP 7 5 5 9
TRP 7 5 7 1
TRP 7 5 7 3
TYR 2 4 4 4
TYR 2 4 4 0
TYR 2 4 5 3
> nawk '{for(i=1;i<=5;i++){name=FILENAME"_"i;print $i> name}}' temp
> ls -1 temp_*
temp_1
temp_2
temp_3
temp_4
temp_5
> cat temp_1
PHE
PHE
PHE
TRP
TRP
TRP
TYR
TYR
TYR
>
To summarise my comments, I suggest something like this (untested as I have no sample file):
NM=$(awk 'NR==1{print NF-2}' file.txt)
echo $NM
for (( i=1; i <= $NM; i++ ))
do
echo $i
awk '{print $'$i'}' file.txt > tmpgrid_0${i}.dat
done
An alternative solution using tr and split
< file.txt tr ' ' '\n' | split -nr/200
This assumes that the file is space delimited, but the tr command could be tweaked as appropriate for any delimiter. Essentially this puts each entry on its own line, and then uses split's round robin version to write each 200th line to the same file.
paste -d' ' x* | cmp - file.txt
verifies that it worked if split is writing files with an x prefix.
I got this solution from Reuti on the coreutils mailing list.

Resources