How can I use shell awk to get this dataset? - shell

Thank you for your interest.
The original data
land_cover_classes rows columns LandCoverDist
"1 of 18" "1 of 720" "1 of 1440" 20
"1 of 18" "1 of 720" "2 of 1440" 0
"1 of 18" "1 of 720" "3 of 1440" 0
"10 of 18" "1 of 720" "4 of 1440" 1
"9 of 18" "110 of 720" "500 of 1440" 0
"1 of 18" "1 of 720" "6 of 1440" 354
"1 of 18" "1 of 720" "7 of 1440" 0
"1 of 18" "1 of 720" "8 of 1440" 0
"1 of 18" "720 of 720" "1440 of 1440" 0
And the expected should be
land_cover_classes rows columns LandCoverDist
1 1 1 20
......
9 110 500 0
1 1 6 354
......
1 720 1440 0

$ awk -F'["[:space:]]+' 'NR>1{$0 = $2 OFS $5 OFS $8 OFS $11} 1' file
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0
$ awk -F'["[:space:]]+' 'NR>1{$0 = $2 OFS $5 OFS $8 OFS $11} 1' file | column -t
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0

Awk solution:
awk 'BEGIN{ FS="\"[[:space:]]+"; OFS="\t" }
function get_num(n){
gsub(/^"| of.*/,"",n);
return n
}
NR==1; NR>1{ print get_num($1), get_num($2), get_num($3), $4 }' file
The output:
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0

$ awk '
BEGIN { FS="\" *\"?" }
NR==1 # print header
{
for(i=2;i<=NF;i++) { # starting from the second field
split($i,a," of ") # split at _of_
printf "%s%s", a[1], (i==NF?ORS:OFS) # print the first part and separator
}
}' file
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0

Related

Print values from both files using awk as vlookup

I have two files, file1:
1 I0626_all 0 0 1 1
2 I0627_all 0 0 2 1
3 I1137_all_published 0 0 1 1
4 I1859_all 0 0 2 1
5 I2497_all 0 0 2 1
6 I2731_all 0 0 1 1
7 I4451_all 0 0 1 1
8 I0626 0 0 1 1
9 I0627 0 0 2 1
10 I0944 0 0 2 1
and file 2:
I0626_all 1 138
I0627_all 1 139
I1137_all_published 1 364
I4089 1 365
AfontovaGora2.SG 1 377
AfontovaGora3_d 1 378
At the end I want
1 I0626_all 138
2 I0627_all 139
3 I1137_all_published 364
I tried using:
awk 'NR==FNR{a[$1]=$2;next} {b[$3]} {print $1,$2,b[$3]}' file2 file1
But It doesnt work.
You may use this awk:
awk 'NR == FNR {map[$1] = $NF; next} $2 in map {print $1, $2, map[$2]}' file2 file1
1 I0626_all 138
2 I0627_all 139
3 I1137_all_published 364

How to sort data based on the value of a column for part (multiple lines) of a file?

My data in the file file1 look like
3
0
2 0.5
1 0.8
3 0.2
3
1
2 0.1
3 0.8
1 0.4
3
2
1 0.8
2 0.4
3 0.3
Each block has the same number of rows (Here it is 3+2 = 5). In each block, the first two lines are header, the next 3 rows have two columns, the first column is the label, which is one of the number from 1 to 3. I want to sort the rows in each block, based on the value of the first column (except the first two rows). So the expected result is
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
I think sort -k 1 -n file1 will be good for the total file.
It gives me the wrong result:
0
1
2
3
3
3
2 0.1
3 0.2
3 0.3
1 0.4
2 0.4
2 0.5
1 0.8
1 0.8
3 0.8
This is not the expected result.
How to sort each block is still a problem for me. I think AWK is possible to perform this problem. Please give some suggestions.
Apply the DSU (Decorate/Sort/Undecorate) idiom using any awk+sort+cut and regardless of how many lines are in each bock:
$ awk -v OFS='\t' '
NF<pNF || NR==1 { blockNr++ }
{ print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }
' file |
sort -n -k1,1 -k2,2 -k4,4 -k3,3 |
cut -f5-
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
To understand what that's doing, just look at the first 2 steps:
$ awk -v OFS='\t' 'NF<pNF || NR==1{ blockNr++ } { print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }' file
1 1 1 1 3
1 1 2 2 0
1 2 3 2 2 0.5
1 2 4 1 1 0.8
1 2 5 3 3 0.2
2 1 6 6 3
2 1 7 7 1
2 2 8 2 2 0.1
2 2 9 3 3 0.8
2 2 10 1 1 0.4
3 1 11 11 3
3 1 12 12 2
3 2 13 1 1 0.8
3 2 14 2 2 0.4
3 2 15 3 3 0.3
$ awk -v OFS='\t' 'NF<pNF || NR==1{ blockNr++ } { print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }' file |
sort -n -k1,1 -k2,2 -k4,4 -k3,3
1 1 1 1 3
1 1 2 2 0
1 2 4 1 1 0.8
1 2 3 2 2 0.5
1 2 5 3 3 0.2
2 1 6 6 3
2 1 7 7 1
2 2 10 1 1 0.4
2 2 8 2 2 0.1
2 2 9 3 3 0.8
3 1 11 11 3
3 1 12 12 2
3 2 13 1 1 0.8
3 2 14 2 2 0.4
3 2 15 3 3 0.3
and notice that the awk command is just creating the key values that you need for sort to sort on by block number, line number or $1, etc. So awk Decorates the input, sort Sorts it, and cut Undecorates it by removing the decoration values that the awk script added.
You can use sort and arrays in gawk
awk 'NF==1 && a[1]{
n=asort(a);
for(k=1; k<=n; k++){print a[k]};
delete a; i=1
}NF==1{print}
NF==2{a[i]=$0;++i}
END{n=asort(a); for(k=1; k<=n; k++){print a[k]}}
' file1
you get
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
This is similar to Ed Morton's solution but without variable assignment, it uses only built-in variables instead:
λ cat input.txt
3
0
2 0.5
1 0.8
3 0.2
3
1
2 0.1
3 0.8
1 0.4
3
2
1 0.8
2 0.4
3 0.3
awk '{ print int((NR-1)/5), ((NR-1)%5<2) ? 0 : 1, (NF>1 ? $1 : NR), NR, $0 }' input.txt |
sort -n -k1,1 -k2,2 -k3,3 -k4,4 | cut -d ' ' -f5-
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
How it work
awk '{ print int((NR-1)/5), ((NR-1)%5<2) ? 0 : 1, (NF>1 ? $1 : NR), NR, $0 }' input.txt
0 0 1 1 3
0 0 2 2 0
0 1 2 3 2 0.5
0 1 1 4 1 0.8
0 1 3 5 3 0.2
1 0 6 6 3
1 0 7 7 1
1 1 2 8 2 0.1
1 1 3 9 3 0.8
1 1 1 10 1 0.4
2 0 11 11 3
2 0 12 12 2
2 1 1 13 1 0.8
2 1 2 14 2 0.4
2 1 3 15 3 0.3
A ruby:
ruby -e '$<.read.split(/\n/).map(&:split).
slice_when { |a, b| b.length == 1 && b.length < a.length }.
map{|e| e.sort_by{|sl| sl.length()>1 ? -sl[-1].to_f : -1.0/0}}.
each{|e| e.each{|x| puts "#{x.join(" ")}"}}' file
Or, a DSU form ruby:
ruby -lane 'BEGIN{lines=[]; block=0; lnf=0}
block+=1 if $F.length()>1 && lnf==1
lnf=$F.length()
lines << [block, -($F.length()>1 ? $F[-1].to_f : (-1.0/0)), $.] + $F
END{lines.sort().each{|sl| puts "#{sl[3..].join(" ")}"}}
' file

Print each two column together from a matrix

I have a matrix:
$cat ifile.txt
2 3 4 5 10 0 2 2 0 1 0 0 0 1
0 3 4 6 2 0 2 0 0 0 0 1 2 3
0 0 0 2 3 0 3 0 3 1 2 3 1 0
Here it has total 14 columns e.g. A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 A6 B6 A7 B7. Each odd number columns correspond to A and even number columns correspond to B.
I would like to print all A in one column and all B in one column. So my desire file looks like:
$cat ofile.txt
2 3
0 3
0 0
4 5
4 6
0 2
10 0
2 0
3 0
2 0
0 0
0 3
....
It is possible for me to do manually in the following way, but I am looking for some more easy way to do it.
for c in 1 3 5 7 9 11 13;do
awk'{printf"%5s %5s",$c,$(c+1)} > A$c.txt
cat A1 A3 A5 A7 A9 A11 A13 > ofile.txt
$ cat tst.awk
{
for ( i=1; i<=NF; i++ ) {
a[NR,i] = $i
}
}
END {
for ( i=1; i<=NF; i+=2 ) {
for (j=1; j<=NR; j++ ) {
print a[j,i], a[j,i+1]
}
}
}
.
$ awk -f tst.awk file
2 3
0 3
0 0
4 5
4 6
0 2
10 0
2 0
3 0
2 2
2 0
3 0
0 1
0 0
3 1
0 0
0 1
2 3
0 1
2 3
1 0
If you want to generalize for more than 2 output columns:
$ cat tst.awk
BEGIN { n=(n ? n : 2) }
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
END {
for ( i=1; i<=NF; i+=n ) {
for (j=1; j<=NR; j++) {
for ( k=1; k<=n; k++ ) {
printf "%s%s", a[j,i+k-1], (k<n ? OFS : ORS)
}
}
}
}
.
$ awk -v n=2 -f tst.awk file
2 3
0 3
0 0
4 5
4 6
0 2
10 0
2 0
3 0
2 2
2 0
3 0
0 1
0 0
3 1
0 0
0 1
2 3
0 1
2 3
1 0
.
$ awk -v n=7 -f tst.awk file
2 3 4 5 10 0 2
0 3 4 6 2 0 2
0 0 0 2 3 0 3
2 0 1 0 0 0 1
0 0 0 0 1 2 3
0 3 1 2 3 1 0

How to substract every nth from (n+3)th line in awk?

I have 4 column data files which have approximately 100 lines. I'd like to substract every nth from (n+3)th line and print the values in a new column ($5). The column data has not a regular pattern for each column.
My sample file:
cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
Output should be:
1 2 3 20 0 #(20-20)
1 2 3 10 20 #(30-10)
1 2 3 5 35 #(40-5)
1 2 3 20 ? #(. - 20)
1 2 3 30 ? #(. - 30)
1 2 3 40 ? #(. - 40)
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
How can i do this in awk?
Thank you
For this I think the easiest thing is to read through the file twice. The first time (the NR==FNR block) we save all the 4th column values in an array indexed by the line number. The next block is executed for the second pass and creates a 5th column with the desired calculation (checking first to make sure that we wouldn't go passed the end of the file).
$ cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
$ awk 'NR==FNR{a[NR]=$4; last=NR; next} {$5 = (FNR+3 <= last ? a[FNR+3] - $4 : "")}1' input input
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20
1 2 3 30
1 2 3 40
You can do this using tac + awk + tac:
tac input |
awk '{a[NR]=$4} NR>3 { $5 = (a[NR-3] ~ /^[0-9]+$/ ? a[NR-3] - $4 : "?") } 1' |
tac | column -t
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20 ?
1 2 3 30 ?
1 2 3 40 ?
1 2 3 .
1 2 3 .
1 2 3 .

Maths in a while loop causing random negative numbers

So I have done this in both python and bash, and the code I am about to post probably has a world of things wrong with it but it is generally very basic and I cannot see a reason that it would cause this 'bug' which I will explain soon.. I have done the same in Python, but much more professionally and cleanly and it also causes this error (at some point, the maths generates a negative number, which makes no sense.)
#!/bin/bash
while [ 1 ];
do
zero=0
ARRAY=()
ARRAY2=()
first=`command to generate a list of numbers`
sleep 1
second=`command to generate a list of numbers`
# so now we have two data sets, 1 second between the capture of each.
for i in $first;
do
ARRAY+=($i)
done
for i in $second;
do
ARRAY2+=($i)
done
for (( c=$zero; c<=${#ARRAY2[#]}; c++ ))
do
expr ${ARRAY2[$c]} - ${ARRAY[$c]}
done
ARRAY=()
ARRAY2=()
zero=0
c=0
first=``
second=``
math=''
done
So the script grabs a set of data, waits 1 second, grabs it again, does math on the two sets to get the difference, that difference is printed. It's very simple, and I have done it elegantly in Python too - no matter how I would do it every now and then, could be anywhere from 3 loops in to 30 loops in, we will get negative numbers.. like so:
START 0 0 0 0 0 19 10 563 0
-34 19 14 2 0
-1302 1198
-532 639
-1078 1119 1 0 0
-843 33 880 0 5
-8
-13508 8773 4541 988 181
-12
-205 217
-9 7 1
-360 303 60 1 0 0
-12
-96 98 3
-870 904
-130
-2105 2264 6
-3084 1576 1650
-939 971
-2249 1150 1281
-693 9 513 142 76 expr: syntax error
Please help, I simply can't find anything about this.
Sample OUTPUT as requested:
ARRAY1 OUTPUT
1 15 1 25 25 1 2 1 3541 853 94567 42 5 1 351 51 1 11 1 13 7 14 12 3999 983 5 1938 3 8287 40 1 1 1 5253 706 1 1 1 1 5717 3 50 1 85 100376 17334 4655 1 1345 2 1 16 1777 1 3 38 23 8 32 47 781 947 1 1 206 9 1 3 2 81 2602 7 158 1 1 43 91 1 120 6589 6 2534 1092 1 6014 7 2 2 37 1 1 1 80 2 1 1270 15448 66 1 10238 1 10794 16061 4 1 1 1 9754 5617 1123 926 3 24 10 16
ARRAY2 OUTPUT
1 15 1 25 25 1 2 1 3555 859 95043 42 5 1 355 55 1 11 1 13 7 14 12 4015 987 5 1938 3 8335 40 1 1 1 5280 706 1 1 1 1 5733 3 50 1 85 100877 17396 4691 1 1353 2 1 16 1782 1 3 38 23 8 32 47 787 947 1 1 206 9 1 3 2 81 2602 7 159 1 1 43 91 1 120 6869 6 2534 1092 1 6044 7 2 2 37 1 1 1 80 2 1 1270 15563 66 1 10293 1 10804 16134 4 1 1 1 9755 5633 1135 928 3 24 10 16
START
The answer lies in Russell Uhl's comment above. Your loop runs one time to many(this is your code):
for (( c=$zero; c<=${#ARRAY2[#]}; c++ ))
do
expr ${ARRAY2[$c]} - ${ARRAY[$c]}
done
To fix, you need to change the test condition from c <= ${#ARRAY2[#]} to c < ${#ARRAY2[#]}:
for (( c=$zero; c < ${#ARRAY2[#]}; c++ ))
do
echo $((${ARRAY2[$c]} - ${ARRAY[$c]}))
done
I've also changed the expr to use arithmetic evaluation builtin $((...)).
The test script (sum.sh):
#!/bin/bash
zero=0
ARRAY=()
ARRAY2=()
first="1 15 1 25 25 1 2 1 3541 853 94567 42 5 1 351 51 1 11 1 13 7 14 12 3999 983 5 1938 3 8287 40 1 1 1 5253 706 1 1 1 1 5717 3 50 1 85 100376 17334 4655 1 1345 2 1 16 1777 1 3 38 23 8 32 47 7
second="1 15 1 25 25 1 2 1 3555 859 95043 42 5 1 355 55 1 11 1 13 7 14 12 4015 987 5 1938 3 8335 40 1 1 1 5280 706 1 1 1 1 5733 3 50 1 85 100877 17396 4691 1 1353 2 1 16 1782 1 3 38 23 8 32 47
for i in $first; do
ARRAY+=($i)
done
# Alternately as chepner suggested:
ARRAY2=($second)
for (( c=$zero; c < ${#ARRAY2[#]}; c++ )); do
echo -n $((${ARRAY2[$c]} - ${ARRAY[$c]})) " "
done
Running it:
samveen#precise:/tmp$ echo $BASH_VERSION
4.2.25(1)-release
samveen#precise:/tmp$ bash sum.sh
0 0 0 0 0 0 0 0 14 6 476 0 0 0 4 4 0 0 0 0 0 0 0 16 4 0 0 0 48 0 0 0 0 27 0 0 0 0 0 16 0 0 0 0 501 62 36 0 8 0 0 0 5 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 280 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 115 0 0 55 0 10 73 0 0 0 0 1 16 12 2 0 0 0 0
EDIT:
* Added improvements from suggestions in comments.
I think the problem has to be when the two arrays don't have the same size. It's easy to reproduce that syntax error -- one of the operands for the minus operator is an empty string:
$ a=5; b=3; expr $a - $b
2
$ a=""; b=3; expr $a - $b
expr: syntax error
$ a=5; b=""; expr $a - $b
expr: syntax error
$ a=""; b=""; expr $a - $b
-
Try
ARRAY=( $(command to generate a list of numbers) )
sleep 1
ARRAY2=( $(command to generate a list of numbers) )
if (( ${#ARRAY[#]} != ${#ARRAY2[#]} )); then
echo "error: different size arrays!"
echo "ARRAY: ${#ARRAY[#]} (${ARRAY[*]})"
echo "ARRAY2: ${#ARRAY2[#]} (${ARRAY2[*]})"
fi
"The error occurs whenever the first array is smaller than the second" -- of course. You're looping from 0 to the array size of ARRAY2. When ARRAY has fewer elements, you'll eventually try to access an index that does not exist in the array. When you try to reference an unset variable, bash gives you the empty string.
$ a=(1 2 3)
$ b=(4 5 6 7)
$ i=2; expr ${a[i]} - ${b[i]}
-3
$ i=3; expr ${a[i]} - ${b[i]}
expr: syntax error

Resources