Grouping data based on two logics - sorting

I have a huge text file of 4 columns. The first column is a serial number, second and third columns are co-ordinates and 4th column is a value. These are the values of a variable at cell nodes. I would like to average the 4 nodal values to get the cell value to be read by my code. For example let me consider a 3 by 3 cartesian cell with following data:
1 0. 0. 5e-4
2 0.1 0. 5e-3
3 0.2 0. 5e-4
4 0.3 0. 5e-3
5 0. 0.1 5e-5
6 0.1 0.1 5e-7
7 0.2 0.1 5e-5
8 0.3 0.1 5e-2
9 0. 0.2 5e-4
10 0.1 0.2 5e-3
11 0.2 0.2 5e-4
12 0.3 0.2 5e-3
13 0. 0.3 5e-5
14 0.1 0.3 5e-7
15 0.2 0.3 5e-5
16 0.3 0.3 5e-2
I would like to group lines in the following order:
1 0. 0. 5e-4
2 0.1 0. 5e-3
5 0. 0.1 5e-5
6 0.1 0.1 5e-7
2 0.1 0. 5e-3
3 0.2 0. 5e-4
6 0.1 0.1 5e-7
7 0.2 0.1 5e-5
3 0.2 0. 5e-4
4 0.3 0. 5e-3
7 0.2 0.1 5e-5
8 0.3 0.1 5e-2
5 0. 0.1 5e-5
6 0.1 0.1 5e-7
9 0. 0.2 5e-4
10 0.1 0.2 5e-3
6 0.1 0.1 5e-7
7 0.2 0.1 5e-5
10 0.1 0.2 5e-3
11 0.2 0.2 5e-4 and so on ...
There are two logics in the above example. One, data of lines (1,2,5,6 and 2,3,6,7 and 3,4,7,8) form one set (the first row of my mesh). This is followed by lines (5,6,9,10) where we move on to the next row data. Then the first logic continues again (6,7,10,11 and 7,8,11,12 and so on...).
I used the following 'sed' command to extract group of lines but doing this individually is cumbersome considering the size of data I have to handle:
sed -n -e 1,2p -e 5,6p fileName
How can I create a loop considering both the logics that I mentioned above?

This might work for you (GNU sed):
sed -n ':a;N;s/\n/&/5;Ta;P;s/[^\n]*\n//;h;P;s/.*\n\(.*\n.*\)/\1/p;g;ba' file |
sed '13~12,+3d'
This follows the pattern uniformly i.e. lines 1,2 followed by lines 5,6, lines row 2,3 followed by lines 6,7 etc. The result is passed to second invocation of sed that removes 4 lines every 12 lines starting at line 13.

Related

How to sort data based on the value of a column for part (multiple lines) of a file?

My data in the file file1 look like
3
0
2 0.5
1 0.8
3 0.2
3
1
2 0.1
3 0.8
1 0.4
3
2
1 0.8
2 0.4
3 0.3
Each block has the same number of rows (Here it is 3+2 = 5). In each block, the first two lines are header, the next 3 rows have two columns, the first column is the label, which is one of the number from 1 to 3. I want to sort the rows in each block, based on the value of the first column (except the first two rows). So the expected result is
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
I think sort -k 1 -n file1 will be good for the total file.
It gives me the wrong result:
0
1
2
3
3
3
2 0.1
3 0.2
3 0.3
1 0.4
2 0.4
2 0.5
1 0.8
1 0.8
3 0.8
This is not the expected result.
How to sort each block is still a problem for me. I think AWK is possible to perform this problem. Please give some suggestions.
Apply the DSU (Decorate/Sort/Undecorate) idiom using any awk+sort+cut and regardless of how many lines are in each bock:
$ awk -v OFS='\t' '
NF<pNF || NR==1 { blockNr++ }
{ print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }
' file |
sort -n -k1,1 -k2,2 -k4,4 -k3,3 |
cut -f5-
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
To understand what that's doing, just look at the first 2 steps:
$ awk -v OFS='\t' 'NF<pNF || NR==1{ blockNr++ } { print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }' file
1 1 1 1 3
1 1 2 2 0
1 2 3 2 2 0.5
1 2 4 1 1 0.8
1 2 5 3 3 0.2
2 1 6 6 3
2 1 7 7 1
2 2 8 2 2 0.1
2 2 9 3 3 0.8
2 2 10 1 1 0.4
3 1 11 11 3
3 1 12 12 2
3 2 13 1 1 0.8
3 2 14 2 2 0.4
3 2 15 3 3 0.3
$ awk -v OFS='\t' 'NF<pNF || NR==1{ blockNr++ } { print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }' file |
sort -n -k1,1 -k2,2 -k4,4 -k3,3
1 1 1 1 3
1 1 2 2 0
1 2 4 1 1 0.8
1 2 3 2 2 0.5
1 2 5 3 3 0.2
2 1 6 6 3
2 1 7 7 1
2 2 10 1 1 0.4
2 2 8 2 2 0.1
2 2 9 3 3 0.8
3 1 11 11 3
3 1 12 12 2
3 2 13 1 1 0.8
3 2 14 2 2 0.4
3 2 15 3 3 0.3
and notice that the awk command is just creating the key values that you need for sort to sort on by block number, line number or $1, etc. So awk Decorates the input, sort Sorts it, and cut Undecorates it by removing the decoration values that the awk script added.
You can use sort and arrays in gawk
awk 'NF==1 && a[1]{
n=asort(a);
for(k=1; k<=n; k++){print a[k]};
delete a; i=1
}NF==1{print}
NF==2{a[i]=$0;++i}
END{n=asort(a); for(k=1; k<=n; k++){print a[k]}}
' file1
you get
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
This is similar to Ed Morton's solution but without variable assignment, it uses only built-in variables instead:
λ cat input.txt
3
0
2 0.5
1 0.8
3 0.2
3
1
2 0.1
3 0.8
1 0.4
3
2
1 0.8
2 0.4
3 0.3
awk '{ print int((NR-1)/5), ((NR-1)%5<2) ? 0 : 1, (NF>1 ? $1 : NR), NR, $0 }' input.txt |
sort -n -k1,1 -k2,2 -k3,3 -k4,4 | cut -d ' ' -f5-
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
How it work
awk '{ print int((NR-1)/5), ((NR-1)%5<2) ? 0 : 1, (NF>1 ? $1 : NR), NR, $0 }' input.txt
0 0 1 1 3
0 0 2 2 0
0 1 2 3 2 0.5
0 1 1 4 1 0.8
0 1 3 5 3 0.2
1 0 6 6 3
1 0 7 7 1
1 1 2 8 2 0.1
1 1 3 9 3 0.8
1 1 1 10 1 0.4
2 0 11 11 3
2 0 12 12 2
2 1 1 13 1 0.8
2 1 2 14 2 0.4
2 1 3 15 3 0.3
A ruby:
ruby -e '$<.read.split(/\n/).map(&:split).
slice_when { |a, b| b.length == 1 && b.length < a.length }.
map{|e| e.sort_by{|sl| sl.length()>1 ? -sl[-1].to_f : -1.0/0}}.
each{|e| e.each{|x| puts "#{x.join(" ")}"}}' file
Or, a DSU form ruby:
ruby -lane 'BEGIN{lines=[]; block=0; lnf=0}
block+=1 if $F.length()>1 && lnf==1
lnf=$F.length()
lines << [block, -($F.length()>1 ? $F[-1].to_f : (-1.0/0)), $.] + $F
END{lines.sort().each{|sl| puts "#{sl[3..].join(" ")}"}}
' file

how to find a sequence of numbers

I have a data file formatted like this:
0.00 0.00 0.00
1 10 1.0
2 12 1.0
3 15 1.0
4 20 0.0
5 23 0.0
0.20 0.15 0.6
1 12 1.0
2 15 1.0
3 20 0.0
4 18 0.0
5 20 0.0
0.001 0.33 0.15
1 8 1.0
2 14 1.0
3 17 0.0
4 25 0.0
5 15 0.0
I need to remove some data and reorder line like this:
1 10
1 12
1 8
2 12
2 15
2 14
3 15
3 20
3 17
4 20
4 18
4 25
5 23
5 20
5 15
My code do not show anything. The problem might be in the grep command. Could you please help me out?
touch extract_file.txt
for (( i=1; i<=band; i++))
do
sed -e '1, 7d' data_file | grep -w " '$(echo $i)' " | awk '{print $2}' > extract(echo $i).txt
paste -s extract_file.txt extract$(echo $i).txt > data
done
#rm eigen*.txt
The following code with comments:
cat <<EOF |
0.00 0.00 0.00
1 10 1.0
2 12 1.0
3 15 1.0
4 20 0.0
5 23 0.0
0.20 0.15 0.6
1 12 1.0
2 15 1.0
3 20 0.0
4 18 0.0
5 20 0.0
0.001 0.33 0.15
1 8 1.0
2 14 1.0
3 17 0.0
4 25 0.0
5 15 0.0
EOF
# remove lines not starting with a space
grep -v '^[^ ]' |
# remove leading space
sed 's/^[[:space:]]*//' |
# remove third arg
sed 's/[[:space:]]*[^[:space:]]*$//' |
# stable sort on first number
sort -s -n -k1 |
# each time first number changes, print additional newline
awk '{ if(length(last) != 0 && last != $1) printf "\n"; print; last=$1}'
outputs:
1 10
1 12
1 8
2 12
2 15
2 14
3 15
3 20
3 17
4 20
4 18
4 25
5 23
5 20
5 15
Tested on repl.
perl one-liner:
$ perl -lane 'push #{$nums{$F[0]}}, "#F[0,1]" if /^ /;
END { for $n (sort { $a <=> $b } keys %nums) {
print for #{$nums{$n}};
print "" }}' input.txt
1 10
1 12
1 8
2 12
2 15
2 14
3 15
3 20
3 17
4 20
4 18
4 25
5 23
5 20
5 15
Basically, for each line starting with a space, use the first number as a key to a hash table that stores lists of the first two numbers, and print them out sorted by first number.

Bash loop for multivariable to the same data

I am trying to create models using multiple variables using bash loop. I need to run several predictions using different r2 and p-value cutoff for the same data. The r2 and value parameters are
cat parameters
0.2 1
0.2 5e-1
0.2 5e-2
0.2 5e-4
0.2 5e-6
0.2 5e-8
0.4 1
0.4 5e-1
0.4 5e-2
0.4 5e-4
0.4 5e-6
0.4 5e-8
0.6 1
0.6 5e-1
0.6 5e-2
0.6 5e-4
0.6 5e-6
0.6 5e-8
0.8 1
0.8 5e-1
0.8 5e-2
0.8 5e-4
0.8 5e-6
0.8 5e-8
The bash loop script I am using test.sh
RSQ=$(cat parameters | awk '{print $1}')
PVAL=$(cat parameters | awk '{print $2}')
season=("spring summer fall winter")
for i in $season;
do
echo prediction_${i}_${RSQ}_${PVAL}
done
the present output is
prediction_spring_0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 0.8 0.8_1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8
prediction_summer_0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 0.8 0.8_1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8
prediction_fall_0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 0.8 0.8_1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8
prediction_winter_0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 0.8 0.8_1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8 1 5e-1 5e-2 5e-4 5e-6 5e-8
The desired output is
prediction_spring_0.2_1
prediction_spring_0.2_5e-1
prediction_spring_0.2_5e-2
prediction_spring_0.2_5e-4
prediction_spring_0.2_5e-6
prediction_spring_0.2_5e-8
prediction_spring_0.4_1
.......
prediction_winter_0.2_1
prediction_winter_0.2_5e-1
prediction_winter_0.2_5e-2
prediction_winter_0.2_5e-4
prediction_winter_0.2_5e-6
prediction_winter_0.2_5e-8
prediction_winter_0.4_1
..........
Your sample output is not complete enough. I can imagine two solutions: 1) you intend every season to be paired with every RSQ value to be paired with every PVAL value; or, 2) you want the stated R/P pairs to be matched with the seasons.
Solution for #1: you need to loop over the R & P lists
for i in $season; do
for r in $RSQ; do
for p in $PVAL; do
echo prediction_${i}_${r}_${p}
done
done
done
Solution for #2: read the file line by line
for i in $season; do
while read r p; do
echo prediction_${i}_${r}_${p}
done < parameters
done

Finding zeros and replacing them with another number in a matrix file by awk

I have a matrix where I want to replace every 0 with 0.1 and depending on how many zeros are replaced the max score in that line will be deducted by number of 0.1s added such that the below matrix will go from,
No line will contain only zeroes, since this is a probability matrix where each line adds up to1. If a highest number occurs more than once (0.5 in this case), then anyone can be changed,and the first line will always be the only one with letters in it,
>ACTTT ASB 0.098
0 0 1 0
0.75 0 0.25 0
0 0 0 1
0 1 0 0
1 0 0 0
1 0 0 0
0 1 0 0
0 1 0 0
to
>ACTTT ASB 0.098
0.1 0.1 0.7 0.1
0.55 0.1 0.25 0.1
0.1 0.1 0.1 0.7
0.1 0.7 0.1 0.1
0.7 0.1 0.1 0.1
0.7 0.1 0.1 0.1
0.1 0.7 0.1 0.1
0.1 0.7 0.1 0.1
I tried to use something like this in a loop from previous answers in here:
while read line ; do echo $line | awk 'NR>1{print gsub(/(^|[[:space:]])0([[:space:]]|$)/,"&")}'; echo $line | awk '{max=$2;for(i=3;i<=NF;i++)if($i>max)max=$i}END{print max}'; done < matrix_file
awk to the rescue!
$ awk -v eps=0.01 'function maxIx() {mI=1;
for(i=1;i<=NF;i++)
if($mI<$i)mI=i;
return mI}
NR>1{mX=maxIx();
for(i=1;i<=NF;i++)
if($i==0) {$i=eps;$mX-=eps}}1' file
>ACTTT ASB 0.098
0.01 0.01 0.97 0.01
0.73 0.01 0.25 0.01
0.01 0.01 0.01 0.97
0.01 0.97 0.01 0.01
0.97 0.01 0.01 0.01
0.97 0.01 0.01 0.01
0.01 0.97 0.01 0.01
0.01 0.97 0.01 0.01
defined eps, as long as you have a sensible value it should work fine, but doesn't check for going below zero.

Merge files with scientific notation data in the first column and how to use uniq

Two questions concerning using uniq command, please help.
First question
Say I have two files;
$ cat 1.dat
0.1 1.23
0.2 1.45
0.3 1.67
$ cat 2.dat
0.3 1.67
0.4 1.78
0.5 1.89
Using cat 1.dat 2.dat | sort -n | uniq > 3.dat, I am able to merge two files into one. results is:
0.1 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89
But if I have a scientific notation in 1.dat file,
$ cat 1.dat
1e-1 1.23
0.2 1.45
0.3 1.67
the result would be:
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89
1e-1 1.23
which is not what I want, how can I let uniq understand 1e-1 is a number, not a string.
Second question
Same as above, but this time, let the second file 2.dat's first row be slightly different (from 0.3 1.67 to 0.3 1.57)
$ cat 2.dat
0.3 1.57
0.4 1.78
0.5 1.89
Then the result would be:
0.1 1.23
0.2 1.45
0.3 1.67
0.3 1.57
0.4 1.78
0.5 1.89
My question is this, how could I use uniq just based on the value from the first file and find repetition only from the first column, so that the results is still:
0.1 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89
Thanks
A more complex test cases
$ cat 1.dat
1e-6 -1.23
0.2 -1.45
110.7 1.55
0.3 1.67e-3
one awk (gnu awk) one-liner solves your two problems
awk '{a[$1*1];b[$1*1]=$0}END{asorti(a);for(i=1;i<=length(a);i++)print b[a[i]];}' file2 file1
test with data: Note, I made file1 unsorted and 1.57 in file2, as you wanted:
kent$ head *
==> file1 <==
0.3 1.67
0.2 1.45
1e-1 1.23
==> file2 <==
0.3 1.57
0.4 1.78
0.5 1.89
kent$ awk '{a[$1*1];b[$1*1]=$0}END{asorti(a);for(i=1;i<=length(a);i++)print b[a[i]];}' file2 file1
1e-1 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89
edit
display 0.1 instead of 1e-1:
kent$ awk '{a[$1*1];b[$1*1]=$2}END{asorti(a);for(i=1;i<=length(a);i++)print a[i],b[a[i]];}' file2 file1
0.1 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89
edit 2
for the precision, awk default (OFMT) is %.6g you could change it. but if you want to display different precision by lines, we have to a bit trick:
(I added 1e-9 in file1)
kent$ awk '{id=sprintf("%.9f",$1*1);sub(/0*$/,"",id);a[id];b[id]=$2}END{asorti(a);for(i=1;i<=length(a);i++)print a[i],b[a[i]];}' file2 file1
0.000000001 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89
if you want to display same number precision for all lines:
kent$ awk '{id=sprintf("%.9f",$1*1);a[id];b[id]=$2}END{asorti(a);for(i=1;i<=length(a);i++)print a[i],b[a[i]];}' file2 file1
0.000000001 1.23
0.200000000 1.45
0.300000000 1.67
0.400000000 1.78
0.500000000 1.89
The first part only:
cat 1.dat 2.dat | sort -g -u
1e-1 1.23
0.2 1.45
0.3 1.67
0.4 1.78
0.5 1.89
man sort
-g, --general-numeric-sort
compare according to general numerical value
-u, --unique
with -c, check for strict ordering; without -c, output only the first of an equal run
To change the scientific notation to decimal I resorted to python
#!/usr/bin/env python
import sys
import glob
infiles = []
for a in sys.argv:
infiles.extend(glob.glob(a))
for f in infiles[1:]:
with open(f) as fd:
for line in fd:
data = map(float, line.strip().split())
print data[0], data[1]
output:
$ ./sn.py 1.dat 2.dat
0.1 1.23
0.2 1.45
0.3 1.67
0.3 1.67
0.4 1.78
0.5 1.89

Resources