Unix sort using unknown delimiter (last column)

Unix sort using unknown delimiter (last column) - bash

My data looks like this:
Adelaide Crows 5 2 3 0 450 455 460.67 8
Essendon 5 5 0 0 622 352 955.88 20
Fremantle 5 3 2 0 439 428 598.50 12
As you can tell, there is a mixture of spaces and tabs. I need to be able to sort the last column descending. So the ouput looks like this:
Essendon 5 5 0 0 622 352 955.88 20
Fremantle 5 3 2 0 439 428 598.50 12
Adelaide Crows 5 2 3 0 450 455 460.67 8
The whole data consists of all AFL teams.
Using sort how can I achieve this. Am i right in attempting to use the $ character to start from the end of line? I also need to sort the second last column after sorting the last column. Therefore any duplicate numbers in the last column will be sorted in the 2nd last column. Code so far:
sort -n -t$'\t' -k 9,9 -k 8,8 tmp
How do I take into account that the football team names will count as whitespace?
Here is the file being sorted (filename: 'tmp') sample data

You could copy the last field into the first position using awk first, then sort by the first field, the get rid of the first field using cut.
awk '{print($NF" "$0)}' sample.txt | sort -k1,1 -n -r -t' ' | cut -f2- -d' '
Port Adelaide 5 5 0 0 573 386 916.05 20
Essendon 5 5 0 0 622 352 955.88 20
Sydney Swans 5 4 1 0 533 428 681.68 16
Hawthorn 5 4 1 0 596 453 620.64 16
Richmond 5 3 2 0 499 445 579.68 12
..
..

Related

Max of all columns based on distinct first column

How do I change the code if I have more than two columns. Let's say the data is like this
ifile.dat
1 10 15
3 34 20
1 4 22
3 32 33
5 3 46
2 2 98
4 20 100
3 13 23
4 50 65
1 40 76
2 20 22
How do I achieve this?
ofile.dat
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46
I mean the max of each column by comparing first column. Thanks.
Here is what I have tried(on a sample file with 13columns). But the highest value is not coming up this way.
cat input.txt | sort -k1,1 -k2,2nr -k3,3nr -k4,4nr -k5,5nr -k6,6nr -k7,7nr -k8,8nr -k9,9nr -k10,10nr -nrk11,11 -nrk12,12 -nrk13,13 | sort -k1,1 -u

Instead of relying on sort, you could switch over to something more robust like awk:
awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' input.txt
That's a mouthful. It breaks down like:
Before processing the file set the PROCINFO setting of sorted_in to #val_num_asc since we will be sorting the contents of an array by its index which will be numeric: (BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"})
Loop through each field in the record for(i=2;i<=NF;++i)
Test to see if the current value is greater than the stored value in the array which has as its index the first field's value. If it is greater, then store it in place of whatever is currently held in the array for that index at that fields position: (if (a[$1][i]<$i){a[$1][i]=$i})
When done processing all of the records, sort the array into new array asorted: (END{n=asorti(a, asorted);)
Iterate through the array and print each element (for(col1 in asorted){print col1, a[col1][2], a[col1][3]})
There may be a more elegant way to do this in awk, but this will do the trick.
Example:
:~$ cat test
1 10 15
3 34 20
1 4 22
3 32 33
5 3 46
2 2 98
4 20 100
3 13 23
4 50 65
1 40 76
2 20 22
:~$ awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' test
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46

Remove rows that have a specific numeric value in a field

I have a very bulky file about 1M lines like this:
4001 168991 11191 74554 60123 37667 125750 28474
8 145 25 101 83 51 124 43
2985 136287 4424 62832 50788 26847 89132 19184
3 129 14 101 88 61 83 32 1 14 10 12 7 13 4
6136 158525 14054 100072 134506 78254 146543 41638
1 40 4 14 19 10 35 4
2981 112734 7708 54280 50701 33795 75774 19046
7762 339477 26805 148550 155464 119060 254938 59592
1 22 2 12 10 6 17 2
6 136 16 118 184 85 112 56 1 28 1 5 18 25 40 2
1 26 2 19 28 6 18 3
4071 122584 14031 69911 75930 52394 89733 30088
1 9 1 3 4 3 11 2 14 314 32 206 253 105 284 66
I want to remove rows that have a value less than 100 in the second column.
How to do this with sed?

I would use awk to do this. Example:
awk ' $2 >= 100 ' file.txt
this will only display every row from file.txt that has a column $2 greater than 100.

Use the following approach:
sed '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' -E /tmp/test.txt
(replace /tmp/test.txt with your current file path)
([0-9]{1,2}|[0][0-9]+) - will match either digits from 0 to 99 OR a digits with leading zero (ex. 012, 00982)
d - delete the pattern space;
-E(--regexp-extended) - Use extended regular expressions rather than basic regular expressions
To remove matched lines in place use -i option:
sed -i -E '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' /tmp/test.txt

how to add text to next line in tab separated file from other file?

I have a set of files contain tab separated values, at the last but third line, I have my desired values. I have extracted that value with
cat result1.tsv | tail -3 | head -1 > final1.tsv
cat resilt2.tsv | tail -3 | head -1 >final2.tsv
..... so on (I have almost 30-40 files)
I want the content of final tsv files in next line in a new single file.
I tried
cat final1.tsv final2.tsv > final.tsv
but this works for the limited amount of files difficult to write the name of all files.
I tried to put the file names in a loop as variables but not worked.
final1.tsv contains:
270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final2.tsv contains:
1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
all the files (final1.tsv,final2.tsv,final3.tsv,final5..... contains same number of columns but different values)
I want the rows of each file merged in new file like
final.tsv
final1 270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final2 1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
final3 270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final4 1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28

here you go...
for f in final{1..4}.tsv;
do
echo -en $f'\t' >> final.tsv;
cat $f >> final.tsv;
done

Try this:
rm final.tsv
for FILE in result*.tsv
do
tail -3 $FILE | head -1 >> final.tsv
done

As long as the files aren't enormous, it's simplest to read each file into an array and select the third record from the end
This solves your problem for you. It looks for all files in the current directory that match result*.tsv and writes the required line from each of them to final.tsv
use strict;
use warnings 'all';
my #results = sort {
my ($aa, $bb) = map /(\d+)/, ($a, $b);
$aa <=> $bb;
} glob 'result*.tsv';
open my $out_fh, '>', 'final.tsv';
for my $result_file ( #results ) {
open my $fh, '<', $result_file or die qq({Unable to open "$result_file" for input: $!};
my #data = <$fh>;
next unless #data >= 3;
my ($name) = $result_file =~ /([^.]+)/;
print { $out_fh } "$name\t$data[-3]";
}

How to use awk and sed to make statistics on a data

I have a raw data file and i want to generate an output file, both of which are shown below. The rule for generation is that column 1 in the output is equivalent to the column 2 of raw file. The column 2 in the output is the mean of last column values in the raw data, the line of which are match. For example, the value1 in the output is (25+24.3846+13.8972+1.33333+1)/5.
#Raw data
4 compiler-compiler 100000 99975 1 25
4 compiler-compiler 100000 99683 13 24.3846
4 compiler-compiler 100000 93649 457 13.8972
4 compiler-compiler 100000 99764 177 1.33333
4 compiler-compiler 100000 99999 1 1
4 compiler-sunflow 100000 99999 1 1
4 compiler-sunflow 100000 99674 11 29.6364
4 compiler-sunflow 100000 93467 423 15.4444
4 compiler-sunflow 100000 99694 159 1.92453
4 compiler-sunflow 100000 99938 4 15.5
4 compress 100000 99997 1 3
4 compress 100000 99653 10 34.7
4 compress 100000 93639 454 14.011
4 compress 100000 99666 173 1.93064
4 compress 100000 99978 4 5.5
4 serial 100000 99998 1 2
4 serial 100000 99932 6 11.3333
4 serial 100000 93068 460 15.0696
4 serial 100000 99264 206 3.57282
4 serial 100000 99997 3 1
4 sunflow 100000 99998 1 2
4 sunflow 100000 99546 18 25.2222
4 sunflow 100000 93387 481 13.7484
4 sunflow 100000 99752 189 1.31217
4 sunflow 100000 99974 4 6.5
4 xml-transfomer 100000 99994 1 6
4 xml-transfomer 100000 99964 3 12
4 xml-transfomer 100000 93621 463 13.7775
4 xml-transfomer 100000 99540 199 2.31156
4 xml-transfomer 100000 99986 2 7
4 xml-validation 100000 99996 1 4
4 xml-validation 100000 99563 16 27.3125
4 xml-validation 100000 93748 451 13.8625
4 xml-validation 100000 99716 190 1.49474
4 xml-validation 100000 99979 3 7
#Output data
compiler-compiler value1
....
xml-transfomer value2
xml-validation value3
I think the comment awk & sed can work for this, but i do not know how get it.

sed cannot being used here since it does not support math operations. It's a job for awk:
awk 'NR>1{c[$2]++;s[$2]+=$(NF)}END{for(i in c){print i,s[i]/c[i]}}' input.txt
Explanation:
NR>1 { c[$2]++; s[$2+=($NF) }
NR>1 means that the following block gets executed on all lines except of the first line. $2 is the value of the second column. NF is the number of fields per line. $(NF) contains the value of the last column. c and s are assoc arrays. c counts the occurrences of $2, c stores a total of the numeric value in the last column - grouped by $2.
END {for(i in c){print i,s[i]/c[i]}}
END means the following action will take place after the last line of input has been processed. The for loop iterates through c and outputs the name and the mean for all indexes in c.
Output:
xml-validation 10.7339
compiler-compiler 13.123
serial 6.59514
sunflow 9.75655
xml-transfomer 8.21781
compiler-sunflow 12.7011
compress 11.8283
Note that you have influence on the output order if using an assoc aray. If you care about the output order you might use the following command:
awk 'NR>1 && $2!=n && c {print n,t/c;c=t=0} NR>1{n=$2;c++;t+=$(NF)}'
This command does not use assoc arrays, it prints out the stats just in time when $2 changes - note that this requires to sorted by $2

adding columns for specified rows & dividing by the number of rows using awk

So I'm really new to using linux and script commands, help would really be appreciated!
I have a file of 1050 rows and 8 columns. Example:
anger 1 0 5 101 13 2 somesentenceofwords
anger 2 0 5 101 23 3 somesentenceofwords
anger 3 0 3 101 35 3 somesentenceofwords
anger 4 0 2 101 23 3 somesentenceofwords
arch 5 0 3 101 34 12 somesentenceofwords
arch 6 0 2 101 45 23 somesentenceofwords
arch 7 0 2 101 23 12 somesentenceofwords
hand 8 9 0 101 32 21 somesentenceofwords
hand 9 0 2 101 23 12 somesentenceofwords
What I want to do is if the first column is the same for x number of rows then output the sum of the 6th column for those rows and divide it by the number of rows (an average essentially).
So in the example since the first 4 rows are all anger I want to get the average of the numbers corresponding to all rows with anger in column 1 for column 6. It would add 13 + 23 + 35 + 23 / 4. It would then do the same for arch, then hand and so on.
Example output:
anger 23.5
arch 34
hand 27.5
I tried this just to see if I can do it individually where each column would equal a specific letter string but couldn't even get that to work.
$ awk '{if($1="anger"){sum+=$6} {print sum}}' filename
Is this possible?

Using awk:
awk '!($1 in s){b[++i]=$1; s[$1]=0} {c[$1]++; s[$1]+=$6}
END{for (k=1; k<=i; k++) printf "%s %.1f\n", b[k], s[b[k]]/c[b[k]]}' file
anger 23.5
arch 34.0
hand 27.5

Pretty straight forward with awk:
$ awk '{a[$1]+=$6;b[$1]++}END{for (i in a) print i,a[i]/b[i]}' file
hand 27.5
arch 34
anger 23.5
How this works?
The block {a[$1]+=$6;b[$1]++} is executed for every line that is read. We create two maps, one storing the sum, for each key, and one storing the count for each key.
The block END{for (i in a) print i,a[i]/b[i]} is executed after all lines are read. We iterate over the keys of the first map, and print the key, and the division of the sum over the count (i.e. the mean).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Unix sort using unknown delimiter (last column) - bash

Related

Max of all columns based on distinct first column

Remove rows that have a specific numeric value in a field

how to add text to next line in tab separated file from other file?

How to use awk and sed to make statistics on a data

adding columns for specified rows & dividing by the number of rows using awk

Categories

Resources