find smallest $2 for every unique $1 [duplicate] - bash

This question already has answers here:
Sort and keep a unique duplicate which has the highest value
(3 answers)
Closed 7 years ago.
I am trying to obtain the smallest $2 value for every $1 value. My data looks like follows:
0 0
23.9901 13.604
23.9901 13.604
23.9901 3.364
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.79 15.36
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
The output must look like:
0 0
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
I can manage this by creating multiple files for each $1 value and finding the min in each file. But I am wondering if there might be a more elegant solution for this?
Thanks.

With Gnu or FreeBSD sort, you can do it as follows;
sort -k1,1 -k2,2g file | sort -k1,1g -su
The first sort sorts the file into order by first and then second column value. The second sort uniquifies the file (-u) using only the first column to determine uniqueness. It also uses the -s flag to guarantee that the second column is still in order. In both cases, the sort uses the -g flag when it matters (see below), which does general numeric comparison, unlike the Posix-standard -n flag which only compares leading integers.
Performance note: (And thanks to OP for spurring me to do the measurements):
Leaving the g off of -k1,1 in the first sort is not a typo; it actually considerably speeds the sort up (on large files, with Gnu sort). Standard or integer (-n) sorts are much faster than general numeric sorts, perhaps 10 times as fast. However, all key types are about twice as fast for files which are "mostly sorted". For more-or-less uniformly sampled random numbers, a lexicographic sort is a close approximation to a general numeric sort; close enough that the result shows the "mostly sorted" speed-up.
It would have been possible to only sort by the second field in the first sort: sort -k2,2g file | sort -k1,1g -su but this is much slower, both because the primary sort in the first pass is general numeric instead of lexicographic and because the file is no longer mostly sorted for the second pass.
Here's just one sample point, although I did a few tests with similar results. The input file consists of 299,902 lines, each containing two numbers in the range 0 to 1,000,000, with three decimal digits. There are precisely 100,000 distinct numbers in the first column; each appears from one to five times with different numbers in the second column. (All numbers in the second column are distinct, as it happens.)
All timings were collected with bash's time verb, taking the real (wallclock) time. (Sort multithreads nicely so the user time was always greater).
With the first column correctly sorted and the second column randomised:
sort -k1,1 -k2,2g sorted | sort -k1,1g -su 1.24s
sort -k1,1g -k2,2g sorted | sort -k1,1g -su 1.78s
sort -k2,2g sorted | sort -k1,1g -su 3.00s
With the first column randomised:
sort -k1,1 -k2,2g unsorted | sort -k1,1g -su 1.42s
sort -k1,1g -k2,2g unsorted | sort -k1,1g -su 2.19s
sort -k2,2g unsorted | sort -k1,1g -su 3.01s

You can use this gnu-awk command:
awk '!($1 in m) || m[$1]>$2{m[$1]=$2} END{for (i in m) print i, m[i]}' file
Or to get the order same as the input file:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} !($1 in m) || m[$1] > $2 {m[$1] = $2}
END{for (i in m) print i, m[i]}' file
BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} is used to order the associative array by numerical index.
Output:
0 0
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718

You can do that:
awk 'NR==1{k=$1;v=$2;next} k==$1 { if (v>$2) v=$2; next} {print k,v; k=$1;v=$2}END{print k,v}'
indented:
# for the first record store the two fields
NR==1 {
k=$1
v=$2
next
}
# when the first field doesn\'t change
k==$1 {
# check if the second field is lower
if (v>$2)
v=$2
next
}
{
# otherwise print stored fields and reinitialize them
print k,v
k=$1
v=$2
}
END {
print k,v
}'

In Perl:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my %min;
while (<>) {
chomp;
my ($key, $value) = split;
if (!exists $min{$key} or $value < $min{$key}) {
$min{$key} = $value;
}
}
for (sort { $a <=> $b } keys %min) {
say "$_ $min{$_}";
}
It's written as a Unix filter, so it reads from STDIN and writes to STDOUT. Call it as:
$ ./get_min < input_file > output_file

When you want to use sort, you first have to fix the ordering. Sort will not understand the decimal point, so temporary change that for a x.
Now sort numeric on the numeric fields and put back the decimal point.
The resulting list is sorted correctly, take the first value of each key.
sed 's/\./ x /g' inputfile | sort -n -k1,3 -k4,6 | sed 's/ x /./g' | sort -u -k1,1

Related

Trying to find a maximum from a file in shellscript

In shellscript, I'm trying to get the maximum value from different lines. There are 5 things in a line, and the fifth is the value, that I need to compare to the others in the lines. If I found, what the maximum is, then I have to write out the rest of the line too.
Any advices how could I do it?
Sort numerically, by field 5, then print only the line containing the highest value:
sort -nk5,5 data.txt | tail -n 1
Try
< MYFILE sort -k5nr | head -1
< pipes MYFILE into sort, -k5 says to sort on the fifth key n is for numeric order, r sorts in reverse order so the largest number comes first. Then head -1 outputs only the first line. The end result is
69.4206662, 12.3216747, 2021.08.21., 14:44, 20

How do I sort a "MON_YYYY_day_NUM" time with UNIX tools?

I'm wondering how do i sort this example based on time. I have already sorted it based on everything else, but i just cannot figure out how to go sort it using time (the 07:30 part for example).
My current code:
sort -t"_" -k3n -k2M -k5n (still need to implement the time sort for the last sort)
What still needs to be sorted is the time:
Dunaj_Dec_2000_day_1_13:00.jpg
Rim_Jan_2001_day_1_13:00.jpg
Ljubljana_Nov_2002_day_2_07:10.jpg
Rim_Jan_2003_day_3_08:40.jpg
Rim_Jan_2003_day_3_08:30.jpg
Any help or just a point in the right direction is greatly appreciated!
Alphabetically; 24h time with a fixed number of digits is okay to sort using a plain alphabetic sort.
sort -t"_" -k3n -k2M -k5n -k6 # default sorting
sort -t"_" -k3n -k2M -k5n -k6V # version-number sort.
There's also a version sort V which would work fine.
I have to admit to shamelessly stealing from this answer on SO:
How to split log file in bash based on time condition
awk -F'[_:.]' '
BEGIN {
months["Jan"] = 1
months["Feb"] = 2
months["Mar"] = 3
months["Apr"] = 4
months["May"] = 5
months["Jun"] = 6
months["Jul"] = 7
months["Aug"] = 8
months["Sep"] = 9
months["Oct"] = 10
months["Nov"] = 11
months["Dec"] = 12
}
{ print mktime($3" "months[$2]" "$5" "$6" "$7" 00"), $0 }
' input | sort -n | cut -d\ -f2-
Use _:.\ field separator characters to parse each file name.
Initialize an associative array so we can map month names to numerical values (1-12)
Uses awk function mktime() - it takes a string in the format of "YYYY MM DD HH MM SS [ DST ]" as per https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html. Each line of input is print with a column prepending with the time in epoch seconds.
The results are piped to sort -n which will sort numerically the first column
Now that the results are sorted, we can remove the first column with cut
I have a MAC, so I had to use gawk to get the mktime function (it's not available with MacOS awk normally ). mawk is another option I've read.

How to select a specific percentage of lines?

Goodmorning !
I have a file.csv with 140 lines and 26 columns. I need to sort the lines in according the values in column 23. This is an exemple :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller2,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-1088,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller3,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-2159,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
To sort the lines according the values of the column 23, I do this :
awk -F "%*," '$23 > 4' myfikle.csv
The result :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
In my example, I use the value of 4% in column 23, the goal being to retrieve all the rows with their value in % which increases significantly in column 23. The problem is that I can't base myself on the 4% value because it is only representative of the current table. So I have to find another way to retrieve the rows that have a high value in column 23.
I have to sort the Controllers in descending order according to the percentage in column 23, I prefer to process the first 10% of the sorted lines to make sure I have the controllers with a large percentage.
The goal is to be able to vary the percentage according to the number of lines in the table.
Do you have any tips for that ?
Thanks ! :)
I could have sworn that this question was a duplicate, but so far I couldn't find a similar question.
Whether your file is sorted or not does not really matter. From any file you can extract the NUMBER first lines with head -n NUMBER. There is no built-in way to specify the number percentually, but you can compute that PERCENT% of your file's lines are NUMBER lines.
percentualHead() {
percent="$1"
file="$2"
linesTotal="$(wc -l < "$file")"
(( lines = linesTotal * percent / 100 ))
head -n "$lines" "$file"
}
or shorter but less readable
percentualHead() {
head -n "$(( "$(wc -l < "$2")" * "$1" / 100 ))" "$2"
}
Calling percentualHead 10 yourFile will print the first 10% of lines from yourFile to stdout.
Note that percentualHead only works with files because the file has to be read twice. It does not work with FIFOs, <(), or pipes.
If you want to use standard tools, you'll need to read the file twice. But if you're content to use perl, you can simply do:
perl -e 'my #sorted = sort <>; print #sorted[0..$#sorted * .10]' input-file
Here is one for GNU awk to get the top p% from the file but they are outputed in the order of appearance:
$ awk -F, -v p=0.5 ' # 50 % of top $23 records
NR==FNR { # first run
a[NR]=$23 # hash precentages to a, NR as key
next
}
FNR==1 { # second run, at beginning
n=asorti(a,a,"#val_num_desc") # sort percentages to descending order
for(i=1;i<=n*p;i++) # get only the top p %
b[a[i]] # hash their NRs to b
}
(FNR in b) # top p % BUT not in order
' file file | cut -d, -f 23 # file processed twice, cut 23rd for demo
45.04%
19.18%
Commenting this in a bit.

multiple field and numeric sort

List of files:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-16
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndwr-threads-1
sysbench-size-256M-mode-rndwr-threads-16
sysbench-size-256M-mode-rndwr-threads-4
sysbench-size-256M-mode-rndwr-threads-8
sysbench-size-256M-mode-seqrd-threads-1
sysbench-size-256M-mode-seqrd-threads-16
sysbench-size-256M-mode-seqrd-threads-4
sysbench-size-256M-mode-seqrd-threads-8
sysbench-size-256M-mode-seqwr-threads-1
sysbench-size-256M-mode-seqwr-threads-16
sysbench-size-256M-mode-seqwr-threads-4
sysbench-size-256M-mode-seqwr-threads-8
I would like to sort them by mode (rndrd, rndwr etc.) and then number:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndrw-threads-16
....
I've tried the following loop but it's sorting by number but I need sequence like 1,4,8,16:
$ for f in $(ls -1A); do echo $f; done | sort -t '-' -k 7n
EDIT:
Please note that numeric sort (-n) sort it by number (1,1,1,1,4,4,4,4...) but I need sequence like 1,4,8,16,1,4,8,16...
Sort by more columns:
sort -t- -k5,5 -k7n
Primary sort is by 5th column (and not the rest, that's why 5,5), secondary sorting by number in the 7th column.
The for loop is completely unnecessary as is the -1 argument to ls when piping its output. This yields
ls -A | sort -t- -k 5,5 -k 7,7n
where the first key begins and ends at column 5 and the second key begins and ends at column 7 and is numeric.

Removing repeated pairs from a very big text file

I have a very big text file (few GB) that has the following format:
1 2
3 4
3 5
3 6
3 7
3 8
3 9
File is already sorted and double lines were removed. There are repeated pairs like '2 1', '4 3' reverse order that I want to remove. Does anybody have any solution to do it in a very resource limited environments, in BASH, AWK, perl or any similar languages? I can not load the whole file and loop between the values.
You want to remove lines where the second number is less than the first?
perl -i~ -lane'print if $F[0] < $F[1]' file
Possible solution:
Scan the file
For any pair where the second value is less than the first, swap the two numbers
Sort the pairs again by first then second number
Remove duplicates
I'm still thinking about more efficient solution in terms of disk sweeps, but this is a basic naive approach
For each value, perform a binary search on the file on the hard drive, without loading it into memory. Delete the duplicate if you see it. Then do a final pass that removes all instances of two or more \n.
Not exactly sure if this works / if it's any good...
awk '{ if ($2 > $1) print; else print $2, $1 }' hugetext | sort -nu -O hugetext
You want remove duplicates considering 1 2 and 2 1 to be the same?
< file.in \
| perl -lane'print "#F[ $F[0] < $F[1] ? (0,1,0,1) : (1,0,0,1) ]"' \
| sort -n \
| perl -lane'$t="#F[0,1]"; print "#F[2,3]" if $t ne $p; $p=$t;' \
> file.out
This can handle arbitrarily large files.
Here's a general O(n) algorithm to do this in 1 pass (no loops or sorting required):
Start with an empty hashset as your blacklist (a set is a map with just keys)
Read file one line at a time.
For each line:
Check to see this pair is in your blacklist already.
If so, ignore it.
If not, append it to your result file; and also add the swapped value to the blacklist (e.g., if you just read "3 4", and "4 3" to the blacklist)
This takes O(n) time to run, and O(n) storage for the blacklist. (No additional storage for the result if you manipulate the file as r/w to remove lines as you check them against the blacklist)
perl -lane '
END{
print for sort {$a<=>$b} keys %h;
}
$key = $F[0] < $F[1] ? "$F[0] $F[1]" : "$F[1] $F[0]";
$h{$key} = "";
' file.txt
Explanations :
I sort the current line in numeric order
I make the hash key variable $key by concatenating first and second value with a space
I defined the $hash{$key} to nothing
At the end, I print all the keys sorted in numeric order.
A hash key is uniq by nature, so no duplicate.
You just need to use Unix redirections to create a new file.

Resources