Trying to find a maximum from a file in shellscript - shell

In shellscript, I'm trying to get the maximum value from different lines. There are 5 things in a line, and the fifth is the value, that I need to compare to the others in the lines. If I found, what the maximum is, then I have to write out the rest of the line too.
Any advices how could I do it?

Sort numerically, by field 5, then print only the line containing the highest value:
sort -nk5,5 data.txt | tail -n 1

Try
< MYFILE sort -k5nr | head -1
< pipes MYFILE into sort, -k5 says to sort on the fifth key n is for numeric order, r sorts in reverse order so the largest number comes first. Then head -1 outputs only the first line. The end result is
69.4206662, 12.3216747, 2021.08.21., 14:44, 20

Related

How can I find the missing integers in a unique and sequential list (one per line) in a unix terminal?

Suppose I have a file as follows (a sorted, unique list of integers, one per line):
1
3
4
5
8
9
10
I would like the following output (i.e. the missing integers in the list):
2
6
7
How can I accomplish this within a bash terminal (using awk or a similar solution, preferably a one-liner)?
Using awk you can do this:
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}' file
2
6
7
Explanation:
{p = $1}: Variable p contains value from previous record
{for ...}: We loop from p+1 to the current row's value (excluding current value) and print each value which is basically the missing values
Using seq and grep:
seq $(head -n1 file) $(tail -n1 file) | grep -vwFf file -
seq creates the full sequence, grep removes the lines that exists in the file from it.
perl -nE 'say for $a+1 .. $_-1; $a=$_'
Calling no external program (if filein contains the list of numbers):
#!/bin/bash
i=0
while read num; do
while (( ++i<num )); do
echo $i
done
done <filein
To adapt choroba's clever answer for my own use case, I needed my sequence to deal with zero-padded numbers.
The -w switch to seq is the magic here - it automatically pads the first number with the necessary number of zeroes to keep it aligned with the second number:
-w, --equal-width equalize width by padding with leading zeroes
My integers go from 0 to 9999, so I used the following:
seq -w 0 9999 | grep -vwFf "file.txt"
...which finds the missing integers in a sequence from 0000 to 9999. Or to put it back into the more universal solution in choroba's answer:
seq -w $(head -n1 "file.txt") $(tail -n1 "file.txt") | grep -vwFf "file.txt"
I didn't personally find the - in his answer was necessary, but there may be usecases which make it so.
Using Raku (formerly known as Perl_6)
raku -e 'my #a = lines.map: *.Int; say #a.Set (^) #a.minmax.Set;'
Sample Input:
1
3
4
5
8
9
10
Sample Output:
Set(2 6 7)
I'm sure there's a Raku solution similar to #JJoao's clever Perl5 answer, but in thinking about this problem my mind naturally turned to Set operations.
The code above reads lines into the #a array, mapping each line so that elements in the #a array are Ints, not strings. In the second statement, #a.Set converts the array to a Set on the left-hand side of the (^) operator. Also in the second statement, #a.minmax.Set converts the array to a second Set, on the right-hand side of the (^) operator, but this time because the minmax operator is used, all Int elements from the min to max are included. Finally, the (^) symbol is the symmetric set-difference (infix) operator, which finds the difference.
To get an unordered whitespace-separated list of missing integers, replace the above say with put. To get a sequentially-ordered list of missing integers, add the explicit sort below:
~$ raku -e 'my #a = lines.map: *.Int; .put for (#a.Set (^) #a.minmax.Set).sort.map: *.key;' file
2
6
7
The advantage of all Raku code above is that finding "missing integers" doesn't require a "sequential list" as input, nor is the input required to be unique. So hopefully this code will be useful for a wide variety of problems in addition to the explicit problem stated in the Question.
OTOH, Raku is a Perl-family language, so TMTOWTDI. Below, a #a.minmax array is created, and grepped so that none of the elements of #a are returned (none junction):
~$ raku -e 'my #a = lines.map: *.Int; .put for #a.minmax.grep: none #a;' file
2
6
7
https://docs.raku.org/language/setbagmix
https://docs.raku.org/type/Junction
https://raku.org

find smallest $2 for every unique $1 [duplicate]

This question already has answers here:
Sort and keep a unique duplicate which has the highest value
(3 answers)
Closed 7 years ago.
I am trying to obtain the smallest $2 value for every $1 value. My data looks like follows:
0 0
23.9901 13.604
23.9901 13.604
23.9901 3.364
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.79 15.36
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
The output must look like:
0 0
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
I can manage this by creating multiple files for each $1 value and finding the min in each file. But I am wondering if there might be a more elegant solution for this?
Thanks.
With Gnu or FreeBSD sort, you can do it as follows;
sort -k1,1 -k2,2g file | sort -k1,1g -su
The first sort sorts the file into order by first and then second column value. The second sort uniquifies the file (-u) using only the first column to determine uniqueness. It also uses the -s flag to guarantee that the second column is still in order. In both cases, the sort uses the -g flag when it matters (see below), which does general numeric comparison, unlike the Posix-standard -n flag which only compares leading integers.
Performance note: (And thanks to OP for spurring me to do the measurements):
Leaving the g off of -k1,1 in the first sort is not a typo; it actually considerably speeds the sort up (on large files, with Gnu sort). Standard or integer (-n) sorts are much faster than general numeric sorts, perhaps 10 times as fast. However, all key types are about twice as fast for files which are "mostly sorted". For more-or-less uniformly sampled random numbers, a lexicographic sort is a close approximation to a general numeric sort; close enough that the result shows the "mostly sorted" speed-up.
It would have been possible to only sort by the second field in the first sort: sort -k2,2g file | sort -k1,1g -su but this is much slower, both because the primary sort in the first pass is general numeric instead of lexicographic and because the file is no longer mostly sorted for the second pass.
Here's just one sample point, although I did a few tests with similar results. The input file consists of 299,902 lines, each containing two numbers in the range 0 to 1,000,000, with three decimal digits. There are precisely 100,000 distinct numbers in the first column; each appears from one to five times with different numbers in the second column. (All numbers in the second column are distinct, as it happens.)
All timings were collected with bash's time verb, taking the real (wallclock) time. (Sort multithreads nicely so the user time was always greater).
With the first column correctly sorted and the second column randomised:
sort -k1,1 -k2,2g sorted | sort -k1,1g -su 1.24s
sort -k1,1g -k2,2g sorted | sort -k1,1g -su 1.78s
sort -k2,2g sorted | sort -k1,1g -su 3.00s
With the first column randomised:
sort -k1,1 -k2,2g unsorted | sort -k1,1g -su 1.42s
sort -k1,1g -k2,2g unsorted | sort -k1,1g -su 2.19s
sort -k2,2g unsorted | sort -k1,1g -su 3.01s
You can use this gnu-awk command:
awk '!($1 in m) || m[$1]>$2{m[$1]=$2} END{for (i in m) print i, m[i]}' file
Or to get the order same as the input file:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} !($1 in m) || m[$1] > $2 {m[$1] = $2}
END{for (i in m) print i, m[i]}' file
BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} is used to order the associative array by numerical index.
Output:
0 0
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
You can do that:
awk 'NR==1{k=$1;v=$2;next} k==$1 { if (v>$2) v=$2; next} {print k,v; k=$1;v=$2}END{print k,v}'
indented:
# for the first record store the two fields
NR==1 {
k=$1
v=$2
next
}
# when the first field doesn\'t change
k==$1 {
# check if the second field is lower
if (v>$2)
v=$2
next
}
{
# otherwise print stored fields and reinitialize them
print k,v
k=$1
v=$2
}
END {
print k,v
}'
In Perl:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my %min;
while (<>) {
chomp;
my ($key, $value) = split;
if (!exists $min{$key} or $value < $min{$key}) {
$min{$key} = $value;
}
}
for (sort { $a <=> $b } keys %min) {
say "$_ $min{$_}";
}
It's written as a Unix filter, so it reads from STDIN and writes to STDOUT. Call it as:
$ ./get_min < input_file > output_file
When you want to use sort, you first have to fix the ordering. Sort will not understand the decimal point, so temporary change that for a x.
Now sort numeric on the numeric fields and put back the decimal point.
The resulting list is sorted correctly, take the first value of each key.
sed 's/\./ x /g' inputfile | sort -n -k1,3 -k4,6 | sed 's/ x /./g' | sort -u -k1,1

multiple field and numeric sort

List of files:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-16
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndwr-threads-1
sysbench-size-256M-mode-rndwr-threads-16
sysbench-size-256M-mode-rndwr-threads-4
sysbench-size-256M-mode-rndwr-threads-8
sysbench-size-256M-mode-seqrd-threads-1
sysbench-size-256M-mode-seqrd-threads-16
sysbench-size-256M-mode-seqrd-threads-4
sysbench-size-256M-mode-seqrd-threads-8
sysbench-size-256M-mode-seqwr-threads-1
sysbench-size-256M-mode-seqwr-threads-16
sysbench-size-256M-mode-seqwr-threads-4
sysbench-size-256M-mode-seqwr-threads-8
I would like to sort them by mode (rndrd, rndwr etc.) and then number:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndrw-threads-16
....
I've tried the following loop but it's sorting by number but I need sequence like 1,4,8,16:
$ for f in $(ls -1A); do echo $f; done | sort -t '-' -k 7n
EDIT:
Please note that numeric sort (-n) sort it by number (1,1,1,1,4,4,4,4...) but I need sequence like 1,4,8,16,1,4,8,16...
Sort by more columns:
sort -t- -k5,5 -k7n
Primary sort is by 5th column (and not the rest, that's why 5,5), secondary sorting by number in the 7th column.
The for loop is completely unnecessary as is the -1 argument to ls when piping its output. This yields
ls -A | sort -t- -k 5,5 -k 7,7n
where the first key begins and ends at column 5 and the second key begins and ends at column 7 and is numeric.

Removing repeated pairs from a very big text file

I have a very big text file (few GB) that has the following format:
1 2
3 4
3 5
3 6
3 7
3 8
3 9
File is already sorted and double lines were removed. There are repeated pairs like '2 1', '4 3' reverse order that I want to remove. Does anybody have any solution to do it in a very resource limited environments, in BASH, AWK, perl or any similar languages? I can not load the whole file and loop between the values.
You want to remove lines where the second number is less than the first?
perl -i~ -lane'print if $F[0] < $F[1]' file
Possible solution:
Scan the file
For any pair where the second value is less than the first, swap the two numbers
Sort the pairs again by first then second number
Remove duplicates
I'm still thinking about more efficient solution in terms of disk sweeps, but this is a basic naive approach
For each value, perform a binary search on the file on the hard drive, without loading it into memory. Delete the duplicate if you see it. Then do a final pass that removes all instances of two or more \n.
Not exactly sure if this works / if it's any good...
awk '{ if ($2 > $1) print; else print $2, $1 }' hugetext | sort -nu -O hugetext
You want remove duplicates considering 1 2 and 2 1 to be the same?
< file.in \
| perl -lane'print "#F[ $F[0] < $F[1] ? (0,1,0,1) : (1,0,0,1) ]"' \
| sort -n \
| perl -lane'$t="#F[0,1]"; print "#F[2,3]" if $t ne $p; $p=$t;' \
> file.out
This can handle arbitrarily large files.
Here's a general O(n) algorithm to do this in 1 pass (no loops or sorting required):
Start with an empty hashset as your blacklist (a set is a map with just keys)
Read file one line at a time.
For each line:
Check to see this pair is in your blacklist already.
If so, ignore it.
If not, append it to your result file; and also add the swapped value to the blacklist (e.g., if you just read "3 4", and "4 3" to the blacklist)
This takes O(n) time to run, and O(n) storage for the blacklist. (No additional storage for the result if you manipulate the file as r/w to remove lines as you check them against the blacklist)
perl -lane '
END{
print for sort {$a<=>$b} keys %h;
}
$key = $F[0] < $F[1] ? "$F[0] $F[1]" : "$F[1] $F[0]";
$h{$key} = "";
' file.txt
Explanations :
I sort the current line in numeric order
I make the hash key variable $key by concatenating first and second value with a space
I defined the $hash{$key} to nothing
At the end, I print all the keys sorted in numeric order.
A hash key is uniq by nature, so no duplicate.
You just need to use Unix redirections to create a new file.

Print only values smaller than certain threshold in bash

I have a file with more than 10000 lines like this, mostly numbers and some strings;
-40
-50
stringA
100
20
-200
...
I would like to write a bash (or other) script that reading this file only outputs numbers (no strings) and only those values smaller than zero (or some other predefined number). How can this be done?
In this case the output (sorted) would be
-40
-50
-200
...
cat filename | awk '{if($1==$1+0 && $1<THRESHOLD_VALUE)print $1}' | sort -n
The $1==$1+0 ensure that the string is a number, it will then check that it is less than THRESHOLD_VALUE (change this to whatever number you wish. Print it out if it passes, and sort.
awk '$1 < NUMBER { print }' FILENAME | sort -n
where NUMBER is the number that you want to use as an upper bound and FILENAME is your file with 10000+ lines of numbers. You can drop the | sort -n if you don't want to sort the numbers.
edit: One small caveat. If your string starts with a number, it will treat it as that number. Otherwise it should ignore it.
Another alternative is as follows:
function compare() {
if test $1 -lt $MAX_VALUE; then
echo $1
fi
} 2> /dev/null
Have a look at help test and man bash for further help on this. The 2> /dev/null redirects errors thrown by test when you try to compare something other than two integers. Call the function like:
compare 1
compare -1
compare string A
Only the middle line will give output.

Resources