Sorting strings with numbers in Bash [duplicate] - bash

This question already has answers here:
How to sort strings that contain a common prefix and suffix numerically from Bash?
(5 answers)
Closed 6 years ago.
I've often wanted to sort strings with numbers in them so that, when sorting e.g. abc_2, abc_1, abc_10 the result is abc_1, abc_2, abc_10. Every sort mechanism I've seen sorts as abc_1, abc_10, abc_2, that is character by character from the left.
Is there any efficient way to sort to get the result I want? The idea of looking at every character, determining if it's a numeral, building a substring out of subsequent numerals and sorting on that as a number is too appalling to contemplate in bash.
Has no bearded *nix guru implemented an alternative version of sort with a --sensible_numerical option?

Execute this
sort -t _ -k 2 -g data.file
-t separator
-k key/column
-g general numeric sort

I think this is a GNU extension to sort, but you're looking for the --version-sort (or -V) option:
$ printf "prefix%d\n" $(seq 10 -3 1)
prefix10
prefix7
prefix4
prefix1
$ printf "prefix%d\n" $(seq 10 -3 1) | sort
prefix1
prefix10
prefix4
prefix7
$ printf "prefix%d\n" $(seq 10 -3 1) | sort --version-sort
prefix1
prefix4
prefix7
prefix10
https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html

You can sort using version-sort
Just pass the following arg
-V or --version-sort
# without (ersion-sort)
$ cat a.txt
abc_1
abc_4
abc_2
abc_10
abc_5
# with (version-sort)
$ sort -V a.txt
abc_1
abc_2
abc_4
abc_5
abc_10

Related

How to perform arithmetic on every row of a file via piping [duplicate]

This question already has answers here:
Bash Script- adding numbers to each line in file
(4 answers)
Closed 12 months ago.
Lets say we have a simple file data.txt which has the following content:
1
2
3
Is there a command to perform arithmetic on every row value of the file via piping? I'm looking for something like cat data.txt | arit "+10" which would output:
11
12
13
The best thing I know is performing arithmetic by using bc for individual values, for example echo "1+10" | bc, but I wasn't able to apply this to my own example which contains many values with trailing newlines.
You could use awk:
awk '{print $1+10}' data.txt
My first impulse is to use sed. (Hey, it's the tool I always reach for.)
cat data.txt | sed 's/$/+10/' | bc
EDIT: or like so
sed 's/$/+10/' data.txt | bc

Bash - how to copy latest files by filename to another folder?

Let's say I have these files in folder Test1
AAAA-12_21_2020.txt
AAAA-12_20_2020.txt
AAAA-12_19_2020.txt
BBB-12_21_2020.txt
BBB-12_20_2020.txt
BBB-12_19_2020.txt
I want below latest files to folder Test2
AAAA-12_21_2020.txt
BBB-12_21_2020.txt
This code would work:
ls $1 -U | sort | cut -f 1 -d "-" | uniq | while read -r prefix; do
ls $1/$prefix-* | sort -t '_' -k3,3V -k1,1V -k2,2V | head -n 1
done
We first iterate over every prefix in the directory specified as the first argument, which we get by sorting the list of files and deleting duplicates, before extracting everything before -. Then we sort those filenames by three fields separated by the _ symbol using the -k option of sort (primarily by years in the third field, then months in second and lastly days). We use version sort to be able to ignore the text around and interpret numbers correctly (as opposed to lexicographical sort).
I'm not sure whether this is the best way to do this, as I used only basic bash functions. Because of the date format and the fact that you have to differentiate prefixes, you have to parse the string fully, which is a job better suited for AWK or Perl.
Nonetheless, I would suggest using day-month-year or year-month-day format for machine-readable filenames.
Using awk:
ls -1 Test1/ | awk -v src_dir="Test1" -v target_dir="Test2" -F '(-|_)' '{p=$4""$2""$3; if(!($1 in b) || b[$1] < p){a[$1]=$0}} END {for (i in a) {system ("mv "src_dir"/"a[i]" "target_dir"/")}}'

`sort -t` doesn't work properly with string input

I want to sort some space separated numbers in bash. The following doesn't work, however:
sort -dt' ' <<< "3 5 1 4"
Output is:
3 5 1 4
Expected output is:
1
3
4
5
As I understand it the -t option should use it's argument as a delimiter. Why isn't my code working? I know I can tr the spaces to newlines, but I'm working on a code golf thing and want to be able to do it without any other utility.
EDIT: everybody is answering by splitting the spaces to lines. I do not want to do this. I already know how to do this with other utilities. I am specifically asking how to do this with sort, and sort only. If -t doesn't delimit input, what does it do?
Use process substitution with printf to have each input number in a separate line, otherwise sort gets only one line to sort:
sort <(printf "%s\n" 3 5 1 4)
1
3
4
5
While doing so -dt '' is not needed.
After searching around, I have discovered what -t is for. It is for delimiting a file if you want to sort by a certain part of each line, for e.g, if you have
Hello,56
Cat,81
Book,14
Nope,62
And you want to sort by the number, you would to -t',' to delimit by the comma and then use -k to select which part to sort by. It is for field delimiting, not record delimiting like I thought.
Since sort only separates fields on a single line, you have no choice but to pipe data into sort -dt such as this method using IFS:
#!/bin/bash
clear
var="8 3 5 1 4 7 2 9 6"
old_IFS="$IFS"
function main() {
IFS=" "
printf "%s\n" $var | sort -d
}
main
This will give an obvious output of:
1
2
3
4
5
6
7
8
9
If this is not the way you wish to use sort well you have already answered your own question by doing a bit of digging on the issue which, if you would have done so before, would have saved time for the others giving answers as well as yours.

egrep AND operator [duplicate]

This question already has answers here:
Check if all of multiple strings or regexes exist in a file
(21 answers)
Closed 4 years ago.
I know egrep has a very useful way of anding two expressions together by using:
egrep "pattern1.*pattern2"|egrep "pattern2.*pattern1" filename.txt|wc -l
However is there an easy way to use egrep's AND operator when searching for three expressions as the permutations increase exponentially as you add extra expressions.
I know the other way going about it using sort|uniq -d however I am looking for a simpler solution.
EDIT:
My current way of search will yield five total results:
#!/bin/bash
pid=$$
grep -i "angio" rtrans.txt|sort|uniq|egrep -o "^[0-9]+ [0-9]+ " > /tmp/$pid.1.tmp
grep -i "cardio" rtrans.txt|sort|uniq|egrep -o "^[0-9]+ [0-9]+ " > /tmp/$pid.2.tmp
grep -i "pulmonary" rtrans.txt|sort|uniq|egrep -o "^[0-9]+ [0-9]+ " > /tmp/$pid.3.tmp
cat /tmp/$pid.1.tmp /tmp/$pid.2.tmp|sort|uniq -d > /tmp/$pid.4.tmp
cat /tmp/$pid.4.tmp /tmp/$pid.3.tmp|sort|uniq -d > /tmp/$pid.5.tmp
egrep -o "^[0-9]+ [0-9]+ " /tmp/$pid.5.tmp|getDoc.mps > /tmp/$pid.6.tmp
head -10 /tmp/$pid.6.tmp
mumps#debianMumpsISR:~/Medline2012$ AngioAndCardioAndPulmonary.script
1514 Structural composition of central pulmonary arteries. Growth potential after surgical shunts.
1517 Patterns of pulmonary arterial anatomy and blood supply in complex congenital heart disease
with pulmonary atresia
3034 Controlled reperfusion following regional ischemia.
3481 Anaesthetic management for oophorectomy in pulmonary lymphangiomyomatosis.
3547 A comparison of methods for limiting myocardial infarct expansion during acute reperfusion--
primary role of unload
While:
mumps#debianMumpsISR:~/Medline2012$ grep "angio" rtrans.txt|grep "cardio" rtrans.txt|grep "pulmonary" rtrans.txt|wc -l
185
yields 185 lines of text because it is only taking the value of the search in pulmonary instead of all three searches.
how about
grep "pattern1" file|grep "pattern2"|grep "pattern3"
this will give those lines that contain p1, p2 and p3. but with arbitrary order.
The approach of Kent with
grep "pattern1" file|grep "pattern2"|grep "pattern3"
is correct and it should be faster, just for the record I wanted to post an alternative which uses egrep to do the same without pipping:
egrep "pattern1.*pattern2|pattern2.*pattern1"
which looks for p1 followed by p2 or p2 followed by p1.
The original question is about why his egrep command didn't work.
egrep "pattern1.*pattern2"|egrep "pattern2.*pattern1" filename.txt|wc -l
Kent and Stanislav are correct in pointing out the syntax error by putting the filename.txt up front. But this doesn't address the original problem.
Bob's "current way" (4 years ago) was a multi-command approach to grep out different keywords on different lines. In other words, his script was looking for a set of lines containing any of his search terms. The other proposed solutions would only result in lines containing all of his search terms, which does not appear to be his intent.
Instead, he could use a single line egrep to look for any of the terms, like this:
egrep -e 'pattern1|pattern2' filename.txt

Remove duplicate lines without sorting [duplicate]

This question already has answers here:
How to delete duplicate lines in a file without sorting it in Unix
(9 answers)
Closed 4 years ago.
I have a utility script in Python:
#!/usr/bin/env python
import sys
unique_lines = []
duplicate_lines = []
for line in sys.stdin:
if line in unique_lines:
duplicate_lines.append(line)
else:
unique_lines.append(line)
sys.stdout.write(line)
# optionally do something with duplicate_lines
This simple functionality (uniq without needing to sort first, stable ordering) must be available as a simple UNIX utility, mustn't it? Maybe a combination of filters in a pipe?
Reason for asking: needing this functionality on a system on which I cannot execute Python from anywhere.
The UNIX Bash Scripting blog suggests:
awk '!x[$0]++'
This command is telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array x is incremented and the line printed if the content of that node was not (!) previously set.
A late answer - I just ran into a duplicate of this - but perhaps worth adding...
The principle behind #1_CR's answer can be written more concisely, using cat -n instead of awk to add line numbers:
cat -n file_name | sort -uk2 | sort -n | cut -f2-
Use cat -n to prepend line numbers
Use sort -u remove duplicate data (-k2 says 'start at field 2 for sort key')
Use sort -n to sort by prepended number
Use cut to remove the line numbering (-f2- says 'select field 2 till end')
To remove duplicate from 2 files :
awk '!a[$0]++' file1.csv file2.csv
Michael Hoffman's solution above is short and sweet. For larger files, a Schwartzian transform approach involving the addition of an index field using awk followed by multiple rounds of sort and uniq involves less memory overhead. The following snippet works in bash
awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
Now you can check out this small tool written in Rust: uq.
It performs uniqueness filtering without having to sort the input first, therefore can apply on continuous stream.
There are two advantages of this tool over the top-voted awk solution and other shell-based solutions:
uq remembers the occurence of lines using their hash values, so it doesn't use as much memory use when the lines are long.
uq can keep the memory usage constant by setting a limit on the number of entries to store (when the limit is reached, there is a flag to control either to override or to die), while the awk solution could run into OOM when there are too many lines.
Thanks 1_CR! I needed a "uniq -u" (remove duplicates entirely) rather than uniq (leave 1 copy of duplicates). The awk and perl solutions can't really be modified to do this, your's can! I may have also needed the lower memory use since I will be uniq'ing like 100,000,000 lines 8-). Just in case anyone else needs it, I just put a "-u" in the uniq portion of the command:
awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq -u --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
I just wanted to remove all duplicates on following lines, not everywhere in the file. So I used:
awk '{
if ($0 != PREVLINE) print $0;
PREVLINE=$0;
}'
the uniq command works in an alias even http://man7.org/linux/man-pages/man1/uniq.1.html

Resources