Sorting file content using String value in a certain sequence - bash

I had a jumbled up file content as follows:
13,13,GAME_FINISH,
1,1,GAME_START,
1,1,GROUP_FINISH,
17,17,WAGER,200.00
2,2,GAME_FINISH,
2,2,GAME_START,
22,22,GAME_WIN,290.00
2,2,GROUP_FINISH,
32,32,WAGER,200.00
3,3,GAME_FINISH,
3,3,GAME_START,
.... more lines
I sorted it and currently hold the file content in following format:
1,1,GAME_FINISH,
1,1,GAME_START,
1,1,GROUP_FINISH,
1,1,WAGER,200.00
2,2,GAME_FINISH,
2,2,GAME_START,
2,2,GAME_WIN,290.00
2,2,GROUP_FINISH,
2,2,WAGER,200.00
3,3,GAME_FINISH,
3,3,GAME_START,
3,3,GROUP_FINISH,
3,3,WAGER,200.00
... more lines
But how can I sort it better to obtain following format? 3rd and 4th line may not always exist.
1,1,WAGER,200.00
1,1,GAME_START,
1,1,GAME_WIN,500.00
1,1,BONUS_WIN_1,1100.00
1,1,GAME_FINISH,
1,1,GROUP_FINISH,
2,2, more lines...
For the initial sort, I used
sort -t, -g -k2 nameofunsortedfile.csv >> sortedfile.csv
Added Information:
I want to sort it in this order - Wager, game start, game win, bonus win, game finish, group finish. My current sorted is not in this order. Game win and bonus win may not always be present.
The order I am expecting is not dictionary but also not random. Every number always has a wager, start, game_finish group_finish sequence. game_win, game_bonus are optional. Looking for a way to example target 1,1 sort in the expected sequence mentioned, move on to 2,2 do the same and so on.

The most straightforward way to do this with standard UNIX utilities is probably to add an additional field to each line, which encodes the type of record in a way that sorts into the order you want.
declare -A mapping=( ["WAGER"]=1 ["GAME_START"]=2 ["GAME_WIN"]=3 ["BONUS_WIN"]=4 ["GAME_FINISH"]=5 ["GROUP_FINISH"]=6 )
cut -d, -f3 filename.txt | while read; do echo ${mapping["$REPLY"]}; done | paste -d, - filename.txt | sort | sort -s -t, -n -k 2,3 | cut -d, -f 2-
The declare statement declares a mapping that allows you to look up the ordering of each record type. The specific values (1, 2, etc.) don't matter as long as they sort into the order you want; you could use letters or words if you prefer.
Then the next line consists of the following commands:
cut -d, -f3 filename.txt extracts the thing you want to sort by (WAGER or whatever)
while read; do echo ${mapping["$REPLY"]}; done takes each value (WAGER etc.) and replaces it with its corresponding sortable value from the associative array mapping
paste -d, - filename.txt sticks those values back on to the start of each line from filename.txt
sort | sort -s -t, -n -k 2,3 has the effect of sorting by field 2, then field 3, then field 1 (the one we added). If sort could use three fields as keys, we could do this in a single sort command, but it only accepts up to two fields to sort by.
cut -d, -f 2- strips off the added field, leaving you with your original records, but in sorted order

Perl to the rescue:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $i = 1;
my %order = map { $_ => $i++ }
qw( WAGER GAME_START GAME_WIN BONUS_WIN GAME_FINISH GROUP_FINISH );
chomp( my #lines = <> );
say join ',', #$_ for sort {
$a->[0] <=> $b->[0]
|| $order{ $a->[2] } <=> $order{ $b->[2] }
} map [ split /,/ ], #lines;
The sort block tells Perl to first sort by the first column, and if the values are the same, use the "order" corresponding to the third one.

Related

Sort by multiple conditions ascending and descending in bash

I have a following issue. I have a file containg name,surname,age,mood. I need to sort this file by age (descending). If age is the same that sort it my surname (ascending).
I use this: cat $FILE |sort -r -n -t"," -k3,3 -k2,2 > "$HOME"/people.txt But -r sorts both descending. How can I sort by surname ascending, please?
By default sort will perform the sort in ascending order, the -r flag will perform the sort in descending order; the -r flag can be applied to individual -k directives when you need to use a mix of ascending and descending, eg:
$ cat raw.dat
1,2,4,5
1,2,7,5
1,2,9,5
1,2,3,5
1,3,7,5
1,1,7,5
Sort by column #3 (descending) and then column #2 (ascending):
$ sort -t"," -k3nr -k2n raw.dat
1,2,9,5
1,1,7,5
1,2,7,5
1,3,7,5
1,2,4,5
1,2,3,5
NOTES:
thanks to Ted Lyngmo for adding the n flag to properly handle numerics
if data could contain a mix of characters and numerics the n may need to be replaced depending on desired sort method (eg, V)
key takeaway is that quite a few of the sort flags can be applied at the -key level

How to sort array of strings by function in shell script

I have the following list of strings in shell script:
something-7-5-2020.dump
another-7-5-2020.dump
anoter2-6-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
something-2-5-2020.dump
another-2-5-2020.dump
8-1-2021
26-1-2021
20-1-2021
19-1-2021
3-9-2020
29-9-2020
28-9-2020
24-9-2020
1-9-2020
6-8-2020
20-8-2020
18-8-2020
12-8-2020
10-8-2020
7-7-2020
5-7-2020
27-7-2020
7-6-2020
5-6-2020
23-6-2020
18-6-2020
28-5-2020
26-5-2020
9-12-2020
28-12-2020
15-12-2020
1-12-2020
27-11-2020
20-11-2020
19-11-2020
18-11-2020
1-11-2020
11-11-2020
31-10-2020
29-10-2020
27-10-2020
23-10-2020
21-10-2020
15-10-2020
23-09-2020
So my goal is to sort them by date, but it's in dd-mm-yyyy and d-m-yyyy format and sometimes there's a word before like word-dd-mm-yyyy. I would like to create a function to sort the values like any other language so it ignores the first word, casts the date to a common format and compares that format. In javascript it would be something like:
arrayOfStrings.sort((a, b) => functionToOrderStrings())
My code to obtain the array is the following:
dumps=$(gsutil ls gs://organization-dumps/ambient | sed "s:gs\://organization-dumps/ambient/::" | sed '/^$/d' | sed 's:/$::' | sort --reverse --key=3 --key=2 --key=1 --field-separator=-)
echo "$dumps"
I would like to say that I've already searched this in Stackoverflow and none of the answers did help me, because all of them are oriented to sort dates in correct format and that's not my case.
If you have the results in a pipeline, involving an array seems completely superfluous here.
You can apply a technique called a Schwartzian transform: add a prefix to each line with a normalized version the data so it can be easily sorted, then sort, then discard the prefix.
I'm guessing something like the following;
gsutil ls gs://organization-dumps/ambient |
awk '{ sub("gs:\/\/organization-dumps/ambient/", "");
if (! $0) next;
sub("/$", "");
d = $0;
sub(/^[^0-9][^-]*-/, "", d);
sub(/[^0-9]*$/, "", d);
split(d, w, "-");
printf "%04i-%02i-%02i\t%s\n", w[3], w[2], w[1], $0 }' |
sort -n | cut -f2-
In so many words, we are adding a tab-delimited field in front of every line, then sorting on that, then discarding the first field with cut -f2-. The field extraction contains some assumptions which seem to be valid for your test data, but may need additional tweaking if you have real data with corner cases like if the label before the date could sometimes contain a number with dashes around it, too.
If you want to capture the result in a variable, like in your original code, that's easy to do; but usually, you should just run everything in a pipeline.
Notice that I factored your multiple sed scripts into the Awk script, too, some of that with a fair amount of guessing as to what the input looks like and what the sed scripts were supposed to accomplish. (Perhaps also note that sed, like Awk, is a scripting language; to run several sed commands on the same input, just put them after each other in the same sed script.)
Preprocess input to be in the format you want it to be for sorting.
Sort
Remove artifacts from step 1
The following:
sed -E '
# extract the date and put it in first column separated by tab
# this could be better, its just an example
s/(.*-)?([0-9]?[0-9]-[0-9]?[0-9]-[0-9]{4})/\2\t&/;
# If day is a single digit, add a zero in front
s/^([0-9]-)/0\1/;
# If month is a single digit, add a zero in front
s/^([0-9][0-9]-)([0-9]-)/\10\2/
# year in front? no idea - shuffle the way you want
s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3-\2-\1/
' input.txt | sort | cut -f2-
outputs:
another-2-5-2020.dump
something-2-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
anoter2-6-5-2020.dump
another-7-5-2020.dump
something-7-5-2020.dump
26-5-2020
28-5-2020
5-6-2020
7-6-2020
18-6-2020
23-6-2020
5-7-2020
7-7-2020
27-7-2020
6-8-2020
10-8-2020
12-8-2020
18-8-2020
20-8-2020
1-9-2020
3-9-2020
23-09-2020
24-9-2020
28-9-2020
29-9-2020
15-10-2020
21-10-2020
23-10-2020
27-10-2020
29-10-2020
31-10-2020
1-11-2020
11-11-2020
18-11-2020
19-11-2020
20-11-2020
27-11-2020
1-12-2020
9-12-2020
15-12-2020
28-12-2020
8-1-2021
19-1-2021
20-1-2021
26-1-2021
Using GNU awk:
gsutil ls gs://organization-dumps/ambient | awk '{ match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/);dayt=substr($0,RSTART,RLENGTH);split(dayt,map,"-");length(map[1])==1?map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2];map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in map1) { print map1[i] } }'
Explanation:
gsutil ls gs://organization-dumps/ambient | awk '{
match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/); # Check that lines contain a date
dayt=substr($0,RSTART,RLENGTH); # Extract the date
split(dayt,map,"-"); # Split the date in the array map based on "-" as the delimiter
length(map[1])==1? map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2]; # Pad the month and day with "0" if required
map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 # Get the epoch format date based on the values in the map array and use this for the index of the array map1 with the line as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # Set the ordering of the array
for (i in map1) {
print map1[i] # Loop through map1 and print the values (lines)
}
}'
Using GNU awk, you can do this fairly easy:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}' file
Essentially, we are asking GNU awk to traverse an array by index in ascending numeric order. Per line read, we extract the date. The date is always located before the <dot>-character and thus always in field 1 if the dot is the field separator (FS="."). We split the first field by the hyphen and use the total number of fields to extract the date. We convert the date simplistically to some number (YYYY*10000+MM*100+DD; DD<100 && MM*100 < 10000) and ask awk to sort it by that number.
It is now possible to combine the full pipe-line in a single awk:
$ gsutil ls gs://organization-dumps/ambient \
| awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{sub("gs://organization-dumps/ambient/",""); sub("/$","")}
(NF==0){next}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}'

How to sort release version string in descending order with Bash

I have a list of release version strings that looks something like this:
releases=( "1.3.1243" "2.0.1231" "0.8.4454" "1.2.4124" "1.2.3231" "0.9.5231" )
How can I use bash to sort my releases array such that the array is sorted in descending order (so the value on the left has the highest precedence).
So the after sorting, the example above would be in the following order:
"2.0.1231", "1.3.1243", "1.2.4124", "1.2.3231", "0.9.5231", "0.8.4454"
You can actually do it quite easily with command substitution and the version sort option to sort, e.g.
releases=($(printf "%s\n" "${releases[#]}" | sort -rV))
(note: the printf-trick simply separates the elements on separate lines so they can be piped to sort for sorting. printf "%s\n", despite having only one "%s" conversion specifier, will process all input)
Now releases contains:
releases=("2.0.1231" "1.3.1243" "1.2.4124" "1.2.3231" "0.9.5231" "0.8.4454")
releases=( "1.3.1243" "2.0.1231" "0.8.4454" "1.2.4124" "1.2.3231" "0.9.5231" )
sorted=( $(echo ${releases[*]} | sed 's/ /\n/g' | sort -t. -k1,1rn -k2,2rn -k3,3rn) )
echo ${sorted[*]}
This uses sed and sort to reverse sort the items, using . as the field separator, and treating each field as numeric:
2.0.1231 1.3.1243 1.2.4124 1.2.3231 0.9.5231 0.8.4454
releases=( "1.3.1243" "2.0.1231" "0.8.4454" "1.2.4124" "1.2.3231" "0.9.5231"
readarray -t sorted < <(printf '%s\n' "${releases[#]}" | sort -Vr)
declare -p sorted
declare -a sorted=([0]="2.0.1231" [1]="1.3.1243" [2]="1.2.4124" [3]="1.2.3231" [4]="0.9.5231" [5]="0.8.4454")

bash sort quoted csv files by numeric key

I have the following input csv file:
"aaa","1","xxx"
"ccc, Inc.","6100","yyy"
"bbb","609","zzz"
I wish to sort by the second column as numbers,
I tried
sort --field-separator=',' --key=2n
the problem is that since all values are quoted, they don't get sorted correctly by -n (numeric) option. is there a solution?
A little trick, which uses a double quote as the separator:
sort --field-separator='"' --key=4 -n
For a quoted csv use a language that has a proper csv parser. Here is an example using perl.
perl -MText::ParseWords -lne '
chomp;
push #line, [ parse_line(",", 0, $_) ];
}{
#line = sort { $a->[1] <=> $b->[1] } #line;
for (#line) {
local $" = qw(",");
print qq("#$_");
}
' file
Output:
"aaa","1","xxx"
"bbb","609","zzz"
"ccc, Inc.","6100","yyy"
Explanation:
Remove the new line from input using chomp function.
Using a code module Text::Parsewords parse the quoted line and store it in an array of array without the quotes.
In the END block, sort the array of array on second column and assign it to the original array of array.
For every item in our array of array, we set the output list separator to "," and we print it with preceding and trailing " to create the lines in original format.
Dropping your example into a file called sort2.txt I found the following to work well.
sort -t'"' -k4n sort2.txt
Using sort with the following commands (thank you for the refinements Jonathan)
-t[optional single character separator other than tab. Defined within the single quotes]'"'.
-k4 choose the value in the fourth key.(k)delimited by ", and on the 4th key value
-n numeric sort
file name avoid the use of chaining as unnecessary
Hope this helps!
There isn't going to be a really simple solution. If you make some reasonable assumptions, then you could consider:
sed 's/","/^A/g' input.csv |
sort -t'^A' -k 2n |
sed 's/^A/","/g'
This replaces the "," sequence with Control-A (shown as ^A in the code), then uses that as the field delimiter in sort (the numeric sort on column 2), and then replace the Control-A characters with "," again.
If you use bash, you can use the ANSI C quoting mechanism $'\1' to embed the control characters visibly into the script; you just have to finish the single-quoted string before the escape, and restart it afterwards:
sed 's/","/'$'\1''/g' input.csv |
sort -t$'\1' -k 2n |
sed 's/'$'\1''/","/g'
Or play with double quotes instead of single quotes, but that gets messy because of the double quotes that you are replacing. But you can simply type the characters verbatim and editors like vim will be happy to show them to you.
Sometimes the values in the CSV file are optionally quoted, only when necessary. In this case, using " as a separator is not reliable.
Example:
"Forest fruits",198
Apples,456
bananas,67
Using awk, sort and cut, you can sort the original file, here by the first column :
awk -F',' '{
a = $1; # or the column index you want
gsub(/(^"|"$)/, "", a);
print a","$0
}' file.csv | sort -k1 | cut -d',' -f1 --complement
This will bring the column you want to sort on in front without quotes, then sort it the way you want, and remove this column at the end.

How can I sort after using a delimiter on the last field in bash scripting

for example
suppose that from a command let's call it "previous" we get a result, this result contains lines of text
now before printing out this text, I want to use the sort command in order to sort it using a delimiter.
in this case the delimiter is "*"
the thing is, I always want to sort on the last field for example if a line is like that
text*text***text*********text..*numberstext
I want my sort to sort using the last field, in this case on numberstext
if all lines were as the line I just posted, then it would be easy
I can just count the fields that are being created when using a delimiter(suppose we have N fields) and then apply this command
previous command | sort -t * -k N -n
but not all lines are in the same form, some line can be like that:
text:::***:*numberstext
as you can see, I always want to sort using the last field
basically I'm looking for a method to find the last field when using as a delimiter the character *
I was thinking that it might be like that
previous command | sort -t * -k $some_variable_denoting_the_ammount_of_fields -n
but I'm not sure if there's anything like that..
thanks :)
Use sed to duplicate the final field at the start of the line, sort, then use sed to remove the duplicate. Probably simpler to use your favourite programming language though.
Here is a perl script for it:
#!/usr/bin/perl
use strict;
use warnings;
my $regex = qr/\*([^*]*)$/o;
sub bylast
{
my $ak = ($a =~ $regex, $1) || "";
my $bk = ($b =~ $regex, $1) || "";
$ak cmp $bk;
}
print for sort bylast (<>);
This might work:
sed -r 's/.*\*([^*]+$)/\1###&/' source | sort | sed 's/^.*###//'
Add the last field to the front, sort it, delete the sort key N.B. ### can be anything you like as long as it does not exist in the source file.
Credit should go to #Havenless this is just his idea put into code

Resources