How can I sort after using a delimiter on the last field in bash scripting - bash

for example
suppose that from a command let's call it "previous" we get a result, this result contains lines of text
now before printing out this text, I want to use the sort command in order to sort it using a delimiter.
in this case the delimiter is "*"
the thing is, I always want to sort on the last field for example if a line is like that
text*text***text*********text..*numberstext
I want my sort to sort using the last field, in this case on numberstext
if all lines were as the line I just posted, then it would be easy
I can just count the fields that are being created when using a delimiter(suppose we have N fields) and then apply this command
previous command | sort -t * -k N -n
but not all lines are in the same form, some line can be like that:
text:::***:*numberstext
as you can see, I always want to sort using the last field
basically I'm looking for a method to find the last field when using as a delimiter the character *
I was thinking that it might be like that
previous command | sort -t * -k $some_variable_denoting_the_ammount_of_fields -n
but I'm not sure if there's anything like that..
thanks :)

Use sed to duplicate the final field at the start of the line, sort, then use sed to remove the duplicate. Probably simpler to use your favourite programming language though.

Here is a perl script for it:
#!/usr/bin/perl
use strict;
use warnings;
my $regex = qr/\*([^*]*)$/o;
sub bylast
{
my $ak = ($a =~ $regex, $1) || "";
my $bk = ($b =~ $regex, $1) || "";
$ak cmp $bk;
}
print for sort bylast (<>);

This might work:
sed -r 's/.*\*([^*]+$)/\1###&/' source | sort | sed 's/^.*###//'
Add the last field to the front, sort it, delete the sort key N.B. ### can be anything you like as long as it does not exist in the source file.
Credit should go to #Havenless this is just his idea put into code

Related

How to sort array of strings by function in shell script

I have the following list of strings in shell script:
something-7-5-2020.dump
another-7-5-2020.dump
anoter2-6-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
something-2-5-2020.dump
another-2-5-2020.dump
8-1-2021
26-1-2021
20-1-2021
19-1-2021
3-9-2020
29-9-2020
28-9-2020
24-9-2020
1-9-2020
6-8-2020
20-8-2020
18-8-2020
12-8-2020
10-8-2020
7-7-2020
5-7-2020
27-7-2020
7-6-2020
5-6-2020
23-6-2020
18-6-2020
28-5-2020
26-5-2020
9-12-2020
28-12-2020
15-12-2020
1-12-2020
27-11-2020
20-11-2020
19-11-2020
18-11-2020
1-11-2020
11-11-2020
31-10-2020
29-10-2020
27-10-2020
23-10-2020
21-10-2020
15-10-2020
23-09-2020
So my goal is to sort them by date, but it's in dd-mm-yyyy and d-m-yyyy format and sometimes there's a word before like word-dd-mm-yyyy. I would like to create a function to sort the values like any other language so it ignores the first word, casts the date to a common format and compares that format. In javascript it would be something like:
arrayOfStrings.sort((a, b) => functionToOrderStrings())
My code to obtain the array is the following:
dumps=$(gsutil ls gs://organization-dumps/ambient | sed "s:gs\://organization-dumps/ambient/::" | sed '/^$/d' | sed 's:/$::' | sort --reverse --key=3 --key=2 --key=1 --field-separator=-)
echo "$dumps"
I would like to say that I've already searched this in Stackoverflow and none of the answers did help me, because all of them are oriented to sort dates in correct format and that's not my case.
If you have the results in a pipeline, involving an array seems completely superfluous here.
You can apply a technique called a Schwartzian transform: add a prefix to each line with a normalized version the data so it can be easily sorted, then sort, then discard the prefix.
I'm guessing something like the following;
gsutil ls gs://organization-dumps/ambient |
awk '{ sub("gs:\/\/organization-dumps/ambient/", "");
if (! $0) next;
sub("/$", "");
d = $0;
sub(/^[^0-9][^-]*-/, "", d);
sub(/[^0-9]*$/, "", d);
split(d, w, "-");
printf "%04i-%02i-%02i\t%s\n", w[3], w[2], w[1], $0 }' |
sort -n | cut -f2-
In so many words, we are adding a tab-delimited field in front of every line, then sorting on that, then discarding the first field with cut -f2-. The field extraction contains some assumptions which seem to be valid for your test data, but may need additional tweaking if you have real data with corner cases like if the label before the date could sometimes contain a number with dashes around it, too.
If you want to capture the result in a variable, like in your original code, that's easy to do; but usually, you should just run everything in a pipeline.
Notice that I factored your multiple sed scripts into the Awk script, too, some of that with a fair amount of guessing as to what the input looks like and what the sed scripts were supposed to accomplish. (Perhaps also note that sed, like Awk, is a scripting language; to run several sed commands on the same input, just put them after each other in the same sed script.)
Preprocess input to be in the format you want it to be for sorting.
Sort
Remove artifacts from step 1
The following:
sed -E '
# extract the date and put it in first column separated by tab
# this could be better, its just an example
s/(.*-)?([0-9]?[0-9]-[0-9]?[0-9]-[0-9]{4})/\2\t&/;
# If day is a single digit, add a zero in front
s/^([0-9]-)/0\1/;
# If month is a single digit, add a zero in front
s/^([0-9][0-9]-)([0-9]-)/\10\2/
# year in front? no idea - shuffle the way you want
s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3-\2-\1/
' input.txt | sort | cut -f2-
outputs:
another-2-5-2020.dump
something-2-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
anoter2-6-5-2020.dump
another-7-5-2020.dump
something-7-5-2020.dump
26-5-2020
28-5-2020
5-6-2020
7-6-2020
18-6-2020
23-6-2020
5-7-2020
7-7-2020
27-7-2020
6-8-2020
10-8-2020
12-8-2020
18-8-2020
20-8-2020
1-9-2020
3-9-2020
23-09-2020
24-9-2020
28-9-2020
29-9-2020
15-10-2020
21-10-2020
23-10-2020
27-10-2020
29-10-2020
31-10-2020
1-11-2020
11-11-2020
18-11-2020
19-11-2020
20-11-2020
27-11-2020
1-12-2020
9-12-2020
15-12-2020
28-12-2020
8-1-2021
19-1-2021
20-1-2021
26-1-2021
Using GNU awk:
gsutil ls gs://organization-dumps/ambient | awk '{ match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/);dayt=substr($0,RSTART,RLENGTH);split(dayt,map,"-");length(map[1])==1?map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2];map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in map1) { print map1[i] } }'
Explanation:
gsutil ls gs://organization-dumps/ambient | awk '{
match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/); # Check that lines contain a date
dayt=substr($0,RSTART,RLENGTH); # Extract the date
split(dayt,map,"-"); # Split the date in the array map based on "-" as the delimiter
length(map[1])==1? map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2]; # Pad the month and day with "0" if required
map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 # Get the epoch format date based on the values in the map array and use this for the index of the array map1 with the line as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # Set the ordering of the array
for (i in map1) {
print map1[i] # Loop through map1 and print the values (lines)
}
}'
Using GNU awk, you can do this fairly easy:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}' file
Essentially, we are asking GNU awk to traverse an array by index in ascending numeric order. Per line read, we extract the date. The date is always located before the <dot>-character and thus always in field 1 if the dot is the field separator (FS="."). We split the first field by the hyphen and use the total number of fields to extract the date. We convert the date simplistically to some number (YYYY*10000+MM*100+DD; DD<100 && MM*100 < 10000) and ask awk to sort it by that number.
It is now possible to combine the full pipe-line in a single awk:
$ gsutil ls gs://organization-dumps/ambient \
| awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{sub("gs://organization-dumps/ambient/",""); sub("/$","")}
(NF==0){next}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}'

how to sort a file in bash whose lines are in a particular format?

all lines in file.txt are in the following format:
player16:level8|2200 Points
player99:level8|19000 Points
player23:level8|260 Points
how can I sort this file based on points? looking for the following output
player99:level8|19000 Points
player16:level8|2200 Points
player23:level8|260 Points
Any help would be greatly appreciated. Thank you.
sort is designed for this task
sort -t'|' -k2nr file
set the delimiter to | and sort by the second field numerical reverse order
You've tagged it as perl so I'll add a perlish answer.
perl's sort function lets you specify an arbitary comparison criteria, provided you return 'positive/negative/zero' depending on relative position. By default the <=> operator does this numerically, and the cmp operator does that alphabetically.
sort works by setting $a and $b to each element of a list in turn, and performing the comparison function for each pair
So for your scenario - we craft a function that regex matches each line, and extracts the points value:
sub sort_by_points {
#$a and $b are provided by sort.
#we regex capture one or more numbers that are followed by "Points".
my ($a_points) = $a =~ m/(\d+) Points/;
my ($b_points) = $b =~ m/(\d+) Points/;
#we compare the points - note $b comes first, because we're sorting
#descending.
return $b_points <=> $a_points;
}
And then you use that function by calling sort with it.
#!/usr/bin/env perl
use strict;
use warnings;
sub sort_by_points {
my ($a_points)= $a =~ m/(\d+) Points/;
my ($b_points) = $b =~ m/(\d+) Points/;
return $b_points <=> $a_points;
}
#reads from the special __DATA__ filehandle.
chomp ( my #list = <DATA> ) ;
my #sorted_list = sort {sort_by_points} #list;
print join "\n", #sorted_list;
__DATA__
player16:level8|2200 Points
player99:level8|19000 Points
player23:level8|260 Points
For your purposes, you can use <> as your input, because that's the magic file handle - arguments on command line, or data piped through STDIN (might sound weird, but it's the same thing as sed/grep/awk do)
If you ask for Perl solution:
perl -F'\|' -e'push#a,[$_,$F[1]]}{print$_->[0]for sort{$b->[1]<=>$a->[1]}#a' file.txt
but using sort is far simpler
sort -t'|' -k2nr file.txt

bash sort quoted csv files by numeric key

I have the following input csv file:
"aaa","1","xxx"
"ccc, Inc.","6100","yyy"
"bbb","609","zzz"
I wish to sort by the second column as numbers,
I tried
sort --field-separator=',' --key=2n
the problem is that since all values are quoted, they don't get sorted correctly by -n (numeric) option. is there a solution?
A little trick, which uses a double quote as the separator:
sort --field-separator='"' --key=4 -n
For a quoted csv use a language that has a proper csv parser. Here is an example using perl.
perl -MText::ParseWords -lne '
chomp;
push #line, [ parse_line(",", 0, $_) ];
}{
#line = sort { $a->[1] <=> $b->[1] } #line;
for (#line) {
local $" = qw(",");
print qq("#$_");
}
' file
Output:
"aaa","1","xxx"
"bbb","609","zzz"
"ccc, Inc.","6100","yyy"
Explanation:
Remove the new line from input using chomp function.
Using a code module Text::Parsewords parse the quoted line and store it in an array of array without the quotes.
In the END block, sort the array of array on second column and assign it to the original array of array.
For every item in our array of array, we set the output list separator to "," and we print it with preceding and trailing " to create the lines in original format.
Dropping your example into a file called sort2.txt I found the following to work well.
sort -t'"' -k4n sort2.txt
Using sort with the following commands (thank you for the refinements Jonathan)
-t[optional single character separator other than tab. Defined within the single quotes]'"'.
-k4 choose the value in the fourth key.(k)delimited by ", and on the 4th key value
-n numeric sort
file name avoid the use of chaining as unnecessary
Hope this helps!
There isn't going to be a really simple solution. If you make some reasonable assumptions, then you could consider:
sed 's/","/^A/g' input.csv |
sort -t'^A' -k 2n |
sed 's/^A/","/g'
This replaces the "," sequence with Control-A (shown as ^A in the code), then uses that as the field delimiter in sort (the numeric sort on column 2), and then replace the Control-A characters with "," again.
If you use bash, you can use the ANSI C quoting mechanism $'\1' to embed the control characters visibly into the script; you just have to finish the single-quoted string before the escape, and restart it afterwards:
sed 's/","/'$'\1''/g' input.csv |
sort -t$'\1' -k 2n |
sed 's/'$'\1''/","/g'
Or play with double quotes instead of single quotes, but that gets messy because of the double quotes that you are replacing. But you can simply type the characters verbatim and editors like vim will be happy to show them to you.
Sometimes the values in the CSV file are optionally quoted, only when necessary. In this case, using " as a separator is not reliable.
Example:
"Forest fruits",198
Apples,456
bananas,67
Using awk, sort and cut, you can sort the original file, here by the first column :
awk -F',' '{
a = $1; # or the column index you want
gsub(/(^"|"$)/, "", a);
print a","$0
}' file.csv | sort -k1 | cut -d',' -f1 --complement
This will bring the column you want to sort on in front without quotes, then sort it the way you want, and remove this column at the end.

awk split on a different token

I am trying to initialize an array from a string split using awk.
I am expecting the tokens be delimited by ",", but somehow they don't.
The input is a string returned by curl from the address http://www.omdbapi.com/?i=&t=the+campaign
I've tried to remove any extra carriage return or things that could cause confusion, but in all clients I have checked it looks to be a single line string.
{"Title":"The Campaign","Year":"2012","Rated":"R", ...
and this is the ouput
-metadata {"Title":"The **-metadata** Campaign","Year":"2012","Rated":"R","....
It should have been
-metadata {"Title":"The Campaign"
Here's my piece of code:
__tokens=($(echo $omd_response | awk -F ',' '{print}'))
for i in "${__tokens[#]}"
do
echo "-metadata" $i"
done
Any help is welcome
I would take seriously the comment by #cbuckley: Use a json-aware tool rather than trying to parse the line with simple string tools. Otherwise, your script will break if a quoted-string has an comma inside, for example.
At any event, you don't need awk for this exercise, and it isn't helping you because the way awk breaks the string up is only of interest to awk. Once the string is printed to stdout, it is still the same string as always. If you want the shell to use , as a field delimiter, you have to tell the shell to do so.
Here's one way to do it:
(
OLDIFS=$IFS
IFS=,
tokens=($omd_response)
IFS=$OLDIFS
for token in "${tokens[#]}"; do
# something with token
done
)
The ( and ) are just to execute all that in a subshell, making the shell variables temporaries. You can do it without.
First, please accept my apologies: I don't have a recent bash at hand so I can't try the code below (no arrays!)
But it should work, or if not you should be able to tweak it to work (or ask underneath, providing a little context on what you see, and I'll help fix it)
nb_fields=$(echo "${omd_response}" | tr ',' '\n' | wc -l | awk '{ print $1 }')
#The nb_fields will be correct UNLESS ${omd_response} contains a trailing "\",
#in which case it would be 1 too big, and below would create an empty
# __tokens[last_one], giving an extra `-metadata ""`. easily corrected if it happens.
#the code below assume there is at least 1 field... You should maybe check that.
#1) we create the __tokens[] array
for field in $( seq 1 $nb_fields )
do
#optionnal: if field is 1 or $nb_fields, add processing to get rid of the { or } ?
${__tokens[$field]}=$(echo "${omd_response}" | cut -d ',' -f ${field})
done
#2) we use it to output what we want
for i in $( seq 1 $nb_fields )
do
printf '-metadata "%s" ' "${__tokens[$i]}"
#will output all on 1 line.
#You could add a \n just before the last ' so it goes each on different lines
done
so I loop on field numbers, instead of on what could be some space-or-tab separated values

Shell script to extract data from file between two date ranges

I have a huge file, with each line starting with a timestamp as shown below. I need a way to grep lines between two dates. Is there any easy way to do this using sed or awk instead of extracting out date fields in each line and comparing day/month/year?
example, need to extract data between 2013-06-01 to 2013-06-15 by checking the timestamp in the first field
File contents:
2013-06-02T19:44:59;(3305,3308,2338,102116);aaaa;xxxx
2013-06-14T20:01:58;(2338);aaaa;xxxx
2013-06-12T20:01:58;(3305,3308,2338);bbbb;xxxx
2013-06-13T20:01:59;(3305,3308,2338,102116);bbbb;xxxx
2013-06-13T20:02:53;(2338);bbbb;xxxx
2013-06-13T20:02:53;(3305,3308,2338);aaaa2;xxxx
2013-06-13T20:02:54;(3305,3308,2338,102116);aaaa2;xxxx
2013-06-14T20:31:58;(2338);aaaa2;xxxx
2013-06-14T20:31:58;(3305,3308,2338);aaaa;xxxx
2013-06-15T20:31:59;(3305,3308,2338,102116);bbbb;xxxx
2013-06-16T20:32:53;(2338);aaaa;xxxx
2013-06-16T20:32:53;(3305,3308,2338);aaaa2;xxxx
2013-06-16T20:32:54;(3305,3308,2338,102116);bbbb;xxxx
It may not have been your first choice but Perl is great for this task.
perl -ne "print if ( m/2013-06-02/ .. m/2013-06-15/ )" myfile.txt
The way this works is that if the first trigger is matched (i.e. m/2013-06-02/) then the condition (print) will be executed on each line until the second trigger is matched (i.e. m/2013-06-15).
However this trick won't work if you specify m/2013-06-01/ as a trigger because this is never matched in your file.
A less exciting technique is to extract some text from each line and test that:
perl -ne 'if ( m/^([0-9-]+)/ ) { $date = $1; print if ( $date ge "2013-06-01" and $date le "2013-06-15" ) }' myfile.txt
(Tested both expressions and working).
You can try something like:
awk -F'-|T' '$1==2013 && $2==06 && $3>=01 && $3<=15' hugefile
You can use sed to print all lines between two patterns. In this case, you will have to sort the file first because the dates are interleaved:
$ sort file | sed -n '/2013-06-12/,/2013-06-15/p'
2013-06-12T20:01:58;(3305,3308,2338);bbbb;xxxx
2013-06-13T20:01:59;(3305,3308,2338,102116);bbbb;xxxx
2013-06-13T20:02:53;(2338);bbbb;xxxx
2013-06-13T20:02:53;(3305,3308,2338);aaaa2;xxxx
2013-06-13T20:02:54;(3305,3308,2338,102116);aaaa2;xxxx
2013-06-14T20:01:58;(2338);aaaa;xxxx
2013-06-14T20:31:58;(2338);aaaa2;xxxx
2013-06-14T20:31:58;(3305,3308,2338);aaaa;xxxx
2013-06-15T20:31:59;(3305,3308,2338,102116);bbbb;xxxx

Resources