How to sort array of strings by function in shell script - bash

I have the following list of strings in shell script:
something-7-5-2020.dump
another-7-5-2020.dump
anoter2-6-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
something-2-5-2020.dump
another-2-5-2020.dump
8-1-2021
26-1-2021
20-1-2021
19-1-2021
3-9-2020
29-9-2020
28-9-2020
24-9-2020
1-9-2020
6-8-2020
20-8-2020
18-8-2020
12-8-2020
10-8-2020
7-7-2020
5-7-2020
27-7-2020
7-6-2020
5-6-2020
23-6-2020
18-6-2020
28-5-2020
26-5-2020
9-12-2020
28-12-2020
15-12-2020
1-12-2020
27-11-2020
20-11-2020
19-11-2020
18-11-2020
1-11-2020
11-11-2020
31-10-2020
29-10-2020
27-10-2020
23-10-2020
21-10-2020
15-10-2020
23-09-2020
So my goal is to sort them by date, but it's in dd-mm-yyyy and d-m-yyyy format and sometimes there's a word before like word-dd-mm-yyyy. I would like to create a function to sort the values like any other language so it ignores the first word, casts the date to a common format and compares that format. In javascript it would be something like:
arrayOfStrings.sort((a, b) => functionToOrderStrings())
My code to obtain the array is the following:
dumps=$(gsutil ls gs://organization-dumps/ambient | sed "s:gs\://organization-dumps/ambient/::" | sed '/^$/d' | sed 's:/$::' | sort --reverse --key=3 --key=2 --key=1 --field-separator=-)
echo "$dumps"
I would like to say that I've already searched this in Stackoverflow and none of the answers did help me, because all of them are oriented to sort dates in correct format and that's not my case.

If you have the results in a pipeline, involving an array seems completely superfluous here.
You can apply a technique called a Schwartzian transform: add a prefix to each line with a normalized version the data so it can be easily sorted, then sort, then discard the prefix.
I'm guessing something like the following;
gsutil ls gs://organization-dumps/ambient |
awk '{ sub("gs:\/\/organization-dumps/ambient/", "");
if (! $0) next;
sub("/$", "");
d = $0;
sub(/^[^0-9][^-]*-/, "", d);
sub(/[^0-9]*$/, "", d);
split(d, w, "-");
printf "%04i-%02i-%02i\t%s\n", w[3], w[2], w[1], $0 }' |
sort -n | cut -f2-
In so many words, we are adding a tab-delimited field in front of every line, then sorting on that, then discarding the first field with cut -f2-. The field extraction contains some assumptions which seem to be valid for your test data, but may need additional tweaking if you have real data with corner cases like if the label before the date could sometimes contain a number with dashes around it, too.
If you want to capture the result in a variable, like in your original code, that's easy to do; but usually, you should just run everything in a pipeline.
Notice that I factored your multiple sed scripts into the Awk script, too, some of that with a fair amount of guessing as to what the input looks like and what the sed scripts were supposed to accomplish. (Perhaps also note that sed, like Awk, is a scripting language; to run several sed commands on the same input, just put them after each other in the same sed script.)

Preprocess input to be in the format you want it to be for sorting.
Sort
Remove artifacts from step 1
The following:
sed -E '
# extract the date and put it in first column separated by tab
# this could be better, its just an example
s/(.*-)?([0-9]?[0-9]-[0-9]?[0-9]-[0-9]{4})/\2\t&/;
# If day is a single digit, add a zero in front
s/^([0-9]-)/0\1/;
# If month is a single digit, add a zero in front
s/^([0-9][0-9]-)([0-9]-)/\10\2/
# year in front? no idea - shuffle the way you want
s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3-\2-\1/
' input.txt | sort | cut -f2-
outputs:
another-2-5-2020.dump
something-2-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
anoter2-6-5-2020.dump
another-7-5-2020.dump
something-7-5-2020.dump
26-5-2020
28-5-2020
5-6-2020
7-6-2020
18-6-2020
23-6-2020
5-7-2020
7-7-2020
27-7-2020
6-8-2020
10-8-2020
12-8-2020
18-8-2020
20-8-2020
1-9-2020
3-9-2020
23-09-2020
24-9-2020
28-9-2020
29-9-2020
15-10-2020
21-10-2020
23-10-2020
27-10-2020
29-10-2020
31-10-2020
1-11-2020
11-11-2020
18-11-2020
19-11-2020
20-11-2020
27-11-2020
1-12-2020
9-12-2020
15-12-2020
28-12-2020
8-1-2021
19-1-2021
20-1-2021
26-1-2021

Using GNU awk:
gsutil ls gs://organization-dumps/ambient | awk '{ match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/);dayt=substr($0,RSTART,RLENGTH);split(dayt,map,"-");length(map[1])==1?map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2];map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in map1) { print map1[i] } }'
Explanation:
gsutil ls gs://organization-dumps/ambient | awk '{
match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/); # Check that lines contain a date
dayt=substr($0,RSTART,RLENGTH); # Extract the date
split(dayt,map,"-"); # Split the date in the array map based on "-" as the delimiter
length(map[1])==1? map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2]; # Pad the month and day with "0" if required
map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 # Get the epoch format date based on the values in the map array and use this for the index of the array map1 with the line as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # Set the ordering of the array
for (i in map1) {
print map1[i] # Loop through map1 and print the values (lines)
}
}'

Using GNU awk, you can do this fairly easy:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}' file
Essentially, we are asking GNU awk to traverse an array by index in ascending numeric order. Per line read, we extract the date. The date is always located before the <dot>-character and thus always in field 1 if the dot is the field separator (FS="."). We split the first field by the hyphen and use the total number of fields to extract the date. We convert the date simplistically to some number (YYYY*10000+MM*100+DD; DD<100 && MM*100 < 10000) and ask awk to sort it by that number.
It is now possible to combine the full pipe-line in a single awk:
$ gsutil ls gs://organization-dumps/ambient \
| awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{sub("gs://organization-dumps/ambient/",""); sub("/$","")}
(NF==0){next}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}'

Related

Sorting the contents within a column using Shell Script Line by Line in a File

I am Sorting a File using a column using the command -
cat myFile | sort -u -k3
Now i want to Sort Data within a Column of a File. Can anyone please help and tell me how can i achieve it?
My Data Looks like this in the File names Student.csv -
Name,Age,Marks,Grades
Sam,21,"34,56,21,67","C,B,D,A"
Josh,25,"90,89,78,45","A,A,B,C"
Output-
Name,Age,Marks,Grades
Sam,21,"21,34,56,67","A,B,C,D"
Josh,25,"45,78,89,90","A,A,B,C"
Will Appreciate the help, Thanks
You should export your CSV with a field separator that does not exist within the texts. Otherwise it becomes hugely cumbersome to deal with this.
Afterwards you can easily sort by specifying the separator and the field.
Example if you would use | as separator:
Name|Age|Marks|Grades
Sam|21|"34,56,21,67"|"C,B,D,A"
Josh|25|"90,89,78,45"|"A,A,B,C"
Then execute:
cat myFile | sort -u -k3 -t\|
or:
sort -u -k3 -t\| <myFile
Afterwards you could be putting your semi-colons back:
sort -u -k3 -t\| <myFile | sed 's/|/;/g'
Did it, but I'm too tired to explain how; brain's hitting a brick wall. There's a lot to unpack there, and it'll take half-a-day to explain. I'll write all the steps in a couple hours after I get a nap in, otherwise there's gonna be 50 typos in that description.
cat Student.csv | head -n1 && cat Student.csv | tail -n+2 | awk -F \" '{split($2,a,",");asort(a);b="";for(i in a)b=b a[i] ",";split($4,c,",");asort(c);d="";for(i in c)d=d c[i] ",";printf "%s\"%s\",\"%s\"\n",$1,substr(b,1,length(b)-1),substr(d,1,length(d)-1)}'
Alternatively:
cat Student.csv | tee >(head -n1) >(tail -n+2 | awk -F \" '{split($2,a,",");asort(a);b="";for(i in a)b=b a[i] ",";split($4,c,",");asort(c);d="";for(i in c)d=d c[i] ",";printf "%s\"%s\",\"%s\"\n",$1,substr(b,1,length(b)-1),substr(d,1,length(d)-1)}') >/dev/null ; sleep 0.1
Output:
Name,Age,Marks,Grades
Sam,21,"21,34,56,67","A,B,C,D"
Josh,25,"45,78,89,90","A,A,B,C"
https://www.tutorialspoint.com/awk/index.htm
Edit -- 'kay, the explaination:
cat concatenates (glues) files together, but when you just give it one arg, then that's what it prints out.
You can do the next part in one or two steps, I'll explain the first method. | pipe directs the output to another command. We all know this, or we wouldn't be here right now... however someday, someone will come across this post, and wonder what it does.
head prints out the first few lines of what you give it. Here, I specified -n1 number of lines = one, so it would print out the header:
Name,Age,Marks,Grades
&& continues to the next command, so long as that initial instruction was a success.
cat Student.csv again, but this time piped into tail, which prints the last few lines, of whatever you give it. -n+2 specifies to spit out everything from line number 2, and beyond.
We then pipe those contents into AWK https://en.wikipedia.org/wiki/AWK ...I'm sure you could do it with sed https://en.wikipedia.org/wiki/Sed, and I started with that, but sed tends to be more simple than awk, so you'd need to do far more chained-commands to achieve the same thing. Lisp might be able to do it more concicely, but it sounded like you were asking for shell builtins. Python's also decent with strings, but again, sh.
-F \" delegates a literal " as the field separator, so that we can group the contents into 3 categories:
Sam,21, " 34,56,21,67 " , "C,B,D,A"
$1 = Sam,21,
$2 = 34,56,21,67
$3 = ,
$4 = C,B,D,A
You actually get 4, but I'm throwing out that comma in the third position. It's easy enough to put it back in.
We now need to sort those numbers, so split($2,a,",") returns an array, in this case, named a, from the contents of $2, which has been delimited by the , symbol.
a = [ 34, 56, 21, 67 ]
; separates AWK commands, you can mostly ignore those. If there were simply a space, awk would try to concatenate items together, and we don't want that yet.
Next, array sort asort( a ), the contents of a -- https://www.tutorialspoint.com/awk/awk_string_functions.htm
a = [ 21, 34, 56, 67 ]
Here would be a perfect time for Python's string .join() method https://www.w3schools.com/python/ref_string_join.asp
However, we don't have that available to us, and AWK doens't seem to have it, as far as I know, so we have to roll our own here. So construct string, b, whose contents will be appended by each item in a. Single-quotes often won't do in commandline, so you'll see double-quotes.
b=""
for( i in a ) b=b a[i] ","
b begins empty. Iterating a for-loop over a's contents, we arrive at an appending which includes commas. Leave the trailing comma for now, it'll get trimmed off in a bit.
21,34,56,67,
Exact same procedure for $4, but we name the array c this time, and the string in which those contents are contatenaded with commas, d -- split( $4, c, "," ) ; asort( c ) ; d="" ; for( i in c ) d=d c[i] "," You can name them anything you like, just happened to have ABCD staring me in the face from those grade listings, so that's what I went with.
OK, now we have everything we need.
$1 = Sam,21,
b = 21,34,56,67,
d = A,B,C,D,
Let's format a string so they're all together.
printf "%s\"%s\",\"%s\"\n"
This will print $1 in the first %s string position, then a literal double-quote,
b into the second %s string position, next ",",
followed by d in the third %s position,
all wrapped up with a final double-quote and a newline.
However, b and d both have trailing commas, so we trim those off with AWK's substr() command. -- https://www.tutorialspoint.com/awk/awk_string_functions.htm Knowing where to begin is easy enough, but we need to chop those at one-from-the-end.
substr( b, 1, length(b) -1 )
substr( d, 1, length(d) -1 )
It'd be nice if you could just specify -2, and have it count backwards, like you can in Lua, Python, et al... but that doesn't seem to do in AWK, so whatevs. Ya live, ya learn. And there you have it, all your ducks in a row.
Sam,21,"21,34,56,67","A,B,C,D"
This does, maybe not elegantly, but it's within the required guidelines. I'm sure there's possibilities of code-golfing in there somewhere, but it's solid logic you can follow.

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
stationery:paper
grocery:apples
grocery:pears
dairy:milk
stationery:pen
dairy:cheese
stationery:rubber
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
Output:
dairy:2:cheese,milk
grocery:2:apples,pears
stationery:3:paper,pen,rubber
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
where
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
:a
$p
N
s/([^:]*): *(.*)\n\1:/\1: \2 /
ta
P
D
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
}
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]
}

How do I do a for loop with 2 arrays in shell script?

I have to first declare two arrays which I also need help with.
Originally, it's two single variables.
day=$(hadoop fs -ls -R /user/hive/* |
awk '/filename.txt.gz/' |
tail -1 |
date -d $(echo `awk '{print $6}'`) '+%b %-d' |
tr -d ' ')
time_stamp=$(hadoop fs -ls -R /user/hive/* |
awk '/filename.txt.gz/' |
tail -1 |
awk '{ print $7 }')
Now instead of tail -1, I need tail -5. So first, how do I make these two arrays?
Second question, how do I make a for loop with each value from the paired values of $day and $time_stamp? I can't use array_combine because I need to perform actions on each array separately. Thanks
You are collecting the data into strings, not arrays. But additionally, your code should probably be refactored significantly -- as a general rule of thumb, if something happens in Awk, most of the rest should also happen in Awk.
You assign to an array with variable=(values of array) and to get the values from a subprocess, it's variable=($(command to produce values)).
Here's a first attempt at refactoring your code.
# Avoid repeated code -- break this out into a function
extract_field () {
hadoop fs -ls -R /user/hive/* |
# Get rid of the tail and the repeated Awk
# Notice backslashes in regex
# Pass in the field to extract as a parameter
awk -v field="$1" '/filename\.txt\.gz/ { d[++i]=$field }
END { for(j=i-5; j<=i; ++j) print d[j] }'
)
day=($(extract_field 6 |
# Refactor accordingly
# And if you don't want a space in the format string, don't put a space in the format string in the first place
xargs -i {} date -d {} '+%b%-d'))
time_stamp=($(extract_field 7))
I'm highly skeptical of the arrangement to call the Hadoop command twice, though. Perhaps just extract fields 6 and 7 in a single go and then post-process the results to get them into two separate arrays. Something like this instead then?
combined=($(hadoop fs -ls -R /user/hive/* |
awk '/filename\.txt\.gz/ { d[++i]=$6 " " $7 }
END { for(j=i-5; j<=i; ++j) print d[j] }'))
for ((i=0; i<"${#combined[#]}"; ++i)); do
day[$i]="$(date -d "${combined[i]% *}" +'%b%-d')"
time_stamp[$i]="${combined[i]#* }"
done
unset combined
The statement that you need to handle the dates and times independently from each other sounds suspicious; if you can find a way to avoid doing that, perhaps after all don't split combined into two separate arrays. The code above reveals how to extract the date and the time from a value in combined (the mechanism is called parameter substitution). It also obviously demonstrates how to loop over the indices in an array.

Want to sort a file based on another file in unix shell

I have 2 files refer.txt and parse.txt
refer.txt contains the following
julie,remo,rob,whitney,james
parse.txt contains
remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,whitney/hello/1.0,julie/hello/2.0,julie/hello/3.0,rob/hello/4.0,james/hello/6.0
Now my output.txt should list the files in parse.txt based on the order specified in refer.txt
ex of output.txt should be:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
i have tried the following code:
sort -nru refer.txt parse.txt
but no luck.
please assist me.TIA
You can do that using gnu-awk:
awk -F/ -v RS=',|\n' 'FNR==NR{a[$1] = (a[$1])? a[$1] "," $0 : $0 ; next}
{s = (s)? s "," a[$1] : a[$1]} END{print s}' parse.txt refer.txt
Output:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
Explanation:
-F/ # Use field separator as /
-v RS=',|\n' # Use record separator as comma or newline
NR == FNR { # While processing parse.txt
a[$1]=(a[$1])?a[$1] ","$0:$0 # create an array with 1st field as key and value as all the
# records with keys julie, remo, rob etc.
}
{ # while processing the second file refer.txt
s = (s)?s "," a[$1]:a[$1] # aggregate all values by reading key from 2nd file
}
END {print s } # print all the values
In pure native bash (4.x):
# read each file into an array
IFS=, read -r -a values <parse.txt
IFS=, read -r -a ordering <refer.txt
# create a map from content before "/" to comma-separated full values in preserved order
declare -A kv=( )
for value in "${values[#]}"; do
key=${value%%/*}
if [[ ${kv[$key]} ]]; then
kv[$key]+=",$value" # already exists, comma-separate
else
kv[$key]="$value"
fi
done
# go through refer list, putting full value into "out" array for each entry
out=( )
for value in "${ordering[#]}"; do
out+=( "${kv[$value]}" )
done
# print "out" array in comma-separated form
IFS=,
printf '%s\n' "${out[*]}" >output.txt
If you're getting more output fields than you have input fields, you're probably trying to run this with bash 3.x. Since associative array support is mandatory for correct operation, this won't work.
tr , "\n" refer.txt | cat -n >person_id.txt # 'cut -n' not posix, use sed and paste
cat person_id.txt | while read person_id person_key
do
print "$person_id" > $person_key
done
tr , "\n" parse.txt | sed 's/(^[^\/]*)(\/.*)$/\1 \1\2/' >person_data.txt
cat person_data.txt | while read foreign_key person_data
do
person_id="$(<$foreign_key)"
print "$person_id" " " "$person_data" >>merge.txt
done
sort merge.txt >output.txt
A text book data processing approach, a person id table, a person data table, merged on a common key field, which is the first name of the person:
[person_key] [person_id]
- person id table, a unique sortable 'id' for each person (line number in this instance, since that is the desired sort order), and key for each person (their first name)
[person_key] [person_data]
- person data table, the data for each person indexed by 'person_key'
[person_id] [person_data]
- a merge of the 'person_id' table and 'person_data' table on 'person_key', which can then be sorted on person_id, giving the output as requested
The trick is to implement an associative array using files, the file name being the key (in this instance 'person_key'), the content being the value. [Essentially a random access file implemented using the filesystem.]
This actually adds a step to the otherwise simple but not very efficient task of grepping parse.txt with each value in refer.txt - which is more efficient I'm not sure.
NB: The above code is very unlikely to work out of the box.
NBB: On reflection, probably a better way of doing this would be to use the file system to create a random access file of parse.txt (essentially an index), and to then consider refer.txt as a batch file, submitting it as a job as such, printing out from the parse.txt random access file the data for each of the names read in from refer.txt in turn:
# 1) index data file on required field
cat person_data.txt | while read data
do
key="$(print "$data" | sed 's/(^[^\/]*)/\1/')" # alt. `cut -d'/' -f1` ??
print "$data" >>./person_data/"$key"
done
# 2) run batch job
cat refer_data.txt | while read key
do
print ./person_data/"$key"
done
However having said that, using egrep is probably just as rigorous a solution or at least for small datasets, I would most certainly use this approach given the specific question posed. (Or maybe not! The above could well prove faster as well as being more robust.)
Command
while read line; do
grep -w "^$line" <(tr , "\n" < parse.txt)
done < <(tr , "\n" < refer.txt) | paste -s -d , -
Key points
For both files, newlines are translated to commas using the tr command (without actually changing the files themselves). This is useful because while read and grep work under the assumption that your records are separated by newlines instead of commas.
while read will read in every name from refer.txt, (i.e julie, remo, etc.) and then use grep to retrieve lines from parse.txt containing that name.
The ^ in the regex ensures matching is only performed from the start of the string and not in the middle (thanks to #CharlesDuffy's comment below), and the -w option for grep allows whole-word matching only. For example, this ensures that "rob" only matches "rob/..." and not "robby/..." or "throb/...".
The paste command at the end will comma-separate the results. Removing this command will print each result on its own line.

bash sort quoted csv files by numeric key

I have the following input csv file:
"aaa","1","xxx"
"ccc, Inc.","6100","yyy"
"bbb","609","zzz"
I wish to sort by the second column as numbers,
I tried
sort --field-separator=',' --key=2n
the problem is that since all values are quoted, they don't get sorted correctly by -n (numeric) option. is there a solution?
A little trick, which uses a double quote as the separator:
sort --field-separator='"' --key=4 -n
For a quoted csv use a language that has a proper csv parser. Here is an example using perl.
perl -MText::ParseWords -lne '
chomp;
push #line, [ parse_line(",", 0, $_) ];
}{
#line = sort { $a->[1] <=> $b->[1] } #line;
for (#line) {
local $" = qw(",");
print qq("#$_");
}
' file
Output:
"aaa","1","xxx"
"bbb","609","zzz"
"ccc, Inc.","6100","yyy"
Explanation:
Remove the new line from input using chomp function.
Using a code module Text::Parsewords parse the quoted line and store it in an array of array without the quotes.
In the END block, sort the array of array on second column and assign it to the original array of array.
For every item in our array of array, we set the output list separator to "," and we print it with preceding and trailing " to create the lines in original format.
Dropping your example into a file called sort2.txt I found the following to work well.
sort -t'"' -k4n sort2.txt
Using sort with the following commands (thank you for the refinements Jonathan)
-t[optional single character separator other than tab. Defined within the single quotes]'"'.
-k4 choose the value in the fourth key.(k)delimited by ", and on the 4th key value
-n numeric sort
file name avoid the use of chaining as unnecessary
Hope this helps!
There isn't going to be a really simple solution. If you make some reasonable assumptions, then you could consider:
sed 's/","/^A/g' input.csv |
sort -t'^A' -k 2n |
sed 's/^A/","/g'
This replaces the "," sequence with Control-A (shown as ^A in the code), then uses that as the field delimiter in sort (the numeric sort on column 2), and then replace the Control-A characters with "," again.
If you use bash, you can use the ANSI C quoting mechanism $'\1' to embed the control characters visibly into the script; you just have to finish the single-quoted string before the escape, and restart it afterwards:
sed 's/","/'$'\1''/g' input.csv |
sort -t$'\1' -k 2n |
sed 's/'$'\1''/","/g'
Or play with double quotes instead of single quotes, but that gets messy because of the double quotes that you are replacing. But you can simply type the characters verbatim and editors like vim will be happy to show them to you.
Sometimes the values in the CSV file are optionally quoted, only when necessary. In this case, using " as a separator is not reliable.
Example:
"Forest fruits",198
Apples,456
bananas,67
Using awk, sort and cut, you can sort the original file, here by the first column :
awk -F',' '{
a = $1; # or the column index you want
gsub(/(^"|"$)/, "", a);
print a","$0
}' file.csv | sort -k1 | cut -d',' -f1 --complement
This will bring the column you want to sort on in front without quotes, then sort it the way you want, and remove this column at the end.

Resources