Using bash to sort data horizontally - bash

I have a file full of data in columns
sarah mark john
10 20 5
x y z
I want to sort the data so the columns stay intact but the second row is in increasing order so it looks like this:
john sarah mark
5 10 20
z x y
I've been looking at the sort command but have only been able to find vertical sorting, not horizontal. I'm happy to use any tool, any help is appreciated.
Thank you!

Let's create a function to transpose a file (make rows become columns, and columns become rows):
transpose () {
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j<NR?OFS:ORS)
}
}'
}
This just loads all the data into a bidimensional array a[line,column] and then prints it back as a[column,line], so that it transposes the given input. The wrapper transpose () { } is used to store it as a bash function. You just need to copy paste it in your shell (or in ~/.bashrc if you want it to be a permanent function, available any time you open a session).
Then, by using it, we can easily solve the problem by using sort -n -k2: sort numerically based on column 2. Then, transpose back.
$ cat a | transpose | sort -n -k2 | transpose
john sarah mark
5 10 20
z x y
In case you want to have a nice format as final output, just pipe to column like this:
$ cat a | transpose | sort -n -k2 | transpose | column -t
john sarah mark
5 10 20
z x y
Step by step:
$ cat a | transpose
sarah 10 x
mark 20 y
john 5 z
$ cat a | transpose | sort -n -k2
john 5 z
sarah 10 x
mark 20 y
$ cat a | transpose | sort -n -k2 | transpose
john sarah mark
5 10 20
z x y

Coming from a duplicate question, this would sort the columns by the first row:
#!/bin/bash
input="$1"
order=$((for i in $(head -1 $input); do echo $i; done) | nl | sort -k2 | cut -f1)
grep ^ $input | (while read line
do
read -a columns <<< "${line%"${line##*[![:space:]]}"}"
orderedline=()
for i in ${order[#]}
do
orderedline+=("${columns[$i - 1]}")
done
line=$(printf "\t%s" "${orderedline[#]}")
echo ${line:1}
done)
To sort by second row, replace head -1 $input with head -2 $input | tail -1. If the sort should be numeric, put in sort -n -k2 instead of sort -k2.

Good one-liner gets the job done:
perl -ane '$,=" "; print sort #F; print "\n";' file
I found it here: http://www.unix.com/unix-for-advanced-and-expert-users/36039-horizontal-sorting-lines-file-sed-implementation.html

Related

Bash - Removing empty columns from .csv file

I have a large .csv file in which I have to remove columns which are empty. By empty, I mean that they have a header, but the rest of the column contains no data.
I've written a Bash script to try and do this, but am running into a few issues.
Here's the code:
#!/bin/bash
total="$(head -n 1 Reddit-cleaner.csv | grep -o ',' | wc -l)"
i=1
count=0
while [ $i -le $total ]; do
cat Reddit-cleaner.csv | cut -d "," -f$i | while read CMD; do if [ -n CMD ]; then count=$count+1; fi; done
if [ $count -eq 1 ]; then
cut -d "," -f$i --complement <Reddit-cleaner.csv >Reddit-cleanerer.csv
fi
count=0
i=$i+1
done
Firstly I find the number of columns, and store it in total. Then while the program has not reached the last column, I loop through the columns individually. The nested while loop checks if each row in the column is empty, and if there is more than one row that is not empty, it writes all other columns to another file.
I recognise that there are a few problems with this script. Firstly, the count modification occurs in a subshell, so count is never modified in the parent shell. Secondly, the file I am writing to will be overwritten every time the script finds an empty column.
So my question then is how can I fix this. I initially wanted to have it so that it wrote to a new file column by column, based on count, but couldn't figure out how to get that done either.
Edit: People have asked for a sample input and output.
Sample input:
User, Date, Email, Administrator, Posts, Comments
a, 20201719, a#a.com, Yes, , 3
b, 20182817, b#b.com, No, , 4
c, 20191618, , No, , 4
d, 20190126, , No, , 2
Sample output:
User, Data, Email, Administrator, Comments
a, 20201719, a#a.com, Yes, 3
b, 20182817, b#b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2
In the sample output, the column which has no data in it except for the header (Posts) has been removed, while the columns which are either entirely or partially filled remain.
I may be misinterpreting the question (due to its lack of example input and expected output), but this should be as simple as:
$ x="1,2,3,,4,field 5,,,six,7"
$ echo "${x//,+(,)/,}"
1,2,3,4,field 5,six,7
This requires bash with extglob enabled. Otherwise, you can use an external call to sed:
$ echo "1,2,3,,4,field 5,,,six,7" |sed 's/,,,*/,/g'
1,2,3,4,field 5,six,7
There's a lot of redundancy in your sample code. You should really consider awk since it already tracks the current field count (as NF) and the number of lines (as NR), so you could add that up with a simple total+=NF on each line. With the empty fields collapsed, awk can just run on the field number you want.
$ echo "1,2,3,,4,field 5,,,six,7" |awk -F ',+' '
{ printf "line %d has %d fields, the 6th of which is <%s>\n", NR, NF, $6 }'
line 1 has 7 fields, the 6th of which is <six>
This uses printf to denote the number of records (NR, the current line number), the number of fields (NF) and the value of the sixth field ($6, can also be as a variable, e.g. $NF is the value of the final field since awk is one-indexed).
It is actually job of a CSV parser but you may use this awk script to get the job done:
cat removeEmptyCellsCsv.awk
BEGIN {
FS = OFS = ", "
}
NR == 1 {
for (i=1; i<=NF; i++)
e[i] = 1 # initially all cols are marked empty
next
}
FNR == NR {
for (i=1; i<=NF; i++)
e[i] = e[i] && ($i == "")
next
}
{
s = ""
for (i=1; i<=NF; i++)
s = s (i==1 || e[i-1] ? "" : OFS) (e[i] ? "" : $i)
print s
}
Then run it as:
awk -f removeEmptyCellsCsv.awk file.csv{,}
Using sample data provided in question, it will produce following output:
1, User, Date, Email, Administrator, Comments
2, a, 20201719, a#a.com, Yes, 3
3, b, 20182817, b#b.com, No, 4
4, c, 20191618, , No, 4
5, d, 20190126, , No, 2
Note that Posts columns has been removed because it is empty in every record.
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
if ( NR > 1 ) {
for (i=1; i<=NF; i++) {
if ( $i ~ /[^[:space:]]/ ) {
gotValues[i]
}
}
}
next
}
{
c=0
for (i=1; i<=NF; i++) {
if (i in gotValues) {
printf "%s%s", (c++ ? OFS : ""), $i
}
}
print ""
}
$ awk -f tst.awk file file
User, Date, Email, Administrator, Comments
a, 20201719, a#a.com, Yes, 3
b, 20182817, b#b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2
See also What's the most robust way to efficiently parse CSV using awk? if you need to work with any more complicated CSVs than the one in your question.
You can use Miller (https://github.com/johnkerl/miller) and its remove-empty-columns verb.
Starting from
+------+----------+---------+---------------+-------+----------+
| User | Date | Email | Administrator | Posts | Comments |
+------+----------+---------+---------------+-------+----------+
| a | 20201719 | a#a.com | Yes | - | 3 |
| b | 20182817 | b#b.com | No | - | 4 |
| c | 20191618 | - | No | - | 4 |
| d | 20190126 | - | No | - | 2 |
+------+----------+---------+---------------+-------+----------+
and running
mlr --csv remove-empty-columns input.csv >output.csv
you will have
+------+----------+---------+---------------+----------+
| User | Date | Email | Administrator | Comments |
+------+----------+---------+---------------+----------+
| a | 20201719 | a#a.com | Yes | 3 |
| b | 20182817 | b#b.com | No | 4 |
| c | 20191618 | - | No | 4 |
| d | 20190126 | - | No | 2 |
+------+----------+---------+---------------+----------+

unix sort groups by their associated maximum value?

Let's say I have this input file 49142202.txt:
A 5
B 6
C 3
A 4
B 2
C 1
Is it possible to sort the groups in column 1 by the value in column 2? The desired output is as follows:
B 6 <-- B group at the top, because 6 is larger than 5 and 3
B 2 <-- 2 less than 6
A 5 <-- A group in the middle, because 5 is smaller than 6 and larger than 3
A 4 <-- 4 less than 5
C 3 <-- C group at the bottom, because 3 is smaller than 6 and 5
C 1 <-- 1 less than 3
Here is my solution:
join -t$'\t' -1 2 -2 1 \
<(cat 49142202.txt | sort -k2nr,2 | sort --stable -k1,1 -u | sort -k2nr,2 \
| cut -f1 | nl | tr -d " " | sort -k2,2) \
<(cat 49142202.txt | sort -k1,1 -k2nr,2) \
| sort --stable -k2n,2 | cut -f1,3
The first input to join sorted by column 2 is this:
2 A
1 B
3 C
The second input to join sorted by column 1 is this:
A 5
A 4
B 6
B 2
C 3
C 1
The output of join is:
A 2 5
A 2 4
B 1 6
B 1 2
C 3 3
C 3 1
Which is then sorted by the nl line number in column 2 and then the original input columns 1 and 3 are kept with cut.
I know it can be done a lot easier with for example groupby of pandas of Python, but is there a more elegant way of doing it, while sticking to the use of GNU Coreutils such as sort, join, cut, tr and nl? Preferably I want to avoid a memory inefficient awk solution, but please share those as well. Thanks!
As explained in the comment my solution tries to reduce the number of pipes, unnecessary cat commands and more especially the number of pipeline sort operations since sorting is a complex/time consuming operation:
I reached the following solution where f_grp_sort is the input file:
for elem in $(sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}')
do
grep $elem <(sort -k2nr f_grp_sort)
done
OUTPUT:
B 6
B 2
A 5
A 4
C 3
C 1
Explanations:
sort -k2nr f_grp_sort will generate the following output:
B 6
A 5
A 4
C 3
B 2
C 1
and sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}' will generate the output:
B
A
C
the awk will just generate in the same order 1 unique element of the first column of the temporary output.
Then the for elem in $(...)do grep $elem <(sort -k2nr f_grp_sort); done
will grep for lines containing B then A, then C what will provide the required output.
Now as enhancement, you can use a temporary file to avoid doing sort -k2nr f_grp_sort operation twice:
$ sort -k2nr f_grp_sort > tmp_sorted_file && for elem in $(awk '!seen[$1]++{print $1}' tmp_sorted_file); do grep $elem tmp_sorted_file; done && rm tmp_sorted_file
So, this won't work for all cases, but if the values in your first column can be turned into bash variables, we can use dynamically named arrays to do this instead of a bunch of joins. It should be pretty fast.
The first while block reads in the contents of the file, getting the first two space separated strings and putting them into col1 and col2. We then create a series of arrays named like ARR_A and ARR_B where A and B are the values from column 1 (but only if $col1 only contains characters that can be used in bash variable names). The array contains the column 2 values associated with these column 1 values.
I use your fancy sort chain to get the order we want column 1 values to print out in, we just loop through them, then for each column 1 array we sort the values and echo out column 1 and column 2.
The dynamc variable bits can be hard to follow, but for the right values in column 1 it will work. Again, if there's any characters that can't be part of a bash variable name in column 1, this solution will not work.
file=./49142202.txt
while read col1 col2 extra
do
if [[ "$col1" =~ ^[a-zA-Z0-9_]+$ ]]
then
eval 'ARR_'${col1}'+=("'${col2}'")'
else
echo "Bad character detected in Column 1: '$col1'"
exit 1
fi
done < "$file"
sort -k2nr,2 "$file" | sort --stable -k1,1 -u | sort -k2nr,2 | while read col1 extra
do
for col2 in $(eval 'printf "%s\n" "${ARR_'${col1}'[#]}"' | sort -r)
do
echo $col1 $col2
done
done
This was my test, a little more complex than your provided example:
$ cat 49142202.txt
A 4
B 6
C 3
A 5
B 2
C 1
C 0
$ ./run
B 6
B 2
A 5
A 4
C 3
C 1
C 0
Thanks a lot #JeffBreadner and #Allan! I came up with yet another solution, which is very similar to my first one, but gives a bit more control, because it allows for easier nesting with for loops:
for x in $(sort -k2nr,2 $file | sort --stable -k1,1 -u | sort -k2nr,2 | cut -f1); do
awk -v x=$x '$1==x' $file | sort -k2nr,2
done
Do you mind, if I don't accept either of your answers, until I have time to evaluate the time and memory performance of your solutions? Otherwise I would probably just go for the awk solution by #Allan.

Print out the value with the highest number of occurrences in a file

In a bash shell script, I want to go through a list of numbers and then print out the number that occurs most often. If there are several different numbers appearing an equal amount of times, I want to print the highest number. For example, in a file like this:
10
10
10
15
15
20
20
20
20
I want to print the value 20.
How can I achieve this?
If the numbers are in a file, one per line:
sort < myfile | uniq -c | sort -r | head -1
without the count:
A=$(sort < myfile | uniq -c | sort -r | head -1)
set $A
echo $2
You can use this command -
echo 10 10 10 15 15 20 20 20 20 | sed 's/ /\n/g' | sort | uniq -c | sort -V | tail -n 1 | awk '{print $2}'
It will print the number you want.

Custom Sort Multiple Files

I have 10 files (1Gb each). The contents of the files are as follows:
head -10 part-r-00000
a a a c b 1
a a a dumbbell 1
a a a f a 1
a a a general i 2
a a a glory 2
a a a h d 1
a a a h o 4
a a a h z 1
a a a hem hem 1
a a a k 3
I need to sort the file based on the last column of each line (descending order), which is of variable length. If there is a match on the numerical value then sort alphabetically by the 2nd last column. The following BASH command works on small datasets (not complete files) and takes 3 second to sort only 10 lines from one file.
cat part-r-00000 | awk '{print $NF,$0}' | sort -nr | cut -f2- -d' ' > FILE
I want the output in a separate FILE. Can someone help me out to speed up the process?
No, once you get rid of the UUOC that's as fast as it's going to get. Obviously you need to add the 2nd-last field to everything too, e.g. something like:
awk '{print $NF,$(NF-1),$0}' part-r-00000 | sort -k1,1nr -k2,2 | cut -f3- -d' '
Check the sort args, I always get mixed up with those..
Reverse order, sort and reverse order:
awk '{for (i=NF;i>0;i--){printf "%s ",$i};printf "\n"}' file | sort -nr | awk '{for (i=NF;i>0;i--){printf "%s ",$i};printf "\n"}'
Output:
a a a h o 4
a a a k 3
a a a general i 2
a a a glory 2
a a a h z 1
a a a hem hem 1
a a a dumbbell 1
a a a h d 1
a a a c b 1
a a a f a 1
You can use a Schwartzian transform to accomplish your task,
awk '{print -$NF, $(NF-1), $0}' input_file | sort -n | cut -d' ' -f3-
The awk command prepends each record with the negative of the last field and the second last field.
The sort -n command sorts the record stream in the required order because we used the negative of the last field.
The cut command splits on spaces and cuts the first two fields, i.e., the ones we used to normalize the sort
Example
$ echo 'a a a c b 1
a a a dumbbell 1
a a a f a 1
a a a general i 2
a a a glory 2
a a a h d 1
a a a h o 4
a a a h z 1
a a a hem hem 1
a a a k 3' | awk '{print -$NF, $(NF-1), $0}' | sort -n | cut -d' ' -f3-
a a a h o 4
a a a k 3
a a a glory 2
a a a general i 2
a a a f a 1
a a a c b 1
a a a h d 1
a a a dumbbell 1
a a a hem hem 1
a a a h z 1
$

How to sort columns using bash script? [duplicate]

I have a file full of data in columns
sarah mark john
10 20 5
x y z
I want to sort the data so the columns stay intact but the second row is in increasing order so it looks like this:
john sarah mark
5 10 20
z x y
I've been looking at the sort command but have only been able to find vertical sorting, not horizontal. I'm happy to use any tool, any help is appreciated.
Thank you!
Let's create a function to transpose a file (make rows become columns, and columns become rows):
transpose () {
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j<NR?OFS:ORS)
}
}'
}
This just loads all the data into a bidimensional array a[line,column] and then prints it back as a[column,line], so that it transposes the given input. The wrapper transpose () { } is used to store it as a bash function. You just need to copy paste it in your shell (or in ~/.bashrc if you want it to be a permanent function, available any time you open a session).
Then, by using it, we can easily solve the problem by using sort -n -k2: sort numerically based on column 2. Then, transpose back.
$ cat a | transpose | sort -n -k2 | transpose
john sarah mark
5 10 20
z x y
In case you want to have a nice format as final output, just pipe to column like this:
$ cat a | transpose | sort -n -k2 | transpose | column -t
john sarah mark
5 10 20
z x y
Step by step:
$ cat a | transpose
sarah 10 x
mark 20 y
john 5 z
$ cat a | transpose | sort -n -k2
john 5 z
sarah 10 x
mark 20 y
$ cat a | transpose | sort -n -k2 | transpose
john sarah mark
5 10 20
z x y
Coming from a duplicate question, this would sort the columns by the first row:
#!/bin/bash
input="$1"
order=$((for i in $(head -1 $input); do echo $i; done) | nl | sort -k2 | cut -f1)
grep ^ $input | (while read line
do
read -a columns <<< "${line%"${line##*[![:space:]]}"}"
orderedline=()
for i in ${order[#]}
do
orderedline+=("${columns[$i - 1]}")
done
line=$(printf "\t%s" "${orderedline[#]}")
echo ${line:1}
done)
To sort by second row, replace head -1 $input with head -2 $input | tail -1. If the sort should be numeric, put in sort -n -k2 instead of sort -k2.
Good one-liner gets the job done:
perl -ane '$,=" "; print sort #F; print "\n";' file
I found it here: http://www.unix.com/unix-for-advanced-and-expert-users/36039-horizontal-sorting-lines-file-sed-implementation.html

Resources