How to do specific sorting in unix - shell

How can I sort following two lines
ABCTz.T.3a.B Student 1 1.4345
ABCTz.T.3.B Student 1 1.5465
to print them like below.
ABCTa.T.3.B Student 1 1.5465
ABCTa.T.3a.B Student 1 1.4345
It can be definitely done using a mixture of sed and sort command but that's not a generic solution. Here is the sample code,
cat 1 | sed "s/\./ ./g" | sort -k3,3 | sed "s/ \././g"
This solution requires customization if the length of string changes or number of character changes between two dots(i.e....
ABCTz.T.SC.D.3a.B Student 1 1.4345
ABCTz.T.SC.D.3.B Student 1 1.5465
Again, I need to modify the sort expression to consider the length in this case. Looking forward to have something very generic.
Regards, Divesh

You can use version sort, available with gnu sort on first field:
sort -V -rk1 file
ABCTz.T.3.B Student 1 1.5465
ABCTz.T.3a.B Student 1 1.4345

If the format is based on tabs, it's easy.
cat 1|sort -t"[Control-V][TAB]" -n -r -k4
But if the number of spaces is variable, I sort with awk.
This formula will put the 4th field at the beginning, followed by |, then it will sort based on this field, and then will strip it out:
cat 1|awk '{print $4 "|" $0}' |sort -t"|" -n -r -k1|cut -d"|" -f2-
Example:
boxes#osboxes Desktop]$ cat 1
asdfa safadf 1.2
asldfkañ sdlfsld 1.3
[osboxes#osboxes Desktop]$ cat 1 | awk '{print $3 "|" $0}'|sort -t"|" -n -r -k1|cut -d"|" -f2-
asldfkañ sdlfsld 1.3
asdfa safadf 1.2
Enjoy!

Related

check if column has more than one value in unix [duplicate]

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

Bash - How to count occurences in a column of a .csv file (without awk)

recently i've started to learn bash scripting and im wondering how i can count occurences in a column of a .csv file, the file is structured like this:
DAYS,SOMEVALUE,SOMEVALUE
sunday,something,something
monday,something,something
wednesday,something,something
sunday,something,something
monday,something,something
so my question is: how can i count each time every value of first column (days) appear? In this case the outputs must be:
Sunday : 2
Monday : 2
Wednesday: 1
The first column is named DAYS, so the script has to not take care of the single value DAYS, DAYS is just a way to identify the column.
if possible i want to see a solution without the awk command and without phyton ecc..
Thx guys and sorry for my bad English
Edit: I thought to do this:
count="$( cat "${FILE}" | grep -c "OCCURENCE")"
echo "OCCURENCE": ${count}
Where OCCURENCE is the single values (sunday,monday...)
But this solution is not automatic, i need to make a list of single occurences in the first column of .csv file and put each one on an array and then count each one with the code i written before. I need some help to do this thx
cut -f1 -d, test.csv | tail -n +2 | sort | uniq -c
This gets you this far:
2 monday
2 sunday
1 wednesday
To get your format (Sunday : 1), I think awk would be an easy and clear way (something like awk '{print $2 " : " $1}', but if you really really must, here's a complete non-awk version:
cut -f1 -d, test.csv | uniq -c | tail -n +2 | while read line; do words=($line); echo ${words[1]} : ${words[0]}; done
A variation of #sneep's answer that uses sed to format the result:
cut -f1 -d, /tmp/data | tail -n +2 | sort | uniq -c | sed 's|^ *\([0-9]*\) \(.*\)|\u\2: \1|g'
Output:
Monday: 2
Sunday: 2
Wednesday: 1
The sed is matching:
^ *: Beginning of line and then any number of spaces
\([0-9]*\): Any number of numbers (storing them in a group \1)
: A single space
\(.*\): Any character until the end, storing it in group \2
And replaces the match with:
\u\2: Second group, capitalizing first character
: \1: Colon, space and the first group

Bash - Count number of occurences in textfile and display in descending order

I want to count the amount of the same words in a text file and display them in descending order.
So far I have :
cat sample.txt | tr ' ' '\n' | sort | uniq -c | sort -nr
Which is mostly giving me satisfying output except the fact that it includes special characters like commas, full stops, ! and hyphen.
How can I modify existing command to not include special characters mentioned above?
You can use tr with a composite string of the letters you wish to delete.
Example:
$ echo "abc, def. ghi! boss-man" | tr -d ',.!'
abc def ghi boss-man
Or, use a POSIX character class knowing that boss-man for example would become bossman:
$ echo "abc, def. ghi! boss-man" | tr -d [:punct:]
abc def ghi bossman
Side note: You can have a lot more control and speed by using awk for this:
$ echo "one two one! one. oneone
two two three two-one three" |
awk 'BEGIN{RS="[^[:alpha:]]"}
/[[:alpha:]]/ {seen[$1]++}
END{for (e in seen) print seen[e], e}' |
sort -k1,1nr -k2,2
4 one
4 two
2 three
1 oneone
How about first extracting words with grep:
grep -o "\w\+" sample.txt | sort | uniq -c | sort -nr

One liner command to extract #of ID occurrences in a very long file

I have the following really huge file (million lines) with the following format:
Timestamp, ID, GUID
Example:
2014-04-14 23:59:59,754 2294 123B24C6452231DC1770FE37E6F3D51168
2014-04-14 23:59:59,757 102254 B9E0CE6C9F67745326F9FD07C5B31B4E1D65
ID is a number which can be any from single digit and up to 6 digits.
GUID has a constant length (as above).
I would like to get #of occurrences for each ID in the file.
Output should looks something like:
Count, ID
8 2294
15 102254
...
I am trying to get this with a single grep using uniq and sort without much succeess.
Appreciate help.
If there are single spaces in between the fields (as in your example) rather than commas (as in your format), then you could use:
cut -d' ' -f3 hugefile | sort | uniq -c
Another alternative, if the separator might be several spaces:
awk '{print $3}' hugefile | sort | uniq -c
You could also do all the work inside the awk program (untested):
awk '{c[$3]++} END { for (n in c) print c[n], n }' hugefile
You can use this,
grep -Po '(?<= )[0-9]+ ' yourfile | sort | uniq -c

Unix - Sorting file name with a key but not knowing its position

I would like to sort those files using Unix commands:
MyFile_fdfdsf_20140326.txt
MyFile_4fg5d6_20100301.csv
MyFile_dfgfdklm_19990101.tar.gz
The result I am waiting for here is MyFile_fdfdsf_20140326.txt
So I'd like to get the file with the newest date.
I can't use 'sort -k', as the position of the key (the date) may vary
But in my file name there are always two "_" delimiters and a dot '.' for the file extension
Any help would be appreciated :)
Then use -t to indicate the field separator and set it to _:
sort -t'_' -k3
See an example of sorting the file names if they are in a file. I used -n for numeric sort and -r for reverse order:
$ sort -t'_' -nk3 file
MyFile_dfgfdklm_19990101.tar.gz
MyFile_4fg5d6_20100301.csv
MyFile_fdfdsf_20140326.txt
$ sort -t'_' -rnk3 file
MyFile_fdfdsf_20140326.txt
MyFile_4fg5d6_20100301.csv
MyFile_dfgfdklm_19990101.tar.gz
From man sort:
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition
-n, --numeric-sort
compare according to string numerical value
-r, --reverse
reverse the result of comparisons
Update
Thank you for you answer. It's perfect. But out of curiosity, what if
I had an unknown number of delimiters, but the date was always after
the last "_" delimiter. MyFile_abc_def_...20140326.txt sort -t''
-nk??? file – user3464809
You can trick it a little bit: print the last field, sort and then remove it.
awk -F_ '{print $NF, $0}' a | sort | cut -d'_' -f2-
See an example:
$ cat a
MyFile_fdfdsf_20140326.txt
MyFile_4fg5d6_20100301.csv
MyFile_dfgfdklm_19990101.tar.gz
MyFile_dfgfdklm_asdf_asdfsadfas_19940101.tar.gz
MyFile_dfgfdklm_asdf_asdfsadfas_29990101.tar.gz
$ awk -F_ '{print $NF, $0}' a | sort | cut -d'_' -f2-
dfgfdklm_asdf_asdfsadfas_19940101.tar.gz
dfgfdklm_19990101.tar.gz
4fg5d6_20100301.csv
fdfdsf_20140326.txt
dfgfdklm_asdf_asdfsadfas_29990101.tar.gz

Resources