Unix - Find and Sort by part of filename - sorting

I have this:
D-T4-0.txt
A-2.txt
C-3-1.txt
B-X1-3.txt
E-2-4.txt
and I wish to order as follows:
D, C, A, B, E
I need to look at the last number in each (before .txt) D-0, C-1, A-2, B-3, E-4.
It's possible?

for i in `awk -F- '{print $NF}' file_name |sort`;do grep -- -$i file_name;done
here I am extracting last field using awk delimited by - and sorting
and using loop to grep on the sorted lines by adding - in front.

You can do it in a pipeline like this:
# List files
ls |
# Include the sorting key in the front as well
sed -E 's/^(.*)([0-9]+)\.txt$/\2\t\1\2.txt/' |
# Sort on the sorting key
sort -n |
# Remove the sorting key
cut -f2-
# Grab the first letter
cut -c1
Output:
D
C
A
B
E

Related

check if column has more than one value in unix [duplicate]

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

how to use cut command -f flag as reverse

This is a text file called a.txt
ok.google.com
abc.google.com
I want to select every subdomain separately
cat a.txt | cut -d "." -f1 (it select ok From left side)
cat a.txt | cut -d "." -f2 (it select google from left side)
Is there any way, so I can get result from right side
cat a.txt | cut (so it can select com From right side)
There could be few ways to do this, one way which I could think of right now could be using rev + cut + rev solution. Which will reverse the input by rev command and then set field separator as . and print fields as per they are from left to right(but actually they are reversed because of the use of rev), then pass this output to rev again to get it in its actual order.
rev Input_file | cut -d'.' -f 1 | rev
You can use awk to print the last field:
awk -F. '{print $NF}' a.txt
-F. sets the record separator to "."
$NF is the last field
And you can give your file directly as an argument, so you can avoid the famous "Useless use of cat"
For other fields, but counting from the last, you can use expressions as suggested in the comment by #sundeep or described in the users's guide under
4.3 Nonconstant Field Numbers. For example, to get the domain, before the TLD, you can substract 1 from the Number of Fields NF :
awk -F. '{ print $(NF-1) }' a.txt
You might use sed with a quantifier for the grouped value repeated till the end of the string.
( Start group
\.[^[:space:].]+ Match 1 dot and 1+ occurrences of any char except a space or dot
){1} Close the group followed by a quantifier
$ End of string
Example
sed -E 's/(\.[^[:space:].]+){1}$//' file
Output
ok.google
abc.google
If the quantifier is {2} the output will be
ok
abc
Depending on what you want to do after getting the values then you could use bash for splitting your domain into an array of its components:
#!/bin/bash
IFS=. read -ra comps <<< "ok.google.com"
echo "${comps[-2]}"
# or for bash < 4.2
echo "${comps[${#comps[#]}-2]}"
google

shell sort command : How to sort by the last column (the number of columns is uncertain)?

If the data is like the follow:
a,b,3
c,d,e,f,2
g,1
I want sort by the last column. the result should be:
g,1
c,d,e,f,2
a,b,3
if the last field is single digit
$ rev file | sort | rev
you may need to add -t, -n to sort for numerical ordering but single digits it doesn't matter.
or, for the general case with awk
$ awk -F, '{a[$NF]=$0} END{n=asorti(a,d); for(k=1;k<=n;k++) print a[d[k]]}' file
g,1
c,d,e,f,2
a,b,3
This will fail if the last field is not unique. Using decorate/sort/undecorate idiom you can write instead (as you found yourself)
$ awk -F, '{print $NF FS $0}' file | sort -n | cut -d, -f2-
it's safer to use the field delimiter between the key and the record since you want to ensure the FS doesn't appear in the key itself.
I have a stupid but simple way to do it :)
// if original data in the file : ~/Desktop/1.log
$ awk -F, '{print $NF, $0}' ~/Desktop/1.log | sort -n | awk '{print $2}'
g,1
c,d,e,f,2
a,b,3
Here is my solution using bash script -- i named it uncertain.sh.
# Set here the size of the largest item to sort.
# In our case it is c,d,e,f,2 which is size 5.
max_n=5
# This function 'pads' array with P's before last element
# to force it to grow to max_n size.
# For example, (a b 3) will be transformed into (a b P P 3).
pad () {
local arr=("$#")
local l=${#arr[#]}
local diff_l=$((max_n-l))
local padding=""
# construct padding
for i in `seq 1 $diff_l`; do
padding+="P "
done
local l_minus=$((l-1))
arr=(${arr[#]:0:$l_minus} "$padding"${arr[#]:$l_minus})
echo "${arr[#]}"
}
################################################
# Provide A,B,C here to sort by last item
################################################
A="a,b,3"
B="c,d,e,f,2"
C="g,1"
A=$(echo "$A" | tr ',' ' ')
B=$(echo "$B" | tr ',' ' ')
C=$(echo "$C" | tr ',' ' ')
a=(`echo "$A"`)
b=(`echo "$B"`)
c=(`echo "$C"`)
# Get padded arrays.
a=$(pad "${a[#]}")
b=$(pad "${b[#]}")
c=$(pad "${c[#]}")
# Here, we sort by the last field (we can do this since
# padded arrays are all same size 5).
# Then we remove 'P's from strings.
feed=$(printf "%s\n" "$a" "$b" "$c" | sort -k5,5n | tr -d 'P')
# Lastly, we change spaces with commas ','.
while read line; do
echo "$line" | tr -s ' ' | tr ' ' ','
done < <(echo "$feed")
Here's the output
$ ./uncertain.sh
g,1
c,d,e,f,2
a,b,3
Here's how I did it:
We start with
a,b,3
c,d,e,f,2
g,1
We convert this to
a,b,P,P,3
c,d,e,f,2
g,P,P,P,1
Then we can sort by the 5th column since they are all of same size 5.
So this becomes
g,P,P,P,1
c,d,e,f,2
a,b,P,P,3
We can now remove P's.
g,1
c,d,e,f,2
a,b,3
Hope you found this useful.

how to compare total in unix

i have a file simple.txt. with contents as below:
a b
c d
c d
I want to check which pair 'a b' or 'c d' has maximum occurrence? I have written this code which gives me output of individual occurrence of each word :
cat simple.txt | tr -cs '[:alnum:]' '[\n*]' | sort | uniq -c |
grep -E -i "\<a\>|\<b\>|\<c\>|\<d\>"
1 a
1 b
2 c
2 d
how can i total the result of this output? or can i write a different code?
If we can assume that each pair of letters is a complete line, one way to handle this would be to sort the lines, use the uniq utility to get a count of each unique line, and then reverse sort to get the count:
sort simple.txt | uniq -c | sort -rn
You may want to get rid of the empty lines using egrep:
egrep '\w' simple.txt | sort | uniq -c | sort -rn
Which should give you:
2 c d
1 a b
$ sort file |
uniq -c |
sort -nr > >(read -r count pair; echo "max count $count is for pair $pair")
sort, count numerically in descending order, read the first and print the results.
or all the above in one awk script...
$ awk '{c[$0]++}
END{n=asorti(c,ci); k=ci[n];
print "max count is " c[k] " for pair " k}' file
With single GNU awk command:
awk 'BEGIN{ PROCINFO["sorted_in"] = "#val_num_desc" }
NF{ a[$0]++ }
END{ for (i in a) { print "The pair with max occurence is:", i; break } }' file
The output:
The pair with max occurence is: c d
To get the pair that occurs most frequently:
$ sort <simple.txt | uniq -c | sort -nr | awk '{print "The pair with max occurence is",$2,$3; exit}'
The pair with max occurence is c d
This can be done entirely by awk and without any need for pipelines:
$ awk '{a[$0]++} END{for (x in a) if (a[x]>(max+0)) {max=a[x]; line=x}; print "The pair with max occurence is",line}' simple.txt
The pair with max occurence is c d

Bash - Count number of occurences in textfile and display in descending order

I want to count the amount of the same words in a text file and display them in descending order.
So far I have :
cat sample.txt | tr ' ' '\n' | sort | uniq -c | sort -nr
Which is mostly giving me satisfying output except the fact that it includes special characters like commas, full stops, ! and hyphen.
How can I modify existing command to not include special characters mentioned above?
You can use tr with a composite string of the letters you wish to delete.
Example:
$ echo "abc, def. ghi! boss-man" | tr -d ',.!'
abc def ghi boss-man
Or, use a POSIX character class knowing that boss-man for example would become bossman:
$ echo "abc, def. ghi! boss-man" | tr -d [:punct:]
abc def ghi bossman
Side note: You can have a lot more control and speed by using awk for this:
$ echo "one two one! one. oneone
two two three two-one three" |
awk 'BEGIN{RS="[^[:alpha:]]"}
/[[:alpha:]]/ {seen[$1]++}
END{for (e in seen) print seen[e], e}' |
sort -k1,1nr -k2,2
4 one
4 two
2 three
1 oneone
How about first extracting words with grep:
grep -o "\w\+" sample.txt | sort | uniq -c | sort -nr

Resources