Scripts for listing all the distinct characters in a text file - bash

Given a file input.txt, which has the following content:
He likes cats, really?
the output would be like:
Note the order of characters in output does not matter.

One way using grep -o . to put each character on a newline and sort -u to remove duplicates:
$ grep -o . file | sort -u
Or a solution that doesn't required sort -u or multiple commands written purely in awk:
$ awk '{for(i=1;i<=NF;i++)if(!a[$i]++)print $i}' FS="" file

How about:
echo "He likes cats, really?" | fold -w1 | sort -u

An awk way:
awk '{$1=$1}1' FS="" OFS="\n" file | sort -u

You can use sed as follows:
sed 's/./\0\n/g' input.txt | sort -u


How do I remove the header in the df command?

I'm trying to write a bash command that will sort all volumes by the amount of data they have used and tried using
df | awk '{print $1 | "sort -r -k3 -n"}'
But this also shows the header called Filesystem.
How do I remove that?
For your specific case, i.e. using awk, #codeforester answer (using awk NR (Number of Records) variable) is the best.
In a more general case, in order to remove the first line of any output, you can use the tail -n +N option in order to output starting with line N:
df | tail -n +2 | other_command
This will remove the first line in df output.
Skip the first line, like this:
df | awk 'NR>1 {print $1 | "sort -r -k3 -n"}'
I normally use one of these options, if I have no reason to use awk:
df | sed 1d
The 1d option to sed says delete the first line, then print everything else.
df | tail -n+2
the -n+2 option to tail say start looking at line 2 and print everything until End-of-Input.
I suspect sed is faster than awk or tail, but I can't prove it.
If you want to use awk, this will print every line except the first:
df | awk '{if (FNR>1) print}'
FNR is the File Record Number. It is the line number of the input. If it is greater than 1, print the input line.
Count the lines from the output of df with wc and then substract one line to output a headerless df with tail ...
LINES=$(df|wc -l)
df | tail -n ${LINES}
OK - I see oneliner - Here is mine ...
DF_HEADERLESS=$(LINES=$(df|wc -l); LINES=$((${LINES}-1));df | tail -n ${LINES})
And for formated output lets printf loop over it...
printf "%s\t%s\t%s\t%s\t%s\t%s\n" ${DF_HEADERLESS} | awk '{print $1 | "sort -r -k3 -n"}'
This might help with GNU df and GNU sort:
df -P | awk 'NR>1{$1=$1; print}' | sort -r -k3 -n | awk '{print $1}'
With GNU df and GNU awk:
df -P | awk 'NR>1{array[$3]=$1} END{PROCINFO["sorted_in"]="#ind_num_desc"; for(i in array){print array[i]}}'
Documentation: 8.1.6 Using Predefined Array Scanning Orders with gawk
Removing something from a command output can be done very simply, using grep -v, so in your case:
df | grep -v "Filesystem" | ...
(You can do your awk at the ...)
When you're not sure about caps, small caps, you might add -i:
df | grep -i -v "FiLeSyStEm" | ...
(The switching caps/small caps are meant as a clarification joke :-) )

remove whitespace from piped output

In a textfile i have some tags with the notation :foo. To get an overview of my tags in the file, I want to get a listing of all this tags.
This is done via
grep -o -e ":[a-z]*\( \|$\)" file.txt | sort | uniq
Now I get duplicates because of the whitespace or newline character at the end.
:movie <-- only newline
:movie <-- whitespace and newline
I want to avoid the duplicates. But I could not figure out how. I tried with | tr -d '[:space:]', but this leads only to a concatenation of all pipe output...
Example of the file.txt
Avengers: Infinity War :movie
Yojimbo 1961 :movie nippon
Some test lines (there is a space after the first :space, you can see it if you highlight the data with your mouse):
$ cat file
with :space
with :space too
without :space
test: this
With grep, sort and uniq:
$ grep -o ":[a-z]\+" file | sort | uniq
With awk (well, gawk and mawk at least):
$ awk 'BEGIN{RS="[" FS "|" RS "]+"}/:[a-z]/&&!a[$0]++' file
Each word is its own record and we pick the first instance of every colon-starting word. RS="[" FS "|" RS "]+" could be written otherwise but it is in this form to emphasize any combination of FS and RS.
You can use Perl regexp and word matching:
grep -oP ':\w+' file.txt | sort | uniq
or, just match non-space characters:
grep -o ':[^ ]*' file.txt | sort | uniq
Since you haven't provided the sample Input_file so couldn't test it as well as I don't have zsh with me. Try following and let me know if this helps you.
awk '/:[a-z]*/{sub(/ +$/,"");} !a[$0]++' Input_file | sort
You can try with sed
sed 's/.*\(:[a-z]*\).*/\1/' file.txt | sort | uniq

Find unique URLs in a file

I have many URLs in a file, and I need to find out how many unique URLs exist.
I would like to run either a bash script or a command.
Desired Result:
Unique Urls: 4
cut -d "|" -f 2 file | cut -d "," -f 1 | sort -u | wc -l
See: man cut, man sort
An awk solution would be
awk '{sub(/^[^|]*\|/,"");gsub(/,[^,]*/,"");i+=a[$0]++?0:1}END{print i}' file
If you happen to use GNU awk then below would also give you the same result
awk '{i+=a[gensub(/.*(http[^,]*).*/,"\\1",1)]++?0:1}END{print i}' file
Or even short as pointed out in this cracker comment by #cyrus
awk -F '[|,]' '{i+=!a[$2]++} END{print i}' file
which uses awk multiple field separator functionality with more idiomatic awk.
Note: See the [ awk manual ] for more info.
Parse with sed, and since file appears to be already sorted,
(with respect to URLs), just run uniq, and count it:
echo Unique URLs: $(sed 's/^.*|\([^,]*\),.*$/\1/' file | uniq | wc -l)
Use GNU grep to extract URLs:
echo Unique URLs: $(grep -o 'ht[^|,]*' file | uniq | wc -l)
Output (either method):
Unique URLs: 4
tr , '|' < myfile.log | sort -u -t '|' -k 2,2 | wc -l
tr , '|' < myfile.log translates all commas into pipe characters
sort -u -t '|' -k 2,2 sorts unique (-u), pipe delimited (-t '|'), in the second field only (-k 2,2)
wc -l counts the unique lines

Find unique words

Suppose there is one file.txt in which below content text is written:
I need to write a shell script that can extract the first unique word before the first /.
So as output, I want ABC and EFG to be written in one file.
You can extract the first word with cut (slash as delimiter), then pipe to sort with the -u (for "unique") option:
$ cut -d '/' -f 1 file.txt | sort -u
To get the output into a file, just redirect by appending > filename to the command. (Or pipe to tee filename to see the output and get it in a file.)
Try this :
cat file.txt | tr -s "/" ' ' | awk -F " " '{print $1}' | sort | uniq > outfile.txt
Another interesting variation:
awk -F'/' '{print $1 |" sort -u" }' file.txt > outfile.txt
Not that it matters here, but being able to pipe and redirect within awk can be very handy.
Another easy way:
cut -d"/" -f1 file.txt|uniq > out.txt
You can use a mix of cut and sort like so:
cut -d '/' -f 1 file.txt | sort -u > newfile.txt
The first line grabs any string until a slash / and outputs it into newfile.txt.
The second line sorts the text, removing any duplicate strings you might have.

multi character separated sort

How can I sort !! delimited records using sort command?
for File1
expected output
sort -t \!\! -k 3 file1
sort: multi-character tab ‘!!’
why isn't it working?
Multi-character delimiters are not allowed in sort -t but you can just use:
sort -t '!' -k1 file
EDIT: If ! can be there in data itself you can use this trick:
sed 's/!!/\x06/g' file | sort -t $'\x06' -k1 | sed 's/\x06/!!/g'
EDIT2: For doing this in single command use awk:
awk -F '!!' -v k=1 '{a[$k,$0]=$0}
END{asort(a, b, "#ind_num_asc"); for (i in b) print b[i]}' file
