Scripts for listing all the distinct characters in a text file

Scripts for listing all the distinct characters in a text file - bash

E.g.
Given a file input.txt, which has the following content:
He likes cats, really?
the output would be like:
H
e
l
i
k
s
c
a
t
,
r
l
y
?
Note the order of characters in output does not matter.

One way using grep -o . to put each character on a newline and sort -u to remove duplicates:
$ grep -o . file | sort -u
Or a solution that doesn't required sort -u or multiple commands written purely in awk:
$ awk '{for(i=1;i<=NF;i++)if(!a[$i]++)print $i}' FS="" file

How about:
echo "He likes cats, really?" | fold -w1 | sort -u

An awk way:
awk '{$1=$1}1' FS="" OFS="\n" file | sort -u

You can use sed as follows:
sed 's/./\0\n/g' input.txt | sort -u

Related

How do I remove the header in the df command?

I'm trying to write a bash command that will sort all volumes by the amount of data they have used and tried using
df | awk '{print $1 | "sort -r -k3 -n"}'
Output:
map
devfs
Filesystem
/dev/disk1s5
/dev/disk1s2
/dev/disk1s1
But this also shows the header called Filesystem.
How do I remove that?

For your specific case, i.e. using awk, #codeforester answer (using awk NR (Number of Records) variable) is the best.
In a more general case, in order to remove the first line of any output, you can use the tail -n +N option in order to output starting with line N:
df | tail -n +2 | other_command
This will remove the first line in df output.

Skip the first line, like this:
df | awk 'NR>1 {print $1 | "sort -r -k3 -n"}'

I normally use one of these options, if I have no reason to use awk:
df | sed 1d
The 1d option to sed says delete the first line, then print everything else.
df | tail -n+2
the -n+2 option to tail say start looking at line 2 and print everything until End-of-Input.
I suspect sed is faster than awk or tail, but I can't prove it.
EDIT
If you want to use awk, this will print every line except the first:
df | awk '{if (FNR>1) print}'
FNR is the File Record Number. It is the line number of the input. If it is greater than 1, print the input line.

Count the lines from the output of df with wc and then substract one line to output a headerless df with tail ...
LINES=$(df|wc -l)
LINES=$((${LINES}-1))
df | tail -n ${LINES}
OK - I see oneliner - Here is mine ...
DF_HEADERLESS=$(LINES=$(df|wc -l); LINES=$((${LINES}-1));df | tail -n ${LINES})
And for formated output lets printf loop over it...
printf "%s\t%s\t%s\t%s\t%s\t%s\n" ${DF_HEADERLESS} | awk '{print $1 | "sort -r -k3 -n"}'

This might help with GNU df and GNU sort:
df -P | awk 'NR>1{$1=$1; print}' | sort -r -k3 -n | awk '{print $1}'
With GNU df and GNU awk:
df -P | awk 'NR>1{array[$3]=$1} END{PROCINFO["sorted_in"]="#ind_num_desc"; for(i in array){print array[i]}}'
Documentation: 8.1.6 Using Predefined Array Scanning Orders with gawk

Removing something from a command output can be done very simply, using grep -v, so in your case:
df | grep -v "Filesystem" | ...
(You can do your awk at the ...)
When you're not sure about caps, small caps, you might add -i:
df | grep -i -v "FiLeSyStEm" | ...
(The switching caps/small caps are meant as a clarification joke :-) )

remove whitespace from piped output

In a textfile i have some tags with the notation :foo. To get an overview of my tags in the file, I want to get a listing of all this tags.
This is done via
grep -o -e ":[a-z]*\( \|$\)" file.txt | sort | uniq
Now I get duplicates because of the whitespace or newline character at the end.
:movie <-- only newline
:movie <-- whitespace and newline
:read
:read
I want to avoid the duplicates. But I could not figure out how. I tried with | tr -d '[:space:]', but this leads only to a concatenation of all pipe output...
Example of the file.txt
Avengers: Infinity War :movie
Yojimbo 1961 :movie nippon

Some test lines (there is a space after the first :space, you can see it if you highlight the data with your mouse):
$ cat file
with :space
with :space too
without :space
test: this
With grep, sort and uniq:
$ grep -o ":[a-z]\+" file | sort | uniq
:space
With awk (well, gawk and mawk at least):
$ awk 'BEGIN{RS="[" FS "|" RS "]+"}/:[a-z]/&&!a[$0]++' file
:space
Each word is its own record and we pick the first instance of every colon-starting word. RS="[" FS "|" RS "]+" could be written otherwise but it is in this form to emphasize any combination of FS and RS.

You can use Perl regexp and word matching:
grep -oP ':\w+' file.txt | sort | uniq
or, just match non-space characters:
grep -o ':[^ ]*' file.txt | sort | uniq

Since you haven't provided the sample Input_file so couldn't test it as well as I don't have zsh with me. Try following and let me know if this helps you.
awk '/:[a-z]*/{sub(/ +$/,"");} !a[$0]++' Input_file | sort

You can try with sed
sed 's/.*\(:[a-z]*\).*/\1/' file.txt | sort | uniq

Find unique URLs in a file

Situation
I have many URLs in a file, and I need to find out how many unique URLs exist.
I would like to run either a bash script or a command.
myfile.log
/home/myfiles/www/wp-content/als/xm-sf0ab5df9c1262f2130a9b313192deca4-f0ab5df9c1262f2130a9b313192deca4-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,18,17
/home/myfiles/www/wp-content/als/xm-s4bf050d47df5bfaf0486a50a8528cb16-4bf050d47df5bfaf0486a50a8528cb16-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,15,14
/home/myfiles/www/wp-content/als/xm-sad122bf22152ba4823a520cc2fe59f40-ad122bf22152ba4823a520cc2fe59f40-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,17,16
/home/myfiles/www/wp-content/als/xm-s3c0f031eebceb0fd5c4334ecef15292d-3c0f031eebceb0fd5c4334ecef15292d-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,12,11
/home/myfiles/www/wp-content/als/xm-sff661e8c3b4f94957926d5434d0ad549-ff661e8c3b4f94957926d5434d0ad549-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s32c41ec2a5440ad220008b9abfe9add2-32c41ec2a5440ad220008b9abfe9add2-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s28787ca2f4372ddb3616d3fd53c161ab-28787ca2f4372ddb3616d3fd53c161ab-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,22,21
/home/myfiles/www/wp-content/als/xm-s89a7b68158e38391da9f0de1e636c0d5-89a7b68158e38391da9f0de1e636c0d5-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,13,12
/home/myfiles/www/wp-content/als/xm-sc4b14e10f6151995f21334061ff1d139-c4b14e10f6151995f21334061ff1d139-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,13,12
/home/myfiles/www/wp-content/als/xm-se589d47d163e43fa0c0d68e824e2c286-e589d47d163e43fa0c0d68e824e2c286-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s52f897a623c539d09bfb988bfb153888-52f897a623c539d09bfb988bfb153888-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,14,13
/home/myfiles/www/wp-content/als/xm-sccf27a904c5b88e96a3522b2e1180fed-ccf27a904c5b88e96a3522b2e1180fed-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,18,17
/home/myfiles/www/wp-content/als/xm-s6874bf9d589708764dab754e5af06ddf-6874bf9d589708764dab754e5af06ddf-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s46c55ec8387dbdedd7a83b3ad541cdc1-46c55ec8387dbdedd7a83b3ad541cdc1-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s08cfdc15f5935b947bbaa93c7193d496-08cfdc15f5935b947bbaa93c7193d496-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,9,8
/home/myfiles/www/wp-content/als/xm-s86e267bd359c12de262c0279cee0c941-86e267bd359c12de262c0279cee0c941-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,15,14
/home/myfiles/www/wp-content/als/xm-s5aa60354d134b87842918d760ec8bc30-5aa60354d134b87842918d760ec8bc30-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,14,13
Desired Result:
Unique Urls: 4

cut -d "|" -f 2 file | cut -d "," -f 1 | sort -u | wc -l
Output:
4
See: man cut, man sort

An awk solution would be
awk '{sub(/^[^|]*\|/,"");gsub(/,[^,]*/,"");i+=a[$0]++?0:1}END{print i}' file
4
If you happen to use GNU awk then below would also give you the same result
awk '{i+=a[gensub(/.*(http[^,]*).*/,"\\1",1)]++?0:1}END{print i}' file
4
Or even short as pointed out in this cracker comment by #cyrus
awk -F '[|,]' '{i+=!a[$2]++} END{print i}' file
4
which uses awk multiple field separator functionality with more idiomatic awk.
Note: See the [ awk manual ] for more info.

Parse with sed, and since file appears to be already sorted,
(with respect to URLs), just run uniq, and count it:
echo Unique URLs: $(sed 's/^.*|\([^,]*\),.*$/\1/' file | uniq | wc -l)
Use GNU grep to extract URLs:
echo Unique URLs: $(grep -o 'ht[^|,]*' file | uniq | wc -l)
Output (either method):
Unique URLs: 4

tr , '|' < myfile.log | sort -u -t '|' -k 2,2 | wc -l
tr , '|' < myfile.log translates all commas into pipe characters
sort -u -t '|' -k 2,2 sorts unique (-u), pipe delimited (-t '|'), in the second field only (-k 2,2)
wc -l counts the unique lines

Find unique words

Suppose there is one file.txt in which below content text is written:
ABC/xyz
ABC/xyz/rst
EFG/ghi
I need to write a shell script that can extract the first unique word before the first /.
So as output, I want ABC and EFG to be written in one file.

You can extract the first word with cut (slash as delimiter), then pipe to sort with the -u (for "unique") option:
$ cut -d '/' -f 1 file.txt | sort -u
ABC
EFG
To get the output into a file, just redirect by appending > filename to the command. (Or pipe to tee filename to see the output and get it in a file.)

Try this :
cat file.txt | tr -s "/" ' ' | awk -F " " '{print $1}' | sort | uniq > outfile.txt

Another interesting variation:
awk -F'/' '{print $1 |" sort -u" }' file.txt > outfile.txt
Not that it matters here, but being able to pipe and redirect within awk can be very handy.

Another easy way:
cut -d"/" -f1 file.txt|uniq > out.txt

You can use a mix of cut and sort like so:
cut -d '/' -f 1 file.txt | sort -u > newfile.txt
The first line grabs any string until a slash / and outputs it into newfile.txt.
The second line sorts the text, removing any duplicate strings you might have.

multi character separated sort

How can I sort !! delimited records using sort command?
for File1
1!!2!!3
2!3!!3!!1
3!!2!!2
expected output
2!3!!3!!1
3!!2!!2
1!!2!!3
then
sort -t \!\! -k 3 file1
RESULT:
sort: multi-character tab ‘!!’
why isn't it working?

Multi-character delimiters are not allowed in sort -t but you can just use:
sort -t '!' -k1 file
1!!b!!c
2!!f!!w
4!!e!!e
EDIT: If ! can be there in data itself you can use this trick:
sed 's/!!/\x06/g' file | sort -t $'\x06' -k1 | sed 's/\x06/!!/g'
1!!b!!c
2!!f!!w
4!!e!!e
EDIT2: For doing this in single command use awk:
awk -F '!!' -v k=1 '{a[$k,$0]=$0}
END{asort(a, b, "#ind_num_asc"); for (i in b) print b[i]}' file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Scripts for listing all the distinct characters in a text file - bash

E.g. Given a file input.txt, which has the following content: He likes cats, really? the output would be like: H e l i k s c a t , r l y ? Note the order of characters in output does not matter.

One way using grep -o . to put each character on a newline and sort -u to remove duplicates: $ grep -o . file | sort -u Or a solution that doesn't required sort -u or multiple commands written purely in awk: $ awk '{for(i=1;i<=NF;i++)if(!a[$i]++)print $i}' FS="" file

How about: echo "He likes cats, really?" | fold -w1 | sort -u

An awk way: awk '{$1=$1}1' FS="" OFS="\n" file | sort -u

You can use sed as follows: sed 's/./\0\n/g' input.txt | sort -u

Related

How do I remove the header in the df command?

remove whitespace from piped output

Find unique URLs in a file

Find unique words

multi character separated sort

Categories

Resources