Counting letters in a file in shell script - shell

I need a shell script/powershell, what count similar letters in a file.
Input:
this is the sample of this script.
This script counts similar letters.
Output:
t 9
h 4
i 8
s 10
e 4
a 2
...

In PowerShell, you can do it with the Group-Object cmdlet:
function Count-Letter {
param(
[String]$Path,
[Switch]$IncludeWhitespace,
[Switch]$CaseSensitive
)
# Read the file, convert to char array, and pipe to group-object
# Convert input string to lowercase if CaseSensitive is not specified
$CharacterGroups = if($CaseSensitive){
(Get-Content $Path -Raw).ToCharArray() | Group-Object -NoElement
} else {
(Get-Content $Path -Raw).ToLower().ToCharArray() | Group-Object -NoElement
}
# Remove any whitespace character group if IncludeWhitespace parameter is not bound
if(-not $IncludeWhitespace){
$CharacterGroups = $CharacterGroups |Where-Object { "$($_.Name)" -match "\S" }
}
# Return the groups, letters first and count second in a default format-table
$CharacterGroups |Select-Object #{Name="Letter";Expression={$_.Name}},Count
}
This is what the output looks like on my machine with your sample input + a linebreak

This one liner should do:
awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)if(tolower($i)~/[a-z]/)a[tolower($i)]++}
END{for(x in a)print x, a[x]}' file
output for your example:
u 1
h 4
i 8
l 3
m 2
n 1
a 2
o 2
c 3
p 3
r 4
e 4
f 1
s 10
t 9

powershell one liner:
"this is the sample of this script".ToCharArray() | group -NoElement | sort Count -Descending | where Name -NE ' '

echo "this is the sample of this script. \
This script counts similar letters." | \
grep -o '.' | sort | uniq -c | sort -rg
Output, sorted, most frequent letters first:
10 s
10
8 t
8 i
4 r
4 h
4 e
3 p
3 l
3 c
2 o
2 m
2 a
2 .
1 u
1 T
1 n
1 f
Notes: no sed or awk needed; a simple grep -o '.' does all the heavy lifting. To not count spaces and punctuation, replace '.' with '[[:alpha:]]' |:
echo "this is the sample of this script. \
This script counts similar letters." | \
grep -o '[[:alpha:]]' | sort | uniq -c | sort -rg
To count capital and lower case letters as one, use the --ignore-case option of sort and uniq:
echo "this is the sample of this script. \
This script counts similar letters." | \
grep -o '[[:alpha:]]' | sort -i | uniq -ic | sort -rg
Output:
10 s
9 t
8 i
4 r
4 h
4 e
3 p
3 l
3 c
2 o
2 m
2 a
1 u
1 n
1 f

echo "this is the sample of this script" | \
sed -e 's/ //g' -e 's/\([A-z]\)/\1|/g' | tr '|' '\n' | \
sort | grep -v "^$" | uniq -c | \
awk '{printf "%s %s\n",$2,$1}'

Related

bash looping and extracting of the fragment of txt file

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:
File 1:
CLUSTERING HISTOGRAM
____________________
________________________________________________________________________________
| | | | |
Clus | Lowest | Run | Mean | Num | Histogram
-ter | Binding | | Binding | in |
Rank | Energy | | Energy | Clus| 5 10 15 20 25 30 35
_____|___________|_____|___________|_____|____:____|____:____|____:____|____:___
1 | -5.78 | 11 | -5.78 | 1 |#
2 | -5.53 | 13 | -5.53 | 1 |#
3 | -5.47 | 17 | -5.44 | 2 |##
4 | -5.43 | 20 | -5.43 | 1 |#
5 | -5.26 | 19 | -5.26 | 1 |#
6 | -5.24 | 3 | -5.24 | 1 |#
7 | -5.19 | 4 | -5.19 | 1 |#
8 | -5.14 | 16 | -5.14 | 1 |#
9 | -5.11 | 9 | -5.11 | 1 |#
10 | -5.07 | 1 | -5.07 | 1 |#
11 | -5.05 | 14 | -5.05 | 1 |#
12 | -4.99 | 12 | -4.99 | 1 |#
13 | -4.95 | 8 | -4.95 | 1 |#
14 | -4.93 | 2 | -4.93 | 1 |#
15 | -4.90 | 10 | -4.90 | 1 |#
16 | -4.83 | 15 | -4.83 | 1 |#
17 | -4.82 | 6 | -4.82 | 1 |#
18 | -4.43 | 5 | -4.43 | 1 |#
19 | -4.26 | 7 | -4.26 | 1 |#
_____|___________|_____|___________|_____|______________________________________
The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.
3 | -5.47 | 17 | -5.44 | 2 |##
Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):
"Name of the file 1": 3 | -5.47 | 17 | -5.44 | 2 |##
"Name_of_the_file_2": 1 | -5.99 | 13 | -5.98 | 16 |################
"Name_of_the_file_3": 2 | -4.78 | 19 | -4.44 | 3 |###
A possible model of my BASH workflow would be:
#!/bin/bash
do
file_name2=$(basename "$f")
file_name="${file_name2/.dlg}"
echo "Processing of $f..."
# take a name of the file and save it in the log
echo "$file_name" >> $PWD/final_results.log
# search of the beginning of the table inside of each file and save it after its name
cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
# check whether it works
gedit $PWD/final_results.log
done
Here I need to substitute combination of echo and grep in order to take selected parts of the table.
You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.
grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.
Result should be like this:
file1.dlg: 3 | -5.47 | 17 | -5.44 | 2 |##########
file2.dlg: 3 | -5.47 | 17 | -5.44 | 2 |####
file3.dlg: 3 | -5.47 | 17 | -5.44 | 2 |#######
Here is a modification to get the first appearence in case of many equal max lines in a file:
grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.
Second solution
Here using only awk:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) print i ":" row[i]}' *.dlg
Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'
Probably makes more sense as an Awk script.
This picks the first line with the widest histogram in the case of a tie within an input file.
#!/bin/bash
awk 'FNR == 1 { if(sel) print sel; sel = ""; max = 0 }
FNR < 9 { next }
length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This assumes the histograms are always the tenth field; if your input format is even messier than the lump you show, maybe adapt to taste.
In some more detail, the first line triggers on the first line of each input file. If we have collected a previous line (meaning this is not the first input file), print that, and start over. Otherwise, initialize for the first input file. Set sel to nothing and max to zero.
The second line skips lines 1-8 which contain the header.
The third line checks if the current line's histogram is longer than max. If it is, update max to this histogram's length, and remember the current line in sel.
The last line is spillover for when we have processed all files. We never printed the sel from the last file, so print that too, if it's set.
If you mean to say we should find the lines between CLUSTERING HISTOGRAM and the end of the table, we should probably have more information about what the surrounding lines look like. Maybe something like this, though;
awk '/CLUSTERING HISTOGRAM/ { if (sel) print sel; looking = 1; sel = ""; max = 0 }
!looking { next }
looking > 1 && $1 != looking { looking = 0; nextfile }
$1 == looking && length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This sets looking to 1 when we see CLUSTERING HISTOGRAM, then counts up to the first line where looking is no longer increasing.
I would suggest processing using awk:
for i in $FILES
do
echo -n \""$i\": "
awk 'BEGIN {
output="";
outputlength=0
}
/(^ *[0-9]+)/ { # process only lines that start with a number
if (length(substr($10, 2)) > outputlength) { # if line has more hashes, store it
output=$0;
outputlength=length(substr($10, 2))
}
}
END {
print output # output the resulting line
}' "$i"
done

Finding all punctuation in a text file & print count

I have come close to counting all occurrences of punctuation, however punctuation characters that are right next to each other get counted as one.
Like so:
cat filename.txt |
tr -sc '[:punct:]' '\n' |
sort |
uniq -c |
sort -bnr`
Which prints something like this:
15 ,
9 !
5 .
2 ;
2 !"
2 '
1 -
1 --
1 :
1 ?
It is clearly only counting punctuation, but how would I separate those that are right next to each other?
This:
tr -sc '[:punct:]' '\n'
Basically what you do here is replace all the non-punctuation characters with \n. So when there is no such character between two punctuation chars , you get them next to each other
You want something like that:
cat filename.txt | tr -cd [:punct:] | fold -w 1 | sort | uniq -c | sort -bnr

Custom Sort Multiple Files

I have 10 files (1Gb each). The contents of the files are as follows:
head -10 part-r-00000
a a a c b 1
a a a dumbbell 1
a a a f a 1
a a a general i 2
a a a glory 2
a a a h d 1
a a a h o 4
a a a h z 1
a a a hem hem 1
a a a k 3
I need to sort the file based on the last column of each line (descending order), which is of variable length. If there is a match on the numerical value then sort alphabetically by the 2nd last column. The following BASH command works on small datasets (not complete files) and takes 3 second to sort only 10 lines from one file.
cat part-r-00000 | awk '{print $NF,$0}' | sort -nr | cut -f2- -d' ' > FILE
I want the output in a separate FILE. Can someone help me out to speed up the process?
No, once you get rid of the UUOC that's as fast as it's going to get. Obviously you need to add the 2nd-last field to everything too, e.g. something like:
awk '{print $NF,$(NF-1),$0}' part-r-00000 | sort -k1,1nr -k2,2 | cut -f3- -d' '
Check the sort args, I always get mixed up with those..
Reverse order, sort and reverse order:
awk '{for (i=NF;i>0;i--){printf "%s ",$i};printf "\n"}' file | sort -nr | awk '{for (i=NF;i>0;i--){printf "%s ",$i};printf "\n"}'
Output:
a a a h o 4
a a a k 3
a a a general i 2
a a a glory 2
a a a h z 1
a a a hem hem 1
a a a dumbbell 1
a a a h d 1
a a a c b 1
a a a f a 1
You can use a Schwartzian transform to accomplish your task,
awk '{print -$NF, $(NF-1), $0}' input_file | sort -n | cut -d' ' -f3-
The awk command prepends each record with the negative of the last field and the second last field.
The sort -n command sorts the record stream in the required order because we used the negative of the last field.
The cut command splits on spaces and cuts the first two fields, i.e., the ones we used to normalize the sort
Example
$ echo 'a a a c b 1
a a a dumbbell 1
a a a f a 1
a a a general i 2
a a a glory 2
a a a h d 1
a a a h o 4
a a a h z 1
a a a hem hem 1
a a a k 3' | awk '{print -$NF, $(NF-1), $0}' | sort -n | cut -d' ' -f3-
a a a h o 4
a a a k 3
a a a glory 2
a a a general i 2
a a a f a 1
a a a c b 1
a a a h d 1
a a a dumbbell 1
a a a hem hem 1
a a a h z 1
$

Sum of Columns for multiple variables

Using Shell Script (Bash), I am trying to sum the columns for all the different variables of a list. Suppose I have the following input of a Test.tsv file
Win Lost
Anna 1 1
Charlotte 3 1
Lauren 5 5
Lauren 6 3
Charlotte 3 2
Charlotte 4 5
Charlotte 2 5
Anna 6 4
Charlotte 2 3
Lauren 3 6
Anna 1 2
Anna 6 2
Lauren 2 1
Lauren 5 5
Lauren 6 6
Charlotte 1 3
Anna 1 4
And I want to sum up how much each of the participants have won and lost. So I want to get this as a result:
Sum Win Sum Lost
Anna 57 58
Charlotte 56 57
Lauren 53 56
What I would usually do is take the sum per person and per column and repeat that process over and over. See below how I would do it for the example mentioned:
cat Test.tsv | grep -Pi '\bAnna\b' | cut -f2 -d$'\t' |paste -sd+ | bc > Output.tsv
cat Test.tsv | grep -Pi '\bCharlotte\b' | cut -f2 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bLauren\b' | cut -f2 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bAnna\b' | cut -f3 -d$'\t' |paste -sd+ | bc > Output.tsv
cat Test.tsv | grep -Pi '\bCharlotte\b' | cut -f3 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bLauren\b' | cut -f3 -d$'\t' |paste -sd+ | bc >> Output.tsv
However I would need to repeat this line for every participant. This becomes a pain when you have to many variables you want to sum it up for.
What would be the way to write this script?
Thanks!
This is pretty straightforward with awk. Using GNU awk:
awk -F '\t' 'BEGIN { OFS = FS } NR > 1 { won[$1] += $2; lost[$1] += $3 } END { PROCINFO["sorted_in"] = "#ind_str_asc"; print "", "Sum Win", "Sum Lost"; for(p in won) print p, won[p], lost[p] }' filename
-F '\t' makes awk split lines at tabs, then:
BEGIN { OFS = FS } # the output should be separated the same way as the input
NR > 1 { # From the second line forward (skip header)
won[$1] += $2 # tally up totals
lost[$1] += $3
}
END { # When done, print the lot.
# GNU-specific: Sorted traversal or player names
PROCINFO["sorted_in"] = "#ind_str_asc"
print "", "Sum Win", "Sum Lost"
for(p in won) print p, won[p], lost[p]
}

Merge multiple lines of table with same name

I have a tab delimited table which I want to change the format of as shown below.
Initially the file was like this
Species Column1 Column2 Column3
A 3
B 1
C 7
D 1
A 8
D 4
B 2
C 5
A 9
What I want is:
Species Column1 Column2 Column3
A 3 8 9
B 1 2
C 7 5
D 1 4
Currently I have this:
Species Column1 Column2 Column3
A 3
A 8
A 9
B 1
B 2
C 7
C 5
D 1
D 4
I used the sort function to get the bottom table but am unsure of how to then combine the rows together. Anyone know how to?
use this script:
#!/bin/bash
cols=4
nums=$(seq $cols)
files=$(printf "f%s " $nums)
for i in $nums
do
if [ $i = 1 ]; then
tail -n +2 $1 | cut -f"$i" | grep '^.' | cut -d' ' -f1 | sort -u > f"$i"
else
tail -n +2 $1 | cut -f"$i" | grep '^.' | cut -d' ' -f1 > f"$i"
fi
done
head -n1 $1
paste $files
rm -rf $files
output is :
$ ./script file
Species Column1 Column2 Column3
A 3 8 9
B 1 4
C 7 2
D 1 5
Assuming columns are separated by tabs and there are no headers, here's the script:
awk -F "\t" '$2' file.txt | sort > col2.txt
awk -F "\t" '$3' file.txt | sort > col3.txt
awk -F "\t" '$4' file.txt | sort > col4.txt
join -a1 -a2 col2.txt col3.txt | join -a1 -a2 - col4.txt

Resources