Bash script frequency analysis of unique letters and repeating letter pairs how should i build this script? - bash

Ok,first post..
So I have this assignment to decrypt cryptograms by hand,but I also wanted to automate the process a little if not all at least a few parts,so i browsed around and found some sed and awk one liners to do some things I wanted done,but not all i wanted/needed.
There are some websites that sort of do what I want, but I really want to just do it in bash for some reason,just because I want to understand it better and such :)
The script would take a filename as parameter and output another file such as solution$1 when done.
if [ -e "$PWD/$1" ]; then
echo "$1 exists"
else
echo "$1 doesnt exists"
fi
Would start the script to see if the file in param exists..
Then I found this one liner
sed -e "s/./\0\n/g" $1 | while read c;do echo -n "$c" ; done
Which works fine but I would need to have the number of occurences per letter, I really don't see how to do that.
Here is what I'm trying to achieve more or less http://25yearsofprogramming.com/fun/ciphers.htm for the counting unique letter occurences and such.
I then need to put all letters in lowercase.
After this I see the script doing theses things..
-a subscript that scans a dictionary file for certain pattern and size of words
the bigger words the better.
For example: let's say the solution is the word "apparel" and the crypted word is "zxxzgvk"
is there a regex way to express the pattern that compares those two words and lists the word "apparel" in a dictionnary file because "appa" and "zxxz" are similar patterns and "zxxzgvk" is of similar length with "apparel"
Can this be part done and is it realistic to view the problem like this or is this just far fetched ?
Another subscript who takes the found letters from the previous output word and that swap
letters in the cryptogram.
The swapped letters will be in uppercase to differentiate them over time.
I'll have to figure out then how to proceed to maybe rescan the new found words to see if they're found in a dictionnary file partly or fully as well,then swap more letters or not.
Did anyone see this problem in the past and tried to solve it with the patterns in words
like i described it,or is this just too complex ?
Should I log any of the swaps ?
Maybe just scan through all the crypted words and swap as I go along then do another sweep
with having for constraint in the first sweep to not change uppercase letters(actually to use them as more precise patterns..!)
Anyone did some similar script/program in another langage? If so which one? Maybe I can relate somehow :)
Maybe we can use your insight as to how you thought out your code.
I will happily include the cryptograms I have decoded and the one I have yet to decode :)
Again, the focus of my assignment is not to do this script but just to resolve the cryptograms. But doing scripts or at least trying to see how I would do this script does help me understand a little more how to think in terms of code. Feel free to point me in the right directions!
The cryptogram itself is based on simple alphabetic substitution.
I have done a pastebin here with the code to be :) http://pastebin.com/UEQDsbPk
In pseudocode the way I see it is :
call program with an input filename in param and optionally a second filename(dictionary)
verify the input file exists and isnt empty
read the file's content and echo it on screen
transform to lowercase
scan through the text and count the amount of each letter to do a frequency analysis
ask the user what langage is the text supposed to be (english default)
use the response to specify which letter frequencies to use as a baseline
swap letters corresponding to the frequency analysis in uppercase..
print the changed document on screen
ask the user to swap letters in the crypted text
if user had given a dictionary file as the second argument
then scan the cipher for words and find the bigger words
find words with a similar pattern (some letters repeating letters) in the dictionary file
list on screen the results if any
offer to swap the letters corresponding in the cipher
print modified cipher on screen
ask again to swap letters or find more similar words
More or less it the way I see the script structured.
Do you see anything that I should add,did i miss something?
I hope this revised version is more clear for everyone!

Tl,dr to be frank. To the only question i've found - the answer is yes:) Please split it to smaller tasks and we'll be happy to assist you - if you won't find the answer to these smaller questions before.
If you can put it out in pseudocode, it would be easier. There's all kinds of text-manipulating stuff in unix. The means to employ depend on how big are your texts. I believe they are not so big, or you would have used some compiled language.
For example the easy but costly gawk way to count frequences:
awk -F "" '{for(i=1;i<=NF;i++) freq[$i]++;}END{for(i in freq) printf("%c %d\n", i, freq[i]);}'
As for transliterating, there is tr utility. You can forge and then pass to it the actual strings in each case (that stands true for Caesar-like ciphers).

grep -o . inputfile | sort | uniq -c | sort -rn
Example:
$ echo 'aAAbbbBBBB123AB' | grep -o . | sort | uniq -c | sort -rn
5 B
3 b
3 A
1 a
1 3
1 2
1 1

Related

Bash: count how many times a word is contained in all the files of a given folder

I'm just trying to count the occurrences of a word without writing an iteration file by file. I don't mind which kind of file it is. The closest I got is:
COUNT=$(grep -r -n -i "theWordImSearchingFor" .)
echo $COUNT
I thought about splitting that by spaces, but the problem is the output does not contain just the filename and the line but also the content (and that may have tons of spaces). w.g. I got:
./doc1.txt:29: This is the content containing theWordImSearchingFor but also other stuff
./doc1.txt:43: This is another line containing theWordImSearchingFor
./dir123/doc2.txt:339: .This is another...file...theWordImSearchingFor....
Any idea on how to keep it simple? TIA
To count the number of occurrences of a specific word, you need to use the same layout of code, but simpler. There are many ways to do this, but there are two much simpler versions of the word count that you have listed here.
The much two simpler versions,
1st way
2nd way
They both should work, unless problem with package installation.

sed delete unmatched lines between two lines with bash variable

I need help understanding a weird problem with sed, bash and a while loop.
MY data looks like this:
-File 1- CSV
account,hostnames,status,ipaddress,port,user,pass
-File 2- XML - This is a sample record set for two entries under one account
<accountname="account">
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
</accountname>
So far, I can add records in between the respective account holder from File1 to File2. But, if I need to remove records that no longer exists it does not work efficiently since it wipes other records from different accounts, ie it does not delete between a matched accountname.
I import from File 1 into File 2 with a while loop in my bash program:
-Bash Program excerpts-
//Read File in to F//
cat File 2 | while read F
do
//extract fields from F into variables
_vmname="$(echo $F |grep 'cname'| sed 's/<cname="//g' |sed 's/.\{2\}$//g')"
_account="$(echo $F | grep 'accountname' | sed 's/accountname="//g' |sed 's/.\{2\}$//g')"
// I then compare my File1 and look for stale records that are still in File2
if grep "$_vmname" File1 ;then
continue
else
// if not matched, delete between the respective accountname
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
If I manually declare _vmname and _account and run
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
It removes the stale records from File2. When I let my bash script run, it does not.
I think I have three problems:
Reading the variables for _vmname and _account name inside a loop makes it read numerous times. Any better way to do is appreciated.
I do not think the sed statement for matching these two patterns and then delete works like I want inside a while loop.
I may have a logic problem with my thought chain.
Any pointers, and please no awk, perl, lxml or python for this one.
Thanks!
and please no awk
I appreciate that you want to keep things simple, and I suppose awk seems more complicated than what you're doing. But I'd like to point out you have so far 3 grep and 4 sed invocations per line in the file, to process another file N times, once per line. That's O(mn) using the slowest method on the planet to read the file (a while loop). And it doesn't work.
I may have a logic problem with my thought chain.
I'm afraid we must allow for that possibility!
The right advice is to tackle XML with an XML parser, because XML is not a regular language and so can't be parsed with regular expressions. And that's really what you need here, because your program processes the whole XML document. You're not just plucking out bits and depending on incidental formatting artifacts; you want to add records that aren't there and remove those that "no longer exist". Apparently there is information in the XML document you need to preserve, else you would just produce it from the CSV. A parser would spoon-feed it to you.
The second-best advice is to use awk. I suppose you might try an approach like:
Process the CSV and produce the XML to be inserted.
In awk, first read the new input XML into an array keyed by cname, Then process the XML target once. For every CNAME, consult your array; if you find a match, insert your pre-constructed XML replacement (or modify the "paragraph" accordingly).
I'm not sure what the delete criteria are, so I don't know if it can be done in the same pass with step #2. If not, extract the salient information somehow. Maybe print a list of keys from each of the two files, and use comm(1) to produce a list of to-be-deleted. Then, similar to step #2, read in that list, and process the XML file one more time. Write anything you delete to stderr so you can keep track of what went missing, from what lines.
Any pointers
Whenever you find yourself processing the same file N times for N inputs, you know you're headed for trouble. One of the two inputs is always smaller, and that one can be put in some kind of array. cat file | while read is another warning signal, telling you use awk or any of a dozen obvious utilities that understand lines of text.
You posted your question on SO two weeks ago. I suspect no one answered it because you warned them away: preemptively saying, in effect, don't tell me to use good tools. I'm only here to suggest that you'll be more comfortable after you take off that straightjacket. Better tools, in this case, are the only right answer.

check for permutation in bash

I have a script wherein you have to input a string with a length greater then or equal to 1 and less then 26.
If that's not the case I want to return an error. But that's the part I have figured out
lengthAlphabetInput=${#1}
if [ $lengthAlphabetInput -lt 1 ] || [ $lengthAlphabetInput -gt 26 ]
then
echo "error: key needs to be between 1 and 26 characters"
exit 1
fi
Other than that I would like to check if the input the user gave is a permutation of (a part of) the alphabet.
For example if the user inputs "abc" I want to return an error "abc is
not a permutation of the alphabet"
if the user inputs "xxxgsdnoip" I again want to return the same error
because I don't want the user to use the same letter more than once.
But the input "xyz" or "jhcwslaedmviotrgzxkbynpuqf" would be correct
because these are permutations of the alphabet. (x instead of a, y
instead of b and z instead of c).
Can anyone help me transform this idea into code?
I realized that this is a question raised by a student, so I did not write down a detailed answer, since the experience of reading manual and figuring it out yourself will really help you learn how to use bash (actually the GNU/BSD core utilities), as said by #binaryzebra. What you should do is:
Learn to read manual in bash, with command man, such as man sort for the manual of sort utility. Hit Up/Down arrow key or PageUp/PageDown key to scroll; hit q to exit. Reading manual is your first step into Unix world. Sure you can skip this and find all the information from Google, but learning to read manual will do you more good in the long run.
Read the manual of sed and learn substitution with regular expression. The manual is a little too long for a newcomer, but luckily you do not need to read it all; just scan the manual and find the part about substitution; read the examples as well, if there is any. Practice with some test file. Now you know how to check whether input contains only letters (instead of whitespace, symbols, etc.), as well as how to split each character in its own line.
Read the manual of uniq. It has a much shorter manual; reading the whole manual won't take long.
Now learn the pipeline feature in bash. I cannot find a short and focused manual entry, so you may as well just read the online manual from GNU. With the help of pipeline, you can combine sed and uniq to detect duplicated characters.
By "permutation", it seems that you do not want the characters in their original order. If so, read the manual of the sort utility and think how it can help you.
You do not seem to care about whether all 26 letters are there. If this is the case, you probably do not need the wc (word count) utility, unless you require the subset of letters be continuous (such as "cdefg" instead of "cdhjk").
That's all the hints; good luck with your homework.
#!/usr/bin/perl
$_=shift;
print "not ok:repeated: $1\n" if/(.).*\1/;
my $i=0;
my #s= ( map { ord($_)-97 != $i++ ? ():($_)} split(''));
print "not ok:samePlace: #s\n" if #s;
usage:
$ perl ex.pl rty
$ perl ex.pl abc
not ok:samePlace: a b c
$ perl ex.pl ddss
not ok:repeated: d

Shell script to show frequency of each word in file and in a directory

I came across a question in my interview
Shell script to show frequency of each word in file and in a directory
A
- A1
- File1.txt
- File2.txt
-A2
- FileA21.txt
-A3
- FileA31.txt
- FileA32.txt
B
-B1
- FileB11.txt
- FileB12.txt
- FileB13.txt
-B2
-FileB21.txt
I believe that I understood the question by understanding that Directories A and B are two separate directories with A1, A2 & A3 being sub-directories of A, and B1 and B2 being sub-directories of B. So I answered like this.
Find . ‘\(-name “A” –and –name “B”\)’ –type f –exec cat ‘{}’ \; | awk ‘{c[$1]++} END {for (i in c) print i, c[i]}’
But still I got an feedback that the above script was not good enough. What's wrong in the given script?
The major limitation is that the script assumes there is exactly one word per line. c[$1]++ just increments the occurrence of the first field of each line.
The question didn't mention anything about the number of words in a line, so I'd assume this wasn't the intention - you need to go through each word in a line. Also, what about empty lines? With an empty line, $1 will be the empty string, so your script will end up counting "empty" words (which it will happily show as part of the output).
In awk, the number of fields in a line is stored in the built-in variable NF; thus it is easy to write code to loop through the words and increment the corresponding count (and it has the nice side effect of implicitly ignoring lines without words).
So, I would do something like this instead:
find . -type f -exec cat '{}' \; | awk '{ for (i = 1; i <= NF; i++) w[$i]++ } END { for (i in w) printf("%-10s %10d\n", i, w[i]) }'
I removed the directory names constraints in the argument to find(1) for the sake of conciseness, and to make it more general.
This is (probably) the main issue with your solution, but the question is (intentionally) vague and there are many things left to discuss:
Is it case-sensitive? This solution treats World and world as different words. Is this desired?
What about punctuation? Should hello and hello! be treated as the same word? What about commas? That is, do we need to parse and ignore punctuation?
Speaking of which - what about things like what's vs. what? Do we consider them different words? And it's vs. its? English is tricky!
Most important of all (and related to the points above), what exactly defines a word? We assumed a word is a sequence of non-blanks (the default in awk). Is this accurate?
If there are no words in the input, what do we do? This solution prints nothing - maybe we should print a warning message?
Is there a fixed number of words in a line? Or is it arbitrary? (E.g. if there's exactly one word per line, your solution would be enough)
FWIW, always remember that your success in an interview is not a binary yes/no. It's not like: Oops, you can't do X, so I'm going to reject you. Or: Oops, wrong answer, you're out. More important than the answer is the process that gets you there, and whether or not you are aware of (a) the assumptions you made; and (b) your solution's limitations. The questions above show ability to consider edge cases, ability to clarify assumptions and requirements, etc, which is way more important than getting the "right" script (and probably there's no such thing as The Right Script).

Swap characters in specific positions in strings of varying lengths

I've been trying to learn sed and the examples I've found here are for swapping dates from 05082012 to 20120805 and I'm having trouble adapting them to my current need.
I need to convert an IP address 10.4.13.22 to a reverse lookup of 22.13.4.10 for a nsupdate script. My biggest problem is the fact that sometimes each octet can change lengths e.g. 10.4.13.2 and 10.19.8.126
Thanks for any help!
echo 10.0.2.99 | sed 's/\(....\)\(....\)/\2\1/'
this is currently what I've tried, just based off another question here, but since the examples don't provide much explanation as to what .... means, Im having trouble understanding what it does.
This is the output of that command .2.910.09 and I am expecting 99.2.0.10
Directly, I want to rearrange each "section" that is separated by a "."
A "bruteforce" method to "reverse" an IPv4 address would be:
sed 's/\([0-9]\+\)\.\([0-9]\+\)\.\([0-9]\+\)\.\([0-9]\+\)/\4.\3.\2.\1/g'
or, for GNU sed,
sed -r 's/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/\4.\3.\2.\1/g'

Resources