How to turn a .txt document to all capitalized letters? - coding-style

We have a coding convention with an obscure, proprietary language called PowerOn (think of a scripty PL/1 language) that requires all coding text to be capitalized. Going behind some previous developers, people thought they would carry over their camel case habits from other languages like Java. Are there any tools that would transform all text to capitalize all text?
Worst case scenario, I could make somehing in .Net that could accomplish this. I am just trying to avoid reinventing the wheel.

tr "[a-z]" "[A-Z]" <code.poweron >newcode.poweron

In case you're on linux you have a number of options. Two of these are:
1. dd if=input.txt of=output.txt conv=ucase
2. tr '[:lower:]' '[:upper:]' < input.txt > output.txt

Related

How to create a script that takes a string and converts it to another managed string?

My intent is to capture the values of a string that I type and have those values be shifted to other letters. Essentially it would be a fake translation program or custom cipher generation script. Example of function:
I would type the sentence:
Who are you?
and the output would be shifted by lets say 1 to the next consonant or vowel, for example. The script would also need to know how to skip vowels or consonants as needed, and for the sake of argument y would always be considered a vowel. So the output would be:
Xju eso auy?
This is something I wanted to attempt for a creative writing project as a means of making another language. Ideally the shift variable could be an input as well to work with to find the best outcome. Possibly even variable shifts for vowels and consonants at the same time?
If you truly are doing this for a creative writing project, then I submit that diving deep into the programming is not warranted. None of the input transformations you described require decisions to be made by the program. That is; once an encoding is chosen, the incoming letters will be each be firmly associated with outgoing letters. This greatly expands your options for how to achieve this, and greatly simplifies the complexity of the task.
Since you tagged Terminal, here are a couple commands you could use in action:
echo "Who are you?" | perl -pe 'tr/N-ZA-Mn-za-m/A-Za-z/'
outputs: Jub ner lbh?
This is the famous Rot13 "encoding" (all it does is substitute the letter that is 13 later in the alphabet). It's particularly handy as 13 is half the alphabet's 26, so putting some "encoded" text in will give you back the original text:
echo "Jub ner lbh?" | perl -pe 'tr/N-ZA-Mn-za-m/A-Za-z/'
outputs: Who are you?
echo just sends text to the screen or other commands. Here we echo our text "How are you?" into a pipe | to pass it to the next command perl, which is a very powerful and flexible text-manipulation and reporting program. The rest of the line is just instructions for perl on how to spin 13 letters later in the alphabet.
Quick note; normally hitting return runs the command in terminal. You can put a backslash \ at the end of a line though and hit return, it will then let you keep typing on the next line but treat it all as one command. Handy for lining things up.
echo "How are you?" | tr \
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' \
'DFVBTXEUWZOSHCJMAQYRINKLPGdfvbtxeuwzoshcjmaqyrinklpg'
outputs: Ujk dqt pji?
There's another command, tr. This example demonstrates an arbitrary substitution—in this case, random. It looks through that first long set of letters, and swaps in instead the letter in the second long set that is in the matching position. Since this substitution example is random, you could use this kind of mapping to create "Cryptogram" puzzles.
The great thing about the tr command is that you can tell it to use whatever input-to-output "mapping" you'd like. Sure, it's a bit manual, but hey—no programming needed!
Here's the mapping to achieve your requested "consonants and vowels" example shift:
echo "Who are you?" | tr \
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' \
'ECDFIGHJOKLMNPUQRSTVYWXZABecdfighjoklmnpuqrstvywxzab'
Outputs: Xju esi auy? Not doing it by hand has its advantages—you missed a vowel in there.
So if you need to rapidly try different mappings, consider learning a bit more about perl (or simpler: sed. or more complex: awk. Or or or…). If, instead, you don't mind a bit of careful command-construction, just lining up each incoming letter with your desired output letter, I think tr would serve nicely.

Case Sensitive Sort Unix Bash

Here is a screenshot of an issue I'm having with sort:
http://i.imgur.com/cIvAF.png
The objective I want out of this, is to put all equal strings on consecutive lines. It works for 99% of the list I'm sorting, but there's a few hitches such as those in the screen shot.
So all the yahoo.coms should be next to each other, and then all the Yahoo.coms then the YAHOO.coms yahoo.cmos yhoo.c etc. (The typos even getting their own group of lines)
Not entirely sure how to handle this with sort, but I'm certainly trying.
I print all the domains unsorted to a file and then sort it with just vanilla sort filename
Would love some advice/input.
You probably need to override the locale; most Linux systems default to a UTF8 locale which specifies both case independent sorting and ignoring punctuation.
LANG=C sort filename
normalize your input a bit
tr [A-Z] [a-z]
Try reading "Unix for poets"

How to enumerate unique characters in a UTF-8 document? With sed?

I'm converting some Polish<->English dictionaries from RTF to HTML. The Polish special characters are coming out fine. But IPA (International Phonetic Alphabet) glyphs get changed to funny things, depending on what program I use for conversion. For example, /ˈbiːrɪ/ comes out as /ÈbiùrI/ or /∪βιρΙ/.
I'd like to correct these documents with a search & replace, but I want to make sure I don't miss any characters and don't want to manually pore over dictionary entries. I'd like to output a list of all unique, NON-ascii characters in a document.
I found this thread:
Find Unique Characters in a File
... and I tried the following two proposals:
sed -e "s/./\0\n/g" inputfile | sort -u
sed -e "s/(.)/\1\n/g" inputfile | sort -u
They both work nicely, and seem to both generate the same output. My problem is that they only output standard ASCII characters, and what I'm looking for is exactly the opposite.
The sed tool looks awesome, but I don't have time to learn it right now (though I intend to later). I'm hoping the solution will be clear to someone who's already mastered this tool, and they can save me a lot of time. [-:
Thanks in advance!
This is not a sed solution but a Python solution. It reads the contents of a file, takes it as UTF-8 and then turns it into a set (thus throwing away duplicates), throws away ASCII characters (0-127), sorts it and then joins it back together again with a blank line between each character:
'\n'.join(sorted(set(unicode(open(inputfile).read(), 'utf-8')) - set(chr(i) for i in xrange(128))))
As something you'd run from the command line if you felt so inclined,
python -c "print '\n'.join(sorted(set(unicode(open('inputfile').read(), 'utf-8')) - set(chr(i) for i in xrange(128))))"
(You could also use ''.join instead of '\n'.join which would list the characters without a newline in between.)

Bash script frequency analysis of unique letters and repeating letter pairs how should i build this script?

Ok,first post..
So I have this assignment to decrypt cryptograms by hand,but I also wanted to automate the process a little if not all at least a few parts,so i browsed around and found some sed and awk one liners to do some things I wanted done,but not all i wanted/needed.
There are some websites that sort of do what I want, but I really want to just do it in bash for some reason,just because I want to understand it better and such :)
The script would take a filename as parameter and output another file such as solution$1 when done.
if [ -e "$PWD/$1" ]; then
echo "$1 exists"
else
echo "$1 doesnt exists"
fi
Would start the script to see if the file in param exists..
Then I found this one liner
sed -e "s/./\0\n/g" $1 | while read c;do echo -n "$c" ; done
Which works fine but I would need to have the number of occurences per letter, I really don't see how to do that.
Here is what I'm trying to achieve more or less http://25yearsofprogramming.com/fun/ciphers.htm for the counting unique letter occurences and such.
I then need to put all letters in lowercase.
After this I see the script doing theses things..
-a subscript that scans a dictionary file for certain pattern and size of words
the bigger words the better.
For example: let's say the solution is the word "apparel" and the crypted word is "zxxzgvk"
is there a regex way to express the pattern that compares those two words and lists the word "apparel" in a dictionnary file because "appa" and "zxxz" are similar patterns and "zxxzgvk" is of similar length with "apparel"
Can this be part done and is it realistic to view the problem like this or is this just far fetched ?
Another subscript who takes the found letters from the previous output word and that swap
letters in the cryptogram.
The swapped letters will be in uppercase to differentiate them over time.
I'll have to figure out then how to proceed to maybe rescan the new found words to see if they're found in a dictionnary file partly or fully as well,then swap more letters or not.
Did anyone see this problem in the past and tried to solve it with the patterns in words
like i described it,or is this just too complex ?
Should I log any of the swaps ?
Maybe just scan through all the crypted words and swap as I go along then do another sweep
with having for constraint in the first sweep to not change uppercase letters(actually to use them as more precise patterns..!)
Anyone did some similar script/program in another langage? If so which one? Maybe I can relate somehow :)
Maybe we can use your insight as to how you thought out your code.
I will happily include the cryptograms I have decoded and the one I have yet to decode :)
Again, the focus of my assignment is not to do this script but just to resolve the cryptograms. But doing scripts or at least trying to see how I would do this script does help me understand a little more how to think in terms of code. Feel free to point me in the right directions!
The cryptogram itself is based on simple alphabetic substitution.
I have done a pastebin here with the code to be :) http://pastebin.com/UEQDsbPk
In pseudocode the way I see it is :
call program with an input filename in param and optionally a second filename(dictionary)
verify the input file exists and isnt empty
read the file's content and echo it on screen
transform to lowercase
scan through the text and count the amount of each letter to do a frequency analysis
ask the user what langage is the text supposed to be (english default)
use the response to specify which letter frequencies to use as a baseline
swap letters corresponding to the frequency analysis in uppercase..
print the changed document on screen
ask the user to swap letters in the crypted text
if user had given a dictionary file as the second argument
then scan the cipher for words and find the bigger words
find words with a similar pattern (some letters repeating letters) in the dictionary file
list on screen the results if any
offer to swap the letters corresponding in the cipher
print modified cipher on screen
ask again to swap letters or find more similar words
More or less it the way I see the script structured.
Do you see anything that I should add,did i miss something?
I hope this revised version is more clear for everyone!
Tl,dr to be frank. To the only question i've found - the answer is yes:) Please split it to smaller tasks and we'll be happy to assist you - if you won't find the answer to these smaller questions before.
If you can put it out in pseudocode, it would be easier. There's all kinds of text-manipulating stuff in unix. The means to employ depend on how big are your texts. I believe they are not so big, or you would have used some compiled language.
For example the easy but costly gawk way to count frequences:
awk -F "" '{for(i=1;i<=NF;i++) freq[$i]++;}END{for(i in freq) printf("%c %d\n", i, freq[i]);}'
As for transliterating, there is tr utility. You can forge and then pass to it the actual strings in each case (that stands true for Caesar-like ciphers).
grep -o . inputfile | sort | uniq -c | sort -rn
Example:
$ echo 'aAAbbbBBBB123AB' | grep -o . | sort | uniq -c | sort -rn
5 B
3 b
3 A
1 a
1 3
1 2
1 1

Replacing huge blocks with sed

I have 2 files that are generated elsewhere. First one is "what to search", and second one is the replacement. Both files are huge, about 2-3mb each.
I need to write a bash script that takes an even bigger file (about 200-300mb) and replaces all occurrences of file1 contents to file2 contents.
Problem is, file1 and file2 can contain any possible characters, including regexp special symbols.
How can I solve this problem using sed?
Thanks in advance.
Maybe have a look at chgrep:
http://www.bmk-it.com/projects/chgrep/
Cheers,
gregx
Since you don't actually need regular expressions, just direct string matching, sed is overkill. What you're really looking for is a fixed-string (maybe even binary) stream editor. Unfortunately, I don't know of one... I hate to suggest possibly reinventing a wheel, but you could write something fairly quickly in C that'd do what you want. A rough draft outline:
read search-file into memory
create a buffer of the same size as search-file
read from stdin (or input-file) into buffer.
For each character, if it does not match the parallel character from search-file, shift the buffer. To find out how much to shift it by, read until you find a match to the first character of input-file, then check to see if the rest matches, repeating until you've found a partial match to input-file (or gotten to the end of the buffer). When you shift, print all the non-matching characters to stdout (or output-file)
If the buffer ever fills up, i.e. totally matches input-file, print replacement-file to stdout (or output-file). Depending on memory vs. speed, you can keep replacement-file in memory or read it from disk each time.
You could also attempt to automatically escape all regex characters from your input file. This could be done with a horribly ugly list of sed substitutions, like
sed -e 's/\\/\\\\/g' -e 's#/#\/#' -e 's/\[/\\[/g' ...
(make sure you do the \ one first!)
I don't know about sed but in Perl you could do (off the top of my head, untested):
perl -0777 -pe 'BEGIN{local $/ = undef; open FROM, "<", shift #ARGV; $from = <FROM>; open TO, "<" shift #ARGV; $to = <TO>} s/\Q$from\E/$to/sog' file1 file2 bigger-file > new-bigger-file
If you're interesting in trying Perl, I could try testing it for you tomorrow.
But it sucks the entire bigger-file into memory because it ignores line-breaks so that your search text can span multiple lines. This will meant that it uses quite a lot of memory!
This answer assumes that the search file is one long search string over multiple lines which must be matched in its entirety rather than a number of separate search strings, any of which can be matched.

Resources