replace a pattern with n number of spaces - shell

I am new to shell scripting, appreciate any help regarding below problem. I have tried to use sed and awk but unable to find a solution.
Problem: I have a fixed width file which has amount fields that need to be replaced with spaces/any special character like $ and the record length has to be maintained. The length of amount fields can vary.
For ex. if sample_file.txt has record length of 10 and there are two amount fields starting at 2 and 6 of length 3 and 5 in this file as below:
a234b67890
It has to be modified as:
a$$$b$$$$$
This is for unix server.
Edit:
Also the records can have numeric characters at other positions which shouldn't be updated. So considering the previous example, the updated input is:
a234b678901234567890
And new output should be:
a$$$b$$$$$1234567890

Try using
inp=a234b67890
echo $inp | sed 's/[0-9]/$/g'
# gives a$$$b$$$$$
The only requirement is that the input should always be of record_length as sed replaces the numbers with the special character.
Hope this helps.

Related

serialized numbers in text files with for loop and sed

I want to put serialized numbers on defined positions in a text file.
My idea is to use character patterns in the file, count up a variable and put them by using sed in the file. I tried this:
for number in 1 2 3 4 ; do
sed -ibak "s/var/$number" file.txt > file2.txt
done
(the arguments 1 2 3 ... are not the best solution, but I think, it should work)
With this code and tiny variations of it, I get different results, but no success.
I can cut/paste the pattern in the text, but it is always the last argument inserted (="3"). Why doesn´t sed take the iterated variable? (which is counted up, I tested it with echo).
The first iteration replaces var by 1, the next iteration replaces exactly the same var by 2, etc. - because you operate on the same input every time, and the pattern isn't dynamic.
It's not clear what you want to achieve, so it's hard to provide a working solution.
It might be easier to reach for Perl:
perl -pe 's/picvar/"pic" . ++$i/e'

Replace specific commas in a csv file

I have a file like this:
gene_id,transcript_id(s),length,effective_length,expected_count,TPM,FPKM,id
ENSG00000000003.14,ENST00000373020.8,ENST00000494424.1,ENST00000496771.5,ENST00000612152.4,ENST00000614008.4,2.23231E3,2.05961E3,2493,2.112E1,1.788E1,00065a62-5e18-4223-a884-12fca053a109
ENSG00000001084.10,ENST00000229416.10,ENST00000504353.1,ENST00000504525.1,ENST00000505197.1,ENST00000505294.5,ENST00000509541.5,ENST00000510837.5,ENST00000513939.5,ENST00000514004.5,ENST00000514373.2,ENST00000514933.1,ENST00000515580.1,ENST00000616923.4,3.09456E3,2.92186E3,3111,1.858E1,1.573E1,00065a62-5e18-4223-a884-12fca053a109
The problem is that instead of ,, the file should've been tab delimited because the values starting from ENST (i.e. transcript_id(s)) are grouped in one column.
The number of ENST IDs is different in each line.
Each ENST ID has the same pattern: starts from ENST, followed by 11 digits followed by a period and then 1-3 digits: ^ENST[0-9]{11}[.][0-9]{1,3}.
I want to convert all the comma's between ENST ids to a : or any other character to read this as a csv file. Any help would be much appreciated. Thanks!
I imagine something as simple as
sed 's|,ENST|:ENST|g;s|:|,|' < /path/to/your/file
should work. No reason to over-complicate.

AWK - I need to write a one line shell command that will count all lines that

I need to write this solution as an AWK command. I am stuck on the last question:
Write a one line shell command that will count all lines in a file called "file.txt" that begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters, and end with a period.
Example(s):
This is the format of lines we want to print. Lines that do not match this format should be skipped:
(10) This is a sample line from file.txt that your script should
count.
(117) And this is another line your script should count.
Lines like this, as well as other non-matching lines, should be skipped:
15 this line should not be printed
and this line should not be printed
Thanks in advance, I'm not really sure how to tackle this in one line.
This is not a homework solution service. But I think I can give a few pointers.
One idea would be to create a counter, and then print the result at the end:
awk '<COND> {c++} END {print c}'
I'm getting a bit confused by the terminology. First you claim that the lines should be counted, but in the examples, it says that those lines should be printed.
Now of course you could do something like this:
awk '<COND>' file.txt | wc -l
The first part will print out all lines that follow the condition, but the output will be parsed to wc -l which is a separate program that counts the number of lines.
Now as to what the condition <COND> should be, I leave to you. I strongly suggest that you google regular expressions and awk, it shouldn't be too hard.
I think the requirement is very clear
Write a one line shell command that will count all lines in a file called "file.txt" that begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters, and end with a period.
1. begin with a decimal number in parenthesis
2. containing a mix of both upper and lower case letters
3. end with a period
check all three conditions. Note that in 2. it doesn't say "only" so you can have extra class of characters but it should have at least one uppercase and one lowercase character.
The example mixes concepts printing and counting, if part of the exercise it's very poorly worded or perhaps assumes that the counting will be done by wc by a piped output of a filtering script; regardless more attention should have been paid, especially for a student exercise.
Please comment if anything not clear and I'll add more details...

How to format a US currency string using python or sed

I have numerous invoices that I sent to clients with this string at the bottom:
Total: 1,000.00
or whatever the amount. Some are 2 figures, some 5 figures + the decimal part.
The thing is that the number's format is inconsistant accross all invoices. Sometimes its 1.000,00 and it keeps on switching the dot and the coma.
so with grep, awk and sed, i am able to only get the amount part from all invoices, without the dollar sign in order to sum them up to a grand total. But the dot and coma switching confuses python, obviously.
So in python (could be in sed as well), i am looking to convert the third char from the right to a dot and then from there on, every fourth char it finds, convert it to a coma.
In other words, it has to be able to separate the digits in groups of 3 from the right, add a coma in between each of them except for the first group at the far right which would be 2 digits separated by a dot.
Hope that is clear enough...
Try this:
yourstring = yourstring[:(len(yourstring)-3)].replace(".",",") + "." + yourstring[-2:]
I tried this on python and I think that works.
sed 's/$/ /
:coma
s/\([0-9]\)[.]\([0-9]\{3\}\)/\1,\2/g;t coma
:dot
s/\([0-9]\),\([0-9][0-9][^0-9]\)/\1.\2/g;t dot
s/ $//
' YourFile
use general and recursive modification for all number on each line.
change every dot number into coma structure then change last coma to a dot
need a trick to change number at end of string (add a space at start, remove it at the end [this could be optimized with a previous test])
posix compliant
Well, the simplest way i've found to handle this is using a bit of sed, some bash and for the final print, printf, which allow us easy currency formatting with "%'.2f" (note the ' character, it is mandatory):
# Get rid of every character that is not a digit
totals=$( echo "$totals" | sed 's/[^0-9]*//g' )
# Sum up the amounts
sum=0
for n in $totals; do
sum=$(($sum+$n))
done
# Put back the comas at each thousand, the dot at decimals and the $ sign in
sansdec=(${#sum}-2)
sum="${sum:0:$sansdec}.${sum: -2}"
printf "%s" "\$"
printf "%'.2f\n" "$sum"

Bash script frequency analysis of unique letters and repeating letter pairs how should i build this script?

Ok,first post..
So I have this assignment to decrypt cryptograms by hand,but I also wanted to automate the process a little if not all at least a few parts,so i browsed around and found some sed and awk one liners to do some things I wanted done,but not all i wanted/needed.
There are some websites that sort of do what I want, but I really want to just do it in bash for some reason,just because I want to understand it better and such :)
The script would take a filename as parameter and output another file such as solution$1 when done.
if [ -e "$PWD/$1" ]; then
echo "$1 exists"
else
echo "$1 doesnt exists"
fi
Would start the script to see if the file in param exists..
Then I found this one liner
sed -e "s/./\0\n/g" $1 | while read c;do echo -n "$c" ; done
Which works fine but I would need to have the number of occurences per letter, I really don't see how to do that.
Here is what I'm trying to achieve more or less http://25yearsofprogramming.com/fun/ciphers.htm for the counting unique letter occurences and such.
I then need to put all letters in lowercase.
After this I see the script doing theses things..
-a subscript that scans a dictionary file for certain pattern and size of words
the bigger words the better.
For example: let's say the solution is the word "apparel" and the crypted word is "zxxzgvk"
is there a regex way to express the pattern that compares those two words and lists the word "apparel" in a dictionnary file because "appa" and "zxxz" are similar patterns and "zxxzgvk" is of similar length with "apparel"
Can this be part done and is it realistic to view the problem like this or is this just far fetched ?
Another subscript who takes the found letters from the previous output word and that swap
letters in the cryptogram.
The swapped letters will be in uppercase to differentiate them over time.
I'll have to figure out then how to proceed to maybe rescan the new found words to see if they're found in a dictionnary file partly or fully as well,then swap more letters or not.
Did anyone see this problem in the past and tried to solve it with the patterns in words
like i described it,or is this just too complex ?
Should I log any of the swaps ?
Maybe just scan through all the crypted words and swap as I go along then do another sweep
with having for constraint in the first sweep to not change uppercase letters(actually to use them as more precise patterns..!)
Anyone did some similar script/program in another langage? If so which one? Maybe I can relate somehow :)
Maybe we can use your insight as to how you thought out your code.
I will happily include the cryptograms I have decoded and the one I have yet to decode :)
Again, the focus of my assignment is not to do this script but just to resolve the cryptograms. But doing scripts or at least trying to see how I would do this script does help me understand a little more how to think in terms of code. Feel free to point me in the right directions!
The cryptogram itself is based on simple alphabetic substitution.
I have done a pastebin here with the code to be :) http://pastebin.com/UEQDsbPk
In pseudocode the way I see it is :
call program with an input filename in param and optionally a second filename(dictionary)
verify the input file exists and isnt empty
read the file's content and echo it on screen
transform to lowercase
scan through the text and count the amount of each letter to do a frequency analysis
ask the user what langage is the text supposed to be (english default)
use the response to specify which letter frequencies to use as a baseline
swap letters corresponding to the frequency analysis in uppercase..
print the changed document on screen
ask the user to swap letters in the crypted text
if user had given a dictionary file as the second argument
then scan the cipher for words and find the bigger words
find words with a similar pattern (some letters repeating letters) in the dictionary file
list on screen the results if any
offer to swap the letters corresponding in the cipher
print modified cipher on screen
ask again to swap letters or find more similar words
More or less it the way I see the script structured.
Do you see anything that I should add,did i miss something?
I hope this revised version is more clear for everyone!
Tl,dr to be frank. To the only question i've found - the answer is yes:) Please split it to smaller tasks and we'll be happy to assist you - if you won't find the answer to these smaller questions before.
If you can put it out in pseudocode, it would be easier. There's all kinds of text-manipulating stuff in unix. The means to employ depend on how big are your texts. I believe they are not so big, or you would have used some compiled language.
For example the easy but costly gawk way to count frequences:
awk -F "" '{for(i=1;i<=NF;i++) freq[$i]++;}END{for(i in freq) printf("%c %d\n", i, freq[i]);}'
As for transliterating, there is tr utility. You can forge and then pass to it the actual strings in each case (that stands true for Caesar-like ciphers).
grep -o . inputfile | sort | uniq -c | sort -rn
Example:
$ echo 'aAAbbbBBBB123AB' | grep -o . | sort | uniq -c | sort -rn
5 B
3 b
3 A
1 a
1 3
1 2
1 1

Resources