how to use values with sed in shell scripting? - bash

i am trying te write a shell script in alphametic ,
i have 5 parameters like this
$alphametic 5790813 BEAR RARE ERE RHYME
to get
ABEHMRY -> 5790813
i tried this :
#!/bin/bash
echo "$2 $3 $4 $5" | sed 's/ //g ' | sed 's/./&\n/g' | sort -n | sed '/^$/d' | uniq -i > testing
paste -sd '' testing > testing2
sed "s|^\(.*\)$|\1 -> ${1}|" testing2
but i get error (with the last command sed), i dont know where is the problem .

Another approach:
chars=$(printf '%s' "${#:2}" | fold -w1 | sort -u | paste -sd '')
echo "$chars -> $1"
sort's -n does't make sense here: these are letters, not numbers.

One idea using awk for the whole thing:
arg1="$1"
shift
others="$#"
awk -v arg1="${arg1}" -v others="${others}" '
BEGIN { n=split(others,arr,"") # split into into array of single characters
for (i=1;i<=n;i++) # loop through indices of arr[] array
letters[arr[i]] # assign characters as indices of letters[] array; eliminates duplicates
delete letters[" "] # delete array index for "<space>"
PROCINFO["sorted_in"]="#ind_str_asc" # sort array by index
for (i in letters) # loop through indices
printf "%s", i # print index to stdout
printf " -> %s\n", arg1 # finish off line with final string
}
'
NOTE: requires GNU awk for the PROCINFO["sorted_in"] (to sort the indices of the letters[] array)
This generates:
ABEHMRY -> 5790813

Related

shell sort command : How to sort by the last column (the number of columns is uncertain)?

If the data is like the follow:
a,b,3
c,d,e,f,2
g,1
I want sort by the last column. the result should be:
g,1
c,d,e,f,2
a,b,3
if the last field is single digit
$ rev file | sort | rev
you may need to add -t, -n to sort for numerical ordering but single digits it doesn't matter.
or, for the general case with awk
$ awk -F, '{a[$NF]=$0} END{n=asorti(a,d); for(k=1;k<=n;k++) print a[d[k]]}' file
g,1
c,d,e,f,2
a,b,3
This will fail if the last field is not unique. Using decorate/sort/undecorate idiom you can write instead (as you found yourself)
$ awk -F, '{print $NF FS $0}' file | sort -n | cut -d, -f2-
it's safer to use the field delimiter between the key and the record since you want to ensure the FS doesn't appear in the key itself.
I have a stupid but simple way to do it :)
// if original data in the file : ~/Desktop/1.log
$ awk -F, '{print $NF, $0}' ~/Desktop/1.log | sort -n | awk '{print $2}'
g,1
c,d,e,f,2
a,b,3
Here is my solution using bash script -- i named it uncertain.sh.
# Set here the size of the largest item to sort.
# In our case it is c,d,e,f,2 which is size 5.
max_n=5
# This function 'pads' array with P's before last element
# to force it to grow to max_n size.
# For example, (a b 3) will be transformed into (a b P P 3).
pad () {
local arr=("$#")
local l=${#arr[#]}
local diff_l=$((max_n-l))
local padding=""
# construct padding
for i in `seq 1 $diff_l`; do
padding+="P "
done
local l_minus=$((l-1))
arr=(${arr[#]:0:$l_minus} "$padding"${arr[#]:$l_minus})
echo "${arr[#]}"
}
################################################
# Provide A,B,C here to sort by last item
################################################
A="a,b,3"
B="c,d,e,f,2"
C="g,1"
A=$(echo "$A" | tr ',' ' ')
B=$(echo "$B" | tr ',' ' ')
C=$(echo "$C" | tr ',' ' ')
a=(`echo "$A"`)
b=(`echo "$B"`)
c=(`echo "$C"`)
# Get padded arrays.
a=$(pad "${a[#]}")
b=$(pad "${b[#]}")
c=$(pad "${c[#]}")
# Here, we sort by the last field (we can do this since
# padded arrays are all same size 5).
# Then we remove 'P's from strings.
feed=$(printf "%s\n" "$a" "$b" "$c" | sort -k5,5n | tr -d 'P')
# Lastly, we change spaces with commas ','.
while read line; do
echo "$line" | tr -s ' ' | tr ' ' ','
done < <(echo "$feed")
Here's the output
$ ./uncertain.sh
g,1
c,d,e,f,2
a,b,3
Here's how I did it:
We start with
a,b,3
c,d,e,f,2
g,1
We convert this to
a,b,P,P,3
c,d,e,f,2
g,P,P,P,1
Then we can sort by the 5th column since they are all of same size 5.
So this becomes
g,P,P,P,1
c,d,e,f,2
a,b,P,P,3
We can now remove P's.
g,1
c,d,e,f,2
a,b,3
Hope you found this useful.

Error in bash script: arithmetic error

I am wrote a simple script to extract text from a bunch of files (*.out) and add two lines at the beginning and a line at the end. Then I add the extracted text with another file to create a new file. The script is here.
#!/usr/bin/env bash
#A simple bash script to extract text from *.out and create another file
for f in *.out; do
#In the following line, n is a number which is extracted from the file name
n=$(echo $f | cut -d_ -f6)
t=$((2 * $n ))
#To extract the necessary text/data
grep " B " $f | tail -${t} | awk 'BEGIN {OFS=" ";} {print $1, $4, $5, $6}' | rev | column -t | rev > xyz.xyz
#To add some text as the first, second and last lines.
sed -i '1i -1 2' xyz.xyz
sed -i '1i $molecule' xyz.xyz
echo '$end' >> xyz.xyz
#To combine the extracted info with another file (ea_input.in)
cat xyz.xyz ./input_ea.in > "${f/abc.out/pqr.in}"
done
./script.sh: line 4: (ls file*.out | cut -d_ -f6: syntax error: invalid arithmetic operator (error token is ".out) | cut -d_ -f6")
How I can correct this error?
In bash, when you use:
$(( ... ))
it treats the contents of the brackets as an arithmetic expression, returning the result of the calculation, and when you use:
$( ... )
it executed the contents of the brackets and returns the output.
So, to fix your issue, it should be as simple as to replace line 4 with:
n=$(ls $f | cut -d_ -f6)
This replaces the outer double brackets with single, and removes the additional brackets around ls $f which should be unnecessary.
The arithmetic error can be avoided by adding spaces between parentheses. You are already using var=$((arithmetic expression)) correctly elsewhere in your script, so it should be easy to see why $( ((ls "$f") | cut -d_ -f6)) needs a space. But the subshells are completely superfluous too; you want $(ls "$f" | cut -d_ -f6). Except ls isn't doing anything useful here, either; use $(echo "$f" | cut -d_ -f6). Except the shell can easily, albeit somewhat clumsily, extract a substring with parameter substitution; "${f#*_*_*_*_*_}". Except if you're using Awk in your script anyway, it makes more sense to do this - and much more - in Awk as well.
Here is an at empt at refactoring most of the processing into Awk.
for f in *.out; do
awk 'BEGIN {OFS=" " }
# Extract 6th _-separated field from input filename
FNR==1 { split(FILENAME, f, "_"); t=2*f[6] }
# If input matches regex, add to array b
/ B / { b[++i] = $1 OFS $4 OFS $5 OFS $6 }
# If array size reaches t, start overwriting old values
i==t { i=0; m=t }
END {
# Print two prefix lines
print "$molecule"; print -1, 2;
# Handle array smaller than t
if (!m) m=i
# Print starting from oldest values (index i + 1)
for(j=1; j<=m; j++) {
# Wrap to beginning of array at end
if(i+j > t) i-=t
print b[i+j]; }
print "$end" }' "$f" |
rev | column -t | rev |
cat - ./input_ea.in > "${f/foo.out/bar.in}"
done
Notice also how we avoid using a temporary file (this would certainly have been avoidable without the Awk refactoring, too) and how we take care to quote all filename variables in double quotes.
The array b contains (up to) the latest t values from matching lines; we collect these into an array which is constrained to never contain more than t values by wrapping the index i back to the beginning of the array when we reach index t. This "circular array" avoids keeping too many values in memory, which would make the script slow if the input file contains many matches.

Using awk to extract two separate strings

MacOS, Unix
So I have a file in the following stockholm format:
# STOCKHOLM 1.0
#=GS WP_002855993.1/5-168 DE [subseq from] MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
#=GS WP_002856586.1/5-166 DE [subseq from] MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
WP_002855993.1/5-168 ------LEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELmkfgKALLT.K...NDFLKTLLECFFKVLGKEGTLLMP-TF---TYSF------CKNE------VYDKVHSKG--KVGVLNEFFRTSGgGVRRTSDPIFSFAVKGAKADIFLKEN--SSCFGKDSVYEILTREGGKFMLLGLNYG-HALTHYAEE-----
#=GR WP_002855993.1/5-168 PP ......6788899999***********************9333344455.6...8999********************.33...3544......4555......799999975..68********98626999****************999865..689*********************9875.456799996.....
WP_002856586.1/5-166 ------LEFENKKYSTYDFIETFYKLGLQKGDTLCVHTEL....FNFGFpLlsrNEFLQTILDCFFEVIGKEGTLIMP-TF---TYSF------CKNE------VYDKINSKT--KMGALNEYFRKQT.GVKRTNDPIFSFAIKGAKEELFLKDT--TSCFGENCVYEVLTKENGKYMTFGGQG--HTLTHYAEE-----
#=GR WP_002856586.1/5-166 PP ......5566677788889999******************....**9953422246679*******************.33...3544......4455......799998876..589**********.******************99999886..689******************999765..5666***96.....
#=GC PP_cons ......6677788899999999*****************9....77675.5...68889*******************.33...3544......4455......799999976..689*******998.8999**************99999876..689******************9998765.466699996.....
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....xxxxx.x...xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
WP_002855993.1/5-168 -----------------------------------------------------------------------------------------------------
#=GR WP_002855993.1/5-168 PP .....................................................................................................
WP_002856586.1/5-166 -----------------------------------------------------------------------------------------------------
#=GR WP_002856586.1/5-166 PP .....................................................................................................
#=GC PP_cons .....................................................................................................
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
//
And I've created a script to extract the IDs I want, in this case, WP_002855993.1 and WP_002856586.1, and search through another file to extract DNA sequences with the appropriate IDs. The script is as follows:
#!/bin/bash
for fileName in *.sto;
do
protID=$(grep -o "WP_.\{0,11\}" $fileName | sort | uniq)
echo $protID
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
And here's an example of the type of file I'm looking through:
>WP_002855993.1 MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
MKYFLEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELMKFGKALLTKNDFLKTLLECFFKVLGKEGTLLMPTFT
>WP_002856586.1 MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
MKYLLEFENKKYSTYDFIETFYKLGLQKGDTLCVHTELFNFGFPLLSRNEFLQTILDCFFEVIGKEGTLIMPTFT
YSFCKNEVYDKINSKTKMGALNEYFRKQTGVKRTNDPIFSFAIKGAKEELFLKDTTSCFGENCVYEVLTKENGKY
>WP_002856595.1 MULTISPECIES: acetyl-CoA carboxylase biotin carboxylase subunit [Campylobacter]
MNQIHKILIANRAEIAVRVIRACRDLHIKSVAVFTEPDRECLHVKIADEAYRIGTDAIRGYLDVARIVEIAKACG
This script works if I have one ID, but in some cases I get two IDs, and I get an error, because I think it's looking for an ID like "WP_002855993.1 WP_002856586.1". Is there a way to modify this script so it looks for two separate occurrences? I guess it's something with the gawk command, but I'm not sure what exactly. Thanks in advance!
an update to the original script:
#!/usr/bin/env bash
for file_sto in *.sto; do
file_faa=$(echo $file_sto | cut -d '_' -f 1,2,3)
file_faa=${file_faa}"_protein.faa"
awk '(NR==FNR) { match($0,/WP_.\{0,11\}/);
if (RSTART > 0) a[substr($0,RSTART,RLENGTH)]++
next; }
($1 in a){ print RS $0 }' $file_sto RS=">" $file_faa >> sequence_protein.file
done
The awk part can probably even be reduced to :
awk '(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }
($1 in a) { print RS $0 }' FS='/' $file_sto FS=" " RS=">" $file_faa
This awk script does the following:
Set the field separator FS to / and read file $file_sto.
When reading $file_sto the record number NR is the same as the file record number FNR.
(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }: this line works only one $file_sto due to the condition in the front. It checks if the line starts with WP_. If it does, it stores the first field $1 (separated by FS which is a /) in an array a; it then skips to the next record in the file (next).
If we finished reading file $file_sto, we set the field separator back to a single space FS=" " (see section Regular expression) and the record separator RS to > and start reading file $file_faa The latter implies that $0 will contain all lines between > and the first field $1 is the protID.
Reading $file_faa, the file record number FNR is restarted from 1 while NR is not reset. Hence the first awk line is skipped.
($1 in a){ print RS $0 } if the first field is in the array a, print the record with the record separator in front of it.
fixing the original script:
If you want to keep your original script, you could store the protID in a list and then loop the list :
#!/bin/bash
for fileName in *.sto; do
protID_list=( $(grep -o "WP_.\{0,11\}" $fileName | sort | uniq) )
echo ${protID_list[#]}
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
for protID in ${protID_list[#]}; do
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
done
Considering your output file is test.
Using following command gives you only file names:
>>cat text | awk '{print $1}' | grep -R 'WP*' | cut -d":" -f2
gives me output:
WP_002855993.1/5-168
WP_002856586.1/5-166
WP_002855993.1/5-168
WP_002856586.1/5-166
Do you want output like that?

Counting palindromes in a text file

Having followed this thread BASH Finding palindromes in a .txt file I can't figure out what am I doing wrong with my script.
#!/bin/bash
search() {
tr -d '[[:punct:][:digit:]#]' \
| sed -E -e '/^(.)\1+$/d' \
| tr -s '[[:space:]]' \
| tr '[[:space:]]' '\n'
}
search "$1"
paste <(search <"$1") <(search < "$1" | rev) \
| awk '$1 == $2 && (length($1) >=3) { print $1 }' \
| sort | uniq -c
All im getting from this script is output of the whole text file. I want to only output palindromes >=3 and count them such as
425 did
120 non
etc. My textfile is called sample.txt and everytime i run the script with: cat sample.txt | source palindrome I get message 'bash: : No such file or directory'.
Using awk and sed
awk 'function palindrome(str) {len=length(str); for(k=1; k<=len/2+len%2; k++) { if(substr(str,k,1)!=substr(str,len+1-k,1)) return 0 } return 1 } {for(i=1; i<=NF; i++) {if(length($i)>=3){ gsub(/[^a-zA-Z]/,"",$i); if(length($i)>=3) {$i=tolower($i); if(palindrome($i)) arr[$i]++ }} } } END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d'
Tested on 1.2GB file and execution time was ~4m 40s (i5-6440HQ # 2.60GHz/4 cores/16GB)
Explanation :
awk '
function palindrome(str) # Function to check Palindrome
{
len=length(str);
for(k=1; k<=len/2+len%2; k++)
{
if(substr(str,k,1)!=substr(str,len+1-k,1))
return 0
}
return 1
}
{
for(i=1; i<=NF; i++) # For Each field in a record
{
if(length($i)>=3) # if length>=3
{
gsub(/[^a-zA-Z]/,"",$i); # remove non-alpha character from it
if(length($i)>=3) # Check length again after removal
{
$i=tolower($i); # Covert to lowercase
if(palindrome($i)) # Check if it's palindrome
arr[$i]++ # and store it in array
}
}
}
}
END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d'
sed -E '/^[0-9]+ (.)\1+$/d' : From the final result check which strings are composed of just repeated chracters like AAA, BBB etc and remove them.
Old Answer (Before EDIT)
You can try below steps if you want to :
Step 1 : Pre-processing
Remove all unnecessary chars and store the result in temp file
tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
tr -dc 'a-zA-Z\n\t ' This will remove all except letters,\n,\t, space
tr ' ' '\n' This will convert space to \n to separate each word in newlines
Step-2: Processing
grep -wof temp <(rev temp) | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }'
grep -wof temp <(rev temp) This will give you all palindromes
-w : Select only those lines containing matches that form whole words.
For example : level won't match with levelAAA
-o : Print only the matched group
-f : To use each string in temp file as pattern to search in <(rev temp)
sed -E -e '/^(.)\1+$/d': This will remove words formed of same letters like AAA, BBBBB
awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }' : This will filter words having length>=3 and counts their frequency and finally prints the result
Example :
Input File :
$ cat file
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
Output:
$ tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
$ grep -wof temp <(rev temp) | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }'
3 dad
3 kayak
3 bob
Just a quick Perl alternative:
perl -0nE 'for( /(\w{3,})/g ){ $a{$_}++ if $_ eq reverse($_)}
END {say "$_ $a{$_}" for keys %a}'
in Perl, $_ should be read as "it".
for( /(\w{3,})/g ) ... for all relevant words (may need some work to reject false positives like "12a21")
if $_ eq reverse($_) ... if it is palindrome
END {say "$_ $a{$_}" for...} ... tell us all the its and its number
\thanks{sokowi,batMan}
Running the Script
The script expects that the file is given as an argument. The script does not read stdin.
Remove the line search "$1" in the middle of the script. It is not part of the linked answer.
Make the script executable using chmod u+x path/to/palindrome.
Call the script using path/to/palindrome path/to/sample.txt. If all the files are in the current working directory, then the command is
./palindrome sample.txt
Alternative Script
Sometimes the linked script works and sometimes it doesn't. I haven't found out why. However, I wrote an alternative script which does the same and is also a bit cleaner:
#! /bin/bash
grep -Po '\w{3,}' "$1" | grep -Evw '(.)\1*' | sort > tmp-words
grep -Fwf <(rev tmp-words) tmp-words | uniq -c
rm tmp-words
Save the script, make it executable, and call it with a file as its first argument.

Count number of names starts with particular character in file

i have the following file::
FirstName, FamilyName, Address, PhoneNo
the file is sorted according to the family name, how can i count the number of family names starts with a particular character ??
output should look like this ::
A: 2
B: 1
...
??
With awk:
awk '{print substr($2, 1, 1)}' file|
uniq -c|
awk '{print $2 ": " $1}'
OK, no awk. Here's with sed:
sed s'/[^,]*, \(.\).*/\1/' file|
uniq -c|
sed 's/.*\([0-9]\)\+ \([a-zA-Z]\)\+/\2: \1/'
OK, no sed. Here's with python:
import csv
r = csv.reader(open(file_name, 'r'))
d = {}
for i in r:
d[i[1][1]] = d.get(i[1][1], 0) + 1
for (k, v) in d.items():
print "%s: %s" % (k, v)
while read -r f l r; do echo "$l"; done < inputfile | cut -c 1 | sort | uniq -c
Just the Shell
#! /bin/bash
##### Count occurance of familyname initial
#FirstName, FamilyName, Address, PhoneNo
exec <<EOF
Isusara, Ali, Someplace, 022-222
Rat, Fink, Some Hole, 111-5555
Louis, Frayser, whaterver, 123-1144
Janet, Hayes, whoever St, 111-5555
Mary, Holt, Henrico VA, 222-9999
Phillis, Hughs, Some Town, 711-5525
Howard, Kingsley, ahahaha, 222-2222
EOF
while read first family rest
do
init=${family:0:1}
[ -n "$oinit" -a $init != "$oinit" ] && {
echo $oinit : $count
count=0
}
oinit=$init
let count++
done
echo $oinit : $count
Running
frayser#gentoo ~/doc/Answers/src/SH/names $ sh names.sh
A : 1
F : 2
H : 3
K : 1
frayser#gentoo ~/doc/Answers/src/SH/names $
To read from a file, remove the here document, and run:
chmod +x names.sh
./names.sh <file
The "hard way" — no use of awk or sed, exactly as asked for. If you're not sure what any of these commands mean, you should definitely look at the man page for each one.
INTERMED=`mktemp` # Creates a temporary file
COUNTS_L=`mktemp` # A second...
COUNTS_R=`mktemp` # A third...
cut -d , -f 2 | # Extracts the FamilyName field only
tr -d '\t ' | # Deletes spaces/tabs
cut -c 1 | # Keeps only the first character
# on each line
tr '[:lower:]' '[:upper:]' | # Capitalizes all letters
sort | # Sorts the list
uniq -c > $INTERMED # Counts how many of each letter
# there are
cut -c1-7 $INTERMED | # Cuts out the LHS of the temp file
tr -d ' ' > $COUNTS_R # Must delete the padding spaces though
cut -c9- $INTERMED > $COUNTS_L # Cut out the RHS of the temp file
# Combines the two halves into the final output in reverse order
paste -d ' ' /dev/null $COUNTS_R | paste -d ':' $COUNTS_L -
rm $INTERMED $COUNTS_L $COUNTS_R # Cleans up the temp files
awk one-liner:
awk '
{count[substr($2,1,1)]++}
END {for (init in count) print init ": " count[init]}
' filename
Prints the how many words start with each letter:
for i in {a..z}; do echo -n "$i:"; find path/to/folder -type f -exec sed "s/ /\n/g" {} \; | grep ^$i | wc -c | awk '{print $0}'; done

Resources