I want a loop that can find the letter that ends words most frequently in multiple languages and output the data in columns.
So far I have
count="./wordlist/french/fr.txt ./wordlist/spanish/es.txt ./wordlist/german/de.$
lang="French Spanish German Portuguese Italian"
(
echo -e "Language Letter Count"
for i in $count
do
(for j in {a..z}
do
echo -e "LANG" $j $(grep -c $j\> $i)
done
) | sort -k3 -rn | head -1
done
) | column -t
I want it to output as shown:
Language Letter Count
French e 196195
Spanish a 357193
German e 251892
Portuguese a 217178
Italian a 216125
Instead I get:
Language Letter Count
LANG z 0
LANG z 0
LANG z 0
LANG z 0
LANG z 0
The words files have the format:
Word Freq(#) where the word and its frequency are delimited by a space.
This means I have 2 problems;
First, the grep command is not handling the argument $j\> to find a character at the end of a word. I have tried using grep -E $j\> and grep '$j\>' and neither worked.
The second problem is that I don't know how to output the name of the language (in the variable lang). Nesting another for loop did not work when I tried it like this (or with i and k in the opposite order):
(
for i in $count
do
for k in $lang
do
for j in {a..z}
do
echo -e $k $j $(grep -c $j\> $i)
done
) | sort -k3 -rn | head -1
done
done
) | column -t
Since this outputs multiples of the name of the language "$k" in places where it does not belong.
I know that I can just copy and paste the loop for each language, but I would like to extend this to every language.
Thanks in advance!
grep word boundaries
To make special delimiters (e.g. \> for word-end) work with egrep when being called from the shell, you should put them into "quotes".
count=$(egrep -c "${char}\>" "${file}")
Btw, you really should use double quote ("), because single quotes will prevent variable-expansion. (e.g. in j="foo"; k='$j\>', the first character of k's value will be $ rather than f)
Language name display
Getting the right language string is a bit more tricky; here's a few suggestions:
Derive the displayed language from the path of the wordlist:
lang=${file%/*}
lang=${lang##*/}
With bash (though not with dash and some other shells) you might even do lang=${lang^} to capitalize the string.
Lookup the proper language name in a dictionary. Bash-4 has dictionaries built in, but you can also use filebased dicts:
$ cat languagues.txt
./wordlist/french/fr.txt Français
./wordlist/english/en.txt English
./wordlist/german/de.txt Deutsch
$ file=./wordlist/french/fr.txt
$ lang=$(egrep "^${file}/>" languages.txt | awk '{print $2}')
You can also iterate over file,lang pairs, e.g.
languages="french/fr,French spanish/es,Español german/de,Deutsch"
for l in $languages; do
file=./wordlist/${l%,*}.txt
lang=${l#*,}
# ...
done
Taking word frequencies into account
The third problem I see (though I might misunderstand the problem), is that you are not taking the word frequency into account. e.g. a word A that is used 1000 times more often than the word B will only get counted once (just like B).
You can use awk to sum up the word frequencies of matching words:
count=$(egrep "${char}\>" "${file}" | awk '{s+=$2} END {print s}')
All Together Now
So a full solution to the problem could look like:
languages="french/fr,French spanish/es,Español german/de,Deutsch"
(
echo -e "Language Letter Count"
for l in ${languages}; do
file=./wordlist/${l%,*}.txt
lang=${l#*,}
for char in {a..z}; do
#count=$(egrep -c "${char}\>" "${file}")
count=$(egrep "${char}\>" "${file}" | awk '{s+=$2} END {print s}')
echo ${file} ${char} ${count}
done | sort -k3 -rn | head -1
done
) | column -t
Related
I need to extract from the file the words that contain certain letters in a certain amount.
I apologize if this question has been resolved in the past, I just did not find anything that fits what I am looking for.
File:
wab 12aaabbb abababx ab ttttt baaabb zabcabc
baab baaabb cbaab ab ccabab zzz
For example
1. If I chose the letters a and the number is 1 the output should be:
wab
ab
ab
//only the words that contains a and the char appear in the word 1 time
2. If I chose the letters a,b and the number is 3, the output should be:
12aaabbb
abababx
baaabb
//only the word contains a,b, and both chars appear in the word 3 times
3. If I chose the letters a,b,c and the number 2, the output should be:
ccabab
zabcabc
//only the words that contains a,b,c and the chars appear in the word 3 times
Is it possible to find 2 letters in the same script?
I was able to find in a single letter but I get only the words where the letters appear in sequence and I do not want to find only these words, that's what I did:
egrep '([a])\1{N-1}' file
And another problem I can not get only the specific words, I get all file and the letter I am looking for "a" in red.
I tried using -w but it does not display anything.
::: EDIT :::
try to edit what you did to a for
i=$1
fileName=$2
letters=${#: 3}
tr -s '[:space:]' '\n' < $fileName* |
for letter in $letters; do
grep -E "^[^$letter]*($letter[^$letter]*){$i}$"
done | uniq
There are various ways to split input so that grep sees a single word per line. tr is most common. For example:
tr -s '[:space:]' '\n' file | ...
We can build a function to find a specific number of a particular letter:
NofL(){
num=$1
letter=$2
regex="^[^$letter]*($letter[^$letter]*){$num}$"
grep -E "$regex"
}
Then:
# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a
# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b
# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c
Regexes are not really suited for that job as there are more efficient ways, but it is possible using repeated matching. We first select all words, from those we select words with n as, and from those we select words with n bs and so on.
Example for n=3 and a, b:
grep -Eo '[[:alnum:]]+' |
grep -Ex '[^a]*a[^a]*a[^a]*a[^a]*' |
grep -Ex '[^b]*b[^b]*b[^b]*b[^b]*'
To auto-generate such a command from an input like 3 a b, you need to dynamically create a pipeline, which is possible, but also a hassle:
exactly_n_times_char() {
(( $# >= 2 )) || { cat; return; }
local n="$1" char="$2" regex
regex="[^$char]*($char[^$char]*){$n}"
shift 2
grep -Ex "$regex" | exactly_n_times_char "$n" "$#"
}
grep -Eo '[[:alnum:]]+' file.txt | exactly_n_times_char 3 a b
With PCREs (requires GNU grep or pcregrep) the check can be done in a single regex:
exactly_n_times_char() {
local n="$1" regex=""
shift
for char; do # could be done without a loop using sed on $*
regex+="(?=[^$char\\W]*($char[^$char\\W]*){$n})"
done
regex+='\w+'
grep -Pow "$regex"
}
exactly_n_times_char 3 a b < file.txt
If a matching word appears multiple times (like baaabb in your example) it is printed multiple times too. You can filter out duplicates by piping through sort -u but that will change the order.
A method using sed and bash would be:
#!/bin/bash
file=$1
n=$2
chars=$3
for ((i = 0; i < ${#chars}; ++i)); do
c=${chars:i:1}
args+=(-e)
args+=("/^\([^$c]*[$c]\)\{$n\}[^$c]*\$/!d")
done
sed "${args[#]}" <(tr -s '[:blank:]' '\n' < "$file")
Notice that filename, count, and characters are parameterized. Use it as
./script filename 2 abc
which should print out
zabcabc
ccabab
given the file content in the question.
An implementation in pure bash, without calling an external program, could be:
#!/bin/bash
readonly file=$1
readonly n=$2
readonly chars=$3
while read -ra words; do
for word in "${words[#]}"; do
for ((i = 0; i < ${#chars}; ++i)); do
c=${word//[^${chars:i:1}]}
(( ${#c} == n )) || continue 2
done
printf '%s\n' "$word"
done
done < "$file"
You can match a string containing exactly N occurrences of character X with the (POSIX-extended) regexp [^X]*(X[^X]*){N}. To do this for multiple characters you could chain them, and the traditional way to process one 'word' at a time, simplistically defined as a sequence of non-whitespace chars, is like this
<infile tr -s ' \t\n' ' ' | grep -Ex '[^a]*(a[^a]*){3}' | \grep -Ex '[^b]*(b[^b]*){3}'
# may need to add \r on Windows-ish systems or for Windows-derived data
If you get colorized output from egrep and grep and maybe some other utilities it's usually because in a GNU-ish environment you -- often via a profile that was automatically provided and you didn't look at or modify -- set aliases to turn them into e.g. egrep --color=auto or possibly/rarely =always; using \grep or command grep or the pathname such as /usr/bin/grep disables the alias, or you could just un-set it/them. Another possibility is you may have envvar(s) set in which case you need to remove or suppress it/them, or explicitly say --color=never, or (somewhat hackily) pipe the output through ... | cat which has the effect of making [e]grep's stdout a pipe not a tty and thus turning off =auto.
However, GNU awk (not necessarily others) can also do this more directly:
<infile awk -vRS='[ \t\n]+' -F '' '{delete f;for(i=1;i<=NF;i++)f[$i]++}
f["a"]==3&&f["b"]==3'
or to parameterize the criteria:
<infile awk -vRS='[ \t\n]+' -F '' 'BEGIN{split("ab",w,//);n=3}
{delete f;for(i=1;i<=NF;i++)f[$i]++;s=1;for(t in w)if(f[w[t]]!=occur)s=0} s'
perl can do pretty much everything awk can do, and so can some other general-purpose tools, but I leave those as exercises.
I am listing the AWS region names.
us-east-1
ap-southeast-1
I want to split the string to print specific first characters delimited by - i.e. 'two characters'-'one character'-'one character'. So us-east-1 should be printed as use1 and ap-southeast-1 should be printed as aps1
I have tried this and it's giving me expected results. I was thinking if there is a shorter way to achieve this.
region=us-east-1
regionlen=$(echo -n $region | wc -m)
echo $region | sed 's/-//' | cut -c 1-3,expr $regionlen - 2-expr $regionlen - 1
How about using sed:
echo "$region" | sed -E 's/^(.[^-]?)[^-]*-(.)[^-]*-(.).*$/\1\2\3/'
Explanation: the s/pattern/replacement/ command picks out the relevant parts of the region name, replacing the entire name with just the relevant bits. The pattern is:
^ - the beginning of the string
(.[^-]?) - the first character, and another (if it's not a dash)
[^-]* - any more things up to a dash
- - a dash (the first one)
(.) - The first character of the second word
[^-]*- - the rest of the second word, then the dash
(.) - The first character of the third word
.*$ - Anything remaining through the end
The bits in parentheses get captured, so \1\2\3 pulls them out and replaces the whole thing with just those.
IFS influencing field splitting step of parameter expansion:
$ str=us-east-2
$ IFS=- eval 'set -- $str'
$ echo $#
3
$ echo $1
us
$ echo $2
east
$ echo $3
No external utilities; just processing in the language.
This is how smartly written build configuration scripts parse version numbers like 1.13.4 and architecture strings like i386-gnu-linux.
The eval can be avoided, if we save and restore IFS.
$ save_ifs=$IFS; set -- $str; IFS=$save_ifs
Using bash, and assuming that you need to distinguish between things like southwest and southeast:
s=ap-southwest-1
a=${s:0:2}
b=${s#*-}
b=${b%-*}
c=${s##*-}
bb=
case "$b" in
south*) bb+=s ;;&
north*) bb+=n ;;&
*east*) bb+=e ;;
*west*) bb+=w ;;
esac
echo "$a$bb$c"
How about:
region="us-east-1"
echo "$region" | (IFS=- read -r a b c; echo "$a${b:0:1}${c:0:1}")
use1
A simple sed -
$: printf "us-east-1\nap-southeast-1\n" |
sed -E 's/-(.)[^-]*/\1/g'
To keep noncardinal specifications like southeast distinct from south at the cost of adding an optional additional character -
$: printf "us-east-1\nap-southeast-1\n" |
sed -E '
s/north/n/;
s/south/s/;
s/east/e/;
s/west/w/;
s/-//g;'
If you could have south-southwest, add g to those directional reductions.
if you MUST have exactly 4 characters of output, I recommend mapping the eight or 16 map directions to specific characters, so that north is N, northeast is maybe O and northwest M... that sort of thing.
Hi, I have a file with following contents
> 1234 alphabet /vag/one/arun
> 1454 bigdata /home/two/ogra
> 5684 apple /vinay/three/dire
but i want the output to be like
> 1234 alphabet one
> 1454 bigdata two
> 5684 apple three
awk '{
split($NF,ar,"/");
$NF=ar[3]
for (i=1;i<=NF;i++) {
printf "%s ",$i
}
printf "\n"
}' filename
Take the last field delimited by space and split it into the array ar based on "/", the last field equal to the third element in ar and then loop through the fields printing them.
Cut each input line into pieces, throw the parts you don't need into a wastebasket, and reassemble what is left. For instance:
grep -Eo '(^|/[^/]+/|/[^/]+$|[^/]+)' <INPUTFILE| grep -Fv /|xargs -L 2 -d '\n' echo >OUTPUTFILE
You can do it simply by controlling IFS (the Internal Field Separator) which when it includes the '/' character will cause filed splitting to occur allowing you to read a/b/c into separate variables. Then it's just a matter of printing the variables you want, e.g. with your original contents in file,
$ while IFS="${IFS}/" read -r n l a b c; do echo "$n $l $b"; done < file
1234 alphabet one
1454 bigdata two
5684 apple three
I am trying to write a shell which will take an SQL file as input. Example SQL file:
SELECT *
FROM %%DB.TBL_%%TBLEXT
WHERE CITY = '%%CITY'
Now the script should extract all variables, which in this case everything starting with %%. So the output file will be something as below:
%%DB
%%TBLEXT
%%CITY
Now I should be able to extract the matching values from the user's .profile file for these variables and create the SQL file with the proper values.
SELECT *
FROM tempdb.TBL_abc
WHERE CITY = 'Chicago'
As of now I am trying to generate the file1 which will contain all the variables. Below code sample -
sed "s/[(),']//g" "T:/work/shell/sqlfile1.sql" | awk '/%%/{print $NF}' | awk '/%%/{print $NF}' > sqltemp2.sql
takes me till
%%DB.TBL_%%TBLEXT
%%CITY
Can someone help me in getting to file1 listing the variables?
You can use grep and sort to get a list of unique variables, as per the following transcript:
$ echo "SELECT *
FROM %%DB.TBL_%%TBLEXT
WHERE CITY = '%%CITY'" | grep -o '%%[A-Za-z0-9_]*' | sort -u
%%CITY
%%DB
%%TBLEXT
The -o flag to grep instructs it to only print the matching parts of lines rather than the entire line, and also outputs each matching part on a distinct line. Then sort -u just makes sure there are no duplicates.
In terms of the full process, here's a slight modification to a bash script I've used for similar purposes:
# Define all translations.
declare -A xlat
xlat['%%DB']='tempdb'
xlat['%%TBLEXT']='abc'
xlat['%%CITY']='Chicago'
# Check all variables in input file.
okay=1
for key in $(grep -o '%%[A-Za-z0-9_]*' input.sql | sort -u) ; do
if [[ "${xlat[$key]}" == "" ]] ; then
echo "Bad key ($key) in file:"
grep -n "${key}" input.sql | sed 's/^/ /'
okay=0
fi
done
if [[ ${okay} -eq 0 ]] ; then
exit 1
fi
# Process input file doing substitutions. Fairly
# primitive use of sed, must change to use sed -i
# at some point.
# Note we sort keys based on descending length so we
# correctly handle extensions like "NAME" and "NAMESPACE",
# doing the longer ones first makes it work properly.
cp input.sql output.sql
for key in $( (
for key in ${!xlat[#]} ; do
echo ${key}
done
) | awk '{print length($0)":"$0}' | sort -rnu | cut -d':' -f2) ; do
sed "s/${key}/${xlat[$key]}/g" output.sql >output2.sql
mv output2.sql output.sql
done
cat output.sql
It first checks that the input file doesn't contain any keys not found in the translation array. Then it applies sed substitutions to the input file, one per translation, to ensure all keys are substituted with their respective values.
This should be a good start, though there may be some edge cases such as if your keys or values contain characters sed would consider important (like / for example). If that is the case, you'll probably need to escape them such as changing:
xlat['%%UNDEFINED']='0/0'
into:
xlat['%%UNDEFINED']='0\/0'
I'm trying to write a bash script that increments the version number which is given in
{major}.{minor}.{revision}
For example.
1.2.13
Is there a good way to easily extract those 3 numbers using something like sed or awk such that I could increment the {revision} number and output the full version number string.
$ v=1.2.13
$ echo "${v%.*}.$((${v##*.}+1))"
1.2.14
$ v=11.1.2.3.0
$ echo "${v%.*}.$((${v##*.}+1))"
11.1.2.3.1
Here is how it works:
The string is split in two parts.
the first one contains everything but the last dot and next characters: ${v%.*}
the second one contains everything but all characters up to the last dot: ${v##*.}
The first part is printed as is, followed by a plain dot and the last part incremented using shell arithmetic expansion: $((x+1))
Pure Bash using an array:
version='1.2.33'
a=( ${version//./ } ) # replace points, split into array
((a[2]++)) # increment revision (or other part)
version="${a[0]}.${a[1]}.${a[2]}" # compose new version
I prefer "cut" command for this kind of things
major=`echo $version | cut -d. -f1`
minor=`echo $version | cut -d. -f2`
revision=`echo $version | cut -d. -f3`
revision=`expr $revision + 1`
echo "$major.$minor.$revision"
I know this is not the shortest way, but for me it's simplest to understand and to read...
Yet another shell way (showing there's always more than one way to bugger around with this stuff...):
$ echo 1.2.3 | ( IFS=".$IFS" ; read a b c && echo $a.$b.$((c + 1)) )
1.2.4
So, we can do:
$ x=1.2.3
$ y=`echo $x | ( IFS=".$IFS" ; read a b c && echo $a.$b.$((c + 1)) )`
$ echo $y
1.2.4
Awk makes it quite simple:
echo "1.2.14" | awk -F \. {'print $1,$2, $3'} will print out 1 2 14.
flag -F specifies separator.
If you wish to save one of the values:
firstVariable=$(echo "1.2.14" | awk -F \. {'print $1'})
I use the shell's own word splitting; something like
oIFS="$IFS"
IFS=.
set -- $version
IFS="$oIFS"
although you need to be careful with version numbers in general due to alphabetic or date suffixes and other annoyingly inconsistent bits. After this, the positional parameters will be set to the components of $version:
$1 = 1
$2 = 2
$3 = 13
($IFS is a set of single characters, not a string, so this won't work with a multicharacter field separator, although you can use IFS=.- to split on either . or -.)
Inspired by the answer of jlliagre I made my own version which supports version numbers just having a major version given. jlliagre's version will make 1 -> 1.2 instead of 2.
This one is appropriate to both styles of version numbers:
function increment_version()
local VERSION="$1"
local INCREMENTED_VERSION=
if [[ "$VERSION" =~ .*\..* ]]; then
INCREMENTED_VERSION="${VERSION%.*}.$((${VERSION##*.}+1))"
else
INCREMENTED_VERSION="$((${VERSION##*.}+1))"
fi
echo "$INCREMENTED_VERSION"
}
This will produce the following outputs:
increment_version 1 -> 2
increment_version 1.2 -> 1.3
increment_version 1.2.9 -> 1.2.10
increment_version 1.2.9.101 -> 1.2.9.102
Small variation on fgm's solution using the builtin read command to split the string into an array. Note that the scope of the IFS variable is limited to the read command (so no need to store & restore the current IFS variable).
version='1.2.33'
IFS='.' read -r -a a <<<"$version"
((a[2]++))
printf '%s\n' "${a[#]}" | nl
version="${a[0]}.${a[1]}.${a[2]}"
echo "$version"
See: How do I split a string on a delimiter in Bash?
I'm surprised no one suggested grep yet.
Here's how to get the full version (not limited to the length of x.y.z...) from a file name:
filename="openshift-install-linux-4.12.0-ec.3.tar.gz"
find -name "$filename" | grep -Eo '([0-9]+)(\.?[0-9]+)*' | head -1
# 4.12.0