Check for duplicate word in comma-separated string in bash

Check for duplicate word in comma-separated string in bash - bash

I need to check that a variable does not contain a duplicate entry in a comma-separated string.
For example, inside of $animals, if I have:
,dog,cat,bird,goat,fish,
That would be considered valid since every word is unique.
The string:
,dog,cat,dog,bird,fish,
would be invalid since dog is entered twice.
,dog,cat,dogs,bird,fish,
Would be valid since there is only one instance of dog (dogs is there but allowed since it's not the same exact word)
The string:
,dog,cat,DOG,bird,fish
Would also be invalid since dog is the same as DOG only in uppercase.
Is there any way I can do this? I would put some code I've tried but I don't know what to use to even experiment.
Using bash 3.2.57(1)-release on 10.11.6 El Capitan

Case sensitive:
echo ",dog,cat,dog,bird,fish," | tr ',' '\n' | grep -v '^$' | sort | uniq -c | sort -k 1,1nr
Case insensitive:
echo ",dog,DOG,cat,dog,bird,fish," | tr ',' '\n' | grep -v '^$' | sort -rf | uniq -ci | sort -k 1,1nr
Perform a reverse sort (-r) and do it case insensitive to get the lower-case letters after upper ones. Then uniq them with -i. (You might have to ensure the defined collation LC_COLLATE and maybe locales like LANG and LC_ALL aren't affecting sort behavior).
Then check if the number in the first row > 1

Simple script-based solution
Usage
$ .\script.sh ,dog,dog,cat,
Actual Script
#!/bin/sh
num_duplicated() {
echo $1 |
tr ',' '\n' | # Split each items into its own line
tr '[:upper:]' '[:lower:]' | # Convert everything to lowercase
sort | # Sorts the lines (required for the call to `uniq`
uniq -d | # Passing the `-d` flag to show only duplicated lines
grep -v '^$' | # Passing `-v` on the pattern `^$` to remove empty lines
wc -l # Count the number of duplicate lines
}
main() {
num_duplicates=$(num_duplicated "$1")
if [[ $num_duplicates -eq '0' ]]
then
echo "No duplicates"
else
echo "Contains duplicate(s)"
fi
}
main $1

Related

Argument in bash script

I have the following bash script called countscript.sh
1 #!/bin/bash
2 echo "Running" $0
3 tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed $1 q
But I don't understand how to pass the argument correctly: ( "3" should be the argument $1 of sed).
$ echo " one two two three three three" | ./countscript.sh 3
Running ./countscript.sh
sed: -e expression #1, char 1: missing command
This works fine:
$ echo "one two three four one one four" | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 3q
3 one
2 four
1 two
Thanks.
PS: Anybody else noticed the
bug in this script on page 10, https://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/pearls-2.pdf ?

In the quoted paper, I think you are misreading
sed ${1}q
as
sed ${1} q
and sed does not consider 3 by itself a valid command. The separate argument q is treated as an input file name. If the value of $1 did result in a single valid sed script, you would have likely gotten an error for the missing input file q.
Proper shell programming would dictate this be written as
sed "${1}q"
or
sed "${1} q"
instead; with the space as part of the script, sed correctly outputs the first $1 lines of input and exits.
It's somewhat curious that the authors used sed instead of head - "$1" to output the first few lines, as one of them (McIlroy) essentially invented the idea of the Unix pipeline as a series of special-purpose, narrowly focused tools. Not having read the full paper, I don't know what Knuth and McIlroy's contributions to the paper were; perhaps Bentley just likes sed. :)

When running the following command:
$ echo " one two two three three three" | ./countscript.sh 3
the special variable $1 will be replaced by 3, your first argument. Hence, the script runs:
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 3 q
Notice the space between the 3 and the q. sed does not know what to do, because you give it no command (3 is not a command).
Remove the space, and you should be fine.
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed "${1}q"

Count of matching word, pattern or value from unix korn shell scripting is returning just 1 as count

I'm trying to get the count of a matching pattern from a variable to check the count of it, but it's only returning 1 as the results, here is what I'm trying to do:
x="HELLO|THIS|IS|TEST"
echo $x | grep -c "|"
Expected result: 3
Actual Result: 1
Do you know why is returning 1 instead of 3?
Thanks.

grep -c counts lines not matches within a line.
You can use awk to get a count:
x="HELLO|THIS|IS|TEST"
echo "$x" | awk -F '|' '{print NF-1}'
3
Alternatively you can use tr and wc:
echo "$x" | tr -dc '|' | wc -c
3

$ echo "$x" | grep -o '|' | grep -c .
3
grep -c does not count the number of matches. It counts the number of lines that match. By using grep -o, we put the matches on separate lines.
This approach works just as well with multiple lines:
$ cat file
hello|this|is
a|test
$ grep -o '|' file | grep -c .
3

The grep manual says:
grep, egrep, fgrep - print lines matching a pattern
and for the -c flag:
instead print a count of matching lines for each input file
and there is just one line that match

You don't need grep for this.
pipe_only=${x//[^|]} # remove everything except | from the value of x
echo "${#pipe_only}" # output the length of pipe_only

Try this :
$ x="HELLO|THIS|IS|TEST"; echo -n "$x" | sed 's/[^|]//g' | wc -c
3
With only one pipe with perl:
echo "$x" |
perl -lne 'print scalar(() = /\|/g)'

Bash - stdir words to file

I am trying to store whole user input in a bash variable (appending variable).
Then to sort them etc.
The problem is that for input f.e.:
sdsd fff sss
asdasds
It creates this output:
fff
sdsd
sssasdasds
Expected output is:
asdasds
fff
sdsd
sss
Code follows:
content=''
while read line
do
content+=$(echo "$line")
done
result=`echo "$content" | sed -r 's/[^a-zA-Z ]+/ /g' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | sort -u | sed '/^$/d' | sed 's/[^[:alpha:]]/\n/g'`
echo "$result" >> "$dictionary"

You aren't providing a space when you are appending.
content+=$(echo "$line")
You need to make sure there is a space between the end of the old value and the new value.
content+=" $line"
(There's no need for echo for this either as #gniourf_gniourf correctly pointed out.)

Something that will achieve what you're showing in your example:
words_ary=()
while read -r -a line_ary; do
(( ${#line_ary[#]} )) || continue # skip empty lines
words_ary+=( "${line_ary[#],,}" ) # The ,, is to convert to lower-case
done
printf '%s\n' "${words_ary[#]}" | sort -u >> "$dictionary"
We're splitting input into words at spaces and put these words in array line_ary
We're checking that we have a non-empty input
we append each word, converted to lowercase, from input to the array words_ary
finally we sort each word from words_ary and append the sorted words to file $dictionary.

shell - Characters contained in both strings - edited

I want to compare two string variables and print the characters that are the same for both. I'm not really sure how to do this, I was thinking of using comm or diff but I'm not really sure the right parameters to print only matching characters. also they say they take in files and these are strings. Can anyone help?
Input:
a=$(echo "abghrsy")
b=$(echo "cgmnorstuvz")
Output:
"grs"

You don't need to do that much work to assign $a and $b shell variables, you can just...
a=abghrsy
b=cdgmrstuvz
Now, there is a classic computer science problem called the longest common subsequence1 that is similar to yours.
However, if you just want the common characters, one way would let Ruby do the work...
$ ruby -e "puts ('$a'.chars.to_a & '$b'.chars.to_a).join"
1. Not to be confused with the different longest common substring problem.

Use Character Classes with GNU Grep
The isn't a widely-applicable solution, but it fits your particular use case quite well. The idea is to use the first variable as a character class to match against the second string. For example:
a='abghrsy'
b='cgmnorstuvz'
echo "$b" | grep --only-matching "[$a]" | xargs | tr --delete ' '
This produces grs as you expect. Note that the use of xargs and tr is simply to remove the newlines and spaces from the output; you can certainly handle this some other way if you prefer.
Set Intersection
What you're really looking for is a set intersection, though. While you can "wing it" in the shell, you'd be better off using a language like Ruby, Python, or Perl to do this.
A Ruby One-Liner
If you need to integrate with an existing shell script, a simple Ruby one-liner that uses Bash variables could be called like this inside your current script:
a='abghrsy'
b='cgmnorstuvz'
ruby -e "puts ('$a'.split(//) & '$b'.split(//)).join"
A Ruby Script
You could certainly make things more elegant by doing the whole thing in Ruby instead.
string1_chars = 'abghrsy'.split //
string2_chars = 'cgmnorstuvz'.split //
intersection = string1_chars & string2_chars
puts intersection.join
This certainly seems more readable and robust to me, but your mileage may vary. At least now you have some options to choose from.

Nice question +1.
You can use an awk trick to get this done.
a=abghrsy
b=cdgmrstuvz
comm -12 <(echo $a|awk -F"\0" '{for (i=1; i<=NF; i++) print $i}') <(echo $b|awk -F"\0" '{for (i=1; i<=NF; i++) print $i}')|tr -d '\n'
OUTPUT:
grs
Note use of awk -F"\0" that breaks input string character by character into different awk fiedls. Rest is pretty straightforward use of comm and tr.
PS: If you input string is not sorted then you need to pipe awk's output to sort or do sort of an array inside awk.
UPDATE: awk only solution (without comm):
echo "$a;$b" | awk -F"\0" '{scnd=0; for (i=1; i<=NF; i++) {if ($i!=";") {if (!scnd) arr1[$i]=$i; else if ($i in arr1) arr2[$i]=$i} else scnd=1}} END { for (a in arr2) printf("%s", a)}'
This assumes semicolon doesn't appear in your string (you can use any other character if that's not the case).
UPDATE 2: I think simplest solution is using grep -o
(thanks to answer from #CodeGnome)
echo "$b" | grep -o "[$a]" | tr -d '\n'

Using gnu coreutils(inspired by #DigitalRoss)..
a="abghrsy"
b="cgmnorstuvz"
echo "$(comm -12 <(echo "$a" | fold -w1 | sort | uniq) <(echo "$b" | fold -w1 | sort | uniq) | tr -d '\n')"
will print grs. I assumed you only want uniq characters.
UPDATE:
Modified for dash..
#!/bin/dash
string1=$(printf "$1" | fold -w1 | sort | uniq | tr -d '\n');
string2=$(printf "$2" | fold -w1 | sort | uniq | tr -d '\n');
while [ "$string1" != "" ]; do
c1=$(printf '%s\n' "$string1" | cut -c 1-1 )
string2=$(printf "$2" | fold -w1 | sort | uniq | tr -d '\n');
while [ "$string2" != "" ]; do
c2=$(printf '%s\n' "$string2" | cut -c 1-1 )
if [ "$c1" = "$c2" ]; then
echo "$c1\c"
fi
string2=$(printf '%s\n' "$string2" | cut -c 2- )
done
string1=$(printf '%s\n' "$string1" | cut -c 2- )
done
echo;
Note: I am just a beginner. There might be a better way of doing this.

How to split a string in shell and get the last field

Suppose I have the string 1:2:3:4:5 and I want to get its last field (5 in this case). How do I do that using Bash? I tried cut, but I don't know how to specify the last field with -f.

You can use string operators:
$ foo=1:2:3:4:5
$ echo ${foo##*:}
5
This trims everything from the front until a ':', greedily.
${foo <-- from variable foo
## <-- greedy front trim
* <-- matches anything
: <-- until the last ':'
}

Another way is to reverse before and after cut:
$ echo ab:cd:ef | rev | cut -d: -f1 | rev
ef
This makes it very easy to get the last but one field, or any range of fields numbered from the end.

It's difficult to get the last field using cut, but here are some solutions in awk and perl
echo 1:2:3:4:5 | awk -F: '{print $NF}'
echo 1:2:3:4:5 | perl -F: -wane 'print $F[-1]'

Assuming fairly simple usage (no escaping of the delimiter, for example), you can use grep:
$ echo "1:2:3:4:5" | grep -oE "[^:]+$"
5
Breakdown - find all the characters not the delimiter ([^:]) at the end of the line ($). -o only prints the matching part.

You could try something like this if you want to use cut:
echo "1:2:3:4:5" | cut -d ":" -f5
You can also use grep try like this :
echo " 1:2:3:4:5" | grep -o '[^:]*$'

One way:
var1="1:2:3:4:5"
var2=${var1##*:}
Another, using an array:
var1="1:2:3:4:5"
saveIFS=$IFS
IFS=":"
var2=($var1)
IFS=$saveIFS
var2=${var2[#]: -1}
Yet another with an array:
var1="1:2:3:4:5"
saveIFS=$IFS
IFS=":"
var2=($var1)
IFS=$saveIFS
count=${#var2[#]}
var2=${var2[$count-1]}
Using Bash (version >= 3.2) regular expressions:
var1="1:2:3:4:5"
[[ $var1 =~ :([^:]*)$ ]]
var2=${BASH_REMATCH[1]}

$ echo "a b c d e" | tr ' ' '\n' | tail -1
e
Simply translate the delimiter into a newline and choose the last entry with tail -1.

Using sed:
$ echo '1:2:3:4:5' | sed 's/.*://' # => 5
$ echo '' | sed 's/.*://' # => (empty)
$ echo ':' | sed 's/.*://' # => (empty)
$ echo ':b' | sed 's/.*://' # => b
$ echo '::c' | sed 's/.*://' # => c
$ echo 'a' | sed 's/.*://' # => a
$ echo 'a:' | sed 's/.*://' # => (empty)
$ echo 'a:b' | sed 's/.*://' # => b
$ echo 'a::c' | sed 's/.*://' # => c

There are many good answers here, but still I want to share this one using basename :
basename $(echo "a:b:c:d:e" | tr ':' '/')
However it will fail if there are already some '/' in your string.
If slash / is your delimiter then you just have to (and should) use basename.
It's not the best answer but it just shows how you can be creative using bash commands.

If your last field is a single character, you could do this:
a="1:2:3:4:5"
echo ${a: -1}
echo ${a:(-1)}
Check string manipulation in bash.

Using Bash.
$ var1="1:2:3:4:0"
$ IFS=":"
$ set -- $var1
$ eval echo \$${#}
0

echo "a:b:c:d:e"|xargs -d : -n1|tail -1
First use xargs split it using ":",-n1 means every line only have one part.Then,pring the last part.

Regex matching in sed is greedy (always goes to the last occurrence), which you can use to your advantage here:
$ foo=1:2:3:4:5
$ echo ${foo} | sed "s/.*://"
5

A solution using the read builtin:
IFS=':' read -a fields <<< "1:2:3:4:5"
echo "${fields[4]}"
Or, to make it more generic:
echo "${fields[-1]}" # prints the last item

for x in `echo $str | tr ";" "\n"`; do echo $x; done

improving from #mateusz-piotrowski and #user3133260 answer,
echo "a:b:c:d::e:: ::" | tr ':' ' ' | xargs | tr ' ' '\n' | tail -1
first, tr ':' ' ' -> replace ':' with whitespace
then, trim with xargs
after that, tr ' ' '\n' -> replace remained whitespace to newline
lastly, tail -1 -> get the last string

For those that comfortable with Python, https://github.com/Russell91/pythonpy is a nice choice to solve this problem.
$ echo "a:b:c:d:e" | py -x 'x.split(":")[-1]'
From the pythonpy help: -x treat each row of stdin as x.
With that tool, it is easy to write python code that gets applied to the input.
Edit (Dec 2020):
Pythonpy is no longer online.
Here is an alternative:
$ echo "a:b:c:d:e" | python -c 'import sys; sys.stdout.write(sys.stdin.read().split(":")[-1])'
it contains more boilerplate code (i.e. sys.stdout.read/write) but requires only std libraries from python.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Check for duplicate word in comma-separated string in bash - bash

Related

Argument in bash script

Count of matching word, pattern or value from unix korn shell scripting is returning just 1 as count

Bash - stdir words to file

shell - Characters contained in both strings - edited

How to split a string in shell and get the last field

Categories

Resources