How to count occurrences of a phrase in Bash? - bash

I have an array:
ABC
GHI
XYZ
ABC
GHI
DEF
MNO
XYZ
How can I count the occurrences of each phrase in this array?
(Can I use a for loop?)
Expected output:
2 ABC
1 DEF
2 GHI
1 MNO
2 XYZ
Thank you so much!

sort file.txt | uniq -c should do the job.
If you mean an array in bash, echo them:
array=(ABC GHI XYZ ABC GHI DEF MNO XYZ)
for i in "${array[#]}"; do echo "$i"; done | sort | uniq -c
Output:
2 ABC
1 DEF
2 GHI
1 MNO
2 XYZ

Using pure bash and an associative array to hold the counts:
#!/usr/bin/env bash
declare -a words=(ABC GHI XYZ ABC GHI DEF MNO XYZ) # regular array
declare -A counts # associative array
# Count how many times each element of words appears
for word in "${words[#]}"; do
counts[$word]=$(( ${counts[$word]:-0} + 1 ))
done
# Order of output will vary
for word in "${!counts[#]}"; do
printf "%d %s\n" "${counts[$word]}" "$word"
done

Related

How to concatenate and loop through two columns using just Bash variables, i.e. without temporary files

I have two Bash variables that contain 2 columns of data. I'd like concatenate them to create two larger columns, and then use this outcome to loop in the resulting rows, having each column read in respective temporal variables.
I'll explain what I need with minimal working example. Let's think I have a tmp file with the following sample content:
for i in `seq 1 10`; do echo foo $i; done > tmp
for i in `seq 1 10`; do echo bar $i; done >> tmp
for i in `seq 1 10`; do echo baz $i; done >> tmp
What I need is effectively equivalent to the following code that relies in external temporary files:
grep foo tmp > file1
grep bar tmp > file2
cat file1 file2 > file_tmp
while read word number
do
if [ $word = "foo" ]
then
echo word $word number $number
fi
done < file_tmp
rm file1 file2 file_tmp
My question then is: how can I to achieve this result, i.e. concatenating the two columns and then looping across rows, without having to write out the temporary files file1, file2 and file_tmp?
Bash's read can take input from a file descriptor other than stdin.
Bash has process substitution
while
read -u3 foo1 foo2 &&
read -u4 bar1 bar2
do
echo "$foo1 $foo2 - $bar1 $bar2"
done 3< <(grep ^foo tmp) 4< <(grep ^bar tmp)
The code above is a kind of zip function. Note that it doesn't address ensuring that the ordering of the two sequences is correct.
It's not clear why your code in the question creates and then ignores bar lines. If you are doing that, the code is even simpler:
while read word number; do
echo "word $word number $number"
done < <(grep ^foo tmp)
I may have misunderstood, but if you want to do this without temp files, perhaps this would work for your use-case:
# Gather the output from the 3 'seq' commands and pipe into AWK
{
for i in $(seq 1 10); do echo foo "$i"; done ;
for i in $(seq 1 10); do echo bar "$i"; done ;
for i in $(seq 1 10); do echo baz "$i"; done ;
} |\
awk '{
if ($1=="foo" || $1=="bar") {a[NR]=$1; b[NR]=$2}}
END{for (i in a) {print "word " a[i] " number " b[i]}
}'
# For the AWK command: if a line contains "foo" or "bar",
# create an array "a" for the word, indexed using the row number ("NR")
# and an array "b" for the number, indexed using the row number ("NR")
# Then print the arrays with the words "word" and "number" and the correct spacing
Result:
word foo number 1
word foo number 2
word foo number 3
word foo number 4
word foo number 5
word foo number 6
word foo number 7
word foo number 8
word foo number 9
word foo number 10
word bar number 1
word bar number 2
word bar number 3
word bar number 4
word bar number 5
word bar number 6
word bar number 7
word bar number 8
word bar number 9
word bar number 10
you mean like this ??
paste <( jot - 1 9 2 ) <( jot - 2 10 2 )
1 2
3 4
5 6
7 8
9 10
You use awk to achieve this.
awk '{if($1=="foo") {print "word "$1" number "$2}}' file_tmp
Splitting then merging standard input in one operation
Of course, this could be used on standard input like output of any command, as well as on a file.
This demonstration use command output directly, without the requirement of temporary file.
First, the bunch of lines:
I've condensed your 1st tmp file into this one line command:
. <(printf 'printf "%s %%d\n" {1..10};' foo bar baz)
For reducing output on SO, here is a sample of output for 3 lines by word (rest of this post will still use 10 values per word.):
. <(printf 'printf "%s %%d\n" {1..3};' foo bar baz)
foo 1
foo 2
foo 3
bar 1
bar 2
bar 3
baz 1
baz 2
baz 3
You will need a fifo for the split:
mkfifo $HOME/myfifo
Note: this could be done by using unnamed fifo (aka without temporary fifo), but you have to manage openning and closing file descriptor by your script.
tee for splitting, then paste for merging output:
Quick run:
. <(printf 'printf "%s %%d\n" {1..10};' foo bar baz) |
tee >(grep foo >$HOME/myfifo ) | grep ba |
paste -d $'\1' $HOME/myfifo - - | sed 's/\o1/ and /g'
(Last sed is just for cosmetic) This should produce:
foo 1 and bar 1 and bar 2
foo 2 and bar 3 and bar 4
foo 3 and bar 5 and bar 6
foo 4 and bar 7 and bar 8
foo 5 and bar 9 and bar 10
foo 6 and baz 1 and baz 2
foo 7 and baz 3 and baz 4
foo 8 and baz 5 and baz 6
foo 9 and baz 7 and baz 8
foo 10 and baz 9 and baz 10
With some bash script in between:
. <(printf 'printf "%s %%d\n" {1..10};' foo bar baz) | (
tee >(
while read -r word num;do
case $word in
foo ) echo Word: foo num: $num ;;
* ) ;;
esac
done >$HOME/myfifo
) |
while read -r word num;do
case $word in
ba* ) ((num%2))&& echo word: $word num: $num ;;
* ) ;;
esac
done
) | paste $HOME/myfifo -
Should produce:
Word: foo num: 1 word: bar num: 1
Word: foo num: 2 word: bar num: 3
Word: foo num: 3 word: bar num: 5
Word: foo num: 4 word: bar num: 7
Word: foo num: 5 word: bar num: 9
Word: foo num: 6 word: baz num: 1
Word: foo num: 7 word: baz num: 3
Word: foo num: 8 word: baz num: 5
Word: foo num: 9 word: baz num: 7
Word: foo num: 10 word: baz num: 9
Other syntax, same job:
paste $HOME/myfifo <(
. <(printf 'printf "%s %%d\n" {1..10};' foo bar baz) |
tee >(
while read -r word num;do
case $word in
foo ) echo Word: foo num: $num ;;
* ) ;;
esac
done >$HOME/myfifo
) |
while read -r word num;do
case $word in
ba* ) ((num%2))&& echo word: $word num: $num ;;
* ) ;;
esac
done
)
Removing fifo
rm $HOME/myfifo

while loop to echo variable until empty in Bash

Here is what we have in the $foo variable:
abc bcd cde def
We need to echo the first part of the variable ONLY, and do this repeatedly until there's nothing left.
Example:
$ magic_while_code_here
I am on abc
I am on bcd
I am on cde
I am on def
It would use the beginning word first, then remove it from the variable. Use the beginning word first, etc. until empty, then it quits.
So the variable would be abc bcd cde def, then bcd cde def, then cde def, etc.
We would show what we have tried but we are not sure where to start.
If you need to use the while loop and cut the parts from the beginning of the string, you can use the cut command.
foo="abc bcd cde def"
while :
do
p1=`cut -f1 -d" " <<<"$foo"`
echo "I am on $p1"
foo=`cut -f2- -d" " <<<"$foo"`
if [ "$p1" == "$foo" ]; then
break
fi
done
This will output:
I am on abc
I am on bcd
I am on cde
I am on def
Assuming the variable consist of sequences of only alphabetic characters separated by space or tabs or newlines, we can (ab-)use the word splitting expansion and just do printf:
foo="abc bcd cde def"
printf "I am on %s\n" $foo
will output:
I am on abc
I am on bcd
I am on cde
I am on def
I would use read -a to read the string into an array, then print it:
$ foo='abc bcd cde def'
$ read -ra arr <<< "$foo"
$ printf 'I am on %s\n' "${arr[#]}"
I am on abc
I am on bcd
I am on cde
I am on def
The -r option makes sure backslashes in $foo aren't interpreted; read -a allows you to have any characters you want in $foo and split on whitespace.
Alternatively, if you can use awk, you could loop over all fields like this:
awk '{for (i=1; i<=NF; ++i) {print "I am on", $i}}' <<< "$foo"

how to find the last grouped digit in a string in bash

This is a follow-up question to this question, regarding how to know the number of grouped digits in string.
In bash,
How can I find the last occurrence of a group of digits in a string?
So, if I have
string="123 abc 456"
I would get
456
And if I had
string="123 123 456"
I would still get
456
Without external utilities (such as sed, awk, ...):
$ s="123 abc 456"
$ [[ $s =~ ([0-9]+)[^0-9]*$ ]] && echo "${BASH_REMATCH[1]}"
456
BASH_REMATCH is a special array where the matches from [[ ... =~ ... ]] are assigned to.
Test code:
str=("123 abc 456" "123 123 456" "123 456 abc def" "123 abc" "abc 123" "123abc456def")
for s in "${str[#]}"; do
[[ $s =~ ([0-9]+)[^0-9]*$ ]] && echo "$s -> ${BASH_REMATCH[1]}"
done
Output:
123 abc 456 -> 456
123 123 456 -> 456
123 456 abc def -> 456
123 abc -> 123
abc 123 -> 123
123abc456def -> 456
You can use a regex in Bash:
$ echo "$string"
123 abc 456
$ [[ $string =~ (^.*[ ]+|^)([[:digit:]]+) ]] && echo "${BASH_REMATCH[2]}"
456
If you want to capture undelimited strings like 456 or abc123def456 you can do:
$ echo "$string"
test456text
$ [[ $string =~ ([[:digit:]]+)[^[:digit:]]*$ ]] && echo "${BASH_REMATCH[1]}"
456
But if you are going to use an external tool, use awk.
Here is a demo of Bash vs Awk to get the last field of digits in a string. These are for digits with ' ' delimiters or at the end or start of a string.
Given:
$ cat file
456
123 abc 456
123 123 456
abc 456
456 abc
123 456 foo bar
abc123def456
Here is a test script:
while IFS= read -r line || [[ -n $line ]]; do
bv=""
av=""
[[ $line =~ (^.*[ ]+|^)([[:digit:]]+) ]] && bv="${BASH_REMATCH[2]}"
av=$(awk '{for (i=1;i<=NF;i++) if (match($i, /^[[:digit:]]+$/)) last=$i; print last}' <<< "$line")
printf "line=%22s bash=\"%s\" awk=\"%s\"\n" "\"$line\"" "$bv" "$av"
done <file
Prints:
line= "456" bash="456" awk="456"
line= "123 abc 456" bash="456" awk="456"
line= "123 123 456" bash="456" awk="456"
line= "abc 456" bash="456" awk="456"
line= "456 abc" bash="456" awk="456"
line= "123 456 foo bar" bash="456" awk="456"
line= "abc123def456" bash="" awk=""
grep -o '[0-9]\+' file|tail -1
grep -o lists matched text only
tail -1 output only the last match
well, if you have string:
grep -o '[0-9]\+' <<< '123 foo 456 bar' |tail -1
You may use this sed to extract last number in a line:
sed -E 's/(.*[^0-9]|^)([0-9]+).*/\2/'
Examples:
sed -E 's/(.*[^0-9]|^)([0-9]+).*/\2/' <<< '123 abc 456'
456
sed -E 's/(.*[^0-9]|^)([0-9]+).*/\2/' <<< '123 456 foo bar'
456
sed -E 's/(.*[^0-9]|^)([0-9]+).*/\2/' <<< '123 123 456'
456
sed -E 's/(.*[^0-9]|^)([0-9]+).*/\2/' <<< '123 x'
123
RegEx Details:
(.*[^0-9]|^): Match 0 or more characters at start followed by a non-digit OR line start.
([0-9]+): Match 1+ digits and capture in group #2
.*: Match remaining characters till end of line
\2: Replace it with back-reference #2 (what we captured in group #2)
Another way to do it with pure Bash:
shopt -s extglob # enable extended globbing - for *(...)
tmp=${string%%*([^0-9])} # remove non-digits at the end
last_digits=${tmp##*[^0-9]} # remove everything up to the last non-digit
printf '%s\n' "$last_digits"
This is a good job for parameter expansion:
$ string="123 abc 456"
$ echo ${string##* }
456
A simple answer with gawk:
echo "$string" | gawk -v RS=" " '/^[[:digit:]]+$/ { N = $0 } ; END { print N }'
With RS=" ", we read each field as a separate record.
Then we keep the last number found and print it.
$ string="123 abc 456 abc"
$ echo "$string" | gawk -v RS=" " '/^[[:digit:]]+$/ { N = $0 } ; END { print N }'
456

bash - find the differed string in column of file

I have a file input.txt, in bash using sed,awk or shell script how can I get the only differed string in a column amount all?
For example:
# cat input.txt
878933fa4965c31c88ee8696a1a5838f abc xyz
878933fa4965c31c88ee8696a1a5838f abc xyz
878933fa4965c31c88ee8696a1a5838f abc xyz
878933fa4965c31c88ee8696a1a5838f abc xyz
878933fa4965c31c88ee8696a1a5838f abc xyz
878933fa4965c31c88ee8696a1a5838f abc xyz
878933fa4965c31c88ee8696a1axxxxx abc xyz
878933fa4965c31c88ee8696a1a5838f abc xyz
878933fa4965c31c88ee8696a1a5838f abc xyz
878933fayyyyyy1c88ee8696a1a5838f abc xyz
878933fa4965c31c88ee8696a1a5838f abc xyz
878933fa4965c31c88ee8696a1a5838f abc xyz
I want to pick and display only "878933fa4965c31c88ee8696a1axxxxx" and "878933fayyyyyy1c88ee8696a1a5838f"
In pure Bash:
declare -A lines
while read col1 line ; do lines["$col1"]="$col1 $line" ; done < input.txt
for i in ${!lines[#]} ; do echo "$i" ; done
First we declare the lines variable as an associative array. Then we read them all in a while loop. Then for each key (the first column) we list the lines.
Your question is kinda vague but maybe you're trying to print the $1 values that appear only once and if so this would do that:
$ awk '{cnt[$1]++} END{for (i in cnt) if (cnt[i]==1) print i}' file
878933fayyyyyy1c88ee8696a1a5838f
878933fa4965c31c88ee8696a1axxxxx
awk '{print $1}' <file> |uniq -u
awk '{print $4}' <file> |uniq -u
uniq -c will give you a count, so if you mean only the entries of one single entry you can do:
cut -d " " -f 1 file | sort | uniq -c | awk '$1==1{print $2}'
Or in perl:
perl -lane '$seen{$F[0]}++; END{for (%seen){ print if $seen{$_}==1}}' file
Try this:
cat input.txt | uniq -u | awk '{print $1}'

Cannot recognize word if spaced by tab

I have problem on that the program cannot read each word if words in text file is spaced by tab, not space.
For example, here is file.
part_Q.txt:
NWLR35MQ 649
HCDA93OW 526
abc 1
def 2
ghi 3
note that between "abc" and "1", there is a tab, not space.
Also note that between "NWLR35MQ" and "649", there is no tab but all are spaces. same for 2nd line as well.
Output:
NWLR35MQ
649
HCDA93OW
526
def
2
ghi
3
However, if I replace tab between "abc" and "1" by space in the file, then it outputs correctly like below,
Expected output:
NWLR35MQ
649
HCDA93OW
526
abc
1
def
2
ghi
3
It correctly display all words in file. How can I display all words regardless of tab or space? it should display all words in both cases. It seems that the program regards tab as a character.
Below is source code:
#!/bin/sh
tempCtr=0
realCtr=0
copyCtr=0
while IFS= read -r line || [[ -n $line ]]; do
IFS=' '
tempCtr=0
for word in $line; do
temp[$tempCtr]="$word"
let "tempCtr++"
done
# if there are exactly 2 fields in each line, store ID and quantity
if [ $tempCtr -eq 2 ]
then
part_Q[$realCtr]=${temp[$copyCtr]}
let "realCtr++"
let "copyCtr++"
part_Q[$realCtr]=${temp[$copyCtr]}
let "realCtr++"
copyCtr=0
fi
done < part_Q.txt
for value in "${part_Q[#]}"; do
echo $value
done
What are you trying to do? If outputting is your only goal, this can be achieved very easily:
$ cat <<EOF | sed -E 's/[[:blank:]]+/\n/'
NWLR35MQ 649
HCDA93OW 526
abc 1
def 2
ghi 3
EOF
NWLR35MQ
649
HCDA93OW
526
abc
1
def
2
ghi
3
Awk is faster than a loop, but here is how you can implement this with a loop:
realCtr=0
while read -r x1 x2 x3; do
if [ -n "${x2}" ] && [ -z "${x3}" ]; then
echo 2=$x2
part_Q[realCtr]="${x1}"
(( realCtr++ ))
part_Q[realCtr]="${x2}"
(( realCtr++ ))
fi
done < part_Q.txt
echo "Array (2 items each line):"
echo "${part_Q[#]}" | sed 's/[^ ]* [^ ]* /&\n/g'
You might solve this (as in your example) by a single line of code
cat part_Q.txt | tr $'\t' $'\n' | tr -s ' ' $'\n'
which
first translates a tab into a newline, and then
translates space(-s) as well
Note: For tr you will need the $ before the \tab and \newline characters in bash.
Since it has been mentioned, awk can help, too:
awk 'NF==2{print $1"\n"$2}' part_Q.txt
Where NF==2 even takes care about only using lines with 2 'words'.
Changing IFS=' ' to IFS=$'\t ' solved the problem.

Resources