Command "grep | cut" in shell programming - bash

I have a problem with the grep command.
I have a file, called dictionary.txt, containing 2 columns of words, like
abc def
apple orange
hour minute
In my Bash script, having entered the word in the left column as an argument, I have to output the corresponding word on the right using the grep command.
A requirement is to use a loop.
I created this script:
#!/bin/bash
parola=$1
for traduzione in $( sort dictionary.txt )
do
if [ $parola == $traduzione ]
then
grep $traduzione | cut -f 2 dictionary.txt
fi
done
This does not work as described above.

I'd suggest to replace the whole for loop with
awk -v word="$parola" '$1 == word{print $2;exit}' dictionary.txt
where
-v word="$parola" passes the parola variable to the awk script
$1 == word checks if the Column 1 value equals the parola
{print $2;exit} - prints the Column2 value and exits (remove exit if you need all matches on the further lines).
With dictionary.txt as
abc def
apple orange
hour minute
and script.sh as
#!/bin/bash
parola=$1
awk -v word="$parola" '$1 == word{print $2; exit}' dictionary.txt
the bash script.sh apple returns orange.
If you need a for loop you can use
#!/bin/bash
parola=$1
while IFS= read -a line; do
read -r left right <<< "$line"
if [ "$left" == "$parola" ]; then
echo "$right";
fi
done < dictionary.txt
That is:
Read dictionary.txt line by line assigning the current line value to the line variable
Read the values on a line into left and right variables
If left is equal to right, print right.

Why are you using a for-loop?
grep -w "word1" dictionary.txt
This shows you the line where you can find that word, so the for-loop is not even needed. For your information, -w means "only take whole words".

Related

How to filter text data in bash more efficiently

I have data file which I need to filter with bash script, see data example:
name=pencils
name=apples
value=10
name=rocks
value=3
name=tables
value=6
name=beds
name=cups
value=89
I need to group name value pairs like so apples=10, if current line starts with name and next line starts with name, first line should be omitted entirely. So result file should look like this:
apples=10
rocks=3
tables=6
cups=89
I came with this simple solution which works but is very slow, it takes 5 min to complete for file with 2000 lines.
VALUES=$(cat input.txt)
for x in $VALUES; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done
I'm aware that this kind of task is not very suitable for bash, but script is already written and this is just small part of it.
How can I optimize this task in bash?
Do not run any commands in subshells, it slows your script a lot. You can do everything in the current shell.
#! /bin/bash
while IFS== read k v ; do
if [[ $k == name ]] ; then
name=$v
elif [[ $k == value ]] ; then
printf '%s=%s\n' "$name" "$v"
fi
done
There are three easy optimizations you can make that will greatly speed up the script without requiring a major rethink.
1. Replace for with while read
Loading input.txt into a string, and then looping over that string with for x in $VALUES is slow. It requires the whole file to be read into memory even though this task could be done in a streaming fashion, reading a line at a time.
A common replacement for for line in $(cat file) is while read line; do ... done < file. It turns out that loops are compound commands, and like the normal one-line commands we're used to, compound commands can have < and > redirections. Redirecting a file into a loop means that for the duration of the loop, stdin comes from the file. So if you call read line inside the loop then it will read one line each iteration.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done < input.txt
2. Redirect output outside loop
It's not just input that can be redirected. We can do the same thing for the >> output.txt redirection. Here's where you'll see the biggest speedup. When >> output.txt is inside the loop output.txt must be opened and closed every iteration, which is crazy slow. Moving it to the outside means it only needs to be opened once. Much, much faster.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}"
fi
done < input.txt > output.txt
3. Shell string processing
One final improvement is to use faster string processing. Calling grep requires forking a subprocess every time just to do a simple string split. It'd be a lot faster if we could do the string splitting using just shell constructs. Well, as it happens that's easy now that we've switched to read. read can do more than read whole lines; it can also split on a delimiter from the variable $IFS (inter-field separator).
while IFS='=' read -r key value; do
case "$key" in
name) name="$value";;
value) echo "$name=$value";;
fi
done < input.txt > output.txt
Further reading
BashFAQ/001 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
This explains why I have IFS= read -r in the first two iterations.
BashFAQ/024 - I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?
cmd | while read; do ... done is another popular use of while read, but it has unique pitfalls.
BashFAQ/100 - How do I do string manipulations in bash?
More in-shell string processing options.
If you have performance issues do not use bash at all. Use a text processing tool like, for instance, awk:
$ awk -F= '{name = $2} $1 == "value" {print name "=" $2}' data.txt
apples=10
rocks=3
tables=6
cups=89
Explanation: -F= defines the field separator as character =. The first block is executed only if the first field of a line ($1) is equal to string value. It prints variable name followed by character = and the second field ($2). The second block is executed on each line and it stores the second field ($2) in variable name.
Normally, if your input resembles what you show, this should automatically skip the first line. Else, we can exclude it explicitly using a test on the NR variable which value is the line number, starting at 1:
awk -F= 'NR != 1 && $1 == "value" {print name "=" $2}
NR != 1 {name = $2}' data.txt
All this works on inputs like the one you show but not on inputs where you would have other types of lines or several value=... consecutive lines. If you really want to test that the name/value pair is on two consecutive lines we need something more. For instance, test if the first field is name and use another variable n to store the line number of the last encountered name=... line. With all these tests we can now put the 2 blocks in a slightly more intuitive order (but the opposite would work the same):
awk -F= 'NR != 1 && $1 == "name" {name = $2; n = NR}
NR != 1 && NR == n+1 && $1 == "value" {print name "=" $2}' data.txt
With awk there might be a more elegant solution but you can have:
awk 'BEGIN{RS="\n?name=";FS="\nvalue="} {if($2) printf "%s=%s\n",$1,$2}' inputs.txt
RS="\n?name=" says that the record separator is name=
FS="\nvalue=" says that the field separator for each record is value=
if($2) says to only proceed the printf is the second field exists

Read each line of a column of a file and execute grep

I have file.txt exemplary here:
This line contains ABC
This line contains DEF
This line contains GHI
and here the following list.txt:
contains ABC<TAB>ABC
contains DEF<TAB>DEF
Now I am writing a script that executes the following commands for each line of this external file list.txt:
take the string from column 1 of list.txt and search in a third file file.txt
if the first command is positive, return the string from column 2 of list.txt
So my output.txt is:
ABC
DEF
This is my code for grep/echo with putting the query/return strings manually:
if grep -i -q 'contains abc' file.txt
then
echo ABC >output.txt
else
echo -n
fi
if grep -i -q 'contains def' file.txt
then
echo DEF >>output.txt
else
echo -n
fi
I have about 100 search terms, which makes the task laborious if done manually. So how do I include while read line; do [commands]; done<list.txt together with the commands about column1 and column2 inside that script?
I would like to use simple grep/echo/awkcommands if possible.
Something like this?
$ awk -F'\t' 'FNR==NR { a[$1] = $2; next } {for (x in a) if (index($0, x)) {print a[x]}} ' list.txt file.txt
ABC
DEF
For the lines of the first file (FNR==NR), read the key-value pairs to array a. Then for the lines of the second line, loop through the array, check if the key is found on the line, and if so, print the stored value. index($0, x) tries to find the contents of x from (the current line) $0. $0 ~ x would instead take x as a regex to match with.
If you want to do it in the shell, starting a separate grep for each and every line of list.txt, something like this:
while IFS=$'\t' read k v ; do
grep -qFe "$k" file.txt && echo "$v"
done < list.txt
read k v reads a line of input and splits it (based on IFS) into k and v.
grep -F takes the pattern as a fixed string, not a regex, and -q prevents it from outputting the matching line. grep returns true if any matching lines are found, so $v is printed if $k is found in file.txt.
Using awk and grep:
for text in `awk '{print $4}' file.txt `
do
grep "contains $text" list.txt |awk -F $'\t' '{print $2}'
done

Bash Shell: Infinite Loop

The problem is the following I have a file that each line has this form:
id|lastName|firstName|gender|birthday|joinDate|IP|browser
i want to sort alphabetically all the firstnames in that file and print them one on each line but each name only once
i have created the following program but for some reason it creates an infinite loop:
array1=()
while read LINE
do
if [ ${LINE:0:1} != '#' ]
then
IFS="|"
array=($LINE)
if [[ "${array1[#]}" != "${array[2]}" ]]
then
array1+=("${array[2]}")
fi
fi
done < $3
echo ${array1[#]} | awk 'BEGIN{RS=" ";} {print $1}' | sort
NOTES
if [ ${LINE:0:1} != '#' ] : this command is used because there are comments in the file that i dont want to print
$3 : filename
array1 : is used for all the seperate names
Wow, there's a MUCH simpler and cleaner way to achieve this, without having to mess with the IFS variable or using arrays. You can use "for" to do this:
First I created a file with the same structure as yours:
$ cat file
id|lastName|Douglas|gender|birthday|joinDate|IP|browser
id|lastName|Tim|gender|birthday|joinDate|IP|browser
id|lastName|Andrew|gender|birthday|joinDate|IP|browser
id|lastName|Sasha|gender|birthday|joinDate|IP|browser
#id|lastName|Carly|gender|birthday|joinDate|IP|browser
id|lastName|Madson|gender|birthday|joinDate|IP|browser
Here's the script I wrote using "for":
#!/bin/bash
for LINE in `cat file | grep -v "^#" | awk -F'|' '{print$3}' | sort -u`
do
echo $LINE
done
And here's the output of this script:
$ ./script.sh
Andrew
Douglas
Madson
Sasha
Tim
Explanation:
for LINE in `cat file`
Creates a loop that reads each line of "file". The commands between ` are run by linux, for example, if you wanted to store the date inside of a variable you could use "VARDATE=`date`".
grep -v "^#"
The option -v is used to exclude results matching the pattern, in this case the pattern is "^#". The "^" character means "line begins with". So grep -v "^#" means "exclude lines beginning with #".
awk -F'|' '{print$3}'
The -F option switches the column delimiter from the default (the default is a space) to whatever you put between ' after it, in this case the "|" character.
The '{print$3}' prints the 3rd column.
sort -u
And the "sort -u" command to sort the names alphabetically.

Bash : How to check in a file if there are any word duplicates

I have a file with 6 character words in every line and I want to check if there are any duplicate words. I did the following but something isn't right:
#!/bin/bash
while read line
do
name=$line
d=$( grep '$name' chain.txt | wc -w )
if [ $d -gt '1' ]; then
echo $d $name
fi
done <$1
Assuming each word is on a new line, you can achieve this without looping:
$ cat chain.txt | sort | uniq -c | grep -v " 1 " | cut -c9-
You can use awk for that:
awk -F'\n' 'found[$1] {print}; {found[$1]++}' chain.txt
Set the field separator to newline, so that we look at the whole line. Then, if the line already exists in the array found, print the line. Finally, add the line to the found array.
Note: If a line will only be suppressed once, so if the same line appears, say, 6 times, it will be printed 5 times.

adding numbers without grep -c option

I have a txt file like
Peugeot:406:1999:Silver:1
Ford:Fiesta:1995:Red:2
Peugeot:206:2000:Black:1
Ford:Fiesta:1995:Red:2
I am looking for a command That counts the number of red Ford Fiesta cars.
The last number in each line is the amount of that particular car.
The command I am looking for CANNOT use the -c option of grep.
so this command should just output the number 4.
Any help would be welcome, thank you.
A simple bit of awk would do the trick:
awk -F: '$1=="Ford" && $4=="Red" { c+=$5 } END { print c }' file
Output:
4
Explanation:
The -F: switch means that the input field separator is a colon, so the car manufacturer is $1 (the 1st field), the model is $2, etc.
If the 1st field is "Ford" and the 4th field is "Red", then add the value of the 5th (last) field to the variable c. Once the whole file has been processed, print out the value of c.
For a native bash solution:
c=0
while IFS=":" read -ra col; do
[[ ${col[0]} == Ford ]] && [[ ${col[3]} == Red ]] && (( c += col[4] ))
done < file && echo $c
Effectively applies the same logic as the awk one above, without any additional dependencies.
Methods:
1.) use some scripting language for counting, like awk or perl and such. Awk solution already posted, here is an perl solution.
perl -F: -lane '$s+=$F[4] if m/Ford:.*:Red/}{print $s' < carfile
#or
perl -F: -lane '$s+=$F[4] if ($F[0]=~m/Ford/ && $F[3]=~/Red/)}{print $s' < carfile
both examples prints
4
2.) The second method is based on shell-pipelining
filter out the right rows
extract the column with the count
sum the numbers
e.g some examples:
grep 'Ford:.*:Red:' carfile | cut -d: -f5 | paste -sd+ | bc
the grep filter out the right rows
the cut get the last column
the paste creates an line like 2+2 what can be counted by
the bc for counting
Another example:
sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile | paste -sd+ | bc
the sed filter and extract
another example - different way of counting
(echo 0 ; sed -n 's/\(Ford:.*:Red\):\(.*\)/\2+/p' carfile ;echo p )| dc
numbers are counted by RPN calculator called dc, e.g. it works like 0 2 + - first comes the values and as the last the operation.
the first echo puts into the stack 0
the sed creates a stream of numbers like 2+ 2+
the last echo p prints the stack
exists many other possibilies how count a strem of numbers.
e.g counting by bash
while read -r num
do
sum=$(( $sum + $num ))
done < <(sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile)
and pure bash:
while IFS=: read -r maker model year color count
do
if [[ "$maker" == "Ford" && "$color" == "Red" ]]
then
(( sum += $count ))
fi
done < carfile
echo $sum

Resources