finding contents of one file in another in Unix - bash

I am trying to search for the contents of one file(f1) in another file(f2) and print successful matches.
I have tried various posted answers as shown below but none of them help.
1.
awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' f1 f2
2.
while read name
do
awk '$1 ~ '$name'' f2| awk '{print $NF, $4}' >> f3
done < f1
3.
grep -F -f f1 f2 > f3
All the above solutions print non matching entries also from f2. Is there any other way of doing it?
I am looking forward to an exact match in my scenario.
Say for example
$cat f1
abc
def
ghi
$cat f2
this line has abc
bc
abc
de
this line has ghi
i
ghi
Expected output :
abc
ghi
Thank you for your help.

Try below command (-i) flag is to search case insensitive
grep -i -Fx -f search_this.txt search_in.txt
Demo session is below
$ cat search_this.txt
xxxx yyyy
kkkkkk
zzzzzzzz
$ cat search_in.txt
line doesnot contain any name
This person is xxxx yyyy good
xxxx yyyy
Another line which doesnot contain any name
Is kkkkkk a good name ?
kkkkkk
This name itself is sleeping ...zzzzzzzz
I can't find any other name
Lets try the command now
$ grep -i -Fx -f search_this.txt search_in.txt
xxxx yyyy
kkkkkk

For me, that works, but I'm unsure if this is safe from a variable expansion point of view
PATTERN=`cat f1`; pcregrep -M "$PATTERN" f2
For using f2 as a number of patterns each which should be found, a solution seems to be here: finding contents of one file into another file in unix shell script

Related

Delete values in line based on column index using shell script

I want to be able to delete the values to the RIGHT(starting from given column index) from the test.txt at the given column index based on a given length, N.
Column index refers to the position when you open the file in the VIM editor in LINUX.
If my test.txt contains 1234 5678, and I call my delete_var function which takes in the column number as 2 to start deleting from and length N as 2 to delete as input, the test.txt would reflect 14 5678 as it deleted the values from column 2 to column 4 as the length to delete was 2.
I have the following code as of now but I am unable to understand what I would put in the sed command.
delete_var() {
sed -i -r 's/not sure what goes here' test.txt
}
clmn_index= $1
_N=$2
delete_var "$clmn_index" "$_N" # call the method with the column index and length to delete
#sample test.txt (before call to fn)
1234 5678
#sample test.txt (after call to fn)
14 5678
Can someone guide me?
You should avoid using regex for this task. It is easier to get this done in awk with simple substr function calls:
awk -v i=2 -v n=2 'i>0{$0 = substr($0, 1, i-1) substr($0, i+n)} 1' file
14 5678
Assumping OP must use sed (otherwise other options could include cut and awk but would require some extra file IOs to replace the original file with the modified results) ...
Starting with the sed command to remove the 2 characters starting in column 2:
$ echo '1234 5678' > test.txt
$ sed -i -r "s/(.{1}).{2}(.*$)/\1\2/g" test.txt
$ cat test.txt
14 5678
Where:
(.{1}) - match first character in line and store in buffer #1
.{2} - match next 2 characters but don't store in buffer
(.*$) - match rest of line and store in buffer #2
\1\2 - output contents of buffers #1 and #2
Now, how to get variables for start and length into the sed command?
Assume we have the following variables:
$ s=2 # start
$ n=2 # length
To map these variables into our sed command we can break the sed search-replace pattern into parts, replacing the first 1 and 2 with our variables like such:
replace {1} with {$((s-1))}
replace {2} with {${n}}
Bringing this all together gives us:
$ s=2
$ n=2
$ echo '1234 5678' > test.txt
$ set -x # echo what sed sees to verify the correct mappings:
$ sed -i -r "s/(.{"$((s-1))"}).{${n}}(.*$)/\1\2/g" test.txt
+ sed -i -r 's/(.{1}).{2}(.*$)/\1\2/g' test.txt
$ set +x
$ cat test.txt
14 5678
Alternatively, do the subtraction (s-1) before the sed call and just pass in the new variable, eg:
$ x=$((s-1))
$ sed -i -r "s/(.{${x}}).{${n}}(.*$)/\1\2/g" test.txt
$ cat test.txt
14 5678
One idea using cut, keeping in mind that storing the results back into the original file will require an intermediate file (eg, tmp.txt) ...
Assume our variables:
$ s=2 # start position
$ n=2 # length of string to remove
$ x=$((s-1)) # last column to keep before the deleted characters (1 in this case)
$ y=$((s+n)) # start of first column to keep after the deleted characters (4 in this case)
At this point we can use cut -c to designate the columns to keep:
$ echo '1234 5678' > test.txt
$ set -x # display the cut command with variables expanded
$ cut -c1-${x},${y}- test.txt
+ cut -c1-1,4- test.txt
14 5678
Where:
1-${x} - keep range of characters from position 1 to position $(x) (1-1 in this case)
${y}- - keep range of characters from position ${y} to end of line (4-EOL in this case)
NOTE: You could also use cut's ability to work with the complement (ie, explicitly tell what characters to remove ... as opposed to above which says what characters to keep). See KamilCuk's answer for an example.
Obviously (?) the above does not overwrite test.txt so you'd need an extra step, eg:
$ echo '1234 5678' > test.txt
$ cut -c1-${x},${y}- test.txt > tmp.txt # store result in intermediate file
$ cat tmp.txt > test.txt # copy intermediate file over original file
$ cat test.txt
14 5678
Looks like:
cut --complement -c $1-$(($1 + $2 - 1))
Should just work and delete columns between $1 and $2 columns behind it.
please provide code how to change test.txt
cut can't modify in place. So either pipe to a temporary file or use sponge.
tmp=$(mktemp)
cut --complement -c $1-$(($1 + $2 - 1)) test.txt > "$tmp"
mv "$tmp" test.txt
Below command result in the elimination of the 2nd character. Try to use this in a loop
sed s/.//2 test.txt

bash for loop only return last value xtimes of xlength of arra

I have a file with IDs such as below:
A
D
E
And I have a second file with the same IDs and extra info that I need:
A 50 G25T1 7.24 298
B 20 G234T2 8.3 80
C 5 G1I1 5.2 909
D 500 G458T3 0.4 79
E 321 G46I2 45.8 901
I want to output the third column of the second file by selecting the first column of the second file using the ids from first file:
G25T1
G458T3
G46I2
The issue I have is while the for loop runs, the output is as follows:
G46I2
G46I2
G46I2
Here is my code:
a=0; IFS=$'\r\n' command eval 'ids=($(awk '{print$1}' shared_single_copies.txt | sed -e 's/[[:space:]]//g'))'; for id in "${ids[#]}"; do a=$(($a+1)); echo $a' '"$id"; awk '{$1=="${id}"} END {print $3}' run_Busco_A1/A1_single_copy_ids.txt >> A1_genes_sc_Buscos.txt; done
Your code is way too complicated. Try one of these solutions: "file1" contains the ids, "file2" contains the extra info:
$ join -o 2.3 file1 file2
G25T1
G458T3
G46I2
$ awk 'NR==FNR {id[$1]; next} $1 in id {print $3}' file1 file2
G25T1
G458T3
G46I2
For more help about join, check the man page.
For more help about awk, start with the awk info page.
#glenn jackman's answer was by far the most succinct and elegant imo. If you want to use loops, though, then this can work:
#!/bin/bash
# if output file already exists, clear it so we don't
# inadvertently duplicate data:
> A1_genes_sc_Buscos.txt
while read -r selector
do
while read -r c1 c2 c3 garbage
do
[[ "$c1" = "$selector" ]] && echo "$c3" >> A1_genes_sc_Buscos.txt
done < run_Busco_A1/A1_single_copy_ids.txt
done < shared_single_copies.txt
That should work for your use-case provided the formatting is valid between what you gave as input and your real files.

Replace a word of a line if matched

I am given a file. If a line has "xxx" as its third word then I need to replace it with "yyy". My final output must have all the original lines with the modified lines.
The input file is-
abc xyz mno
xxx xyz abc
abc xyz xxx
abc xxx xxx xxx
The required output file should be-
abc xyz mno
xxx xyz abc
abc xyz yyy
abc xxx yyy xxx
I have tried-
grep "\bxxx\b" file.txt | awk '{if ($3=="xxx") print $0;}' | sed -e 's/[^ ]*[^ ]/yyy/3'
but this gives the output as-
abc xyz yyy
abc xxx yyy xxx
Following simple awk may help you in same.
awk '$3=="xxx"{$3="yyy"} 1' Input_file
Output will be as follows.
abc xyz mno
xxx xyz abc
abc xyz yyy
abc xxx yyy xxx
Explanation: Checking condition here if $3 3rd field is equal to string xxx then setting $3's value to string yyy. Then mentioning 1 there, since awk works on method of condition then action. I am making condition TRUE here by mentioning 1 here and NOT mentioning any action here so be default print of current line will happen(either with changed 3rd field or with new 3rd field).
sed solution:
sed -E 's/^(([^[:space:]]+[[:space:]]+){2})apathy\>/\1empathy/' file
The output:
abc xyz mno
apathy xyz abc
abc xyz empathy
abc apathy empathy apathy
To modify the file inplace add -i option: sed -Ei ....
In general the awk command may look like
awk '{command set 1}condition{command set 2}' file
The command set 1 would be executed for every line while command set 2 will be executed if the condition preceding that is true.
My final output must have all the original lines with the modified
lines
In your case
awk 'BEGIN{print "Original File";i=1}
{print}
$3=="xxx"{$3="yyy"}
{rec[i++]=$0}
END{print "Modified File";for(i=1;i<=NR;i++)print rec[i]}'file
should solve that.
Explanation
$3 is the the third space-delimited field in awk. If it matches "xxx", then it is replaced. Print the unmodified lines first while storing the modified lines in an array. At the end, print the modified lines. BEGIN and END blocks are executed only at the beginning and the end respectively. NR is the awk built-in variable which denotes that number of records processed till the moment. Since it is used in the END block it should give us the total number of records.
All good :-)
Ravinder has already provided you with the shortest awk solution possible.
In sed, the following would work:
sed -E 's/(([^ ]+ ){2})xxx/\1yyy/'
Or if your sed doesn't include -E, you can use the more painful BRE notation:
sed 's/\(\([^ ][^ ]* \)\{2\}\)xxx/\1yyy/'
And if you're in the mood to handle this in bash alone, something like this might work:
while read -r line; do
read -r -a a <<<"$line"
[[ "${a[2]}" == "xxx" ]] && a[2]="yyy"
printf '%s ' "${a[#]}"
printf '\n'
done < input.txt

grep word from column and remove row

I want to grep a word from particular column from a file. Then remove those rows put all remaining rows into another file.
could anyone please help me on shell command to get following output?
I have a file with this format:
1234 8976 897561234 1234 678901234
5678 5678 123456789 4567 123456790
1234 1234 087664566 4567 678990000
1223 6586 212134344 8906 123456789
I want to grep word "1234" in the second column alone and removed those rows alone and put remaining rows in another file. So output should be in this format:
1234 8976 897561234 1234 678901234
5678 5678 123456789 4567 123456790
1223 6586 212134344 8906 123456789
The out should be with 3 rows except 3 row out of 4 rows.
while read value ;do
grep -v ${value:0:10} /tmp/lakshmi.txt > /tmp/output.txt
cp /tmp/output.txt /tmp/no_post1.txt
done < /tmp/priya.txt
Could you please help me to modify this script?
You can use awk for this, if that's good for you:
awk '$2==1234' <file-name>
$2 represents second column and it will return the line:
1234 1234 087664566 4567 678990000
Then you can use sed, grep -v or even awk for further process, either delete this line from current file, or print only the lines that do not match to another file. awk will be much easier and powerful.
Try the following regular expression.
egrep -v "^[[:space:]]*[^[:space:]]+[[:space:]]+1234[[:space:]]+.*$"
Not sure what your intention is, but my best guess is that you want to do the following.
while read value ;do
egrep -v "^[[:space:]]*[^[:space:]]+[[:space:]]+${value:0:10}[[:space:]]+.*$" /tmp/lakshmi.txt > /tmp/output.txt
cp /tmp/output.txt /tmp/no_post1.txt
done < /tmp/priya.txt
For columnar data, awk is often the best tool to use.
Superficially, if your input data is in priya.txt and you want the output in lakshmi.txt, then this would do the job:
awk '$2==1234 { next } { print }' priya.txt > lakshmi.txt
The first pattern detects 1234 (and also 01234 and 0001234) in column 2 and executes a next which skips the rest of the script. The rest of the script prints the input data; people often use 1 in place of { print }, which achieves the same effect less verbosely (or less clearly).
If you want the line(s) with 1234 in another file (filtered.out, say), then you'd use:
awk '$2==1234 { print > "filtered.out"; next } { print }' priya.txt > lakshmi.txt
If the column must be exactly 1234 rather than just numerically equal to 1234, then you would use a regx match instead:
awk '$2 ~ /^1234$/ { next } { print }' priya.txt > lakshmi.txt
The great thing about awk is that it splits the data into fields automatically, and that usually makes it easy to process columnar data with awk. You can also use Perl or Python or other similar scripting languages to do much the same job.
You did not specify the record layout exactly. When an empty first field is replaced by 4 spaces, the clever solutions will fail. Can a field have a space inside?
When your fields have fixed offsets, you might want to check the offset:
grep -v "^.\{9\}1234"
When /tmp/priya.txt has more than 1 line, your while loop becomes ugly:
cp /tmp/lakshmi.txt /tmp/output.txt
while read value ;do
grep -v "^.\{9\}${value}" /tmp/output.txt > /tmp/output2.txt
mv /tmp/output2.txt /tmp/output.txt
done < /tmp/priya.txt
You can also use the -f option of grep:
echo "1234 8976 897561234 1234 678901234
5678 5678 123456789 4567 123456790
1234 1234 087664566 4567 678990000
1223 6586 212134344 8906 123456789" |grep -vf <(sed 's/^/^.\\{9\\}/' /tmp/priya.txt )
or in your case
grep -vf <(sed 's/^/^.\\{9\\}/' /tmp/priya.txt ) /tmp/lakshmi.txt

Cut first appearing pattern from line

I have a file say abc containing records like:
$cat xyz
ABC
ABCABC
ABCABCABC
I want to cut first pattern so result should be like:
AC
ACABC
ACABCABC
I am trying to cut pattern using awk like:
$ cat xyz|awk -F 'B' '{print $1,$2}'
A CA
A CA
A CA
Of course, B is deliminator so i am getting above result. How could i do that?
Thanks
I understand you want to delete first B in each line. If so, this will work:
sed 's/B//' xyx
Output:
AC
ACABC
ACABCABC
If you want the file to be replaced, add -i
sed -i 's/B//' xyx
I see you tried to edit my answer to add a new question - note that you have to do it updating your answer or writing in the comments.
Thanks and if i have one more case that i want to delete first pattern
only if i have more than one repeated pattern like:
$cat xyz
ABC
ABCABC
ABCABCABC
Output should be:
ABC
ACABC
ACABCABC
$cat xy
This can be a way to do it:
while read line
do
if [ `echo $line | grep -o "B" | wc -l` -ge 2 ]
then
echo $line | sed 's/B//'
else
echo $line
fi
done < xyz
Output:
ABC
ACABC
ACABCABC

Resources