grep remove exact matches from line without removing the whole line - bash

I have patterns.txt file and I would like to remove all exact matches of patterns from FILE.txt. The FILE.txt is the following:
word1 word2
word3 word4
word5 word6
The pattern file contains:
word1
word6
The expected output is:
word2
word3 word4
word5
The command below removes the whole row where there is an exact match. How can I only remove the exact match from a line without removing the whole line? I don't want to use for-loops to achieve this.
cat FILE.txt | grep -wvf pattern.txt

With sed:
re=$(tr '\n' '|' < patterns.txt)
sed -r "s/$re//; s/^[[:space:]]*//" file
word2
word3 word4
word5
Note: Make sure patterns.txt does not have a trailing new line or extra new lines since | will end up in each of those positions.

You may try this awk:
awk 'FNR == NR {pats[$1]; next} {more=0; for (i=1; i<=NF; ++i) if (!($i in pats)) printf "%s", (more++ ? OFS : "") $i; print ""}' patterns.txt file
word2
word3 word4
word5
A more readable version:
awk '
FNR == NR {
pats[$1]
next
}
{
more = 0
for (i=1; i<=NF; ++i)
if (!($i in pats))
printf "%s", (more++ ? OFS : "") $i
print ""
}' patterns.txt file

In order to split the text with a word (string) as a delimiter, you can use awk.
awk -F 'word' '{print $1;print $2}' file.txt
In case you want to display only what is after the delimiter then it would be:
awk -F 'word' '{print $2}' file.txt
In order to change the pattern continuously then you might have to create a loop.

My first thought was to do just what #anubhava did. Then I thought that perl might be good for that: perl has good capabilities for filtering lists. The problem is that perl doesn't have an FNR variable. But I played around with a2p and came up with this:
perl -lane '
$FNR = $. - $FNRbase;
if ($. == $FNR) {
$ignore{$F[0]} = 1;
} else {
print join " ", grep {not exists $ignore{$_}} #F;
}
} continue {
$FNRbase = $. if eof
' pattern.txt FILE.txt

Related

Bash Unix search for a list of words in multiple files

I have a list of words I need to check in more one hundred text files.
My list of word's file named : word2search.txt.
This text file contains N word :
Word1
Word2
Word3
Word4
Word5
Word6
Wordn
So far I've done this bash file :
#!/bin/bash
listOfWord2Find=/home/mobaxterm/MyDocuments/word2search.txt
while IFS= read -r listOfWord2Find
do
echo "$listOfWord2Find"
grep -l -R "$listOfWord2Find" /home/mobaxterm/MyDocuments/txt/*.txt
echo "================================================================="
done <"$listOfWord2Find"
The result does not satisfy me, I can hardly exploit the result
Word1
/home/mobaxterm/MyDocuments/txt/new 6.txt
/home/mobaxterm/MyDocuments/txt/file1.txt
/home/mobaxterm/MyDocuments/txt/file2.txt
/home/mobaxterm/MyDocuments/txt/file3.txt
=================================================================
Word2
/home/mobaxterm/MyDocuments/txt/new 6.txt
/home/mobaxterm/MyDocuments/txt/file1.txt
=================================================================
Word3
/home/mobaxterm/MyDocuments/txt/new 6.txt
/home/mobaxterm/MyDocuments/txt/file4.txt
/home/mobaxterm/MyDocuments/txt/file5.txt
/home/mobaxterm/MyDocuments/txt/file1.txt
=================================================================
Word4
/home/mobaxterm/MyDocuments/txt/new 6.txt
/home/mobaxterm/MyDocuments/txt/file1.txt
=================================================================
Word5
/home/mobaxterm/MyDocuments/txt/new 6.txt
=================================================================
This is what i want to see :
/home/mobaxterm/MyDocuments/txt/file1.txt : Word1, Word2, Word3, Word4
/home/mobaxterm/MyDocuments/txt/file2.txt : Word1
/home/mobaxterm/MyDocuments/txt/file3.txt : Word1
/home/mobaxterm/MyDocuments/txt/file4.txt : Word3
/home/mobaxterm/MyDocuments/txt/file5.txt : Word3
/home/mobaxterm/MyDocuments/txt/new 6.txt : Word1, Word2, Word3, Word4, Word5, Word6
I do not understand why my script doesnt show me the Word6(there are files which contains this word6). It stops at word5. To avoid this issue, I've added a new line blablabla (I'm sure to not find this occurence).
If you can help me on this subject :)
Thank you.
Another much more elegant approach to search all words on each file. One file at a time.
Use grep command multi pattern option -f, --file=FILE, and print matched lines with -o, --only-matching
Then to pipe massage the resulting words into csv list.
Like this:
script.sh
#!/bin/bash
for currFile in $*; do
matched_words_list=$(grep --only-matching --file=$WORDS_LIST $currFile |sort|uniq|awk -vORS=', ' 1|sed "s/, $//")
printf "%s : %s\n" "$currFile" "$matched_words_list"
done
script.sh output
Passing words list file in environment variable: WORDS_LIST
Passing inspected files list as arguments list input.*.txt
export WORDS_LIST=./words.txt; ./script.sh input.*.txt
input.1.txt : word1, word2
input.2.txt : word4
input.3.txt :
Explanation:
using words.txt:
word2
word1
word5
word4
using input.1.txt:
word1
word2
word3
word3
word1
word3
And pipe massage the grep command
grep --file=words.txt -o input.1.txt |sort|uniq|awk -vORS=, 1|sed s/,$//
word1,word2
output 1
List all matched words from words.txt in inspected file input.1.txt
grep --file=words.txt -o input.1.txt
word1
word2
word1
output 2
List all matched words from words.txt in inspected file input.1.txt
Than sort the output words list
grep --file=words.txt -o input.1.txt|sort
word1
word1
word2
output 3
List all matched words from words.txt in inspected file input.1.txt
Than sort the output words list
Than remove duplicate words
grep --file=words.txt -o input.1.txt|sort|uniq
word1
word2
output 4
List all matched words from words.txt in inspected file input.1.txt
Than sort the output words list
Than remove duplicate words
Than create a csv list from the unique words
grep --file=words.txt -o input.1.txt|sort|uniq|awk -vORS=, 1
word1,word2,
output 5
List all matched words from words.txt in inspected file input.1.txt
Than sort the output words list
Than remove duplicate words
Than create a csv list from the unique words
Than remove trailing , from csv list
grep --file=words.txt -o input.1.txt|sort|uniq|awk -vORS=, 1|sed s/,$//
word1,word2
The suggest strategy is to scan each line once with all words.
Suggest to write gawk script, which is standard Linux awk
script.awk
FNR == NR { # Only in first file having match words list
matchWordsArr[++wordsCount] = $0; # read match words into ordered array
matchedWordInFile[wordsCount] = 0; # reset matchedWordInFile array
}
FNR != NR { # Read line in inspected file
for (i in matchWordsArr) { # scan line for all match words
if ($0 ~ matchWordsArr[i]) matchedWordInFile[i]++; # if word is mached increment respective matchedWordInFile[i]
}
}
ENDFILE{ # on each file read completion
if (FNR != NR) { # if not first file
outputLine = sprintf("%s: ", FILENAME); # assign outputLine header to current fileName
for (i in matchWordsArr) { # iterate over matched words
if (matchedWordInFile[i] == 0) continue; # skip unmatched words
outputLine = sprintf("%s%s%s", outputLine, seprator, matchWordsArr[i]); # append matched word to outputLine
matchedWordInFile[i] = 0; # reset matched words array
seprator = ","; # set words list seperator ","
}
print outputLine;
}
outputLine = seprator = ""; # reset words list seperator "" and outputLine
}
input.1.txt:
word1
word2
word3
input.2.txt:
word3
word4
word5
input.3.txt:
word3
word7
word8
words.txt
word2
word1
word5
word4
running:
$ awk -f script.awk words.txt input.*.txt
input.1.txt: word2,word1
input.2.txt: word5,word4
input.3.txt:
Just grep:
grep -f list.txt input.*.txt
-f FILENAME allows to use a file with patterns for grep to search.
If you want to display the filename along with the match, pass -H in addition to that:
grep -Hf list.txt input.*.txt

How to build a tab-delimited text file with many calculated values?

Background
From a set of similar reference files (sample1.txt, sample2.txt, etc.), I am calculating or retrieving data for many different parameters (pc.genes, pc.transcripts, pc.genes.antisense, etc.).
A simplified example of a single ref.file (e.g., sample1.txt):
word1 word2 word3 405438 409170 . Y . word4; word5
word1 word2 word3 405438 409170 . N . word4; word5
word1 word2 word3 409006 409170 . N . word4; word5
word1 word2 word3 405438 408401 . Y . word4; word5
word1 word2 word3 407099 408361 . N 0 word4; word5
A calculation for “avg.exons” parameter might look like:
$ awk '$3 == "word3"' sample1.txt | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/\1;\2/p' | awk -F';' '{a[$1]++}END{for (i in a) {count++; sum += a[i]} print sum/count}'
5.96732
A retrieval for “pc.genes” parameter might look like:
$ awk '$3 == "word3"' sample1.txt | grep -c "word4"
19062
These are just examples in case the solution requires the commands to be piped to a function that transfers/adds them to a table. The output value of these commands is always a single number.
Desired Output
I would like to put these calculated/retrieved values into an organized table format (preferably a tab-delimited text file) so that I can generate plots from the data:
ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength
sample1.txt 19062 116573 2585576 1318321 5.96732 3732.57
sample2.txt 19753 138563 5834759 1433785 5.84654 4023.89
sample3.txt 19376 124576 2871235 1983263 6.78929 3890.32
Is this possible? And if so, how can I achieve this?
Attempt
for file in sample*.txt
do
printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'
pc.genes=$(awk '$3 == "word3"' ${file} | grep -c "word4")
avg.exons=$(awk '$3 == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/\1;\2/p' | awk -F';' '{a[$1]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
... # get rest of desired values
done > table.txt
Resulting Errors
-bash: pc.genes=19062: command not found
... # other errors with corresponding CORRECT value outputs
-bash: avg.exons=5.96732: command not found
... # the errors even continue into the other sample*.txt files, which is good
-bash: pc.genes=19753: command not found
...
All of the values corresponding to a given parameter (i.e., "=###") are correct, but the error is preventing them from being put in the table.
Based solely on the details provided by the OP, and assuming a looping construct is being used to process a single file at a time, something like:
# print header
printf "ref.file\tpc.genes\tpc.transcripts\tpc.genes.antisense\tpc.genes.sense\tavg.exons\tavg.genelength\n"
while read -r fn
do
aexons=$(awk '$3 == "word1"' ${fn} | sed -n 's/.*word2 \([^;]*\).*word3 \([^;]*\).*/\1;\2/p' | awk -F';' '{a[$1]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
pgenes=$(awk '$3 == "word1"' ${fn} | grep -c "word2")
... # get rest of desired values
# print tab-delimited output to stdout; adjust formats as needed
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${fn} ${pgenes} .... ${axeons} ...
done < <('ls' sample*.txt) # replace with whatever logic OP is using to find desired files
While the above should work, it's not very efficient what with all of the subprocess calls ($(...); piped commands), and the need to process each input file (${fn}) 6x times (for 6x values).
A more efficient method would look at processing each input file (${fn}) just once.
An additional step might be to eliminate the loop in favor of a single program to process all files in one pass.
Since awk is capable of parsing data (from multiple files), calculating sums/averages, and generating (tab-delimited) output, I'd probably be leaning towards a single awk command/invocation as a more efficient solution ... but can't tell for sure without sample data and more details on the desired calculations.
The following answer works perfectly and comes from the combined suggestions of markp-fuso and KamilCuk. Thank you both!
# add the table headers
printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'
for file in sample*.txt
do
# create variables containing code for all parameter calculations/retrievals
pcgenes=$(awk '$3 == "word3"' ${file} | grep -c "word4")
pctranscripts=$(...)
pcgenesantisense=$(...)
pcgenessense=$(...)
avgexons=$(awk '$3 == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/\1;\2/p' | awk -F';' '{a[$1]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
avggenelength=$(...)
# print all resulting values in a single tab separated row of the table
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${file} ${pcgenes} ${pctranscripts} ${pcgenesantisense} ${pcgenessense} ${avgexons} ${avggenelength}
done > table.txt

How to find lines of a file where 2 firsts words are differents from previous and next line

Consider the following file:
word1 word2 word3
word1 word2 word3
word6 word7 word8
word6 word7 word9
word9 word10 word4
word1 word2 word5
word1 word2 word5
I search for a shell command line to output lines where 2 first words are different from previous and next line.
Expected output:
word9 word10 word4
Any idea?
case 1: each line has same number of words (fields)
uniq can skip initial fields but not trailing fields
rev reverses the characters on a line
Since each line has the same number of fields (1 trailing), we can do:
<file rev | uniq -u -f1 | rev
case 2: arbitrary number of words on each line
We can write an awk script that keeps track of the current and the previous two lines and prints the previous one when appropriate:
awk <file '
{
# does current line match previous line?
diff = !( $1==p1 && $2==p2 )
# print stashed line if not duplicate
if (diff && pdiff) print p0
# stash current line data
pdiff=diff; p0=$0; p1=$1; p2=$2
}
END {
# print the final line if appropriate
if (pdiff) print p0
}
'
I guess there is some redundancy here but works
$ awk '{k=$1 FS $2}
k!=p && p!=pp {print p0}
{p0=$0; pp=p; p=k}
END {if(p!=pp) print}' file
word9 word10 word4

copy the last column in the first position awk

I want to copy the first value of colum in the first position and comment out the old value.
For example :
word1 word2 1233425 -----> 1233425 word1 word2 #1233425
word1 word2 word3 49586 -----> 49586 word1 word2 word3 #49586
I don't know the number of words preceding the number.
I tried with an awk script :
awk '{$1="";score=$NF;$NF="";print $score $0 #$score}' file
But It does not work.
What about this? It is pretty similar to yours.
$ awk '{score=$NF; $NF="#"$NF; print score, $0}' file
1233425 word1 word2 #1233425
49586 word1 word2 word3 #49586
Note that in your case you are emptying $1, which is not necessary. Just store score as you did and then add # to the beginning of $NF.
Using awk
awk '{f=$NF;$NF="#" $NF;print f,$0}' file
Since we posted the same answer, here is a shorter variation :)
awk '{$0=$NF FS$0;$NF="#"$NF}1' file
$0=$NF FS$0 add last field to line
$NF="#"$NF add # to last field.
1 print line
A perl way to do it:
perl -pe 's/^(.+ )(\d+)/$2 $1 #$2/' infile
sed 's/\(.*\) \([^[:blank:]]\{1,\}\)/\2 \1 #\2/' YourFile
with GNU sed add -posix option

Removing line numbers (not entire line) from a file in unix

How can I remove line numbers from a file if the line numbers have been added by 'nl'?
example:
file1:
word1 word2
word3
word4
After command: nl file1 > file2
This is the exact command used.
file2:
1 word1 word2
2 word3
3 word4
Here comes the part where it revolves around.
Removing the line numbers from file2 and storing the lines in file3 (Or if possible, removing the numbers in file 2 whilst keeping the lines in file 2).
file3:
word1 word2
word3
word4
sed 's/ *[0-9]*.//' file2 > file3
Yep. As it was answered here:
you can use awk:
cat file | awk '{print $2}' > newfile
you can use cut:
cat file | cut -f2 > newfile
The cut utility will work. In this case, you have only one word in the line, so you can use just cut -f2, but if you had more columns, cut -f2- will preserve all except the first.
Something like this should solve it:
cut -d\ -f2- < file1
This will remove only the first word/number from each string in file2 and put the rest in file3:
awk '{$1 = ""; print $0;}' file2 > file3
file3:
word1 word2
word3
word4
Assuming you can't just cp file1 file3....
ni file1 > file2
sed 's/[0-9]*[ ]*\(.*\)$/\1/' file1 > file3

Resources