grep -Ff producing invalid output - bash

I'm using
code -
grep -Ff list.txt C:/data/*.txt > found.txt
but it keeps outputting invalid responses, lines don't contain the emails i input..
list.txt contains -
email#email.com
customer#email.com
imadmin#gmail.com
newcustomer#email.com
helloworld#yes.com
and so on.. email to match on each line,
search files contain -
user1:phonenumber1:email#email.com:last-active:recent
user2:phonennumber2:customer#email.com:last-active:inactive
user3:phonenumber3:blablarandom#bla.com:last-active:never
then another may contain -
blublublu email#email.com phonenumber subscribed
nanananana customer#email.com phonenumber unsubscribed
useruser noemailinput#noemail.com phonenumber pending
so what I'm trying to do is present grep with a list of emails/list of strings " list.txt " and to then search the directory provided for matches of each string and output the entire line that contains each match.
example of output in this case would be -
user1:phonenumber1:email#email.com:last-active:recent
user2:phonennumber2:customer#email.com:last-active:inactive
blublublu email#email.com phonenumber subscribed
nanananana customer#email.com phonenumber unsubscribed
yet it wouldn't output the other two lines -
user3:phonenumber3:blablarandom#bla.com:last-active:never
useruser noemailinput#noemail.com phonenumber pending
because no string is within that line.

The file list.txt probably contains empty lines or some of the separators. When I added : to list.txt, all the lines from the first sample started to match. Similarly, adding a space made all the lines from the second sample match. Adding # causes the same symptoms.
Try running grep -oFf ... (if your grep supports -o) to see the exact matching parts. If there are empty lines in list.txt, the number of matches will be less than the number of matches without -o. Try searching the output of -o for extremely short outputs to check for suspicious strings. You can also examine the shortest lines in list.txt.
while read line ; do echo ${#line} "$line" ; done < list.txt | sort -nk1,1

I think your file list.txt may have blank lines in it, causing it to match every line in the files specified with C:/data/*.txt. To fix you can either manually delete every empty line or run the command sed -i '/^$/d' list.txt where the -i flag edits the file in place.
The issue may also be related to dos carriage returns, try running: cat -v list.txt and checking if the lines are followed by ^M:
email#email.com^M
customer#email.com^M
If this is the case you will need to amend the file using either dos2unix or tr -d '\r' < list.txt > output.txt.

Related

Extracting a value from a same file from multiple directories

Directory name F1 F2 F3……F120
Inside each directory, a file with a common name ‘xyz.txt’
File xyz.txt has a value
Example:
F1
Xyz.txt
3.345e-2
F2
Xyz.txt
2.345e-2
F3
Xyz.txt
1.345e-2
--
F120
Xyz.txt
0.345e-2
I want to extract these values and paste them in a single file say ‘new.txt’ in a column like
New.txt
3.345e-2
2.345e-2
1.345e-2
---
0.345e-2
Any help please? Thank you so much.
If your files look very similar then you can use grep. For example:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.][0-9]{3}e-[0-9]$' > new.txt
This is a general example as any number can be anything. The regular expression says that the whole line must consist of: a any digit [0-9], a dot character [.], three digits [0-9]{3}, the letter 'e' and any digit [0-9].
If your data is more regular you can also try more simple solution:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.]345e-2$' > new.txt
In this solution only the first digit can be anything.
If your files might contain something else than the line, but the line you want to extract can be unambiguously extracted with a regex, you can use
sed -n '/^[0-9]\.[0-9]*e-*[0-9]*$/p' F*/Xyz.txt >new.txt
The same can be done with grep, but you have to separately tell it to not print the file name. The -x option can be used as a convenience to simplify the regex.
grep -h -x '[0-9]\.[0-9]*e-*[0-9]*' F*/Xyz.txt >new.txt
If you have some files which match the wildcard which should be excluded, try a more complex wildcard, or multiple wildcards which only match exactly the files you want, like maybe F[1-9]/Xyz.txt F[1-9][0-9]/Xyz.txt F1[0-9][0-9]/Xyz.txt
This might work for you (GNU parallel and grep):
parallel -k grep -hE '^[0-9][.][0-9]{3}e-[0-9]$' F{}/xyz.txt ::: {1..120}
Process files in parallel but output results in order.
If the files contain just one line, and you want the whole thing, you can use bash range expansion:
cat /path/to/F{1..120}/Xyz.txt > output.txt
(this keeps the order too).
If the files have more lines, and you need to actually extract the value, use grep -o (-o is not posix, but your grep probably has it).
grep -o '[0-9].345-e2' /path/to/F{1..120}/Xyz.txt > output.txt

grep of 50000 strings in a big file performance improvement

I have a file, which is about 200 MB of size, with about 1.2 M lines in it. Let's say it as reading.txt. I have another file, input.txt,
in which there are about 50000 lines. I want to take a string in each line from input.txt file and grep in reading.txt. For a matched line,
in reading.txt get that complete line and write into other file, output.txt.
As of now, I am looping through every string of input.txt file, grep in reading.txt file. This approach is consuming more than 1 hour time.
Is there any option to increase performance so that time consumption reduces for this process.
while read line
do
LC_ALL=C grep ${line} reading.txt 2>/dev/null
done<input.txt >> output.txt
man grep yields (among others):
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. If this option is used
multiple times or is combined with the -e (--regexp) option,
search for all patterns given. The empty file contains zero
patterns, and therefore matches nothing.
grep -f input.txt reading.txt > output.txt
...will print all lines in 'reading.txt', with a sub string matching a line in 'input.txt', in the order of 'reading.txt', to 'output.txt'
You don't specify this, but it may be relevant (you said 1.2MB lines in 'reading.txt') - a separate output file for every matching line:
#!/bin/sh
nl='
'
IFS=$nl
c=0
for i in $(grep -f input.txt reading.txt); do
c=$((c+1))
echo "$i" > output$c.txt
done
There are tidier methods of setting IFS to a new line, for example in bash: IFS=$'\n' (also you can use > output$((++c)).txt in bash)

Copying first lines of multiple text files into single file

Using single bash command (pipes, stdio allowed)
copy first line of each file whose name begins with ABC to file named DEF.
Example:
Input:
ABC0:
qwe\n
rty\n
uio\n
ABC1:
asd\n
fgh\n
jkl\n
ABC2:
zxc\n
bvn\n
m,.\n
Result:
DEF:
qwe\n
asd\n
zxc\n
Already tried cat ABC* | head -n1 but it takes only first line from first file, others are omitted.
You would want head -n1 ABC* to let head take the first line from each file. Reading from standard input, head know nothing about where its input comes from.
head, though, adds its own header to identify which file each line comes from, so use awk instead:
awk 'FNR == 1 {print}' ./ABC* > DEF
FNR is the variable containing the line number of the current line of the input, reset to 0 each time a new file is opened. Using ./ABC* instead of ABC* guards against filenames containing an = (which awk handles specially if the part before = is a valid awk variable name. HT William Pursell.)
Assuming that the file names don't contain spaces or newlines, and that there are no directories with names starting with ABC:
ls ABC* | xargs -n 1 head -n 1
The -n 1 ensures that head receives only one name at a time.
If the aforementioned conditions are not met, use a loop like chepner suggested, but explicitly guard against directory entries which are not plain files, to avoid error messages issued by head.

Looping and grep writes output for the last line only

I am looping through the lines in a text file. And performing grep on each lines through directories. like below
while IFS="" read -r p || [ -n "$p" ]
do
echo "This is the field: $p"
grep -ilr $p * >> Result.txt
done < fields.txt
But the above writes the results for the last line in the file. And not for the other lines.
If i manually execute the command with the other lines, it works (which mean the match were found). Anything that i am missing here? Thanks
The fields.txt looks like this
annual_of_measure__c
attached_lobs__c
apple
When the file fields.txt
has DOS/Windows lineending convention consisting of two character (Carriage-Return AND Linefeed) and
that file is processed by Unix-Tools expecting Unix lineendings consisting of only one character (Linefeed)
then the line read by the read command and stored in the variable $p is in the first line annual_of_measure__c\r (note the additional \r for the Carriage-Return). Then grep will not find a match.
From your description in the question and the confirmation in the comments, it seems that the last line in fields.txt has no lineending at all, so the variable $p is the ordinary string apple and grep can find a match on the last line of the file.
There are tools for converting lineendings, e.g. see this answer or even more options in this answer.

Finding a particular string from a file

I have a file that contains one particular string many times. How can I print all occurrences of the string on the screen along with letters following that word till the next space is encountered.
Suppose a line contained:
example="123asdfadf" foo=bar
I want to print example="123asdfadf".
I had tried using less filename | grep -i "example=*" but it was printing the complete lines in which example appeared.
$ grep -o "example[^ ]*" foo
example="abc"
example="123asdfadf"
Since -o is only supported by GNU grep, a portable solution would be to use sed:
sed -n 's/.*\(example=[^[:space:]]*\).*/\1/p' file

Resources