grep from file doesn't write result to an output file - bash

I'm trying to grep strings from file2.csv using existing strings from file1.csv and write matched lines to result.csv file. A have a following bash script:
cat file1.csv | while read line; do
grep $line ./file2.csv > result.csv
done
But afterall the result.csv is always empty. When I do manual grep from file2.csv everything works fine. What I do wrong?
file1.csv:
15098662745072
15098662745508
file2.csv:
";"0";"15098662745072";"4590";"4590";"
";"0";"15098662745508";"6400";"6400";"
";"0";"15098662745515";"6110";"6110";"
";"0";"15098662745812";"7970";"7970";"
expected result (result.csv):
";"0";"15098662745072";"4590";"4590";"
";"0";"15098662745508";"6400";"6400";"

> keeps overwriting the file. Use >> to append to it.
Instead of using a loop, you can simply use the -f option in grep to make grep read patterns from the file.
grep -f file1.csv file2.csv > result.csv
If you have to use a loop, use the following approach:
while read line; do
grep "$line" ./file2.csv
done < file1.csv > result.csv

You should be using awk for this, not grep, because:
a) grep does not by default look for strings, it looks for regular expressions. You need to use fgrep or grep -F or awk instead of grep to search for strings.
b) You really only want to match the numbers from file1.csv when they appear as a full specific field in file2.csv, not wherever they occur on the line.
awk -F'";"' 'NR==FNR{a[$0];next} $3 in a' file1.csv file2.csv > result.csv

Related

Compare 2 csv files and delete rows - Shell

I have a 2 csv files. One has several columns, the other is just one column with domains. Simplified data of these files would be
file1.csv:
John,example.org,MyCompany,Australia
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
file2.csv:
example.org
google.es
mysite.uk
The output should be
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
I have tried this solution
grep -v -f file2.csv file1.csv >output-file
Found here
http://www.unix.com/shell-programming-and-scripting/177207-removing-duplicate-records-comparing-2-csv-files.html
But since there is no explanation whatsoever about how the script works, and I suck at shell, I cannot tweak it to make it work for me
A solution for this would be highly appreciated, a solution with some explanation would be awesome! :)
EDIT:
I have tried the line that was suppose to work, but for some reason it does not. Here the output from my terminal. What's wrong with this?
Desktop $ cat file1.csv ; echo
John,example.org,MyCompany,Australia
Lenny ,domain.com,OtherCompany,US
Martha,mysite.com,ThirCompany,US
Desktop $ cat file2.csv ; echo
example.org
google.es
mysite.uk
Desktop $ grep -v -f file2.csv file1.csv
John,example.org,MyCompany,Australia
Lenny ,domain.com,OtherCompany,US
Martha,mysite.com,ThirCompany,US
Why grep doesn't remove the line
John,example.org,MyCompany,Australia
The line you posted, works just fine.
$ grep -v -f file2.csv file1.csv
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
And here's an explanation. grep will search for a given pattern in a given file and print all lines that match. The simplest example of usage is:
$ grep John file1.csv
John,example.org,MyCompany,Australia
Here we used a simple pattern that matches each character, but you can also use regular expressions (basic, extended, and even perl-compatible ones).
To invert the logic, and print only the lines that do not match, we use the -v switch, like this:
$ grep -v John file1.csv
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
To specify more than one pattern, you can use the option -e pattern multiple times, like this:
$ grep -v -e John -e Lenny file1.csv
Martha,site.com,ThirdCompany,US
However, if there is a larger number of patterns to check for, we might use the -f file option that will read all patterns from a file specified.
So, when we combine all of those; reading patterns from a file with -f and inverting the matching logic with -v, we get the line you need.
One in awk:
$ awk -F, 'NR==FNR{a[$1];next}($2 in a==0)' file2 file1
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
Explained:
$ awk -F, ' # using awk, comma-separated records
NR==FNR { # process the first file, file2
a[$1] # hash the domain to a
next # proceed to next record
}
($2 in a==0) # process file1, if domain in $2 not in a, print the record
' file2 file1 # file order is important

Creating a script that checks to see if each word in a file

I am pretty new to Bash and scripting in general and could use some help. Each word in the first file is separated by \n while the second file could contain anything. If the string in the first file is not found in the second file, I want to output it. Pretty much "check if these words are in these words and tell me the ones that are not"
File1.txt contains something like:
dog
cat
fish
rat
file2.txt contains something like:
dog
bear
catfish
magic ->rat
I know I want to use grep (or do I?) and the command would be (to my best understanding):
$foo.sh file1.txt file2.txt
Now for the script...
I have no idea...
grep -iv $1 $2
Give this a try. This is straight forward and not optimized but it does the trick (I think)
while read line ; do
fgrep -q "$line" file2.txt || echo "$line"
done < file1.txt
There is a funny version below, with 4 parrallel fgrep and the use of an additional result.txt file.
> result.txt
nb_parrallel=4
while read line ; do
while [ $(jobs | wc -l) -gt "$nb_parralel" ]; do sleep 1; done
fgrep -q "$line" file2.txt || echo "$line" >> result.txt &
done < file1.txt
wait
cat result.txt
You can increase the value 4, in order to use more parrallel fgrep, depending on the number of cpus and cores and the IOPS available.
With the -f flag you can tell grep to use a file.
grep -vf file2.txt file1.txt
To get a good match on complete lines, use
grep -vFxf file2.txt file1.txt
As #anubhava commented, this will not match substrings. To fix that, we will use the result of grep -Fof file1.txt file2.txt (all the relevant keywords).
Combining these will give
grep -vFxf <(grep -Fof file1.txt file2.txt) file1.txt
Using awk you can do:
awk 'FNR==NR{a[$0]; next} {for (i in a) if (index(i, $0)) next} 1' file2 file1
rat
You can simply do the following:
comm -2 -3 file1.txt file2.txt
and also:
diff -u file1.txt file2.txt
I know you were looking for a script but I don't think there is any reason to do so and if you still want to have a script you can jsut run the commands from a script.
similar awk
$ awk 'NR==FNR{a[$0];next} {for(k in a) if(k~$0) next}1' file2 file1
rat

Reading numbers from a text line in bash shell

I'm trying to write a bash shell script, that opens a certain file CATALOG.dat, containing the following lines, made of both characters and numbers:
event_0133_pk.gz
event_0291_pk.gz
event_0298_pk.gz
event_0356_pk.gz
event_0501_pk.gz
What I wanna do is print the numbers (only the numbers) inside a new file NUMBERS.dat, using something like > ./NUMBERS.dat, to get:
0133
0291
0298
0356
0501
My problem is: how do I extract the numbers from the text lines? Is there something to make the script read just the number as a variable, like event_0%d_pk.gz in C/C++?
A grep solution:
grep -oP '[0-9]+' CATALOG.dat >NUMBERS.dat
A sed solution:
sed 's/[^0-9]//g' CATALOG.dat >NUMBERS.dat
And an awk solution:
awk -F"[^0-9]+" '{print $2}' CATALOG.dat >NUMBERS.dat
There are many ways that you can achieve your result. One way would be to use awk:
awk -F_ '{print $2}' CATALOG.dat > NUMBERS.dat
This sets the field separator to an underscore, then prints the second field which contains the numbers.
Awk
awk 'gsub(/[^[:digit:]]/,"")' infile
Bash
while read line; do echo ${line//[!0-9]}; done < infile
tr
tr -cd '[[:digit:]\n]' <infile
You can use grep command to extract the number part.
grep -oP '(?<=_)\d+(?=_)' CATALOG.dat
gives output as
0133
0291
0298
0356
0501
Or
much simply
grep -oP '\d+' CATALOG.dat
You don't need perl mode in grep for this. BREs can do this.
grep -o '[[:digit:]]\+' CATALOG.dat > NUMBERS.dat

awk execute same command on different files one by one

Hi I have 30 txt files in a directory which are containing 4 columns.
How can I execute a same command on each file one by one and direct output to different file.
The command I am using is as below but its being applied on all the files and giving single output. All i want is to call each file one by one and direct outputs to a new file.
start=$1
patterns=''
for i in $(seq -43 -14); do
patterns="$patterns /cygdrive/c/test/kpi/SIGTRAN_Load_$(exec date '+%Y%m%d' --date="-${i} days ${start}")*"; done
cat /cygdrive/c/test/kpi/*$patterns | sed -e "s/\t/,/g" -e "s/ /,/g"| awk -F, 'a[$3]<$4{a[$3]=$4} END {for (i in a){print i FS a[i]}}'| sed -e "s/ /0/g"| sort -t, -k1,2> /cygdrive/c/test/kpi/SIGTRAN_Load.csv
Sth like this
for fileName in /path/to/files/foo*.txt
do
mangleFile "$fileName"
done
will mangle a list of files you give via globbing. If you want to generate the file name patterns as in your example, you can do it like this:
for i in $(seq -43 -14)
do
for fileName in /cygdrive/c/test/kpi/SIGTRAN_Load_"$(exec date '+%Y%m%d' --date="-${i} days ${start}")"*
do
mangleFile "$fileName"
done
done
This way the code stays much more readable, even if shorter solutions may exist.
The mangleFile of course then will be the awk call or whatever you would like to do with each file.
Use the following idiom:
for file in *
do
./your_shell_script_containing_the_above.sh $file > some_unique_id
done
You need to run a loop on all the matching files:
for i in /cygdrive/c/test/kpi/*$patterns; do
tr '[:space:]\n' ',\n' < "$i" | awk -F, 'a[$3]<$4{a[$3]=$4} END {for (i in a){print i FS a[i]}}'| sed -e "s/ /0/g"| sort -t, -k1,2 > "/cygdrive/c/test/kpi/SIGTRAN_Load-$i.csv"
done
PS: I haven't tried much to refactor your piped commands that can probably be shortened too.

Searching for Strings

I would like to have a shell script that searches two files and returns a list of strings:
File A contains just a list of unique alphanumeric strings, one per line, like this:
accc_34343
GH_HF_223232
cwww_34343
jej_222
File B contains a list of SOME of those strings (some times more than once), and a second column of infomation, like this:
accc_34343 dog
accc_34343 cat
jej_222 cat
jej_222 horse
I would like to create a third file that contains a list of the strings from File A that are NOT in File B.
I've tried using some loops with grep -v, but that doesn't work. So, in the above example, the new file would have this as it's contents:
GH_HF_223232
cwww_34343
Any help is greatly appreciated!
Here's what you can do:
grep -v -f <(awk '{print $1}' file_b) file_a > file_c
Explanation:
grep -v : Use -v option to grep to invert the matching
-f : Use -f option to grep to specify that the patterns are from file
<(awk '{print $1}' file_b): The <(awk '{print $1}' file_b) is to simply extract the first column values from file_b without using a temp file; the <( ... ) syntax is process substitution.
file_a : Tell grep that the file to be searched is file_a
> file_c : Output to be written to file_c
comm is used to find intersections and differences between files:
comm -23 <(sort fileA) <(cut -d' ' -f1 fileB | sort -u)
result:
GH_HF_223232
cwww_34343
I assume your shell is bash/zsh/ksh
awk 'FNR==NR{a[$0];next}!($1 in a)' fileA fileB
check here

Resources