Parallelise grep - use file rows as input for grep - parallel-processing

I have File1 and File2 as below. I found similar questions but not quite the same.
Use File1 rows as input for grep and extract 1st column of File2. In below toy example, if column2 in File2 equals to a or b then write 1 to File_ab.
So far I am using double loop, and estimated time is 4 days. I was hoping to get something like: cat File1 | xargs -P 12 -exec grep "$1\|$2" File2 > File_$1$2.txt
But failed to get the syntax right. I am trying to run 12 greps in parallel with OR condition.
File1
a b
c d
File2
1 a
2 b
3 c
1 d
4 a
5 e
6 d
Desired output is 2 files, File_ab and File_cd:
File_ab
1
2
4
File_cd
1
3
6
Note: My File1 is 25K rows, and File2 is 10Mln rows.

Use perl:
#!/usr/bin/perl
use FileCache;
#a=`cat File1`;
chomp(#a);
for $a (#a) {
#parts = split/ +/,$a;
push #re, #parts;
for $p (#parts) {
$file{$p} = "File_".join "",#parts;
}
}
$re = join("|",#re);
while(<>) {
if(/(\d+).*($re)/o and $file{$2}) {
$fh = cacheout $file{$2};
print $fh $1,"\n";
}
}
Then:
chmod 755 myscript
./myscript File2

Related

Bash: compare 2 files and show the unique content of one file with 'hierachy'

So basically, these are two files I need to compare
file1.txt
1 a
2 b
3 c
44 d
file2.txt
11 a
123 a
3 b
445 d
To show the unique lines in file 1, I use 'comm -23' command after 'sort -u' these 2 files. Additionally, I would like to make '11 a' '123 a' in file 2 become subsets of '1 a' in file 1, similarly, '445 d' is a subset of ' 44 d'. These subsets are considered the same as their superset. So the desired output is
2 b
3 c
I'm a beginner and my loop is way too slow... So here is my code
comm -23 <( awk {print $1,$2}' file1.txt | sort -u ) <( awk '{print $1,$2}' file2.txt | sort -u ) >output.txt
array=($( awk -F ',' '{print $1}' file1.txt ))
for i in "${array[#]}";do
awk -v pattern="$i" 'match($0, "^" pattern)' output.txt > repeat.txt
done
comm -23 <( cat output.txt | sort -u ) <( cat repeat.txt | sort -u )
Anyone got any good ideas?
Another question: Any ways I could show the row numbers from original file at output? For example,
(row num from file 1)
2 2 b
3 3 c
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR {
vals[$2][$1]
next
}
$2 in vals {
for (i in vals[$2]) {
if ( index(i,$1) == 1 ) {
next
}
}
}
{ print FNR, $0 }
$ awk -f tst.awk file2 file1
2 2 b
3 3 c

Finding items that are common to all the input files

I have a series of files of the type-
f1.txt f2.txt f3.txt
A B A
B G B
C H C
D I E
E L G
F M J
I want to find out the entries that are common to all three files. In this case the expected output would be B since that is the only letter that occurs is all three files.
If I had just two files, I could find out the common entries using comm -1 -2 f1.txt f2.txt.
But that doesn't work with multiple files. I thought about something like
sort -u f*.txt > index #to give me the total unique entries
while read i ; do *test if entry is present in all the files* ; done < index
I thought of iteratively doing the comm -12 f1.txt f2.txt | comm -12 - f3.txt but I have 100+ files so that's not practical. Performance does matter.
EDIT
I implemented the following-
sort -u f* > index
while read i
do
echo -n "$i "
grep -c "$i" f*.txt > temp
awk -F ":" '{a+=$2} END {print a}' temp
done < index | sort -rnk2
This gives the output-
B 3
G 2
E 2
C 2
A 2
M 1
L 1
J 1
I 1
H 1
F 1
D 1
From here I can see that the number of files is 3 and the occurrence of B is 3. Hence it occurs in all the files. I'm still looking for a better solution though.
awk '{cnt[$0]++} END{for (i in cnt) if (cnt[i]==(ARGC-1)) print i}' *.txt
The above assumes each value occurs no more than once in a given file, like in your example. If a value CAN occur multiple times in one file then:
awk '!seen[FILENAME,$0]++{cnt[$0]++} END{for (i in cnt) if (cnt[i]==(ARGC-1)) print i}' *.txt
or with GNU awk for true multi-dimensional arrays and ARGIND:
awk '{cnt[$0][ARGIND]} END{for (i in cnt) if (length(cnt[i])==ARGIND) print i}' *.txt
Using python
This python script with find the common lines among a large number of files:
#!/usr/bin/python
from glob import glob
fnames = glob('f*.txt')
with open(fnames[0]) as f:
lines = set(f.readlines())
for fname in fnames[1:]:
with open(fname) as f:
lines = lines.intersection(f.readlines())
print(''.join(lines))
Sample run:
$ python script.py
B
How it works:
fnames = glob('f*.txt')
This collects the names of files of interest.
with open(fnames[0]) as f:
lines = set(f.readlines())
This reads the first file and creates a set from its lines. This set is called lines.
for fname in fnames[1:]:
with open(fname) as f:
lines = lines.intersection(f.readlines())
For each subsequent file, this takes the intersection of lines with the lines of this file.
print(''.join(lines))
This prints out the resulting set of common lines.
Using grep and shell
Try:
$ grep -Ff f1.txt f2.txt | grep -Ff f3.txt
B
This works in two steps:
grep -Ff f1.txt f2.txt selects those lines from f2.txt that also occur in f1.txt. In other words, the output from this command consists of lines that f1.txt and f2.txt have in common.
grep -Ff f3.txt selects from its input all lines that are also in f3.txt.
Notes:
The -F option tells grep to treat its input as fixed strings, not regular expressions.
The -f option tells grep to get the patterns it is looking for from the file whose name follows.
The command above looks for complete matching lines. That means, for one, that leading or trailing white space is significant.
Use join:
$ join f1.txt <(join f2.txt f3.txt)
B
join expects the files to be sorted, though. This seems to work too:
$ join <(sort f1.txt) <(join <(sort f2.txt) <(sort f3.txt))
B
Note that Ed's answer is considerably faster than my suggestion, but I'll leave it for posterity :-)
I used GNU Parallel to apply comm to the files in pairs in parallel (so it should be fast) and do that repeatedly, passing the output of each iteration as the input to the next.
It converges when there is only one file left to process. If there are an odd number of files at any stage, the odd file is promoted forward to the next round and processed later.
#!/bin/bash
shopt -s nullglob
# Get list of files
files=(f*.txt)
iter=0
while : ; do
# Get number of files
n=${#files[#]}
echo DEBUG: Iter: $iter, Files: $n
# If only one file left, we have converged, cat it and exit
[ $n -eq 1 ] && { cat ${files[0]}; break; }
# Check if odd number of files, and promote and delete one if odd
if (( n % 2 )); then
mv ${files[0]} s-$iter-odd;
files=( ${files[#]:1} )
fi
parallel -n2 comm -1 -2 {1} {2} \> s-$iter-{#} ::: "${files[#]}"
files=(s-$iter-*)
(( iter=iter+1 ))
done
Sample Output
DEBUG: Iter: 0, Files: 110
DEBUG: Iter: 1, Files: 55
DEBUG: Iter: 2, Files: 28
DEBUG: Iter: 3, Files: 14
DEBUG: Iter: 4, Files: 7
DEBUG: Iter: 5, Files: 4
DEBUG: Iter: 6, Files: 2
DEBUG: Iter: 7, Files: 1
Basically, s-0-* is the output of the first pass, s-1-* is the output of the second pass...
If you would like to see the commands parallel would run, without it actually running any of them, use:
parallel --dry-run ...
If (but only if) all of your files have unique entries this should work too:
sort f*.txt | uniq -c \
| grep "^\s*$(ls f*.txt | wc -w)\s" \
| while read n content; do echo $content; done

Substracting row-values from two different text files

I have two text files, and each file has one column with several rows:
FILE1
a
b
c
FILE2
d
e
f
I want to create a file that has the following output:
a - d
b - e
c - f
All the entries are meant to be numbers (decimals). I am completely stuck and do not know how to proceed.
Using paste seems like the obvious choice but unfortunately you can't specify a multiple character delimiter. To get around this, you can pipe the output to sed:
$ paste -d- file1 file2 | sed 's/-/ - /'
a - d
b - e
c - f
Paste joins the two files together and sed adds the spaces around the -.
If your desired output is the result of the subtraction, then you could use awk:
paste file1 file2 | awk '{ print $1 - $2 }'
given:
$ cat /tmp/a.txt
1
2
3
$ cat /tmp/b.txt
4
5
6
awk is a good bet to process the two files and do arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FN""]+$1 }' /tmp/a.txt /tmp/b.txt
5
7
9
Or, if you want the strings rather than arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FNR""] " - "$0 }' /tmp/a.txt /tmp/b.txt
1 - 4
2 - 5
3 - 6
Another solution using while and file descriptors :
while read -r line1 <&3 && read -r line2 <&4
do
#printf '%s - %s\n' "$line1" "$line2"
printf '%s\n' $(($line1 - $line2))
done 3<f1.txt 4<f2.txt

Get n lines from file which are equal spaced

I have a big file with 1000 lines.I wanted to get 110 lines from it.
Lines should be evenly spread in Input file.
For example,I have read 4 lines from file with 10 lines
Input File
1
2
3
4
5
6
7
8
9
10
outFile:
1
4
7
10
Use:
sed -n '1~9p' < file
The -n option will stop sed from outputting anything. '1~9p' tells sed to print from line 1 every 9 lines (the p at the end orders sed to print).
To get closer to 110 lines you have to print every 9th line (1000/110 ~ 9).
Update: This answer will print 112 lines, if you need exactly 110 lines, you can limit the output just using head like this:
sed -n '1~9p' < file | head -n 110
$ cat tst.awk
NR==FNR { next }
FNR==1 { mod = int((NR-1)/tgt) }
!( (FNR-1)%mod ) { print; cnt++ }
cnt == tgt { exit }
$ wc -l file1
1000 file1
$ awk -v tgt=110 -f tst.awk file1 file1 > file2
$ wc -l file2
110 file2
$ head -5 file2
1
10
19
28
37
$ tail -5 file2
946
955
964
973
982
Note that this will not produce the output you posted in your question given your posted input file because that would require an algorithm that doesn't always use the same interval between output lines. You could dynamically calculate mod and adjust it as you parse your input file if you like but the above may be good enough.
With awk you can do:
awk -v interval=3 '(NR-1)%interval==0' file
where interval is the difference in line count between consecutive lines that are printed. The value is essentially a division of the total lines in the file divided by the number of lines that are printed.
I often like to use a combination of shell and awk for these sorts of things
#!/bin/bash
filename=$1
toprint=$2
awk -v tot=$(expr $(wc -l < $filename)) -v toprint=$toprint '
BEGIN{ interval=int((tot-1)/(toprint-1)) }
(NR-1)%interval==0 {
print;
nbr++
}
nbr==toprint{exit}
' $filename
Some examples:
$./spread.sh 1001lines 5
1
251
501
751
1001
$ ./spread.sh 1000lines 110 |head -n 3
1
10
19
$ ./spread.sh 1000lines 110 |tail -n 3
964
973
982

Counting equal lines in two files

Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482

Resources