Closest value different files, with different number of lines and other conditions ( bash awk other) - bash

I have to revive and old question with a modification for long files.
I have the age of two stars in two files (File1 and File2). The column of the age of the stars is $1 and the rest of the columns up to $13 are information that I need to print at the end.
I am trying to find an age in which the stars have the same age or the closest age. Since the files are too large (~25000 lines) I don't want to search in the whole array, for speed issues.
Also, they could have a big difference in number of lines (let say ~10000 in some cases)
I am not sure if this is the best way to solve the problem, but in a lack of a better one, this is my idea. (If you have a faster and more efficient method, please do it)
All the values are with 12 decimals of precision. And for now I am only concern in the first column (where the age is).
And I need different loops.
Let's use this value from file 1:
2.326062371284e+05
First the routine should search in file2 all the matches that contain
2.3260e+05
(This loop probably will search in the whole array, but if there is a way to stop the search as soon it reaches 2.3261 then it will save some time)
If it finds just one, then the output should be that value.
Usually, it will find several lines, maybe even up to 1000. It this is the case, it should search again against
2.32606e+05
between the lines founded before. (It is a nested loop I think)
Then the number of matches will decrease up to ~200
At that moment, the routine should search the best difference with certain tolerance X between
2.326062371284e+05
and all these 200 lines.
This way having these files
File1
1.833800650355e+05 col2f1 col3f1 col4f1
1.959443501406e+05 col2f1 col3f1 col4f1
2.085086352458e+05 col2f1 col3f1 col4f1
2.210729203510e+05 col2f1 col3f1 col4f1
2.326062371284e+05 col2f1 col3f1 col4f1
2.441395539059e+05 col2f1 col3f1 col4f1
2.556728706833e+05 col2f1 col3f1 col4f1
File2
2.210729203510e+05 col2f2 col3f2 col4f2
2.354895663228e+05 col2f2 col3f2 col4f2
2.499062122946e+05 col2f2 col3f2 col4f2
2.643228582664e+05 col2f2 col3f2 col4f2
2.787395042382e+05 col2f2 col3f2 col4f2
2.921130362004e+05 col2f2 col3f2 col4f2
3.054865681626e+05 col2f2 col3f2 col4f2
Output File3 (with tolerance 3000)
2.210729203510e+05 2.210729203510e+05 col2f1 col2f2 col4f1 col3f2
2.326062371284e+05 2.354895663228e+05 col2f1 col2f2 col4f1 col3f2
Important condition:
The output shouldn't contain repeated lines (the star 1 can't have at a fixed age, different ages for the star 2, just the closest one.
How would you solve this?
super thanks!
ps: I've change completely the question, since it was showed to me that my reasoning had some errors. Thanks!

Not an awk solution, comes a time when other solutions are great too, so here is an answer using R
New answer with different datas, not reading from file this time to bake an example:
# Sample data for code, use fread to read from file and setnames to name the colmumns accordingly
set.seed(123)
data <- data.table(age=runif(20)*1e6,name=sample(state.name,20),sat=sample(mtcars$cyl,20),dens=sample(DNase$density,20))
data2 <- data.table(age=runif(10)*1e6,name=sample(state.name,10),sat=sample(mtcars$cyl,10),dens=sample(DNase$density,10))
setkey(data,'age') # Set the key for joining to the age column
setkey(data2,'age') # Set the key for joining to the age column
# get the result
result=data[ # To get the whole datas from file 1 and file 2 at end
data2[
data, # Search for each star of list 1
.SD, # return columns of file 2
roll='nearest',by=.EACHI, # Join on each line (left join) and find nearest value
.SDcols=c('age','name','dens')]
][!duplicated(age) & abs(i.age - age) < 1e3,.SD,.SDcols=c('age','i.age','name','i.name','dens','i.dens') ] # filter duplicates in first file and on difference
# Write results to a file (change separator for wish):
write.table(format(result,digits=15,scientific=TRUE),"c:/test.txt",sep=" ")
Code:
# A nice package to have, install.packages('data.table') if it's no present
library(data.table)
# Read the data (the text can be file names)
stars1 <- fread("1.833800650355e+05
1.959443501406e+05
2.085086352458e+05
2.210729203510e+05
2.326062371284e+05
2.441395539059e+05
2.556728706833e+05")
stars2 <- fread("2.210729203510e+05
2.354895663228e+05
2.499062122946e+05
2.643228582664e+05
2.787395042382e+05
2.921130362004e+05
3.054865681626e+05")
# Name the columns (not needed if the file has a header)
colnames(stars1) <- "age"
colnames(stars2) <- "age"
# Key the data tables (for a fast join with binary search later)
setkey(stars1,'age')
setkey(stars2,'age')
# Get the result (more datils below on what is happening here :))
result=stars2[ stars1, age, roll="nearest", by=.EACHI]
# Rename the columns so we acn filter whole result
setnames(result,make.unique(names(result)))
# Final filter on difference
result[abs(age.1 - age) < 3e3]
So the interesting parts are the first 'join' on the two stars ages list, searching for each in stars1 the nearest in stars2.
This give (after column renaming):
> result
age age.1
1: 183380.1 221072.9
2: 195944.4 221072.9
3: 208508.6 221072.9
4: 221072.9 221072.9
5: 232606.2 235489.6
6: 244139.6 249906.2
7: 255672.9 249906.2
Now we have the nearest for each, filter those close enough (on absolute difference above 3 000 here):
> result[abs(age.1 - age) < 3e3]
age age.1
1: 221072.9 221072.9
2: 232606.2 235489.6

Perl to the rescue. This should be very fast, as it does a binary search in the given range.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use List::Util qw{ max min };
use constant { SIZE => 100,
TOLERANCE => 3000,
};
my #times2;
open my $F2, '<', 'file2' or die $!;
while (<$F2>) {
chomp;
push #times2, $_;
}
my $num = 0;
open my $F1, '<', 'file1' or die $!;
while (my $time = <$F1>) {
chomp $time;
my $from = max(0, $num - SIZE);
my $to = min($#times2, $num + SIZE);
my $between;
while (1) {
$between = int(($from + $to) / 2);
if ($time < $times2[$between] && $to != $between) {
$to = $between;
} elsif ($time > $times2[$between] && $from != $between) {
$from = $between;
} else {
last
}
}
$num++;
if ($from != $to) {
my $f = $time - $times2[$from];
my $t = $times2[$to] - $time;
$between = ($f > $t) ? $to : $from;
}
say "$time $times2[$between]" if TOLERANCE >= abs $times2[$between] - $time;
}

Related

script to loop through and combine two text files

I have two .csv files which I am trying to 'multiply' out via a script. The first file is person information and looks basically like this:
First Name, Last Name, Email, Phone
Sally,Davis,sdavis#nobody.com,555-555-5555
Tom,Smith,tsmith#nobody.com,555-555-1212
The second file is account numbers and looks like this:
AccountID
1001
1002
Basically I want to get every name with every account Id. So if I had 10 names in the first file and 10 account IDs in the second file, I should end up with 100 rows in the resulting file and have it look like this:
First Name, Last Name, Email, Phone, AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555, 1001
Tom,Smith,tsmith#nobody.com,555-555-1212, 1001
Sally,Davis,sdavis#nobody.com,555-555-5555, 1002
Tom,Smith,tsmith#nobody.com,555-555-1212, 1002
Any help would be greatly appreciated
You could simply write a for loop for each value to be repeated by it's id count and append the description, but just in the reverse order.
Has that not worked or have you not tried that?
If python works for you, here's a script which does that:
def main():
f1 = open("accounts.txt", "r")
f1_total_lines = sum(1 for line in open('accounts.txt'))
f2_total_lines = sum(1 for line in open('info.txt'))
f1_line_counter = 1;
f2_line_counter = 1;
f3 = open("result.txt", "w")
f3.write('First Name, Last Name, Email, Phone, AccountID\n')
for line_account in f1.readlines():
f2 = open("info.txt", "r")
for line_info in f2.readlines():
parsed_line_account = line_account
parsed_line_info = line_info.rstrip() # we have to trim the newline character from every line from the 'info' file
if f2_line_counter == f2_total_lines: # ...for every but the last line in the file (because it doesn't have a newline character)
parsed_line_info = line_info
f3.write(parsed_line_info + ',' + parsed_line_account)
if f1_line_counter == f1_total_lines:
f3.write('\n')
f2_line_counter = f2_line_counter + 1
f1_line_counter = f1_line_counter + 1
f2_line_counter = 1 # reset the line counter to the first line
f1.close()
f2.close()
f3.close()
if __name__ == '__main__':
main()
And the files I used are as follows:
info.txt:
Sally,Davis,sdavis#nobody.com,555-555-555
Tom,Smith,tsmith#nobody.com,555-555-1212
John,Doe,jdoe#nobody.com,555-555-3333
accounts.txt:
1001
1002
1003
If You Intended to Duplicate Account_ID
If you intended to add each Account_ID to every record in your information file then a short awk solution will do, e.g.
$ awk -F, '
FNR==NR{a[i++]=$0}
FNR!=NR{b[j++]=$0}
END{print a[0] ", " b[0]
for (k=1; k<i; k++)
for (m=1; m<i; m++)
print a[m] ", " b[k]}
' info id
First Name, Last Name, Email, Phone, AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555, 1001
Tom,Smith,tsmith#nobody.com,555-555-1212, 1001
Sally,Davis,sdavis#nobody.com,555-555-5555, 1002
Tom,Smith,tsmith#nobody.com,555-555-1212, 1002
Above the lines in the first file (when the file-record-number equals the record-number, e.g. FNR==NR) are stored in array a, the lines from the second file (when FNR!=NR) are stored in array b and then they combined and output in the END rule in the desired order.
Without Duplicating Account_ID
Since Account_ID is usually a unique bit of information, if you did not intended to duplicate every ID at the end of each record, then there is no need to loop. The paste command does that for you. In your case with your information file as info and you account ID file as id, it is as simple as:
$ paste -d, info id
First Name, Last Name, Email, Phone,AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555,1001
Tom,Smith,tsmith#nobody.com,555-555-1212,1002
(note: the -d, option just sets the delimiter to a comma)
Seems a lot easier that trying to reinvent the wheel.
Can be easily done with arrays
OLD=$IFS; IFS=$'\n'
ar1=( $(cat file1) )
ar2=( $(cat file2) )
IFS=$OLD
ind=${!ar1[#]}
for i in $ind; { echo "${ar1[$i]}, ${ar2[$i]}"; }

`awk`-ing plain text tables produced by pandoc with fields/cells/values that contain spaces

I've run into this problem a number of times and maybe it's just my unsophisticated technique as I'm still a bit of a novice with the finer points of text processing, but using pandoc going from html to plain yields pretty tables in the form of:
# IP Address Device Name MAC Address
--- ------------- -------------------------- -------------------
1 192.168.1.3 ANDROID-FFFFFFFFFFFFFFFF FF:FF:FF:FF:FF:FF
2 192.168.1.4 XXXXXXX FF:FF:FF:FF:FF:FF
3 192.168.1.5 -- FF:FF:FF:FF:FF:FF
4 192.168.1.6 -- FF:FF:FF:FF:FF:FF
--- ------------- -------------------------- -------------------
The column headings here in this example (and the fields/cells/whatever in others) aren't especially awk friendly since they contain spaces. There must be some utility (or pandoc option) to add delimiters or otherwise process it in a smart and simple way to make it easier to use with awk (since the dash ruling hints as the max column width), but I'm fast approaching the limits of my knowledge and have been unable to find any good solutions on my own. I'd appreciate any help and I'm open to alternate approaches (I just use pandoc since that's what I know).
I've got a solution for you which parses the dash line to get column lengths, then uses that info to divide each line into columns (similar to what #shellter proposed in the comment, but without the need to hardcode values).
First, within the BEGIN block we read the headers line and the dashes line. Then we will grab the column lengths by splitting the dashline and processing it.
BEGIN {
getline headers
getline dashline
col_count = split(dashline, columns, " ")
for (i=1;i<=col_count;i++)
col_lens[i] = length(columns[i])
}
Now we have the lengths of each column and you can use that inside the main body.
{
start = 1
for (i=start;i<=col_count;i++){
col_n = substr($0, start, col_lens[i])
start = start + col_lens[i] + 1
printf("column %i: [%s]\n",i,col_n);
}
}
That seems a little onerous, but it works. I believe this answers your question. To make things a little nicer, I factored out the line parsing into a user defined function. That's convenient because you can now use it on the headers you stored (if you want).
Here's the complete solution:
function parse_line(line, col_lens, col_count){
start = 1
for (i=start;i<=col_count;i++){
col_i = substr(line, start, col_lens[i])
start = start + col_lens[i] + 1
printf("column %i: [%s]\n", i, col_i)
}
}
BEGIN {
getline headers
getline dashline
col_count = split(dashline, columns, " ")
for (i=1;i<=col_count;i++){
col_lens[i] = length(columns[i])
}
parse_line(headers, col_lens, col_count);
}
{
parse_line($0, col_lens, col_count);
}
If you put your example table into a file called table and this program into a file called dashes.awk, here's the output (using head -n -1 to drop the final row of dashes):
$ head -n -1 table | awk -f dashes.awk
column 1: [ # ]
column 2: [ IP Address ]
column 3: [ Device Name ]
column 4: [ MAC Address]
column 1: [ 1 ]
column 2: [ 192.168.1.3 ]
column 3: [ ANDROID-FFFFFFFFFFFFFFFF ]
column 4: [ FF:FF:FF:FF:FF:FF]
column 1: [ 2 ]
column 2: [ 192.168.1.4 ]
column 3: [ XXXXXXX ]
column 4: [ FF:FF:FF:FF:FF:FF]
column 1: [ 3 ]
column 2: [ 192.168.1.5 ]
column 3: [ -- ]
column 4: [ FF:FF:FF:FF:FF:FF]
column 1: [ 4 ]
column 2: [ 192.168.1.6 ]
column 3: [ -- ]
column 4: [ FF:FF:FF:FF:FF:FF]
Have a look at pandoc's filter functionallity: It allows you to programmatically alter the document without having to parse the table yourself. Probably the simplest option is to use lua-filters, as those require no external program and are fully platform-independent.
Here is a filter which acts on each cell of the table body, ignoring the table header:
function Table (table)
for i, row in ipairs(table.rows) do
for j, cell in ipairs(row) do
local cell_text = pandoc.utils.stringify(pandoc.Div(cell))
local text_val = changed_cell(cell_text)
row[j] = pandoc.read(text_val).blocks
end
end
return table
end
where changed_cell could be either a lua function (lua has good built-in support for patterns) or a function which pipes the output through awk:
function changed_cell (raw_text)
return pandoc.pipe('awk', {'YOUR AWK SCRIPT'}, raw_text)
end
The above is a slightly unidiomatic pandoc filter, as filters usually don't act on raw strings but on pandoc AST elements. However, the above should work fine in your case.

Python: Can I grab the specific lines from a large file faster?

I have two large files. One of them is an info file(about 270MB and 16,000,000 lines) like this:
1101:10003:17729
1101:10003:19979
1101:10003:23319
1101:10003:24972
1101:10003:2539
1101:10003:28242
1101:10003:28804
The other is a standard FASTQ format(about 27G and 280,000,000 lines) like this:
#ST-E00126:65:H3VJ2CCXX:7:1101:1416:1801 1:N:0:5
NTGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
#ST-E00126:65:H3VJ2CCXX:7:1101:10003:75641:N:0:5
TAAGATAGATAGCCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifie. For each sequence,this part of the Line 1 is unique.
1101:1416:1801 and 1101:10003:75641
And I want to grab the Line 1 and the next three lines from the FASTQ file according to the info file. Here is my code:
import gzip
import re
count = 0
with open('info_path') as info, open('grab_path','w') as grab:
for i in info:
sample = i.strip()
with gzip.open('fq_path') as fq:
for j in fq:
count += 1
if count%4 == 1:
line = j.strip()
m = re.search(sample,j)
if m != None:
grab.writelines(line+'\n'+fq.next()+fq.next()+fq.next())
count = 0
break
And it works, but because both of these two files have millions of lines, it's inefficient(running one day only get 20,000 lines).
UPDATE at July 6th:
I find that the info file can be read into the memory(thank #tobias_k for reminding me), so I creat a dictionary that the keys are info lines and the values are all 0. After that, I read the FASTQ file every 4 line, use the identifier part as the key,if the value is 0 then return the 4 lines. Here is my code:
import gzip
dic = {}
with open('info_path') as info:
for i in info:
sample = i.strip()
dic[sample] = 0
with gzip.open('fq_path') as fq, open('grap_path',"w") as grab:
for j in fq:
if j[:10] == '#ST-E00126':
line = j.split(':')
match = line[4] +':'+line[5]+':'+line[6][:-2]
if dic.get(match) == 0:
grab.writelines(j+fq.next()+fq.next()+fq.next())
This way is much faster, it takes 20mins to get all the matched lines(about 64,000,000 lines). And I have thought about sorting the FASTQ file first by external sort. Splitting the file that can be read into the memory is ok, my trouble is how to keep the next three lines following the indentifier line while sorting. The Google's answer is to linear these four lines first, but it will take 40mins to do so.
Anyway thanks for your help.
You can sort both files by the identifier (the 1101:1416:1801) part. Even if files do not fit into memory, you can use external sorting.
After this, you can apply a simple merge-like strategy: read both files together and do the matching in the meantime. Something like this (pseudocode):
entry1 = readFromFile1()
entry2 = readFromFile2()
while (none of the files ended)
if (entry1.id == entry2.id)
record match
else if (entry1.id < entry2.id)
entry1 = readFromFile1()
else
entry2 = readFromFile2()
This way entry1.id and entry2.id are always close to each other and you will not miss any matches. At the same time, this approach requires iterating over each file once.

Awk Calc Avg Rows Below Certain Line

I'm having trouble calculating an average of specific numbers in column BELOW a specific text identifier using awk. I have two columns of data and I'm trying to start the average keying on a common identifier that repeats, which is 01/1991. So, awk should calc the average of all lines beginning with 01/1991, which repeats, using the next 21 lines with total count of rows for average = 22 for the total number of years 1991-2012. The desired output is an average of each TextID/Name entry for all the January's (01) for each year 1991 - 2012 show below:
TextID/Name 1
Avg: 50.34
TextID/Name 2
Avg: 45.67
TextID/Name 3
Avg: 39.97
...
sample data:
TextID/Name 1
01/1991, 57.67
01/1992, 56.43
01/1993, 49.41
..
01/2012, 39.88
TextID/Name 2
01/1991, 45.66
01/1992, 34.77
01/1993, 56.21
..
01/2012, 42.11
TextID/Name 3
01/1991, 32.22
01/1992, 23.71
01/1993, 29.55
..
01/2012, 35.10
continues with the same data for TextID/Name 4
I'm getting an answer using this code shown below but the average is starting to calculate BEFORE the specific identifier line and not on and below that line (01/1991).
awk '$1="01/1991" {sum+=$2} (NR%22==0){avg=sum/22;print"Average: "avg;sum=0;next}' myfile
Thanks and explanations of the solution is greatly appreciated! I have edited the original answer with more description - thank you again.
If you look at your file, the first field is "01/1991," with a comma at the end, not "01/1991". Also, NR%22==0 will look at line numbers divisible by 22, not 22 lines after the point it thinks you care about.
You can do something like this instead:
awk '
BEGIN { l=-1; }
$1 == "01/1991," {
l=22;
s=0;
}
l > 0 { s+=$2; l--; }
l == 0 { print s/22; l--; }'
It has a counter l that it sets to the number of lines to count, then it sums up that number of lines.
You may want to consider simply summing all lines from one 01/1991 to the next though, which might be more robust.
If you're allowed to use Perl instead of Awk, you could do:
#!/usr/bin/env perl
$start = 0;
$have_started = 0;
$count = 0;
$sum = 0;
while (<>) {
$line = $_;
# Grab the value after the date and comma
if ($line = /\d+\/\d+,\s+([\d\.]+)/) {
$val = $+;
}
# Start summing values after 01/1991
if (/01\/1991,\s+([\d\.]+)/) {
$have_started = 1;
$val = $+;
}
# If we have started counting,
if ($have_started) {
$count++;
$sum += $+;
}
}
print "Average of all values = " . $sum/$count;
Run it like so:
$ cat your-text-file.txt | above-perl-script.pl

Comparing many files in Bash

I'm trying to automate a task at work that I normally do by hand, that is taking database output from the permissions of multiple users and comparing them to see what they have in common. I have a script right now that uses comm and paste, but it's not giving me all the output I'd like.
Part of the problem comes in comm only dealing with two files at once, and I need to compare at least three to find a trend. I also need to determine if two out of the three have something in common, but the third one doesn't have it (so comparing the output of two comm commands doesn't work). I need these in comma separated values so it can be imported into Excel. Each user has a column, and at the end is a listing of everything they have in common. comm would work perfectly if it could compare more than two files (and show two-out-of-three comparisons).
In addition to the code I have to clean all the extra cruft off the raw csv file, here's what I have so far in comparing four users. It's highly inefficient, but it's what I know.
cat foo1 | sort > foo5
cat foo2 | sort > foo6
cat foo3 | sort > foo7
cat foo4 | sort > foo8
comm foo5 foo6 > foomp
comm foo7 foo8 > foomp2
paste foomp foomp2 > output2
sed 's/[\t]/,/g' output2 > output4.csv
cat output4.csv
Right now this outputs two users, their similarities and differences, then does the same for another two users and pastes it together. This works better than doing it by hand, but I know I could be doing more.
An example input file would be something like:
User1
Active Directory
Internet
S: Drive
Sales Records
User2
Active Directory
Internet
Pricing Lookup
S: Drive
User3
Active Directory
Internet
Novell
Sales Records
where they have AD and Internet in common, two out of three have sales records access and S: drive permission, only one of each has Novell and Pricing access.
Can someone give me a hand in what I'm missing?
Using GNU AWK (gawk) you can print a table that shows how multiple users' permissions correlate. You could also do the same thing in any language that supports associative arrays (hashes), such as Bash 4, Python, Perl, etc.
#!/usr/bin/awk -f
{
array[FILENAME, $0] = $0
perms[$0] = $0
if (length($0) > maxplen) {
maxplen = length($0)
}
users[FILENAME] = FILENAME
}
END {
pcount = asort(perms)
ucount = asort(users)
maxplen += 2
colwidth = 8
printf("%*s", maxplen, "")
for (u = 1; u <= ucount; u++) {
printf("%-*s", colwidth, users[u])
}
printf("\n")
for (p = 1; p <= pcount; p++) {
printf("%-*s", maxplen, perms[p])
for (u = 1; u <= ucount; u++) {
if (array[users[u], perms[p]]) {
printf("%-*s", colwidth, " X")
} else {
printf("%-*s", colwidth, "")
}
}
printf("\n")
}
}
Save this file, perhaps calling it "correlate", then set it to be executable:
$ chmod u+x correlate
Then, assuming that the filenames correspond to the usernames or are otherwise meaningful (your examples are "user1" through "user3" so that works well), you can run it like this:
$ ./correlate user*
and you would get the following output based on your sample input:
user1 user2 user3
Active Directory X X X
Internet X X X
Novell X
Pricing Lookup X
S: Drive X X
Sales Records X X
Edit:
This version doesn't use asort() and so it should work on non-GNU versions of AWK. The disadvantage is that the order of rows and columns is unpredictable.
#!/usr/bin/awk -f
{
array[FILENAME, $0] = $0
perms[$0] = $0
if (length($0) > maxplen) {
maxplen = length($0)
}
users[FILENAME] = FILENAME
}
END {
maxplen += 2
colwidth = 8
printf("%*s", maxplen, "")
for (u in users) {
printf("%-*s", colwidth, u)
}
printf("\n")
for (p in perms) {
printf("%-*s", maxplen, p)
for (u in users) {
if (array[u, p]) {
printf("%-*s", colwidth, " X")
} else {
printf("%-*s", colwidth, "")
}
}
printf("\n")
}
}
You can use the diff3 program. From the man page:
diff3 - compare three files line by line
Given your sample inputs, above, running diff3 results in:
====
1:3,4c
S: Drive
Sales Records
2:3,4c
Pricing Lookup
S: Drive
3:3,4c
Novell
Sales Records
Does this get you any closer to what you're looking for?
I would use the strings command to remove any binary from the files, cat them together then use uniq -c on the concatenated file to get a count of occurrences of a string

Resources