Bash/Awk: Reformat uneven columns with multiple deliminators - bash

I have a CSV where I need to reformat a single column's contents.
The problem is that each cell has completely different lengths to reformat.
Current column looks like (these are two lines of single column) :
Foo*foo*foo*1970,1980+Bar*bar*bar*1970
Foobar*Foobar*foobarbar*1970,1975,1980
Result should look like (still two lines one column)
Foo*foo*foo*1970+Foo*foo*foo*1980+Bar*bar*bar*1970
Foobar*Foobar*foobarbar*1970+Foobar*Foobar*foobarbar*1975+Foobar*Foobar*foobarbar*1980
this is what I'm trying to do
#!/bin/bash
cat foocol | \
awk -F'+' \
'{for i in NF print $i}' \
| awk -F'*' \
'{$Foo=$1"*"$2"*"$3"*" print $4}' \
\
| awk -v Foo=$Foo -F',' \
'{for j in NF do \
print Foo""$j"+" }' \
> newcol
The idea is to iterate over the multiple '+' delimited data, while the first three '*' delimited values are to be grouped for every ',' delimited year, with a '+' between them
But I'm just getting syntax errors everywhere.
Thanks

$ awk --re-interval -F, -v OFS=+ '{match($1,/([^*]*\*){3}/);
prefix=substr($0,RSTART,RLENGTH);
for(i=2;i<=NF;i++) $i=prefix $i }1' file
Foo*foo*foo*1970+Foo*foo*foo*1980+Bar*bar*bar*1970
Foobar*Foobar*foobarbar*1970+Foobar*Foobar*foobarbar*1975+Foobar*Foobar*foobarbar*1980
perhaps add validation with if(match(...

Solution in TXR:
$ txr reformat.txr data
Foo*foo*foo*1970+Foo*foo*foo*1980+Bar*bar*bar*1970
Foobar*Foobar*foobarbar*1970+Foobar*Foobar*foobarbar*1975+Foobar*Foobar*foobarbar*1980
Code in reformat.txr:
#(repeat)
# (coll)#/\+?/#a*#b*#c*#(coll)#{x /[^,+]+/}#(until)+#(end)#(end)
# (output :into items)
# (repeat)
# (repeat)
#a*#b*#c*#x
# (end)
# (end)
# (end)
# (output)
# {items "+"}
# (end)
#(end)
This solution is based on regarding the data to have nested syntax: groups of records are delimited by newlines. Records within groups are separated by + and within records there are four fields separated by *. The last field contains comma-separated items. The data is to be normalized by expanding copies of the records such that the comma-separated items are distributed across the copies.
The outer #(repeat) handles walking over the lines. The outer #(coll) iterates over records, collecting the first three fields into variables a, b and c. Then an inner #(coll) gets each comma separated item into the variable x. The inner #(coll) collects the x-s into a list, and the outer #(coll) also collects all the variables into lists, so a, b, c become lists of strings, and x is a list of lists of strings.
The :into items keyword parameter in the output causes the lines which would normally go the standard output device to be collected into a list of strings, and bound to a variable. For instance:
#(output :into lines)
a
b
cd
#(end)
establishes a variable lines which contains the list ("a" "b" "cd").
So here we are getting the output of the doubly-nested repeat as a bunch of lines, where each line represents a record, stored in a variable called items. Then we output these using the #{items "+"}, a syntax which outputs the contents of a list variable with the given separator.
The doubly nested repeat handles the expansion of records over each comma separated item from the fourth field. The outer repeat implicitly iterates over the lists a, b, c and x. Inside the repeat, these variables denote the items of their respective lists. Variable x is a list of lists, and so the inner repeat iterates over that. Inside the outer repeat, variables a, b, c are already scalar, and stay that way in the scope of the inner repeat: only x varies, which is exactly what we want.
In the data collection across each line, there are some subtleties:
# (coll)#/\+?/#a*#b*#c*#(coll)#{x /[^,+]+/}#(until)+#(end)#(end)
Firstly, we match an optional leading plus with the /\+?/ regex, thereby consuming it. Without this, the a field of every record, except for the first one, would include that separating + and we would get double +-s in the final output. The a, b, c variables are matched simply. TXR is non-greedy with regard to the separating material: #a* means match some characters up to the nearest * and bind them to a variable a. Collecting the x list is more tricky. Here was use a positive-regex-match variable: #{x /[^,+]+/} to extract the sub-field. Each x is a sequence of one or more characters which are not pluses or commas, extracted positively without regard for whatever follows, much like a tokenizer extracts a token. This inner collect terminates when it encounters a +, which is what the #(until)+ clause ensures. It will also implicitly terminate if it hits the end of the line; the #(until) match isn't mandatory (by default). That terminating + stays in the input stream, which is why we have to recognize it and discard it in front of the #a.
It should be noted that #(coll), by default, scans for matches and skips regions of text that do not match, just like its cousin #(collect) does with lines. For instance if we have #(coll)#{foo /[a-z]+/}#(end), which collects sequences of lower-case letters into foo, turning foo into a list of such strings, and if the input is 1234abcd-efgh.... ijk, then foo ends up with the list ("abcd" "efgh" "ijk"). This is why there is no explicit logic in the inner #(coll) to consume the separating commas: they are implicitly skipped.

Related

Bash: compare if a value is contained within an interval

I got a text file which is tab separated and contains 2 columns like this:
1227637 1298347
1347879 1356788
1389993 1399847
... ...
Now I got some values from an analysis and I'd like to check if these values are contained in my text file intervals.
For example if I have 1227659, which is contained in the first interval, I'd like the bash-script to print to std out something like:
1227659 is contained between 1227637 and 1298347
Thanks.
How about:
awk -v x=1227659 '
$1<x && x<$2 {print x, "is contained between", $1, "and", $2}
' intervals.txt
1227659 is contained between 1227637 and 1298347
If you want any end of the interval to be interpreted as inclusive, change < to <= accordingly. If you want to stop after the first match (makes only sense if the intervals can overlap), add ; exit before the closing curly brace }.

Pad Independently Missing Columns per Row in CSV with Bash (based off expected values)

I have a CSV file in which the ideal format for a row is this:
taxID#, scientific name, kingdom, k, phylum, p, class, c, order, o, family, f, genus, g
...where kingdom, phylum, etc. are identifiers, literals ("kingdom", ... "phylum"), and the values that follow the identifiers (k, p, etc.) are the actual values for those kingdoms, phyla, etc.
Example:
240395,Rugosa emeljanovi,kingdom,Metazoa,phylum,Chordata,class,Amphibia,order,Anura,family,Ranidae,genus,Rugosa
However, not all rows possess all levels of taxonomy, i.e. any one row might be missing the columns for an identifier/value pair, say, "class, c," and any 2-column PAIR can be missing independently of the other pairs missing or not. Also, if fields are missing, they will always be missing with their identifier field, so I'd never get "kingdom, phylum" together without the value for "k" between them. Thus much of my file is missing random fields:
...
135487,Nocardia cyriacigeorgica,class,Actinobacteria,order,Corynebacteriales,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,phylum,Actinobacteria,class,Actinobacteria
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria
77133,uncultured bacterium
...
Question: How can I write a bash shell script that can "pad" every row in a file so that every field pair that may be missing from my ideal format is inserted, and its value column that follows is just blank. Desired output:
...
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria,clas,,order,,family,,genus,
77133,uncultured bacterium,kingdom,,phylum,,class,,order,,family,,genus,
...
Notes:
Notice if a genus was missing, the padded output should end with a comma to denote the value of genus doesn't exist.
taxID# and scientific name (the first two fields) will ALWAYS be present.
I don't care for time/resource efficiency if your solution is brute-forcey.
What I've tried:
I wrote a simple if/then script that checks sequentially if an expected field is gone. pseudocode:
if "$f3" is not "kingdom", pad
but the problem is that if kingdom was truly missing, it will get padded in output but the remaining field variables will be goofed up and I can't just follow that by saying
if "$f5" is not "phylum", pad
because if kingdom were missing, phylum would probably now be in field 3 ($f3), not $f5, that is, if it too weren't missing. (I did this by concatenating onto a string variable the expected output based on the absence of each field, and simply concatenating the original value if the field wasn't missing, and then echoing the finished, supposedly padded row to output).
I'd like to be able to execute my script like this
bash pad.sh prePadding.csv postPadding.csv
but I would accept answers using Mac Excel 2011 if needed.
Thank you!!
Although it should be possible in bash, I would use Perl for this. I tried to make the code as simple to understand as I could.
#!/usr/bin/perl
while (<>){
chomp;
my #fields=split ',';
my $kingdom='';
my $phylum='';
my $class='';
my $order='';
my $family='';
my $genus='';
for (my $i=2;$i<$#fields;$i+=2){
if ($fields[$i] eq 'kingdom'){$kingdom=$fields[$i+1];}
if ($fields[$i] eq 'phylum'){$phylum=$fields[$i+1];}
if ($fields[$i] eq 'class'){$class=$fields[$i+1];}
if ($fields[$i] eq 'order'){$order=$fields[$i+1];}
if ($fields[$i] eq 'family'){$family=$fields[$i+1];}
if ($fields[$i] eq 'genus'){$genus=$fields[$i+1];}
}
print "$fields[0],$fields[1],kingdom,$kingdom,phylum,$phylum,class,$class,order,$order,family,$family,genus,$genus\n";
}
Which gives me:
perl pad.pl input
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,kingdom,,phylum,Acidobacteria,class,,order,,family,,genus,
(or for better reading:)
perl pad.pl input | tableize -t | sed 's/^/ /'
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|135487|Nocardia cyriacigeorgica |kingdom| |phylum| |class|Actinobacteria|order|Corynebacteriales|family| |genus|Nocardia|
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|10090 |Mus musculus |kingdom|Metazoa|phylum|Chordata |class|Mammalia |order|Rodentia |family|Muridae|genus|Mus |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|152507|uncultured actinobacterium |kingdom| |phylum|Actinobacteria|class|Actinobacteria|order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|171953|uncultured Acidobacteria bacterium|kingdom| |phylum|Acidobacteria |class| |order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
This would be the answer in bash using associative arrays:
#!/bin/bash
declare -A THIS
while IFS=, read -a LINE; do
# we always get the #ID and name
if (( ${#LINE[#]} < 2 || ${#LINE[#]} % 2 )); then
echo Invalid CSV line: "${LINE[#]}" >&2
continue
fi
echo -n "${LINE[0]},${LINE[1]},"
THIS=()
for (( INDEX=2; INDEX < ${#LINE[#]}; INDEX+=2 )); do
THIS[${LINE[INDEX]}]=${LINE[INDEX+1]}
done
for KEY in kingdom phylum class order family; do
echo -n $KEY,${THIS[$KEY]},
done
echo genus,${THIS[genus]}
done <$1 >$2
It also validates CSV lines so that they contain at least 2 columns (ID and name) and that they have an even number of columns.
The script can be extended to do more error checking (i.e. if both arguments are passed, if the input exists, etc), but it should work as expected with just the way you posted it.

bash: identifying the first value list that also exists in another list

I have been trying to come up with a nice way in BASH to find the first entry in list A that also exists in list B. Where A and B are in separate files.
A B
1024dbeb 8e450d71
7e474d46 8e450d71
1126daeb 1124dae9
7e474d46 7e474d46
1124dae9 3217a53b
In the example above, 7e474d46 is the first entry in A also appearing in B, So I would return 7e474d46.
Note: A can be millions of entries, and B can be around 300.
awk is your friend.
awk 'NR==FNR{a[$1]++;next}{if(a[$1]>=1){print $1;exit}}' file2 file1
7e474d46
Note : Check the [ previous version ] of this answer too which assumed that values are listed in a single file as two columns. This one is wrote after you have clarified that values are fed as two files in [ this ] comment.
Though few points are not clear, like how about if a number in A list is coming 2 times or more?(IN your given example itself d46 comes 2 times). Considering that you need all the line numbers of list A which are present in List B, then following will help you in same.
awk '{col1[$1]=col1[$1]?col1[$1]","FNR:FNR;col2[$2];} END{for(i in col1){if(i in col2){print col1[i],i}}}' Input_file
OR(NON-one liner form of above solution)
awk '{
col1[$1]=col1[$1]?col1[$1]","FNR:FNR;
col2[$2];
}
END{
for(i in col1){
if(i in col2){
print col1[i],i
}
}
}
' Input_file
Above code will provide following output.
3,5 7e474d46
6 1124dae9
creating array col1 here whose index is first field and array col2 whose index is $2. col1's value is current line's value and it will be concatenating it's own value too. Now in END section of awk traversing through col1 array and then checking if any value of col1 is present in array col2 too, if yes then printing col1's value and it's index.
If you have GNU grep, you can try this:
grep -m 1 -f B A

Using awk or sed to print column of CSV file enclosed in double quotes

I'm working on a csv file like the one below, comma delimited, each cell is enclosed in double quotes, but some of them contain double quote and/or comma inside double quote enclosure. The actual file contain around 300 columns and 200,000 rows.
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with "comma" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, "cde" here","cde","cde","cde"
I'll need to remove some unless columns, and merge last few columns, instead of having "," in between them, I need </br>. and move second column to the end. Anything within the cells should be the same, with double quotes and commas as the original file. Below is an example of the output that I need.
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, "cde" here","cde</br>cde</br>cde","cde"
In this example I want to remove column3 and merge column 5, 6, 7.
Below is the code that I tried to use, but it is reading either double quote and/or comma, which is end of the row to be different than what I expected.
awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
sed is used to remove beginning and ending double quote of a cell.
The output file that I'm getting right now, if previous field contains a double quote, it will consider that is the beginning of a cell, so the following values are often pushed up a column.
Other code that I have used consider every comma as beginning of a cell, so that won't work as well.
awk -F',' 'BEGIN{OFS=",";} {print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
Any help is greatly appreciated. thanks!
CSV is a loose format. There may be subtle variations in formatting. Your particular format may or may not be expressible with a regular grammar/regular expression. (See this question for a discussion about this.) Even if your particular formatting can be expressed with regular expressions, it may be easier to just whip out a parser from an existing library.
It is not a bash/awk/sed solution as you may have wanted or needed, but Python has a csv module for parsing CSV files. There are a number of options to tweak the formatting. Try something like this:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for row in inreader:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
Note that in Python, indexes start with 0 (e.g. row[1] is the second field). The first index of a slice is inclusive, the last is exclusive (row[1:3] is row[1] and row[2] only). Your formatting seems to require quotes around every field, hence the quoting=csv.QUOTE_ALL. There are more options at Dialects and Formatting Parameters.
The above code produces the following output:
"Column1","Column4","Column5</br>Column6</br>Column7","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, cde"" here""","cde</br>cde</br>cde","cde"
There are two issues with this:
It doesn't treat the first row any differently, so the headers of columns 5, 6, and 7 are merged like the other rows.
Your input CSV contains "some other, "cde" here" (third row, fourth column) with unescaped quotes around the cde. There is another case of this on line two, but it was removed since it is in column 3. The result contains incorrect quotes.
If these quotes are properly escaped, your sample input CSV file becomes
infile.csv (escaped quotes):
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with ""comma"" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, ""cde"" here","cde","cde","cde"
Now consider this modified Python script that doesn't merge columns on the first row:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
first_row = True
for row in inreader:
if first_row:
first_row = False
else:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field (index 1) to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
The output outfile.csv is
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, ""cde"" here","cde</br>cde</br>cde","cde"
This is your sample output, but with properly escaped "some other, ""cde"" here".
This may not be precisely what you wanted, not being a sed or awk solution, but I hope it is still useful. Processing more complicated formats may justify more complicated tools. Using an existing library also removes a few opportunities to make mistakes.
This might be an oversimplification of the problem but this has worked for me with your test data:
cat /tmp/inputfile.csv | sed 's#\"\,\"#|#g' | sed 's#"</br>"#</br>#g' | awk 'BEGIN {FS="|"} {print $1 "," $4 "," $5 "</br>" $6 "</br>" $7 "," $2}'
Please not that I am on Mac probably that's why I had to wrap the commas in the AWK script in quotation marks.

Get common lines, for only specific fields, from multiple files

I am trying to understand the following code used to pull out overlapping lines over multiple files using BASH.
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
sprintf("\t%-20s -->\t%s", rec[R], R)
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}' file[a-d]
After understanding what each sub-block of code is doing, I would like to extend this code to find specific fields with overlap, rather than the entire line. For example, I have tried changing the line:
n = split(rec[R], t, "/")
to
n = split(rec[R$1], t, "/")
to find the lines where the first field is the same across all files but this did not work. Eventually I would like to extend this to check that a line has fields 1, 2, and 4 the same, and then print the line.
Specifically, for the files mentioned in the example in the link:
if file 1 is:
chr1 31237964 NP_055491.1 PUM1 M340L
chr1 33251518 NP_037543.1 AK2 H191D
and file 2 is:
chr1 116944164 NP_001533.2 IGSF3 R671W
chr1 33251518 NP_001616.1 AK2 H191D
chr1 57027345 NP_001004303.2 C1orf168 P270S
I would like to pull out:
file1/file2 --> chr1 33251518 AK2 H191D
I found this code at the following link:
http://www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738. Specifically, I would like to understand what R, rec, n, dup, and D represent from the files themselves. It is unclear from the comments provided and printf statements I've added within the subloops fail.
Thank you very much for any insight on this!
The script works by building an auxiliary array, the indices of which are the lines in the input files (denoted by $0 in rec[$0]), and the values are filename1/filename3/... for those filenames in which the given line $0 is present. You can hack it up to just work with $1,$2 and $4 like so:
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1) {
split(R,R1R2R4,SUBSEP)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) : \
sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3])
}
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the partial content of the current record
# (special concatenation of $1, $2 and $4)
# concatenating the filenames separated by / as values
rec[$1,$2,$4] = rec[$1,$2,$4] ? rec[$1,$2,$4] "/" FILENAME : FILENAME
}' file[a-d]
this solution makes use of multidimensional arrays: we create rec[$1,$2,$4] instead of rec[$0]. This special syntax of awk concatenates the indices with the SUBSEP character, which is by default non-printable ("\034" to be precise), and so it is unlikely to be part of either of the fields. In effect it does rec[$1 SUBSEP $2 SUBSEP $4]=.... Otherwise this part of the code is the same. Note that it would be more logical to move the second block to the beginning of the script, and finish with the END block.
The first part of the code also has to be changed: now for (R in rec) loops over these tricky concatenated indices, $1 SUBSEP $2 SUBSEP $4. This is good while indexing, but you need to split R at the SUBSEP characters to obtain again the printable fields $1, $2, $4. These are put into the array R1R2R4, which can be used to print the necessary output: instead of %s,...,R we now have %s\t%s\t%s,...,R1R2R4[1],R1R2R4[2],R1R2R4[3],. In effect we're doing sprintf ...%s,...,$1,$2,$4; with pre-saved fields $1, $2, $4. For your input example this will print
records found in 2 files:
foo11.inp1/foo11.inp2 --> chr1 33251518 AK2
Note that the output is missing H191D but rightly so: that is not in field 1, 2 or 4 (but rather in field 5), so there's no guarantee that it is the same in the printed files! You probably don't want to print that, or anyway have to specify how you should treat the columns which are not checked between files (and so may differ).
A bit of explanation for the original code:
rec is an array, the indices of which are full lines of input, and the values are the slash-separated list of files in which those lines appear. For instance, if file1 contains a line "foo bar", then rec["foo bar"]=="file1" initially. If then file2 also contains this line, then rec["foo bar"]=="file1/file2". Note that there are no checks for multiplicity, so if file1 contains this line twice, then eventually you'll get rec["foo bar"]=file1/file1/file2 and obtain 3 for the number of files containing this line.
R goes over the indices of the array rec after it has been fully built. This means that R will eventually assume each unique line of every input file, allowing us to loop over rec[R], containing the filenames in which that specific line R was present.
n is a return value from split, which splits the value of rec[R] --- that is the filename list corresponding to line R --- at each slash. Eventually the array t is filled with the list of files, but we don't make use of this, we only use the length of the array t, i.e. the number of files in which line R is present (this is saved in the variable n). If n==1, we don't do anything, only if there are multiplicities.
the loop over n creates classes according to the multiplicity of a given line. n==2 applies to lines that are present in exactly 2 files. n==3 to those which appear thrice, and so on. What this loop does is that it builds an array dup, which for every multiplicity class (i.e. for every n) creates the output string "filename1/filename2/... --> R", with each of these strings separated by RS (the record separator) for each value of R that appears n times total in the files. So eventually dup[n] for a given n will contain a given number of strings in the form of "filename1/filename2/... --> R", concatenated with the RS character (by default a newline).
The loop over D in dup will then go through multiplicity classes (i.e. valid values of n larger than 1), and print the gathered output lines which are in dup[D] for each D. Since we only defined dup[n] for n>1, D starts from 2 if there are multiplicities (or, if there aren't any, then dup is empty, and the loop over D will not do anything).
first you'll need to understand the 3 blocks in an AWK script:
BEGIN{
# A code that is executed once before the data processing start
}
{
# block without a name (default/main block)
# executed pet line of input
# $0 contains all line data/columns
# $1 first column
# $2 second column, and so on..
}
END{
# A code that is executed once after all data processing finished
}
so you'll probably need to edit this part of the script:
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}

Resources