I have a directed graph with like 2000 nodes stored in a file. Each line represents an edge from the node stored in the first column to the node stored in the second column, it is even easy to visualize the data for example in dot(1). Columns are separated by tabs, rows separated by newlines and nodes are named with any of the a-zA-Z0-9_ characters. Tree can have multiple roots, it may have cycles, which should be ignored. I don't care about cycles, they are redundant, but they can happen in the input. Below I am presenting an example of the graph, with tr to substitute spaces for tabs and here-document, to easy reproduce the input file:
tr ' ' '\t' <<EOF >connections.txt
str1 str2
str2 str3
str3 str4
str100 str2
str100 str101
str101 str102
EOF
I have also a list of some node in the graph, called heads. These will be the starting nodes, ie. heads:
tr ' ' '\t' <<EOF >heads.txt
str1
str100
EOF
And I have also a list of associated "cost" with each node. Example with some random data:
tr ' ' '\t' <<EOF >cost.txt
str1 1
str2 5
str3 10
str4 548
str100 57
str101 39
str102 23
EOF
I want to sum the "cost" of each node while traversing the tree from nodes stored in head.txt and print the cost with some traversing information for each leaf.
I want to:
for each node in heads.txt
sum the cost from costs.txt of the node into some variable
find that node in connections.txt
find what does this node connect to
and repeat the algorithm for each of the nodes the node connects to
when the node is connected with nothing, print the sum of costs
Ideally the script would look like:
$ script.sh heads.txt connections.txt cost.txt
str1->str2->str3->str4 1+5+10+548 564
str100->str2->str3->str4 57+5+10+548 620
str100->str101->str102 57+39+23 119
And I even have written this, and it works:
#!/bin/bash
set -euo pipefail
headsf=$1
connectionsf=$2
costf=$3
get_cost() {
grep "^$1"$'\t' "$costf" | cut -f2 || echo 0
}
get_conn() {
grep "^$1"$'\t' "$connectionsf" | cut -f2
}
check_conns() {
grep -q "^$1"$'\t' "$connectionsf"
}
f_output() {
printf "%s\t%s\n" "$1" "$2"
}
f() {
local func cost
func="$1"
cost=$(get_cost "$func")
if ! check_conns "$func"; then
f_output "${2:+$2->}$func" "${3:+$3+}$cost"
return
fi
get_conn "$func" |
while IFS=$'\t' read -r calls; do
if [ "$func" = "$calls" ]; then
echo "$func is recursive" >&2
continue
fi
if <<<"$2" grep -q -w "$calls"; then
printf "$2 calls recursive $calls\n" >&2
continue
fi
f "$calls" "${2:+$2->}$func" "${3:+$3+}$cost"
done
}
while IFS= read -r head; do
f "$head" "" ""
done < "$headsf" |
while IFS=$'\t' read -r func calc; do
tmp=$(<<<$calc bc)
printf "%s\t%s\t%s\n" "$func" "$calc" "$tmp"
done |
column -t -s $'\t'
However it is impossibly slow on bigger inputs. Even with sample files here (only 6 lines) the script takes 200ms on my machine. How can I speed it up? Can the inputs be sorted, joined somehow to speed it up (grep doesn't care if the input is sorted)? Can this be done faster in awk or other unix tools?
I would like to limit myself to bash shell and standard *unix tools, coreutils, moreutils, datamash and such. I tried doing it in awk, but failed, I have no idea how to do find something recursively in the input in awk. It this feels to me "doable" in a shell script really fast.
Since no one has posted an answer yet, here is an awk solution as a starting point:
#!/usr/bin/awk -f
BEGIN {
FS=OFS="\t"
}
FILENAME=="connections.txt" {
edges[$1,++count[$1]]=$2
next
}
FILENAME=="cost.txt" {
costs[$1]=$2
next
}
FILENAME=="heads.txt" {
f($1)
}
function f(node,
path,cost,sum,prev,sep1,sep2,i) {
if(node in prev)
# cycle detected
return
path=path sep1 node
cost=cost sep2 costs[node]
sum+=costs[node]
if(!count[node]) {
print path,cost,sum
}
else {
prev[node]
for(i=1;i<=count[node];++i)
f(edges[node,i],path,cost,sum,prev,"->","+")
delete prev[node]
}
}
Make it read connections.txt and cost.txt before heads.txt.
Its output (padded):
$ awk -f tst.awk connections.txt cost.txt heads.txt
str1->str2->str3->str4 1+5+10+548 564
str100->str2->str3->str4 57+5+10+548 620
str100->str101->str102 57+39+23 119
You say you want only standard tools, but you also mention using dot on your data, so I'm assuming you have the other graphviz utilities available... in particular, gvpr, which is like awk for graphs:
#!/usr/bin/env bash
graph=$(mktemp)
join -t$'\t' -j1 -o 0,1.2,2.2 -a2 \
<(sort -k1,1 connections.txt) \
<(sort -k1,1 cost.txt) |
awk -F$'\t' 'BEGIN { print "digraph g {" }
{ printf "%s [cost = %d ]\n", $1, $3
if ($2 != "") printf "%s -> %s\n", $1, $2 }
END { print "}" }' > "$graph"
while read root; do
gvpr -a "$root" '
BEGIN {
int depth;
int seen[string];
string path[int];
int costs[int];
}
BEG_G {
$tvtype = TV_prepostfwd;
$tvroot = node($, ARGV[0]);
}
N {
if ($.name in seen) {
depth--;
} else {
seen[$.name] = 1;
path[depth] = $.name;
costs[depth] = $.cost;
depth++;
if (!fstout($) && path[0] == ARGV[0]) {
int i, c = 0;
for (i = 0; i < depth - 1; i++) {
printf("%s->", path[i]);
}
printf("%s\t", $.name);
for (i = 0; i < depth - 1; i++) {
c += costs[i];
printf("%d+", costs[i]);
}
c += $.cost;
printf("%d\t%d\n", $.cost, c);
}
}
}' "$graph"
done < heads.txt
rm -f "$graph"
Running this after creating your data files:
$ ./paths.sh
str1->str2->str3->str4 1+5+10+548 564
str100->str2->str3->str4 57+5+10+548 620
str100->str101->str102 57+39+23 119
Or, since it's so ubiquitous it might as well be standard, a sqlite-based solution. This one doesn't even require bash/zsh/ksh93, unlike the above.
$ sqlite3 -batch -noheader -list <<EOF
.separator "\t"
CREATE TABLE heads(node TEXT);
.import heads.txt heads
CREATE TABLE costs(node TEXT PRIMARY KEY, cost INTEGER) WITHOUT ROWID;
.import cost.txt costs
CREATE TABLE connections(from_node TEXT, to_node TEXT
, PRIMARY KEY(from_node, to_node)) WITHOUT ROWID;
.import connections.txt connections
WITH RECURSIVE paths(tail, path, costs, cost) AS
(SELECT h.node, h.node, c.cost, c.cost
FROM heads AS h
JOIN costs AS c ON h.node = c.node
UNION ALL
SELECT conn.to_node, p.path || '->' || conn.to_node
, p.costs || '+' || c.cost, p.cost + c.cost
FROM paths AS p
JOIN connections AS conn ON conn.from_node = p.tail
JOIN costs AS c ON c.node = conn.to_node
)
SELECT path, costs, cost FROM paths AS p
WHERE tail NOT IN (SELECT from_node FROM connections)
ORDER BY path;
EOF
str1->str2->str3->str4 1+5+10+548 564
str100->str101->str102 57+39+23 119
str100->str2->str3->str4 57+5+10+548 620
Related
I have csv file with multiple lines. Each line has the same number of columns. What I need to do is to group those lines by a few specified columns and aggregate data from other columns. Example of input file:
proces1,pathA,5-May-2011,10-Sep-2017,5
proces2,pathB,6-Jun-2014,7-Jun-2015,2
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces1,pathA,11-Sep-2017,15-Oct-2017,2
For above example I need to group lines by first two columns. From 3rd column I need to choose the min value, for 4th column max value, and 5th column should have the sum. So, for such input file I need output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
I need to process it in bash (I can use awk or sed as well).
With bash and sort:
#!/bin/bash
# create associative arrays
declare -A month2num=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12])
declare -A p ds de # date start and date end
declare -A -i sum # set integer attribute
# function to convert 5-Jun-2011 to 20110605
date2num() { local d m y; IFS="-" read -r d m y <<< "$1"; printf "%d%.2d%.2d\n" $y ${month2num[$m]} $d; }
# read all columns to variables p1 p2 d1 d2 s
while IFS="," read -r p1 p2 d1 d2 s; do
# if associative array is still empty for this entry
# fill with current strings/value
if [[ -z ${p[$p1,$p2]} ]]; then
p[$p1,$p2]="$p1,$p2"
ds[$p1,$p2]="$d1"
de[$p1,$p2]="$d2"
sum[$p1,$p2]="$s"
continue
fi
# compare strings, set new strings and sum value
if [[ ${p[$p1,$p2]} == "$p1,$p2" ]]; then
[[ $(date2num "$d1") < $(date2num ${ds[$p1,$p2]}) ]] && ds[$p1,$p2]="$d1"
[[ $(date2num "$d2") > $(date2num ${de[$p1,$p2]}) ]] && de[$p1,$p2]="$d2"
sum[$p1,$p2]=sum[$p1,$p2]+s
fi
done < file
# print content of all associative arrays with key vom associative array p
for i in "${!p[#]}"; do echo "${p[$i]},${ds[$i]},${de[$i]},${sum[$i]}"; done
Usage: ./script.sh | sort
Output to stdout:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
See: help declare, help read and of course man bash
With awk + sort
awk -F',|-' '
BEGIN{
A["Jan"]="01"
A["Feb"]="02"
A["Mar"]="03"
A["Apr"]="04"
A["May"]="05"
A["Jun"]="06"
A["July"]="07"
A["Aug"]="08"
A["Sep"]="09"
A["Oct"]="10"
A["Nov"]="11"
A["Dec"]="12"
}
{
B[$1","$2]=B[$1","$2]+$9
z=sprintf( "%.2d",$3)
y=sprintf("%s",$5 A[$4] z)
if(!start[$1$2])
{
end[$1$2]=0
start[$1$2]=99999999
}
if (y < start[$1$2])
{
start[$1$2]=y
C[$1","$2]=$3"-"$4"-"$5
}
x=sprintf( "%.2d",$6)
w=sprintf("%s",$8 A[$7] x)
if(w > end[$1$2] )
{
end[$1$2]=w
D[$1","$2]=$6"-"$7"-"$8
}
}
END{
for (i in B)print i "," C[i] "," D[i] "," B[i]
}
' infile | sort
Extended GNU awk solution:
awk -F, 'function parse_date(d_str){
split(d_str, d, "-");
t = mktime(sprintf("%d %d %d 00 00 00", d[3], m[d[2]], d[1]));
return t
}
BEGIN{ m["Jan"]=1; m["Feb"]=2; m["Mar"]=3; m["Apr"]=4; m["May"]=5; m["Jun"]=6;
m["Jul"]=7; m["Aug"]=8; m["Sep"]=9; m["Oct"]=10; m["Nov"]=11; m["Dec"]=12;
}
{
k=$1 SUBSEP $2;
if (k in a){
if (parse_date(a[k]["min"]) > parse_date($3)) { a[k]["min"]=$3 }
if (parse_date(a[k]["max"]) < parse_date($4)) { a[k]["max"]=$4 }
} else {
a[k]["min"]=$3; a[k]["max"]=$4
}
a[k]["sum"]+= $5
}
END{
for (i in a) {
split(i, j, SUBSEP);
print j[1], j[2], a[i]["min"], a[i]["max"], a[i]["sum"]
}
}' OFS=',' file
The output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
gawk -v ff=${fileB} '
/^1017/ { print $0 >> ff; next; }
!(/^#/||/^1016/||/^1018/||/^1013/||/^1014/||/^1013/||/^1014/) {
f=substr($0,11,2)".csv"; print $0 >>"../../" f;
}
' ${csvfiles}
The big file contains various 20 million lines.and we have to read each line if it starts with 1017 it will be printed in fileB irrespective of line content
if it starts not starting with the skip list above(1016,1013..) it will be written in file, where the filename is taken from the line content. for example the line
1010,abcdefg,123453,343,3434, written in fg.csv. we do substring and take the fg from the second column.
The problem is the performance is like 35k lines per second. is it possible to make it faster?
sample input
Exclusion List 1016 1013 ..
Include line number 1010,1017...
1016,abcdefg,123453,343,3434,
1010,abcdefg,123453,343,3434,
1017,sdfghhj,123453,343,3434,
1034,zxczcvf,123453,343,3434,
1055,zxczcfg,123453,343,3434,
sample output
fileB.csv
1017,sdfghhj,123453,343,3434,
fg.csv
055,zxczcfg,123453,343,3434,
vf.csv
1034,zxczcvf,123453,343,3434,
Try this:
gawk -v ff="$fileB" '
!/^(#|10(1[6834]|24|55))/{ print > (/^1017/ ? ff : "../../" substr($0,20,2) ".csv") }
' "$csvfiles"
This MAY speed things up if all the time is being spent on file opens/closes:
awk '!/^(#|10(1[6834]|24|55))/{print substr($0,20,2), $0}' "$csvfiles" |
sort -t ' ' |
awk -v ff="$fileB" '
{
curr = substr($0,1,2)
str = substr($0,3)
if ( index(str,"1017") == 1 ) {
out = ff
}
else if ( curr != prev ) {
close(out)
out = "../../" curr ".csv"
prev = curr
}
print str > out
}
' "$csvfiles"
I'm really not sure if it'll be any faster but it might be due to the simpler regexp at least it's concise.
Reading a text file into an array, extracting elements and sorting them is taking a very long time.
The text file is ffmpeg console output for R128 audio analysis. I need to get the highest M and S values. Example:
[Parsed_ebur128_0 # 0x7fd32a60caa0] t: 4.49998 M: -22.2 S: -29.9 I: -27.0 LUFS LRA: 9.8 LU FTPK: -12.4 dBFS TPK: -9.7 dBFS
[Parsed_ebur128_0 # 0x7fd32a60caa0] t: 4.69998 M: -22.5 S: -28.6 I: -25.9 LUFS LRA: 11.3 LU FTPK: -12.7 dBFS TPK: -9.7 dBFS
The text file can be hundreds or thousands of lines long depending on the duration of the audio file being analysed
I want to find the highest M (-22.2) and S Values (-28.6) and assign them to variables M and S
This is what I am using currently:
ARRAY=()
while read LINE
do
ARRAY+=("$LINE")
done < $tempDir/text.txt
for LINE in "${ARRAY[#]}"
do
echo "$LINE" | sed -n ‘/B:/p' | sed 's/S:.*//' | sed -n -e 's/^.*M://p' | sed -n -e 's/-//p' >>/$tempDir/R128M.txt
done
for LINE in "${ARRAY[#]}"
do
echo "$LINE" | sed -n '/M:/p' | sed 's/I:.*//' | sed -n -e 's/^.*S://p' | sed -n -e 's/-//p' >>$tempDir/R128S.txt
done
cat $tempDir/R128M.txt
M=( $(sort $tempDir/R128M.txt) )
cat $tempDir/R128S.txt
S=( $(sort $tempDir/R128S.txt) )
Is there a faster way of doing this?
Rather than reading in the whole file in memory, writing bits of it out to separate file, and reading those in again, just parse it and pick out the largest values:
$ awk '$7 > m || m == "" { m = $7 } $9 > s || s == "" { s = $9 } END { print m, s }' data
-22.2 -28.6
In your data, field 7 and 9 contains the values of M and S. The awk script will update its m and s variables if it finds larger values in these fields and then print the largest found at the end. The m == "" and s == "" are needed to trigger initialization of the values if no values has been read yet.
Another way with awk, which may look cleaner:
$ awk 'FNR == 1 { m = $7; s = $9; next } $7 > m { m = $7 } $9 > s { s = $9 } END { print m, s }' data
To assign them to M and S in the shell:
$ declare $( awk 'FNR == 1 { m = $7; s = $9; next } $7 > m { m = $7 } $9 > s { s = $9 } END { printf("M=%f S=%f\n", m, s) }' data )
$ echo $M $S
-22.200000 -28.600000
Adjust the printf() format to use %s instead of %f if you want the original strings instead of float values, or set the number of decimals you might want with, e.g., %.2f in place of %f.
First of all, three-process pipe is a bit redundant for a single value extraction, especially taking into account you reinstantiate it anew for every line.
Next, you save all the values into a file and then sort that file, while all you need is the maximum value. You can easily find it during the very first (value extraction) loop, for additional O(N) running time, instead of I/O and sorting with all the I/O overhead and O(NlogN) sorting expenses. See ARITHMETIC EXPANSION and conditional expressions in bash manual.
I have a protein sequence file in the following format
uniprotID\space\sequence
sequence is a string of any length but with only 20 allowed letters i.e.
ARNDCQEGHILKMFPSTWYV
Example of 1 record
Q5768D AKCCACAKCCAC
I want to create a csv file in the following format
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
This is what I'm currently trying:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[#]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1
in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.
If this is what you want:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
This will do it:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] = $1
lgth[colNr] = length($2)
delete name2nr
for (i=1;i<=(length($2)-2);i++) {
trimer = substr($2,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
If instead you want output like in #rogerovo's perl answer that'd be much simpler than the above and more efficient and use far less memory:
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length($2)-2);i++) {
cnt[substr($2,i,3)]++
}
printf "%s;%s", $1, length($2)
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2
This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)
output:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
code:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
print join ";",($1,length($2));
my %trimdb;
my $seq=$2;
#split the sequence into chars
my #a=split //,$seq;
my #trimmer;
# while there are unprocessed chars in the sequence...
while (scalar #a) {
# fill up the buffer with a char from the top of the sequence
push #trimmer, shift #a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar #trimmer == 3 ) {
$trimdb{(join "",#trimmer)}++;
# drop the first letter from buffer, for next loop
shift #trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv
I have a large file where each line consists of 24 small integers separated by whitespace. I would like to find, for each line, the longest segment that repeats, allowing the line to wrap around. For example, given the line
0 10 4 2 7 9 11 8 6 5 0 10 4 2 7 11 9 3 8 3 1 1 6 5
the sequence 6 5 0 10 4 2 7 is longest; it has length 7 and the two occurrences are separated by 10 positions (or 14).
Could someone show me how to cobble a script together to return, for each line, the length of the longest sequence and the interval between its two beginnings?
The way the file is constructed it will be impossible for any segment to be repeated more than once (i.e. more than two appearances), because each number from 0 to 11 is constrained to appear exactly twice.
Much appreciated. --Lloyd
Here is a rather obfuscated solution that works on a single line of input. Wrap the whole thing in a loop that reads the line from your input rather than setting it explicitly, and you should have a viable (albeit terribly slow and ugly) solution.
#!/bin/sh
input='0 10 4 2 7 9 11 8 6 5 0 10 4 2 7 11 9 3 8 3 1 1 6 5'
trap 'rm -f $TMP1 $TMP2' 0
TMP1=$(mktemp $(basename $0.XXXX))
TMP2=$(mktemp $(basename $0.XXXX))
input="$input $input" # handle wrap-around
seq 0 11 | while read start_value; do
echo $input | tr \ \\n | grep -w -n $start_value | sed 's/:.*//' | {
read i
read j
delta=$( expr $j - $i )
echo $input | tr \ \\n | sed -n "$i,${j}p" > $TMP1
echo $input | tr \ \\n | sed -n "$j,\$p" > $TMP2
diff $TMP1 $TMP2 | { IFS=a read length junk
echo $length $delta $start_value
}
}
done | sort -rn | sed 1q | { read length delta start;
printf "%s " "The sequence"
echo $input | tr \ \\n | awk '$0==k{t=1}t' k=$start | sed "${length}q"
echo ' is the longest sequence.'
/bin/echo -n The difference between starting positions is $delta '(or '
expr 24 - $delta
echo ')'
} | tr \\n ' '
echo
There are a lot of languages that would make this easier than awk ( including gawk ), but here's an all awk answer.
Try putting this into an executable awk file:
#!/usr/bin/awk -f
BEGIN { DELIM=":" }
function reorder(start) {
head = ""
tail = ""
for( i=1;i<=NF;i++ ) {
if( i<start ) tail = sprintf( "%s%s%s", tail, $i, FS )
else head = sprintf( "%s%s%s", head, $i, FS )
}
# last field is the starting index
return( head tail start )
}
function longest(pair) {
split( pair, a, DELIM )
split( a[1], one, FS )
split( a[2], two, FS )
long = ""
for( i=1;i<=NF;i++ ) {
if( one[i] != two[i] ) break
long = sprintf( "%s%s%s", long, one[i], FS )
}
return( i-1 DELIM two[NF+1]-one[NF+1] DELIM long )
}
{
for( k=1;k<=NF;k++ ) {
pairs[$k] = (pairs[$k]==""?"":pairs[$k]DELIM) reorder( k )
}
for( p in pairs ) {
tmp = longest( pairs[p] )
out = tmp>out ? tmp : out
}
print out
}
If I call this awko then running awko data yields data in the form:
# of matched fields:index separation:longest match
which for the input data is:
7:14:6 5 0 10 4 2 7
Notice that I haven't bothered to clean up the extra space at the end of the data that matches. With more input data, I'd have a better idea if this has bugs or not.
I wanted to see how fast I could do this:
#!/usr/bin/awk -f
BEGIN { OFS=":" }
function longest(first, second) {
long = ""
s = second
flds = 0
for( f=first;f<=NF;f++ ) {
if( $f != $s ) break
long = sprintf( "%s%s%s", long, $f, " " )
if( s==NF ) s=0
s++
flds++
}
return( flds OFS second-first OFS long )
}
{
for(k=1;k<=NF;k++) {
val = pos[$k]
if( val!="" ) {
tmp = longest( val, k )
delete pos[$k] #### need an awk/gawk that can remove elems or "delete pos" l8r
}
else pos[$k] = k
out = tmp>out ? tmp : out
}
print out
}
It's ~200% faster than the first go round. It's only using a single outer field loop and processes each matching number as the second is found using the original parsed fields. Running the same data over and over (2400 lines worth) gave me a system time of 0.33s instead of the 71.10s I got from the first script on the same data.