How to merge two .txt files into one, matching each line by timestamp, using awk - bash

Summary:
I currently have two .txt files imported from a survey system I am testing. Column 1 of each data file is a timestamp of the format "HHMMSS.SSSSSS". In file1, there is a second column of field intensity readings. In file2 there are two additional columns of positional information. I'm attempting to write a script that matches data points between these files by lining the timestamps up. The issue is that at no point are any of the timestamps the exact same value. The script must be able to match data points (lines in each .txt file) based the timestamp of its closest counterpart in the other file (i.e. the time 125051.354948 from file1 should "match" the nearest timestamp in file2, which is 125051.112784).
If anyone with a little bit more awk/sed/join/regex/Unix knowledge could point me in the right direction, I would be very appreciative.
What I have so far:
(Please note that the exact syntax shown here may not make sense for the sample .txt files attached in this question, there are more extensive versions of these files with more columns that were being used for testing scripts.)
I'm new to awk/Unix/shell scripting so please bear with me if some of these trial solutions don't work or don't make a whole lot of sense.
I have already attempted some solutions posted here on stack overflow using join, but it doesn't seem to want to properly sort or join either of these files:
${
join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 file1) <(sort -k 1 file2)
join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 file1) <(sort -k 1
file2)
} | sort -k 1
Result: only outputs a similar version of the original file2
I attempted to reconfigure existing awk solutions that were posted here as well:
awk 'BEGIN {FS=OFS="\t"} NR==FNR {v[$3]=$2; next} {print $1, (v[$3] ?
v[$3] : 0)}' file1 file2 > file3
awk 'BEGIN {FS=OFS="\t"} NR==FNR {v[$1]=$2; next} {print $1, (v[$1] ?
v[$1] : 0)}' file1 file2 > file3
Result: both of these awk commands result in the output of file2's
data with nothing from file1 included (or so it seems).
awk -F '
FNR == NR {
time[$3]
next
}
{ for(i in time)
if(index($3, i) == 1) {
print
next
}
}' file1 file2 > file3
Result: keeps returning a syntax error regarding the "." of ".txt"
I looked into integrating some sort of regex or split command to the script... but was confused as to how to proceed and didn't come up with anything of substance.
Sample Data
$ cat file1.txt
125051.354948 058712.429
125052.352475 058959.934
125054.354322 058842.619
125055.352671 058772.045
125057.351794 058707.281
125058.352678 058758.959
$ cat file2.txt
125050.105886 4413.34358 07629.87620
125051.112784 4413.34369 07629.87606
125052.100811 4413.34371 07629.87605
125053.097826 4413.34373 07629.87603
125054.107361 4413.34373 07629.87605
125055.107038 4413.34375 07629.87604
125056.093783 4413.34377 07629.87602
125057.097928 4413.34378 07629.87603
125058.098475 4413.34378 07629.87606
125059.095787 4413.34376 07629.87602
Expected Result:
(Format: Column1File1 Column1File2 Column2File1 Column2File2 Column3File2)
$ cat file3.txt
125051.354948 125051.112784 058712.429 4413.34358 07629.87620
125052.352475 125052.100811 058959.934 4413.34371 07629.87605
125054.354322 125054.107361 058842.619 4413.34373 07629.87605
125055.352671 125055.107038 058772.045 4413.34375 07629.87604
125057.351794 125057.097928 058707.281 4413.34378 07629.87603
125058.352678 125058.098475 058758.959 4413.34378 07629.87606
As shown, not every data point from each file will find a match. Only pairs of lines that have the most proximal timestamps to one another will be written over to the new file
As previously mentioned, current solutions result in file3 being entirely blank, or just containing information from one of the two files (but not both)

Please try the following:
awk '
# find the closest element in "a" to val and return the index
function binsearch(a, val, len,
low, high, mid) {
if (val < a[1])
return 1
if (val > a[len])
return len
low = 1
high = len
while (low <= high) {
mid = int((low + high) / 2)
if (val < a[mid])
high = mid - 1
else if (val > a[mid])
low = mid + 1
else
return mid
}
return (val - a[low]) < (a[high] - val) ? high : low
}
NR == FNR {
time[FNR] = $1
position[FNR] = $2
intensity[FNR] = $3
len++
next
}
{
i = binsearch(time, $1, len)
print $1 " " time[i] " " $2 " " position[i] " " intensity[i]
}
' file2.txt file1.txt
Result:
125051.354948 125051.112784 058712.429 4413.34369 07629.87606
125052.352475 125052.100811 058959.934 4413.34371 07629.87605
125054.354322 125054.107361 058842.619 4413.34373 07629.87605
125055.352671 125055.107038 058772.045 4413.34375 07629.87604
125057.351794 125057.097928 058707.281 4413.34378 07629.87603
125058.352678 125058.098475 058758.959 4413.34378 07629.87606
Note that the 4th and 5th values in your expected result may be wrongly copy-and-pasted.
[How it works]
The key is the binsearch function which finds the closest value in the
array and returns the index to the array. I would not mention about
the algorithm in detail because it is a common "binary search" technique.

#!/bin/bash
if [[ $# -lt 2 ]]; then
echo "wrong args, it should be $0 file1 file2"
exit 0
fi
# clear blanks, add an extra column 'm' to file1, merge file1, file2, sort
{ awk 'NF{print $0, "m"}' "$1" ; awk 'NF' "$2"; } | sort -nk1,1 | \
\
awk '# record lines and fields in to a
{a[NR] = $0; a[NR,1] = $1; a[NR,2] = $2; a[NR,3] = $3}
END{
for(i=1; i<= NR; ++i){
# 3rd filed of file1 is "m"
if(a[i, 3] == "m"){
# get difference of column1 between current record ,previous record, next record
prevDiff = (i-1) in a && a[i-1,3] == "m" ? -1 : a[i,1] - a[i-1,1]
nextDiff = (i+1) in a && a[i+1,3] == "m" ? -1 : a[i+1,1] - a[i,1]
# compare differences, choose the close one and print.
if(prevDiff !=-1 && (nextVal == -1 || prevDiff < nextDiff))
print a[i,1], a[i-1, 1], a[i, 2], a[i-1, 2], a[i-1, 3]
else if(nextDiff !=-1 && (prevDiff == -1 || nextDiff < prevDiff))
print a[i,1], a[i+1, 1], a[i, 2], a[i+1, 2], a[i+1, 3]
else
print a[i]
}
}
}'
Out put of { awk 'NF{print $0, "m"}' "$1" ; awk 'NF' "$2"; } | sort -nk1,1 is:
125050.105886 4413.34358 07629.87620
125051.112784 4413.34369 07629.87606
125051.354948 058712.429 m
125052.100811 4413.34371 07629.87605
125052.352475 058959.934 m
125053.097826 4413.34373 07629.87603
125054.107361 4413.34373 07629.87605
125054.354322 058842.619 m
125055.107038 4413.34375 07629.87604
125055.352671 058772.045 m
125056.093783 4413.34377 07629.87602
125057.097928 4413.34378 07629.87603
125057.351794 058707.281 m
125058.098475 4413.34378 07629.87606
125058.352678 058758.959 m
125059.095787 4413.34376 07629.87602

Related

UNIX group by two values

I have a file with the following lines (values are separated by ";"):
dev_name;dev_type;soft
name1;ASR1;11.1
name2;ASR1;12.2
name3;ASR1;11.1
name4;ASR3;15.1
I know how to group them by one value, like count of all ASRx, but how can I group it by two values, as for example:
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
another awk
$ awk -F';' 'NR>1 {a[$2]; b[$3]; c[$2,$3]++}
END {for(k in a) {print k;
for(p in b)
if(c[k,p]) print "\t*"p,"-",c[k,p]}}' file
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
$ cat tst.awk
BEGIN { FS=";"; OFS=" - " }
NR==1 { next }
$2 != prev { prt(); prev=$2 }
{ cnt[$3]++ }
END { prt() }
function prt( soft) {
if ( prev != "" ) {
print prev
for (soft in cnt) {
print " *" soft, cnt[soft]
}
delete cnt
}
}
$ awk -f tst.awk file
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
Or if you like pipes....
$ tail +2 file | cut -d';' -f2- | sort | uniq -c |
awk -F'[ ;]+' '{print ($3!=prev ? $3 ORS : "") " *" $4 " - " $2; prev=$3}'
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
try something like
awk -F ';' '
NR==1{next}
{aRaw[$2"-"$3]++}
END {
asorti( aRaw, aVal)
for( Val in aVal) {
split( aVal [Val], aTmp, /-/ )
if ( aTmp[1] != Last ) { Last = aTmp[1]; print Last }
print " " aTmp[2] " " aRaw[ aVal[ Val] ]
}
}
' YourFile
key here is to use 2 field in a array. The END part is more difficult to present the value than the content itself
Using Perl
$ cat bykub.txt
dev_name;dev_type;soft
name1;ASR1;11.1
name2;ASR1;12.2
name3;ASR1;11.1
name4;ASR3;15.1
$ perl -F";" -lane ' $kv{$F[1]}{$F[2]}++ if $.>1;END { while(($x,$y) = each(%kv)) { print $x;while(($p,$q) = each(%$y)){ print "\t\*$p - $q" }}}' bykub.txt
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
$
Yet Another Solution, this one using the always useful GNU datamash to count the groups:
$ datamash -t ';' --header-in -sg 2,3 count 3 < input.txt |
awk -F';' '$1 != curr { curr = $1; print $1 } { print "\t*" $2 " - " $3 }'
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
I don't want to encourage lazy questions, but I wrote a solution, and I'm sure someone can point out improvements. I love posting answers on this site because I learn so much. :)
One binary subcall to sort, otherwise all built-in processing. That means using read, which is slow. If your file is large, I'd recommend rewriting the loop in awk or perl, but this will get the job done.
sed 1d groups | # strip the header
sort -t';' -k2,3 > group.srt # pre-sort to collect groupings
declare -i ctr=0 # initialize integer record counter
IFS=';' read x lastA lastB < group.srt # priming read for comparators
printf "$lastA\n\t*$lastB - " # priming print (assumes at least one record)
while IFS=';' read x a b # loop through the file
do if [[ "$lastA" < "$a" ]] # on every MAJOR change
then printf "$ctr\n$a\n\t*$b - " # print total, new MAJOR header and MINOR header
lastA="$a" # update the MAJOR comparator
lastB="$b" # update the MINOR comparator
ctr=1 # reset the counter
elif [[ "$lastB" < "$b" ]] # on every MINOR change
then printf "$ctr\n\t*$b - " # print total and MINOR header
ctr=1 # reset the counter
else (( ctr++ )) # otherwise increment
fi
done < group.srt # feed read from sorted file
printf "$ctr\n" # print final group total at EOF

How to grep for multiple word occurrences from multiple files and list them grouped as rows and columns

Hello: Need your help to count word occurrences from multiple files and output them as row and columns. I searched the site for a similar reference but could not locate, hence posting it here.
Setup:
I have 2 files with the following
[a.log]
id,status
1,new
2,old
3,old
4,old
5,old
[b.log]
id,status
1,new
2,old
3,new
4,old
5,new
Results required
The result i require using the command line only is (preferably):
file count(new) count(old)
a.log 1 4
b.log 3 2
Script
The script below provides me the count for a single word across multiple.
I am stuck trying to get results for multiple words. Please help.
grep -cw "old" *.log
You can get this output using gnu-awk that accepts comma separated word to be searched in a command line argument:
awk -v OFS='\t' -F, -v wrds='new,old' 'BEGIN{n=split(wrds, a, /,/); for(i=1; i<=n; i++) b[a[i]]=a[i]} FNR==1{next} $2 in b{freq[FILENAME][$2]++} END{printf "%s", "file" OFS; for(i=1; i<=n; i++) printf "count(%s)%s", a[i], (i==n?ORS:OFS); for(f in freq) {printf "%s", f OFS; for(i=1; i<=n; i++) printf "%s%s", freq[f][a[i]], (i==n?ORS:OFS)}}' a.log b.log | column -t
Output:
file count(new) count(old)
a.log 1 4
b.log 3 2
PS: column -t was only used for formatting the output in tabular format.
Readable awk:
awk -v OFS='\t' -F, -v wrds='new,old' 'BEGIN {
n = split(wrds, a, /,/) # split input words list by comma with int index
for(i=1; i<=n; i++) # store words in another array with key as words
b[a[i]]=a[i]
}
FNR==1 {
next # skip first row from all the files
}
$2 in b {
freq[FILENAME][$2]++ # store filename and word frequency in 2-dimesional array
}
END { # print formatted result
printf "%s", "file" OFS
for(i=1; i<=n; i++)
printf "count(%s)%s", a[i], (i==n?ORS:OFS)
for(f in freq) {
printf "%s", f OFS
for(i=1; i<=n; i++)
printf "%s%s", freq[f][a[i]], (i==n?ORS:OFS)
}
}' a.log b.log
I think you're looking for something like this, but it's not overly clear what your objectives are (if you're going for efficiency for example, this isn't overly efficient)...
for file in *.log; do
echo -n "${file}\t"
for word in "new" "old"; do
grep -cw $word $file;
echo -n "\t";
done
echo;
done
(for readability, I simplified the first line, but this doesn't work if there's spaces in the filenames -- the proper solution is to change the first line to read find . -iname "*.log" -maxdepth=1 | while read file; do)
for c in a b ; do egrep -o "new|old" $c.log | sort | uniq -c > $c.luc; done
Get rid of the headlines with grep, then sort and count.
join -1 2 -2 2 a.luc b.luc
> new 1 3
> old 4 2
Placing a new header is left as an exercise for the reader. Is there a flip command for unix/linux/bash to flip a table, or how would you say?
Handling empty cells is left as an exercise too, but possible with join.
without real multi-dim array support, this will count all values in field 2, not just "new/old". The header and number of columns are dynamic with number of distinct values as well.
$ awk -F, 'NR==1 {fs["file"]}
FNR>1 {c[FILENAME,$2]++; fs[FILENAME]; ks[$2];
c["file",$2]="count("$2")"}
END {for(f in fs)
{printf "%s", f;
for(k in ks) printf "%s", OFS c[f,k];
printf "\n"}}' file{1,2} | column -t
file count(new) count(old)
file1 1 4
file2 3 2
Awk solution:
awk 'BEGIN{
FS=","; OFS="\t"; print "file","count(new)","count(old)";
f1=ARGV[1]; f2=ARGV[2] # get filenames
}
FNR==1{ next } # skip the 1st header line
NR==FNR{ c1[$2]++; next } # accumulate occurrences of the 2nd field in 1st file
{ c2[$2]++ } # accumulate occurrences of the 2nd field in 2nd file
END{
print f1, c1["new"], c1["old"];
print f2, c2["new"], c2["old"]
}' a.log b.log
The output:
file count(new) count(old)
a.log 1 4
b.log 3 2

Bash: extract columns with cut and filter one column further

I have a tab-separated file and want to extract a few columns with cut.
Two example line
(...)
0 0 1 0 AB=1,2,3;CD=4,5,6;EF=7,8,9 0 0
1 1 0 0 AB=2,1,3;CD=1,1,2;EF=5,3,4 0 1
(...)
What I want to achieve is to select columns 2,3,5 and 7, however from column 5 only CD=4,5,6.
So my expected result is
0 1 CD=4,5,6; 0
1 0 CD=1,1,2; 1
How can I use cut for this problem and run grep on one of the extracted columns? Any other one-liner is of course also fine.
here is another awk
$ awk -F'\t|;' -v OFS='\t' '{print $2,$3,$6,$NF}' file
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
or with cut/paste
$ paste <(cut -f2,3 file) <(cut -d';' -f2 file) <(cut -f7 file)
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
Easier done with awk. Split the 5th field using ; as the separator, and then print the second subfield.
awk 'BEGIN {FS="\t"; OFS="\t"}
{split($5, a, ";"); print $2, $3, a[2]";", $7 }' inputfile > outputfile
If you want to print whichever subfield begins with CD=, use a loop:
awk 'BEGIN {FS="\t"; OFS="\t"}
{n = split($5, a, ";");
for (i = 1; i <= n; i++) {
if (a[i] ~ /^CD=/) subfield = a[i];
}
print $2, $3, subfield";", $7}' < inputfile > outputfile
I think awk is the best tool for this kind of task and the other two answers give you good short solutions.
I want to point out that you can use awk's built-in splitting facility to gain more flexibility when parsing input. Here is an example script that uses implicit splitting:
parse.awk
# Remember second, third and seventh columns
{
a = $2
b = $3
d = $7
}
# Split the fifth column on ";". After this the positional variables
# (e.g. $1, # $2, ..., $NF) contain the fields from the previous
# fifth column
{
oldFS = FS
FS = ";"
$0 = $5
}
# For example to test if the second elemnt starts with "CD", do
# something like this
$2 ~ /^CD/ {
c = $2
}
# Print the selected elements
{
print a, b, c, d
}
# Restore FS
{
FS = oldFS
}
Run it like this:
awk -f parse.awk FS='\t' OFS='\t' infile
Output:
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1

Find a number of a file in a range of numbers of another file

I have this two input files:
file1
1 982444
1 46658343
3 15498261
2 238295146
21 47423507
X 110961739
17 7490379
13 31850803
13 31850989
file2
1 982400 982480
1 46658345 46658350
2 14 109
2 5000 9000
2 238295000 238295560
X 110961739 120000000
17 7490200 8900005
And this is my desired output:
Desired output:
1 982444
2 238295146
X 110961739
17 7490379
This is what I want: Find the column 1 element of file1 in column 1 of file2. If the number is the same, take the number of column 2 of file1 and check if it is included in the range of numbers of column2 and 3 of file2. If it is included, print the line of file1 in the output.
Maybe is a little confusing to understand, but I'm doing my best. I have tried some things but I'm far away from the solution and any help will be really appreciated. In bash, awk or perl please.
Thanks in advance,
Just using awk. The solution doesn't loop through file1 repeatedly.
#!/usr/bin/awk -f
NR == FNR {
# I'm processing file2 since NR still matches FNR
# I'd store the ranges from it on a[] and b[]
# x[] acts as a counter to the number of range pairs stored that's specific to $1
i = ++x[$1]
a[$1, i] = $2
b[$1, i] = $3
# Skip to next record; Do not allow the next block to process a record from file2.
next
}
{
# I'm processing file1 since NR is already greater than FNR
# Let's get the index for the last range first then go down until we reach 0.
# Nothing would happen as well if i evaluates to nothing i.e. $1 doesn't have a range for it.
for (i = x[$1]; i; --i) {
if ($2 >= a[$1, i] && $2 <= b[$1, i]) {
# I find that $2 is within range. Now print it.
print
# We're done so let's skip to the next record.
next
}
}
}
Usage:
awk -f script.awk file2 file1
Output:
1 982444
2 238295146
X 110961739
17 7490379
A similar approach using Bash (version 4.0 or newer):
#!/bin/bash
FILE1=$1 FILE2=$2
declare -A A B X
while read F1 F2 F3; do
(( I = ++X[$F1] ))
A["$F1|$I"]=$F2
B["$F1|$I"]=$F3
done < "$FILE2"
while read -r LINE; do
read F1 F2 <<< "$LINE"
for (( I = X[$F1]; I; --I )); do
if (( F2 >= A["$F1|$I"] && F2 <= B["$F1|$I"] )); then
echo "$LINE"
continue
fi
done
done < "$FILE1"
Usage:
bash script.sh file1 file2
Let's mix bash and awk:
while read col min max
do
awk -v col=$col -v min=$min -v max=$max '$1==col && min<=$2 && $2<=max' f1
done < f2
Explanation
For each line of file2, read the min and the max, together with the value of the first column.
Given these values, check in file1 for those lines having same first column and being 2nd column in the range specified by file 2.
Test
$ while read col min max; do awk -v col=$col -v min=$min -v max=$max '$1==col && min<=$2 && $2<=max' f1; done < f2
1 982444
2 238295146
X 110961739
17 7490379
Pure bash , based on Fedorqui solution:
#!/bin/bash
while read col_2 min max
do
while read col_1 val
do
(( col_1 == col_2 && ( min <= val && val <= max ) )) && echo $col_1 $val
done < file1
done < file2
cut -d' ' -f1 input2 | sed 's/^/^/;s/$/\\s/' | \
grep -f - <(cat input2 input1) | sort -n -k1 -k3 | \
awk 'NF==3 {
split(a,b,",");
for (v in b)
if ($2 <= b[v] && $3 >= b[v])
print $1, b[v];
if ($1 != p) a=""}
NF==2 {p=$1;a=a","$2}'
Produces:
X 110961739
1 982444
2 238295146
17 7490379
Here's a Perl solution. It could be much faster but less concise if I built a hash out of file2, but this should be fine.
use strict;
use warnings;
use autodie;
my #bounds = do {
open my $fh, '<', 'file2';
map [ split ], <$fh>;
};
open my $fh, '<', 'file1';
while (my $line = <$fh>) {
my ($key, $val) = split ' ', $line;
for my $bound (#bounds) {
next unless $key eq $bound->[0] and $val >= $bound->[1] and $val <= $bound->[2];
print $line;
last;
}
}
output
1 982444
2 238295146
X 110961739
17 7490379

Comparison shell script for large text/csv files - improvement needed

My task is the following - I have two CSV files:
File 1 (~9.000.000 records):
type(text),number,status(text),serial(number),data1,data2,data3
File 2 (~6000 records):
serial_range_start(number),serial_range_end(number),info1(text),info2(text)
The goal is to add to each entry in File 1 the corresponding info1 and info2 from File 2:
type(text),number,status(text),serial(number),data1,data2,data3,info1(text),info2(text)
I use the following script:
#!/bin/bash
USER="file1.csv"
RANGE="file2.csv"
for SN in `cat $USER | awk -F , '{print $4}'`
do
#echo \n "$SN"
for LINE in `cat $RANGE`
do
i=`grep $LINE $RANGE| awk -F, '{print $1}'`
#echo \n "i= " "$i"
j=`grep $LINE $RANGE| awk -F, '{print $2}'`
#echo \n "j= " "$j"
k=`echo $SN`
#echo \n "k= " "$k"
if [ $k -ge $i -a $k -le $j ]
then
echo `grep $SN $USER`,`grep $i $RANGE| cut -d',' -f3-4` >> result.csv
break
#else
#echo `grep $SN $USER`,`echo 'N/A','N/A'` >> result.csv
fi
done
done
The script works rather good on small files but I'm sure there is a way to optimize it because I am running it on an i5 laptop with 4GB of RAM.
I am a newbie in shell scripting and I came up with this script after hours and hours of research, trial and error but now I am out of ideas.
note: not all the info in file 1 can be found in file.
Thank you!
Adrian.
FILE EXAMPLES and additional info:
File 1 example:
prep,28620026059,Active,123452010988759,No,No,No
post,28619823474,Active,123453458466109,Yes,No,No
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes
File 2 example:
123452010988750,123452010988759,promo32,1.11
123453458466100,123453458466199,promo64,2.22
123450000000000,123450000000010,standard128,3.333
Result example (currently):
prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
Result example (nice to have):
prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes,NA,NA
File 1 is sorted after the 4th column
File 2 is sorted after the first column.
File 2 does not have ranges that overlap
Not all the info in file 1 can be found in a range in file 2
Thanks again!
LE:
The script provided by Jonathan seems to have an issue on some records, as follows:
file 2:
123456780737000,123456780737012,ONE 32,1.11
123456780016000,123456780025999,ONE 64,2.22
file 1:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes
The output is the following:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 32,1.11
and it should be:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 64,2.22
It seems that it returns 0 and writes the info on first record from file2...
I think this will work reasonably well:
awk -F, 'BEGIN { n = 0; OFS = ","; }
NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
NR!=FNR { for (i = 0; i < n; i++)
{
if ($4 >= lo[i] && $4 <= hi[i])
{
print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
break;
}
}
}' file2 file1
Given file2 containing:
1,10,xyz,pqr
11,20,abc,def
21,30,ambidextrous,warthog
and file1 containing:
A,123,X2,1,data01_1,data01_2,data01_3
A,123,X2,2,data02_1,data02_2,data02_3
A,123,X2,3,data03_1,data03_2,data03_3
A,123,X2,4,data04_1,data04_2,data04_3
A,123,X2,5,data05_1,data05_2,data05_3
A,123,X2,6,data06_1,data06_2,data06_3
A,123,X2,7,data07_1,data07_2,data07_3
A,123,X2,8,data08_1,data08_2,data08_3
A,123,X2,9,data09_1,data09_2,data09_3
A,123,X2,10,data10_1,data10_2,data10_3
A,123,X2,11,data11_1,data11_2,data11_3
A,123,X2,12,data12_1,data12_2,data12_3
A,123,X2,13,data13_1,data13_2,data13_3
A,123,X2,14,data14_1,data14_2,data14_3
A,123,X2,15,data15_1,data15_2,data15_3
A,123,X2,16,data16_1,data16_2,data16_3
A,123,X2,17,data17_1,data17_2,data17_3
A,123,X2,18,data18_1,data18_2,data18_3
A,123,X2,19,data19_1,data19_2,data19_3
A,223,X2,20,data20_1,data20_2,data20_3
A,223,X2,21,data21_1,data21_2,data21_3
A,223,X2,22,data22_1,data22_2,data22_3
A,223,X2,23,data23_1,data23_2,data23_3
A,223,X2,24,data24_1,data24_2,data24_3
A,223,X2,25,data25_1,data25_2,data25_3
A,223,X2,26,data26_1,data26_2,data26_3
A,223,X2,27,data27_1,data27_2,data27_3
A,223,X2,28,data28_1,data28_2,data28_3
A,223,X2,29,data29_1,data29_2,data29_3
the output of the command is:
A,123,X2,1,data01_1,data01_2,data01_3,xyz,pqr
A,123,X2,2,data02_1,data02_2,data02_3,xyz,pqr
A,123,X2,3,data03_1,data03_2,data03_3,xyz,pqr
A,123,X2,4,data04_1,data04_2,data04_3,xyz,pqr
A,123,X2,5,data05_1,data05_2,data05_3,xyz,pqr
A,123,X2,6,data06_1,data06_2,data06_3,xyz,pqr
A,123,X2,7,data07_1,data07_2,data07_3,xyz,pqr
A,123,X2,8,data08_1,data08_2,data08_3,xyz,pqr
A,123,X2,9,data09_1,data09_2,data09_3,xyz,pqr
A,123,X2,10,data10_1,data10_2,data10_3,xyz,pqr
A,123,X2,11,data11_1,data11_2,data11_3,abc,def
A,123,X2,12,data12_1,data12_2,data12_3,abc,def
A,123,X2,13,data13_1,data13_2,data13_3,abc,def
A,123,X2,14,data14_1,data14_2,data14_3,abc,def
A,123,X2,15,data15_1,data15_2,data15_3,abc,def
A,123,X2,16,data16_1,data16_2,data16_3,abc,def
A,123,X2,17,data17_1,data17_2,data17_3,abc,def
A,123,X2,18,data18_1,data18_2,data18_3,abc,def
A,123,X2,19,data19_1,data19_2,data19_3,abc,def
A,223,X2,20,data20_1,data20_2,data20_3,abc,def
A,223,X2,21,data21_1,data21_2,data21_3,ambidextrous,warthog
A,223,X2,22,data22_1,data22_2,data22_3,ambidextrous,warthog
A,223,X2,23,data23_1,data23_2,data23_3,ambidextrous,warthog
A,223,X2,24,data24_1,data24_2,data24_3,ambidextrous,warthog
A,223,X2,25,data25_1,data25_2,data25_3,ambidextrous,warthog
A,223,X2,26,data26_1,data26_2,data26_3,ambidextrous,warthog
A,223,X2,27,data27_1,data27_2,data27_3,ambidextrous,warthog
A,223,X2,28,data28_1,data28_2,data28_3,ambidextrous,warthog
A,223,X2,29,data29_1,data29_2,data29_3,ambidextrous,warthog
This uses a linear search on the list of ranges; you can write functions in awk and a binary search looking for the correct range would perform better on 6,000 entries. That part, though, is an optimization — exercise for the reader. Remember that the first rule of optimization is: don't. The second rule of optimization (for experts only) is: don't do it yet. Demonstrate that it is a problem. This code shouldn't take all that much longer than the time it takes to copy the 9,000,000 record file (somewhat longer, but not disastrously so). Note, though, that if the file1 data is sorted, the tail of the processing will take longer than the start because of the linear search. If the serial numbers are in a random order, then it will all take about the same time on average.
If your CSV data has commas embedded in the text fields, then awk is no longer suitable; you need a tool with explicit support for CSV format — Perl and Python both have suitable modules.
Answer to Exercise for the Reader
awk -F, 'BEGIN { n = 0; OFS = ","; }
NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
NR!=FNR { i = search($4)
print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
}
function search(i, l, h, m)
{
l = 0; h = n - 1;
while (l <= h)
{
m = int((l + h)/2);
if (i >= lo[m] && i <= hi[m])
return m;
else if (i < lo[m])
h = m - 1;
else
l = m + 1;
}
return 0; # Should not get here
}' file2 file1
Not all that hard to write the binary search. This gives the same result as the original script on the sample data. It has not been exhaustively tested, but appears to work.
Note that the code does not really handle missing ranges in file2; it assumes that the ranges are contiguous but non-overlapping and in sorted order and cover all the values that can appear in the serial column of file1. If those assumptions are not valid, you get erratic behaviour until you fix either the code or the data.
In Unix you may use the join command (type 'man join' for more information) which can be configured to work similarly to join operation in databases. That may help you to add the information from File2 to File1.

Resources