Find nearest point from file1 in file2, shell skript - bash

I have 2 files:
file1
-3241.42 633.261 1210.53
-1110.89 735.349 836.635
(this is the points I am looking for, with coordinates x,y,z)
file2
2014124 -2277.576 742.75 962.5816 0 0
2036599 -3236.882 638.748 1207.804 0 0
2036600 -3242.417 635.2612 1212.527 0 0
2036601 -3248.006 631.6553 1217.297 0 0
2095885 -1141.905 737.7666 843.3465 0 0
2095886 -1111.889 738.3486 833.6354 0 0
2095887 -1172.227 737.4004 853.9965 0 0
2477149 -3060.679 488.6802 1367.816 0 0
2477150 -3068.369 489.6621 1365.769 0 0
and so on
(this is the points from my model, with ID, x, y, z, 0, 0)
I am looking for such a result: (find the point IDs with nearest coordinates)
Output
2036600 , xyz= -3242.42, 635.261, 1212.53, dist= 3.00
2095886 , xyz= -1111.89, 738.349, 833.635, dist= 4.36
My algorithm would look like this:
For each line in file1, catch x1,y1,z1
Search in file2 the nearest point, that mean dist = sqrt((x1-x2)**2+(y1-y2)**2+(z1-z2)**2) is minimum
Display the result with pointID, xyz = x2, y2, z2, dist= dist
I tried to adapt a script found here, but it gives to much lines
#!/bin/bash
(($#!=2))&& { echo "Usage $0 1st_file 2nd_file"; exit 1; }
awk '
BEGIN {p=fx=0; fn=""; maxd=1.1e11;}
$0~"[^0-9. \t]" || NF!=4 && NF!=3 {next;} # skip no data lines
fn!=FILENAME {fx++; fn=FILENAME;} # fx: which file
fx==1 { if(NF!=3){printf("Change the series of two input files\n"); exit 1;}
x1[p]=$1; y1[p]=$2; z1[p]=$3;next;} # save the columns of first file
fx==2 { mv=maxd; mp=0; # search minimal distance
for(i=0; i<p; i++){
dx=x1[i]-$2; dy=y1[i]-$3; dz=z1[i]-$4; dist=sqrt(dx*dx+dy*dy+dz*dz);
if(dd<mv){mv=dd; mp=i;} # min value & min place
}
printf("%3d %6.2f %6.2f %3d\n", $1, x1[mp], y1[mp], z1[mp], dist);
}
' file1.dat file2.dat
Thank you very much!

$ cat tst.awk
BEGIN { OFS=", " }
NR==FNR {
points[NR] = $0
next
}
{
min = 0
for (i in points) {
split(points[i],coords)
dist = ($1 - coords[2])^2 + \
($2 - coords[3])^2 + \
($3 - coords[4])^2
if ( (i == 1) || (dist <= min) ) {
min = dist
point = points[i]
}
}
split(point,p)
print p[1] " ", "xyz= " p[2], p[3], p[4], "dist= " sqrt(min)
}
$ awk -f tst.awk file2 file1
2036600 , xyz= -3242.417, 635.2612, 1212.527, dist= 2.99713
2095886 , xyz= -1111.889, 738.3486, 833.6354, dist= 4.35812

Related

How to write a script that searches for numeric pattern in huge file?

I have 200000 integers written in a file like this
0
1
2
3
.
98
99
.
.
100
101
102
.
I want to write with awk or join script that would tell how many times this pattern(from 0 to 99 )repeats itself.
Not battle tested:
awk 'i++!=$0{i=$0==0?1:0}i==100{c++;i=0}END{print c}' p.txt
Breakdown:
i++ != $0 { # Use a cursor (i) which will be compared to input
i=$0==0?1:0; # If not matched reset cursor if current line is zero then set to 1 because
# .. this means we already matched our first line. If not set to 0
i == 100 { # If Full pattern found:
c++; # add to count
i=0; # reset cursor
}
END {print c} # Print matched count
You can do this using a state variable which is reset anytime the pattern is incomplete. For example:
#!/usr/bin/awk -f
BEGIN {
state = -1;
count = 0;
}
/^[0-9]+$/ {
if ( $0 == ( state + 1 ) || $0 == 0 ) {
state = $0;
if ( state == 99 ) {
count++;
}
} else {
state = -1;
}
next;
}
{ state = -1; next; }
END {
print count;
}
This script assumes awk is in /usr/bin (the usual case). You would put the script in a file, e.g., "patterns", and run it like
./patterns < p.txt

{awk} How to read a line and compare a $ with its next/previous line?

The command below is used to read an input file containing 7682 lines:
I use the --field-separator then converted some fields into what I need, and the grep got rid of the 2 first lines I do not need.
awk --field-separator=";" '($1<15) {print int(a=(($1-1)/480)+1) " " ($1-((int(a)-1)*480)) " " (20*log($6)/log(10))}' 218_DW.txt | grep -v "0 480 -inf"
I used ($1<15) so that I only print 14 lines, better for testing. The output I get is exactly what I want, but, there is more I need to do on that:
1 1 48.2872
1 2 48.3021
1 3 48.1691
1 4 48.1502
1 5 48.1564
1 6 48.1237
1 7 48.1048
1 8 48.015
1 9 48.0646
1 10 47.9472
1 11 47.8469
1 12 47.8212
1 13 47.8616
1 14 47.8047
From above, $1 will increment from 1-16, $2 from 1-480, it's always continuous,
so when it gets to 16 480 47.8616 it restarts from 2 1 47.8616 until last line is 16 480 10.2156
So I get 16*480=7680 lines
What I want to do is simple, but, I don't get it :)
I want to compare the current line with the next one. But not all fields, only $3, it's a value in dB that decreases when $2 increases.
In example:
The current line is 1 1 48.2872=a
Next line is 1 2 48.3021=b
If [ (a - b) > 6 ] then print $1 $2 $3
Of course (a - b) has got to be an absolute value, always > 0.
The beast will be to be able to compare the current line (the $3 only) with it's next and previous line ($3).
Something like this:
1 3 48.1691=a
1 4 48.1502=b
1 5 48.1564=c
If [ ABS(b - a) > 6 ] OR If [ ABS(b - c) > 6 ] then print $1 $2 $3
But of course first line can only be compared with its next one and the last one with its previous one. Is it possible?
Try this:
#!/usr/bin/awk -f
function abs(x) {
if (x >= 0)
return x;
else
return -1 * x;
}
function compare(a,b) {
return abs(a - b) > 6;
}
function update() {
before_value = current_value;
current_line = $0;
current_value = $3;
}
BEGIN {
line_n = 1;
}
#Edit: added to skip blank lines and differently formatted lines in
# general. You could add some error message and/or exit function
# here to detect badly formatted data.
NF != 3 {
next;
}
line_n == 1 {
update();
line_n += 1;
next;
}
line_n == 2 {
if (compare(current_value, $3))
print current_line;
update();
line_n += 1;
next;
}
{
if (compare(current_value, before_value) && compare(current_value, $3))
print current_line;
update();
}
END {
if (compare(current_value, before_value)) {
print current_line;
}
}
The funny thing is that I had this code lying around from a old project where I had to do basically the same thing. Adapted it a little for you. I think it solves your problem (how I understood it, at least). If it doesn't, it should point you in the right direction.
Instructions to run the awk script:
Supposing you saved the code with the name "awkscript", the data file is named "datafile" and they are both in the current folder, you should first mark the script as executable with chmod +x awkscript and then execute it passing the data file as parameter with ./awkscript datafile or use it as part of a sequence of pipes as in cat datafile | ./awkscript.
Comparing the current line to the previous one is trivial, so I think the problem you're having is that you can't figure out how to compare the current line to the next one. Just keep 2 previous lines instead of 1 and always operate on the line before the one that's actually being read as $0, i.e. the line stored in the array p1 in this example (p2 is the line before it and $0 is the line after it):
function abs(val) { return (val > 0 ? val : -val) }
NR==2 {
if ( abs(p1[3] - $3) > 6 ) {
print p1[1], p1[2], p1[3]
}
}
NR>2 {
if ( ( abs(p1[3] - p2[3]) > 6 ) || ( abs(p1[3] - $3) > 6 ) ) {
print p1[1], p1[2], p1[3]
}
}
{ prev2=prev1; prev1=$0; split(prev2,p2); split(prev1,p1) }
END {
if ( ( abs(p1[3] - p2[3]) > 6 ) ) {
print p1[1], p1[2], p1[3]
}
}

How to use awk or anything else to number of shared x values of 2 different y values in a csv file consists of column a and b?

Let me be specific. We have a csv file consists of 2 columns x and y like this:
x,y
1h,a2
2e,a2
4f,a2
7v,a2
1h,b6
4f,b6
4f,c9
7v,c9
...
And we want to count how many shared x values two y values have, which means we want to get this:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
And b6,a2,2 should not show up. Does anyone know how to do this by awk? Or anything else?
Thx ahead!
Try this executable awk script:
#!/usr/bin/awk -f
BEGIN {FS=OFS=","}
NR==1 { print "y1" OFS "y2" OFS "share" }
NR>1 {last=a[$1]; a[$1]=(last!=""?last",":"")$2}
END {
for(i in a) {
cnt = split(a[i], arr, FS)
if( cnt>1 ) {
for(k=1;k<cnt;k++) {
for(i=2;i<=cnt;i++) {
if( arr[k] != arr[i] ) {
key=arr[k] OFS arr[i]
if(out[key]=="") {order[++ocnt]=key}
out[key]++
}
}
}
}
}
for(i=1;i<=ocnt;i++) {
print order[i] OFS out[order[i]]
}
}
When put into a file called awko and made executable, running it like awko data yields:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
I'm assuming the file is sorted by y values in the second column as in the question( after the header ). If it works for you, I'll add some explanations tomorrow.
Additionally for anyone who wants more test data, here's a silly executable awk script for generating some data similar to what's in the question. Makes about 10K lines when run like gen.awk.
#!/usr/bin/awk -f
function randInt(max) {
return( int(rand()*max)+1 )
}
BEGIN {
a[1]="a"; a[2]="b"; a[3]="c"; a[4]="d"; a[5]="e"; a[6]="f"
a[7]="g"; a[8]="h"; a[9]="i"; a[10]="j"; a[11]="k"; a[12]="l"
a[13]="m"; a[14]="n"; a[15]="o"; a[16]="p"; a[17]="q"; a[18]="r"
a[19]="s"; a[20]="t"; a[21]="u"; a[22]="v"; a[23]="w"; a[24]="x"
a[25]="y"; a[26]="z"
print "x,y"
for(i=1;i<=26;i++) {
amultiplier = randInt(1000) # vary this to change the output size
r = randInt(amultiplier)
anum = 1
for(j=1;j<=amultiplier;j++) {
if( j == r ) { anum++; r = randInt(amultiplier) }
print a[randInt(26)] randInt(5) "," a[i] anum
}
}
}
I think if you can get the input into a form like this, it's easy:
1h a2 b6
2e a2
4f a2 b6 c9
7v a2 c9
In fact, you don't even need the x value. You can convert this:
a2 b6
a2
a2 b6 c9
a2 c9
Into this:
a2,b6
a2,b6
a2,c9
a2,c9
That output can be sorted and piped to uniq -c to get approximately the output you want, so we only need to think much about how to get from your input to the first and second states. Once we have those, the final step is easy.
Step one:
sort /tmp/values.csv \
| awk '
BEGIN { FS="," }
{
if (x != $1) {
if (x) print values
x = $1
values = $2
} else {
values = values " " $2
}
}
END { print values }
'
Step two:
| awk '
{
for (i = 1; i < NF; ++i) {
for (j = i+1; j <= NF; ++j) {
print $i "," $j
}
}
}
'
Step three:
| sort | awk '
BEGIN {
combination = $0
print "y1,y2,share"
}
{
if (combination == $0) {
count = count + 1
} else {
if (count) print combination "," count
count = 1
combination = $0
}
}
END { print combination "," count }
'
This awk script does the job:
BEGIN { FS=OFS="," }
NR==1 { print "y1","y2","share" }
NR>1 { ++seen[$1,$2]; ++x[$1]; ++y[$2] }
END {
for (y1 in y) {
for (y2 in y) {
if (y1 != y2 && !(y2 SUBSEP y1 in c)) {
for (i in x) {
if (seen[i,y1] && seen[i,y2]) {
++c[y1,y2]
}
}
}
}
}
for (key in c) {
split(key, a, SUBSEP)
print a[1],a[2],c[key]
}
}
Loop through the input, recording both the original elements and the combinations. Once the file has been processed, look at each pair of y values. The if statement does two things: it prevents equal y values from being compared and it saves looping through the x values twice for every pair. Shared values are stored in c.
Once the shared values have been aggregated, the final output is printed.
This sed script does the trick:
#!/bin/bash
echo y1,y2,share
x=$(wc -l < file)
b=$(echo "$x -2" | bc)
index=0
for i in $(eval echo "{2..$b}")
do
var_x_1=$(sed -n ''"$i"p'' file | sed 's/,.*//')
var_y_1=$(sed -n ''"$i"p'' file | sed 's/.*,//')
a=$(echo "$i + 1" | bc)
for j in $(eval echo "{$a..$x}")
do
var_x_2=$(sed -n ''"$j"p'' file | sed 's/,.*//')
var_y_2=$(sed -n ''"$j"p'' file | sed 's/.*,//')
if [ "$var_x_1" = "$var_x_2" ] ; then
array[$index]=$var_y_1,$var_y_2
index=$(echo "$index + 1" | bc)
fi
done
done
counter=1
for (( k=1; k<$index; k++ ))
do
if [ ${array[k]} = ${array[k-1]} ] ; then
counter=$(echo "$counter + 1" | bc)
else
echo ${array[k-1]},$counter
counter=1
fi
if [ "$k" = $(echo "$index-1"|bc) ] && [ $counter = 1 ]; then
echo ${array[k]},$counter
fi
done

Adding a loop in awk

I had a problem that was resolved in a previous post:
But because I had too many files it was not practical to do an awk on every file and then use a second script to get the output I wanted.
Here are some examples of my files:
3
10
23
.
.
.
720
810
980
And the script was used to see where the numbers from the first file fell in this other file:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
After that range was located, the mean values of the second column in the second file was estimated.
Here are the scripts:
awk -v start=$(head -n 1 file1) -v end=$(tail -n 1 file1) -f script file2
and
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
How could I make a loop so that the mean for every file was estimated. My desired output would be something like this:
Name_of_file
start = number , end = number , mean = number
Thanks in advance.
.. wrap it in a loop?
for f in <files>; do
echo "$f";
awk -v start=$(head -n 1 "$f") -v end=$(tail -n 1 "$f") -f script file2;
done
Personally I would suggest combining them on one line (so that your results are block-data as opposed to file names on different lines from their results -- in that case replace echo "$f" with echo -n "$f " (to not add the newline).
EDIT: Since I suppose you're new to the syntax, <files> can either be a list of files (file1 file2 file 3), a list of files as generated by a glob (file*, files/data_*.txt, whatever), or a list of files generated by a command ( $(find files/ -name 'data' -type f), etc).

unix shell: replace by dictionary

I have file which contains some data, like this
2011-01-02 100100 1
2011-01-02 100200 0
2011-01-02 100199 3
2011-01-02 100235 4
and have some "dictionary" in separate file
100100 Event1
100200 Event2
100199 Event3
100235 Event4
and I know that
0 - warning
1 - error
2 - critical
etc...
I need some script with sed/awk/grep or something else which helps me receive data like this
100100 Event1 Error
100200 Event2 Warning
100199 Event3 Critical
etc
will be grateful for ideas how to do this in best way, or for working example
update
sometimes I have data like this
2011-01-02 100100 1
2011-01-02 sometext 100200 0
2011-01-02 100199 3
2011-01-02 sometext 100235 4
where sometext = any 6 characters (maybe this is helpful info)
in this case I need whole data:
2011-01-02 sometext EventNameFromDictionary Error
or without "sometext"
awk 'BEGIN {
lvl[0] = "warning"
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR {
evt[$1] = $2; next
}
{
print $2, evt[$2], lvl[$3]
}' dictionary infile
Adding a new answer for the new requirement and because of the limited formatting options inside a comment:
awk 'BEGIN {
lvl[0] = "warning"
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR {
evt[$1] = $2; next
}
{
if (NF > 3) {
idx = 3; $1 = $1 OFS $2
}
else idx = 2
print $1, $idx in evt ? \
evt[$idx] : $idx, $++idx in lvl ? \
lvl[$idx] : $idx
}' dictionary infile
You won't need to escape the new lines inside the tertiary operator if you're using GNU awk.
Some awk implementations may have problems with this part:
$++idx in lvl ? lvl[$idx] : $idx
If you're using one of those,
change it to:
$(idx + 1) in lvl ? lvl[$(idx + 1)] : $(idx + 1)
OK, comments added:
awk 'BEGIN {
lvl[0] = "warning" # map the error levels
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR { # while reading the first
# non-empty input file
evt[$1] = $2 # build the associative array evt
next # skip the rest of the program
# keyed by the value of the first column
# the second column represents the values
}
{ # now reading the rest of the input
if (NF > 3) { # if the number of columns is greater than 3
idx = 3 # set idx to 3 (the key in evt)
$1 = $1 OFS $2 # and merge $1 and $2
}
else idx = 2 # else set idx to 2
print $1, \ # print the value of the first column
$idx in evt ? \ # if the value of the second (or the third,
\ # depeneding on the value of idx), is an existing
\ # key in the evt array, print its value
evt[$idx] : $idx, \ # otherwise print the actual column value
$++idx in lvl ? \ # the same here, but first increment the idx
lvl[$idx] : $idx # because we're searching the lvl array now
}' dictionary infile
I hope perl is ok too:
#!/usr/bin/perl
use strict;
use warnings;
open(DICT, 'dict.txt') or die;
my %dict = %{{ map { my ($id, $name) = split; $id => $name } (<DICT>) }};
close(DICT);
my %level = ( 0 => "warning",
1 => "error",
2 => "critical" );
open(EVTS, 'events.txt') or die;
while (<EVTS>)
{
my ($d, $i, $l) = split;
$i = $dict{$i} || $i; # lookup
$l = $level{$l} || $l; # lookup
print "$d\t$i\t$l\n";
}
Output:
$ ./script.pl
2011-01-02 Event1 error
2011-01-02 Event2 warning
2011-01-02 Event3 3
2011-01-02 Event4 4

Resources