Clean list of points depending on close range (+-5) - bash

How to clean a list of points in a variable regarding on if it is
the same point or
a close by point (+-5).
Example each line is one point with to coordinates:
points="808,112\n807,113\n809,113\n155,183\n832,572"
echo "$points"
#808,112
#807,113
#809,113
#155,183
#832,572
#196,652
I would like to ignore points within a range of +-5 counts. The result should be:
echo "$points_clean"
#808,112
#155,183
#832,572
#196,652
I thought about looping through the list, but I need help to how to check if point coordinates already exist in the new list:
points_clean=$(for point in $points; do
x=$(echo "$point" | cut -d, -f1)
y=$(echo "$point" | cut -d, -f2)
# check if same or similar point coordinates already in $points_clean
echo "$x,$y"
done)

This seems to work with Bash 4.x (support for process substitution is needed):
#!/bin/bash
close=100
points="808,112\n807,113\n809,113\n155,183\n832,572"
echo -e "$points"
clean=()
distance()
{
echo $(( ($1 - $3) * ($1 - $3) + ($2 - $4) * ($2 - $4) ))
}
while read x1 y1
do
ok=1
for point in "${clean[#]}"
do
echo "compare $x1 $y1 with $point"
set -- $point
if [[ $(distance $x1 $y1 $1 $2) -le $close ]]
then
ok=0
break
fi
done
if [ $ok = 1 ]
then clean+=("$x1 $y1")
fi
done < <( echo -e "$points" | tr ',' ' ' | sort -u )
echo "Clean:"
printf "%s\n" "${clean[#]}" | tr ' ' ','
The sort is optional and may slow things down. Identical points will be too close together, so the second instance of a given coordinate will be eliminated even if the first wasn't.
Sample output:
808,112
807,113
809,113
155,183
832,572
compare 807 113 with 155 183
compare 808 112 with 155 183
compare 808 112 with 807 113
compare 809 113 with 155 183
compare 809 113 with 807 113
compare 832 572 with 155 183
compare 832 572 with 807 113
Clean:
155,183
807,113
832,572
The workaround for Bash 3.x (as found on Mac OS X 10.10.4, for example) is a tad painful; you need to send the output of the echo | tr | sort command to a file, then redirect the input of the pair of loops from that file (and clean up afterwards). Or you can put the pair of loops and the code that follows (the echo of the clean array) inside the scope of { …; } command grouping.
In response to the question 'what defines close?', wittich commented:
Let's say ±5 counts. Eg. 808(±5,) 112(±5). That's why the second and third point would be "cleaned".
OK. One way of looking at that would be to adjust the close value to 50 in my script (allowing a difference of 52 + 52), but that rejects points connected by a line of length just over 7, though. You could revise the distance function to do ±5; it takes a bit more work and maybe an auxilliary abs function, or you could return the square of the larger delta and compare that with 25 (52 of course). You can play with what the criterion should be to your hearts content.
Note that Bash shell arithmetic is integer arithmetic (only); you need Korn shell (ksh) or Z shell (zsh) to get real arithmetic in the shell, or you need to use bc or some other calculator.

Related

Shell script with grep and sed to extract individuals from a pair after comparing the numerical values of a variable

I want to compare a group of words (individuals) in pairs and extract the one with the lowest numeric variable. My files and scripts are made this way.
Relatedness_3rdDegree.txt (example):
Individual1 Individual2
Individual5 Individual23
Individual50 Individual65
filename.imiss
INDV N_DATA N_GENOTYPES_FILTERED N_MISS F_MISS
Individual1 375029 0 782 0.00208517
Individual2 375029 0 341 0.000909263
Individual3 375029 0 341 0.000909263
Main script:
numlines=$(wc -l Relatedness_3rdDegree.txt|awk '{print $1}')
for line in `seq 1 $numlines`
do
ind1=$(sed -n "${line}p" Relatedness_3rdDegree.txt|awk '{print $1}')
ind2=$(sed -n "${line}p" Relatedness_3rdDegree.txt|awk '{print $2}')
miss1=$(grep $ind1 filename.imiss|awk '{print $5}')
miss2=$(grep $ind2 filename.imiss|awk '{print $5}')
if echo "$miss1 > $miss2" | bc -l | grep -q 1
then
echo $ind1 >> miss.txt
else
echo $ind2 >> miss.txt
fi
echo "$line / $numlines"
done
This last script will echo a series of line like this :
1 / 208
2 / 208
3 / 208
and so on, until getting to this error:
91 / 208
(standard_in) 1: syntax error
92 / 208
(standard_in) 1: syntax error
93 / 208
If I go to my output (miss.txt), the printed individuals are not correct.
It should print the individuals, within the pairs contained in the file "Relatedness_3rdDegree.txt", that have the lowest value of F_MISS (column $5 of the "filename.imiss").
For instance, in the pair "Individual1 Individual2", it should compare their values of F_MISS and print only the individual with the lowest value, which in this example would be Individual 2.
I have manually checked the values and the printed individual, and it looks like it printed random individuals per each pair.
What is wrong in this script?
Bash version:
#!/bin/bash
declare -A imiss
while read -r ind nd ngf nm fm # we'll ignore most of these
do
imiss[$ind]=$fm
done < filename.imiss
while read -r i1 i2
do
if (( $(echo "${imiss[$i1]} > ${imiss[$i2]}" | bc -l) ))
then
echo "$i1"
else
echo "$i2"
fi
done < Relatedness_3rdDegree.txt
Run* it like:
bash-imiss
AWK version:
#!/usr/bin/awk -f
NR == FNR {imiss[$1] = $5; next}
{
if (imiss[$1] > imiss[$2]) {
print $1
} else {
print $2
}
}
Run* it like:
awk-imiss filename.imiss Relatedness_3rdDegree.txt
These two scripts do exactly the same thing in exactly the same way using associative arrays.
* This assumes that you have set the script file executable using chmod and that it's in your PATH and that the data files are in your current directory.

How can a "grep | sed | awk" script merging line pairs be more cleanly implemented?

I have a little script to extract specific data and cleanup the output a little. It seems overly messy and i'm wondering if the script can be trimmed down a bit.
The input file contains of pairs of lines -- names, followed by numbers.
Line pairs where the numeric value is not between 80 and 199 should be discarded.
Pairs may sometimes, but will not always, be preceded or followed by blank lines, which should be ignored.
Example input file:
al12t5682-heapmemusage-latest.log
38
al12t5683-heapmemusage-latest.log
88
al12t5684-heapmemusage-latest.log
100
al12t5685-heapmemusage-latest.log
0
al12t5686-heapmemusage-latest.log
91
Example/wanted output:
al12t5683 88
al12t5684 100
al12t5686 91
Current script:
grep --no-group-separator -PxB1 '([8,9][0-9]|[1][0-9][0-9])' inputfile.txt \
| sed 's/-heapmemusage-latest.log//' \
| awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'
Extra input example
al14672-heapmemusage-latest.log
38
al14671-heapmemusage-latest.log
5
g4t5534-heapmemusage-latest.log
100
al1t0000-heapmemusage-latest.log
0
al1t5535-heapmemusage-latest.log
al1t4676-heapmemusage-latest.log
127
al1t4674-heapmemusage-latest.log
53
A1t5540-heapmemusage-latest.log
54
G4t9981-heapmemusage-latest.log
45
al1c4678-heapmemusage-latest.log
81
B4t8830-heapmemusage-latest.log
76
a1t0091-heapmemusage-latest.log
88
al1t4684-heapmemusage-latest.log
91
Extra Example expected output:
g4t5534 100
al1t4676 127
al1c4678 81
a1t0091 88
al1t4684 91
another awk
$ awk -F- 'NR%2{p=$1; next} 80<=$1 && $1<=199 {print p,$1}' file
al12t5683 88
al12t5684 100
al12t5686 91
UPDATE
for the empty line record delimiter
$ awk -v RS= '80<=$2 && $2<=199{sub(/-.*/,"",$1); print}' file
al12t5683 88
al12t5684 100
al12t5686 91
Consider implementing this in native bash, as in the following (which can be seen running with your sample input -- including sporadically-present blank lines -- at http://ideone.com/Qtfmrr):
#!/bin/bash
name=; number=
while IFS= read -r line; do
[[ $line ]] || continue # skip blank lines
[[ -z $name ]] && { name=$line; continue; } # first non-blank line becomes name
number=$line # second one becomes number
if (( number >= 80 && number < 200 )); then
name=${name%%-*} # prune everything after first "-"
printf '%s %s\n' "$name" "$number" # emit our output
fi
name=; number= # clear the variables
done <inputfile.txt
The above uses no external commands whatsoever -- so whereas it might be slower to run over large input than a well-implemented awk or perl script, it also has far shorter startup time since no interpreter other than the already-running shell is required.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?, describing the while read idiom.
BashFAQ #100 - How do I do string manipulations in bash?; or The Bash-Hackers' Wiki on parameter expansion, describing how name=${name%%-*} works.
The Bash-Hackers' Wiki on arithmetic expressions, describing the (( ... )) syntax used for numeric comparisons.
perl -nle's/-.*//; $n=<>; print "$_ $n" if 80<=$n && $n<=199' inputfile.txt
With gnu sed
sed -E '
N
/\n[8-9][0-9]$/bA
/\n1[0-9]{2}$/!d
:A
s/([^-]*).*\n([0-9]+$)/\1 \2/
' infile

Convert decimal to Base-4 in bash

I have been using a pretty basic, and for the most part straight forward, method to converting base-10 numbers {1..256} to base-4 or quaternary numbers. I have been using simple division $(($NUM/4)) to get the main result in order to get the remainders $(($NUM%4)) and then printing the remainders in reverse to arrive at the result. I use the following bash script to do this:
#!/bin/bash
NUM="$1"
main() {
local EXP1=$(($NUM/4))
local REM1=$(($NUM%4))
local EXP2=$(($EXP1/4))
local REM2=$(($EXP1%4))
local EXP3=$(($EXP2/4))
local REM3=$(($EXP2%4))
local EXP4=$(($EXP3/4))
local REM4=$(($EXP3%4))
echo "
$EXP1 remainder $REM1
$EXP2 remainder $REM2
$EXP3 remainder $REM3
$EXP4 remainder $REM4
Answer: $REM4$REM3$REM2$REM1
"
}
main
This script works fine for numbers 0-255 or 1-256. But beyond this(these) ranges, results become mixed and often repeated or inaccurate. This isn't so much of a problem as I don't intend to convert numbers beyond 256 or less than 0 (negative numbers [yet]).
My question is: "Is there a more simplified method to do this, possibly using expr or bc?
Base 4 conversion in bash
int2b4() {
local val out num ret=\\n;
for ((val=$1;val;val/=4)){
out=$((val%4))$out;
}
printf ${2+-v} $2 %s${ret[${2+1}]} $out
}
Invoked with only 1 argument, this will convert to base 4 and print the result followed by a newline. If a second argument is present, a variable of this name will be populated, no printing.
int2b4 135
2013
int2b4 12345678
233012011032
int2b4 5432 var
echo $var
1110320
Detailled explanation:
The main part is (could be written):
out=""
for (( val=$1 ; val > 0 ; val = val / 4 )) ;do
out="$((val%4))$out"
done
We're conversion loop could be easily understood (i hope)
local ensure out val num to be local empty variables and initialise locally ret='\n'
printf line use some bashisms
${2+-v} is emppty if $2 is empty and represent -v if not.
${ret[${2+1}]} become respectively ${ret[]} ( or ${ret[0]} ) and ${ret[1]}
So this line become
printf "%s\n" $out
if no second argument ($2) and
printf -v var "%s" $out
if second argument is var (Note that no newline will be appended to a populated variable, but added for terminal printing).
Conversion back to decimal:
There is a bashism letting you compute with arbitrary base, under bash:
echo $((4#$var))
5432
echo $((4#1110320))
5432
In a script:
for integer in {1234..1248};do
int2b4 $integer quaternary
backint=$((4#$quaternary))
echo $integer $quaternary $backint
done
1234 103102 1234
1235 103103 1235
1236 103110 1236
1237 103111 1237
1238 103112 1238
1239 103113 1239
1240 103120 1240
1241 103121 1241
1242 103122 1242
1243 103123 1243
1244 103130 1244
1245 103131 1245
1246 103132 1246
1247 103133 1247
1248 103200 1248
Create a look-up table taking advantage of brace expansion
$ echo {a..c}
a b c
$ echo {a..c}{r..s}
ar as br bs cr cs
$ echo {0..3}{0..3}
00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33
and so, for 0-255 in decimal to base-4
$ base4=({0..3}{0..3}{0..3}{0..3})
$ echo "${base4[34]}"
0202
$ echo "${base4[255]}"
3333

check continuity of a number series using if-else

I have a file which contain numbers, say 1 to 300. But the numbers are not continuous. A sample file looks like this
042
043
044
045
078
198
199
200
201
202
203
212
213
214
215
238
239
240
241
242
256
257
258
Now I need to check the continuity of the number series and accordingly write out the output. For example the first 4 numbers are in series, so the output should be
042-045
Next, 078 is a lone number, so the output should be
078
for convenience it can be made to look like
078-078
Then 198 to 203 are continuous. So, next output should be
198-203
and so on. The final output should be like
042-045
078-078
198-203
212-215
238-242
256-258
I just need to know the first and end member of the continuous series and jump on the next series when discontinuity is encountered; The output can be manipulated. I am inclined to use the if statement and can think of a complicated thing like this
num=`cat file | wc -l`
out1=`head -1 file`
for ((i=2;i<=$num;i++))
do
j=`echo $i-1 | bc`
var1=`cat file | awk 'NR='$j'{print}'`
var2=`cat file | awk 'NR='$i'{print}'`
var3=`echo $var2 - $var1 | bc`
if [ $var3 -gt 1 ]
then
out2=$var1
echo $out1-$out2
out1=$var2
fi
done
which works but too lengthy. I am sure there is definitely a short way of doing this.
I am also open to other straight-forward command (or few commands) in shell, awk or a few lines of fortran code that can do it.
Thanking you in anticipation.
This awk one-liner works for given example:
awk 'p+1!=$1{printf "%s%s--",NR==1?"":p"\n",$1}{p=$1}END{print $1}' file
It gives the output for your data as input:
042--045
078--078
198--203
212--215
238--242
256--258
Here is a simple program in Fortran:
program test
implicit none
integer :: first, last, uFile, i, stat
open( file='numbers.txt', newunit=uFile, action='read', status='old' )
read(uFile,*,iostat=stat) i
if ( stat /= 0 ) stop
first = i ; last = i
do
read(uFile,*,iostat=stat) i
if ( stat /= 0 ) exit
if ( i == last+1 ) then
last = i
else
print *,first,'-',last
write(*,'(i3.3,a,i3.3)') first,'-',last
endif
enddo
write(*,'(i3.3,a,i3.3)') first,'-',last
end program
The output is
042-045
078-078
198-203
212-215
238-242
256-258

Sed - convert negative to positive numbers

I am trying to convert all negative numbers to positive numbers and have so far come up with this
echo "-32 45 -45 -72" | sed -re 's/\-([0-9])([0-9])\ /\1\2/p'
but it is not working as it outputs:
3245 -45 -72
I thought by using \1\2 I would have got the positive number back ?
Where am I going wrong ?
Why not just remove the -'s?
[root#vm ~]# echo "-32 45 -45 -72" | sed 's/-//g'
32 45 45 72
My first thought is not using sed, if you don't have to. awk can understand that they're numbers and convert them thusly:
echo "-32 45 -45 -72" | awk -vRS=" " -vORS=" " '{ print ($1 < 0) ? ($1 * -1) : $1 }'
-vRS sets the "record separator" to a space, and -vORS sets the "output record separator" to a space. Then it simply checks each value, sees if it's less than 0, and multiplies it by -1 if it is, and if it's not, just prints the number.
In my opinion, if you don't have to use sed, this is more "correct," since it treats numbers like numbers.
This might work for you:
echo "-32 45 -45 -72" | sed 's/-\([0-9]\+\)/\1/g'
Reason why your regex is failing is
Your only doing a single substitution (no g)
Your replacement has no space at the end.
The last number has no space following so it will always fail.
This would work too but less elegantly (and only for 2 digit numbers):
echo "-32 45 -45 -72" | sed -rn 's/-([0-9])([0-9])(\s?)/\1\2\3/gp'
Of course for this example only:
echo "-32 45 -45 -72" | tr -d '-'
You are dealing with numbers as with a string of characters. More appropriate would be to store numbers in an array and use built in Shell Parameter Expansion to remove the minus sign:
[~] $ # Creating and array with an arbitrary name:
[~] $ array17=(-32 45 -45 -72)
[~] $ # Calling all elements of the array and removing the first minus sign:
[~] $ echo ${array17[*]/-}
32 45 45 72
[~] $

Resources