Evaluating overlap of number ranges in bash - bash

Assume a text file file which contains multiple lines of number ranges. The lower and upper bound of each range are separated by a dash, and the individual ranges are sorted (i.e., range 101-297 comes before 1299-1314).
$cat file
101-297
1299-1314
1301-5266
6898-14503
How can I confirm in bash if one or more of these number ranges are overlapping?
In my opinion, all that is needed seems to be to iteratively perform integer comparisons across adjacent lines. The individual integer comparisons could look like something like this:
if [ "$upperbound_range1" -gt "$lowerbound_range2" ]; then
echo "Overlap!"
exit 1
fi
I suspect, however, that this comparison can also be done via awk.
Note: Ideally, the code could not only determine if any of the ranges is overlapping with its immediate successor range, but also which range is the overlapping one.

try in awk.
awk -F"-" 'Q>=$1 && Q{print}{Q=$NF}' Input_file
Making here -(dash) as a field separator then checking if a variable named Q is NOT NULL and it's value is greater then current line's first field($1) is yes then print that line(if you want to print previous line we could do that also), now create/re-assign variable Q's value to current line's last field's value.
EDIT: As per OP user wants to get the previous line so changing it to that now too.
awk -F"-" 'Q>=$1 && Q{print val}{Q=$NF;val=$0}' Input_file

You could do:
$ awk -F"-" '$1<last_2 && NR>1 {printf "%s: %s: Overlap\n", last_line, $0}
{last_line=$0; last_2=$2}' file
1299-1314: 1301-5266: Overlap

If ranges are sorted by lower bound, and there's a range which overlaps, then the overlapping range will be the successor.
ranges=( $(<file) )
# or ranges=(101-297 1299-1314 1301-5266 6898-14503)
for ((i=1;i<${#ranges[#]};i+=1)); do
range=${ranges[i-1]}
succesorRange=${ranges[i]}
if ((${range#*-}>=${succesorRange%-*})); then
echo "overlap $i $range $succesorRange"
fi
done

Related

How can I compare floats in bash

Working on a script and I am currently stuck. (Still pretty new at this)
First off I have my data file, the file I am searching inside.
First field is name, second is money payed, and third is money owed.
customerData.txt
name1,500.00,1000
name2,2000,100
name3,100,100.00
Here is my bash file. Basically if the owe amount is greater than the paid amount, then print the name. Works fine for anything thats not a float. I also understand that bash doesn't handle floats and the only way to handle them is with the bc utility, but I have had no luck.
#!/bin/bash
while IFS="," read name paid owe; do
#due=$(echo "$owe - $paid" |bc -1)
#echo $due
if [ $owe -gt $paid ]; then
echo $name
fi
done < customerData.txt
To print all lines for which the third column is larger than the second:
$ awk -F, '$3>$2' customerData.txt
name1,500.00,1000
How it works
-F, tells awk that the columns are comma-separated.
$3>$2 tells awk to print any line for which the third column is larger than the second.
In more detail, $3>$2 is a condition: it evaluates to true or false. If it evaluates to true, then the action is performed. Since we didn't specify any action, awk performs the default action which is to print the line.

Efficient substring parsing of large fixed length text file in Bash

I have a large text file (millions of records) of fixed length data and need to extract unique substrings and create a number of arrays with those values. I have a working version, however I'm wondering if performance can be improved since I need to run the script iteratively.
$_file5 looks like:
138000010065011417865201710152017102122
138000010067710416865201710152017102133
138000010131490417865201710152017102124
138000010142349413865201710152017102154
138400010142356417865201710152017102165
130000101694334417865201710152017102176
Here is what I have so far:
while IFS='' read -r line || [[ -n "$line" ]]; do
_in=0
_set=${line:15:6}
_startDate=${line:21:8}
_id="$_account-$_set-$_startDate"
for element in "${_subsets[#]}"; do
if [[ $element == "$_set" ]]; then
_in=1
break
fi
done
# If we find a new one and it's not 504721
if [ $_in -eq 0 ] && [ $_set != "504721" ] ; then
_subsets=("${_subsets[#]}" "$_set")
_ids=("${_ids[#]}" "$_id")
fi
done < $_file5
And this yields:
_subsets=("417865","416865","413865")
_ids=("9899-417865-20171015", "9899-416865-20171015", "9899-413865-20171015")
I'm not sure if sed or awk would be better here and can't find a way to implement either. Thanks.
EDIT: Benchmark Tests
So I benchmarked my original solution against the two provided. Ran this over 10 times and all results where similar to below.
# Bash read
real 0m8.423s
user 0m8.115s
sys 0m0.307s
# Using sort -u (#randomir)
real 0m0.719s
user 0m0.693s
sys 0m0.041s
# Using awk (#shellter)
real 0m0.159s
user 0m0.152s
sys 0m0.007s
Looks like awk wins this one. Regardless, the performance improvement from my original code is substantial. Thank you both for your contributions.
I don't think you can beat the performance of sort -u with bash loops (except in corner cases, as this one turned out to be, see footnote✻).
To reduce the list of strings you have in file to a list of unique strings (set), based on a substring:
sort -k1.16,1.21 -u file >set
Then, to filter-out the unwanted id, 504721, starting at position 16, you can use grep -v:
grep -vE '.{15}504721' set
Finally, reformat the remaining lines and store them in arrays with cut/sed/awk/bash.
So, to populate the _subsets array, for example:
$ _subsets=($(sort -k1.16,1.21 -u file | grep -vE '.{15}504721' | cut -c16-21))
$ printf "%s\n" "${_subsets[#]}"
413865
416865
417865
or, to populate the _ids array:
$ _ids=($(sort -k1.16,1.21 -u file | grep -vE '.{15}504721' | sed -E 's/^.{15}(.{6})(.{8}).*/9899-\1-\2/'))
$ printf "%s\n" "${_ids[#]}"
9899-413865-20171015
9899-416865-20171015
9899-417865-20171015
✻ If the input file is huge, but it contains only a small number (~40) of unique elements (for the relevant field), then it makes perfect sense for the awk solution to be faster. sort needs to sort a huge file (O(N*logN)), then filter the dupes (O(N)), all for a large N. On the other hand, awk needs to pass through the large input only once, checking for dupes along the way via set membership testing. Since the set of uniques is small, membership testing takes only O(1) (on average, but for such a small set, practically constant even in worst case), making the overall time O(N).
In case there were less dupes, awk would have O(N*log(N)) amortized, and O(N2) worst case. Not to mention the higher constant per-instruction overhead.
In short: you have to know how your data looks like before choosing the right tool for the job.
Here's an awk solution embedded in a bash script:
#!/bin/bash
fn_parser() {
awk '
BEGIN{ _account="9899" }
{ _set=substr($0,16,6)
_startDate=substr($0,22,8)
#dbg print "#dbg:_set=" _set "\t_startDate=" _startDate
if (_set != "504721") {
_id= _account "-" _set"-" _startDate
ids[_id] = _id
sets[_set]=_set
}
}
END {
printf "_subsets=("
for (s in sets) { printf("%s\"%s\"" , (commaCtr++ ? "," : ""), sets[s]) }
print ");"
printf "_ids=("
for (i in ids) { printf("%s\"%s\"" , (commaCtr2++ ? "," : ""), ids[i]) }
print ")"
}
' "${#}"
}
#dbg set -vx
eval $( echo $(fn_parser *.txt) )
echo "_subsets="$_subsets
echo "_ids="$_ids
output
_subsets=413865,417865,416865
_ids=9899-416865-20171015,9899-413865-20171015,9899-417865-20171015
Which I believe would be the same output your script would get if you did an echo on your variable names.
I didn't see that _account was being extracted from your file, and assume it is passed in from a previous step in your batch. But until I know if that is a critical piece, I'll have to come back to figuring out how to pass in var to a function that calls awk.
People won't like using eval, but hopefully no one will embed /bin/rm -rf / into your data set ;-)
I use the eval so that the data extracted is available via the shell variables. You can uncomment the #dbg before the eval line to see how the code is executing in the "layers" of function, eval, var=value assignments.
Hopefully, you see how the awk script is a transcription of your code into awk.
It does rely on the fact that arrays can contain only 1 copy of a key/value pair.
I'd really appreciate if you post timings for all solutions submitted. (You could reduce the file size by 1/2 and still have a good test). Be sure to run each version several times, and discard the first run.
IHTH

Shell Bash Replace or remove part of a number or string

Good day.
Everyday i receive a list of numbers like the example below:
11986542586
34988745236
2274563215
4532146587
11987455478
3652147859
As you can see some of them have a 9(11 digits total) in as the third digit and some dont(10 digits total, that`s because the ones with an extra 9 are the new Brazilian mobile number format and the ones without it are in the old format.
The thing is that i have to use the numbers in both formats as a parameter for another script and i usually have do this by hand.
I am trying to create a script that reads the length of a mobile number and check it`s and add or remove the "9" of a number or string if the digits condition is met and save it in a separate file condition is met.
So far i am only able to check its length but i don`t know how to add or remove the "9" in the third digit.
#!/bin/bash
Numbers_file="/FILES/dir/dir2/Numbers_File.txt"
while read Numbers
do
LEN=${#Numbers}
if [ $LEN -eq "11" ]; then
echo "lenght = "$LEN
elif [ $LEN -eq "10" ];then
echo "lenght = "$LEN
else
echo "error"
fi
done < $Numbers_file
You can delete the third character of any string with sed as follows:
sed 's/.//3'
Example:
echo "11986542586" | sed 's/.//3'
1186542586
To add a 9 in the third character:
echo "2274563215" | sed 's/./&9/3'
22794563215
If you are absolutely sure about the occurrence happening only at the third position, you can use an awk statement as below,
awk 'substr($0,3,1)=="9"{$0=substr($0,1,2)substr($0,4,length($0))}1' file
1186542586
3488745236
2274563215
4532146587
1187455478
3652147859
Using the POSIX compliant substr() function, process only the lines having 9 at the 3rd position and move around the record not considering that digit alone.
substr(s, m[, n ])
Return the at most n-character substring of s that begins at position m, numbering from 1. If n is omitted, or if n specifies more characters than are left in the string, the length of the substring shall be limited by the length of the string s
There are lots of text manipulation tools that will do this, but the lightest weight is probably cut because this is all it does.
cut only supports a single range but does have an invert function so cut -c4 would give you just the 4th character, but add in --complement and you get everything but character 4.
echo 1234567890 | cut -c4 --complement
12356789

extract each line followed by a line with a different value in column two

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.
Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.
This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.
Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

count the max number of _ and add additional semi-colon if some are missing

I have several files with fields like below
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme_Fort_Email_am;04/02/2015;Deme_Fort_Postal
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme_faible_Email_am;18/02/2015;deme_Faible_Email_Relance_am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi_Fort_Email_am;23/02/2015;trav_Fort_Email_am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav_Faible_Email_pm;18/02/2015;trav_Faible_Email_Relance_pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav_Fort_Email_am;12/02/2015;Trav_Fort_Postal
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya_Faible_Email_am;29/01/2015;voya_Faible_Email_Relance_am
Aim is to have that
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxxdeme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
I'm counting the max of underscore, after the 7th field, for any line. I then change it to semi-colon and add additional semi-colon depending of the maximum underscore count found in all the lines.
I thought about using awk for that but I will only change ,with the command line below , every thing after the first field. My aim is also to add additional semi-colon
awk 'BEGIN{FS=OFS=";"} {for (i=7;i<=NF;i++) gsub(/_/,";", $i) } 1' file
Thanks.
Awk way
awk -F';' -vOFS=';' '{y=0;for(i=8;i<=NF;i++)y+=gsub(/_/,";",$i)
x=x<y?y:x;NF=NF+(x-y)}NR!=FNR' file{,}
Output
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
Explanation
awk -F';' -vOFS=';'
This sets the Field Separator and Output Field separator to ;.
y=0;
Initialised y as 0 on each line.
for(i=8;i<=NF;i++)y+=gsub(/_/,";",$i)
For each field from field 8 to the Number of Fields on the line(NF).Substitute _ with a ;.Increment y by the number of substitutions.
x=x<y?y:x
Check if x is less than y, if it is set x to yelse leave the same.
NF=NF+(x-y)
Set the number of field to the current number of fields + the difference between x and y.
NR!=FNR
This means that if the Total record number is not equal to the Files record number then print.Effectively means print anything that isn't the first file.
file{,}
Expands to file file so the file is read twice.
Resources
https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

Resources