Connecting markers with PolyLine does not work - geopandas

I'm having a geopandas table that looks like:
+----+---------+------------+-------------+--------+--------+--------------------------------+
| | Score | LATITUDE | LONGITUDE | Name | From | geometry |
|----+---------+------------+-------------+--------+--------+--------------------------------|
| 0 | 9 | 47.0989 | 6.81907 | a | a | POINT (6.81906939 47.09885642) |
| 1 | 4 | 47.0993 | 6.82133 | b | a | POINT (6.82133392 47.09932583) |
| 2 | 12 | 47.1006 | 6.82169 | c | a | POINT (6.821687 47.10058986) |
| 3 | 12 | 47.1006 | 6.82169 | d | f | POINT (6.821687 47.10058986) |
| 4 | 2 | 47.0985 | 6.81926 | e | b | POINT (6.81926344 47.09847967) |
| 5 | 4 | 47.0998 | 6.82126 | f | b | POINT (6.82126031 47.09980364) |
| 6 | 2 | 47.0993 | 6.82197 | g | b | POINT (6.82197033 47.09929947) |
+----+---------+------------+-------------+--------+--------+--------------------------------+
One column shows the Name of the datapoints and another column where they come From.
Now I want to connect the markers/datapoints with their sources. I tried following code, but it doesn't work:
df2 = pd.merge(df, df[["Name", "LATITUDE", "LONGITUDE"]],
on="Name", how="left", suffixes=["_to", "_from"])
latlong_from = zip(df2["LATITUDE_from"], df2["LONGITUDE_from"])
latlong_to = zip(df2["LATITUDE_to"], df2["LONGITUDE_to"])
for _from, _to in zip(latlong_from, latlong_to):
folium.PolyLine([[_from[0], _from[1]],
[_to[0], _to[1]]]).add_to(m)
m
It still looks like:
Does somebody see what I'm droing wrong?
Update
Now I tried to create a new geopandas-df with only the lines. But they also do not appear.
from shapely.geometry import LineString
df2 = pd.merge(df, df[["Name", "LATITUDE", "LONGITUDE"]],
on="Name", how="left", suffixes=["_to", "_from"])
df2["from_geometry"] = geopandas.points_from_xy(df2["LONGITUDE_from"], df2["LATITUDE_from"])
df2['line'] = df2.apply(lambda row: LineString([row['from_geometry'], row['geometry']]), axis=1) #Create a linestring column
gdf2 = geopandas.GeoDataFrame(
df2,
geometry=df2["line"])
gdf2.drop(["line", "from_geometry"], axis=1, inplace=True)
gdf2.crs = "EPSG:4326"
gdf2.explore(color="red")
For better overview I don't copy again a screenshot. But it zooms to the right area in map, so coords should be fine, but there are no lines. I'm lost, can somebody give a hint? I don't need a properly coded solution, just an idea what I could try or what I'm doing wrong.
Goal
Here an example for expected output:
MWE
import geopandas
import pandas as pd
dict_data = {'Score': {0: 9, 1: 4, 2: 12, 3: 12, 4: 2, 5: 4, 6: 2},
'LATITUDE': {0: 47.09885642,
1: 47.09932583,
2: 47.10058986,
3: 47.10058986,
4: 47.09847967,
5: 47.09980364,
6: 47.09929947},
'LONGITUDE': {0: 6.81906939,
1: 6.82133392,
2: 6.821687,
3: 6.821687,
4: 6.81926344,
5: 6.82126031,
6: 6.82197033},
'Name': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g'},
'From': {0: 'a', 1: 'a', 2: 'a', 3: 'f', 4: 'b', 5: 'b', 6: 'b'}}
df = pd.DataFrame(dict_data)
gdf = geopandas.GeoDataFrame(
df,
geometry=geopandas.points_from_xy(df["LONGITUDE"], df["LATITUDE"]))
gdf.crs = "EPSG:4326"
m = gdf.explore(
column="Name",
tiles="CartoDB positron",
cmap="Set1"
)
Note:
I created a new question, as the old one was to general.

I can't work out why this does not generate the graph you want. Basically generating a LineString for each pair of points defined by pairs of Name and From
from shapely.geometry import LineString
gdf2 = gdf.assign(From_point=gdf.set_index("Name").loc[gdf["From"],"geometry"].values)
gdf2["line"] = gdf2[["geometry","From_point"]].apply(LineString, axis=1)
gdf2.set_crs(gdf.crs).set_geometry("line").explore(m=m)

It might be possible to add lines with geopandas, but I don't have the knowledge to do so, so I achieved this with folium. The line needs to be specified from point e to point a and from point a to point b, so I manually list the latitude and longitude. That list is then set to a polyline in a loop process.
import folium
# line coords
# e -> a:a -> b:b -> f:f -> d:b ->g
line_points = [[[47.098480,6.819263],[47.098856,6.819069]],
[[47.098856,6.819069],[47.099326,6.821334]],
[[47.099326,6.821334],[47.099804,6.821260]],
[[47.099804,6.821260],[47.100590,6.821687]],
[[47.099326,6.821334],[47.099299,6.821970]]]
m = gdf.explore(
column="Name",
tiles="CartoDB positron",
cmap="Set1"
)
for line in line_points:
#print(line[0][0],line[0][1],line[1][0],line[1][1])
folium.PolyLine([[line[0][0],line[0][1]],[line[1][0],line[1][1]]], weight=2, color="black").add_to(m)
m

Related

How do I generate age category? My PATIENT_YOB is given as 01jan1956 and I want to get exact age

I'm trying to use the following code but it gives error
01jan1986
05jan2001
07mar1983
and so on I need to get the exact age of them
gen agecat=1
if age 0-20==1
if age 21-40==2
if age 41-60==3
if age 61-64==4```
Here's one way:
gen age_cat = cond(age <= 20, 1, cond(age <= 40, 2, cond(age <= 60, 3, cond(age <= 64, 4, .))))
You might also want to look into egen, cut, see help egen.
To build off of Wouter's answer, you could do something like this to calculate the age to the tenth of a year:
clear
set obs 12
set seed 12352
global today = date("18Jun2021", "DMY")
* Sample Data
gen dob = runiformint(0,17000) // random Dates
format dob %td
* Create Age
gen age = round((ym(year(${today}),month(${today})) - ym(year(dob), month(dob)))/ 12,0.1)
* Correct age if dob in current month, but after today's date
replace age = age - 0.1 if (month(${today}) == month(dob)) & (day(dob) > day(${today}))
* age category
gen age_cat = cond(age <= 20, 1, cond(age <= 40, 2, cond(age <= 60, 3, cond(age <= 64, 4, .))))
The penultimate step is important as it decrements the age if their DOB is in the same month as the comparison date but has yet to be realised.
+----------------------------+
| dob age age_cat |
|----------------------------|
1. | 30jan2004 17.4 1 |
2. | 14aug1998 22.8 2 |
3. | 06aug1998 22.8 2 |
4. | 31aug1994 26.8 2 |
5. | 27mar1990 31.3 2 |
|----------------------------|
6. | 12jun1968 53 3 |
7. | 05may1964 57.1 3 |
8. | 06aug1994 26.8 2 |
9. | 21jun1989 31.9 2 |
10. | 10aug1984 36.8 2 |
|----------------------------|
11. | 22oct2001 19.7 1 |
12. | 03may1972 49.1 3 |
+----------------------------+
Note that the decimal is just approximate as it uses the month of the birthday and not the actual date.
You got some good advice in other answers, but this can be as simple as you want.
Consider this example, noting that presenting data as code we can run is a really helpful detail.
* Example generated by -dataex-. For more info, type help dataex
clear
input str9 sdate float dob
"01jan1986" 9497
"05jan2001" 14980
"07mar1983" 8466
end
format %td dob
The age at end 2020 is just 2020 minus the year people were born. Use any other year if it makes more sense.
. gen age = 2020 - year(dob)
. l
+-----------------------------+
| sdate dob age |
|-----------------------------|
1. | 01jan1986 01jan1986 34 |
2. | 05jan2001 05jan2001 19 |
3. | 07mar1983 07mar1983 37 |
+-----------------------------+
For 20 year bins, why not make them self-describing. Thus with this code, 20, 40 etc. are the upper limit of each bin. (You might need to tweak that if you have children under 1 year old in your data.)
. gen age2 = 20 * ceil(age/20)
. l
+------------------------------------+
| sdate dob age age2 |
|------------------------------------|
1. | 01jan1986 01jan1986 34 40 |
2. | 05jan2001 05jan2001 19 20 |
3. | 07mar1983 07mar1983 37 40 |
+------------------------------------+
This paper is a review of rounding and binning using Stata.

Continous Parent and Discrete Child Conditional with Observed Data in PyMC

I am new to PyMC and trying to set up the simple conditional probability model: P(has_diabetes|bmi, race). Race can take on 5 discrete values encoded as 0-4 and BMI can take on a non-zero positive real number. So far I have the parent variables setup like this:
p_race = [0.009149232914923292,
0.15656903765690378,
0.019637377963737795,
0.013947001394700141,
0.800697350069735]
race = pymc.Categorical('race', p_race)
bmi_alpha = pymc.Exponential('bmi_alpha', 1)
bmi_beta = pymc.Exponential('bmi_beta', 1)
bmi = pymc.Gamma('bmi', bmi_alpha, bmi_beta, value=bmis, observed=True)
I have observed data that looks like:
| bmi | race | has_diabetes |
| 21.7 | 1 | 0 |
| 45.3 | 4 | 1 |
| 18.9 | 2 | 0 |
| 26.6 | 0 | 0 |
| 35.1 | 4 | 0 |
I am attempting to model has_diabetes as:
has_diabetes = pymc.Bernoulli('has_diabetes', p_diabetes, value=data, observed=True)
My problem is that I am not sure how to construct the p_diabetes function since it is dependent on the values of race and and the continuous value of bmi.
You need to construct a deterministic function that generates p_diabetes as a function of your predictors. The safest way to do this is via a logit-linear transformation. For example:
intercept = pymc.Normal('intercept', 0, 0.01, value=0)
beta_race = pymc.Normal('beta_race', 0, 0.01, value=np.zeros(4))
beta_bmi = pymc.Normal('beta_bmi', 0, 0.01, value=0)
#pymc.deterministic
def p_diabetes(b0=intercept, b1=beta_race, b2=beta_bmi):
# Prepend a zero for baseline
b1 = np.append(0, b1)
# Logit-linear model
return pymc.invlogit(b0 + b1[race] + b2*bmi)
I would make the baseline race be the largest group (it is assumed to be index 0 in this example).
Actually, its not clear what the first part of the model above is for, specifically, why you are building models for the predictors, but perhaps I am missing something.

Understanding the Recursion of mergesort

Most of the mergesort implementations I see are similar to this. intro to algorithms book along with online implentations I search for. My recursion chops don't go much further than messing with Fibonacci generation (which was simple enough) so maybe it's the multiple recursions blowing my mind, but I can't even step through the code and understand whats going on even before I even hit the merge function.
How is it stepping through this? Is there some strategy or reading I should undergo to better understand the process here?
void mergesort(int *a, int*b, int low, int high)
{
int pivot;
if(low<high)
{
pivot=(low+high)/2;
mergesort(a,b,low,pivot);
mergesort(a,b,pivot+1,high);
merge(a,b,low,pivot,high);
}
}
and the merge(although frankly I'm mentally stuck before I even get to this part)
void merge(int *a, int *b, int low, int pivot, int high)
{
int h,i,j,k;
h=low;
i=low;
j=pivot+1;
while((h<=pivot)&&(j<=high))
{
if(a[h]<=a[j])
{
b[i]=a[h];
h++;
}
else
{
b[i]=a[j];
j++;
}
i++;
}
if(h>pivot)
{
for(k=j; k<=high; k++)
{
b[i]=a[k];
i++;
}
}
else
{
for(k=h; k<=pivot; k++)
{
b[i]=a[k];
i++;
}
}
for(k=low; k<=high; k++) a[k]=b[k];
}
MERGE SORT:
1) Split the array in half
2) Sort the left half
3) Sort the right half
4) Merge the two halves together
I think the "sort" function name in MergeSort is a bit of a misnomer, it should really be called "divide".
Here is a visualization of the algorithm in process.
Each time the function recurses, it's working on a smaller and smaller subdivision of the input array, starting with the left half of it. Each time the function returns from recursion, it will continue on and either start working on the right half, or recurse up again and work on a larger half.
Like this
[************************]mergesort
[************]mergesort(lo,mid)
[******]mergesort(lo,mid)
[***]mergesort(lo,mid)
[**]mergesort(lo,mid)
[**]mergesort(mid+1,hi)
[***]merge
[***]mergesort(mid+1,hi)
[**]mergesort*(lo,mid)
[**]mergesort(mid+1,hi)
[***]merge
[******]merge
[******]mergesort(mid+1,hi)
[***]mergesort(lo,mid)
[**]mergesort(lo,mid)
[**]mergesort(mid+1,hi)
[***]merge
[***]mergesort(mid+1,hi)
[**]mergesort(lo,mid)
[**]mergesort(mid+1,hi)
[***]merge
[******]merge
[************]merge
[************]mergesort(mid+1,hi)
[******]mergesort(lo,mid)
[***]mergesort(lo,mid)
[**]mergesort(lo,mid)
[**]mergesort(mid+1,hi)
[***]merge
[***]mergesort(mid+1,hi)
[**]mergesort(lo,mid)
[**]mergesort(mid+1,hi)
[***]merge
[******]merge
[******]mergesort(mid+1,hi)
[***]mergesort(lo,mid)
[**]mergesort*(lo,mid)
[**]mergesort(mid+1,hi)
[***]merge
[***]mergesort(mid+1,hi)
[**]mergesort(lo,mid)
[**]mergesort(mid+1,hi)
[***]merge
[******]merge
[************]merge
[************************]merge
An obvious thing to do would be to try this merge sort on a small array, say size 8 (power of 2 is convenient here), on paper. Pretend you are a computer executing the code, and see if it starts to become a bit clearer.
Your question is a bit ambiguous because you don't explain what you find confusing, but it sounds like you are trying to unroll the recursive calls in your head. Which may or may not be a good thing, but I think it can easily lead to having too much in your head at once. Instead of trying to trace the code from start to end, see if you can understand the concept abstractly. Merge sort:
Splits the array in half
Sorts the left half
Sorts the right half
Merges the two halves together
(1) should be fairly obvious and intuitive to you. For step (2) the key insight is this, the left half of an array... is an array. Assuming your merge sort works, it should be able to sort the left half of the array. Right? Step (4) is actually a pretty intuitive part of the algorithm. An example should make it trivial:
at the start
left: [1, 3, 5], right: [2, 4, 6, 7], out: []
after step 1
left: [3, 5], right: [2, 4, 6, 7], out: [1]
after step 2
left: [3, 5], right: [4, 6, 7], out: [1, 2]
after step 3
left: [5], right: [4, 6, 7], out: [1, 2, 3]
after step 4
left: [5], right: [6, 7], out: [1, 2, 3, 4]
after step 5
left: [], right: [6, 7], out: [1, 2, 3, 4, 5]
after step 6
left: [], right: [7], out: [1, 2, 3, 4, 5, 6]
at the end
left: [], right: [], out: [1, 2, 3, 4, 5, 6, 7]
So assuming that you understand (1) and (4), another way to think of merge sort would be this. Imagine someone else wrote mergesort() and you're confident that it works. Then you could use that implementation of mergesort() to write:
sort(myArray)
{
leftHalf = myArray.subArray(0, myArray.Length/2);
rightHalf = myArray.subArray(myArray.Length/2 + 1, myArray.Length - 1);
sortedLeftHalf = mergesort(leftHalf);
sortedRightHalf = mergesort(rightHalf);
sortedArray = merge(sortedLeftHalf, sortedRightHalf);
}
Note that sort doesn't use recursion. It just says "sort both halves and then merge them". If you understood the merge example above then hopefully you see intuitively that this sort function seems to do what it says... sort.
Now, if you look at it more carefully... sort() looks pretty much exactly like mergesort()! That's because it is mergesort() (except it doesn't have base cases because it's not recursive!).
But that's how I like thinking of recursive functions--assume that the function works when you call it. Treat it as a black box that does what you need it to. When you make that assumption, figuring out how to fill in that black box is often easy. For a given input, can you break it down into smaller inputs to feed to your black box? After you solve that, the only thing that's left is handling the base cases at the start of your function (which are the cases where you don't need to make any recursive calls. For example, mergesort([]) just returns an empty array; it doesn't make a recursive call to mergesort()).
Finally, this is a bit abstract, but a good way to understand recursion is actually to write mathematical proofs using induction. The same strategy used to write an proof by induction is used to write a recursive function:
Math proof:
Show the claim is true for the base cases
Assume it is true for inputs smaller than some n
Use that assumption to show that it is still true for an input of size n
Recursive function:
Handle the base cases
Assume that your recursive function works on inputs smaller than some n
Use that assumption to handle an input of size n
Concerning the recursion part of the merge sort, I've found this page to be very very helpful. You can follow the code as it's being executed. It shows you what gets executed first, and what follows next.
Tom
the mergesort() simply divides the array in two halves until the if condition fails that is low < high. As you are calling mergesort() twice : one with low to pivot and second with pivot+1 to high, this will divide the sub arrays even more further.
Lets take an example :
a[] = {9,7,2,5,6,3,4}
pivot = 0+6/2 (which will be 3)
=> first mergesort will recurse with array {9,7,2} : Left Array
=> second will pass the array {5,6,3,4} : Right Array
It will repeat until you have 1 element in each left as well as right array.
In the end you'll have something similar to this :
L : {9} {7} {2} R : {5} {6} {3} {4} (each L and R will have further sub L and R)
=> which on call to merge will become
L(L{7,9} R{2}) : R(L{5,6} R{3,4})
As you can see that each sub array are getting sorted in the merge function.
=> on next call to merge the next L and R sub arrays will get in order
L{2,7,9} : R{3,4,5,6}
Now both L and R sub array are sorted within
On last call to merge they'll be merged in order
Final Array would be sorted => {2,3,4,5,6,7,9}
See the merging steps in answer given by #roliu
My apologies if this has been answered this way. I acknowledge that this is just a sketch, rather than a deep explanation.
While it is not obvious to see how the actual code maps to the recursion, I was able to understand the recursion in a general sense this way.
Take a the example unsorted set {2,9,7,5} as input. The merge_sort algorithm is denoted by "ms" for brevity below. Then we can sketch the operation as:
step 1: ms( ms( ms(2),ms(9) ), ms( ms(7),ms(5) ) )
step 2: ms( ms({2},{9}), ms({7},{5}) )
step 3: ms( {2,9}, {5,7} )
step 4: {2,5,7,9}
It is important to note that merge_sort of a singlet (like {2}) is simply the singlet (ms(2) = {2}), so that at the deepest level of recursion we get our first answer. The remaining answers then tumble like dominoes as the interior recursions finish and are merged together.
Part of the genius of the algorithm is the way it builds the recursive formula of step 1 automatically through its construction. What helped me was the exercise of thinking how to turn step 1 above from a static formula to a general recursion.
Trying to work out each and every step of a recursion is often not an ideal approach, but for beginners, it definitely helps to understand the basic idea behind recursion, and also to get better at writing recursive functions.
Here's a C solution to Merge Sort :-
#include <stdio.h>
#include <stdlib.h>
void merge_sort(int *, unsigned);
void merge(int *, int *, int *, unsigned, unsigned);
int main(void)
{
unsigned size;
printf("Enter the no. of integers to be sorted: ");
scanf("%u", &size);
int * arr = (int *) malloc(size * sizeof(int));
if (arr == NULL)
exit(EXIT_FAILURE);
printf("Enter %u integers: ", size);
for (unsigned i = 0; i < size; i++)
scanf("%d", &arr[i]);
merge_sort(arr, size);
printf("\nSorted array: ");
for (unsigned i = 0; i < size; i++)
printf("%d ", arr[i]);
printf("\n");
free(arr);
return EXIT_SUCCESS;
}
void merge_sort(int * arr, unsigned size)
{
if (size > 1)
{
unsigned left_size = size / 2;
int * left = (int *) malloc(left_size * sizeof(int));
if (left == NULL)
exit(EXIT_FAILURE);
for (unsigned i = 0; i < left_size; i++)
left[i] = arr[i];
unsigned right_size = size - left_size;
int * right = (int *) malloc(right_size * sizeof(int));
if (right == NULL)
exit(EXIT_FAILURE);
for (unsigned i = 0; i < right_size; i++)
right[i] = arr[i + left_size];
merge_sort(left, left_size);
merge_sort(right, right_size);
merge(arr, left, right, left_size, right_size);
free(left);
free(right);
}
}
/*
This merge() function takes a target array (arr) and two sorted arrays (left and right),
all three of them allocated beforehand in some other function(s).
It then merges the two sorted arrays (left and right) into a single sorted array (arr).
It should be ensured that the size of arr is equal to the size of left plus the size of right.
*/
void merge(int * arr, int * left, int * right, unsigned left_size, unsigned right_size)
{
unsigned i = 0, j = 0, k = 0;
while ((i < left_size) && (j < right_size))
{
if (left[i] <= right[j])
arr[k++] = left[i++];
else
arr[k++] = right[j++];
}
while (i < left_size)
arr[k++] = left[i++];
while (j < right_size)
arr[k++] = right[j++];
}
Here's the step-by-step explanation of the recursion :-
Let arr be [1,4,0,3,7,9,8], having the address 0x0000.
In main(), merge_sort(arr, 7) is called, which is the same as merge_sort(0x0000, 7).
After all of the recursions are completed, arr (0x0000) becomes [0,1,3,4,7,8,9].
| | |
| | |
| | |
| | |
| | |
arr - 0x0000 - [1,4,0,3,7,9,8] | | |
size - 7 | | |
| | |
left = malloc() - 0x1000a (say) - [1,4,0] | | |
left_size - 3 | | |
| | |
right = malloc() - 0x1000b (say) - [3,7,9,8] | | |
right_size - 4 | | |
| | |
merge_sort(left, left_size) -------------------> | arr - 0x1000a - [1,4,0] | |
| size - 3 | |
| | |
| left = malloc() - 0x2000a (say) - [1] | |
| left_size = 1 | |
| | |
| right = malloc() - 0x2000b (say) - [4,0] | |
| right_size = 2 | |
| | |
| merge_sort(left, left_size) -------------------> | arr - 0x2000a - [1] |
| | size - 1 |
| left - 0x2000a - [1] <-------------------------- | (0x2000a has only 1 element) |
| | |
| | |
| merge_sort(right, right_size) -----------------> | arr - 0x2000b - [4,0] |
| | size - 2 |
| | |
| | left = malloc() - 0x3000a (say) - [4] |
| | left_size = 1 |
| | |
| | right = malloc() - 0x3000b (say) - [0] |
| | right_size = 1 |
| | |
| | merge_sort(left, left_size) -------------------> | arr - 0x3000a - [4]
| | | size - 1
| | left - 0x3000a - [4] <-------------------------- | (0x3000a has only 1 element)
| | |
| | |
| | merge_sort(right, right_size) -----------------> | arr - 0x3000b - [0]
| | | size - 1
| | right - 0x3000b - [0] <------------------------- | (0x3000b has only 1 element)
| | |
| | |
| | merge(arr, left, right, left_size, right_size) |
| | i.e. merge(0x2000b, 0x3000a, 0x3000b, 1, 1) |
| right - 0x2000b - [0,4] <----------------------- | (0x2000b is now sorted) |
| | |
| | free(left) (0x3000a is now freed) |
| | free(right) (0x3000b is now freed) |
| | |
| | |
| merge(arr, left, right, left_size, right_size) | |
| i.e. merge(0x1000a, 0x2000a, 0x2000b, 1, 2) | |
left - 0x1000a - [0,1,4] <---------------------- | (0x1000a is now sorted) | |
| | |
| free(left) (0x2000a is now freed) | |
| free(right) (0x2000b is now freed) | |
| | |
| | |
merge_sort(right, right_size) -----------------> | arr - 0x1000b - [3,7,9,8] | |
| size - 4 | |
| | |
| left = malloc() - 0x2000c (say) - [3,7] | |
| left_size = 2 | |
| | |
| right = malloc() - 0x2000d (say) - [9,8] | |
| right_size = 2 | |
| | |
| merge_sort(left, left_size) -------------------> | arr - 0x2000c - [3,7] |
| | size - 2 |
| | |
| | left = malloc() - 0x3000c (say) - [3] |
| | left_size = 1 |
| | |
| | right = malloc() - 0x3000d (say) - [7] |
| | right_size = 1 |
| | |
| | merge_sort(left, left_size) -------------------> | arr - 0x3000c - [3]
| left - [3,7] was already sorted, but | | size - 1
| that doesn't matter to this program. | left - 0x3000c - [3] <-------------------------- | (0x3000c has only 1 element)
| | |
| | |
| | merge_sort(right, right_size) -----------------> | arr - 0x3000d - [7]
| | | size - 1
| | right - 0x3000d - [7] <------------------------- | (0x3000d has only 1 element)
| | |
| | |
| | merge(arr, left, right, left_size, right_size) |
| | i.e. merge(0x2000c, 0x3000c, 0x3000d, 1, 1) |
| left - 0x2000c - [3,7] <------------------------ | (0x2000c is now sorted) |
| | |
| | free(left) (0x3000c is now freed) |
| | free(right) (0x3000d is now freed) |
| | |
| | |
| merge_sort(right, right_size) -----------------> | arr - 0x2000d - [9,8] |
| | size - 2 |
| | |
| | left = malloc() - 0x3000e (say) - [9] |
| | left_size = 1 |
| | |
| | right = malloc() - 0x3000f (say) - [8] |
| | right_size = 1 |
| | |
| | merge_sort(left, left_size) -------------------> | arr - 0x3000e - [9]
| | | size - 1
| | left - 0x3000e - [9] <-------------------------- | (0x3000e has only 1 element)
| | |
| | |
| | merge_sort(right, right_size) -----------------> | arr - 0x3000f - [8]
| | | size - 1
| | right - 0x3000f - [8] <------------------------- | (0x3000f has only 1 element)
| | |
| | |
| | merge(arr, left, right, left_size, right_size) |
| | i.e. merge(0x2000d, 0x3000e, 0x3000f, 1, 1) |
| right - 0x2000d - [8,9] <----------------------- | (0x2000d is now sorted) |
| | |
| | free(left) (0x3000e is now freed) |
| | free(right) (0x3000f is now freed) |
| | |
| | |
| merge(arr, left, right, left_size, right_size) | |
| i.e. merge(0x1000b, 0x2000c, 0x2000d, 2, 2) | |
right - 0x1000b - [3,7,8,9] <------------------- | (0x1000b is now sorted) | |
| | |
| free(left) (0x2000c is now freed) | |
| free(right) (0x2000d is now freed) | |
| | |
| | |
merge(arr, left, right, left_size, right_size) | | |
i.e. merge(0x0000, 0x1000a, 0x1000b, 3, 4) | | |
(0x0000 is now sorted) | | |
| | |
free(left) (0x1000a is now freed) | | |
free(right) (0x1000b is now freed) | | |
| | |
| | |
| | |
I know this is an old question but wanted to throw my thoughts of what helped me understand merge sort.
There are two big parts to merge sort
Splitting of the array into smaller chunks (dividing)
Merging the array together (conquering)
The role of the recurison is simply the dividing portion.
I think what confuses most people is that they think there is a lot of logic in the splitting and determining what to split, but most of the actual logic of sorting happens on the merge. The recursion is simply there to divide and do the first half and then the second half is really just looping, copying things over.
I see some answers that mention pivots but I would recommend not associating the word "pivot" with merge sort because that's an easy way to confuse merge sort with quicksort (which is heavily reliant on choosing a "pivot"). They are both "divide and conquer" algorithms. For merge sort the division always happens in the middle whereas for quicksort you can be clever with the division when choosing an optimal pivot.
process to divide the problem into subproblems
Given example will help you understand recursion. int A[]={number of element to be shorted.}, int p=0; (lover index). int r= A.length - 1;(Higher index).
class DivideConqure1 {
void devide(int A[], int p, int r) {
if (p < r) {
int q = (p + r) / 2; // divide problem into sub problems.
devide(A, p, q); //divide left problem into sub problems
devide(A, q + 1, r); //divide right problem into sub problems
merger(A, p, q, r); //merger the sub problem
}
}
void merger(int A[], int p, int q, int r) {
int L[] = new int[q - p + 1];
int R[] = new int[r - q + 0];
int a1 = 0;
int b1 = 0;
for (int i = p; i <= q; i++) { //store left sub problem in Left temp
L[a1] = A[i];
a1++;
}
for (int i = q + 1; i <= r; i++) { //store left sub problem in right temp
R[b1] = A[i];
b1++;
}
int a = 0;
int b = 0;
int c = 0;
for (int i = p; i < r; i++) {
if (a < L.length && b < R.length) {
c = i + 1;
if (L[a] <= R[b]) { //compare left element<= right element
A[i] = L[a];
a++;
} else {
A[i] = R[b];
b++;
}
}
}
if (a < L.length)
for (int i = a; i < L.length; i++) {
A[c] = L[i]; //store remaining element in Left temp into main problem
c++;
}
if (b < R.length)
for (int i = b; i < R.length; i++) {
A[c] = R[i]; //store remaining element in right temp into main problem
c++;
}
}
When you call the recursive method it does not execute the real function at the same time it's stack into stack memory. And when condition not satisfied then it's going to next line.
Consider that this is your array:
int a[] = {10,12,9,13,8,7,11,5};
So your method merge sort will work like below:
mergeSort(arr a, arr empty, 0 , 7);
mergeSort(arr a, arr empty, 0, 3);
mergeSort(arr a, arr empty,2,3);
mergeSort(arr a, arr empty, 0, 1);
after this `(low + high) / 2 == 0` so it will come out of first calling and going to next:
mergeSort(arr a, arr empty, 0+1,1);
for this also `(low + high) / 2 == 0` so it will come out of 2nd calling also and call:
merger(arr a, arr empty,0,0,1);
merger(arr a, arr empty,0,3,1);
.
.
So on
So all sorting values store in empty arr.
It might help to understand the how recursive function works

Algorithm for finding record with most matching attributes

I'm looking for some algorithm that for a given record with n properties with n possible values each (int, string etc) searches a number of existing records and gives back the one that matches the most properties.
Example:
A = 1
B = 1
C = 1
D = f
A | B | C | D
----+-----+-----+----
1 | 1 | 9 | f <
2 | 3 | 1 | g
3 | 4 | 2 | h
2 | 5 | 8 | j
3 | 6 | 5 | h
The first row would be the one I'm looking for, as it has the most matching values. I think it doesn't need to calculate any closeness to the values, because then row 2 might be more matching.
Loop through each row, add one to the row score of a field matches (field one has a score of 2) and when that's done, you have a resultset of scores which you can sort.
The basic algorithm could look like (in java pseudo code):
int bestMatchIdx = -1;
int currMatches = 0;
int bestMatches = 0;
for ( int row = 0 ; row < numRows ; row++ ) {
currMatches = 0;
for ( int col = 0 ; col < numCols ; col++ ) {
if ( search[col].equals( rows[ row ][ cols] ))
currMatches++;
}
if ( currMatches > bestMatches ) {
bestMatchIdx = row;
bestMatches = currMatches;
}
}
This assumes that you have an equals function to compare, and the data stored in a 2D array. 'search' is the reference row to compare all other rows against it.

Quickest way to determine range overlap in Perl

I have two sets of ranges. Each range is a pair of integers (start and end) representing some sub-range of a single larger range. The two sets of ranges are in a structure similar to this (of course the ...s would be replaced with actual numbers).
$a_ranges =
{
a_1 =>
{
start => ...,
end => ...,
},
a_2 =>
{
start => ...,
end => ...,
},
a_3 =>
{
start => ...,
end => ...,
},
# and so on
};
$b_ranges =
{
b_1 =>
{
start => ...,
end => ...,
},
b_2 =>
{
start => ...,
end => ...,
},
b_3 =>
{
start => ...,
end => ...,
},
# and so on
};
I need to determine which ranges from set A overlap with which ranges from set B. Given two ranges, it's easy to determine whether they overlap. I've simply been using a double loop to do this--loop through all elements in set A in the outer loop, loop through all elements of set B in the inner loop, and keep track of which ones overlap.
I'm having two problems with this approach. First is that the overlap space is extremely sparse--even if there are thousands of ranges in each set, I expect each range from set A to overlap with maybe 1 or 2 ranges from set B. My approach enumerates every single possibility, which is overkill. This leads to my second problem--the fact that it scales very poorly. The code finishes very quickly (sub-minute) when there are hundreds of ranges in each set, but takes a very long time (+/- 30 minutes) when there are thousands of ranges in each set.
Is there a better way I can index these ranges so that I'm not doing so many unnecessary checks for overlap?
Update: The output I'm looking for is two hashes (one for each set of ranges) where the keys are range IDs and the values are the IDs of the ranges from the other set that overlap with the given range in this set.
This sounds like the perfect use case for an interval tree, which is a data structure specifically designed to support this operation. If you have two sets of intervals of sizes m and n, then you can build one of them into an interval tree in time O(m lg m) and then do n intersection queries in time O(n lg m + k), where k is the total number of intersections you find. This gives a net runtime of O((m + n) lg m + k). Remember that in the worst case k = O(nm), so this isn't any better than what you have, but for cases where the number of intersections is sparse this can be substantially better than the O(mn) runtime you have right now.
I don't have much experience working with interval trees (and zero experience in Perl, sorry!), but from the description it seems like they shouldn't be that hard to build. I'd be pretty surprised if one didn't exist already.
Hope this helps!
In case you are not exclusively tied to perl; The IRanges package in R deals with interval arithmetic. It has very powerful primitives, it would probably be easy to code a solution with them.
A second remark is that the problem could possibly become very easy if the intervals have additional structure; for example, if within each set of ranges there is no overlap (in that case a linear approach sifting through the two ordered sets simultaneously is possible). Even in the absence of such structure, the least you can do is to sort one set of ranges by start point, and the other set by end point, then break out of the inner loop once a match is no longer possible. Of course, existing and general algorithms and data structures such as the interval tree mentioned earlier are the most powerful.
There are Several existing CPAN modules that solve this issue, I have developed 2 of them: Data::Range::Compare and Data::Range::Compare::Stream
Data::Range::Compare only works with arrays in memory, but supports generic range types.
Data::Range::Compare::Stream Works with streams of data via iterators, but it requires understanding OO Perl to extend to generic data types. Data::Range::Compare::Stream is recommended if you are processing very very large sets of data.
Here is an Excerpt form the Examples folder of Data::Range::Compare::Stream.
Given these 3 sets of data:
Numeric Range set: A contained in file: source_a.src
+----------+
| 1 - 11 |
| 13 - 44 |
| 17 - 23 |
| 55 - 66 |
+----------+
Numeric Range set: B contained in file: source_b.src
+----------+
| 0 - 1 |
| 2 - 29 |
| 88 - 133 |
+----------+
Numeric Range set: C contained in file: source_c.src
+-----------+
| 17 - 29 |
| 220 - 240 |
| 241 - 250 |
+-----------+
The expected output would be:
+--------------------------------------------------------------------+
| Common Range | Numeric Range A | Numeric Range B | Numeric Range C |
+--------------------------------------------------------------------+
| 0 - 0 | No Data | 0 - 1 | No Data |
| 1 - 1 | 1 - 11 | 0 - 1 | No Data |
| 2 - 11 | 1 - 11 | 2 - 29 | No Data |
| 12 - 12 | No Data | 2 - 29 | No Data |
| 13 - 16 | 13 - 44 | 2 - 29 | No Data |
| 17 - 29 | 13 - 44 | 2 - 29 | 17 - 29 |
| 30 - 44 | 13 - 44 | No Data | No Data |
| 55 - 66 | 55 - 66 | No Data | No Data |
| 88 - 133 | No Data | 88 - 133 | No Data |
| 220 - 240 | No Data | No Data | 220 - 240 |
| 241 - 250 | No Data | No Data | 241 - 250 |
+--------------------------------------------------------------------+
The Source code can be found here.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use lib qw(./ ../lib);
# custom package from FILE_EXAMPLE.pl
use Data::Range::Compare::Stream::Iterator::File;
use Data::Range::Compare::Stream;
use Data::Range::Compare::Stream::Iterator::Consolidate;
use Data::Range::Compare::Stream::Iterator::Compare::Asc;
my $source_a=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_a.src');
my $source_b=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_b.src');
my $source_c=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_c.src');
my $consolidator_a=new Data::Range::Compare::Stream::Iterator::Consolidate($source_a);
my $consolidator_b=new Data::Range::Compare::Stream::Iterator::Consolidate($source_b);
my $consolidator_c=new Data::Range::Compare::Stream::Iterator::Consolidate($source_c);
my $compare=new Data::Range::Compare::Stream::Iterator::Compare::Asc();
my $src_id_a=$compare->add_consolidator($consolidator_a);
my $src_id_b=$compare->add_consolidator($consolidator_b);
my $src_id_c=$compare->add_consolidator($consolidator_c);
print " +--------------------------------------------------------------------+
| Common Range | Numeric Range A | Numeric Range B | Numeric Range C |
+--------------------------------------------------------------------+\n";
my $format=' | %-12s | %-13s | %-13s | %-13s |'."\n";
while($compare->has_next) {
my $result=$compare->get_next;
my $string=$result->to_string;
my #data=($result->get_common);
next if $result->is_empty;
for(0 .. 2) {
my $column=$result->get_column_by_id($_);
unless(defined($column)) {
$column="No Data";
} else {
$column=$column->get_common->to_string;
}
push #data,$column;
}
printf $format,#data;
}
print " +--------------------------------------------------------------------+\n";
Try Tree::RB but to find mutually exclusive ranges, no overlaps
The performance is not bad, if I had about 10000 segments and had to find the segment for each discrete number. My input had 300 million records. I neaded to put them into separate buckets. Like partitioning the data. Tree::RB worked out great.
$var = [
[0,90],
[91,2930],
[2950,8293]
.
.
.
]
my lookup value were 10, 99, 991 ...
and basically I needed the position of the range for the given number
With this below comparison function, mine uses something like this:
my $cmp = sub
{
my ($a1, $b1) = #_;
if(ref($b1) && ref($a1))
{
return ($$a1[1]) <=> ($$b1[0]);
}
my $ret = 0;
if(ref($a1) eq 'ARRAY')
{
#
if($$a1[0] <= $b1 && $b1 >= $$a1[1])
{
$ret = 0;
}
if($$a1[0] < $b1)
{
$ret = -1;
}
if($$a1[1] > $b1)
{
$ret = 1;
}
}
else
{
if($$b1[0] <= $a1 && $a1 >= $$b1[1])
{
$ret = 0;
}
if($$b1[0] > $a1)
{
$ret = -1;
}
if($$b1[1] < $a1)
{
$ret = 1;
}
}
return $ret;
}
I should check time in order to know if its the fastest way, but according to the structure of your data you should try this:
use strict;
my $fromA = 12;
my $toA = 15;
my $fromB = 7;
my $toB = 35;
my #common_range = get_common_range($fromA, $toA, $fromB, $toB);
my $common_range = $common_range[0]."-".$common_range[-1];
sub get_common_range {
my #A = $_[0]..$_[1];
my %B = map {$_ => 1} $_[2]..$_[3];
my #common = ();
foreach my $i (#A) {
if (defined $B{$i}) {
push (#common, $i);
}
}
return sort {$a <=> $b} #common;
}

Resources