Using nearest neighbour to find postcode to new postcodes found - nearest-neighbor

I have a list of new postcodes and I'm trying to find the nearest postcode from an existing postcode file to attach to the new postcodes. I am using the below code but it seems to have duplicated some rows, please could I have some help resolving this...
My 2 dataframes are:
new_postcode_df which contains 92,590 rows, and columns:
Postcode e.g. "AB101BJ"
Latitude e.g. 57.146051
Longitude e.g. -2.107375
current_postcode_df which contains 1,738,339 rows, and columns:
Postcode e.g. "AB101AB"
Latitude e.g. 57.149606
Longitude e.g. -2.096916
my desired output is output_df
new_postcode e.g. "AB101BJ"
current_postcode e.g. "AB101AB"
My code is below:
new_postcode_df_gps = new_postcode_df[["lat", "long"]].values
current_postcode_df_gps = current_postcode_df[["Latitude", "Longitude"]].values
new_postcode_df_radians = np.radians(new_postcode_df_gps)
current_postcode_df_radians = np.radians(current_postcode_df_gps)
tree = BallTree(current_postcode_df_radians , leaf_size=15, metric='haversine')
distance, index = tree.query(new_postcode_df_radians, k=1)
earth_radius = 6371000
distance_in_meters = distance * earth_radius
current_postcode_df.Postcode_NS[index[:,0]]
my output is shown in the attached where you can see postcodes beginning with "GY" have been added near the top which should not be the case. Postcodes starting with "AB" should all be at the top.
The new dataframe has increase from 92,590 rows to 92,848 rows
Image of final output dataframe
Libraries I'm using are:
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree
new_postcode_df = pd.DataFrame({"Postcode":["AB101BJ", "AB101BL", "AB107FU"],
"Latitude":[57.146051, 57.148655, 57.119636],
"Longitude":[-2.107375, -2.097433, -2.147906]})
current_postcode_df = pd.DataFrame({"Postcode":["AB101AB", "AB101AF", "AB101AG"],
"Latitude":[57.149606, 57.148707, 57.149051],
"Longitude":[-2.096916, -2.097806, -2.097004]})
output_df = pd.DataFrame({"Postcode":["AB101RS", "AB129TS", "GY35HG"]})

Related

insert the same value twice in a column called 'mer_value' of a shapefile

Hi there, i want to insert a value between 0 and the output of function total polygons twice to the column of 'mer_value'(from the bottom left) of a shapefile so that the print output would look some thing like this.
mer_value
0.0
0.0
1.0
1.0
2.0
2.0
....
right now it only inserts the value from 0.0 all the way to the end repeating only once, like the below example.
mer_value
0.0
1.0
2.0
.....
any help is appreciated
{
import geopandas as gpd
number_of_geocells = 2
#create a funtion to read the total polygons in a given shapefile and divide it by the number of geocells and store the value in a new varialbe called "total_polygons"
def total_polygons(shapefile):
total_polygons = gpd.read_file(shapefile).shape[0] / number_of_geocells
total_geocells_per_zone_rounded = round(total_polygons)
return total_geocells_per_zone_rounded
#create a function to duplicate the shapefile and save it as a new shapefile called "zones"
def duplicate_shapefile(shapefile):
zones = gpd.read_file(shapefile)
zones.to_file("C:/tmp/shapefiletotestautomation/zones.shp")
return zones
#create a function to create a colum called 'mer_value' in the shapefile called "zones"
def create_merge_value(shapefile):
zones = gpd.read_file(shapefile)
zones['mer_value'] = zones.index
zones.to_file("C:/tmp/shapefiletotestautomation/zones.shp")
return zones
#create a loop to insert the same number twice to column "mer_value" in the shapefile called "zones"
def insert_merge_value(shapefile):
zones = gpd.read_file(shapefile)
for i in range(total_polygons(shapefile)):
zones.loc[i, 'mer_value'] = i
zones.loc[i + number_of_geocells, 'mer_value'] = i
zones.to_file("C:/tmp/shapefiletotestautomation/zones.shp")
return zones
print(insert_merge_value("C:/tmp/shapefiletotestautomation/m/GEOCELLS.shp"))
}

geopandas loop does not give me name of layers

I m trying to get a dataframe with all different layers of a kml. The code below gives a dataframe, but I also want the name of the kml layers to create a column in data. Any idea about what I m doing wrong?
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
fp="file.kml"
data = gpd.GeoDataFrame()
layers_list=pd.Series(fiona.listlayers(fp))
list(layers_list)
for l in layers_list :
s = gpd.read_file(fp, driver='KML', layer=l)
data = data.append(s, ignore_index=True)
data['layers']= l

Query REST API latitude and longitude

I want my users to query two slugs fields (latitude, longitude) and then the 2 slug fields get compared to find nearest distance within 1.5km radius and display the api according to the nearest safehouses.
For example: when the users add latitude, longitude in their query,
www.example.com/safeplace/?find=-37.8770,145.0442
This will show the nearest safeplaces within 1.5km
Here is my function
def distance(lat1, long1, lat2, long2):
R = 6371 # Earth Radius in Km
dLat = math.radians(lat2 - lat1) # Convert Degrees 2 Radians
dLong = math.radians(long2 - long1)
lat1 = math.radians(lat1)
lat2 = math.radians(lat2)
a = math.sin(dLat/2) * math.sin(dLat/2) + math.sin(dLong/2) *
math.sin(dLong/2) * math.cos(lat1) * math.cos(lat2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = R * c
return d
Here is my model
class Safeplace(models.Model):
establishment = models.CharField(max_length=250)
address = models.CharField(max_length=250)
suburb = models.CharField(max_length=250)
postcode = models.IntegerField()
state = models.CharField(max_length=250)
type = models.CharField(max_length=250)
latitude = models.DecimalField(decimal_places=6,max_digits=10)
longtitude = models.DecimalField(decimal_places=6,max_digits=10)
Is there a way to run a for loop in my database? I am currently working on Django SQLite. On Views.py, how can i implement the distance function with the user input in my rest api url to find the nearest safeplace and display as REST Api?
What you need is to run a comparison for loop in your views.py. It is pretty difficult to execute but I will try to explain step by step.
assuming you are using that distance(lat, lng, lat2, lng2) function and trying to find the distance within 2km for example.
In views.py
import pandas as pd
class someapiview(ListAPIView):
serializer_class = SafeplaceSerializer
### Now we are creating definition which sorts parameters lng and lat ###
def get_queryset(self):
queryset = Safeplace.Objects.all()
lat = float(self.query_params.get('lag', None)
lng = float(self.query_params.get('lng', None)
### Now, we are reading your api using pandas ###
df = pd.read_json('yourapi') ## yourapi is a url to ur api
obj = []
for x in range(0, len(df)):
latx = float(df['latitude'][x])
lngx = float(df['longitude'][x])
### Calculating distance ###
km = distance(lat, lng, latx, lngx)
if km <= 2:
obj.append(df['id'][x])
### Django auto generate primary key which usually calls id ###
### Now we are going to call those pk as a queryset ###
return Safeplace.objects.filter(pk__in=obj)
I used pandas to work around, the load time might be slow if you have lots of data. However, I think this does the job. Usually Geo Django provides an efficient system to deal with long and lat, however I am not very competent in Geo Django so I cannot really tell. But I Believe this is a good work around.
UPDATE :
you can query with "www.yourapi.com/safeplace?lat=x&lng=y"
I believe you know how to set urls

groupby.sum() sparse matrix in pandas or scipy: looking for performance

I have the following dataset df:
import numpy.random
import pandas
cat = pandas.Series(numpy.random.random_integers(0,400,1000000))
ids = pandas.Series(numpy.random.random_integers(0,10000,1000000))
team = pandas.Series(numpy.random.random_integers(0,1,1000000))
df = pandas.concat([ids,cat,team],axis=1)
df.columns = ['ids','cat','team']
Note that there is only 400 distinct categories in the cat column. Conseqently, I want to prepare the dataset for a machine learning classification, i.e., create one column for each distinct category value from 0 to 400, and for each row, write 1 if the id has the corresponding category, and 0 otherwise. My goal is then to make a groupby ids, and sum the 1 for every category column, as follows:
df2 = pandas.get_dummies(df['cat'], sparse=True)
df2['ids'] = df['ids']
df3 = df2.groupby('ids').sum()
My problem is that the groupby.sum() is very very long, far too long (more than 30 mins). So I need a different strategy to make my calculation. Here is a second attempt.
from sklearn import preprocessing
import numpy
text_encoder = preprocessing.OneHotEncoder(dtype=numpy.int)
X = text_encoder.fit_transform(df.drop(['team','ids'],axis=1).values).astype(int)
But then, X is a sparse scipy matrix. Here I have two choices: either a find a way to groupby.sum() efficiently on this sparse scipy matrix, or I convert it to a real numpy matrix with .toarray() as follows:
X = X.toarray()
df2 = pandas.DataFrame(X)
df2['ids'] = df['ids']
df3 = df2.groupby('ids').sum()
The problem now is that a lot of memory is lost due to the .toarray(). And the groupby.sum() surely takes a lot of memory.
So my question is: is there a smart way to solve my problem using SPARSE MATRIX with EFFICIENT TIME for the groupby.sum()?
EDIT: In fact this is a job for pivot_table(), so once your df is created:
df_final = df.pivot_table(cols='cat', rows='ids', aggfunc='count')
df_final.fillna(0, inplace = True)
For the record but useless: following my comments on the question:
import numpy.random
import pandas
from sklearn import preprocessing
cat = pandas.Series(numpy.random.random_integers(0,400,1000000))
ids = pandas.Series(numpy.random.random_integers(0,10000,1000000))
team = pandas.Series(numpy.random.random_integers(0,1,1000000))
df = pandas.concat([ids,cat,team],axis=1)
df.columns = ['ids','cat','team']
df.sort('ids', inplace = True)
text_encoder = preprocessing.OneHotEncoder(dtype=numpy.int)
X = text_encoder.fit_transform(df.drop(['team','ids'],axis=1).values).astype(int)
se_size = df.groupby('ids').size()
ls_rows = []
row_ind = 0
for name, nb_lines in se_size.iteritems():
ls_rows.append(X[row_ind : row_ind + nb_lines,:].sum(0).tolist()[0])
row_ind += nb_lines
df_final = pandas.DataFrame(ls_rows,
index = se_size.index,
columns = text_encoder.active_features_)

Labeled LDA learn in Stanford Topic Modeling Toolbox

It's ok when I run the example-6-llda-learn.scala as follows:
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) // take terms with >=3 characters
}
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
// define fields from the dataset we are going to slice against
val labels = {
source ~> // read from the source file
Column(2) ~> // take column two, the year
TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
TermCounter() ~> // collect label counts
TermMinimumDocumentCountFilter(10) // filter labels in < 10 docs
}
val dataset = LabeledLDADataset(text, labels);
// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);
// Name of the output model folder to generate
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature);
// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
But it's not ok when I change the last line from:
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
to:
TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
And the method of CVB0 cost much memory.I train a corpus of 10,000 documents with about 10 labels each document,it will cost 30G memory.
I've encountered the same situation and indeed I believe it's a bug. Check GIbbsLabeledLDA.scala in edu.stanford.nlp.tmt.model.llda under the src/main/scala folder, from line 204:
val z = doc.labels(zI);
val pZ = (doc.theta(z)+topicSmoothing(z)) *
(countTopicTerm(z)(term)+termSmooth) /
(countTopic(z)+termSmoothDenom);
doc.labels is self-explanatory, and doc.theta records the distribution (counts, actually) of its labels, which has the same size as doc.labels.
zI is index variable iterating doc.labels, while the value z gets the actual label number. Here comes the problem: it's possible this documents has only one label - say 1000 - therefore zI is 0 and z is 1000, then doc.theta(z) gets out of range.
I suppose the solution would be to modify doc.theta(z) to doc.theta(zI).
(I'm trying to check whether the results would be meaningful, anyway this bug has made me not so confident in this toolbox.)

Resources