Geopandas, what is the unit used for `sjoin_nearest` distance column? - geopandas

I was using sjoin_nearest function to join some data frames based on their closest values. It works perfectly, however, I couldn't find any reference on the unit used to express the distance value, or how I can translate it into a metric value.
I found this thread which discusses about this implementation based on the metric units. However, inside the code documentation as referenced earlier, there is no sign of the unit.
As an example if we have the following points in two data frames:
POINT (6.58053 53.19697)
POINT (6.58583 53.17099)
And use the following call:
gpd.sjoin_nearest(df1, df2, distance_col='distance')
The distance column will contain a value of 0.026842 while I want to have it in Kilometers, which should be 2.93.
PS I did translate it by multiplying it to 110.486, however, I'm more interested to know what is the original unit used here.

Related

How to get immediate next word probability using GPT2 model?

I was trying the hugging face gpt2 model. I have seen the run_generation.py script, which generates a sequence of tokens given a prompt. I am aware that we can use GPT2 for NLG.
In my use case, I wish to determine the probability distribution for (only) the immediate next word following the given prompt. Ideally this distribution would be over the entire vocab.
For example, given the prompt: "How are ", it should give a probability distribution where "you" or "they" have the some high floating point values and other vocab words have very low floating values.
How to do this using hugging face transformers? If it is not possible in hugging face, is there any other transformer model that does this?
You can have a look at how the generation script works with the probabilities.
GPT2LMHeadModel (as well as other "MLHead"-models) returns a tensor that contains for each input the unnormalized probability of what the next token might be. I.e., the last output of the model is the normalized probability of the next token (assuming input_ids is a tensor with token indices from the tokenizer):
outputs = model(input_ids)
next_token_logits = outputs[0][:, -1, :]
You get the distribution by normalizing the logits using softmax. The indices in the first dimension of the next_token_logits correspond to indices in the vocabulary that you get from the tokenizer object.
Selecting the last logits becomes tricky when you use a batch size bigger than 1 and sequences of different lengths. In that case, you would need to specify attention_mask in the model call to mask out padding tokens and then select the last logits using torch.index_select. It is much easier either to use batch size 1 or batch of equally long sequences.
You can use any autoregressive model in Transformers: there is distilGPT-2 (a distilled version of GPT-2), CTRL (which is basically GPT-2 trained with some additional "commands"), the original GPT (under the name openai-gpt), XLNet (designed for contextual embeddings, but can be used for generation in arbitrary order). There are probably more, you can Hugging Face Model Hub.

Clustsig with modified method.distance

I am attempting to perform a Simprof test using a Pearson correlation as a distance method. I am aware that it is designed for the typical distance methods such as euclidean or bray curtis, but it supposedly allows any function that returns a dist object.
My issue lies with the creation of that function. My original data exists as a set of 35 rows and 2146 columns. I wish to correlate the columns. Below is a small subset of that data (lines 78-82).
I need a function that takes the absolute value of the Pearson correlation coefficient metric to be used as the method.distance function. I can calculate those individually, as seen in lines 84-86, but I have no idea how to make a single function that contains all of that. My attempt is on lines 89-91, but I know that as.dist needs the matrix of correlation coefficients, which you can only get from CorrelationSmall$r. I'm assuming it needs to be nested, but I'm at a loss. I apologize if I'm am asking something ridiculous. I have combed the forums and don't know who else to ask. Many thanks!
library(clustsig)
library(Hmisc)
NetworkAnalysisSmall <- read_csv("C:/Users/WilhelmLab/Desktop/Lena/NetworkAnalysisSmall.csv")
NetworkAnalysisSmallMatrix<-as.matrix(NetworkAnalysisSmall)
#subset of NetworkAnalysisSmall
a<-c(0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000001505,0.0000000000685,0.0000000009909,0.0000000001543,0.0000000000000,0.0000000000000,0.0000000000000)
b<-c(0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000002228,0.0000000000000,0.0000000001375,0.0000000000000,0.0000000000000)
c<-c(0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000546,0.0000000000000,0.0000000000000,0.0000000002293,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000540,0.0000000002085,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000)
subset<-data.frame(a,b,c)
CorrelationSmall<-rcorr(as.matrix(NetworkAnalysisSmall),type=c("pearson"))
CCsmall<-CorrelationSmall$r
CCsmallAbs<-abs(CCsmall)
dist3 = function(x) {
as.dist(rcorr(as.matrix(x),type=c("pearson")))
}
NetworkSimprof<-simprof(NetworkAnalysisSmall,num.expected=1000,num.simulated=1000,method.cluster=c("ward"),method.distance=c("dist3"),method.transform=c("log"),alpha=0.05,sample.orientation="column")

Similarity measure using vectors in gensim

I have a pair of word and semantic types of those words. I am trying to compute the relatedness measure between these two words using semantic types, for example: word1=king, type1=man, word2=queen, type2=woman
we can use gensim word_vectors.most_similar to get 'queen' from 'king-man+woman'. However, I am looking for similarity measure between vector represented by 'king-man+woman' and 'queen'.
I am looking for a solution to above (or)
way to calculate vector that is representative of 'king-man+woman' (and)
calculating similarity between two vectors using vector values in gensim (or)
way to calculate simple mean of the projection weight vectors(i.e king-man+woman)
You should look at the source code for the gensim most_similar() method, which is used to propose answers to such analogy questions. Specifically, when you try...
sims = wv_model.most_similar(positive=['king', 'woman'], negative=['man'])
...the top result will (in a sufficiently-trained model) often be 'queen' or similar. So, you can look at the source code to see exactly how it calculates the target combination of wv('king') - wv('man') + wv('woman'), before searching all known vectors for those closest vectors to that target. See...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L486
...and note that the local variable mean is the combination of the positive and negative values provided.
You might also find other methods there useful, either directly or as models for your own code, such as distances()...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L934
...or n_similarity()...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L1005

Own fast Gamma Index implementation

My friends and I are writing our own implementation of Gamma Index algorithm. It should compute it within 1s for standard size 2d pictures (512 x 512) though could also calculate 3D pictures; be portable and easy to install and maintain.
Gamma Index, in case if you haven't came across this topic, is a method for comparing pictures. On input we provide two pictures (reference and target); every picture consist of points distributed over regular fine grid; every point has location and value. As output we receive a picture of Gamma Index values. For each point of target picture we calculate some function (called gamma) against every point from reference picture (in original version) or against points from reference picture, that are closest to the one from target picture (in version, that is usually used in Gamma Index calculation software). The Gamma Index for certain target point is minimum of calculated for it gamma function.
So far we have tried following ideas with these results:
use GPU - the calculation time has decreased 10 times. Problem is, that it's fairly difficult to install it on machines with non nVidia graphics card
use supercomputer or cluster - the problem is with maintenance of this solution. Plus every picture has to be ciphered for travel through network due to data sensitivity
iterate points ordered by their distances to target point with some extra stop criterion - this way we got 15 seconds at best condition (which is actually not ideally precise)
currently we are writing in Python due to NumPy awesome optimizations over matrix calculation, but we are open for other languages too.
Do you have any ideas how we can accelerate our algorithm(s), in order to meet the objectives? Do you think the obtaining of this level of performance is possible?
Some more information about GI for anyone interested:
http://lcr.uerj.br/Manual_ABFM/A%20technique%20for%20the%20quantitative%20evaluation%20of%20dose%20distributions.pdf

using Roiroad function in venis

I have a mobility model created by SUMO with area around 2 KM * 2 Km for real map.
I want to compute the results for only part of this model. I read that I can use roiroad or roirect.
Roirect take (x1,y1-x2,y2) as Traci coordination, however, I want to use roiroad to take exactly the cars in specific road.
My question is: if the roiroad function take a string of road name , from where in sumo that I can get this value.
should I construct the map again with Netconvert and using --output-street-names
Edges in SUMO always have an ID. It is stored in the id="..." attribute of the <edge> tag. If you convert a network from some other data format (say, OpenStreetMap) to SUMO's XML representation, you have the option to try and use an ID that closely resembles the road name the edge represents (this is the option you mentioned). The default is to allocate a numeric ID.
Other than by opening the road network XML file in a text editor, you can also find the edge ID by opening the network in the SUMO GUI and right clicking on the edge (or by enabling the rendering of edge IDs in the GUI).
Note that, depending on the application you simulate, you will need to make sure that you have no "gaps" in the Regions Of Interest (ROIs) you specify. When a vehicle is no longer in the ROI its corresponding node is removed from the network simulation. Even if the same vehicle later enters another (or the same) ROI, a brand new node will be created. This is particularly important when specifying edges as ROI (via the roiRoads parameter). Keep in mind that SUMO uses edges not just to represent streets, but also to represent lanes crossing intersections. If you do not specify these internal edges, your ROIs will have small gaps at every intersection.
Note also that up until OMNeT++ 5.0, syntax highlighting of the .ini file in the IDE will (mistakenly) display a string containing a # character as if it were a comment. This is just a problem with the syntax highlighting. The simulation will behave as expected. For example, setting the roiRoads parameter to "-5445204#1 :252726232_7 -5445204#2" in the Veins 4.4 example as follows...
...will result in a Veins simulation where only cars on one of the following three edges are simulated:
on the edge leading to the below intersection; or
on the edge crossing the below intersection; or
on the edge leaving the below intersection.

Resources