I m trying to get a dataframe with all different layers of a kml. The code below gives a dataframe, but I also want the name of the kml layers to create a column in data. Any idea about what I m doing wrong?
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
fp="file.kml"
data = gpd.GeoDataFrame()
layers_list=pd.Series(fiona.listlayers(fp))
list(layers_list)
for l in layers_list :
s = gpd.read_file(fp, driver='KML', layer=l)
data = data.append(s, ignore_index=True)
data['layers']= l
Related
I have a list of new postcodes and I'm trying to find the nearest postcode from an existing postcode file to attach to the new postcodes. I am using the below code but it seems to have duplicated some rows, please could I have some help resolving this...
My 2 dataframes are:
new_postcode_df which contains 92,590 rows, and columns:
Postcode e.g. "AB101BJ"
Latitude e.g. 57.146051
Longitude e.g. -2.107375
current_postcode_df which contains 1,738,339 rows, and columns:
Postcode e.g. "AB101AB"
Latitude e.g. 57.149606
Longitude e.g. -2.096916
my desired output is output_df
new_postcode e.g. "AB101BJ"
current_postcode e.g. "AB101AB"
My code is below:
new_postcode_df_gps = new_postcode_df[["lat", "long"]].values
current_postcode_df_gps = current_postcode_df[["Latitude", "Longitude"]].values
new_postcode_df_radians = np.radians(new_postcode_df_gps)
current_postcode_df_radians = np.radians(current_postcode_df_gps)
tree = BallTree(current_postcode_df_radians , leaf_size=15, metric='haversine')
distance, index = tree.query(new_postcode_df_radians, k=1)
earth_radius = 6371000
distance_in_meters = distance * earth_radius
current_postcode_df.Postcode_NS[index[:,0]]
my output is shown in the attached where you can see postcodes beginning with "GY" have been added near the top which should not be the case. Postcodes starting with "AB" should all be at the top.
The new dataframe has increase from 92,590 rows to 92,848 rows
Image of final output dataframe
Libraries I'm using are:
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree
new_postcode_df = pd.DataFrame({"Postcode":["AB101BJ", "AB101BL", "AB107FU"],
"Latitude":[57.146051, 57.148655, 57.119636],
"Longitude":[-2.107375, -2.097433, -2.147906]})
current_postcode_df = pd.DataFrame({"Postcode":["AB101AB", "AB101AF", "AB101AG"],
"Latitude":[57.149606, 57.148707, 57.149051],
"Longitude":[-2.096916, -2.097806, -2.097004]})
output_df = pd.DataFrame({"Postcode":["AB101RS", "AB129TS", "GY35HG"]})
I want to arrange many objects of a certain class as a graph in Matlab. The goal is, that when I create a new object it automatically is added to the graph. However, as far as I can see graphs only accept numbers when I add a new node. How is typically dealt with it? Should I have a GroupClass that holds all the objects and a graph with the relations? What I would like to have is something like
G = graph()
O1 = createObject(G)
O2 = createObject(G)
and in createObject something like
...
G.addnode(O1)
G.addedge(O1,O2)
...
Afterwards I want to be able to plot the relations, print out groups or all nodes, etc.
You can do this by adding nodes as a "node properties" table. Here's a very simple example:
G = graph();
for idx = 1:10
% make a single-row table containing the name and data
% associated with this node
nodeProps = table({['Idx ', num2str(idx)]}, ...
MException('msg:id', sprintf('Message %d', idx)), ...
'VariableNames', {'Name', 'Data'});
G = addnode(G, nodeProps);
end
for idx = 2:10
% add edges based on the node names
G = addedge(G, 'Idx 1', sprintf('Idx %d', idx));
end
plot(G)
Hi i have two load statements A and B.I want to pass the particular column values from A to B .I tried the following code.
A = LOAD '/user/bangalore/part-m-00000-bangalore' using PigStorage ('\t') as (generatekey:chararray,PropertyID:chararray,ssk:chararray,ptsk:chararray,ptid:chararray,BuiltUpArea:chararray,Price:chararray,pn:chararray,NoOfBedRooms:chararray,NoOfBathRooms:chararray,balconies:chararray,Furnished:chararray,TowerNo:chararray,NoOfTowers:chararray,UnitsOntheFloor:chararray,FloorNoOfProperty:chararray,TotalFloors:chararray,NumberOfLifts:chararray,Facing:chararray,Description:chararray,NewResale:chararray,Possession:chararray,Age:chararray,Ownership:chararray,Type:chararray,PropertyAddress:chararray,Property_Address2:chararray,city:chararray,state:chararray,Property_PinCode:chararray,Locality:chararray,Landmark:chararray,PropertyFeatures:chararray,NearByFacilities:chararray,ReferenceURL:chararray,Flooring:chararray,OverLooking:chararray,ListedOn:chararray,Sellerinfo:chararray,CompanyAddress:chararray,Agency_Address2:chararray,city2:chararray,state2:chararray,Agency_Pincode:chararray,Agency_Phone1:chararray,Agency_Phone2:chararray,ContactName:chararray,Agency_Email:chararray,Agency_WebSite:chararray);
B = foreach A generate Locality;
C = LOAD '/user/april_data/bangalore' using PigStorage ('\t') as (SourceWebSite:chararray,PropertyID:chararray,ListedOn:chararray,ContactName:chararray,TotalViews:int,Price:chararray,PriceperArea:chararray,NoOfBedRooms:int,NoOfBathRooms:int,FloorNoOfProperty:chararray,TotalFloors:int,Possession:chararray,BuiltUpArea:chararray,Furnished:chararray,Ownership:chararray,NewResale:chararray,Facing:chararray,title:chararray,PropertyAddress:chararray,NearByFacilities:chararray,PropertyFeatures:chararray,Sellerinfo:chararray,Description:chararray,emp:chararray);
D = FORACH C generate title
E = join B by Locality,D by title;
the locality column is empty.I want to pass the values from the title column to locality column.the above code prints null only.any help will be appreciated.
I have the following dataset df:
import numpy.random
import pandas
cat = pandas.Series(numpy.random.random_integers(0,400,1000000))
ids = pandas.Series(numpy.random.random_integers(0,10000,1000000))
team = pandas.Series(numpy.random.random_integers(0,1,1000000))
df = pandas.concat([ids,cat,team],axis=1)
df.columns = ['ids','cat','team']
Note that there is only 400 distinct categories in the cat column. Conseqently, I want to prepare the dataset for a machine learning classification, i.e., create one column for each distinct category value from 0 to 400, and for each row, write 1 if the id has the corresponding category, and 0 otherwise. My goal is then to make a groupby ids, and sum the 1 for every category column, as follows:
df2 = pandas.get_dummies(df['cat'], sparse=True)
df2['ids'] = df['ids']
df3 = df2.groupby('ids').sum()
My problem is that the groupby.sum() is very very long, far too long (more than 30 mins). So I need a different strategy to make my calculation. Here is a second attempt.
from sklearn import preprocessing
import numpy
text_encoder = preprocessing.OneHotEncoder(dtype=numpy.int)
X = text_encoder.fit_transform(df.drop(['team','ids'],axis=1).values).astype(int)
But then, X is a sparse scipy matrix. Here I have two choices: either a find a way to groupby.sum() efficiently on this sparse scipy matrix, or I convert it to a real numpy matrix with .toarray() as follows:
X = X.toarray()
df2 = pandas.DataFrame(X)
df2['ids'] = df['ids']
df3 = df2.groupby('ids').sum()
The problem now is that a lot of memory is lost due to the .toarray(). And the groupby.sum() surely takes a lot of memory.
So my question is: is there a smart way to solve my problem using SPARSE MATRIX with EFFICIENT TIME for the groupby.sum()?
EDIT: In fact this is a job for pivot_table(), so once your df is created:
df_final = df.pivot_table(cols='cat', rows='ids', aggfunc='count')
df_final.fillna(0, inplace = True)
For the record but useless: following my comments on the question:
import numpy.random
import pandas
from sklearn import preprocessing
cat = pandas.Series(numpy.random.random_integers(0,400,1000000))
ids = pandas.Series(numpy.random.random_integers(0,10000,1000000))
team = pandas.Series(numpy.random.random_integers(0,1,1000000))
df = pandas.concat([ids,cat,team],axis=1)
df.columns = ['ids','cat','team']
df.sort('ids', inplace = True)
text_encoder = preprocessing.OneHotEncoder(dtype=numpy.int)
X = text_encoder.fit_transform(df.drop(['team','ids'],axis=1).values).astype(int)
se_size = df.groupby('ids').size()
ls_rows = []
row_ind = 0
for name, nb_lines in se_size.iteritems():
ls_rows.append(X[row_ind : row_ind + nb_lines,:].sum(0).tolist()[0])
row_ind += nb_lines
df_final = pandas.DataFrame(ls_rows,
index = se_size.index,
columns = text_encoder.active_features_)
please help me out..its really urgent..deadline nearing, and im stuck with it since 2 weeks..breaking my head but no result. i am a newbie in piglatin.
i have a scenario where i have to filter data from a csv file.
the csv is on hdfs, and has two columns.
grunt>> fl = load '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
grunt>> dump f1;
("first~584544fddf~dssfdf","2001")
("first~4332990~fgdfs4s","2001")
("second~232434334~fgvfd4","1000")
("second~786765~dgbhgdf","1000)
("second~345643~gfdgd43","1000")
what i need to do is i need to extract only the first word before the 1st '~' sign and concat that with the second column value of the csv file. Also i need to group the concatenated result returned and count the number of such similar rows, and create a new csv file as out put, where there would be 2 columns again. 1st column would be the concatenated value and the 2nd column would be the row count.
i.e
("first 2001","2")
("second 1000","3")
and so on.
I have written the code here but its just not working. i have used STRSPLIT. it is splitting the values of the first column of input csv file. but i dont know how to extract the first split value.
code is given below:
convData = LOAD '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
fil = FILTER convData BY conv != '"-1"'; --im using this to filter out the rows that has 1st column as "-1".
data = FOREACH fil GENERATE STRSPLIT($0, '~');
X = FOREACH data GENERATE CONCAT(data.$0,' ',convData.clnt);
Y = FOREACH X GROUP BY X;
Z = FOREACH Y GENERATE COUNT(Y);
var = FOREACH Z GENERATE CONCAT(Y,',',Z);
STORE var INTO '/user/hduser/output.csv' USING PigStorage(',');
STRSPLIT returns a tuple, the individual elements of which you can access using the numbered syntax. This is what you need:
data = FOREACH fil GENERATE STRSPLIT($0, '~') AS a, clnt;
X = FOREACH data GENERATE CONCAT(a.$0,' ', clnt);