PageRank for weighted graph issue - pagerank

I have a question about how PageRank can show the impact of "weight". I want to calculate the PageRank of trade countries using the trade value as a weight, my code is shown below. But what I find is that the results are same as the results that are not weighted. I don't know why.
Could someone help me to understand how to show the "weight" in PageRank calculating?
import networkx as nx
import os
import pandas as pd
data=pd.read_excel('f-e-2016-intermediate-use.xlsx')
G=nx.DiGraph()
teams=data.groupby(['reportercode','partnercode'])
team_names=[name for name,group in teams]
G.add_edges_from(team_names)
a_node=data.groupby(['reportercode'])
source_nodes=[name for name,group in a_node]
b_node=data.groupby(['partnercode'])
target_nodes=[name for name,group in b_node]
nodes=set(source_nodes+target_nodes)
G.add_nodes_from(nodes)
page_rank=nx.pagerank(G,weight='tradevalueus')

I've just came across this after looking for the answer myself. What worked for me is simply adding the weight=True command alongside the Pagerank parameters, e.g. for building a page rank score for all nodes in a network:
pagerank_dict = dict(nx.pagerank(G, weight=True)
The only problem might be that you're using a different method than me to read in your edgelist. I suggest using the nx.read_weighted_edgelist feature to load in the node and edge data for your graph. Your excel file should contain three columns with adjacent values for source node, target node, and edge weight (Don't include headers, and save in .csv format). You can then use the following command to load in your data so its guaranteed to work correctly with pagerank:
G = nx.read_weighted_edgelist('f-e-2016-intermediate-use.csv', delimiter=',', create_using = nx.DiGraph(), nodetype=str)
pagerank_dict = dict(nx.pagerank(G, weight=True)

Related

Why the decision tree algorithm in python change every run?

I am following a course on udemy about data science with python.
The course is focused on the output of the algorithm and less on the algorithm by itself.
In particular I am performing a decision tree. Every doing I run the algorithm on python, also with the same samples, the algorithm gives me a slightly different decision tree. I have asked to the tutors and they told me "The decision trees does not guarantee the same results each run because of its nature." Someone can explain me why more in detail or maybe give me an advice for a good book about it?
I did the decision tree of my data importing:
import numpy as np
import pandas as pd
from sklearn import tree
and doing this command:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,y)
where X are my feature data and y is my target data
Thank you
The DecisionTreeClassifier() function is apparently documented here:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
So this function has many arguments. But in Python, function arguments may have default values. Here, all arguments have default values, so you can even call the function with an empty argument list, like this:
clf = tree.DecisionTreeClassifier()
The parameter of interest, random_state is documented like this:
random_state: int, RandomState instance or None, default=None
So your call is equivalent to, among many other things:
clf = tree.DecisionTreeClassifier(random_state=None)
The None value tells the library that you don't want to bother with providing a seed (that is, an initial state) to the underlying pseudo-random number generator. Hence, the library has to come up with some seed.
Typically, it will take the current time value, with microsecond precision if possible, and apply some hash function. So at every call you will get a different initial state, and so a different sequence of pseudo-random numbers. Hence, a different tree.
You might want to try forcing the seed. For example:
clf = tree.DecisionTreeClassifier(random_state=42)
and see if your problem persists.
Now, regarding why does the decision tree require pseudo-random numbers, this is discussed for example here:
According to scikit-learn’s “best” and “random” implementation [4], both the “best” splitter and the “random” splitter uses Fisher-Yates-based algorithm to compute a permutation of the features array.
The Fisher-Yates algorithm is the most common way to compute a random permutation. Also, if stopped before completion, it can be used to extract a random subset of the data sample, for example if you need a random 10% of the sample to be excluded from the data fitting and set aside for a later cross-validation step.
Side note: in some circumstances, non-reproducibility can become a pain point, for example if you want to study the influence of an external parameter, say some global Y values bias. In that case, you don't want uncontrolled changes in the random numbers to blur the effects of your parameter changes. Hence the need for the API to provide some way to control the seed value.

Exogenous variables in hmmlearn's GaussianHMM

I am trying to use hmmlearn's GaussianHMM to fit a Hidden Markov Model with 2 main states, while allowing for multiple exogenous variables. My goal is to determine two states of GDP growth (one with low variance and the other with high variance), these states then depend on lagged unemployment, lagged commercial confidence level etc. I have a couple of questions:
Using hmmlearn's GaussiansHMM, I have read through the documentation but I cannot find any mention of exogenous variable. Using the method fit(X, lengths=None), I see that X can have n_features columns, do I understand correctly that I should pass in an array with the first column being the endogenous varible (GDP growth in my case) and the rest of columns are the exogenous variables ?
Is hmmlearn's GaussianHMM equivalent to statsmodels.tsa.regime_switching.markov_regression.MarkovRegression ? This model allows for exog_tvtp which means that exogenous variables are used to calculate a time varying transition probabilities matrix.
An example of fitting the monthly returns of the S&P500, no exogenous variable.
import numpy as np
import pandas as pd
from hmmlearn.hmm import GaussianHMM
import yfinance as yf
sp500 = yf.download("^GSPC")["Adj Close"]
# Fitting an absolute return model because we only care about volatility #
rets = np.log(sp500/sp500.shift(1)).dropna()
rets.index = pd.to_datetime(rets.index)
rets = rets.resample("M").sum()
model = GaussianHMM(n_components=2)
model.fit(rets.to_frame())
state_sequence = model.predict(rets.to_frame())
Imagine if I want to add a dependency on exogenous variables to the returns of the S&P500, for example on economic growth or past volatilities, is there a way to do this ?
Thanks for any help.
n_features can be thought of as the temporal domain, and should not be conflated with features that describe the complexity of ie. a regression model.
If your hidden states are the two states of GDP growth, then the observed variable (or emissions) that you are trying to infer the hidden states from should be the feature space (a.k.a. n_features).
This should be a single measurement (emission) descriptive of a combination of your "exogenous variables", collected over time. hmmlearn will not be able to take multivariate emissions.
Suggestions
If I understand your question correctly, perhaps what you might be looking for are Kalman filters. KF produces estimates of unknowns based on multiple measurements (ie. all of your exogenous variables) that ultimately produce a model more accurate than those based on a single measurement.
If you wish each hidden state to have multiple independent emissions then what you might be looking for is a structured perceptron. This is discussed here: Hidden Markov Model for multiple observed variables

Questions about feature selection and data engineering when using H2O autoencoder for anomaly detection

I am using H2O autoencoder in R for anomaly detection. I don’t have a training dataset, so I am using the data.hex to train the model, and then the same data.hex to calculate the reconstruction errors. The rows in data.hex with the largest reconstruction errors are considered anomalous. Mean squared error (MSE) of the model, which is calculated by the model itself, would be the sum of the squared reconstruction errors and then divided by the number of rows (i.e. examples). Below is some sudo code of the model.
# Deeplearning Model
model.dl <- h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE)
# Anomaly Detection Algorithm
errors <- h2o.anomaly(model.dl, data.hex, per_feature = FALSE)
Currently there are about 10 features (factors) in my data.hex, and they are all categorical features. I have two questions below:
(1) Do I need to perform feature selection to select a subset of the 10 features before the data go into the deep learning model (with autoencoder=TRUE), in case some features are significantly associated with each other? Or I don’t need to since the data will go into an autoencoder which compresses the data and selects only the most importance information already, so feature selection would be redundant?
(2) The purpose of using the H2O autoencoder here is to identify the senders in data.hex whose action is anomalous. Here are two examples of data.hex. Example B is a transformed version of Example A, by concatenating all the actions for each sender-receiver pair in Example A.
After running the model on data.hex in Example A and in Example B separately, what I got is
(a) MSE from Example A (~0.005) is 20+ times larger than MSE from Example B;
(b) When I put the reconstruction errors in ascending order and plot them (so errors increase from left to right in the plot), the reconstruction error curve from Example A is steeper (e.g. skyrocketing) on the right end, while the reconstruction error curve from Example B increases more gradually.
My question is, which example of data.hex works better for my purpose to identify anomalies?
Thanks for your insights!
Question 1
You shouldn't need to decrease the number of inputted features into the model. I can't say I know what would happen during training, but collinear/associated features could be eliminated in the hidden layers as you said. You could consider adjusting your hidden nodes and see how it behaves. hidden = c(25,25,25) -> hidden = c(25,10,25) or hidden = c(15,15) or even hidden = c(7, 5, 7) for your few features.
Question 2
What is the purpose of your model? Are you trying to determine which "Sender/Receiver combinations" are anomalies or are you trying to determine which "Sender/Receiver + specific Action combo" are anomalies? If it's the former ("Sender/Receiver combinations") I would guess Example B is better.
If you want to know "Sender/Receiver combinations" and use Example A, then how would you aggregate all the actions for one Sender-Receiver combo? Will you average their error?
But it sounds like Example A has more of a response for anomalies in ascended order list (where only a few rows have high error). I would sample different rows and see if the errors make sense (as a domain expert). See if higher errors tend to seem to be anomaly-like rows.

Algorithm or command line tool to decimate point cloud of terrain points?

I need to take a larger (more dense) than needed list of lidar survey points (longitude, latitude, and elevation) for terrain definition and decimate it based on a 2 dimensional grid. The idea would be to end up with points based on a NxN (i.e. 1 meter x 1 meter) dimension grid using the longitude, latitude (x,y) values, therefore eliminating the points that are more than are needed. The goal is to determine what the elevation is at each point in the grid after the decimation, not use elevation as part of the decimation rule itself.
An actual or precisely structured grid is not necessary or the goal here, I only use the grid terminology to best approximate what I envision as the remainder of the cloud of points after reducing it in a manner that we have always have a point within a certain radius (i.e. 1 meter). It is possible there is a better term to use than grid.
I would like to either code/script this myself in a scripting or programming language if I can start with a decimation algorithm or use a command line tool from a project that may already exist that can do this that can run on Ubuntu and called from our application as system call. The approach should not require using a GUI based type of software or tool to solve this. It needs to be part of an automated set of steps.
The data currently exists in a tab separated values file but I could load the data into a sqlite database file if using an database/sql query driven algorithm would be better/faster. The ideal scripting language would be ruby or python but can be any really and if there exists C/C++/C# libraries for this already then we could wrap those for our needs.
Ideas?
Update
Clarifying the use of the result of this decimated list: Given a user's location (known by latitude and longitude), what is the closest point in the list and in turn its elevation? We can do this now of course, but we have more data than is necessary so we just want to relax the density of the data so that if we can find the closest point within a tolerance distance (i.e. 1 meter) if able to use a decimated list vs the full list. The latitude, longitude values in the list are in decimal GPS (i.e. 38.68616190027656, -121.11013105991036)
PART 1: decimated version
Load data
Load the data from the tabular file (change sep according to the separator you are using):
# installed as dependency
import pandas as pd
# https://github.com/daavoo/pyntcloud
from pyntcloud import PyntCloud
dense = PyntCloud(pd.read_csv("example.tsv",
sep='\t',
names=["x","y","z"]))
This is how it looks the example I created:
Build VoxelGrid
Asuming that the latitude and longitude in your file are in meters you can generate a grid as follows:
grid_id = dense.add_structure("voxelgrid",
sizes=[1, 1,None],
bb_cuboid=False)
voxelgrid = dense.voxelgrids[grid_id]
This voxelgrid has a size of 1 along the x (latitude) and y (longitude) dimensions.
Build decimated version
decimated = dense.get_sample("voxelgrid_centroids", voxelgrid=grid_id)
decimated is a numpy (N,3) array. You can store it for later use in a SQL database, etc.
PART 2: Query
Option A: query voxelgrid
Get mean altitudes for each grid cell
You can know get a vector with the mean z (altitude) value for each cell in the grid:
z_mean = voxelgrid.get_feature_vector(mode="z_mean")
Query the grid with users's location:
users_location = np.random.rand(100000, 2)
Add a column of zeros because query requires 3D (This doesn't affect the results):
users_location = np.c_[ users_location, np.zeros(users_location.shape[0]) ]
Get in wich cell each user is:
users_cell = voxelgrid.query(users_location)
And finally, get the altitude corresponding to each user:
users_altitude = z_mean[users_cell]
Option B: Use decimated version for query
Build a KDTree of decimated:
from scipy.spatial import cKDTree
kdt = cKDTree(decimated)
Query the KDTree with user locations:
users_location = np.random.rand(100000, 2)
users_location = np.c_[ users_location, np.zeros(users_location.shape[0])
distances, indices = kdt.query(user_locations, k=1, n_jobs=-1)
Extra, you can save and laod the voxelgrid with pickle:
pickle.dump(voxelgrid, open("voxelgrid.pkl", "wb"))
voxelgrid = pickle.load(open("voxelgrid.pkl", "rb"))
If you have a point cloud as a text file (.xyz) a simple and fast solution is to take a random sample from the file using shuf.
10 million points in a xyz-file equals 10 million lines of text. You can run:
shuf input.xyz -l 5000000 -o out.xyz
You have decimated the file to half the original size.

Number of neighbours KNN algorithm

I applied the KNN algorithm in matlab for classifying handwritten digits. the digits are in vector format initially 8*8, and stretched to form a vector 1*64. So each time I am comparing the first digit with all the rest data set, (which is quite huge), then the second one with the rest of the set etc etc etc. Now my question is, isn't 1 neighbor the best choice always? Since I am using Euclidean Distance, (I pick the one that is closer) why should I also choose 2 or 3 more neighbors since I got the closest digit?
Thanks
You have to take noise into consideration. Assume that maybe some of your classified examples were classified wrongly, or maybe one of them is oddly very close to other examples - which are different, but it is actually only a "glitch". In these cases - classifying according to this off the track example could lead to a mistake.
From personal experience, usually the best results are achieved for k=3/5/7, but it is instance dependent.
If you want to achieve best performance - you should use cross validation top chose the optimal k for your specific instance.
Also, it is common to use only odd number as k for KNN, to avoid "draws"
A simple program to demonstrate ML Knn algorithm
Knn algorithm works by training a computer with set of data and passing inputs to get expected output. Ex:- consider a parent want to train his child to identify picture of "Rabbit", here parent will show n number of photos of a a Rabbit and if the photo belongs to Rabbit then we shout Rabbit else we will move on , like this in this approach a supervision is taken care to the computer by feeding set of data to get expected output
from sklearn.neigbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
df=pd.read_csv("D:\\heart.csv")
new_data{"data":np.array(df[["age","gende","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal"]],ndmin=2),"target":np.array(df["target"]),"target_names":np.array(["No_problem","Problem"])}
X_train,X_test,Y_train,Y_test=train_test_split(new_data["data"],new_data["target"],random_state=0)
kn=KNeighborsClassifier(n_neighbors=3)
kn.fit(X_train,Y_train)
x_new=np.array([[71,0,0,112,149,0,1,125,0,1.6,1,0,2]])
res=kn.predict(x_new)
print("The predicted k value is : {}\n".format(res))
print("The predicted names is : {}\n".format(new_data["target_names"][res])
print("Score is : {:.2f}".format(kn.score(X_train,Y_train)))

Resources