Implement the Page rank algorithm with Pyspark - algorithm

Can anyone kindly help to adjust the remaining code as I'm confused with that about the Google Page Rank Algorithm using PySpark. Thanks a lot.
I have done some parts:
def computeContribs(neighbors, rank):
for neighbor in neighbors:
yield (neighbor, rank/len(neighbors))
rdd = sc.textFile('network.txt').persist()
linksRDD = rdd.map(lambda x:tuple(x.split(" "))).map(lambda x:(x[0],[x[1]])).reduceByKey(lambda x, y: x+y).collect()
linksRDD2 = sc.parallelize (linksRDD)
ranksRDD = linksRDD2.map(lambda x:(x[0],1.0)).collect()
but I'm confused with the calculation of the contribution of each page's outgoing link.
contribs =
update each page's page rank by summing up all incoming link's contribution
ranksRDD =
notebook link: https://colab.research.google.com/drive/1g5E-tqGN8u8cioUSqPNhCM5SIdxO0cLB
Thanks

The spark graphframes package has two pagerank implementations. You can use those out of box implementations than writing yours.
https://graphframes.github.io/graphframes/docs/_site/user-guide.html#pagerank

Related

For loop for a regression model with increasing number of predictors

Ho can I create a loop to fit models with increasing number of predictors. The first iteration should
use one predictor, then two, and so on until all predictors are included. I have to compute the RMSE
on both the training and test data for this model, and store these values in a list/array.
predictors = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors',
'waterfront','view','condition','grade','sqft_above',
'sqft_basement','yr_built','yr_renovated','zipcode','lat',
'long','sqft_living15','sqft_lot15']
models = []
formula = 'price ~ bedrooms'
for p in predictors[0:19]:
formula = formula + p
print(formula)
model_linear_kc_5 = smf.ols(formula=formula, data=df_train_kc)
models.append(model_linear_kc_5.fit())
My code so far but I know this isn't right and am stuck how to do it.
I have to put print(formula) inside loop and then adjust the formula = … line until it does what I want it to.
I would really appreciate help in this regard. Thank you.

How to computer range from loss factor

I am trying to understand the following code given in MediumLimitCache.cc file to understand the formula of finding communication range using loss factor. I know for isotropic antenna, FSPL = Pt/Pr = (4*pi*d*f/c)^2. I could not understand how this formula is implemented in below code.
Would anyone please explain me the functions of the below code? Thank you.
m MediumLimitCache::computeMaxRange(W maxTransmissionPower, W minReceptionPower) const
{
// TODO: this is NaN by default
Hz centerFrequency = Hz(par("centerFrequency"));
**double loss = unit(minReceptionPower / maxTransmissionPower).get() / maxAntennaGain / maxAntennaGain;
return radioMedium->getPathLoss()->computeRange(radioMedium->getPropagation()->getPropagationSpeed(), centerFrequency, loss);**
}
The formula is NOT implemented in this code. This is MediumLimitCache i.e. this is just an optimization component that requires a maximum estimation.
You might find more info inside the computeRange() method.

0-dimension ndarray created with xarray and enumerate: bug or feature?

Please find below a minimum example of how I iterate through time in xarray.
ds = xr.Dataset({'time': pd.date_range(start='1/1/2018', periods=8)})
for ii, date in enumerate(ds.time):
nd = date.data
nd is a numpy.ndarray but of size = 1; no shape: shape = () and 0-dimension: dims = 0.
I can access the element through nd[()] (it took me a while, thx Clive), but I wonder if it is something we should expect or if it is a bug.
If there is a better way to enumerate through my date, please let me know or point me out where to find it.
The nd array as a 0d array is a feature; explained here: https://stackoverflow.com/a/49621796/3064736.
There is a small bug given a recent pandas change such that nd.item() returns an int rather than a date on the most recent versions of xarray & pandas. That's being tracked here: https://github.com/pydata/xarray/pull/4292.
Generally we would want nd=data.item()

Tensorflow how to dump the result placement algorithm

I'm curious of model parallelism, and I've read the code from Yaroslav Bulatov. And in that example, we should partition our model (or in tensorflow we called Graph) manually to different partition (left_network & right_network).
So, I was wondering if I have to make partitions manually, what's the simple_placer.cc and graph_partition.cc have done to the whole graph? And I'm still not clear as all.
In my thought(let me know if anythong wrong):
If the graph has 8 partitions(subgraph) which can be seen as 8 jobs, and 4 workers,How the partitions distributed to workers can be done through:
explicit annotations via tf.device(), or
distributed training, tf.train.replica_device_setter()
share the variables across parameter servers, and otherwise put all
ops on the worker device
But how's the graph make partitions?
I want to trace what's the subgraph (op-nodes set) looks like?
Can I dump the result or I need to trace/modified which code file?
Please let me know if any concepts is wrong or vague.
I'm a rookie of these, any opinion is appreciated.
In the code below, is matmul a op-node, would it be partition into
different jobs?
y_ = tf.placeholder(tf.float32, [None, 10])
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.matmul(x, W) + b
You can get the result of the placement algorithm by passing additional options when you call tf.Session.run()
# ...
y = tf.matmul(x, W) + b
sess = tf.Session()
options = tf.RunOptions(output_partition_graphs=True)
metadata = tf.RunMetadata()
sess.run(y, options=options, run_metadata=metadata)
# `metadata` now contains information about what happened during the `run()` call.
for partition in metadata.partition_graphs:
# `partition` is a `tf.GraphDef` representing all the nodes that ran on a single
# device. All nodes in `partition` have the same `device` value.
device = partition.node[0].device
for node in partition.node:
# e.g. print each node or store it in a dictionary for further analysis.
# ...

MATLAB parfor slicing a 3D array

I'm trying to speed up my code using parfor. The purpose of the code is to slide a 3D square window on a 3D image and for each block of mxmxm apply a function.
I wrote this code:
function [ o_image ] = SlidingWindow( i_image, i_padSize, i_fun, i_options )
%SLIDINGWINDOW Summary of this function goes here
% Detailed explanation goes here
o_image = zeros(size(i_image,1),size(i_image,2),size(i_image,3));
i_image = padarray(i_image,i_padSize,'symmetric');
i_padSize = num2cell(i_padSize);
[m,n,p] = deal(i_padSize{:});
[row,col,depth] = size(i_image);
windowShape = i_options.windowShape;
mask = i_options.mask;
parfor (i = m+1:row-m,i_options.cores)
temp = i_image(i-m:i+m,:,:);
for j = n+1:col-n
for h = p+1:depth-p
ii = i-m;
jj = j-n;
hh = h-p;
temp = temp(:,j-n:j+n, h-p:h+p);
o_image(ii,jj,hh) = parfeval(i_fun, temp, windowShape, mask);
end
end
end
end
I get one warning and one error that I don't understand how to solve.
The warning says:
the entire array or structure 'i_image' is a broadcast variable.
The error says:
the PARFOR loop can not run due to the way variable 'o_image' is used.
I don't understand how to fix these two things. Any help is greatly appreciated!
As far as I understand, parfeval takes care of running your function on the available number of workers, which is why it doesn't need to be surrounded by parfor. Assuming you already have an active parpool, changing the external parfor into for eliminates both problems.
Unfortunately, I can't support my answer with a benchmark or suggest a more fitting solution because your inputs are unknown.
It seems to me that the code can be optimized in other ways, mainly by vectorization. I would suggest you looked into the following resources:
This question, for additional info on parfeval.
Examples on how to use bsxfun and permute and benchmarks thereof: ex1, ex2, ex3.
P.S.: The 2nd part of (i = m+1:row-m,i_options.cores) seems out of place...

Resources