How to run PageRank in Blazegraph on a dataset? - parallel-processing

I want to run the PageRank algorithm in Blazegraph on dataset downloaded from SNAP, the Stanford Network Analysis Project. As far as I can see, there is a PageRank implementation in Blazegraph, but I cannot find a way to run it. Is it possible to run it? If yes, how?

You can use the Blazegraph GAS API to execute graph analytics on data loaded in Blazegraph. The example below shows running a PageRank over all of the data loaded in a namespace. If you have a particular SNAP data set converted to RDF, you'd like to see feel free to post a link.
PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?node ?rank {
SERVICE gas:service {
gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.PR" .
gas:program gas:out ?node . # exactly once - will be bound to the visited vertices.
gas:program gas:out1 ?rank . # Computed PageRank value for the node
}
FILTER (?rank<100)
} ORDER BY DESC(?rank)
PageRank example output over the connectivity of Autonomous System (AS) Links:
node rank
<as:1120> 0.4546700227713777
<as:11492> 0.42358562655858023
<as:12644> 0.41794183515852634
<as:12143> 0.39695587975476715
<as:10217> 0.37759985273202806
<as:13092> 0.3668006144247455
<as:11139> 0.33221277719235737
<as:12722> 0.3256365110406788
<as:10913> 0.32270313230429504

Related

Units of an elasticsearch query to get distance from arbitrary point to Geopoint

I have a django project which uses elasticsearch 6.5.3 to index products in a store with locations as GeoPoints. I am trying to query this index and also calculate distance between an arbitrary point, say user's location to each oh these results.
I am using elasticsearch_dsl and my code looks something like this:
search_query = search_query.script_fields(distance={
'script':{
'inline':"doc['location'].arcDistance(params.lat, params.lon)",
'params': {
'lat':user_loc.lat,
'lon':user_loc.lon
}
}
})
for result in search_query.execute():
print(result.distance)
Which gives me values that looks like:
[123456.456879123]
But I'm not sure about its units.
By using and online distance calculator in https://www.nhc.noaa.gov/gccalc.shtml,
which gives me the distance as ~123km,
It looks like value is in meters.
So:
1. Where can I find some definitive answers about its units?
Please point me to the relevant documentation for these methods.
I am also interested to know if there is a way to specify the units expected for the results in the method call.
2. Is there a better way to do this in python?
The units are those returned by the arcDistance method providing the value in your script.
The arc distance (in meters) of this geo point field from the provided lat/lon
The painless docs leave a lot to be desired (there appears to be no docs on this method in 6.5). The quote above was obtained from here: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/modules-scripting.html
Additionally, they mention arcDistance caluclates meters here: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/breaking_50_scripting.html
I'm not sure about the exact python API, but elasticsearch have Geo Distance Query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-distance-query.html
In: https://github.com/elastic/elasticsearch-dsl-py/issues/398 there's an example of python usage of ES API:
MyDocType.search().filter(
'geo_distance', distance='1000m', location={"lat": "40", "lon": "-74"}
)
The 'geo_distance' query is the easiest way to get a distance between two geo points indexed to elasticsearch. I thinking that you don't need to use scripting in order to achieve that.
Regarding the distance unit, as you suspected the default is meters. from:
https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#distance-units
Wherever distances need to be specified, such as the distance parameter in the Geo Distance Query), the default unit if none is specified is the meter.

Clustering using Representatives (CURE)

I need a numerical example which demonstrates the working of clustering using CURE algorithm.
https://www.cs.ucsb.edu/~veronika/MAE/summary_CURE_01guha.pdf
The pyclustering library has a number of clustering algorithims with examples, and example code on their Github. Here is a link the CURE example.
Googling Cure algorithim example also came up with a fair bit.
Hopefully that helps!
Using pyclustering library you can extract information about representatives points and means using corresponding methods (link to CURE pyclustering generated documentation):
# create instance of the algorithm
cure_instance = cure(<algorithm parameters>);
# start processing
cure_instance.process();
# get allocated clusteres
clusters = cure_instance.get_clusters();
# get representative points
representative = cure_instance.get_representors();
Also you can modify source code of the CURE algorithm to display changes after each step, for example, print them to console or even visualize. Here is an example how to modify code to display changes on each step clustering (after line 219) where star means representative point, small points - points itself and big points - means:
# New cluster and updated clusters should relocated in queue
self.__insert_cluster(merged_cluster);
for item in cluster_relocation_requests:
self.__relocate_cluster(item);
#
# ADD FOLLOWING PEACE OF CODE TO DISPLAY CHANGES ON EACH STEP
#
temp_clusters = [ cure_cluster_unit.indexes for cure_cluster_unit in self.__queue ];
temp_representors = [ cure_cluster_unit.rep for cure_cluster_unit in self.__queue ];
temp_means = [ cure_cluster_unit.mean for cure_cluster_unit in self.__queue ];
visualizer = cluster_visualizer();
visualizer.append_clusters(temp_clusters, self.__pointer_data);
for cluster_index in range(len(temp_clusters)):
visualizer.append_cluster_attribute(0, cluster_index, temp_representors[cluster_index], '*', 7);
visualizer.append_cluster_attribute(0, cluster_index, [ temp_means[cluster_index] ], 'o');
visualizer.show();
You will see sequence of images, something like that:
Thus, you can display any information that you need.
Also I would like to add that you can use C++ implementation of the algorithm for visualization (that is also part of pyclustering): https://github.com/annoviko/pyclustering/blob/master/ccore/src/cluster/cure.cpp

Execute query lazily in Orient-DB

In current project we need to find cheapest paths in almost fully connected graph which can contain lots of edges per vertex pair.
We developed a plugin containing functions
for special traversal this graph to lower reoccurences of similar paths while TRAVERSE execution. We will refer it as search()
for special effective extraction of desired information from results of such traverses. We will refer it as extract()
for extracting best N records according to target parameter without costly ORDER BY. We will refer it as best()
But resulted query still has unsatisfactory performance on full data.
So we decided to modify search() function so it could watch best edges first and prune paths leading to definitely undesired result by using current state of best() function.
Overall solution is effectively a flexible implementation of Branch and Bound method
Resulting query (omitting extract() step) should look like
SELECT best(path, <limit>) FROM (
TRAVERSE search(<params>) FROM #<starting_point>
WHILE <conditions on intermediate vertixes>
) WHERE <conditions on result elements>
This form is very desired so we could adapt conditions under WHILE and WHERE for our current task. The path field is generated by search() containing all information for best() to proceed.
The trouble is that best() function is executed strictly after search() function, so search() can not prune non-optimal branches according to results already evaluated by best().
So the Question is:
Is there a way to pipeline results from TRAVERSE step to SELECT step in the way that older paths were TRAVERSEd with search() after earlier paths handled by SELECT with best()?
the query execution in this case will be streamed. If you add a
System.out.println()
or you put a breakpoint in your functions you'll see that the invocation sequence will be
search
best
search
best
search
...
You can use a ThreadLocal object http://docs.oracle.com/javase/7/docs/api/java/lang/ThreadLocal.html
to store some context data and share it between the two functions, or you can use the OCommandContext (the last parameter in OSQLFunction.execute() method to store context information.
You can use context.getVariable() and context.setVariable() for this.
The contexts of the two queries (the parent and the inner query) are different, but they should be linked by a parent/child relationship, so you should be able to retrieve them using OCommandContext.getParent()

How to handle the Nominal Data by Weka J48

When I ran J48 of weka with binary split option, such decision tree was built.
http://www.fastpic.jp/viewer.php?file=2693704973.jpg
Input explanation variable is 1 nominal data which was made by question id + answer id.
1 nominal data, 1 transaction.
I'm wondering why the tree is at only one side.
Is it caused by my data set or table definition or original binary splits way?
I'd like the tree to have node on both sides.
If you know such a option please show me.
!Sample Data! Please ignore dot '・'
usr,qa,class
A,11,1
A,21,1
A,31,1
B,12,2
B,22,2
B,32,2
C,13,3
C,23,3
C,33,3
D,11,4
D,22,4
D,31,4
E,11,1
E,23,1
E,31,1
F,12,2
F,22,2
F,33,2
G,13,3
G,22,3
G,32,3
H,12,4
H,21,4
H,33,4
There's no error in the tree built and no option would really modify it. If your question is related to your same Akinator project, please reformat your data to get all questions (ie. 11,21,31) on the same instance/line and the answer as target class.
PS: if you import those data as CSV, Weka will take those data as numerical (not as as nominal). You should then add a non digit character (ie. #1,#2,#3...) so that Weka will take those data as nominal.

Find item by arbitrary query

Problem: I have item in database, wich called "AABGng-LS 4х4 0.66kV". AABG is vendor, ng-LS is type, 4*4 is cable cross-section, 0.66 kV is voltage. User must find this item for this queries:
AABG ng LS 4х4 660 V
AABGng-LS-660 4х4
AABG ng-LS 0.66 4*4
How can it be solved (algorithm)? I prefer ruby language, but algorithm in any language can be suggested.
the problem that you are describing is one of a search-index. this involves a lot of steps to get it working if you want to do it yourself, like normalizing, stemming, matching etc.
i would advise you to have a look at lucene based search indexes like elasticsearch, solr etc.

Resources