Difficulty updating Neo4j LPA community detection to GDS version with graph projections - algorithm

I'm trying to migrate and upgrade my graph to the latest version of Neo4j and make use of new features and GDS algorithms. The old LPA community detection query was as follows:
CALL algo.labelPropagation.stream(
'MATCH (p:Publication) RETURN id(p) as id',
'MATCH (p1:Publication)-[r1:HAS_WORD]->(w)<-[r2:HAS_WORD]-(p2:Publication)
WHERE r1.occurrence > 5 AND r2.occurrence > 5
RETURN id(p1) as source, id(p2) as target, count(w) as weight',
{graph:'cypher',write:false, weightProperty : "weight"})
yield nodeId, label
WITH
label, collect(algo.asNode(nodeId)) as nodes where size(nodes) > 2
MERGE (c:PublicationLPACommunity {id : label})
FOREACH (n in nodes |
MERGE (n)-[:IN_LPA_COMMUNITY]->(c)
)
return label, nodes
I've been trying to understand the documentation for projecting a graph then performing community detection - and I think I've been close - but I just don't fully understand what is happening and how to project it correctly first in order to perform the LPA.
Here is my code so far:
CALL gds.graph.project.cypher(
'testProjection',
'MATCH (p:Publication) RETURN id(p) AS id',
'MATCH (p:Publication)-[r1:HAS_WORD]->(w)<-[r2:HAS_WORD]-(p2:Publication) WHERE r1.occurrence > 5 AND r2.occurrence > 5 RETURN id(p1) as source, id(p2) as target, count(w) as weight'
)
YIELD
graphName AS graph, nodeCount AS nodes, relationshipCount AS rels, weightProperty AS weight
I think I'm mixing up elements of the projection and elements of the algorithm - I can't figure out what should happen and why. I've managed to make simple graph projections with the Publication nodes in the past - but it seems like that isn't enough information to perform the LPA algorithm.
Any help very much appreciated.

Related

Cypher Query Trying to get all nodes with 3rd degree connection, while all nodes follow the same constraints

I'm struggling with writing this query
I want to find all of the workers,
who have friends of 1st degree, 2nd degree AND 3rd degree,
who follows 2 conditions.
but if one of the above connections does not apply to the conditions,
I do not want it to be in the output
for example:
the relationship is friend.
conditions are:
City: "New Delhi"
age >= 29
friend , friend of friend , friend of friend of friend - Direction irrelevant
if the node has a friend that not apply to the conditions even though it has another path that does apply I do not want it
Graph DB
MATCH (u1)-[:friend]-(u2)-[:friend]-(u3)-[:friend]-(u4)
WHERE(u1.address = "New Delhi" AND u1.age >= 29)
AND (u2.address = "New Delhi" AND u2.age >= 29)
AND (u3.address = "New Delhi" AND u3.age >= 29)
AND (u4.address = "New Delhi" AND u4.age >= 29)
RETURN DISTINCT u3.name
This query gives me all the nodes that I wanted and more..
it does not filter the ones that have a relationship with someone that not apply the conditions
I will be more clear:
a-b-c-d path that apply the conditions MARK it P
a-q-w-v q OR w OR v do not apply the conditions MARK it N
because of P a will be in the output but because of N it should not be
Who stand the requirements and Output
I added this picture for graphic explanation
the mark with black circle is the output-> every one in chain apply the conditions
Green is apply the conditions
Red is not.
Hillel, Tor and Dror. They apply the conditions but the have friends or friend of friends or friend of friends of friends that does not. So they will not be in the output.
The out put for my Query is the black circles and the 3 names above how to I do it with out those 3.
sorry for all the extra details I tried in neo4j manual to find an answer but with no luck

Neo4j cypher query improvement (performance)

I have the following cypher query:
CALL apoc.index.nodes('node_auto_index','pref_label:(Foo)')
YIELD node, weight
WHERE node.corpus = 'my_corpus'
WITH node, weight
MATCH (selected:ontoterm{corpus:'my_corpus'})-[:spotted_in]->(:WEBSITE)<-[:spotted_in]-(node:ontoterm{corpus:'my_corpus'})
WHERE selected.uri = 'http://uri1'
OR selected.uri = 'http://uri2'
OR selected.uri = 'http://uri3'
RETURN DISTINCT node, weight
ORDER BY weight DESC LIMIT 10
The first part (until the WITH) runs very fast (Lucene legacy index) and returns ~100 nodes. The uri property is also unique (selected = 3 nodes)
I have ~300 WEBSITE nodes. The execution time is 48749 ms.
Profile:
How can I restructure the query to improve performance? And why there are ~13.8 Mio rows in the profile?
I think the problem was in the WITH clause which expanded the results enormous. InverseFalcon's answer makes the query faster: 49 -> 18 sec (but still not fast enough). To avoid the enormous expand I collected the websites. The following query takes 60ms
MATCH (selected:ontoterm)-[:spotted_in]->(w:WEBSITE)
WHERE selected.uri in ['http://avgl.net/carbon_terms/Faser', 'http://avgl.net/carbon_terms/Carbon', 'http://avgl.net/carbon_terms/Leichtbau']
AND selected.corpus = 'carbon_terms'
with collect(distinct(w)) as websites
CALL apoc.index.nodes('node_auto_index','pref_label:(Fas OR Fas*)^10 OR pref_label_deco:(Fas OR Fas*)^3 OR alt_label:(Fa)^5') YIELD node, weight
WHERE node.corpus = 'carbon_terms' AND node:ontoterm
WITH websites, node, weight
match (node)-[:spotted_in]->(w:WEBSITE)
where w in websites
return node, weight
ORDER BY weight DESC
LIMIT 10
I don't see any occurrence of NodeUniqueIndexSeek in your plan, so the selected node isn't being looked up efficiently.
Make sure you have a unique constraint on :ontoterm(uri).
After the unique constraint is up, give this a try:
PROFILE CALL apoc.index.nodes('node_auto_index','pref_label:(Foo)')
YIELD node, weight
WHERE node.corpus = 'my_corpus' AND node:ontoterm
WITH node, weight
MATCH (selected:ontoterm)
WHERE selected.uri in ['http://uri1', 'http://uri2', 'http://uri3']
AND selected.corpus = 'my_corpus'
WITH node, weight, selected
MATCH (selected)-[:spotted_in]->(:WEBSITE)<-[:spotted_in]-(node)
RETURN DISTINCT node, weight
ORDER BY weight DESC LIMIT 10
Take a look at the query plan. You should see a NodeUniqueIndexSeek somewhere in there, and hopefully you should see a drop in db hits.

Neo4j performance with cycles

I have a relatively large neo4j graph with 7 millions vertices and 5 millions of relations.
When I try to find out subtree size for one node neo4j is stuck in traversing 600,000 nodes, only 130 of whom are unique.
It does it because of cycles.
Looks like it applies distinct only after it traverses the whole graph to maximum depth.
Is it possible to change this behaviour somehow?
The query is:
match (a1)-[o1*1..]->(a2) WHERE a1.id = '123' RETURN distinct a2
You can iteratively step through the subgraph a "layer" at a time while avoiding reprocessing the same node multiple times, by using the APOC procedure apoc.periodic.commit. That procedure iteratively processes a query until it returns 0.
Here is a example of this technique. It:
Uses a temporary TempNode node to keep track of a couple of important values between iterations, one of which will eventually contain the disinct ids of the nodes in the subgraph (except for the "root" node's id, since your question's query also leaves that out).
Assumes that all the nodes you care about share the same label, Foo, and that you have an index on Foo(id). This is for speeding up the MATCH operations, and is not strictly necessary.
Step 1: Create TempNode (using MERGE, to reuse existing node, if any)
WITH '123' AS rootId
MERGE (temp:TempNode)
SET temp.allIds = [rootId], temp.layerIds = [rootId];
Step 2: Perform iterations (to get all subgraph nodes)
CALL apoc.periodic.commit("
MATCH (temp:TempNode)
UNWIND temp.layerIds AS id
MATCH (n:Foo) WHERE n.id = id
OPTIONAL MATCH (n)-->(next)
WHERE NOT next.id IN temp.allIds
WITH temp, COLLECT(DISTINCT next.id) AS layerIds
SET temp.allIds = temp.allIds + layerIds, temp.layerIds = layerIds
RETURN SIZE(layerIds);
");
Step 3: Use subgraph ids
MATCH (temp:TempNode)
// ... use temp.allIds, which contains the distinct ids in the subgraph ...

CK metrics from C# project with Ndepend

I have project for school. Now I need to make from it report of all metrics CK (Chidamber Kemerer metrics). The report has to be like table of all those metrics. Question is how to make it from Ndepend this report which it generates it is not what I am looking for.
Please help and say how to do it... maybe some tips, documents or something this is very important...
Ok, so if we are talking of these Chidamber Kemerer metrics, the NDepend ability to write Code Queries and Rules over LINQ queries (CQLinq) will answer all your needs. For example:
WMC Weighted Methods Per Class
warnif count > 0
from t in Application.Types
let methods = t.Methods
.Where(m => !m.IsPropertyGetter &&
!m.IsPropertySetter &&
!m.IsConstructor)
where methods.Count() > 20
orderby methods.Count() descending
select new { t, methods }
DIT Depth of Inheritance Tree
warnif count > 0
from t in JustMyCode.Types
where t.IsClass
let baseClasses = t.BaseClasses.ExceptThirdParty()
where baseClasses.Count() >= 5
select new { t, baseClasses,
// The metric value DepthOfInheritance takes account
// of third-party base classes
t.DepthOfInheritance
}
NOC Number of Children
from t in Types
where t.IsClass
let childClasses = t.DerivedTypes
where childClasses.Count() > 0
orderby childClasses.Count() descending
select new { t, childClasses }
CBO Coupling between Object Classes
from t in Application.Types
let typesUsed = t.TypesUsed.ExceptThirdParty()
orderby typesUsed.Count() descending
select new { t, typesUsed }
and so on...
Does NDepend have a direct way in CQL to measure RFC (RFT)? Or do we have to write a CQL query for recursive counting invoked methods in used classes (types) our-self? If so, how does it look like?

Best clustering algorithm? (simply explained)

Imagine the following problem:
You have a database containing about 20,000 texts in a table called "articles"
You want to connect the related ones using a clustering algorithm in order to display related articles together
The algorithm should do flat clustering (not hierarchical)
The related articles should be inserted into the table "related"
The clustering algorithm should decide whether two or more articles are related or not based on the texts
I want to code in PHP but examples with pseudo code or other programming languages are ok, too
I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.
APPROACH 1
<?php
$zeit = time();
function check($str1, $str2){
$minprozent = 60;
similar_text($str1, $str2, $prozent);
$prozent = sprintf("%01.2f", $prozent);
if ($prozent > $minprozent) {
return TRUE;
}
else {
return FALSE;
}
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
$rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
$rel2 = mysql_query($rel1);
$rel2a = mysql_num_rows($rel2);
if ($rel2a > 0) {
while ($rel3 = mysql_fetch_assoc($rel2)) {
if (check($sql3['text'], $rel3['text']) == TRUE) {
$id_a = $sql3['id'];
$id_b = $rel3['id'];
$rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
$rein2 = mysql_query($rein1);
$rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
$rein4 = mysql_query($rein3);
}
}
}
}
?>
APPROACH 2 [only check()]
<?php
function square($number) {
$square = pow($number, 2);
return $square;
}
function check($text1, $text2) {
$words_sub = text_splitter($text2); // splits the text into single words
$words = text_splitter($text1); // splits the text into single words
// document 1 start
$document1 = array();
foreach ($words as $word) {
if (in_array($word, $words)) {
if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
}
}
$rating1 = 0;
foreach ($document1 as $temp) {
$rating1 = $rating1+square($temp);
}
$rating1 = sqrt($rating1);
// document 1 end
// document 2 start
$document2 = array();
foreach ($words_sub as $word_sub) {
if (in_array($word_sub, $words)) {
if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
}
}
$rating2 = 0;
foreach ($document2 as $temp) {
$rating2 = $rating2+square($temp);
}
$rating2 = sqrt($rating2);
// document 2 end
$skalarprodukt = 0;
for ($m=0; $m<count($words)-1; $m++) {
$skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
}
if (($rating1*$rating2) == 0) { continue; }
$kosinusmass = $skalarprodukt/($rating1*$rating2);
if ($kosinusmass < 0.7) {
return FALSE;
}
else {
return TRUE;
}
}
?>
I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.
I hope you can help me. Thanks in advance!
The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.
First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:
for word in article.text
article.histogram[indexLookup[word]]++
Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:
def check(articleA, articleB)
rtn = 0
for a,b in zip(articleA.histogram, articleB.histogram)
rtn += a*b
return rtn > threshold
(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)
This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.
Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.
By the way, this technique is described in more detail here
Maybe clustering is the wrong strategy here?
If you want to display similar articles, use similarity search instead.
For text articles, this is well understood. Just insert your articles in a text search database like Lucene, and use your current article as search query. In Lucene, there exists a query called MoreLikeThis that performs exactly this: find similar articles.
Clustering is the wrong tool, because (in particular with your requirements), every article must be put into some cluster; and the related items would be the same for every object in the cluster. If there are outliers in the database - a very likely case - they could ruin your clustering. Furthermore, clusters may be very big. There is no size constraint, the clustering algorithm may decide to put half of your data set into the same cluster. So you have 10000 related articles for each article in your database. With similarity search, you can just get the top-10 similar items for each document!
Last but not least: forget PHP for clustering. It's not designed for this, and not performant enough. But you can probably access a lucene index from PHP well enough.
I believe you need to make some design decisions about clustering, and continue from there:
Why are you clustering texts? Do you want to display related documents together? Do you want to explore your document corpus via clusters?
As a result, do you want flat or hierarchical clustering?
Now we have the complexity issue, in two dimensions: first, the number and type of features you create from the text - individual words may number in the tens of thousands. You may want to try some feature selection - such as taking the N most informative words, or the N words appearing the most times, after ignoring stop words.
Second, you want to minimize the number of times you measure similarity between documents. As bubaker correctly points out, checking similarity between all pairs of documents may be too much. If clustering into a small number of clusters is enough, you may consider K-means clustering, which is basically: choose an initial K documents as cluster centers, assign every document to the closest cluster, recalculate cluster centers by finding document vector means, and iterate. This only costs K*number of documents per iteration. I believe there are also heuristics for reducing the needed number of computations for hierarchical clustering as well.
What does the similar_text function called in Approach #1 look like? I think what you're referring to isn't clustering, but a similarity metric. I can't really improve on the White Walloun's :-) histogram approach - an interesting problem to do some reading on.
However you implement check(), you've got to use it to make at least 200M comparisons (half of 20000^2). The cutoff for "related" articles may limit what you store in the database, but seems too arbitrary to catch all useful clustering of texts,
My approach would be to modify check() to return the "similarity" metric ($prozent or rtn). Write the 20K x 20K matrix to a file and use an external program to perform a clustering to identify nearest neighbors for each article, which you could load into the related table. I would do the clustering in R - there's a nice tutorial for clustering data in a file running R from php.

Resources