I've got following setup:
About 1,5m nodes of type IpRangeBlock consisting of start and end properties - both of them are of type Long. There's an index on the start property.
What I then do is to find a range containing given IP. So, e.g. for ip 0.0.0.2 I convert it to long and then perform comparison on all nodes n.start <= 2 && n.end >= 2.
The cypher query I run looks like this:
MATCH (n:IpRangeBlock) WHERE n.start <= {ip} AND n.end >= {ip} RETURN n LIMIT 1
All is fine, though as I mentioned, for 1,5m nodes I have it can take up to 20s for Neo4j to find matching range. My question is, is there a way to speed up this operation or is the fault in my db design?
Ok, I tried caching node references and performing the comparison on the app side. As you might expect - pulling that much of nodes takes time.
So I tried another approach - I examined our data set and it turned out that all ip ranges' start and end properties begin with the same first octet. I used those octets as grouping nodes to quickly narrow down subset of probable IP ranges. This worked well, as our dataset is actually well distributed across all ip ranges. now, instead of comparing 100k nodes' properties, each query has to do it 'only' for around 8-10k.
I know it's not perfect aproach but it worked for me. There's neo4j article I got this idea from.
Related
I'm creating an application using Marklogic 8 and the search API. I need to create facets based on MarkLogic defined collections, but instead of the facet count giving a tally of the number of fragments (documents) which contain X number of occurrences of the keyword search performed, I need the facet count to reflect the total number of times the keyword appears in all documents in the collection.
Right now, I'm using search:search() to process the query and return a element with the facet option enabled.
In the MarkLogic documentation, I've been looking at cts:frequency() which says:
"If you want the total frequency instead of the fragment-based frequency (that is, the total number of occurences of the value in the items specified in the cts:query option of the lexicon API), you must specify the item-frequency option to the lexicon API value input to cts:frequency."
But, I can't get that to work.
I've tried running a query like this in query console, but it times out.
cts:element-values(QName("http://www.tei-c.org/ns/1.0", "TEI"),
"", "item-frequency",
cts:and-query((
fn:collection("KirchlicheDogmatik/volume4/part3"),
cts:word-query("lehre"))))
The issue is probably that you have a range index on <TEI>, which contains the entire document. Range indexes are memory-mapped, so you have essentially forced the complete text contents of your database into memory. It's hard to say exactly what's going on, but it's probably struggling to inspect the values (range indexes are designed for smaller atomic values) and possibly swapping to disk.
MarkLogic has great documentation on its indexing, so I'd recommend starting there for a better understanding on how to use them: https://docs.marklogic.com/guide/concepts/indexing#id_51573
Note that even using the item-frequency option, results (or counts) are not guaranteed to be one-to-one with the "total number of times the keyword appears." It will report the number of "items" matching - in your example it would report on the number of <TEI> elements matching.
The problem of getting an exact count of terms matching a query across the whole database is actually quite hard. To get exact matching values within a document, you would need to use cts:highlight or cts:walk, which requires loading the whole document into memory. That typically works fine for a subset of documents, but ultimately to get an accurate value for the entire database, you would need to load the entire database into memory and process every document.
Nearly any approach to getting a term match count requires some kind of approximation and depends heavily on your markup. For example, if you index <p> (or even better <s>) elements, it would be possible to construct a query that uses indexes to count the number of matching paragraphs (or sentences), but that would still load an incredibly large amount of data into memory and keep it there. This is technically feasible if you are willing to allocate enough memory (and/or enough servers), but it hardly seems worth it.
I am not so experienced in neo4j and have the requirement of searching for all graphs from a selection A of nodes to a selection B of nodes.
Around 600 nodes in the db with some relationships per node.
Node properties:
riskId
de_DE_description
en_GB_description
en_US_description
impact
Selection:
Selection A is determined by a property match (property: 'riskId')
Selection B is a known constant list of nodes (label: 'Core')
The following query returns the result I want, but it seems a bit slow to me:
match p=(node)-[*]->(:Core)
where node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN extract (n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description] )
as `risks`, length(p)
This query results in 7 rows with between 1 and 4 nodes per row, so not much.
I get around 270ms or more response time in my local environment.
I have not created any indices or done any other performance attempts.
Any hints how I can craft the query in more intelligent way or apply any performance tuning tricks?
Thank you very much,
Manuel
If there is not yet a single label that is shared by all the nodes that have the riskId property, you should add such a label (say, :Risk) to all those nodes. For example:
MATCH (n)
WHERE EXISTS(n.riskId)
SET n:Risk;
A node can have multiple labels. This alone can make your query faster, as long as you specify that node label in your query, since it would restrict scanning to only Risk nodes instead of all nodes.
However, you can do much better by first creating an index, like this:
CREATE INDEX ON :Risk(riskId);
After that, this slightly altered version of your query should be much faster, as it would use the index to quickly get the desired Risk nodes instead of scanning:
MATCH p=(node:Risk)-[*]->(:Core)
WHERE node.riskId IN ["R47","R48","R49","R50","R51","R14","R3"]
RETURN
EXTRACT(n IN nodes(p)| [n.riskId, n.impact, n.en_GB_description]) AS risks,
LENGTH(p);
I'm looking for a fast Cypher statement that returns all relationships between a known set of nodes (I have their Neo4j ID's), so that I can assemble the subgraph for that particular set of nodes. I'm working within a label called label which has around 50K nodes and 800K edges between these nodes.
I have several working approaches for this, but none are fast enough for my application, even at small set sizes (less than 1000 nodes).
For example, the following statement does the trick:
MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids}) AND (ID(v) IN {ids})
RETURN collect(r)
Where {ids} is a list of numeric Neo4j ids given as parameter to the Py2Neo cypher.execute(statement, parameters) method. The problem is that it takes around 34 seconds for a set of 838 nodes, which returns all 19K relationships between them. I realize the graph is kind of dense, but it takes 1.76 seconds for every 1000 edges returned. I just don't think that's acceptable.
If I use the START clause instead (shown below), the time is actually a little worse.
START u=node({ids}), v=node({ids})
MATCH (u:label)-[r]->(v:label)
RETURN collect(r)
I've found many similar questions/answers, however they all fall short in some aspect. Is there a better statement for doing this, or even a better graph schema, so that it can scale to sets of thousands of nodes?
UPDATE
Thanks for the fast replies. First, to run my current query for 528 nodes as input (len(ids)=528) it takes 32.1 seconds and the query plan is below.
NodeByIdSeek: 528 hits
Filter : 528 hits
Expand(All) : 73,773 hits
Filter : 73,245 hits
Projection : 0 hits
Filter : 0 hits
Brian Underwood's query, with the same input, takes 27.8 seconds. The query plan is identical, except for the last 2 steps (Projection and Filter), which don't exist on for his query. However the db hits sum is the same.
Michael Hunger's query takes 26.9 seconds and the query plan is identical to Brian's query.
I've restarted the server between experiments to avoid cache effects (there's probably a smarter way to do it). I'm also querying straight from the web interface to by pass possible bottlenecks in my code and the libs I'm using.
Bottomline, Neo4j seems smart enough to optimize my query, however it's still pretty slow even with fairly small sets. Any suggestions?
I think the problem is that the query is doing a Cartesian product to get all combinations of the 838 node, so you end up searching 838*838=702,244 combinations.
I'm curious how this would perform:
MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids})
WITH r, v
WHERE (ID(v) IN {ids})
RETURN collect(r)
Also, why do the collect at the end?
How big are your id-lists?
Try this:
MATCH (u) WHERE ID(u) IN {ids}
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN count(*)
MATCH (u) WHERE (ID(u) IN {ids})
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN r
Also try to create a query plan by prefixing your query with PROFILE then you see where the cost is.
Source: Google Interview Question
Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs.
Have many large <string (url) -> int (visits)> maps.
Calculate < string (url) -> int (sum of visits among all distributed maps), and get the top ten in the combined map.
Main constraint: The maps are too large to transmit over the network. Also can't use MapReduce directly.
I have now come across quite a few questions of this type, where processiong needs to be done over large Distributed systems. I cant think or find a suitable answer.
All I could think of is brute force, which in some or other way, violates the given constraint.
It says you can't use map-reduce directly which is a hint the author of the question wants you to think how map reduce works, so we will just mimic the actions of map-reduce:
pre-processing: let R be the number of servers in cluster, give each
server unique id from 0,1,2,...,R-1
(map) For each (string,id) - send the tuple to the server which has the id hash(string) % R.
(reduce) Once step 2 is done (simple control communication), produce the (string,count) of the top 10 strings per server. Note that the tuples where those sent in step2 to this particular server.
(map) Each server will send all his top 10 to 1 server (let it be server 0). It should be fine, there are only 10*R of those records.
(reduce) Server 0 will yield the top 10 across the network.
Notes:
The problem with the algorithm, like most big-data algorithms that
don't use frameworks is handling failing servers. MapReduce takes
care of it for you.
The above algorithm can be translated to a 2 phases map-reduce algorithm pretty straight forward.
In the worst case any algorithm, which does not require transmitting the whole frequency table, is going to fail. We can create a trivial case where the global top-10s are all at the bottom of every individual machines list.
If we assume that the frequency of URIs follow Zipf's law, we can come up with effecive solutions. One such solution follows.
Each machine sends top-K elements. K depends solely on the bandwidth available. One master machine aggregates the frequencies and finds the 10th maximum frequency value "V10" (note that this is a lower limit. Since the global top-10 may not be in top-K of every machine, the sum is incomplete).
In the next step every machine sends a list of URIs whose frequency is V10/M (where M is the number of machines). The union of all such is sent back to every machine. Each machines, in turn, sends back the frequency for this particular list. A master aggregates this list into top-10 list.
We recently did some research on how to speed things up a bit in sphinxsearch.
We found a great way to speed things up is to use a distributed index.
We ran real-life tests, and found that queries execute somewhere between 35-40% faster when a distributed index is used.
What I mean by distributed is basically our regular index, split up into 4 (the box hosting this index has 4 cores) via adding AND id % 4/3/2/1 = 0 into each source, for each of the parts of the index.
FYI, id is our primary key / auto increment.
So what this should do instead of having one huge index is split it up into 4.
And then we just use index type = distributed + local .... local .... local .... local .... for a 'put all the parts together' index.
We did some quick testing, the same results come back... Only 35-40% faster :)
So, before we implement this site wide, we would like to know:
Does switching to a distributed index like the one mentioned above impact sorting in any way?
We ask this because we use sphinx for a number of SEO related items, and we NEED to keep the order of the results the same.
I should also mention, queries, all query options, etc stay the same. Any and all changes were done on the daemon end.
Thanks!
Sorting should be unaffected. You suffer a bigger performance hit when using distribution indexes and high offsets. But the first few pages will be fine.
As far as I know the gotcha are using grouping/clustering and kill-lists. But if not using them, should be nothing to worry about.