JanusGraph sometimes returns empty results when an indexed property is used in the has() clause - janusgraph

We are using JanusGraph (with Cassandra) as our backend database and facing some issues with the indexed property when using it in the has() clause.
The following query returns a null response:
g.V().has('fooName', 'fooValue').has('barName', 'barValue')
But, the following query returns the correct response:
g.V().has('barName', 'barValue').values('fooName')
==> fooValue
Both the properties are indexed (Composite). The data set is around 20k vertices with the value of 'fooName' property as 'fooValue', which is also confirmed with the following query which works:
g.V().has('fooName', 'fooValue').count()
==> 20000
This happens intermittently and not for all the vertices. Out of the 20k vertices, around 6k vertices show the above issue. The method of adding the property value is the same for all the vertices.
Is it the case that if we add a Composite Index for Vertices where the domain of values will be small and the range of sets of Vertices will be few with one set having a very large share of the universe that the resulting index queries will fail by falsely claiming that present Vertices that match the predicate are absent?
Rather than triggering a full scan, the traversals are reporting that the present Vertices are absent. If this is the case, at what threshold does the Composite Index begin to fail and where can this be tuned?
We are aware that there are recent changes to index caching and transactions. For the failing indexed query, we also observe that the "g.V().has( 'fooName', 'fooValue' ).count()" is limited to a query-limit value of 4000 (as shown in a .profile() report) until the transaction is committed. After the commit, the count() jumps back to the expected value. (20,000). Is this related?

#Anya let's look at https://github.com/JanusGraph/janusgraph/issues/735 and
This looks equal to [his] post on gitter on nov 15th. The query optimizer
inserted a limit(4000), perhaps here too. Turning off smart-limit (
query.smart-limit=false ) solved the issue.
https://docs.janusgraph.org/basics/configuration-reference/

Related

Neo4j building initial graph is slow

I am trying to build out a social graph between 100k users. Users can sync other social media platforms or upload their own contacts. Building each relationship takes about 200ms. Currently, I have everything uploaded on a queue so it can run in the background, but ideally, I can complete it within the HTTP request window. I've tried a few things and received a few warnings.
Added an index to the field pn
Getting a warning This query builds a cartesian product between disconnected patterns. - I understand why I am getting this warning, but no relationship exists and that's what I am building in this initial call.
MATCH (p1:Person {userId: "....."}), (p2:Person) WHERE p2.pn = "....." MERGE (p1)-[:REL]->(p2) RETURN p1, p2
Any advice on how to make it faster? Ideally, each relationship creation is around 1-2ms.
You may want to EXPLAIN the query and make sure that NodeIndexSeeks are being used, and not NodeByLabelScan. You also mentioned an index on :Person(pn), but you have a lookup on :Person(userId), so you might be missing an index there, unless that was a typo.
Regarding the cartesian product warning, disregard it, the cartesian product is necessary in order to get the nodes to create the relationship, this should be a 1 x 1 = 1 row operation so it's only going to be costly if multiple nodes are being matched per side, or if index lookups aren't being used.
If these are part of some batch load operation, then you may want to make your query apply in batches. So if 100 contacts are being loaded by a user, you do NOT want to execute 100 queries each, with each query adding a single contact. Instead, pass as a parameter the list of contacts, then UNWIND the list and apply the query once to process the entire batch.
Something like:
UNWIND $batch as row
MATCH (p1:Person {pn: row.p1}), (p2:Person {pn: row.p2)
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2
It's usually okay to batch 10k or so entries at a time, though you can adjust that depending on the complexity of the query
Check out this blog entry for how to apply this approach.
https://dzone.com/articles/tips-for-fast-batch-updates-of-graph-structures-wi
You can use the index you created on Person by suggesting a planner hint.
Reference: https://neo4j.com/docs/cypher-manual/current/query-tuning/using/#query-using-index-hint
CREATE INDEX ON :Person(pn);
MATCH (p1:Person {userId: "....."})
WITH p1
MATCH (p2:Person) using index p2:Person(pn)
WHERE p2.pn = "....."
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2

Repeated uses of TOP in query

Finding TOP items in query
I have an Access query with fields (in simplified form) name, type, value. I need to extract the top x records (according to value) FOR EVERY PAIR (name, type) with x depending on the pair. The query already has the column "value" sorted for each pair.
Solution 1. Do separate queries per pair, take the top x in each and build the union of the queries. Wrong! the number of pairs is large, Access can't handle the resulting query.
Solution 2. Add an extra column to the query, call it "Valid" and set it to True in all records. Then use VBA to traverse the recordset items of the query one by one and set Valid to False for the non-top items. Then do an additional query dropping the false records. Wrong again, the recordset is not editable in VBA (even though "Valid" has nothing to do with any tables used in the query). Yes, I opened the recordset in VBA with dbOpenDynaset -- no dice.
Any ideas? Thanks

Duplicate results when sorting using a Spring-Data pageable object on a JPA repository

I have a rest-api that returns a list of users when called. The API uses the org.springframework.data.domain.Pageable to paginate and sort the results. This works by simply passing the pageable to the JPA repository, which then returns the desired page.
For some reason, when sorting by the first name, there is a chance that a duplicate entry will appear, but only if multiple entries have the same first name. However, this never happens when sorting by lastName. Both are simply strings in the entity, there is no discernable difference besides the property name.
Have any of you ever encountered this and if so, how did you fix it?
EDIT: To clarify, there is basically no logic of mine between the controller and the repository. I just pass the pageable through and return the results.
EDIT 2: Solved! Interesting titbit: The reason why the issue was only occurring when sorting by the first name was that only there were there always records that appeared on page 1 and 2, regardless which way the entries were sorted. Lucky me that our testers used that specific test data, or I might have never noticed.
Yes, I have seen that: a record can appear on, say, page 1 and then again on page 2.
This is an issue at the database level. For example 10 items per page and items at positions 10 and 11 have the same value for the property then it can be random which one appears in which position in each resultset.
Therefore apply a secondary sort on a unique property - the database ID for example - in order to ensure consistent ordering across requests.

Why is a paginated query slower than a plain one with Spring Data?

Given I have a simple query:
List<Customer> findByEntity(String entity);
This query returns 7k records in 700ms.
Page<Customer> findByEntity(String entity, Pageable pageable);
this query returns 10 records in 1080ms. I am aware of the additional count query for pagination, but still something seems off. Also one strange thing I've noticed is that if I increase page size from 10 to 1900, response time is exactly the same around 1080 ms.
Any suggestions?
It might indeed be the count query that's expensive here. If you insist on knowing about the total number of elements matching in the collection there's unfortunately no way around that additional query. However there are two possibilities to avoid more of the overhead if you're able to sacrifice on information returned:
Using Slice as return type — Slice doesn't expose a method to find out about the total number of elements but it allows you to find out about whether a next slice is available. We avoid the count query here by reading one more element than requested and using its (non-)presence as indicator of the availability of a next slice.
Using List as return type — That will simply apply the pagination parameters to the query and return the window of elements selected. However it leaves you with no information about whether subsequent data is available.
Method with pagination runs two query:
1) select count(e.id) from Entity e //to get number of total records
2) select e from Entity e limit 10 [offset 10] //'offset 10' is used for next pages
The first query runs slow on 7k records, IMHO.
Upcoming release Ingalis of Spring Data will use improved algorithm for paginated queries (more info).
Any suggestions?
I think using a paginated query with 7k records it's useless. You should limit it.

how to improve Neo4J performance in creating edges?

i'm building a traffic schedule application using Neo4J, NodeJS and GTFS-data; currently, i'm trying to get
things working for the traffic on a single day on the Berlin subway network. these are the grand totals
i've collected so far:
10 routes
211 stops
4096 trips
83322 stoptimes
to put it simply, GTFS (General Transit Feed Specification) has the concept of a stoptime which denotes the
event of a given train or bus stopping for passengers to board and alight. stoptimes happen on a trip,
which is a series of stoptimes, they happen on a specific date and time, and they happen on a given
stop for a given route (or 'line') in a transit network. so there's a lot of references here.
the problem i'm running into is the amount of data and the time it takes to build the database. in order
to speed up things, i've already (1) cut down the data to a single day, (2) deleted the database files
and have the server create a fresh one (very effective!), (3) searched a lot to get better queries. alas,
with the figures as given above, it still takes 30~50 minutes to get all the edges of the graph.
these are the indexes i'm building:
CREATE CONSTRAINT ON (n:trip) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stop) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:route) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stoptime) ASSERT n.id IS UNIQUE;
CREATE INDEX ON :trip(`route-id`);
CREATE INDEX ON :stop(`name`);
CREATE INDEX ON :stoptime(`trip-id`);
CREATE INDEX ON :stoptime(`stop-id`);
CREATE INDEX ON :route(`name`);
i'd guess the unique primary keys should be most important.
and here are the queries that take up like 80% of the running time (with 10% that are unrelated to Neo4J,
and 10% needed to feed the node data using plain HTTP post requests):
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
MATCH (stoptime:`stoptime`), (trip:`trip`)
WHERE stoptime.`trip-id` = trip.id
CREATE UNIQUE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]-(stoptime);
MATCH (stoptime:`stoptime`), (stop:`stop`)
WHERE stoptime.`stop-id` = stop.id
CREATE UNIQUE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]-(stoptime);
MATCH (a:stoptime), (b:stoptime)
WHERE a.`trip-id` = b.`trip-id`
AND ( a.idx + 1 = b.idx OR a.idx - 1 = b.idx )
CREATE UNIQUE (a)-[:linked]-(b);
MATCH (stop1:stop)-->(a:stoptime)-[:next]->(b:stoptime)-->(stop2:stop)
CREATE UNIQUE (stop1)-[:distance {`~label`: 'distance', value: 0}]-(stop2);
the first query is still in the range of some minutes which i find longish given that there are only
thousands (not hundreds of thousands or millions) of trips in the database. the subsequent queries that
involve stoptimes take several ten minutes each on my desktop machine.
(i've also calculated whether the schedule really contains 83322 stoptimes each day, and yes, it's plausible:
in Berlin, subway trains run on 10 lines for 20 hours a day with 6 or 12 trips per hour, and there are 173
subway stations: 10 lines x 2 directions x 17.3 stops per line x 20 hours x 9 trips per hour gives 62280,
close enough. there are some faulty? / double / extra stop nodes in the data (211
stops instead of 173), but those are few.)
frankly, if i don't find a way to speed up things at least tenfold (rather more), it'll make little sense to use Neo4J
for this project. just in order to cover the single city of Berlin many, many more stoptimes have to be added,
as the subway is just a tiny fraction of the overall public transport here (e.g. bus and tramway have like
170 routes with 7,000 stops, so expect around 7,000,000 stoptimes each day).
Update the above edge creation queries, which i perform one by one, have now been running for over an hour and not yet finished, meaning that—if things scale in a linear fashion—the time needed to feed the Berlin public transport data for a single day would consume something like a week. therefore, the code currently performs several orders of magnitude too slow to be viable.
Update #MichaelHunger's solution did work; see my response below.
I just imported 12M nodes and 12M rels into Neo4j in 10 minutes using LOAD CSV.
You should see your issues when you run profiling on your queries in the shell.
Prefix your query with profile and look a the profile output if it mentions to use the index or rather just label-scan.
Do you use parameters for your insert queries? So that Neo4j can re-use built queries?
For queries like this:
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
It will very probably not use your index.
Can you perhaps point to your datasource? We can convert it into CSV if it isn't and then import even more quickly.
Perhaps we can create a graph gist for your model?
I would rather use:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id` = route.id)
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
For your initial import you also don't need create unique as you match every trip only once.
And I'm not sure what your "~label" is good for?
Similar for your other queries.
As the data is public it would be cool to work together on this.
Something I'd love to hear more about is how you plan do express your query use-cases.
I had a really great discussion about timetables for public transport with training attendees last time in Leipzig. You can also email me on michael at neo4j.org
Also perhaps you want to check out these links:
Tramchester
http://www.thoughtworks.com/de/insights/blog/transforming-travel-and-transport-industry-one-graph-time
http://de.slideshare.net/neo4j/graph-connect-v5
https://www.youtube.com/watch?v=AhvECxOhEX0
London Tube Graph
http://blog.bruggen.com/2013/11/meet-this-tubular-graph.html
http://www.markhneedham.com/blog/2014/03/03/neo4j-2-1-0-m01-load-csv-with-rik-van-bruggens-tube-graph/
http://www.markhneedham.com/blog/2014/02/13/neo4j-value-in-relationships-but-value-in-nodes-too/
detailed solution
i'm happy to report that #MichaelHunger's solution works like a charm. i modified the edge-building queries
from the question with the below shapes that keep to the suggested query outline:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id`: route.id})
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]->(route)
MATCH (trip:`trip`)
MATCH (stoptime:`stoptime` {`trip-id`: trip.id})
CREATE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]->(stoptime)
MATCH (stop:`stop`)
MATCH (stoptime:`stoptime` {`stop-id`: stop.id})
CREATE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]->(stoptime)
MATCH (a:stoptime)
MATCH (b:stoptime {`trip-id`: a.`trip-id`, `idx`: a.idx + 1})
CREATE (a)-[:linked {`~label`: 'linked'}]->(b)
MATCH (stop1:stop)--(a:stoptime)-[:linked]-(b:stoptime)--(stop2:stop)
CREATE (stop1)-[:distance {`~label`: 'distance', value: 0}]->(stop2)
as can be seen, the trick here is to give each participating node a MATCH statement of its own and to
move the WHERE clause inside the second match condition; presumably, as mentioned above, Neo4J can only
then take advantage of its indexes.
with these queries in place, the process of reading in nodes and building edges takes roughly 13 minutes;
of these 13 minutes, fetching the data from an external source, building the node representations and issuing CREATE queries
takes about 10 minutes, and building almost a half million edges between them is done in about 3 minutes.
right now none of my queries (especially the node CREATE statements and updates for stop distances) use
parametrized queries, which is another potential source for performance gains.
as for the ~label field and also the question why i use dahes in names where underscores would be more
convenient, well, that's a long story about what i perceive good and practical naming that sometimes clashes
with the syntax of some languages (of most languages, should i say). but that's boring detail. maybe more
intersting is the question: why is there a ~label attribute that repeats what the element label says (what
you write after the colon)? well, it's an attempt to comply with Neo4J conventions (we use labels here), take
advantage of the 'identifier, colon, label' syntax of cypher queries, AND to make it so the labels do
appear in the returned values.
mind you, labels are so central to graph thinking the Neo4J way, but *in query results, labels are
conspicuously absent. when you include a relationship that is marked with nothing but a label in your result set,
then that edge will arrive as an empty
object, telling you only that there is something but not what. so i decided i to duplicate the
label on each single node and each single edge. not an optimal solution but at least now i get an informative
graph display in the Neo4J browser.
as for how to express query use-cases, that's an active field of reserach for me right now. i guess it will
all start with a 'field of interest', like 'show all Berlin subway stops', or 'all busses departing within
the next 15 minutes from a bus stop near me'. the data already allows to see which stops are directly connected
by a subway line, their geographical distance, what services are present and what routes they take. the idea
is to grab the data and present them in novel, usable and beatiful ways. 9292 is quite
close to what i imagine; what's missing are graphical representations of spatial and temporal relationships.

Resources