Neo4J Cypher: performance of matching multiple properties and creating relationships - performance

A little context: I'm experimenting with Neo4J (as a newbie, but experienced in other database technologies) for possible use as a master data management system within our business of identity intelligence, in particular looking at building up a graph of places, identity attributes (eg: email addresses, telephone numbers, electoral roll data, etc.) with relationships between these nodes that express something meaningful, for example where an email address has been used, or where a telephone number is registered.
Desired system properties: I would like this system to have some specific properties that are valuble to us:
Fast ingestion of information from a significant number of providers (100+), this precludes lengthy (hours) ETL processes, short ones are ok!
On line at all times, this precludes use of the batch importer, we are most likely to use a fault tolerant cluster, sharding would be good :)
Capacity to eventually ingest ~30G records / year (~1000/second) and retain them, creation and retention of ~100G relationships / year, right now we are ingesting ~1/10 of this load.
Where I'm stuck: I have been experimenting with a single node in Azure, 32GB RAM, 4 cores, with non-local disk, running Debian 8 and Neo4J 3.1.1. This happily ingests and relates back together the UK postal address file (PAF), around 29M records, in a few 10s of minutes using either LOAD CSV or home-brew Java and bolt. I have also ingested but not related a test set of email address data, around 20M records, and now need to build relationships based on matching postcodes, building numbers, and possibly other fields between the two data sets. This is where things get much slower when using Cypher, here's the fastest query I have been able to create thus far:
UNWIND {list} AS i
MATCH(e:DDSEMAIL) WHERE ID(e) = i WITH e
MATCH(s:SUBBNAME) USING INDEX s:SUBBNAME(SBNA)
WHERE upper(e.Building) = s.SBNA WITH e,s
MATCH(m:MAINFILE)
WHERE trim(split(e.Postcode,' ')[0]) = m.OUTC AND
trim(split(e.Postcode,' ')[1]) = m.INCO AND
right('0000'+e.HouseNo,4) = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
CREATE (e)-[r:USED_AT]->(m)
RETURN COUNT(r);
Indexes are:
ON :DDSEMAIL(HouseNo) ONLINE
ON :DDSEMAIL(Postcode) ONLINE
ON :DDSEMAIL(Building) ONLINE
ON :MAINFILE(OUTC) ONLINE
ON :MAINFILE(INCO) ONLINE
ON :MAINFILE(BNUM) ONLINE
ON :SUBBNAME(SBNA) ONLINE
Please note that the {list} parameter is being supplied through bolt from a Java client that has already enumerated all the ~20M DDSEMAIL nodes, and is batching into transactions (typically 1000 IDs at a time).
This is taking between 100-200msecs per ID, over a test run of 157000 IDs it took 7.3 hours, indicating a full execution time of ~760 hours or >1 month. The underlying machine appears CPU bound (no significant IO wait time).
Looking at the EXPLAIN for this query, there are no full scans, it's all schema index matching (once I had included the explicit index statement), so I'm not sure where to look for more speed..
(edited to add this PROFILE output):
PROFILE part 1
PROFILE part 2
This shows that the match to both parts of the postcode is filtering a lot of rows (56k), it may be better to re-order these fields to reduce the filter input size.
(end of edit)
As a (very unfair) comparision, I pushed both sets of data from CSV files into a custom Bloom filter written in C#/.NET, which performs similar field reformatting as above then concatenates to generate textual keys, and matches these keys together. This completed convolving all 20M email records against all 29M PAF records in under 5 minutes on a single core of my laptop. It was largely IO bound.
Right now I'm considering using an external application or a user procedure to perform the record matching, and just creating relationships using Cypher, but it feels wrong to avoid a well-written query engine that should be able to do this much, much quicker than it is.
What should I be looking at to improve performance please?

If I recall correctly, the index won't be utilized correctly when there are transformations occurring on the comparison values (such as UPPER() or LOWER() or TRIM()) when they're sourced from another node property. You may need to perform these operations first and alias them, then do the match.
Providing the index hint gets around this, I think, so your match to s.SBNA should be correctly using the index, but if there's an index on any of the matched properties on m:MAINFILE, that may not be using the index.
Test to see if this makes a difference, comparing this query to the older query on a smaller data set:
UNWIND {list} AS i
MATCH(e:DDSEMAIL) WHERE ID(e) = i
WITH e, upper(e.Building) as SBNA
MATCH(s:SUBBNAME)
WHERE s.SBNA = SBNA
WITH e,s, trim(split(e.Postcode,' ')[0]) as OUTC,
trim(split(e.Postcode,' ')[1]) as INCO,
right('0000'+e.HouseNo,4) as BNUM
MATCH(m:MAINFILE)
WHERE OUTC = m.OUTC AND
INCO = m.INCO AND
BNUM = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
CREATE (e)-[r:USED_AT]->(m)
RETURN COUNT(r);
Also, if you could add a screenshot of a PROFILE or EXPLAIN of the query to your description (after expanding all plan nodes) that may help to see where things could improve.
EDIT
As you mentioned in your description, batching these may be a good idea. APOC Procedures has apoc.periodic.iterate(), which may help here.
Let's see if we can apply that to your query. Try this out:
WITH {list} AS list
CALL apoc.periodic.iterate('
UNWIND {list} as list
RETURN list
', '
WITH {list} as i
MATCH(e:DDSEMAIL) WHERE ID(e) = i
WITH e, upper(e.Building) as SBNA
MATCH(s:SUBBNAME)
WHERE s.SBNA = SBNA
WITH e,s, trim(split(e.Postcode,' ')[0]) as OUTC,
trim(split(e.Postcode,' ')[1]) as INCO,
right('0000'+e.HouseNo,4) as BNUM
MATCH(m:MAINFILE)
WHERE OUTC = m.OUTC AND
INCO = m.INCO AND
BNUM = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
MERGE (e)-[:USED_AT]->(m)
', {batchSize:1000, iterateList:true, params:{list:list}}) YIELD batches, total, committedOperations, failedOperations, failedBatches, errorMessages
RETURN batches, total, committedOperations, failedOperations, failedBatches, errorMessages
We have to sacrifice returning the total number of relationships created, however, as we can't return values from the batched query.

Related

Dynamodb total record count for Pagination

I am planning to leverage AWS DynamoDB for one the legacy application. I have did the data modelling for persist the data in DDB and I have came with single table, as it is coming to effective in my use case.
But, there is one of the requirement where I need to show the total qualified record count for a Query for Pagination.
Apart of Scanning the whole table, is there any out of box to to get total qualified record counts?
Thanks
You can use describe table API for that.
It will return several json values including ItemCount which you
need.
This might be not 100% updated as of its no-sql nature. They update it after every ~6 hours. If you need live count, you have to scan entire table but scan is also eventually consistent operation.
If your question is about count on the basis of some condition then
no, you have to use scan or query depends how you want to implement
conditions
more details
https://docs.aws.amazon.com/cli/latest/reference/dynamodb/describe-table.html

Elasticsearch index design

I am maintaining a years of user's activity including browse, purchase data. Each entry in browse/purchase is a json object:{item_id: id1, item_name, name1, category: c1, brand:b1, event_time: t1} .
I would like to compose different queries such like getting all customers who browsed item A, and or purchased item B within time range t1 to t2. There are tens of millions customers.
My current design is to use nested object for each customer:
customer1:
customer_id,id1,
name: name1,
country: US,
browse: [{browseentry1_json},{browseentry2_json},...],
purchase: [{purchase entry1_json},{purchase entry2_json},...]
With this design, I can easily compose all kinds of queries with nested query. The only problem is that it is hard to expire older browse/purchase data: I only wanna keep, for example, a years of browse/purchase data. In this design, I will have to at some point, read the entire index out, delete the expired browse/purchase data, and write them back.
Another design is to use parent/child structure.
type: user is the parent of type browse and purchase.
type browse will contain each browse entry.
Although deleting old data seems easier with delete by query, for the above query, I will have to do multiple and/or has_child queries,and it would be much less performant. In fact, initially i was using parent/child structure, but the query time seemed really long. I thus gave it up and tried to switch to nested object.
I am also thinking about using nested object, but break the data into different index(like monthly index) so that I can easily expire old data. The problem with this approach is that I have to query across those multiple indexes, and do aggregation on that to get the distinct users, which I assume will be much slower.(havn't tried yet). One requirement of this project is to be able to give the count of the queries in acceptable time frame.(like seconds) and I am afraid this approach may not be acceptable.
The ES cluster is 7 machines, each 8 cores and 32G memory.
Any suggestions?
Thanks in advance!
Chen
Instead of creating a customers index I would create a "Browsing" indices (indexes) and a "Purchasing" indices separated by a timespan (EG: Monthly, as you mentioned in your last paragraph).
In each struct I would add the customer fields. Now you are facing two different approaches:
1. You can add only a reference to the customer (such as id) and make another query to get his details.
2. If you don't have any storage problem you can keep all the customer's data in each struct.
if this doesn't enough for performance you can combine it with "routing" and save all specific user's data on the same shard. and Elasticsearch won't need to fetch data between shards (you can watch this video where Shay Benon explains about "user data flow")
Niv

What is the most efficient way to filter a search?

I am working with node.js and mongodb.
I am going to have a database setup and use socket.io to have real-time updates that will have the db queried again as well or push the new update to the client.
I am trying to figure out what is the best way to filter the database?
Some more information in regards to what is being queried and what the real time updates are:
A document in the database will include information such as an address, city, time, number of packages, name, price.
Filters include city/price/name/time (meaning only to see addresses within the same city, or within the same time period)
Real-time info: includes adding a new document to the database which will essentially update the admin on the website with a notification of a new address added.
Method 1: Query the db with the filters being searched?
Method 2: Query the db for all searches and then filter it on the client side (Javascript)?
Method 3: Query the db for all searches then store it in localStorage then query localStorage for what the filters are?
Trying to figure out what is the fastest way for the user to filter it?
Also, if it is different than what is the most cost effective way, then the most cost effective as well (which I am assuming is less db queries)...
It's hard to say because we don't see exact conditions of the filter, but in general:
Mongo can use only 1 index in a query condition. Thus whatever fields are covered by this index can be used in an efficient filtering. Otherwise it might do full table scan which is slow. If you are using an index then you are probably doing the most efficient query. (Mongo can still use another index for sorting though).
Sometimes you will be forced to do processing on client side because Mongo can't do what you want or it takes too many queries.
The least efficient option is to store results somewhere just because IO is slow. This would only benefit you if you use them as cache and do not recalculate.
Also consider overhead and latency of networking. If you have to send lots of data back to the client it will be slower. In general Mongo will do better job filtering stuff than you would do on the client.
According to you if you can filter by addresses within time period then you could have an index that cuts down lots of documents. You most likely need a compound index - multiple fields.

Full-text indexing sluggish. Looking for alternatives

I have a table that I've created a Full Text Catalog on. The table has just over 6000 rows. I've added two columns to the index. The first could be considered a unique identifier of sorts and the second could be considered the content for that item (there are 11 other columns in my table that aren't part of the Full Text Catalog). Here is an example of a couple of rows:
TABLE: data_variables
ROW unique_id label
1 A100d1 Personal preference of online shopping sites
2 A100d2 Shopping behaviors for adults in household
In my web application on the front end, I have a text box that the user can type into to get a list of items that match whatever terms they're searching for in the UNIQUE ID or LABEL columns. So, for example, if the user typed in sho or a100 then a list would be populated with both of the rows above. If they typed in behav then a list would be populated with only row 2 above.
This is done via an Ajax request on each keyup. PHP calls a Stored Procedure on the SQL server that looks like:
SELECT TOP 50 dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE (CONTAINS((dv.unique_id, dv.label), #search))
(#search is the text from the user that is passed into the Stored Procedure.)
I've noticed that this gets pretty sluggish, especially when I wasn't using TOP 50 in the query.
What I'm looking for is a way to speed this up either directly on the SQL Server or by abandoning the full-text indexing idea and using jQuery to search through an array of the searchable items on the client-side. I've looked a bit into the jQuery AutoComplete stuff and some other jQuery plugins for AutoComplete, but haven't yet tried to mock up anything. That would be my next step, but I wanted to check here first to see what advice I would get.
Thanks in advance.
Several suggestions, based around the fact that you have only 6000 rows, so the database should eat this alive.
A. Try using Like operator, just in case it helps. Not expecting it too, but pretty trivial to try. There is something else going on here overall for you to detect this is slow given these small volumes.
B. can you cache queries in advance? With 6000 rows, there are probably only 36*36 combinations of 2 character queries, which should take virtually no memory and save the database any work.
C. Moving the selection out to the client is a good idea, depends on how big the 6000 rows are overall, vs network latency for individual lookups.
D. Combining b and c will give you really good performance I suspect, but with some coding effort required. If the server maintains a list of all single character results in cache, and clients download the letter cache set after first keystroke, then they potentially have a subset of all rows, but won't need to do more network IO for additional keystrokes.
I would advise against a LIKE, unless you're using a linear index (left-to-right) and you're doing queries like LIKE 'work%'. If you're doing something like LIKE '%word%' a regular index isn't going to help you. You typically want to use a Full-Text index when you want to search for words inside a paragraph.
With a lot of data, typically the built-in Full-Text engines in databases aren't very stealer. For the best performance you typically have to go with an external solution that is built specifically for Full-Text.
Some options are Sphinx, Solr, and elasticsearch, just to name a few. I wouldn't say that any of these options are better than the other. There are definitely pros and cons to consider:
What kind of data do you have?
What language support do these solutions have?
What database engines do these solutions support?
The best thing you can do is benchmark these solutions against your existing data. Testing each and every individual component (unit testing) can help you identify the real problems and help you find good solutions.
I had the same problem and went for the LIKE solution. I found too that the or operator to be too taxing and divide the query into two selects with an union all (fastest, and in my scenario it was impossible to find the same text in the index column and the data).
Yours will be like
SELECT TOP 50 from (
select dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE dv.unique_id like '%'+#search+'%'
UNION ALL
select dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE dv.label like '%'+#search+'%'
)
Oh!! And test the performance in SQL Server, not the web!
If You plan to increase amount of data it will be best way to use reverse index for full-text searching.
Look at Apache Solr - best fulltext search engine at this moment.
You can simply periodically index Your database data and use solr as search-engine,
it provide simple ajax api and can be queried directly from frontend.
If you really need performance ..you may want to look at; FTS3 and FTS4 ...
snip... from another forum...
For example, if each of the 517430 documents in the "Enron E-Mail Dataset" is inserted into both an FTS table and an ordinary SQLite table created using the following SQL script:
Code:
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table /
CREATE TABLE enrondata2(content TEXT); / Ordinary table */
Then either of the two queries below may be executed to find the number of documents in the database that contain the word "linux" (351). Using one desktop PC hardware configuration, the query on the FTS3 table returns in approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
see...
http://www.sqlite.org/fts3.html

lucene.net paging using query?

I'm using lucene.net to produce an index and search it. I'm actually using the API indirectly through the Examine project on codeplex. I currently have everything working and the paging logic in place, however the current logic pages the results after the search has been completed. I don't like this because it means the search will possibly return thousands of records and only then does my code take the 10-20 records it needs and discards the rest which is a major waste of resources. Even if each SearchResult item is just a tiny 3KB the amount of memory to execute these searches will grow with time and become a huge memory hog. My shared host is only guaranteeing 1GB of dedicated memory so this is a big concern for my website.
So the question is: How do i limit the results of the results in a paged manner using lucene query language alone? I looked at the apache lucene project, which lucene.net is ported from, and I don't see any syntax that lets me do what I'm looking for. Basically I want the equivalent of what sql server has to limit the rows at the query language level.
E.g. (this is how we would do paging in sql and it only returns 20 records not every record that matches the where clause)
Select * from (select Row_Number() OVER (ORDER BY OrderDate) as RoNum,
OrderID,
OrderDate
FROM SalesOrders
WHERE OrderCustomerName like 'Davis%') O
WHERE RowNum BETWEEN 1 and 20
I don't think that there is a major waste of resources, since search is (making it simple) nothing more than calculating the Bitvector & scores. What is costly is the reading of docs from the index. (Except the deprecated Hits class) search results don't read the docs, instead just return the docid's, so there isn't much overhead in skipping the first N result.
The exception for this is when you want to sort the result according to some field. Then all docs in the search result list must be read from the index, to be able to return them in correct order.

Resources