Neo4j query with condition on concatenated string is very slow - performance

I have Person nodes with basic string fields such (firstName,lastName, fatherName,motherName) and trying to link nodes based on those fields.
A simple query where I compare motherName to concatenation of first name and last name such as
match(p1:Person) match (p2:Person) where p1.motherName=p2.firstName+' '+ p2.lastName return p1,p2 limit 500
takes around 1 hour , (removing ' ' from the concatenation does not make a difference ). Using match(p1:Person),(p2:Person) also makes no difference
While if comparing exact fields such as
match(p1:Person) match (p2:Person) where p1.motherName=p2.firstName return p1,p2 limit 500
only takes a few seconds.
I have noticed something peculiar regarding transaction memory which is that in the first query the estimatedUsedHeapMemory is always 2097152 and currentQueryAllocatedBytes is 64,
but I see the database is consuming around 7.5 GB of memory.
When running the 2nd query, the numbers for memory used for the heap and query are much bigger. Could it be that something special is causing the query to not be able to use as much memory as it needs as thus is slow?
I had successfully ran a query on all the data to link persons and fathers, that matches on exact fields, which took 2.5 hours. while the query for the mothers which needs to compare concatenated strings was still running after 9 hours with no result.
Query for father linking, which was successful.
CALL apoc.periodic.iterate(
"match (p1:Person) match(p2:Person) where p1.fatherName=p2.firstName and p1.lastName=p2.lastName and p1.dateOfBirth>p2.dateOfBirth return p1,p2",
"MERGE (p1)-[:CHILD_OF {parentRelationship:'FATHER'}]->(p2)",
{batchSize:5000})
I have 4 million nodes, my db size is 3.14 gb , these are my memory settings
NEO4J_server_memory_heap_max__size=5G
NEO4J_server_memory_heap_initial__size=5G
NEO4J_server_memory_pagecache_size=7G
I have tried to first the fast query on the data, so that it could load the data in the memory.
I tried concatenating without '', nothing helps.
I previously had a range index on firstname, which caused the father's query to also be super slow and also have the limit on used memory, I had to drop it in order to get that query to work

Below are my suggestions:
Index the field dateOfBirth on Person node.
String comparison always slows down when there is a large set of data. To compare strings directly try using apoc.util.md5() https://neo4j.com/labs/apoc/4.0/overview/apoc.util/apoc.util.md5/
This produces the hash value of the string passed which makes the comparison fast. So your query will be
CALL apoc.periodic.iterate( "match (p1:Person) match(p2:Person) where apoc.util.md5([p1.fatherName]) = apoc.util.md5([p2.firstName]) and apoc.util.md5([p1.lastName]) = apoc.util.md5([p2.lastName]) and p1.dateOfBirth > p2.dateOfBirth return p1,p2", "MERGE (p1)-[:CHILD_OF {parentRelationship:'FATHER'}]->(p2)", {batchSize:5000})
Hope this helps!

Related

Why is a paginated query slower than a plain one with Spring Data?

Given I have a simple query:
List<Customer> findByEntity(String entity);
This query returns 7k records in 700ms.
Page<Customer> findByEntity(String entity, Pageable pageable);
this query returns 10 records in 1080ms. I am aware of the additional count query for pagination, but still something seems off. Also one strange thing I've noticed is that if I increase page size from 10 to 1900, response time is exactly the same around 1080 ms.
Any suggestions?
It might indeed be the count query that's expensive here. If you insist on knowing about the total number of elements matching in the collection there's unfortunately no way around that additional query. However there are two possibilities to avoid more of the overhead if you're able to sacrifice on information returned:
Using Slice as return type — Slice doesn't expose a method to find out about the total number of elements but it allows you to find out about whether a next slice is available. We avoid the count query here by reading one more element than requested and using its (non-)presence as indicator of the availability of a next slice.
Using List as return type — That will simply apply the pagination parameters to the query and return the window of elements selected. However it leaves you with no information about whether subsequent data is available.
Method with pagination runs two query:
1) select count(e.id) from Entity e //to get number of total records
2) select e from Entity e limit 10 [offset 10] //'offset 10' is used for next pages
The first query runs slow on 7k records, IMHO.
Upcoming release Ingalis of Spring Data will use improved algorithm for paginated queries (more info).
Any suggestions?
I think using a paginated query with 7k records it's useless. You should limit it.

Speed up data comparison in powershell acquired via Import-CSV

Simple question, but tough problem.
I have 2 CSV files exported from Excel, one with 65k rows and one with about 50k. I need to merge the data from those 2 files based on that condition :
where File1.Username -eq File2.Username
Note that the datatype for the username property in both files is this :
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True String System.Object
And obviously looping through 65k x 50k objects properties to compare takes..well, 1 day and 23 hours as I estimated when I measured a script run on only 10 rows.
I am considering several solutions at this point, like splitting the CSV files and running different portions in different powershell sessions simultaneously while giving real time priority to powershell.exe but that's cumbersome, and I haven't tested that option so I can't report on the real gain of performance.
I wondered if I should change the datatype rather, and use for instance .ToString.GetHashCode() but I tried that option too and oddly enough the execution time was quicker when comparing string VS string than hash sum integer VS hash sum integer.
So long story short, I am looking for a superfast way to compare 65k x 50k string variables.
Any help would be greatly appreciated :)
Thanks!
Elaborating example:
Ok here's a metaphorical example. Suppose you have a database containing the names and equipment of astronauts (SPACE), and another one containing the names and equipment of
marine explorers(OCEAN).
So in the SPACE dataset you'll have for instance:
First Name,Last name, Username, space gear,environment.
And then the first row of data would be like :
Neil,Armstrong,Stretch,spacesuit,moon
In the OCEAN Dataset you'd have :
First Name,Last name, Username, birthdate, diving gear,environment
with the following data:
Jacques,Cousteau,Jyc,1910-06-11,diving suit,ocean
Now suppose that at some point Neil Armstrong had himself registered to a diving course and was added the the OCEAN dataset.
In the OCEAN Dataset you'd now have :
First Name,Last name, Username, birthdate, diving gear,environment
with the following data:
Jacques,Cousteau,Jyc,1910-06-11,diving suit,ocean
Neil,Armstrong,Stretch,1930-08-05,diving suit,ocean
The person who handed me the data over gave me a third dataset which was a "mix" of the other 2 :
In the MIXED Dataset you'd now have :
Dataset,First Name,Last name, Username, birthdate, diving gear, space gear,environment
with the following data:
ocean,Jacques,Cousteau,Jyc,1910-06-11,diving suit,,ocean
space,Neil,Armstrong,Stretch,1930-08-05,,space suit,moon
ocean,Neil,Armstrong,Stretch,1930-08-05,diving suit,,ocean
So my task is to make the dataset MIXED looking like this:
First Name,Last name, Username, birthdate, diving gear, space gear,environment
Jacques,Cousteau,Jyc,1910-06-11,diving suit,,ocean
Neil,Armstrong,Stretch,1930-08-05,diving suit,space suit,(moon,ocean)
And to top it all off, there's a couple of profoundly stupid scenarios that can happen:
1) A same guy could be in either SPACE Dataset or OCEAN Dataset more than once, but with different usernames.
2) Two completely different users could share the same username in the SPACE Dataset, but NOT in the OCEAN Dataset.User names there are unique. Yes, you read that correctly, both Cousteau and Armstrong could potentially have the same username.
I've indeed already looked at the possibility of having the data cleaned up a little bit before getting my teeth stuck in that task, but that's not possible.
I have to take the context as it is, can't change anything.
So the first thing I did was to segregate the number of records for the username field, Group-Object -Property Username, and my work was focused on the cases were a given user was, like Neil Armstrong, in both datasets.
When there is only 1 record, like Cousteau, it's straight forward, I leave it as it is. When there's one record in each dataset I need to merge data, and when there is more than 2 records for one username then it's fair to say that it is a complete mess, although I don't mind leaving those as they are just now (especially because thousands of records have a [string]::IsNullOrEmpty($Username) = $true so they count as a number greater than 2 records..)
I hope it makes more sense?
At the moment I want to focus on the cases where a given username is showing up once in both SPACE and OCEAN datasets, I know its not complicated but the algorithm I am using makes the whole process super slow :
0 - Create an empty array
1 - Get rows from SPACE dataset
2 - Get rows from OCEAN dataset
3 - Create a hashtable containing the properties of both datasets where properties aren't empty
4 - Create a psobject to encapsulate the hashtable
5 - Add that object to the array
And that is taking ages because I am talking about 65k records in SPACE and about 50k records in OCEAN.
SO I was wondering if there's a better way of doing this?
Thanks !

Is it a good idea to store and access an active query resultset in Coldfusion vs re-quering the database?

I have a product search engine using Coldfusion8 and MySQL 5.0.88
The product search has two display modes: Multiple View and Single View.
Multiple displays basic record info, Single requires additional data to be polled from the database.
Right now a user does a search and I'm polling the database for
(a) total records and
(b) records FROM to TO.
The user always goes to Single view from his current resultset, so my idea was to store the current resultset for each user and not have to query the database again to get (waste a) overall number of records and (waste b) a the single record I already queried before AND then getting the detail information I still need for the Single view.
However, I'm getting nowhere with this.
I cannot cache the current resultset-query, because it's unique to each user(session).
The queries are running inside a CFINVOKED method inside a CFC I'm calling through AJAX, so the whole query runs and afterwards the CFC and CFINVOKE method are discarded, so I can't use query of query or variables.cfc_storage.
So my idea was to store the current resultset in the Session scope, which will be updated with every new search, the user runs (either pagination or completely new search). The maximum results stored will be the number of results displayed.
I can store the query allright, using:
<cfset Session.resultset = query_name>
This stores the whole query with results, like so:
query
CACHED: false
EXECUTIONTIME: 2031
SQL: SELECT a.*, p.ek, p.vk, p.x, p.y
FROM arts a
LEFT JOIN p ON
...
LEFT JOIN f ON
...
WHERE a.aktiv = "ja"
AND
... 20 conditions ...
SQLPARAMETERS: [array]
1) ... 20+ parameters
RESULTSET:
[Record # 1]
a: true
style: 402
price: 2.3
currency: CHF
...
[Record # 2]
a: true
style: 402abc
...
This would be overwritten every time a user does a new search. However, if a user wants to see the details of one of these items, I don't need to query (total number of records & get one record) if I can access the record I need from my temp storage. This way I would save two database trips worth 2031 execution time each to get data which I already pulled before.
The tradeoff would be every user having a resultset of up to 48 results (max number of items per page) in Session.scope.
My questions:
1. Is this feasable or should I requery the database?
2. If I have a struture/array/object like a the above, how do I pick the record I need out of it by style number = how do I access the resultset? I can't just loop over the stored query (tried this for a while now...).
Thanks for help!
KISS rule. Just re-query the database unless you find the performance is really an issue. With the correct index, it should scales pretty well. When the it is an issue, you can simply add query cache there.
QoQ would introduce overhead (on the CF side, memory & computation), and might return stale data (where the query in session is older than the one on DB). I only use QoQ when the same query is used on the same view, but not throughout a Session time span.
Feasible? Yes, depending on how many users and how much data this stores in memory, it's probably much better than going to the DB again.
It seems like the best way to get the single record you want is a query of query. In CF you can create another query that uses an existing query as it's data source. It would look like this:
<cfquery name="subQuery" dbtype="query">
SELECT *
FROM Session.resultset
WHERE style = #SelectedStyleVariable#
</cfquery>
note that if you are using CFBuilder, it will probably scream Error at you for not having a datasource, this is a bug in CFBuilder, you are not required to have a datasource if your DBType is "query"
Depending on how many records, what I would do is have the detail data stored in application scope as a structure where the ID is the key. Something like:
APPLICATION.products[product_id].product_name
.product_price
.product_attribute
Then you would really only need to query for the ID of the item on demand.
And to improve the "on demand" query, you have at least two "in code" options:
1. A query of query, where you query the entire collection of items once, and then query from that for the data you need.
2. Verity or SOLR to index everything and then you'd only have to query for everything when refreshing your search collection. That would be tons faster than doing all the joins for every single query.

Salesforce SOQL query length and efficiency

I am trying to solve a problem of deleting only rows matching two criteria, each being a list of ids. Now these Ids are in pairs, if the item to be deleted has one, it must have the second one in the pair, so just using two in clauses will not work. I have come up with two solutions.
1) Use the two in clauses but then loop over the items and check that the two ids in question appear in the correct pairing.
I.E.
for(Object__c obj : [SELECT Id FROM Object__c WHERE Relation1__c in :idlist1 AND Relation2__c in:idlist2]){
if(preConstructedPairingsAsString.contains(''+obj.Relation1__c+obj.Relation2__c)){
listToDelete.add(obj);
}
}
2) Loop over the ids and build an admittedly long query.
I like the second choice because I only get the items I need and can just throw the list into delete but I know that salesforce has hangups with SOQL queries. Is there a penalty to the second option? Is it better to build and query off a long string or to get more objects than necessary and filter?
In general you want to put as much logic as you can into soql queries because that won't use any script statements and they execute faster than your code. However, there is a 10k character limit on soql queries (can be raised to 20k) so based on my back of the envelope calculations you'd only be able to put in 250 id pairs or so before hitting that limit.
I would go with option 1 or if you really care about efficiency you can create a formula field on the object that pairs the ids and filter on that.
formula: relation1__c + '-' + relation2__c
for(list<Object__c> objs : [SELECT Id FROM Object__c WHERE formula__c in :idpairs]){
delete objs;
}

Improve SQL Server 2005 Query Performance

I have a course search engine and when I try to do a search, it takes too long to show search results. You can try to do a search here
http://76.12.87.164/cpd/testperformance.cfm
At that page you can also see the database tables and indexes, if any.
I'm not using Stored Procedures - the queries are inline using Coldfusion.
I think I need to create some indexes but I'm not sure what kind (clustered, non-clustered) and on what columns.
Thanks
You need to create indexes on columns that appear in your WHERE clauses. There are a few exceptions to that rule:
If the column only has one or two unique values (the canonical example of this is "gender" - with only "Male" and "Female" the possible values, there is no point to an index here). Generally, you want an index that will be able to restrict the rows that need to be processed by a significant number (for example, an index that only reduces the search space by 50% is not worth it, but one that reduces it by 99% is).
If you are search for x LIKE '%something' then there is no point for an index. If you think of an index as specifying a particular order for rows, then sorting by x if you're searching for "%something" is useless: you're going to have to scan all rows anyway.
So let's take a look at the case where you're searching for "keyword 'accounting'". According to your result page, the SQL that this generates is:
SELECT
*
FROM (
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY sq.name) AS Row,
sq.*
FROM (
SELECT
c.*,
p.providername,
p.school,
p.website,
p.type
FROM
cpd_COURSES c, cpd_PROVIDERS p
WHERE
c.providerid = p.providerid AND
c.activatedYN = 'Y' AND
(
c.name like '%accounting%' OR
c.title like '%accounting%' OR
c.keywords like '%accounting%'
)
) sq
) AS temp
WHERE
Row >= 1 AND Row <= 10
In this case, I will assume that cpd_COURSES.providerid is a foreign key to cpd_PROVIDERS.providerid in which case you don't need an index, because it'll already have one.
Additionally, the activatedYN column is a T/F column and (according to my rule above about restricting the possible values by only 50%) a T/F column should not be indexed, either.
Finally, because searching with a x LIKE '%accounting%' query, you don't need an index on name, title or keywords either - because it would never be used.
So the main thing you need to do in this case is make sure that cpd_COURSES.providerid actually is a foreign key for cpd_PROVIDERS.providerid.
SQL Server Specific
Because you're using SQL Server, the Management Studio has a number of tools to help you decide where you need to put indexes. If you use the "Index Tuning Wizard" it is actually usually pretty good at tell you what will give you the good performance improvements. You just cut'n'paste your query into it, and it'll come back with recommendations for indexes to add.
You still need to be a little bit careful with the indexes that you add, because the more indexes you have, the slower INSERTs and UPDATEs will be. So sometimes you'll need to consolidate indexes, or just ignore them altogether if they don't give enough of a performance benefit. Some judgement is required.
Is this the real live database data? 52,000 records is a very small table, relatively speaking, for what SQL 2005 can deal with.
I wonder how much RAM is allocated to the SQL server, or what sort of disk the database is on. An IDE or even SATA hard disk can't give the same performance as a 15K RPM SAS disk, and it would be nice if there was sufficient RAM to cache the bulk of the frequently accessed data.
Having said all that, I feel the " (c.name like '%accounting%' OR c.title like '%accounting%' OR c.keywords like '%accounting%') " clause is problematic.
Could you create a separate Course_Keywords table, with two columns "courseid" and "keyword" (varchar(24) should be sufficient for the longest keyword?), with a composite clustered index on courseid+keyword
Then, to make the UI even more friendly, use AJAX to apply keyword validation & auto-completion when people type words into the keywords input field. This gives you the behind-the-scenes benefit of having an exact keyword to search for, removing the need for pattern-matching with the LIKE operator...
Using CF9? Try using Solr full text search instead of %xxx%?
You'll want to create indexes on the fields you search by. An index is a secondary list of your records presorted by the indexed fields.
Think of an old-fashioned printed yellow pages - if you want to look up a person by their last name, the phonebook is already sorted in that way - Last Name is the clustered index field. If you wanted to find phone numbers for people named Jennifer or the person with the phone number 867-5309, you'd have to search through every entry and it would take a long time. If there were an index in the back with all the phone numbers or first names listed in order along with the page in the phonebook that the person is listed, it would be a lot faster. These would be the unclustered indexes.
I would try changing your IN statements to an EXISTS query to see if you get better performance on the Zip code lookup. My experience is that IN statements work great for small lists but the larger they get, you get better performance out of EXISTS as the query engine will stop searching for a specific value the first instance it runs into.
<CFIF zipcodes is not "">
EXISTS (
SELECT zipcode
FROM cpd_CODES_ZIPCODES
WHERE zipcode = p.zipcode
AND 3963 * (ACOS((SIN(#getzipcodeinfo.latitude#/57.2958) * SIN(latitude/57.2958)) +
(COS(#getzipcodeinfo.latitude#/57.2958) * COS(latitude/57.2958) *
COS(longitude/57.2958 - #getzipcodeinfo.longitude#/57.2958)))) <= #radius#
)
</CFIF>

Resources