Neo4j cypher query for updating nodes taking a long time - performance

I have a graph with the following:
(:Customer)-[:MATCHES]->(:Customer)
This is not limited to two customers but the depth does not exceed 5. The following also exists:
(:Customer)-[:MATCHES]->(:AuditCustomer)
There is a label :Master that needs to be added to one customer node in a matching set depending on some criteria.
My query therefore needs to find all sets of matching customers where none have the label master and add that label to one node in the set. There are a large number of customer nodes and doing it all in one go causes the database to become very slow.
I have attempted to do it using apoc.periodic.commit:
CALL apoc.periodic.commit("MATCH (c:Customer)
WHERE NOT c:Master AND NOT (c)-[:MATCHES*0..5]-(:Master) WITH c limit {limit}
CALL apoc.path.expand(c, 'MATCHES', '+Customer', 0, -1)
YIELD path UNWIND NODES(path) AS nodes WITH c,nodes
order by nodes:Searchable desc, nodes.createdTimestamp asc
with c, head(collect(distinct nodes)) as collectedNodes
set collectedNodes:Master
return count(collectedNodes)", {limit:100})
However this still causes the database to become very slow even if the limit parameter is set to 1. I read that apoc.periodic.commit is a blocking process so this may be causing a problem. Is there a way to do this that is not so resource intensive and the DB can continue to process other transactions while this is running?
The slowest part of the query is the initial match:
MATCH (c:Customer)
WHERE NOT c:Master AND NOT (c)-[:MATCHES*0..5]-(:Master) WITH c limit {limit}
This takes around 3.5s and the entire query takes about 4s. There is very little difference between limit 1 and limit 20. Maybe if there is a way to rewrite this so it is faster, it might be a better approach?
Also, if it is of any use, the following returns ~70K
MATCH (c:Customer)
WHERE NOT c:Master AND NOT (c)-[:MATCHES*0..5]-(:Master)
RETURN COUNT(c)

Related

How to measure execution time of Vulkan pipeline

Summary
I wish to be able to measure time elapsed in milliseconds, on the GPU, of running the entire graphics pipeline. The goal: To be able to save benchmarks before/after optimizing the code (next step would be mipmapping textures) to see improvements. This was really simple in OpenGL, but I'm new to Vulkan, and could use some help.
I have browsed related existing answers (here and here), but they aren't really of much help. And I cannot find code samples anywhere, so I dare ask here.
Through documentation pages I have found a couple of functions that I think I should be using, so I have in place something like this:
1: Creating query pool
void CreateQueryPool()
{
VkQueryPoolCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_QUERY_POOL_CREATE_INFO;
createInfo.pNext = nullptr; // Optional
createInfo.flags = 0; // Reserved for future use, must be 0!
createInfo.queryType = VK_QUERY_TYPE_TIMESTAMP;
createInfo.queryCount = mCommandBuffers.size() * 2; // REVIEW
VkResult result = vkCreateQueryPool(mDevice, &createInfo, nullptr, &mTimeQueryPool);
if (result != VK_SUCCESS)
{
throw std::runtime_error("Failed to create time query pool!");
}
}
I had the idea of queryCount = mCommandBuffers.size() * 2 to have space for a separate query timestamp before and after rendering, but I have no clue whether this assumption is correct or not.
2: Recording command buffers
// recording command buffer i:
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i);
// render pass ...
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i);
vkCmdCopyQueryPoolResults(/* many parameters here */);
I'm looking for a couple of clarifications:
What is the concequence of writing to the same query index? Do I need two separate query pools - one for before render time and one for after render time?
How should I handle synchronization? I assume having a separate query for each command buffer.
For the destination buffer containing the query result, is it good enough to store somewhere with "host visible bit", or do I need staging memory for "device visible only"? I'm a bit lost on this one as well.
I have not been able to find any online examples of how to measure render time, but I just assume it's such a common task that surely there must be an example out there somewhere.
So, thanks to #karlschultz, I managed to get something working. So in case other people will be looking for the same answer, I decided to post my findings here. For the Vulkan experts out there: Please let me know if I make obvious mistakes, and I will correct them here!
Query Pool Creation
I fill out a VkQueryPoolCreateInfo struct as described in my question, and let its queryCount field equal twice the number of command buffers, to store space for a query before and after rendering.
Important here is to reset all entries in the query pool before using the queries, and to reset a query after writing to it. This necessitates a few changes:
1) Asking graphics queue if timestamps are supported
When picking the graphics queue family, the struct VkQueueFamilyProperties has a field timestampValidBits which must be greater than 0, otherwise the queue family cannot be used for timestamp queries!
2) Determining the timestamp period
The physical device contains a special value which indicates the number of nanoseconds it takes for a timestamp query to be incremented by 1. This is necessary to interpret the query result as e.g. nanoseconds or milliseconds. That value is a float, and can be retrieved by calling vkGetPhysicalDeviceProperties and looking at the field VkPhysicalDeviceProperties.limits.timestampPeriod.
3) Asking for query reset support
During logical device creation, one must fill out a struct and add it to the pNext chain to enable the host query reset feature:
VkDeviceCreateInfo createInfo{};
VkPhysicalDeviceHostQueryResetFeatures resetFeatures;
resetFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_HOST_QUERY_RESET_FEATURES;
resetFeatures.pNext = nullptr;
resetFeatures.hostQueryReset = VK_TRUE;
createInfo.pNext = &resetFeatures;
4) Recording command buffers
Timestamp queries should be outside the scope of the render pass, as seen below. It is not possible to measure running time of a single shader (e.g. fragment shader), only the entire pipeline or whatever is outside the scope of the render pass, due to (potential) temporal overlap of pipeline stages.
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i * 2);
vkCmdBeginRenderPass(/* ... */);
// render here...
vkCmdEndRenderPass(mCommandBuffers[i]);
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i * 2 + 1);
5) Retrieving query result
We have two methods for this: vkCmdCopyQueryPoolResults and vkGetQueryPoolResults. I chose to go with the latter since is greatly simplifies the setup and does not require synchronization with GPU buffers.
Given that I have a swapchain index (in my scenario same is command buffer index!), I have a setup like this:
void FetchRenderTimeResults(uint32_t swapchainIndex)
{
uint64_t buffer[2];
VkResult result = vkGetQueryPoolResults(mDevice, mTimeQueryPool, swapchainIndex * 2, 2, sizeof(uint64_t) * 2, buffer, sizeof(uint64_t),
VK_QUERY_RESULT_64_BIT);
if (result == VK_NOT_READY)
{
return;
}
else if (result == VK_SUCCESS)
{
mTimeQueryResults[swapchainIndex] = buffer[1] - buffer[0];
}
else
{
throw std::runtime_error("Failed to receive query results!");
}
// Queries must be reset after each individual use.
vkResetQueryPool(mDevice, mTimeQueryPool, swapchainIndex * 2, 2);
}
The variable mTimeQueryResults refers to an std::vector<uint64_t> which contains a result for each swapchain. I use it to calculate an average rendering time each second by using the timestamp period determined in step 2).
And one must not forget to cleanup to query pool by calling vkDestroyQueryPool.
There are a lot of details omitted, and for a total Vulkan noob like me this setup was frightening and took several days to figure out. Hopefully this will spare someone else the headache.
More info in documentation.
Writing to the same query index is bad because you are overwriting your "before" timestamp with the "after" timestamp at the same query index. You might want to change the last parameter in your write timestamp calls to i * 2 for the "before" call and to i * 2 + 1 for the "after". You are already allocating 2 timestamps for each command buffer, but only using half of them. This scheme ends up producing a pair of before/after timestamps for each command buffer i.
I don't have any experience using vkCmdCopyQueryPoolResults(). If you can idle your queue, then after idle, call vkGetQueryPoolResults() which will probably be much easier for what you are doing here. It copies the query results back into host memory and you don't have to mess with synchronizing writes to another buffer and then mapping/reading it back.

ElasticSearch: Result window is too large

My friend stored 65000 documents on the Elastic Search cloud and I would like to retrieve all of them (using python). However, when I am running my current script, there is an error noticing that :
RequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')
My script
es = Elasticsearch(cloud_id=cloud_id, http_auth=(username, password))
docs = es.search(body={"query": {"match_all": {}}, '_source': ["_id"], 'size': 65000})
What would be the easiest way to retrieve all those document and not limit it to 10000 docs? thanks
The limit has been set so that the result set does not overwhelm your nodes. Results will occupy memory in the elastic node. So bigger the result set, bigger the memory footprint and impact on the nodes.
Depending on what you want to do with the retrieved documents,
try to use the scroll api (as suggested in your error message) if its a batch job. Be mindful of the lifetime of scroll context in that case.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-scroll
or, use the Search After
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-search-after
You should use the scroll API and get the results in different calls. The scroll API will return to you the results 10000 by 10000 as maximum (that will be available to consult during the amount of time you indicate in the call) and you will be able then to paginate the results and obtain them thanks to a scroll_id.
The error message itself is mentioning that how can you solve the issue, look carefully this part of the error message.
This limit can be set by changing the [index.max_result_window] index
level setting.
Please refer update indices level setting on how to change that.
So for your setting it would look like:
PUT /<your-index-name>/_settings
{
"index" : {
"index.max_result_window" : 65000 -> note its equal to your all the docs in your index
}
}

Getting duplicates with NiFi HBase_1_1_2_ClientMapCacheService

I need to remove duplicates from a flow I've developed, it can receive the same ${filename} multiple times. I tried using HBase_1_1_2_ClientMapCacheService with DetectDuplicate (I am using NiFi v1.4), but found that it lets a few duplicates through. If I use DistributedMapCache (ClientService and Server), I do not get any duplicates. Why would I receive some duplicates with the HBase Cache?
As a test, I listed a directory (ListSFTP) with 20,000 files on all cluster nodes (4 nodes) and passed to DetectDuplicate (using the HBase Cache service). It routed 20,020 to "non-duplicate", and interestingly the table actually has 20,000 rows.
Unfortunately I think this is due to a limitation in the operations that are offered by HBase.
The DetectDuplicate processor relies on an operation "getAndPutIfAbsent" which is expected to return the original value, and then set the new value if it wasn't there. For example, first time through it would return null and set the new value, indicating it wasn't a duplicate.
HBase doesn't natively support this operation, so the implementation of this method in the HBase map cache client does this:
V got = get(key, keySerializer, valueDeserializer);
boolean wasAbsent = putIfAbsent(key, value, keySerializer, valueSerializer);
if (! wasAbsent) return got;
else return null;
So because it is two separate calls there is a possible race condition...
Imagine node 1 calls the first line and gets null, but then node 2 performs the get and the putIfAbsent, now when node 1 calls putIfAbsent it gets false because node 2 just populated the cache, so now node 1 returns the null value from the original get... both of these look like non-duplicates to DetectDuplicate.
In the DistributedMapCacheServer, it locks the entire cache per operation so it can provide an atomic getAndPutIfAbsent.

Solr performance with commitWithin does not make sense

I am running a very simple performance experiment where I post 2000 documents to my application.
Who in tern persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
I am testing 3 use cases:
No indexing at all - ~45 sec to post 2000 documents
Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents
Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents
The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
I am worried that I am missing something very big. Is it possible that committing after each add will degrade performance by a factor of 400?!
The code I use for point 2:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc);
solrConnection.commit();
Where as the code for point 3:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc, 1); // According to API documentation I understand there is no need to call an explicit commit after this
According to this wiki:
https://wiki.apache.org/solr/NearRealtimeSearch
the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant.
What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations).

python slow to check if mongodb record found

I have a python (3.2) request that goes to MongoDB and the request itself is running fast enough. When I then perform an if statement check to see if any records were found it takes 50 times as long:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
58 27623 6475988 234.4 1.7 itemInDB = db.mainData.find({"x":item[x]}).limit(1)
59
60 #existing item in db
61 27623 293419802 10622.3 77.6 if itemInDB.count():
What on earth is the cause for that if statement taking so long?! I presume there must be a better way to check if a record was found but google has come up empty.
Thanks for the help.
Perhaps a Better Way
If you're only interested in returning one value, you might want to use find_one instead of find. It will stop looking for values after one has been found, as opposed to find, which has to run through the collection:
itemInDB = db.mainData.find_one({"x":item[x]})
if itemInDB:
print("Item found")
else:
print("Item not found")
For Your Example
According to the PyMongo docs, when querying the count of a cursor, you can pass in a parameter (True or False) to take into account any skip or limit calls previously made to the cursor. The default for that parameter is False (namely, not taking those calls into account). That may be affecting the performance of your count query.
Gauging Query Performance
If you want to see how your query will be carried out by mongo, you can call explain on your cursor:
db.coll.find({"x":4}).explain()
The explain function is also implemented in PyMongo.
Turns out it was due to the find() function and not the if statement. I created an index on "x" (as I should have anyway). Changed the find to find_one and removed the .count() from the if statement. Overall 75% faster.

Resources