I have a Neo4j database of size ~ 70Gb. It has 8 datasets that are of the same structure, just different nodes. A simple Cypher query presented below that retrieves some data from one dataset takes forever to run. There are not so many nodes in the dataset, just several thousands. Here is the query:
MATCH (c:Cell)-[ex:EXPRESSES]->(g:Gene)
WHERE c.DATASET = "cd1_e165" AND g.geneName = "1010001B22Rik"
RETURN c.tsneX, c.tsneY, ex.expr, c.cellId
There is huge amount of :EXPRESSES relationships in total, but if we limit only to the c.DATASET I am sure it should run way faster. Maybe the issue is somehow related to the fact that I am having c.DATASET property in each :Cell, and not having it as a kind of index. What could be done to speed up the query?
First of all you should use the indexes on both properties.
CREATE INDEX ON :Cell(DATASET);
CREATE INDEX ON :Gene(geneName);
Next I would rewrite the query like this (not sure whether this will help but this makes more sense to me and cypher behaves often just like you would expect it to do and in that case it seems rather clear that it should use the indexes and not start searching for all possible paths):
MATCH (c:Cell{DATASET:'cd1_e165'})-[ex:EXPRESSES]->(g:Gene{geneName:'1010001B22Rik'})
RETURN c.tsneX, c.tsneY, ex.expr, c.cellId
As InverseFalcon mentioned: PROFILE and EXPLAINcan always help you understanding what your query does and whether it fits your expectation. Take a look at at the docs.
Related
I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!
Is it possible to use Slice via solrTemplate ?
actually I am struggling to see if it will even make a difference because even without using spring, there doesnt appear to be any way of telling Solr to exclude its "numFound" (total results) from a query
And when I use a normal spring data Page<..> query , when I look under the hood I only see one query issued to solr, i.e. no extra one for count. Or is the count simply done inside Solr somehow in an extra step ?
confused
Total document count is part of the Solr query. No additional query is required. Therefore, there is no advantage to Slice vs. Page.
The only related concept is when somebody wants to export a significant amount of data, in which case built-in paging becomes slower the further is data requested. For that, Solr has exporting functionality.
For a property sale/rent website, a search function should be provided. At the same time, users can use the filters to get the result they want most.
Normally, there are many attributions of a property, like the price, address, the year built, area, many amenties such as balcony, washing-machine and so on. maybe it's over 100.
So how to design the database(mysql or other nosql) and artitecher to make the search performance to be the most efficient?
Sounds like your application requires a lot more search queries than update queries, and that the search queries are quite diverse.
In this case, try ElasticSearch: You choose some database where you store and modify your data. Then, you should propagate any update to an ElasticSearch index, where you upload a denormalized view of the data, which is closer to what users will expect to get when searching.
https://www.quora.com/Whats-the-best-way-to-setup-MySQL-to-Elasticsearch-replication
I have serious trouble finding anything useful in Mongo documentation about dealing with embedded documents. Let's say I have a following schema:
{
_id: ObjectId,
...
data: [
{
_childId: ObjectId // let's use custom name so we can distinguish them
...
}
]
}
What's the most efficient way to remove everything inside data for
particular _id?
What's the most efficient way to remove embedded document with
particular _childId inside given _id? What's the performance
here, can _childId be indexed in order to achieve logarithmic (or
similar) complexity instead of linear lookup? If so, how?
What's the most efficient way to insert a lot of (let's say a 1000)
documents into data for given _id? And like above, can we get
O(n log n) or similar complexity with proper indexing?
What's the most efficient way to get the count of documents inside data for given _id?
The other two answers give sensible advice on your questions 1-4, but I want to address your question by interrogating the basis for asking it in the first place. The terminology of "embedded document" in the context of MongoDB storing "documents" confuses people. You should not think of an embedded document as another document in MongoDB that you search for, index, or update as its own document, because that's not what it is. It's a grouped collection of fields inside a document; it's a BSON field of type Object. To quote the embedded document docs,
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations.
Starting from knowledge about your use case, you should pick your documents and document structure to make your common operations easier. If you are so concerned about 1-4, you probably want to unwind your data array of childIds into separate documents. A concrete example of this common "antipattern" is a blog with many authors - you could have a user document with a large, changing array of posts embedded inside, or a post document with user information replicated in each. I can't say for sure what is or isn't wrong with your data model as you've given no specific details about it, but struggling to understand why 1-4 seem hard or undocumented or slow in MongoDB is a good sign that you should rethink the data model so the equivalent of 1-4 are fun and easy! Or at least easier and more fun.
I can't find anything on speed so I will go with the ways found in the documentation in the hope that they made the most efficient ways the one they documented:
If you want to remove all subdocuments in data you can just update data to []
The official way to remove a document with a specific _childId from data would be $pull:
db.collection.update(
{ },
{ $pull: { data: { _childId: id } } },
)
might need to add { multi: true } if _childId is not unique (multipart subdocuments)
On indexing on subdocuments I would refer you to this question. Short answer yes you can index fields in subdocuments for faster lookup just like you would index normal fields by
db.collection.ensureIndex({"data._childId" : 1})
If you want to search for a subdocument in only one specific document you can use aggregation i.e.
db.collection.aggregate({$match:{_id : _id},
{$unwind:'$data'},
{$match:{data._childId: _childID})
which will first match for _id and only then for _childId. It will return the parent document with data only containing the subdocument(s) with _childId.
There is $push for that although for 1000 subdocument you might not want to do it in one query anyways
Trudbert is right: db.collection.update({_id:yourId},{$set:{data:[]}})
Two points for Trudbert. However, I would like to add that if you have the whole document available in your app, it might be reasonable to simply replace the contents of the whole document if suitable for your use case.
I have made good experience with bulk updates performance wise. You might want to try it.
I don't know how you come to the idea that an aggregate wouldn't use indices, but since _id is unique, it would make much more sense to use db.collection.findOne({_id:yourId},{"data._childId":1,_id:0}).data.length or use it's equivalent as a raw command in the driver of choice. Since the connection is already established, unless the array is very big, it should be faster to simply return the data instead of having the calculations done on a possibly (over)loaded server.
As per your comments to Trudberts answer: _id is unique. So exactly one doc will need to be modified for a known _id: db.collection.update({_id:theId},{$pull..... It does not get more efficient. For an unknown id, create an index on childId and do the same pull operation with a match on childId instead of id with the multi option set to remove all references to a specific childId.
I strongly second Trudberts suggestion of using the aggregation framework to create documents when needed out of optimized data. Currently, I have an aggregation pipeline which analyses 5M records with more than 7 million relations to each other in some 6 seconds. On a non sharded standalone instance. With spinning disks, crappy IO and not even optimized. With careful planning the aggregations (an early match limiting the documents passed to the ones not processed so far) and merging them with earlier results (adapt the _id in the group phase can achieve that), you can even optimize this for some mere fractions of seconds, if absolutely necessary.
Is there a guide to writing the ES queries - what to do, what to avoid, this sort of stuff. The official site describes all various ways to search, but provides little giudance as to when select what.
In my particular instance I have a list of providers, each one has a name an address and a number of IDs. I want to give the user a box he can type in anything he knows about the provider and run search based on whatever is provided. Essentially I would like to match every word from the box against the records (documents) in the index.
For the end user this should look like a simple keyword search.
Matching should cover exact matches, wild card matches, phonetic matches, synonyms (for names). Also some fuzziness should be included too.
The official site describes various ways to do that, but how to combine them together? For instance to support wild card search do I use wild card query, or do I index it with the NGram and do just text query?
With the SQL queries a certain way to get this sort of information is to check the execution plan for the query. If the SQL optimizer tells you that it will use table scan against a table of considerable size, you know you should change your query, or, may be, add an index. AFAIK there is no equivalent for this powerful feature in ES and I am not even sure if it is possible to build it.
But at least some generic considerations...? Pretty please...
There is not a best way to go about doing things, because a lot of times it depends on what you are indexing, and how you map your data into variables within Elasticsearch.
Some rule of thumb that you should look out for:
a. Faceted Queries in Elasticsearch work in sequences:
{
"query": {
// data will be searched from this block first //
}, "facets": {
// after the data is received, it will be processed into facets //
}
}
Hence if your query size is huge, you are going to slow down your query further by faceting. Monitor the results of your query.
b. Filters vs Queries
Filters do a subset of your queries, meaning it will take the entire result of what your query is, and then filter out what you do want or what you do not want.
Queries are usually direct searches for data.
Hence, if you can make your query as specific as possible before you do a filter, it should yield faster results.
c. Queries are cached; running them again and again will generally yield faster responses. The Warmers API should be able to make your queries even quicker if you are always going to use the same set of queries
Again, all these are rule of thumbs and cannot be followed strictly, because what you index into specific variables will affect processing times. A string is different from long types, and strings with analyzers are different from non-analyzers. What you need to do is probably to experiment with your queries to get a better judgement.
One correction from the above - Filters are cacheable by ES, and not queries. Queries does the extra step of relevance scoring & full text search. So, where ever full text search is not needed using filter is advised.
Also, design your mappings with correct index values (not_analyzed, no, analyzed)