I want to use this method in a script, which sets up a sync between ElasticSearch and Firebase. I am using this to avoid children that are already with me and index only new ones, is it efficient method when I have millions of data on my firebase!
It is efficient if you have an index defined on the field you're using to sort.
Related
Data duplication prevention is handled at the index level with the field "_id".
However, to avoid having huge indices, I work with several small indices linked under an alias. Is there a mechanism in place to check existing _ids at the alias level (over multiple indices) when a document is inserted or should it be handled at the application level ?
indices architecture
not natively, no. you'd need to handle this in your own code
Before inserting your document, you need to first find out which real index contains your document via the alias using
GET alias/_search?q=_id:123456&filter_path=hits.hits._index
In the response you'll get the concrete index name that you can then use to index/update your new document version.
Since I don't want to call the API every time I need certain data (like an array of 1000 rows) I would like to store that array in ElasticSearch so I can easily get this array without the need to call the api. I'm using FOS Elastic Bundle. Is this even possible to make and if it is how?
What I would do:
-I have a function that gets this data from database
-I would like to save this data in ES after calling php bin/console fos:elastica:populate
-use this array in controller to return it to the view and use it there.
I would suggest that you define a type with a mapping that can cover a single row in your database. After that, when you have fetched the 1000 rows from the database, you can index those 1000 rows in a single bulk index call in form of 1000 documents: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
You can then fetch these 1000 documents for use in controller.
Alternatively, you can define a mapping with a nested property. This nested property should be identical to a row in your database. Using this, you can create a single document with 1000 rows worth of your data inside the nested property like an array. After that, you can fetch this single document.
Which of these strategies is better will depend on your requirement. The second is a heavier indexing process while first is relatively heavy fetch process. In my experience with ElasticSearch, it is better to have lighter indexing requests to ensure data consistency. Depending on your data, you can create the 1000 documents with IDs in a certain pattern and with the IDs known, fetching these documents becomes very efficient.
I am trying to use elasticsearch to filter millions of data. All data are in one index and I want to access them in a 'direct' way.
What I mean with direct way?
Direct way means for example accessing the 700000th element of this index (not by id). Is this possible somehow?
What I tried already:
from + size works, but seems not to be fast if number of elements > 10000
Scrolling I didn't try, but it's seem somehow not the right thing for my use-case.
So any other ideas?
Scrolling will not work. That will fetch all the data.
I think elasticseach is not the correct use case for what you want to do.
It would be better to use a linked list of the ids, that will let you fetch the id by index and then you can query elasticsearch to get the data.
If you data is such that it does not get modified or deleted then you can add an extra field in the mapping that will act like an auto increment field in a database. You can fetch the data using that field.
The previous setting was all documents of one type were in the same index. But due to different forms (conceptually) of types, and for backing up purposes, I need multiple indices of a single type.
They will all be in the form _feed. While this setting is great in some circumstances, for
client.prepareGet(index, typename, ids).execute().actionGet(); // works great if you know in which index to search
it is useless, since no wildcards may be used. What I can do is use multiple multigets and interleave the results. This results in what I want, but increase the amount of queries significantly.
Assuming I know, for sure, only one document exist with a given index, is there a better way to query does than call a multiget on all _uids for each possible index?
The best way would be to develop a mechanism in your application that would allow you to deduce the index name from the id. But assuming that this is not possible or practical, you have pretty much only two choices. If you need realtime get, then your approach is the only way to do it. If realtime get is not a requirement, you can perform a search across all indices using ids filter. If the id list is small you can benefit from using routing on your search query. This way the search request will only be dispatch to the shards that might contain any of the ids listed in the query. However, if the list of ids is big enough to span most of the shards, it will not provide any benefit.
I have serious trouble finding anything useful in Mongo documentation about dealing with embedded documents. Let's say I have a following schema:
{
_id: ObjectId,
...
data: [
{
_childId: ObjectId // let's use custom name so we can distinguish them
...
}
]
}
What's the most efficient way to remove everything inside data for
particular _id?
What's the most efficient way to remove embedded document with
particular _childId inside given _id? What's the performance
here, can _childId be indexed in order to achieve logarithmic (or
similar) complexity instead of linear lookup? If so, how?
What's the most efficient way to insert a lot of (let's say a 1000)
documents into data for given _id? And like above, can we get
O(n log n) or similar complexity with proper indexing?
What's the most efficient way to get the count of documents inside data for given _id?
The other two answers give sensible advice on your questions 1-4, but I want to address your question by interrogating the basis for asking it in the first place. The terminology of "embedded document" in the context of MongoDB storing "documents" confuses people. You should not think of an embedded document as another document in MongoDB that you search for, index, or update as its own document, because that's not what it is. It's a grouped collection of fields inside a document; it's a BSON field of type Object. To quote the embedded document docs,
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations.
Starting from knowledge about your use case, you should pick your documents and document structure to make your common operations easier. If you are so concerned about 1-4, you probably want to unwind your data array of childIds into separate documents. A concrete example of this common "antipattern" is a blog with many authors - you could have a user document with a large, changing array of posts embedded inside, or a post document with user information replicated in each. I can't say for sure what is or isn't wrong with your data model as you've given no specific details about it, but struggling to understand why 1-4 seem hard or undocumented or slow in MongoDB is a good sign that you should rethink the data model so the equivalent of 1-4 are fun and easy! Or at least easier and more fun.
I can't find anything on speed so I will go with the ways found in the documentation in the hope that they made the most efficient ways the one they documented:
If you want to remove all subdocuments in data you can just update data to []
The official way to remove a document with a specific _childId from data would be $pull:
db.collection.update(
{ },
{ $pull: { data: { _childId: id } } },
)
might need to add { multi: true } if _childId is not unique (multipart subdocuments)
On indexing on subdocuments I would refer you to this question. Short answer yes you can index fields in subdocuments for faster lookup just like you would index normal fields by
db.collection.ensureIndex({"data._childId" : 1})
If you want to search for a subdocument in only one specific document you can use aggregation i.e.
db.collection.aggregate({$match:{_id : _id},
{$unwind:'$data'},
{$match:{data._childId: _childID})
which will first match for _id and only then for _childId. It will return the parent document with data only containing the subdocument(s) with _childId.
There is $push for that although for 1000 subdocument you might not want to do it in one query anyways
Trudbert is right: db.collection.update({_id:yourId},{$set:{data:[]}})
Two points for Trudbert. However, I would like to add that if you have the whole document available in your app, it might be reasonable to simply replace the contents of the whole document if suitable for your use case.
I have made good experience with bulk updates performance wise. You might want to try it.
I don't know how you come to the idea that an aggregate wouldn't use indices, but since _id is unique, it would make much more sense to use db.collection.findOne({_id:yourId},{"data._childId":1,_id:0}).data.length or use it's equivalent as a raw command in the driver of choice. Since the connection is already established, unless the array is very big, it should be faster to simply return the data instead of having the calculations done on a possibly (over)loaded server.
As per your comments to Trudberts answer: _id is unique. So exactly one doc will need to be modified for a known _id: db.collection.update({_id:theId},{$pull..... It does not get more efficient. For an unknown id, create an index on childId and do the same pull operation with a match on childId instead of id with the multi option set to remove all references to a specific childId.
I strongly second Trudberts suggestion of using the aggregation framework to create documents when needed out of optimized data. Currently, I have an aggregation pipeline which analyses 5M records with more than 7 million relations to each other in some 6 seconds. On a non sharded standalone instance. With spinning disks, crappy IO and not even optimized. With careful planning the aggregations (an early match limiting the documents passed to the ones not processed so far) and merging them with earlier results (adapt the _id in the group phase can achieve that), you can even optimize this for some mere fractions of seconds, if absolutely necessary.