Elasticsearch question, should I have duplicate data along 2 different indices? Not sure how to set up the data - elasticsearch

Edit: 3 different incides. Sorry about the title :c
I am trying to grasp elasticsearch as fast as I can but I think I've confused myself majorly here. How should I set this data up?
I have 3 major searches:
1: Search by pokemon name. Eg: Show all Charizard in the system.
2: Search by trainer name Eg: Show all of John Doe's pokemon/checkins at the pokecenter.
3: Search by checkins at the pokecenter.
Should each of these be in their own separate index? I am absolutely from an SQL background primarily so I want to have separate tables for all of these. But that isn't how elasticsearch works... so I am really confused here.
Should I have a separate index for each pokemon?
And then another separate index for each trainer?
And then another separate index for each checkin at the pokecenter?
Query return examples
1: Search by pokemon name.
{
1 : {
id: 9239329,
pokeId: 6,
name: Charizard,
trainerId: 2932
}
}
2: Search by trainer name
{
1 : {
id: 2932,
name: John Doe,
pokemon: [
9239329
]
}
}
3: Search by checkins at the pokecenter.
{
1 : {
id: 3232,
date: 11/11/1111,
pokemon: [
9239329
],
trainerId: 2932
}
}
But if I have a separate index.... and index for EACH of these ... while that would be fast wouldn't that just be crazy horrendous data duplication?

It depends on the scope of the project :
the ideal way is to have each one as it's separate index this allows you to scale them differently if needed and move them to another cluster and also allow each one to have different replica settings
The quick way , is to have the checkins as an index and the trainer as a nested object , and under that the pokemon as a nested object.
note: nested queries are slower, and writing the queries to return exactly what you want is a little tricker.

Related

Elasticsearch 6.0 Removal of mapping types - Alternatives

Background
I migrating my ES index into ES version 6. I currenly stuck because ES6 removed the using on "_type" field.
Old Implementation (ES2)
My software has many users (>100K). Each user has at least one document in ES. So, the hierarchy looks like this:
INDEX -> TYPE -> Document
myindex-> user-123 -> document-1
The key point here is with this structure I can easily remove all the document of specific user.
DELETE /myindex/user-123
(Delete all the document of specific user, with a single command)
The problem
"_type" is no longer supported by ES6.
Possible solution
Instead of using _type, use the index name as USER-ID. So my index will looks like:
"user-123" -> "static-name" -> document
Delete user is done by delete index (instead of delete type in previous implementation).
Questions:
My first worry is about the amount of index and performance: Having like 1M indexes is something that acceptable in terms of performance? don't forget I have to search on them frequently.
Most of my users has small amount of documents stored in ES. Is that make sense to hold a shard, which should be expensive, for < 10 documents?
My data architecture sounds reasonable for you?
Any other tip will be welcome!
Thanks.
I would not have one index per user, it's a waste of resources, especially if there are only 10 docs per user.
What I would do instead is to use filtered aliases, one per user.
So the index would be named users and the type would be a static name, e.g. doc. For user 123, the documents of that user would all be stored in users/doc/xyz and in each document you need to add the user id, e.g.
PUT users/doc/xyz
{
...
"userId": 123,
...
}
Then you can define a filtered alias for all documents of user 123, like this:
POST /_aliases
{
"actions" : [
{
"add" : {
"index" : "users",
"alias" : "user-123",
"filter" : { "term" : { "userId" : "123" } }
}
}
]
}
If you need to delete all documents of user 123, then you can simply do it like this:
POST user-123/_delete_by_query?q=*
Having these many indexes is definitely not a good approach. If your only concern to delete multiple documents with a single command. Then you can use Delete by Query API provided by ElasticSearch
You can introduce "subtype" attribute in all your document containing value for each document like "user-" value. So in your case, document would looks like.
{
"attribute1":"value",
"subtype":"user-123"
}

Search by filters using views in CouchDB

I have a CouchDB database where I store models like this:
"_id": "id",
"_rev": "rev",
"field_1": "test",
"filed_2": 45,
"filed_3": 15,
"object_1": {
"field_1_1": 123,
"filed_1_2": 125
}
}
And I want to search for models by specific parameters in different ranges (filters).
For example, in one situation I need to find all the models with
field_2 from 10 to 50
field_3 from 10 to 20
object_1.field_1_1 from 100 to 150, object_1.field_1_2 from 120 to 130
In another case I need to find just all the models with field_2 from 10 to 50.
At the moment I wrote view like this:
function (doc) {
emit([doc.filed_2, doc.field_3, doc.object_1.field_1_1, doc.object_1.filed_1_2], 1);
}
So it generates that result:
{"id":"id","key":[45,15,123, 125],"value":1}
I can use this array-key to fetch necessary models and I can use "startkey" and "endkey" to generate ranges.
But Is there more efficient way to create search by different filters (some filters can be skipped, user selects the filters he wants to search by) in CouchDB? How Can I combine different parameters?
And How Can I skip parameters if they were not chosen for search (like in the second case)?
Thank you.
In CouchDB 2.x you can use the /db/_find endpoint with Mango expressions in order to query the database.
Please, check the expression syntax in order to check if it can cover your needs.

Rethinkdb multiple level grouping

Let's say I have a table with documents like:
{
"country": 1,
"merchant": 2
"product": 123,
...
}
Is it possible to group all the documents into a final json structure like:
[
{
<country_id>: {
<merchant_id>: {
<product_id>: <# docs with this product id/merchant_id/country_id>,
... (other product_id and so on)
},
... (other merchant_id_id and so on)
},
... (other country_id and so on)
]
And if yes, what would be the best and most efficient way?
I have more than a million of these documents, on 4 shards with powerful servers (22 Gb cache each)
I have tried this (in the data explorer, in JS, for the moment):
r.db('foo')
.table('bar')
.indexCreate('test1', function(d){
return [d('country'), d('merchant'), d('product')]
})
and then
r.db('foo')
.table('bar')
.group({index: 'test1'})
But the data explorer seems to hang, still working on it as you can see...
.group({index: 'test1'}).count() will do something pretty similar to what you want, except it won't produce the nested document structure. To produce the nested document structure it would probably be easiest to ungroup, then map over the ungrouped values to produce objects of the form you want, then merge all of them.
The problem with group queries on the whole table though is that they won't stream, you'll need to traverse the whole table to get the end result back. The data explorer is meant for small queries, and I think it times out if your query takes more than 5 minutes to return, so if you're traversing a giant table then it would probably be better to run that query from one of the clients.

For 1 billion documents, Populate data from one field to another fields in the same collection using MongoDB

I need to populate data from one field to multiple fields on the same collection. For example:
Currently I have document like below:
{ _id: 1, temp_data: {temp1: [1,2,3], temp2: "foo bar"} }
I want to populate into two different fields on the same collection as like below:
{ _id: 1, temp1: [1,2,3], temp2: "foo bar" }
I have one billion documents to migrate. Please suggest me the efficient way to update all one billion documents?
In your favorite language, write a tool that runs through all documents, migrates them, and store them in a new database.
Some hints:
When iterating the results, make sure they are sorted (e.g. on the _id) so you can implement resume should your migration code crash at 90%...
Do batch inserts: read, say, 1000 items, migrate them, then write 1000 items in a single batch to the new database. Reads are automatically batched.
Create indexes after the migration, not before. That will be faster and lead to less fragmentation
Here I made a query for you, use following query to migrate your data
db.collection.find().forEach(function(myDoc) {
db.collection_new.update(
{_id: myDoc._id},
{
$unset: {'temp_data': 1},
$set: {
'temp1': myDoc.temp_data.temp1,
'temp2': myDoc.temp_data.temp2
}
},
{ upsert: true }
)
});
To learn more about foreach cursor please visit link
Need $limit and $skip operator to migrate data in batches. In update query i have used upsert beacuse there if already exist it will update otherwise inserted entry wiil be new.
Thanks

one-to-many relationships in Elastic Search

Suppose I have 2 tables called "twitter_user" and "twitter_comments".
twitter_users has the fields: username and bio
twitter_comments has the fields: username and comment
Obviously, an user has 1 entry in twitter_users and potentially many in twitter_comments
I want to model both twitter_users and twitter_comments in Elastic Search, have ES search both models when I query, knowing that a comment counts towards the overall relevancy score for a twitter user.
I know I can mimic this with just 1 model, by creating a single extra field (in addition to username and bio) with all the comments concatenated. But is there another "cleaner" way?
It depends.
If you just want to be able to search for a users comments ,full-text and over all fields, simply store all comments within the user object (no need to concatenate anything):
{
"user" : {
"username" : "TestUser",
"bio" : "whatever",
"comments" : [
{
"title" : "First comment",
"text" : "My 1st comment"
},
{
"title" : "Second comment",
"text" : "My 2nd comment"
}
]
}
}
If you need per-comment-based queries you need to map the comments as nested (before submitting any data), so that every comment gets treated as a single item.
For your scoring, simply add another field "comment_count" and use this for your boost/scoring.
As Thorsten already suggested you can use nested query and it's a good approach.
Alternatively, you can index comments as children of users. Then you can can search users as you do now, search comments using top_children query to find all relevant to your search comments, and finally combine scores from both of them together using bool or dis_max queries.
Nested approach would be more efficient during search, but you will have to reindex the user and all comments every time an additional comment is added. With child/parent approach you will need to index only new comments, but search will be slower and it will require more memory.

Resources