elasticsearch return unique matched values - elasticsearch

I recently started looking at elasticsearch, I'm in the process of learning what it can do and decide how I can use it in my projects.
For one project I used a couchdb (noSQL) database. The client can search it using couchdb views. Easy, but limited in functionality.
I'd like to have elasticsearch to open up the data in a far more rich way.
Searching for composers and titles of musical pieces is now handled by elasticsearch with amazingly fast 'query_string's. And it's fuzzy!
There is one thing however I did not manage to accomplish with elasticsearch, but I'm pretty sure it's possible, I'm just missing it.
It's about the autocomplete functionality when entering instrument names.
For example:
I have 2 documents (musical pieces) with different instruments needed to play them:
{
title: 'Awesome Piece',
authors: [{
name: 'John Doe',
role: 'composer'
}, {
name: 'Shakespeare',
role: 'lyricist'
}],
instruments: [
'soprano',
'alto',
'tenor',
'bass',
'trumpet',
'trumpet',
'piano'
]
}
{
title: 'Not so Awesome Piece',
authors: [{
name: 'Another J. Doe',
role: 'composer'
}, {
name: 'Little John',
role: 'arranger'
}],
instruments: [
'trombone',
'organ'
]
}
To enter a new musical piece there is a field to insert instrument names. I'd like the offer an autocomplete.
So if the user types 't', I want a list of all instruments matching 't*': ['tenor', 'trumpet', 'trombone'], if he types 'tr', I need: ['trumpet', 'trombone']
The best I coold find was a query with an aggregation, but it searches for documents and aggregates them as a whole, returning all instruments of the document(s) found with the query.
And off course, I want the autocomplete to be fuzzy in the end.
Can anybody point me in a direction?
Thanks in advance!
(I'm running elasticsearch 2.3, but I don't mind upgrading!)

Related

ElasticSearch Chaining queries based on result from first query

I have some data and I am looking to implement a search feature that probably requires chaining multiple queries. for example there are few people who are part of a group but each member in the database are separate. None of the data is nested.
For example
data = [
{
id: '1'
name: 'abc',
familyId: '3'
},
{
id: '2'
name: 'def',
familyId: '3'
},
{
id: '3'
name: 'ghi',
familyId: null
},
]
So now I am trying to implement a search feature where people can search by name, and if the name matches I want to show that result along with his family members. Each data is different and there is no connection between them apart from the familyId.
So currently my solution is to make a search using the name first and then from the result of my first search I will see if there is family ID present in the result, and if yes make another ES query to get all the members and then show the result.
Is there a away I could make it one query that will give me the desired output?
Any suggestion is very much appreciated.
there's no native Elasticsearch functionality for this unfortunately, your approach is the current best way to do it

Elasticsearch question, should I have duplicate data along 2 different indices? Not sure how to set up the data

Edit: 3 different incides. Sorry about the title :c
I am trying to grasp elasticsearch as fast as I can but I think I've confused myself majorly here. How should I set this data up?
I have 3 major searches:
1: Search by pokemon name. Eg: Show all Charizard in the system.
2: Search by trainer name Eg: Show all of John Doe's pokemon/checkins at the pokecenter.
3: Search by checkins at the pokecenter.
Should each of these be in their own separate index? I am absolutely from an SQL background primarily so I want to have separate tables for all of these. But that isn't how elasticsearch works... so I am really confused here.
Should I have a separate index for each pokemon?
And then another separate index for each trainer?
And then another separate index for each checkin at the pokecenter?
Query return examples
1: Search by pokemon name.
{
1 : {
id: 9239329,
pokeId: 6,
name: Charizard,
trainerId: 2932
}
}
2: Search by trainer name
{
1 : {
id: 2932,
name: John Doe,
pokemon: [
9239329
]
}
}
3: Search by checkins at the pokecenter.
{
1 : {
id: 3232,
date: 11/11/1111,
pokemon: [
9239329
],
trainerId: 2932
}
}
But if I have a separate index.... and index for EACH of these ... while that would be fast wouldn't that just be crazy horrendous data duplication?
It depends on the scope of the project :
the ideal way is to have each one as it's separate index this allows you to scale them differently if needed and move them to another cluster and also allow each one to have different replica settings
The quick way , is to have the checkins as an index and the trainer as a nested object , and under that the pokemon as a nested object.
note: nested queries are slower, and writing the queries to return exactly what you want is a little tricker.

How do I model my document in MongoDB to make it paginable for nested attributes?

I'm trying to cache my tweets and show that based on my keyword save. However, as tweets grow overtime I need to paginate them.
I'm using Ruby and Mongoid which this is what I have come up so far.
class SavedTweet
include Mongoid::Document
field :saved_id, :type => String
field :slug, :type => String
field :tweets, :type => Array
end
And the tweets array would be like this
{id: "id", text: "text", created_at: "created_at"}
So it's like a bucket for each keyword that you can save. My first problem would be that Mongodb cannot sort the second level of document which in this case it's tweets and that'd make pagination much harder because I cannot use skip and limit. I will have to load the whole tweets and put that in the cache and paginate from that.
The question is how should I model my problem to make it paginable out of Mongodb and not in the memory. I'm assuming that doing it in Mongodb would be faster. Right now, I'm in the early stage of my application so it's easier to change the model than later. If you guys have any suggestions or opinion I'm really appreciated.
An option could be to save tweets in a different collection and link them with your SavedTweet class. It will be easy to query and you could use skip and limit without problems.
{id: "id", text: "text", created_at: "created_at", saved_tweet:"_id"}
EDIT: a better explanation, with two aditional options
As far I see, you have three options, if I understand correctly your requirements:
Use the same schema that you are already using. You would have two problems: you cannot use skip and limit with an usual query and you have a limit of 16 MB per document. I think, the first one could be resolved with an Aggregation Framework query ($unwind, $skip and $limit could be helpful). The second one could be a problem if you have a lot of tweet documents in the array, because one document cannot have more than 16MB of size.
Use two collections to store your tweets. One collection would have the same structure that you already have. For example:
{
save_id:"23232",
slug:"adkj"
}
And the other collection would have one document per tweet.
{
id: "id",
text: "text",
created_at: "created_at",
saved_tweet:"_id"
}
With saved_tweet field you are linking saved_tweets with tweet with a 1 to N relation. So with this way, you can carry out queries over tweet collection and still be able to use limit and skip operators..
Save all info in the same document. If your saved_tweet collection only have those fields, you can save all info in a whole document (one document for each tweet). Something like this:
{
save_id:"23232",
slug:"adkj"
tweet:
{
tweet_id: "id",
text: "text",
created_at: "created_at"
}
}
Whit this solution you are duplicating fields, because *save_id* and slug would be the same in other documents of the same saved_tweet, but I could be an option if you have a little quantity of fields and that fields are not subdocuments or arrays.
I hope it is clear now.

Elasticsearch: field "title" was indexed without position data; cannot run PhraseQuery

I have an index in ElasticSearch with the following mapping:
mappings: {
feed: {
properties: {
html_url: {
index: not_analyzed
omit_norms: true
index_options: docs
type: string
}
title: {
index_options: offsets
type: string
}
created: {
store: true
format: yyyy-MM-dd HH:mm:ss
type: date
}
description: {
type: string
}
}
}
getting the following error when performing phrase search ("video games"):
IllegalStateException[field \"title\" was indexed without position data; cannot run PhraseQuery (term=video)];
Single word searches work fine. Tried "index_options: positions" as well but with no luck. Title field contains text in multiple languages, sometimes empty. Interesting that it seems to fail randomly, for example it would fail with 200K documents or 800K using the same dataset. Is there a reason some titles wouldn't get indexed with positions?
Elastic search version 0.90.5
Just in case someone else has the same issue. There was another type/table (feed2) in the same index with the same "title" field that was set to "not_analyzed".
For some reason even if you specify the type: http://elasticsearchhost.com:9200/index_name/feed/_search the other type is still being searched as well. Changing the mapping for feed2 type fixed the problem.
You probably have another field named 'title' with a different mapping in another type but in the same index.
Basically if you have 2 fields with the same name in the same index - even if they are in different types - they cannot have different mappings: to be more precise, even if they have the same type (eg: "string") but one of them is "analyzed" and the other is "not analyzed", problems will arise.
I mean, yeah, you can try to setup 2 different mappings, and ElasticSearch will not complain, but when searching you get strange result and everything will go bananas.
You can read more about this issue here where they say:
[...] In the end, we opted to enforce the rule that all fields with the same name in the same index must have the same mapping [...]
And yeah, considering how the promise of ElasticSearch has always been "it just works" this little detail took a lot of people by surprise.

Multiple atomic updates using MongoDB?

I am using Codeigniter and Alex Bilbie's MongoDB library.
In my API that I am developing users can upload images and other users can comment on them.
I have chosen to include the comments as sub documents to the images.
Each comment contains:
Fullname (of author)
Comment
Created_at
So in other words. The users full name is "hard coded" into each comment so if they
later decides to change their names I have a problem.
I read that I can use atomic updates to update all occurrences of the name (like in comments) but how can I do this using Alex´s library? Can I update all places where the name is wrong?
UPDATE
This is how the image document looks like with the comments.
I think that it is pretty strange that MongoDB encourage the use of subdocuments but then does not include a way to update multiple items in an array.
{
"_id": ObjectId("4e9ead773dc793dc01020000"),
"description": "An image",
"category": "accident",
"comments": [
{
"id": ObjectId("4e96bd063dc7937202000000"),
"fullname": "James Bond",
"comment": "This is a comment.",
"created_at": "2011-10-19 13:02:40"
}
],
"created_at": "2011-10-19 12:59:03"
}
Thankful for all help!
I am not familiar with codeignitor, but mb mongodb shell syntax will help you:
db.comments.update( {"Fullname":"Andrew Orsich"},
{ $set : { Fullname: "New name"} }, false, true )
Last true flag indicate that you want update multiple documents. So it is possible to update all comments in one update operation.
BTW: denormalazing (not 'hard coding') data in mongodb and nosql in general is usual operation. Also operation that require update a lot of documents usually work async. But it is up to you.
Update:
db.comments.update( {"comments.Fullname":"Andrew Orsich"},
{ $set : { comments.$.Fullname: "New name"} }, false, true )
But, above query will update full name in first comment on nested array. If you need to affect changes to more than one array element you will need to use multiple update statements.

Resources