I was going through RethinkDB docs and found out about geospatial queries. So I thought, why not give it a try to build an UBER kind of database to get Drivers near a User.
So here is how I approached:
Creating the Customer and Driver
r.table('customer').insert({
name: "John",
currentLocation: [77.627108, 12.927923]
})
r.table('driver').insert({
name: "Carl",
currentLocation: [77.612319, 12.934784]
})
Creating a geospatial index on Driver table as Customer will be searching for Drivers nearest to them.
r.table('driver').indexCreate('currentLocation', {geo: true})
According to their docs, we can find the nearest point using getNearest api
r.table('driver').getNearest(r.point(77.627108, 12.927923),
{index: 'currentLocation', maxDist: 2000, unit: 'm'}
)
(r.point(77.627108, 12.927923) is on Customer. Right now I am not concerned with querying the Customer table and turning that into ReQL geometry object)
Theoretically, the above query should work but it isn't. It returns an empty array. Am I missing something?
ANSWER
Just found this - I missed this important line in docs "A geospatial index field should contain only geometry objects. It will work with geometry ReQL terms (getIntersecting and getNearest) as well as index-specific terms (indexStatus, indexWait, indexDrop and indexList)."
All the queries are fine except I have to make a small modification in Driver query:
r.table('driver').insert({
name: "Carl",
currentLocation: r.point(77.612319, 12.934784)
})
currentLocation attribute needs to have geometry object and not an array. After making this change, everything worked fine.
Related
Note: it will be very appreciate if you tell me why you think this is a shit question by comment. Please do not just down vote and not telling why..
We know there is the concept called type under index. But I do not know why we need it.
Firstly I thought we use it to organize data. Like we have index like below:
curl -XPOST 'localhost:9200/customer/USA/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'
But in the above situation, we can always eliminate the type, move it to the json body like :
curl -XPOST 'localhost:9200/customer/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe","country":"USA" }
{"index":{"_id":"2"}}
{"name": "Jane Doe","country":"USA" }
'
In this way we can always add a field to replace the type.
Then I thought it may be performance related. I thought If you split the data into different type, then there is less data under each type. So the performance to query each type should be better. But it is also not like that.
The performance of elasticsearch index is related to the shard. So even you split the data into different type, it still stored under the same sets of shards.
Then why we need type?
First of all, although elastic search determine types of fields on runtime, but once it has assigned a particular type to a field it would always expect same type of value for that field. So you need multiple types if you need to store different type of data. Secondly it allows for storing multiple types with difference mappings in single index. Besides it makes querying on a particular type easier if you are sure about its schema.
From my understanding of ES , type is something we can relate to table concept in a relational database. In which a database can be said as group of related tables. Likewise in ES,index is a group of related types each type in index holds documents that share some common property or fields.
In your example,for a index say Customer we can have different employees from different countries like USA,india,UK etc. Customer records from each country can be grouped under different types so that it will be organized. And when we run a search query for customers in a particular country we will need to run that query on type USA only. We don't need to lookup in the whole index to get the data of customers from USA.
Another example : Let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data. So we are logically organizing the data to different types and looking up to the required type whenever we do a search.
So in general,type is a logical category/partition of your index whose semantics is completely up to you. It can be defined as documents that have a set of common fields.
You may refer to this post for better understanding https://www.elastic.co/blog/index-vs-type
I need to store data that can be represented in JSON as follows:
Article{
Id: 1,
Category: History,
Title: War stories,
//Comments could be pretty long and also be changed frequently
Comments: "Nice narration, Reminds me of the difficult Times, Tough Decisions"
Tags: "truth, reality, history", //Might change frequently
UserSpecifiedNotes:[
//The array may contain different users for different articles
{
userid: 20,
note: "Good for work"
},
{
userid: 22,
note: "Homework is due for work"
}
]
}
After having gone through different articles, denormalization of data is one of the ways to handle this data. But since common fields could be pretty long and even be changed frequently, I would like to not repeat it. What could be the other ways better ways to represent and search this data? Parent-child? Inner object?
Currently, I would be dealing with a lot of inserts, updates and few searches. But whenever search is to be done, it has to be very fast. I am using NEST (.net client) for using elastic search. The search query to be used is expected to work as follows:
Input: searchString and a userID
Behavior: The Articles containing searchString in either Title, comments, tags or the note for the given userIDsort in the order of relevance
In a normal scenario the main contents of the article will be changed very rarely whereas the "UserSpecifiedNotes"/comments against an article will be generated/added more frequently. This is an ideal use case for implementing parent-child relation.
With inner object you still have to reindex all of the "man article" and "UserSpecifiedNotes"/comments every time a new note comes in. With the use of parent-child relation you will be just adding a new note.
With the details you have specified you can take the approach of 4 indices
Main Article (id, category, title, description etc)
Comments (commented by, comment text etc)
Tags (tags, any other meta tag)
UserSpecifiedNotes (userId, notes)
Having said that what need to be kept in mind is your actual requirement. Having parent-child relation will need more memory, and ma slow down search performance a tiny bit. But indexing will be faster.
On the other hand a nested object will increase your indexing time significantly as you need to collect all the data related to an article before indexing. You can of course store everything and just add as an update. As a simpler maintenance and ease of implementation I would suggest use parent-child.
I'm working on a simple side project, and have a tech stack that involves both a SQL database and ElasticSearch. I only have ElasticSearch because I assumed that as my project grows, my full text searching would be most efficiently performed by ES. My ES schema is very simple - documents that I insert into ES have 2 fields, one being the id and the other being the field with the body of text to search. The id being inserted into ES corresponds to that document's primary key id from the SQL database.
insert record into SQL -> insert record into ES using PK from SQL
Searching would be the reverse of that. Query ES and grab all the matching ids, and then turn around and use those ids to get records from SQL.
search ES can get all PK ids -> use those ids to get documents from SQL
The problem that I am facing though, is that ES can only return documents in a paginated manner. This is a problem because I also have a WHERE clause on my SQL query, beyond just the ids. My SQL query might look like this ...
SELECT * FROM foo WHERE id IN (1,2,3,4,5) AND bar != 'baz'
Well, with ES paginating the results, my WHERE clause will always only be querying a subset of the full results from ES. Even if I utilize ES' skip and take, I'm still only querying SQL using a subset of document ids.
Is there a way to get Elastic Search to only return the entire list of matching document ids? I realize this is here to not allow me to shoot myself in the foot, because doing this across all shards and many many documents is not efficient. Is there no way, though?
After putting in some hours on this project, I've only now realized that I've poorly engineered this, unless I can get all of these ids from ES. Some alternative implementations that I've thought of would be to store the things that I'm filtering on, in SQL, in ES as well. A problem there is that I'd have to update the ES document every time I update the document in SQL. This would require a pretty big rewrite to some of my data access code. I could scrap ElasticSearch all together and just perform searching in Postgres, for now, until I can think of a better way to structure this.
The elasticsearch not support return each and every doc match to you queries. Because it Ll overload the system. Instead of this.. Use scroll concept in elasticsearch.. It's lik cursor concept in db's..
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
For more examples refer the Github repo. https://github.com/sidharthancr/elasticsearch-java-client
Hope it helps..
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-fields.html
please have a look into the elastic search document where you can specify only particular fields that return from the match documents
hope this resolves your problem
{
"fields" : ["user", "postDate"],
"query" : {
"term" : { "user" : "kimchy" }
}
}
Referring to this question here:
I am working on a similar site using mongodb as my main database. As you can imagine, each user object has a lot of fields that need to be serchable, say for example mood, city, age, sex, smoker, drinker, etc.
Now, apart from the problem that there cannot be more than 64 indexes per collection, is it wise to assign index to all of my fields?
There might be another viable way of doing it: tags (refer to this other question) If i set the index on an array of predetermined tags and then text-search over them, would it be better? as I am using only ONE index. What do you think? E.g.:
{
name: "john",
tags: ["happy", "new-york", "smoke0", "drink1"]
}
MongoDB doesn't (yet) support index intersection, so the rule is: one index per query. Some of your query parameters have extremely low selectivity, the extreme example being the boolean ones, and indexing those will usually slow things down rather than speed them up.
As a simple approximation, you could create a compound index that starts with the highest-selectivity fields, for instance {"city", "age", "mood", ... }. However, then you will always have to use a city constraint. If you query for {age, mood}, the above index wouldn't be used.
If you can narrow down your result set to a reasonable size using indexes, a scan within that set won't be a performance hog. More precisely, if you say limit(100) and MongoDB has to scan 200 items to fill up those 100, it won't be critical.
The danger lies is very narrow searches across the database - if you have to perform a scan on the entire dataset to find the only unhappy, drinking non-smoker older than 95, things get ugly.
If you want to allow very fine grained searches, a dedicated search database such as SolR might be a better option.
EDIT: The tags suggestion looks a bit like using the crowbar to me -- maybe the key/value multikey index recommended by in the MongoDB FAQ is a cleaner solution:
{ _id : ObjectId(...),
attrib : [
{ k: "mood", v: "happy" },
{ k: "city": v: "new york" },
{ k: "smoker": v: false },
{ k: "drinker": v: true }
]
}
However, YMMV and 'clean' and 'fast' often don't point in the same direction, so the tags approach might not be bad at all.
I'm looking to search for a particular JSON document in a bucket and I don't know its document ID, all I know is the value of one of the sub-keys. I've looked through the API documentation but still confused when it comes to my particular use case:
In mongo I can do a dynamic query like:
bucket.get({ "name" : "some-arbritrary-name-here" })
With couchbase I'm under the impression that you need to create an index (for example on the name property) and use startKey / endKey but this feels wrong - could you still end up with multiple documents being returned? Would be nice to be able to pass a parameter to the view that an exact match could be performed on. Also how would we handle multi-dimensional searches? i.e. name and category.
I'd like to do as much of the filtering as possible on the couchbase instance and ideally narrow it down to one record rather than having to filter when it comes back to the App Tier. Something like passing a dynamic value to the mapping function and only emitting documents that match.
I know you can use LINQ with couchbase to filter but if I've read the docs correctly this filtering is still done client-side but at least if we could narrow down the returned dataset to a sensible subset, client-side filtering wouldn't be such a big deal.
Cheers
So you are correct on one point, you need to create a view (an index indeed) to be able to query on on the content of the JSON document.
So in you case you have to create a view with this kind of code:
function (doc, meta) {
if (doc.type == "youtype") { // just a good practice to type the doc
emit(doc.name);
}
}
So this will create a index - distributed on all the nodes of your cluster - that you can now use in your application. You can point to a specific value using the "key" parameter