Sorry for the ambiguous title, couldn't thing of anything better fitting.
I 'm exploring Elastic Search and it looks very cool. My question is conceptual since I 'm used to sql.
In Sql, you have different databases and you store the data for each application there. Does the same concept exist in ES? Or is all data from all my application going to end up in the same place? In that case, what are the best practices to avoid unwanted results from unfitting data?
Schemaless doesn't mean structureless:
In elastic search you can organize your data into document collections
A top-level document collection is roughly equivalent to a database
You can also hierarchically create new document collections inside top-level collections, which is a very rough equivalent of a database table
When you search documents, you search for documents inside specific document collections (such as search for all posts inside blog1)
Individual documents can be viewed as equivalent to rows in a database table
Also please note that I say roughly equivalent -- data in SQL is often normalized into tables by relations, while documents (in ES) often hold large entities of data. For instance, it generally makes sense to embed all comments inside a blog post document, whereas in SQL you would normalize comments and blogposts into individual tables.
For a nice tutorial, I recommend taking look at "ElasticSearch in 5 minutes" tutorial.
Switching from SQL to a search engine can be challenging at times. Elasticsearch has a concept of index, that can be roughly mapped to a database and type that can, again very roughly, mapped to a table. Elasticsearch has very powerful mechanism of selecting records (rows) of a single type and combining results from different types and indices (union). However, there is no support for joins at the moment. The only relationship that elasticsearch supports is has_child, but it's not suitable for modeling many-to-many relationships. So, in most cases, you need to be prepared to denormalize your data, so it can be stored in a single table.
Related
I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/
Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.
I'm creating a microservice to handle the contacts that are created in the software. I'll need to create contacts and also search if a contact exists based on some information (name, last name, email, phone number). The idea is the following:
A customer calls, if it doesn't exist we create the contact asking all his personal information. The second time he calls, we will search coincidences by name, last name, email, to detect that the contact already exists in our DB.
What I thought is to use a MongoDB as primary storage and use ElasticSearch to perform the query, but I don't know if there is really a big difference between this and querying in a common relational database.
EDIT: Imagine a call center that is getting calls all the time from mostly different people, and we want to search fast (by name, email, last name) if that person it's in our DB, wouldn't ElasticSearch be good for this?
A relational database can store data and also index it.
A search engine can index data but also store it.
Relational databases are better in read-what-was-just-written performance. Search engines are better at really quick search with additional tricks like all kinds of normalization: lowercase, รค->a or ae, prefix matches, ngram matches (if indexed respectively). Whether its 1 million or 10 million entries in the store is not the big deal nowadays, but what is your query load? Well, there are only this many service center workers, so your query load is likely far less than 1qps. No problem for a relational DB at all. The search engine would start to make sense if you want some normalization, as described above, or you start indexing free text comments, descriptions of customers.
If you don't have a problem with performance, then keep it simple and use 1 single datastore (maybe with some caching in your application).
Elasticsearch is not meant to be a primary datastore so my advice is to use a simple relational database like Postgres and use simple SQL queries / a ORM mapper. If the dataset is not really large it should be fast enough.
When you have performance issues on searches you can use a combination of relation db and Elasticsearch. You can use Elasticsearch feeders to update ES with your data in you relational db.
Indexed RDBMS works well for search
If your data is structured i.e. columns are clearly defined, searching 1 million records will also not be a problem in RDBMS.
When to use Elastic
Text Search: Searching words across multiple properties (e.g. description, name etc.)
JSON Store and search: If data being stored is in json format and later needs to be searched
Auto Suggestions: Elastic is better at providing autocomplete suggestions
Elastic as an application data provider
Elastic should not be seen as data store, even if you storing data in it. It is about how you perceive elastic. Elastic should be used to store and setup data for the application. It is the application which decides how and when to use elastic (search and suggestions). Elastic is not a nosql storage alternative if compared to RDBMS, you should use a nosql database instead.
This perception puts elastic in line with redis and kafka. These tools are key components of an application design and they are used to serve as events stores, search engines and cache etc. to the applications.
Database with Elastic
Your design should use both. For storing the contacts use the database, index the contacts for querying. Also make the data available in elastic for searching, autocomplete and related matches.
As always, it depends on your specific use case. You briefly described it, but how are you acually going to use the data?
If it's just something simple like checking if a customer exists and then creating a new customer, then use the RDMS option. Moreover, if you don't expect a large dataset, so that scaling isn't an issue (hence the designation that Elasticsearch is for BigData), but you have transactions and data integrity is important, then a RDMS will be the right fit. Some examples could be for tax, leasing, or financial reporting systems.
However, if you have a large dataset, you need a wide range of query capabilities, such as a fuzzy search or searches where the user
can select multiple filters on the data or you want to do some predictive analysis on the data, then Elasticsearch is the clear choice.
For example, I worked on an web based app with a large customer base: 11 million, with 200+ hits per second at peak time for a find a doctor application. The customer could check some checkboxes to determine, specialty, spoken languages, ratings, hospitals, etc. all sorted by the distance from the users location with a 2 second or less response time. It would be very difficult for a RDMS to match that.
I want to store employee hierarchy in elastic search. where CFO, CTO, COO etc report to CEO. And each employee can have their own reportees.
I think above can be done using elastic search parent-child relationship. Can we write a query to get the all reportees(direct reportees and sub-reportees) in a single call.
For example if we query for CEO we should get all employees and for CFO we should get employees in finance dept.
Something similar exists in RDMS like SQL server's CTE.
Parent-child relations in ES is:
Parent knows nothing about children
Children must provide _parent to connect with it and to be routed accodringly.
Parent-child mapping is handled by ES via mapping in memory.
Parent/child documents is independent in any other aspect.
So, there is no easy way to do it (there's no way to actually store normal form of any relational data as well, because ES in non-relational DB). Workarounds about this:
query documents with has_parent/has_child queries (only 1 level of relation works for this)
store documents as nested objects (pay attention, that this model reindexes whole document if any of members changes)
denormalize data (most natural way for non-relational storages, IMO)
First and foremost, avoid thinking about ES in a relational database way. ES isn't so suited for joins/relations, though it can achieve similar effect via the parent/child relations. Don't even think about joins that might involve a undetermined number of depths. CTE can handle without much difficulty but not all relational databases support CTE AFAIK (MySQL being one).
The parent-child relations is more trouble than its worth IMMO. Child docs are routed to shards where their parents reside. In your case of a tree, all documents will eventually trace back to the root document, which will result all your documents to reside in a single shard. The depth of your tree could be quite large (more than 4 or 5 in a not-so-small organization). Also, if you go with this solution, it is quite inconvenient to retrieve (via the GET API) a particular child doc from ES based on its ID, because you have to specify its parent IDs all the way up to its root.
I think it's best to store the PATH from root up to but not including the current employee as a list of IDs. So each employee has a field like:
"superiors": [CEO_ID, CTO_ID, ... , HER_DIRECT_MANAGER_ID],
So it is completely denormalized and your application has to prepare for this list.
With this setup, to get all subordinates of an employee:
filtering out IDs in this employee's own superiors field plus her own ID, either using a filter agg or a filtered query.
do a terms agg on the superiors field and you will have all subordinates of this employee.
I must admit that at least two queries are needed. The first one is a GET request to retrieve the superiors field of this employee and then the second query to do what I described above.
Also, don't worry about the duplications due to denormalization. ES can handle way more data than you can save here.
I have a collection of OWL ontologies. Each ontology is stored in a dataset of a triple store database (e.g OWLIM, Stardog, AllegroGraph ). Now I need to develop an application which supposes searching these ontologies based on keywords, i.e., given a keyword, the application should return ontologies that contains this keyword.
I have checked OWLIM-SE and Stardag, they only provide full text search over one dataset but not the whole database. I also have considered Solr(Lucene). But in this case the ontologies will be indexed twice (once by Lucene, another one by triple store database.)
Is there any other solution for this problem?
Thanks in advance.
Stardog's full text indexing works over an entire database and can be done transparently with SPARQL which will allow you to easily access other properties of the concepts matching your search criteria in a single query. This will get you precisely what you're describing.
For some information on administering the search indexes, and Stardog in general, check out these docs
The overhead of adding indexes is well-documented, but I have not been able to find good information on when to use multiple indexes with regards to the various document types being indexed.
Here is a generic example to illustrate the question:
Say we have the following entities
Products (Name, ProductID, ProductCategoryID, List-of-Stores)
Product Categories (Name, ProductCategoryID)
Stores (Name, StoreID)
Should I dump these three different types of documents into a single index, each with the appropriate elasticsearch type?
I am having difficulty establishing where to the draw the line on one vs. multiple indexes.
What if we add an unrelated entity, "Webpages". Definitely a separate index?
A very interesting video explaining elasticsearch "Data Design Patterns" by Shay Banon:
http://vimeo.com/44716955
This exact question is answered at 13:40 where examining different data flows, by looking at the concepts of Type, Filter and Routing
Regards
I was recently modeling a ElasticSearch backend from scratch and from my point of view, the best option is putting all related documents types in the same index.
I read that some people had problems with too many concurrent indexes (1 index per type). It's better for performance and robustness to unify related types in the same index.
Besides, if the types are in the same index you can use "_parent" field to create hierarquical models that allow to you interesting features for search as "has_child" and "has_parent" and of course you have not to duplicate data in your model.