Why SOLR has a schema and ElasticSearch does not? - elasticsearch

We were comparing those search solutions and started to wonder why one does need a schema and the other does not. What are tradeoffs? Is it because one is like SQL and the other is like NoSQL in sense of schema configuration?

ES does have a schema defined as templates and mappings. You don't have to use it, but in practice you will. Schema is actually a good thing, and if you notice a database claiming to be pure schemaless - there will be performance implication.
Schema is a tradeoff between ease of developing and adoption against performance. It is easy to read/write into a schemaless database, but it it will be less performant, particularly for any non-trivial query.

Elasticsearch definitely has a schema. If you think it does not, try indexing a date into a field and then an int into the same field. Or even into different types with the same name (I think ES 2.0 disallows that now).
What Elasticsearch does is simplifies auto-creation of a schema. That has tradeoffs such as possible incorrect type detection, fields that are single-valued or multivalued in the result output based on number of elements they contain (they are always multivalued under the covers), and so on. Elasticsearch has some ways to work around that, mostly by defining some of the schema elements and explicit schema mapping as Oleksii wrote.
Solr also has schemaless mode that closely matches Elasticsearch mode, down to storing all JSON as a single field. And when you enable it, you get both similar benefits and similar disadvantages Elasticsearch has. Except, in Solr, you can change things like order of auto-type strategies and mapping to field types. In Elasticsearch (1.x at least) it was hard coded. You can see - slightly dated - comparison in my presentation from 2014.
As Slomo said, they both use Lucene underneath for storing and most of the search. So, the core engine approach cannot change.

Related

Is there any performance benefit to creating an index mapping for Elasticsearch

I was wondering for people who have used Elasticsearch at scale if there is a performance benefit while searching if I create an index mapping and then put documents in it compared to not creating a mapping and just directly putting documents in
It is usually preferable to create the explicit mapping for an index, where possible.
For a search case, this is crucial in order to index data with the analysis chains needed to service the search strategy.
For a log use case, it may not be possible to know what the explicit mapping should be for log records that will be ingested, as there may be dynamic fields in the data that is not known ahead of time. Dynamic templates can help here, as can adopting a unified logging structure like Elastic Common Schema (ECS), either converting data to ECS format whilst logging, or converting whilst ingesting into Elasticsearch with ingest pipelines
Yes it is always better to use explicit mapping before putting the documents rather than depending on the dynamic mapping. If at all you are dependent on the dynamic mapping you may not be able to visualize on few data types like text. And also when you maintain mapping your index will always have the same kind of data. Please refer to this blog:
[https://qbox.io/blog/maximize-guide-elasticsearch-indexing-performance-part-1/][1]

Why do mappings exist in Elasticsearch?

From what I read, Elasticsearch is dropping support for types.
So, as the examples say indexes are similar to databases and documents are similar to rows of a relational database.
So now, everything is a top-level document right?
Then what is the need for a mapping, if we can store all sorts of documents in an index with whatever schema we want it to have.
I want to understand if my concepts are incorrect anywhere.
Elasticsearch is not dropping support for mapping types, they are dropping support for multiple mapping types within a single index. That's a slight, yet very important, difference.
Having a proper index mapping in ES is as much important as having a proper schema in any RDBMS, i.e. the main idea is to clearly define of which type each field is and how you want your data to be analyzed, sliced and diced, etc.
Without explicit mapping, it wouldn't be possible to do all the above (and much more), ES would guess the type of your fields and even though most of the time it gets it right, there are plenty of times where it is not exactly what you want/need.
For instance, some people store floating point values in string fields (see below), ES would detect that field as being text/keyword even though you want it to be double.
{
"myRatio": "0.3526472"
}
This is just once reason out of many others why it is important to define your own mapping and not rely on the fact that ES will guess it for you.

What are the advantages of mapping a field to a type in Elasticsearch?

I have about 10 million very flat (like an RDBMS row) documents stored in ES. There are say 10 fields to each document, and 5 of the fields are actually enumerations.
I have created a mapping that maps the Enum's ordinal to a Short, and pass the ordinal in when I index the document.
Does Elasticsearch actually store these values as a Short in its index? Or do they get .toString()'ed? What is actually happening "under the hood" when I map a field to a data type?
Since ES is built on top of Lucene, that is the place to look to see how fields are actually stored and used "under the hood".
As far as I understand, Lucene does in fact store data in more than just String format. So to answer one of your questions, I believe the answer is no - everything does not get .toString()'ed. In fact, if you look at the documentation for Lucene's document package, you'll see it has many numeric types (e.g. IntField, LongField, etc).
The Elasticsearch documentation on Core Types also alludes to this fact:
"It uses specific constructs within Lucene in order to support numeric
values. The number types have the same ranges as corresponding Java
types."
Furthermore, Lucene offers queries (which ES takes advantage of) designed specifically for searching fields with known numeric terms, such as the NumericRangeQuery which is discussed in Lucene's search package. The same numeric types in Lucene allow for efficient sorting as well.
One other benefit is data integrity. Just like any database, if you only expect a field to contain numeric data and your application attempts to insert non-numeric data, in most cases you would want that insert to fail. This is the default behavior of ES when you try to index a document whose field values do not match the type mapping. (Though, you can disable this behavior on numeric fields using ignore_malformed, if you wish)
Hope this helps...

Elasticsearch vs Cassandra vs Elasticsearch with Cassandra

I am learning NoSQL and looking at different options for one of my client's requirements. I have gone through various resources before putting up this question (a person with little knowledge in NoSQL)
I need to store data at faster rate and read data.
Fully fail-safe and easily scalable.
Able to search through data for Analytics.
I ended up with a short list of: Cassandra and Elasticsearch
What I do understand is Cassandra is a perfect NoSQL storage solution for me, as I can write data and read data using indexes. Where it fails or it could fail is on Analytics. In the future, if I want to get data from from_date to to_date, or more ways to get data for analytics, if I don't design the Data model properly or keeping long term sight, which might be quite hard in ever changing world.
While Elastic Search is best at indexing (backed by Lucene), and can search the data randomly by throwing some random text. But does it work the same for even if I want to retrieve data from_date to to_date (I expect it might be). But the real question is, is it a Search Engine, or perfect NoSQL data storage like Cassandra? If yes, why do we still need Cassandra?
If both of these are in different world, please explain that! How do we combine them to get a more effective solution?
One of our applications uses data that is stored into both Cassandra and ElasticSearch. We use Cassandra to access those records whenever we can, and have data duplicated into query tables designed to adhere to specific application-side requests. For a more liberal search than our query tables can allow, ElasticSearch performs that functionality nicely.
We have asked that same question (of ourselves)..."Why don't we just get everything from ElastsicSearch?"
The answer is that ElasticSearch was designed to be a search engine, and not a persistent data store. Sometimes ElasticSearch loses writes. Schema changes are difficult to do in ElasticSearch without blowing everything away and reloading. For that purpose, I have written jobs that are designed to keep ElasticSearch in-sync with our Cassandra cluster. There was also a fairly recent discussion on Quora about this topic, that yielded similar points.
That being said, ElasticSearch works great as a search engine. And Cassandra works great as a scalable, high-performance datastore. But querying data is different from searching for data. There are times that we need one or the other, and a combination of the two works well for our application. It may (or it may not) work well for yours.
As for analytics, I have had some success in using the Cassandra Spark connector, to serve more complex OLAP queries.
Edit 20200421
I've written a newer answer to a similar question:
ElasticSearch vs. ElasticSearch+Cassandra
Cassandra + Lucene is a great option. There are different initiatives for this issue, for example:
Stratio’s Cassandra Lucene Index - Derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality. (https://github.com/Stratio/cassandra-lucene-index)
Stratio Cassandra, it's a native integration with Apache Lucene, it is very interesting. (https://github.com/Stratio/stratio-cassandra) - THIS PROJECT HAS BEEN DISCONTINUED IN FAVOUR OF Stratio’s Cassandra Lucene Index
Tuplejump Calliope, it's like Stratio Cassandra, but it's less active. (https://github.com/tuplejump/stargate-core)
DSE Search by Datastax. It allows using Cassandra with Apache Solr, but it's a proprietary option.(http://www.datastax.com/what-we-offer/products-services/datastax-enterprise)
After working on this problem myself I have realized that NoSQL databases like casandra are good when you want to make sure you are preserving your data schema with reliable writing operation, and don't want to take advantage of indexing operations that elasticsearch offers. In case you want to preserve some indexes data then elasticsearch is good in case you are trusting your scheme and only going to do far more reads than writes.
My case was data analytics. So I preserved a lot of my Latices in elastic search since later I wanted to traverse through the data a lot to see what should be my next step. I would have used casandra if I wanted to have a lot of changes in the schema of the data in my analytic pilelines.
Also there are many nice representing tools like kibana that you can use to present your data with some good graphics. Maybe I am lazy but they are very good looking and they helped me.
Storing data in a combination of Cassandra and ElasticSearch gives you most functionality. It allows you to lookup key-value tables, and also allows you to search data in indexes.
The combination gives you a lot of flexibility, ideal for your application.
Elassandra is the combined solution of Cassandra + Elastic search , It uses Elastic search to index the data and Cassandra as the data store , i'm not sure about the performance but as per this article , its performance is good.
If your application needs search feature then , Elassandra is the best open source option. DSE search is available but its expensive.
We had developed an application where we used Elasticsearch and Cassandra.
Similar data was stored into Cassandra and indexed into Elasticsearch.
Our application's UI was having features like searches, aggregations, data export, etc.
The back-end microservices were continuously getting huge data (on Kafka topics) and storing it into Cassandra. Once the data is stored into Cassandra, the services would make sure the data is indexed into Elasticsearch.
Cassandra was acting as "Source of truth" for Elasticsearch. In the cases, where reindexing of the ES index was required, we queried Cassandra and reindexed the data into ES.
This solution helped us, as this was very easy to scale and the searches and aggregations were much faster.
Cassandra is great at retrieving data by ID. I don't know much about secondary index performance, but I doubt it's as fast as Elasticsearch. Certainly Elasticsearch wins when it comes to full text search functionality (text analysis, relevancy scoring, etc).
Cassandra wins on update performance, too. Elasticsearch supports updates, but an update is really a reindex + soft delete in an atomic operation.
Cassandra has a very nice replication model (if you need to be extra-fail-safe). Elasticsearch is OK, too, I'm not in the camp that says ES is particularly unreliable (it has issues sometimes, like all software).
Elasticsearch also has aggregations for real-time analytics. And because searches are so fast, analytics on a subset of data will be fast, too.
If your requirements are satisfied well enough by one of them (like here it seems like ES would work well), I would just use one. If you have requirements from both worlds, then you can either:
use one of them and work around the downsides. For example, you may be able to handle many updates with Elasticsearch, but with more shards and more hardware
use both and make sure they're in sync
As elasticsearch is built on Lucene index and if you want to store indexing in elasticsearch it performs best comparing to indexing in Cassandra itself for retrieving the data.
If your requirements are not related to real-time retrieval then you can use elasticsearch as NoSQL database also, there are thoughts that ElasticSearch loses writes & Schema changes are difficult, but if your volume of data is not too big. You can easily achive elasticsearch as a search engine with best indexing along with elasticsearch as aNoSQL database. There are several way that you can prevent it. I have worked on the schema changes in elasticsearch, if your data structure is consistent then it will create any issues.
Being a supporter of ElasticSearch or SOlr. I have worked on both the search engines and i experienced that both the search engines can be used fluently if you configure them correctly.
Only cons that i can think of it, if you are targetting real time result and can't comprosie milliseconds delay in your response. Then its better to take help of other NoSQL databases like cassandra or couchbase.
Cassandra with solr, work better than Cassandra with elasticSearch.

Internal data storage mechanism of elasticsearch

I have been working with elasticsearch for the past 2 months. I have used both REST approach and API support in different languages to index, get and search data. I also read a lot about elasticsearch and found out it is not a good option to use it as a data store. Why is this? And I'm also curious about how elasticsearch internally stores the indexed data. Any good link or explanation??
Elastic Search is built on top of Apache Lucene - here's a reference doc on the Lucene index file structure:
http://lucene.apache.org/core/4_7_2/core/org/apache/lucene/codecs/lucene46/package-summary.html#package_description
Regarding whether or not it's a good option as a data store I think that's more individual opinion and specific use cases than a fact that can be proved. It does not have the transaction support that something like MySQL does if that's what you are looking for. In that case it's somewhat on a par with other NoSQL solutions. This is a pretty decent writeup on the trade-offs and issues: https://www.found.no/foundation/elasticsearch-as-nosql/
In the end it depends on what you are doing with your data and what level of robustness you require.

Resources