In ElasticSearch i have to create single index and multiple types or multiple index with single types? - elasticsearch

I am new in elastic search.I am using elastic search for big data.
There is not join query in my application then which structure is best for my application?

I am working on elasticserach from past few days. I would like to share my experience/learnings.
1) If we moving from relational DB like MYSQL, SQL to ES, We need to maintain all relation among all data. Declare the primary key in different types or indexes, On basis of which you can perform Query DSL.
2) In case of if you dealing with millions data everyday, You need to design accordingly. Some people prefer duration based structure like Day, Week, Month wise. Its totally depend on your use case. For large data set (~ 1TB) you need to distribute your data in various of indexes and shards .
3) If you have small data set the it will be work in default settings too (5 shrads 1 replica). It will give you better If data set is small in your shards.
4) The JOIN query can be expensive in elasticsearch. And if you frequently performing it can be impact to your HEAP. So I would suggest prepare your data set with pre-cooked data (The result data which you getting when you perform join query in Relational DBs.) & document with unique ID. You can refer this. Check here to look, How we can perform JOIN
5) There might be some points which you need to take care while designing your index:
Don't treat Elasticsearch like a database
Know your use case BEFORE you jump in
Organize your data wisely
Make smart use of replicas
Base your capacity plans on experiment
6) Your wrong architecture can cause reindex which will be heavy cost with downtime. Checkout this article to know about index designing and best practices.

Related

Do I need to split order data into multiple time based index in Elasticsearch?

I am planning to use Elasticsearch to store user orders data. There could be 20 million orders per year in my system. 20 million orders probably take about 10GB size.
My question is whether I should create one index to include all orders' data. I have read ES doc saying we'd better keep 20GB data in one primary shard. If I create one index with 5 primary shards, does it mean I am fine to save 100GB (200 millions) orders in this index?
Another approach is to create index per year, for example, I create index order-2020, order-2021, order-2022 etc. And I can create less primary shard for each index. I understand using this pattern may benefit if I want to add a retention period on my order data. But apart from that, what other benefits I can have to use this pattern?
From query performance perspective, which approach is better?
In terms of search speed and aggregation accuracy, multi-index multi-fragment will inevitably have some loss, but in terms of data health, it is recommended to split the data by year, you can use alias to establish index association, and the loss in query performance is much less than that in aggregation.

cassandra vs elastic search vs any other design suggestions

We have a need to run analytics queries on the data stored in rds. And that's becoming very very slow because of group by queries and ever increasing size of the tables.
For example we have following 3 tables in RDS :
alm(id,name,cli, group_id, con_id ...)
group(id, type,timestamp ...)
con(id,ip,port ...)
each of the tables have very high amount of data and are being updated several times a minute as the new data comes in.
Now we want to run aggregation queries like :
select name from alm, group, con where alm.group_id=group.id and alm.con_id=con.id group by name, group.type, con.ip
We also want users to run custom aggregation queries in the future as opposed to the fix query provided by us in future.
So far the options we are considering are moving to either Cassandra, Elasticsearch or Dynamo db so that aggregation would be faster. Can someone guide as to how to go about this problem ? Or any crumbs of experience ? Anybody know any technologies have severe advantage over others ?
Cassandra and DynamoDB are quite different from ElasticSearch. And all three are very different from relational database offerings.
For ad-hoc analytics, relational databases, with a well designed schema, can be pretty good up to the point where you need to split your data across multiple servers (then replication issues start to dominate the benefits). And that's really the primary motivation for non-relational databases. But the catch is that in order to solve the horizontal scaling problem, they generally trade some features such as joining and aggregating.
Elastic search is really great at answering search queries, but not particularly good at aggregations (other than very basic counts, sums and their estimates). It's amazing at indexing copious amounts of data but it can't answer queries that require complex cross index operations. It is also not as robust (rebuilding indexes may be needed from time to time)
If you have high volumes of data and you need aggregation, you pretty much have two options:
if you can get away with offline analytics, then distributed data processing frameworks such as Spark can get you the answers you need very efficiently
if you need online analytics, the most common approach is to pre-compute the aggregations and update as you get more data, so that answers to queries can be very fast without having to process a lot of data for each query
Don't be afraid to mix and match though. Relational databases have their purpose as do non-relationals. There is no silver bullet though.
One more options is Column-oriented databases, this kind of DB is more suitable for 'analytics' cases when you have many data fields and you want to perform aggregations or extract some subset of fields for big amount of data.
Recently Yandex ClickHouse becomes very popular and there is Column Oriented service from Amazon - Redshift. Also there are several other solutions
Store in parquet and use spark, partition efficiently

MongoDB Index definition strategy

I have a MongoDB-based database with something about 100K to 500K text documents inside and the collection keeps growing. The system should support the queries by different fields of the documents, e.g. title, category, importance etc.
The system is a near real-time system, which got new documents every 5-10 minutes.
Is it a good idea, in order to boost the queries' performance, to define a separate index for each frequently queried field (field types: small text, numeric, date) of the document? Or there are another best practices for queries' performance boosting in MongoDB?
You should use/make indexes depending on the result you are trying to find.
It's very good idea to have different indexes for different field you are trying to find at different times.
But keep in mind that indexes occupies your RAM. More you make indexes more it will use your RAM. Also consider ordering of index while making for better Search.
When developing your indexing strategy you should have a deep understanding of your application’s queries. Before you build indexes, map out the types of queries you will run so that you can build indexes that reference those fields. Indexes come with a performance cost, but are more than worth the cost for frequent queries on large data set. Consider the relative frequency of each query in the application and whether the query justifies an index.
The best overall strategy for designing indexes is to profile a variety of index configurations with data sets similar to the ones you’ll be running in production to see which configurations perform best.Inspect the current indexes created for your collections to ensure they are supporting your current and planned queries. If an index is no longer used, drop the index.
Some of the Strategies to choose while creating:
Create Indexes to Support Your Queries
An index supports a query when the index contains all the fields scanned by the query. Creating indexes that supports queries results in greatly increased query performance.
Use Indexes to Sort Query Results
To support efficient queries, use the strategies here when you specify the sequential order and sort order of index fields.
Ensure Indexes Fit in RAM
When your index fits in RAM, the system can avoid reading the index from disk and you get the fastest processing.
Create Queries that Ensure Selectivity
Selectivity is the ability of a query to narrow results using the index. Selectivity allows MongoDB to use the index for a larger portion of the work associated with fulfilling the query.

Data Store with multiple shard keys

I've been researching several different data store technologies that can be used for storing huge amounts of semi-structured logs for people to search through later. I've looked at cassandra, riak, and elastic search so far, and it seems like elastic search offers the closest fit for what I'm interested in (largely because it indexes everything transparently). However, there's one feature that I'm interested in that seems to escape them all, and I was wondering if there is a data store with this feature.
What I'm thinking about is the ability to transparent shard on multiple key. To be clear, I'm not talking about using a composite key for sharding. I mean that if you had a table that was sharded by user_id, time_of_creation, and ip_address, and you inserted a row, three copies of that row would be created, each one in a different cluster that's sharded by a different key (or maybe they could all somehow actually be in the same cluster. That important part is that the data would be duplicated). And when you wanted to query this table later, the data store would transparently choose which cluster to use.
In the articles I've read about Cassandra, people often recommend doing something like this, but it's definitely a manual process in at least three ways:
For insertion, you have you insert into each table yourself.
When it comes to querying, you have to figure out which table you want to query (you need to pick the one that uses the right cluster key).
And, if you ever want to add another key to shard on, you have to write a routine the existing data into the new table.
Although I was using cassandra as an example, I believe that the situation with riak and elastic search is similar. I understand that a data store that offered this ability would probably have to make huge trades to do so. Updating/deleting might no longer be possible (or it would have extremely poor performance), and consistency would suffer. But, it's a set of trades that I find acceptable when dealing with logs, so I was wondering if anyone is familiar with a technology that offers this feature.

how do handle if data in each table increases within same index in elasticsearch

index created for multiple table from a database. Over a period of time record of a couple of table increased and a couple of table has millions of record. but initially it has hundreds of record. So if count of records are increased, how do make better performance.
1) do we need to move the table from old index to newly creating index
or
2) increase the nodes and shards of existing index, thus make better performance.
so i am looking better solution and pls let me know, if my requirement is not clear.
Could anybody answer please.
It sounds like perhaps you should consider using timeseries-based indices, so
you would create an index for every day or month (or whatever time period you
wanted) and then you could use a tool like
curator to manage them. This way you have
more flexibility with what to do with older indices, like closing them, deleting
them, or force-merging them using the optimize API.
If you already have performance issues, it's going to be harder. Since the number of shards is fixed at index creation time, you will have to reindex the data to a new index with more shards.
If you have tables that grow indefinitely (e.g. logs) plan ahead with time-based indexes. If your data is not time-based you can do the same trick. Use templates to automatically create indexes and aliases so you can query them as it would be one index.
There is no golden rule here, but once you know that for example how your index scales for your usecase for 1M records you can do some automatic indexing by id from your primary storage (db). All you have to do is, when indexing pick the right index to write to (since you can't use alias for indexing), querying is transparent through alias. This is a minor change for most apps.

Resources