elasticsearch vs hbase/hadoop for realtime statistics - hadoop

I'm loggin millions of small log documents weekly to do:
ad hoc queries for data mining
joining, comparing, filtering and calculating values
many many fulltext-search with python
run this operations with all millions of docs, some times every day
My first thought was put all docs in HBase/HDFS and run Hadoop jobs generating stats results.
The problem is: some of results must be near real-time.
So, after some research I discovered ElasticSearch and Now I'm thinking about transfer all millions of documents and use DSL-Queries to generate stats results.
Is this a good idea? ElasticSearch seems to be so easy to handle with millions/billions of documents.

For real-time search Analytics Elastic Search is a good choice.
Definitely easier to setup and handle than Hadoop/HBase/HDFS.
Elastic-Search vs HBase Good Comparison: http://db-engines.com/en/system/Elasticsearch%3BHBase

Related

Is Elastic Search a good data store for a Read Only Api?

We are planning to create a reporting database exposed via read only api. It'll contain reporting related read apis for both our customers and internal processes like invoicing.
Also, we thought it will also be useful to have Kibana over it to have analytics for our internal teams.
Is Elastic Search good for this use case?
Yeah why not, Elasticsearch will be very good choice for your use-case due to following reasons:
You can de-normalize your data and store them in single index, this will make fetching and searching very fast, this is normally the prime usecases of nosql and ES can work like that.
Basic x-pack security is available free in ES, which would provide read only access to your users without much effort and cost.
Apart from search, Elasticsearch is again very popular for analytics use-cases, you can run very aggregations easily for your use-cases and can use Kibana dashboard for visualisation, which has very nice integration with ES as both are same company(Elastic) products.
And most importantly ES is horizontally scalable and distributed system and easily be scaled to hundreds of nodes to support anyone's growing needs.
In addition to opster's answer there are 2 things that I want to mention that might help you in making a decision :
How E.S is serving us for a real-time reporting use case in production with an extensive data set
Performance of reporting in E.S vs Mongo (that we measured)
How E.S is serving for a real-time reporting use case in production
with an extensive data set
E.S provides real-time results (under 1 sec) for below cases of ours:
Reports generated by running multiple set of filters (date, etc) & aggregation on millions of data points
Time based reports (grouping data by day, week, month, quarter, year) - Powered by DateHistogram
Performance of reporting in E.S vs Mongo (that we measured)
Aggregating 5 million data points in E.S took < 1 sec while it took Mongo > 10 sec, on similar instances.
In addition to above: Support for scripting is also available, which provided a lot of flexibility.

cassandra vs elastic search vs any other design suggestions

We have a need to run analytics queries on the data stored in rds. And that's becoming very very slow because of group by queries and ever increasing size of the tables.
For example we have following 3 tables in RDS :
alm(id,name,cli, group_id, con_id ...)
group(id, type,timestamp ...)
con(id,ip,port ...)
each of the tables have very high amount of data and are being updated several times a minute as the new data comes in.
Now we want to run aggregation queries like :
select name from alm, group, con where alm.group_id=group.id and alm.con_id=con.id group by name, group.type, con.ip
We also want users to run custom aggregation queries in the future as opposed to the fix query provided by us in future.
So far the options we are considering are moving to either Cassandra, Elasticsearch or Dynamo db so that aggregation would be faster. Can someone guide as to how to go about this problem ? Or any crumbs of experience ? Anybody know any technologies have severe advantage over others ?
Cassandra and DynamoDB are quite different from ElasticSearch. And all three are very different from relational database offerings.
For ad-hoc analytics, relational databases, with a well designed schema, can be pretty good up to the point where you need to split your data across multiple servers (then replication issues start to dominate the benefits). And that's really the primary motivation for non-relational databases. But the catch is that in order to solve the horizontal scaling problem, they generally trade some features such as joining and aggregating.
Elastic search is really great at answering search queries, but not particularly good at aggregations (other than very basic counts, sums and their estimates). It's amazing at indexing copious amounts of data but it can't answer queries that require complex cross index operations. It is also not as robust (rebuilding indexes may be needed from time to time)
If you have high volumes of data and you need aggregation, you pretty much have two options:
if you can get away with offline analytics, then distributed data processing frameworks such as Spark can get you the answers you need very efficiently
if you need online analytics, the most common approach is to pre-compute the aggregations and update as you get more data, so that answers to queries can be very fast without having to process a lot of data for each query
Don't be afraid to mix and match though. Relational databases have their purpose as do non-relationals. There is no silver bullet though.
One more options is Column-oriented databases, this kind of DB is more suitable for 'analytics' cases when you have many data fields and you want to perform aggregations or extract some subset of fields for big amount of data.
Recently Yandex ClickHouse becomes very popular and there is Column Oriented service from Amazon - Redshift. Also there are several other solutions
Store in parquet and use spark, partition efficiently

How do websites do fulltext search and sort?

How do websites implement search and sort? (example: ecommerce search for a product and sort by price)
I've been wrestling with this for a while. I'm using MySQL and after long discussions here it seems that MySQL can't handle this. I've also asked here here whether posgres can do this and again it seems like the answer is no.
So how do websites do it?
EDIT: To be clear, I'm asking how websites do it in a way that uses both fulltext search and some sort of BTREE index for the sorting. To do fulltext search and sort without using one of the indexes would be easy (albeit slow).
I worked for a large ecommerce site that used SQL Server full-text search to accomplish this. Conceptually, the full-text search engine would produce a list of ids, which would be joined against the b-tree indexes to return sorted results. Performance was acceptable, but we pushed it as far as we could go with the largest hardware available at the time (80 cpu, 512 GB RAM, etc). With 20-25 million documents a simple full-text query (2-3 terms) would have response times in the 3-5 second range. That was for the historical data. The live data set (around 1 million documents) would average 200ms with a wide distribution. We were able to handle 150-200 queries per second.
We eventually ended up moving away from SQL Server for search because we wanted additional full-text functionality that SQL Server didn't offer, specifically highly tunable relevance sorting for results. We researched various options and settled on elastic search hosted on aws.
Elastic search offered substantial improvements in features. Performance was great. We went live with 4 xlarge instances on aws. Query response times were right around 150-175 ms, and very, very consistent. We could easily scale up/down the number of nodes to keep performance consistent with varying amounts of load.
SQL Server was still the system of record. We had to develop several services to push changes from SQL Server to ES (incremental loading, bulk loading, etc). Translating the SQL search logic to ES was straight forward.
In conclusion, if your database can't meet your search needs, then use a tool (elasticsearch) that does.

Why people often use a database like Redshift together with an ElasticSearch purely for analytics / reporting queries?

Per the title - I have seen that many companies - especially in ad tech - use a data warehouse solution like Redshift, where they store all the transactional data to do aggregations and analytics, and also pump their data in elastic search for possibly the same reason (not for search anyways).
Apologies if this questions looks daft but wanted to understand the reasons behind this.
Is it to get real-time queries out of one and do historical data analysis on the other?
Thanks
Indeed, I've worked with a few companies (as a consultant) who were considering a combination of these 2 exactly for the similar reasons to what you described:
Redshift: for historical analysis, large complex queries, joins, trends, pre-aggregations
ElasticSearch (usually with Kibana): for near real-time operational monitoring and analytics, leveraging its indexing capabilities and free-form searches, dashboards, JSON indexing, real-time metric queries
Redshift is great for handling massive amounts of time-series data (billions of rows in seconds). But it's not ideal for frequent queries on real-time streamed data, and that's where ElasticSearch comes in.

ElasticSearch Performance : Continuous read/write vs Bulk write

I am new to Elastic Search. I need to implement a system where I will be getting data feed continuously throughout the day. I would like to make this data feed searchable so I am using ElasticSearch.
Now, I have two ways to go about this:
1) Store data from the feed in mongo. And feed this data to ElasticSearch at regular interval, let say twice a day.
2) Directly feed data to ElasticSearch which is s continuous process. At the same time ElasticSearch has to perform search queries.
I am expecting a volume of around 20 entries per second coming from data feed and around 2-3 queries per second being performed by ElasticSearch.
Please advice.
Can you tell us more about your cluster architecture? How many nodes? All nodes have data or also gateway nodes?
Usually I would say feeding directly to elasticsearch shouldn't be a problem. 2-3 queries per second is not much at all for elasticsearch.
You should optimize your index structure and application code for it:
Create separate index for each day
Increase number of shards (you
should experiment, based on your hardware configuration)
For old
days indexes you should close them or aggregate into big periods
(another month indexes) using some batch processing
from my tests 20 inserts/second is not a big load for elasticsearch

Resources