Using ElasticSearch and Kibana for Business Intelligence - elasticsearch

We are using ElasticSearch for search capability in our product. This works fine.
Now we want to provide self service Business intelligence to our customers. Reporting on the operational database sucks due to performance impact. At the run-time, calculating average 'order resolution time' for 10 million records would not fetch the results in time. Traditional way is to create a data mart by loading the operational data using ETL and summarizing it. Then use any reporting engine, to offer metrics and reports to customers. This approach works but increases total cost of ownership for our customers.
I am wondering if anybody has used ElasticSearch as the intermediate data surface for reporting. Can Kibana serve the data exploration, visualization need?

We have the same needs.
Tools like Qlik, PowerBI, Tableau require to increase the overall insfrastructure stack and where you are designin solution to bring abroad without the possibility to share anyting they could be not the best possible option in terms of both costs
& complexity.

I have used devextreme by devexpress. Its server side approach using custom store is very efficient to handle & perform operations on large amount of data. In case of mysql and mssql db, I have myself performed grouping, sorting ,filtering, summaries on 10 million data using devextreme.

Apache Superset seems to be an answer. https://superset.apache.org/docs/intro

Related

Is Elastic Search a good data store for a Read Only Api?

We are planning to create a reporting database exposed via read only api. It'll contain reporting related read apis for both our customers and internal processes like invoicing.
Also, we thought it will also be useful to have Kibana over it to have analytics for our internal teams.
Is Elastic Search good for this use case?
Yeah why not, Elasticsearch will be very good choice for your use-case due to following reasons:
You can de-normalize your data and store them in single index, this will make fetching and searching very fast, this is normally the prime usecases of nosql and ES can work like that.
Basic x-pack security is available free in ES, which would provide read only access to your users without much effort and cost.
Apart from search, Elasticsearch is again very popular for analytics use-cases, you can run very aggregations easily for your use-cases and can use Kibana dashboard for visualisation, which has very nice integration with ES as both are same company(Elastic) products.
And most importantly ES is horizontally scalable and distributed system and easily be scaled to hundreds of nodes to support anyone's growing needs.
In addition to opster's answer there are 2 things that I want to mention that might help you in making a decision :
How E.S is serving us for a real-time reporting use case in production with an extensive data set
Performance of reporting in E.S vs Mongo (that we measured)
How E.S is serving for a real-time reporting use case in production
with an extensive data set
E.S provides real-time results (under 1 sec) for below cases of ours:
Reports generated by running multiple set of filters (date, etc) & aggregation on millions of data points
Time based reports (grouping data by day, week, month, quarter, year) - Powered by DateHistogram
Performance of reporting in E.S vs Mongo (that we measured)
Aggregating 5 million data points in E.S took < 1 sec while it took Mongo > 10 sec, on similar instances.
In addition to above: Support for scripting is also available, which provided a lot of flexibility.

cassandra vs elastic search vs any other design suggestions

We have a need to run analytics queries on the data stored in rds. And that's becoming very very slow because of group by queries and ever increasing size of the tables.
For example we have following 3 tables in RDS :
alm(id,name,cli, group_id, con_id ...)
group(id, type,timestamp ...)
con(id,ip,port ...)
each of the tables have very high amount of data and are being updated several times a minute as the new data comes in.
Now we want to run aggregation queries like :
select name from alm, group, con where alm.group_id=group.id and alm.con_id=con.id group by name, group.type, con.ip
We also want users to run custom aggregation queries in the future as opposed to the fix query provided by us in future.
So far the options we are considering are moving to either Cassandra, Elasticsearch or Dynamo db so that aggregation would be faster. Can someone guide as to how to go about this problem ? Or any crumbs of experience ? Anybody know any technologies have severe advantage over others ?
Cassandra and DynamoDB are quite different from ElasticSearch. And all three are very different from relational database offerings.
For ad-hoc analytics, relational databases, with a well designed schema, can be pretty good up to the point where you need to split your data across multiple servers (then replication issues start to dominate the benefits). And that's really the primary motivation for non-relational databases. But the catch is that in order to solve the horizontal scaling problem, they generally trade some features such as joining and aggregating.
Elastic search is really great at answering search queries, but not particularly good at aggregations (other than very basic counts, sums and their estimates). It's amazing at indexing copious amounts of data but it can't answer queries that require complex cross index operations. It is also not as robust (rebuilding indexes may be needed from time to time)
If you have high volumes of data and you need aggregation, you pretty much have two options:
if you can get away with offline analytics, then distributed data processing frameworks such as Spark can get you the answers you need very efficiently
if you need online analytics, the most common approach is to pre-compute the aggregations and update as you get more data, so that answers to queries can be very fast without having to process a lot of data for each query
Don't be afraid to mix and match though. Relational databases have their purpose as do non-relationals. There is no silver bullet though.
One more options is Column-oriented databases, this kind of DB is more suitable for 'analytics' cases when you have many data fields and you want to perform aggregations or extract some subset of fields for big amount of data.
Recently Yandex ClickHouse becomes very popular and there is Column Oriented service from Amazon - Redshift. Also there are several other solutions
Store in parquet and use spark, partition efficiently

Why people often use a database like Redshift together with an ElasticSearch purely for analytics / reporting queries?

Per the title - I have seen that many companies - especially in ad tech - use a data warehouse solution like Redshift, where they store all the transactional data to do aggregations and analytics, and also pump their data in elastic search for possibly the same reason (not for search anyways).
Apologies if this questions looks daft but wanted to understand the reasons behind this.
Is it to get real-time queries out of one and do historical data analysis on the other?
Thanks
Indeed, I've worked with a few companies (as a consultant) who were considering a combination of these 2 exactly for the similar reasons to what you described:
Redshift: for historical analysis, large complex queries, joins, trends, pre-aggregations
ElasticSearch (usually with Kibana): for near real-time operational monitoring and analytics, leveraging its indexing capabilities and free-form searches, dashboards, JSON indexing, real-time metric queries
Redshift is great for handling massive amounts of time-series data (billions of rows in seconds). But it's not ideal for frequent queries on real-time streamed data, and that's where ElasticSearch comes in.

Reasons against using Elasticsearch as an OLAP cube

At first glance, it seems that with Elasticsearch as a backend it is easy and fast to build reports with pivot-like functionality as used in traditional business intelligence environments.
By "pivot-like" I mean that in SQL-terms, data is grouped by one to two dimensions, filtered, ordered by one or two dimensions and aggregated by several metrics e.g. with sum or count.
By "easy" I mean that with a sufficiently large cluster, no pre-aggregation of the data is required, which saves ETLs and data engineering time.
By "fast" I mean that due to Elasticsearch's near real time capability report latency can be reduced in many instances, when compared to traditional business intelligence systems.
Are there any reasons, not to use Elasticsearch for the above purpose?
ElasticSearch is a great alternative to a cube, we use it for that same purpose today. One huge benefit is that with a cube you need to know what dimensions you want to create reports on. With ES you just shove in more and more data and figure out later how you want to report on it.
At our company we regularly have data go through the following life cycle.
record is written to SQL
primary key from SQL is written to RabbitMQ
we respond back to the customer very quickly
When Rabbit has time, it uses the primary key to gather up all the data we want to report on
That data is written to ElasticSearch
A word of advice: If you think you might want to report on it, get it from the beginning. Inserting 1M rows into ES is very easy, updating 1M rows is a bigger pain.

Datameer for Real Time Querying

We are currently interested in evaluating datameer and have a few questions. Are there any datameer users that can answer these questions:
Since datameer works off HDFS, are the querying speeds similar to that of Hive? How does the querying speed compare with columnar databases?
Since Hadoop is known for high latency, is it advisable to use datameer for real time quering?
Thank you.
Ravi
Regarding 1:
Query speeds are comparable to Hive.
But Datameer is a lot faster in the design phase of your "query". Datameer provides a real time preview how the results of your "query" would look like, which is happening in memory and not on the cluster. The preview is based on a representative sample of your data. It's only a preview not the final results, but it gives you constant feedback if your analytics make sense while designing.
To test a Hive query you would have to execute it, which makes the design process very slow.
Datameer's big advantage over Hive is:
Loading data into Hadoop is much easier. No static schema creation, no ETL, etc. Just use a wizard to download data from your database, log files, social media, etc.
Designing analytics or making changes is a lot faster and can even be done by non technical users.
No need to install anything else since Datameer includes all you need for importing, analytics, scheduling, security, visualization etc. in one product
If you have real time requirements you should not pull data directly out of Datameer, Hive, Impala, etc.. Columnar storages make some processing faster but will still not be low latency. But you can use those tools together with a low latency database. Use Datameer/Hive/Impala for the heavy lifting to filter and pre aggregate big data into smaller data and then export that out into a database. In Datameer you could set this up very easily using one of Datameer's wizards.
Hope this helps,
Peter Voß (Datameer)

Resources