Using ElasticSearch as a permanent storage - hadoop

Recently I am working on a project which is producing a huge amount of data every day, in this project, there are two functionalities, one is storing data into Hbase for future analysis, and second one is pushing data into ElasticSearch for monitoring.
As the data is huge, we should store data into two platforms(Hbase,Elasticsearch)!
I have no experience in both of them. I want no know is it possible to use elasticsearch instead of hbase as a persistence storage for future analytics?

I recommend you reading this old but still valid article : https://www.elastic.co/blog/found-elasticsearch-as-nosql
Keep in mind, Elasticsearch is only a search engine. But it depends if your data are critical or if you can accept to lose some of them like non critical logs.
If you don't want to use an additionnal database with huge large data, you probably can store them into files in something like HDFS.

You should also check Phoenix https://phoenix.apache.org/ which may provide the monitoring features that you are looking for

Related

Is Elasticsearch optimized for inserts?

I develop for a relatively large online store with a PHP backend, and it uses elasticsearch for some things (like text search, logging... etc).
Now, I'd like to start storing all kinds of information about user activity in ES. For instance, every page view (for instance: user enter product page/category page ,etc).
Is ES optimized for such a heavy load of continuous inserts, or should I consider some alternatives, like for instance having some sort of a buffer layer where I store all of my immediate inserts in memory, and then every minute or so, insert them into ES in bulk?
What is the industry standard? Or am I worrying in vain and ES is optimized for that?
Thanks.
Elasticsearch, when properly sized to handle your load, is definitely a valid alternative for such a use case.
You might decide, however, to store that streaming data into another cluster which is different from your production cluster, so as to not impact the health of the production cluster too much.
There are a lot variables to arrive at the correct decision, and we don't have enough information here, but it's definitely a valid way.

Avoid data replication when using Elasticsearch + MySQL backend?

I'm working on a project where we have some legacy data in MySQL and now we want to deploy ES for better full text search.
We still want to use MySQL as the backend data storage because the current system is closely coupled with that.
It seems that most of the available solutions suggest syncing the data between the two, but this would result in storing all the documents twice in both ES and MySQL. Since some of the documents can be rather large, I'm wondering if there's a way to have only a single copy of the documents?
Thanks!
Impossible. This is analogous to asking the following: if you have legacy data in an Excel spreadsheet, can I use a MySQL database to query the data without also storing it in MySQL?
Elasticsearch is not just an application layer that interprets userland queries and turns them into database queries, it is itself a database system (in fact, it can be used as your primary data store, though it's not recommended due to various drawbacks). Its search functionality fundamentally depends on how its own backing storage is organized. Elasticsearch cannot query other databases.
You should consider what portions of your data actually need to be stored in Elasticsearch, i.e. what fields need text completion. You will need to build a component which syncs that view of the data between Elasticsearch and your MySQL database.

making elasticsearch and bigquery work together

I have a web app that displays the analysis data in browser with elasticsearch as backend data store.
Everything was cool as elasticsearch was handling about 1TB data and search queries were blazing fast.
Then came the decision to add data from all services into the app, close to a peta byte, and we switched to bigquery.[yes, we abandoned the elasticsearch and started querying bigquery directly ].
Now users of my app are complaining that their queries are slow, they are taking seconds (4~10~15), which used to display under a second before.
Naturally the huge amount of data here is to be blamed but I am wondering if there is a way to bring back elasticsearch into the game and make elasticsearch and bigquery play together nicely so that I can get the petaytes of storage from bigquery but still retain the lightspeed search of elasticsearch.
I am sure I am not the first one to face this issue rather I believe I am bit late to the bigquery party so I should be able to reap the benefits of delayed entry by getting all the problems already solved.
Thanks in advance if you can point me to the right direction.
This is a common pattern I see deployed by customers:
Use Elasticsearch to display results from the latest day/week - whatever fits within Elasticsearch's RAM.
Use BigQuery for everything else.
In this way your users will get sub-second results for 90% of their queries, and they will also be able to go wherever they want to go if Elasticsearch can't find an answer within its resources.
I'm not sure what are your users interfaces for getting data - but that's where this logic would need to be deployed.
(of course, expect improvements in the connections and speed as tech progresses)

Storing and processing timeseries with Hadoop

I would like to store a large amount of timeseries from devices. Also these timeseries have to be validated, can be modified by an operator and have to be exported to other systems. Holes in the timeseries must be found. Timeseries must be shown in the UI filtered by serialnumber and date range.
We have thought about using hadoop, hbase, opentsdb and spark for this scenario.
What do you think about it? Can Spark connect to opentsdb easily?
Thanks
OpenTSDB is really great for storing large amount of time series data. Internally, it is underpinned by HBase - which means that it had to find a way around HBase's limitations in order to perform well. As a result, the representation of time series is highly optimized and not easy to decode. AFAIK, there is no out-of-the-box connector that would allow to fetch data from OpenTSDB into Spark.
The following GitHub project might provide you with some guidance:
Achak1987's connector
If you are looking for libs that would help you with time series, have a look at spark-ts - it contains useful functions for missing data imputation as well.
Warp 10 offers the WarpScript language which can be used from Spark/Pig/Flink to manipulate time series and access data stored in Warp 10 via a Warp10InputFormat.
Warp 10 is Open Source and available at www.warp10.io
Disclaimer: I'm CTO of Cityzen Data, maker of Warp 10.
Take a look at Axibase Time Series Database which has a rather unique versioning feature to maintain a history of value changes for the same timestamp. Once enabled with per-metric granularity, the database keeps track of source, status and times of value modifications for audit trail or data reconciliation.
We have customers streaming data from Spark apps using Network API, typically once data is enriched with additional metadata (aks series tags) for downstream reporting.
You can query data from ATSD with REST API or SQL.
Disclaimer: I work for Axibase.

Storing data in Elasticsearch - OLTP

I have a transactional application where the reps want to enter the tickets and I got to store them immediately. The reason I picked ES is because the techs may enter some unstructured data and they want to search on it later.
Is it ok to store the data directly in ES instead of RDBMS?
I think probably 5-10 users will be using this application concurrently.
I have already built using DJango/ES but just want to make sure I don't have any issues later.
It is certainly 'ok' to store data in Elasticsearch instead of a traditional relational model, but that doesn't mean it's the right choice. Your use case sounds fairly simple, and more 'document' based that tabular. For this a NoSQL document store can be a good fit. Elasticsearch also offers shards as well that can replicate your data for both higher availability and resilience - for instance, if one of your concerns is backing up your data.
On the other hand, simply having some longer text fields is not a strong argument for choosing ES over a database system (RDBMS or otherwise) that you more familiar with or that has more built-in support for administrative functions.
If you have truly unstructured data - ie different tickets can have different fields - or you have a high volume of tickets, such that the full-text indexing and searching in ES provides a real performance gain, then it could be worth the learning curve.
The basic concepts page for ES is a good place to start. See the sections on Shards & Replicas.
https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html
This might also be useful: https://www.elastic.co/blog/found-uses-of-elasticsearch

Resources