ElasticSeach read data from Apache Hadoop - elasticsearch

We are trying to implement Elasticsearch into our big data environment. Currently we are running Apache Hadoop 2.7, include Hive and Spark. Data store as Parquest format in Hadoop.
When we implement ELK into our environment, Can we only store data into Hadoop HDFS ? Or we must to extract data from Hadoop and import into Elasticsearch so can create indexing, but we have duplicate dataset into system(Hadoop HDFS and ElasticSearch)
Thank you.

Sorry if this is a bit long, but I hope that would help!
What Elasticsearch is?
Elasticsearch is a search engine. Period. Search Solution. Or rather a type of database or datastore which helps you organize your data in such a way which can help you perform activities like data discovery or build search applications for your organisation.
Although you can also do lot of analytical queries and build analytical solutions around it, there are certain limitations.
The nature of Elasticsearch and the datastructures used in it are so different, that you would need to push the data(ingest) into it in order to perform all search/data-discover/analytical activities. It has its own file system and data structures which manage/store the data specifically for efficient searching.
So yes there will be duplication of data.
What Elasticsearch is not?
It is not to be used as analytical solution, although it does come with lot of aggregation queries, it is not as expressive as processing engine like Apache Spark or data virtualisation tools like Denodo or Presto.
It is not a storage solution like HDFS or S3 and used as a data lake for the organization.
It is not to be used as a transactional database to be replaced with RDBMS solutions.
Most of the times, many organisations ingest data into ES from various different sources like OLAP, RDBMS, NoSQL database, CMS, Messaging Queues so that they can do searching of the data more efficiently.
In a way, for most of the times, ES is never a primary datasource.
How organisations use it?
Build search solutions for e.g. if you wish to provide or build any e-commerce solution, you can have its search implementation managed by Elasticsearch.
Enterprise Search Solutions (internal and external) for IT people to be more productive and allow the data for their customers to find required documentation, knowledgebase, downloadable pdfs text etc for their products for e.g. Installation docs, Configuration docs, Release docs, New Product Docs. All the contents would be assembled from various different sources in a company and pushed into ES so that they could be searchable.
Ingest data for e.g. logs from application servers and messaging queues in order to perform Logging and Monitoring activities (Alerts, Fraud analysis).
So two most common usage of ES is searching and logging and monitoring activities. Basically real time activities.
How it is different from Hadoop?
Mostly organisations are increasing leveraging Hadoop for its file system i.e. HDFS to be used as a data store while utilising Spark or Hive for data processing. Mostly to do heavy data analytical solutions for which ES has limitation.
Hadoop has the capability to store all file formats(of course you need to make use of parquet or other formats for efficient storage) however Elasticsearch only makes use of JSON. This makes Hadoop a default industry standard along with S3 and other FS for data-lake or data-archival storage tool.
If you are storing data in Hadoop, you probably would have to make do with other frameworks to do efficient data processing like Spark or Giraph or Hive to transform data and do complex analytical processing for which ES has a limitation. ES in its core is a full-text retrieval/search engine.
Hadoop for search
You probably need to run Map-Reduce or Spark Jobs and write tons of pattern-matching algorithm to find the documents or folders with any keyword you want to search. Every search would result into one such job. Which would not be practical.
Even if you transform and organise them in such a way for you to leverage Hive, still it would not be as efficient as Elasticsearch for text processing.
Here is a link that can help you understand core data structure used in Elasticsearch and why text search is faster/different/efficient.
How can we make use of Hadoop and Elasticsearch?
Perhaps the below diagram as mentioned in this link could be useful.
Basically you can set up ingestion pipelines, process raw data from Hadoop, transform and thereby index them in Elasticsearch so that you can make use of its search capabilities.
Take a look this link to understand how you can make use of Spark with Elasticsearch and have two-way communication achieved between them.
Hope that helps!

Related

Hadoop data visualization

I am a new hadoop developer and I have been able to install and run hadoop services in a single-node cluster. The problem comes during data visualization. What purpose does MapReduce jar file play when I need to use a data visualization tool like Tableau. I have a structured data source in which I need to add a layer of logic so that the data could make sense during visualization. Do I need to write MapReduce programs if I am going to visualize with other tools? Please shed some light on how I could go about on this issue.
This probably depends on what distribution of Hadoop you are using and which tools are present. It also depends on the actual data preparation task.
If you don't want to actually write map-reduce or spark code yourself you could try SQL-like queries using Hive (which translates to map-reduce) or the even faster Impala. Using SQL you can create tabular data (hive tables) which can easily be consumed. Tableau has connectors for both of them that automatically translate your tableau configurations/requests to Hive/Impala. I would recommend connecting with Impala because of its speed.
If you need to do work that requires more programming or where SQL just isn't enough you could try Pig. Pig is a high level scripting language that compiles to map-reduce code. You can try all of the above in their respective editor in Hue or from CLI.
If you feel like all of the above still don't fit your use case I would suggest writing map-reduce or spark code. Spark does not need to be written in Java only and has the advantage of being generally faster.
Most tools can integrate with hive tables meaning you don't need to rewrite code. If a tool does not provide this you can make CSV extracts from the hive tables or you can keep the tables stored as CSV/TSV. You can then import these files in your visualization tool.
The existing answer already touches on this but is a bit broad, so I decided to focus on the key part:
Typical steps for data visualisation
Do the complex calculations using any hadoop tool that you like
Offer the output in a (hive) table
Pull the data into the memory of the visualisation tool (e.g. Tableau), for instance using JDBC
If the data is too big to be pulled into memory, you could pull it into a normal SQL database instead and work on that directly from your visualisation tool. (If you work directly on hive, you will go crazy as the simplest queries take 30+ seconds.)
In case it is not possible/desirable to connect your visualisation tool for some reason, the workaround would be to dump output files, for instance as CSV, and then load these into the visualisation tool.
Check out some end to end solutions for data visualization.
For example like Metatron Discovery, it uses druid as their OLAP engine. So you just link your hadoop with Druid and then you can manage and visualize your hadoop data accordingly. This is an open source so that you also can see the code inside it.

Suggestions for noSQL selection for mass data export

We have billions of records formatted with relational data format (e.g transaction id, user name, user id and some other fields), my requirement is to create system where user can request data export from this data store (user will provide some filters like user id, date and so on), typically exported file will be having thousand to 100s of thousands to millions of records based on selected filters (output file will be CSV or similar format)
Other than raw data, I am also looking for some dynamic aggregation on few of the fields during data export.
Typical time between user submitting request and exported data file available should be within 2-3 minutes (max can be 4-5 minutes).
I am seeking suggestions on backend noSQLs for this use case, I've used Hadoop map-reduce so far but hadoop batch job execution with typical HDFS data map-reduce might not give expected SLA in my opinion.
Another option is to use Spark map-reduce which I've never used but it should be way faster then typical Hadoop map-reduce batch job.
We've already tried production grade RDBMS/OLTP instance but it clearly seems not a correct option due to size of data we are exporting and dynamic aggregation.
Any suggestion on using Spark here? or any other better noSQL?
In summary SLA, dynamic aggregation and raw data (millions) are the requirement considerations here.
If system only requires to export data after doing some ETL - aggregations, filtering and transformations then answer is very straight forward. Apache Spark is the best. You would have to fine tune the system and decide whether you want to use only memory or memory + disk or serialization etc.. However, most of the times one needs to think about other aspects too; I am considering them as well.
This is a wide topic of discussion and it involves many aspects such aggregations involved, search related queries (if any), development time. As per the description, it seems to be an interactive/near-real-time-interactive system. Other aspect is whether any analysis involved? And another important point is type of system (OLTP/OLAP, only reporting etc..).
I see there are two questions involved -
Which computing/data processing engine to use?
Which data storage/NoSQL?
- Data processing -
Apache Spark would be a best choice for computing. We are using for the same purpose, along with the filtering, we also have xml transformations to perform which are also done in Spark. Its superfast as compared to Hadoop MapReduce. Spark can run standalone and it can also run on the top of Hadoop.
- Storage -
There are many noSQL solutions available. Selection depends upon many factors such as volume, aggregations involved, search related queries etc..
Hadoop - You can go with Hadoop with HDFS as a storage system. It has many benefits as you get entire Hadoop ecosystem.If you have analysts/data scientists who require to get insights of data/ play with data then this would be a better choice as you would get different tools such as Hive/Impala. Also, resource management would be easy. But for some applications it can be too much.
Cassendra - Cassandra as a storage engine that has solved the problems of distribution and availability while maintaining scale and performance. It brings wonders when used with Spark. For example, performing complex aggregations. By the way, we are using it. For visualization (to view data for analyzing), options are Apache Zeppelin, Tableau (lot of options)
Elastic Search - Elastic Search is also a suitable option if your storage is in few TBs upto 10 TBs. It comes with Kibana (UI) which provides limited analytics capabilities including aggregations. Development time is minimal, its very quick to implement.
So, depending upon your requirement I would suggest Apache Spark for data processing (transformations/filtering/aggregations) and you may also require to consider other technology for storage and data visualization.

Hadoop vs. NoSQL-Databases

As I am new to Big Data and the related technologies my question is, as the title implies:
When would you use Hadoop and when would you use some kind of NoSQL-Databases to store and analyse massive amounts of data?
I know that Hadoop is a Framework and that Hadoop and NoSQL differs.
But you can save lots of data with Hadoop on HDFS and also with NoSQL-DBs like MongoDB, Neo4j...
So maybe the use of Hadoop or of a NoSQL-Database depends if you just want to analyse data or if you just want to store data?
Or is it just that HDFS can save lets say RAW data and a NoSQL-DB is more structured (more structured than raw data and less structured than a RDBMS)?
Hadoop in an entire framework of which one of the components can be NOSQL.
Hadoop generally refers to cluster of systems working together to analyze data. You can take data from NOSQL and parallel process them using Hadoop.
HBase is a NOSQL that is part of Hadoop ecosystem. You can use other different NOSQL too.
Your question is missleading you are comparing Hadoop, which is a framework, to a database ...
Hadoop is containing a lot of features (including NoSQL database named HBase) in order to provide you a big data environment. If you're having a massive quantity of data you will probably use Hadoop (for the MapReduce functionalities or the datawarehouse capabilities) but it's not sure, depending on what you're processing and how you want to process it. If you're just storing a lot of data and don't need other feature (batch data processing or data transformations ...) a simple NoSQL database is enough.

XML data via API to Land in Hadoop

We are receiving huge amounts of XML data via API. In-order to handle this large data set, we were planning to do it in Hadoop.
Needed your help in understanding how to efficiently bring the data to Hadoop. What are the tools available ? Is there a possibility of bringing this data real-time ?
Please provide your inputs.
Thanks for your help.
Since you are receiving huge amounts of data, the appropriate way, IMHO, would be to use some aggregation tool like Flume. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of data into your Hadoop cluster from different types of sources.
You can easily write custom sources based on your needs to collect the data. You might fins this link helpful to get started. It presents a custom Flume source designed to connect to the Twitter Streaming API and ingest tweets in a raw JSON format into HDFS. You could try something similar for your xml data.
You might also wanna have a look at Apache Chukwa which does the same thing.
HTH
Flume, Scribe & Chukwa are the tools that can accomplish the above task. However Flume is most popularly used tool of all the three. Flume has strong Reliability and Failover techniques available. As well Flume has commercial support available from Cloudera while the other two does not have.
If your only objective is for the data to land in HDFS, you can keep writing the XML responses to disk following some convention such as data-2013-08-05-01.xml and write a daily (or hourly cron) to import the XML data in HDFS. Running Flume will be overkill if you don't need streaming capabilities. From your question, it is not immediately obvious why you need Hadoop? Do you need to run MR jobs?
You want to put the data into Avro or your choice of protocol buffer for processing. Once you have a buffer to match the format of the text the hadoop ecosystem is of much better help in processing the structured data.
Hadoop originally was found most useful for taking one line entries of log files and structuring / processing the data from their. XML is already structured and requires more processing power to get it into a hadoop friendly format.
A more basic solution would be to chunking the xml data and process using Wukong (Ruby streaming) or a python alternative. Since your network bound by the 3rd party api a streaming solution might be more flexible and just as fast in the end for your needs.

Hadoop Ecosystem - What technological tool combination to use in my scenrio? (Details Inside)

This might be an interesting question to some:
Given: 2-3 Terabyte of data stored in SQL Server(RDBMS), consider it similar to Amazons data, i.e., users -> what things they saw/clicked to see -> what they bought
Task: Make a recommendation engine (like Amazon), which displays to user, customer who bought this also bought this -> if you liked this, then you might like this -> (Also) kind of data mining to predict future buying habits as well(Data Mining). So on and so forth, basically a reco engine.
Issue: Because of the sheer volume of data (5-6 yrs worth of user habit data), I see Hadoop as the ultimate solution. Now the question is, what technological tools combinations to use?, i.e.,
HDFS: Underlying FIle system
HBASE/HIVE/PIG: ?
Mahout: For running some algorithms, which I assume uses Map-Reduce (genetic, cluster, data mining etc.)
- What am I missing? What about loading RDBMS data for all this processing? (Sqoop for Hadoop?)
- At the end of all this, I get a list of results(reco's), or there exists a way to query it directly and report it to the front-end I build in .NET??
I think the answer to this question, just might be a good discussion for many people like me in the future who want to kick start their hadoop experimentation.
For loading data from RDBMS, I'd recommend looking into BCP (to export from SQL to flat file) then Hadoop command line for loading into HDFS. Sqoop is good for ongoing data but it's going to be intolerably slow for your initial load.
To query results from Hadoop you can use HBase (assuming you want low-latency queries), which can be queried from C# via it's Thrift API.
HBase can fit your scenario.
HDFS is the underlying file system. Nevertheless you cannot load the data in HDFS (in arbitrary format) query in HBase, unless you use the HBase file format (HFile)
HBase has integration with MR.
Pig and Hive also integrate with HBase.
As Chris mentioned it, you can use Thrift to perform your queries (get, scan) since this will extract specific user info and not a massive data set it is more suitable than using MR.

Resources