XML data via API to Land in Hadoop - hadoop

We are receiving huge amounts of XML data via API. In-order to handle this large data set, we were planning to do it in Hadoop.
Needed your help in understanding how to efficiently bring the data to Hadoop. What are the tools available ? Is there a possibility of bringing this data real-time ?
Please provide your inputs.
Thanks for your help.

Since you are receiving huge amounts of data, the appropriate way, IMHO, would be to use some aggregation tool like Flume. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of data into your Hadoop cluster from different types of sources.
You can easily write custom sources based on your needs to collect the data. You might fins this link helpful to get started. It presents a custom Flume source designed to connect to the Twitter Streaming API and ingest tweets in a raw JSON format into HDFS. You could try something similar for your xml data.
You might also wanna have a look at Apache Chukwa which does the same thing.
HTH

Flume, Scribe & Chukwa are the tools that can accomplish the above task. However Flume is most popularly used tool of all the three. Flume has strong Reliability and Failover techniques available. As well Flume has commercial support available from Cloudera while the other two does not have.

If your only objective is for the data to land in HDFS, you can keep writing the XML responses to disk following some convention such as data-2013-08-05-01.xml and write a daily (or hourly cron) to import the XML data in HDFS. Running Flume will be overkill if you don't need streaming capabilities. From your question, it is not immediately obvious why you need Hadoop? Do you need to run MR jobs?

You want to put the data into Avro or your choice of protocol buffer for processing. Once you have a buffer to match the format of the text the hadoop ecosystem is of much better help in processing the structured data.
Hadoop originally was found most useful for taking one line entries of log files and structuring / processing the data from their. XML is already structured and requires more processing power to get it into a hadoop friendly format.
A more basic solution would be to chunking the xml data and process using Wukong (Ruby streaming) or a python alternative. Since your network bound by the 3rd party api a streaming solution might be more flexible and just as fast in the end for your needs.

Related

ElasticSeach read data from Apache Hadoop

We are trying to implement Elasticsearch into our big data environment. Currently we are running Apache Hadoop 2.7, include Hive and Spark. Data store as Parquest format in Hadoop.
When we implement ELK into our environment, Can we only store data into Hadoop HDFS ? Or we must to extract data from Hadoop and import into Elasticsearch so can create indexing, but we have duplicate dataset into system(Hadoop HDFS and ElasticSearch)
Thank you.
Sorry if this is a bit long, but I hope that would help!
What Elasticsearch is?
Elasticsearch is a search engine. Period. Search Solution. Or rather a type of database or datastore which helps you organize your data in such a way which can help you perform activities like data discovery or build search applications for your organisation.
Although you can also do lot of analytical queries and build analytical solutions around it, there are certain limitations.
The nature of Elasticsearch and the datastructures used in it are so different, that you would need to push the data(ingest) into it in order to perform all search/data-discover/analytical activities. It has its own file system and data structures which manage/store the data specifically for efficient searching.
So yes there will be duplication of data.
What Elasticsearch is not?
It is not to be used as analytical solution, although it does come with lot of aggregation queries, it is not as expressive as processing engine like Apache Spark or data virtualisation tools like Denodo or Presto.
It is not a storage solution like HDFS or S3 and used as a data lake for the organization.
It is not to be used as a transactional database to be replaced with RDBMS solutions.
Most of the times, many organisations ingest data into ES from various different sources like OLAP, RDBMS, NoSQL database, CMS, Messaging Queues so that they can do searching of the data more efficiently.
In a way, for most of the times, ES is never a primary datasource.
How organisations use it?
Build search solutions for e.g. if you wish to provide or build any e-commerce solution, you can have its search implementation managed by Elasticsearch.
Enterprise Search Solutions (internal and external) for IT people to be more productive and allow the data for their customers to find required documentation, knowledgebase, downloadable pdfs text etc for their products for e.g. Installation docs, Configuration docs, Release docs, New Product Docs. All the contents would be assembled from various different sources in a company and pushed into ES so that they could be searchable.
Ingest data for e.g. logs from application servers and messaging queues in order to perform Logging and Monitoring activities (Alerts, Fraud analysis).
So two most common usage of ES is searching and logging and monitoring activities. Basically real time activities.
How it is different from Hadoop?
Mostly organisations are increasing leveraging Hadoop for its file system i.e. HDFS to be used as a data store while utilising Spark or Hive for data processing. Mostly to do heavy data analytical solutions for which ES has limitation.
Hadoop has the capability to store all file formats(of course you need to make use of parquet or other formats for efficient storage) however Elasticsearch only makes use of JSON. This makes Hadoop a default industry standard along with S3 and other FS for data-lake or data-archival storage tool.
If you are storing data in Hadoop, you probably would have to make do with other frameworks to do efficient data processing like Spark or Giraph or Hive to transform data and do complex analytical processing for which ES has a limitation. ES in its core is a full-text retrieval/search engine.
Hadoop for search
You probably need to run Map-Reduce or Spark Jobs and write tons of pattern-matching algorithm to find the documents or folders with any keyword you want to search. Every search would result into one such job. Which would not be practical.
Even if you transform and organise them in such a way for you to leverage Hive, still it would not be as efficient as Elasticsearch for text processing.
Here is a link that can help you understand core data structure used in Elasticsearch and why text search is faster/different/efficient.
How can we make use of Hadoop and Elasticsearch?
Perhaps the below diagram as mentioned in this link could be useful.
Basically you can set up ingestion pipelines, process raw data from Hadoop, transform and thereby index them in Elasticsearch so that you can make use of its search capabilities.
Take a look this link to understand how you can make use of Spark with Elasticsearch and have two-way communication achieved between them.
Hope that helps!

Big Data - Lambda Architecture and Storing Raw Data

Currently I am using cassandra for storing data for my functional use cases (display time-series and consolidated data to users). Cassandra is very good at it, if you design correctly your data model (query driven)
Basically, data are ingested from RabbitMQ by Storm and save to Cassandra
Lambda architecture is just a design-pattern for big-data architect and technology independent, the layers can be combined :
Cassandra is a database that can be used as serving layer & batch layer : I'm using it for my analytics purpose with spark too (because data are already well formatted, like time-series, in cassandra)
As far as I know, one huge thing to consider is STORING your raw data before any processing. You need to do this in order to recover for any problem, human-based (algorithm problem, DROP TABLE in PROD, stuff like that this can happen..) or for future use or mainly for batch aggregation
And here I'm facing a choice :
Currently I'm storing it in cassandra, but i'm consider switching storing the raw data in HDFS for different reason : raw data are "dead", using cassandra token, using resource (mainly disk space) in cassandra cluster.
Can someone help me in that choice ?
HDFS makes perfect sense. Some considerations :
Serialization of data - Use ORC/ Parquet or AVRO if format is variable
Compression of data - Always compress
HDFS does not like too many small files - In case of streaming have a job which aggregates & write single large file on a regular interval
Have a good partitioning scheme so you can get to data you want on HDFS without wasting resources
hdfs is better idea for binary files. Cassandra is o.k. for storing locations where the files are etc etc but just pure files need to be modelled really really well so most of the people just give up on cassandra and complain that it sucks. It still can be done, if you want to do it there are some examples like:
https://academy.datastax.com/resources/datastax-reference-application-killrvideo
that might help you to get started.
Also the question is more material for quora or even http://www.mail-archive.com/user#cassandra.apache.org/ this question has been asked there a lot of time.

Suggestions for noSQL selection for mass data export

We have billions of records formatted with relational data format (e.g transaction id, user name, user id and some other fields), my requirement is to create system where user can request data export from this data store (user will provide some filters like user id, date and so on), typically exported file will be having thousand to 100s of thousands to millions of records based on selected filters (output file will be CSV or similar format)
Other than raw data, I am also looking for some dynamic aggregation on few of the fields during data export.
Typical time between user submitting request and exported data file available should be within 2-3 minutes (max can be 4-5 minutes).
I am seeking suggestions on backend noSQLs for this use case, I've used Hadoop map-reduce so far but hadoop batch job execution with typical HDFS data map-reduce might not give expected SLA in my opinion.
Another option is to use Spark map-reduce which I've never used but it should be way faster then typical Hadoop map-reduce batch job.
We've already tried production grade RDBMS/OLTP instance but it clearly seems not a correct option due to size of data we are exporting and dynamic aggregation.
Any suggestion on using Spark here? or any other better noSQL?
In summary SLA, dynamic aggregation and raw data (millions) are the requirement considerations here.
If system only requires to export data after doing some ETL - aggregations, filtering and transformations then answer is very straight forward. Apache Spark is the best. You would have to fine tune the system and decide whether you want to use only memory or memory + disk or serialization etc.. However, most of the times one needs to think about other aspects too; I am considering them as well.
This is a wide topic of discussion and it involves many aspects such aggregations involved, search related queries (if any), development time. As per the description, it seems to be an interactive/near-real-time-interactive system. Other aspect is whether any analysis involved? And another important point is type of system (OLTP/OLAP, only reporting etc..).
I see there are two questions involved -
Which computing/data processing engine to use?
Which data storage/NoSQL?
- Data processing -
Apache Spark would be a best choice for computing. We are using for the same purpose, along with the filtering, we also have xml transformations to perform which are also done in Spark. Its superfast as compared to Hadoop MapReduce. Spark can run standalone and it can also run on the top of Hadoop.
- Storage -
There are many noSQL solutions available. Selection depends upon many factors such as volume, aggregations involved, search related queries etc..
Hadoop - You can go with Hadoop with HDFS as a storage system. It has many benefits as you get entire Hadoop ecosystem.If you have analysts/data scientists who require to get insights of data/ play with data then this would be a better choice as you would get different tools such as Hive/Impala. Also, resource management would be easy. But for some applications it can be too much.
Cassendra - Cassandra as a storage engine that has solved the problems of distribution and availability while maintaining scale and performance. It brings wonders when used with Spark. For example, performing complex aggregations. By the way, we are using it. For visualization (to view data for analyzing), options are Apache Zeppelin, Tableau (lot of options)
Elastic Search - Elastic Search is also a suitable option if your storage is in few TBs upto 10 TBs. It comes with Kibana (UI) which provides limited analytics capabilities including aggregations. Development time is minimal, its very quick to implement.
So, depending upon your requirement I would suggest Apache Spark for data processing (transformations/filtering/aggregations) and you may also require to consider other technology for storage and data visualization.

Processing HDFS files

Let me begin by saying I am a complete newbie to Hadoop. My requirement is to analyse server log files using Hadoop infrastructure. The first step I took in this direction was to stream the log files and dump them raw into my single node Hadoop cluster using Flume HDFS sink. Now I have a bunch of files with records which look something like this:
timestamp req-id level module-name message
My next step is to parse the files (separate out the fields) and store them back so that they are ready for searching.
What approach should I use for this? Can I do this using Hive? (sorry if the question is naive). The information available on the internet is overwhelming.
You can use HCatalog or Impala for faster querying.
From your explanation you have time series data.Hadoop with HDFS itself is not meant for random access or querying. You can use HBase a database for hadoop as HDFS a backend filesystem. It is good for random access.
Also for your need parsing and rearranging data, you can make use of Hadoop's MapReduce.HBase has built in support for this. HBase can be used for input/output of MapReduce Job.
Basic information you can get from here. For better understanding try Definitive Guide for HBase / HBase in Action books.

Hadoop Ecosystem - What technological tool combination to use in my scenrio? (Details Inside)

This might be an interesting question to some:
Given: 2-3 Terabyte of data stored in SQL Server(RDBMS), consider it similar to Amazons data, i.e., users -> what things they saw/clicked to see -> what they bought
Task: Make a recommendation engine (like Amazon), which displays to user, customer who bought this also bought this -> if you liked this, then you might like this -> (Also) kind of data mining to predict future buying habits as well(Data Mining). So on and so forth, basically a reco engine.
Issue: Because of the sheer volume of data (5-6 yrs worth of user habit data), I see Hadoop as the ultimate solution. Now the question is, what technological tools combinations to use?, i.e.,
HDFS: Underlying FIle system
HBASE/HIVE/PIG: ?
Mahout: For running some algorithms, which I assume uses Map-Reduce (genetic, cluster, data mining etc.)
- What am I missing? What about loading RDBMS data for all this processing? (Sqoop for Hadoop?)
- At the end of all this, I get a list of results(reco's), or there exists a way to query it directly and report it to the front-end I build in .NET??
I think the answer to this question, just might be a good discussion for many people like me in the future who want to kick start their hadoop experimentation.
For loading data from RDBMS, I'd recommend looking into BCP (to export from SQL to flat file) then Hadoop command line for loading into HDFS. Sqoop is good for ongoing data but it's going to be intolerably slow for your initial load.
To query results from Hadoop you can use HBase (assuming you want low-latency queries), which can be queried from C# via it's Thrift API.
HBase can fit your scenario.
HDFS is the underlying file system. Nevertheless you cannot load the data in HDFS (in arbitrary format) query in HBase, unless you use the HBase file format (HFile)
HBase has integration with MR.
Pig and Hive also integrate with HBase.
As Chris mentioned it, you can use Thrift to perform your queries (get, scan) since this will extract specific user info and not a massive data set it is more suitable than using MR.

Resources