Streaming data [Hadoop/MapReduce] - What are the challenges? - hadoop

I have read in many places about Streaming data, but just trying to understand the challenges which are faced while processing it using Map Reduce technique?
i.e. the reason behind the existence of frameworks like Apache Flume, Apache Storm, etc.
Please share your advise & thoughts.
Thanks,
Ranit

There are many technologies out there, and many of them run on the Hadoop framework.
The older Hadoop services like Hive tend to be slow, and are usually used for batch jobs, not for streaming.
As streaming becomes more and more a necessity, other services have surfaced like Storm or Spark that are designed for faster execution and integration with messaging queues like Kafka for streaming.
In data analytics though, most of the time processing is not al real time: historical data may be processed in batch mode to extract models that are then used for real-time analytics, so a 'streaming' system is usually based on a Lambda Architecture http://lambda-architecture.net/
A service like Spark tries to integrate all of the components, with Spark Streaming for the speed layer, Spark SQL for the Serving layer, Spark MLLib for the modeling, all based on Hadoop Distributed File system (hdfs) for replicated large volume storage.
Flume helps in directing the data from source to hdfs for raw storage, but in order to process it, Storm or Spark are used.
Hope that helps.

Your question is open eneded. But I assume you want to understand the challenges of processing streaming data in Map Reduce environment.
1) Map Reduce is primarily designed for batch processing. It is for processing high volume of data which is at rest in disk.
2) The streaming data is a high velocity of data, which are coming from various sources like Web Application Click Stream, Social Media Logs, Twitter Tags, Application logs.
3) The stream of events might be processed either stateless manner ( assuming every event is unique) or in a stateful manner (collect the data for 2 seconds and processes them) but batch applications does not have any such requirement.
4) Streaming applications wants delivery / process guarantee. For example, the frameworks must provide "exactly once" delivery/process mechanism, so that it processes all the stream events without fail. It is not a challenge in batch processing since all the data is available locally.
5) External Connectors : Streaming frameworks must support external connectivity to read data in realtime from various sources as we discussed in (2). This is not a challenge in batch, since the data is locally available.
Hope this helps.

Related

Relevance of Hadoop & Streaming solutions when Spark exists

I am starting a big data initiative for my startup. In 2018 is there any reason to use Hadoop at all since Spark is touted to be way faster due to it primarily not writing the intermediate data to disk as Hadoop’s MR.
I realize Spark has a higher need for RAM But that would be just one time CAPEX costs that would pay for itself?
In general unless there are legacy projects why should one pick up Hadoop at all since Spark is available?
Would appreciate real world comparisons of the two, gotchas etc.?
Alternately are there use cases that Hadoop can solve but Spark cannot?
—————-comment below for actual problem————
I would use YARN as the resource manager with HDFS as the file system for Spark.
Also realize that as Spark intersects quiet a bit with Hadoop ecosystem.
Comparos are :
Mapreduce vs Spark code
SparkSQL vs Hive
People mention Pig too but not a whole lot of people want to learn custom querying. And if I had to use Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop?
Also not sure how Spark handles the following:
If data does not fit in RAM then what ? Back to a disk based paradigm (not talking of streaming use cases here..) so no better than Mapreduce? How does Tez make MR2 better?
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Where I am unclear is the plethora of overlapping choices. For e.g. streaming alone has:
Spark streaming
Apache storm
Apache Samza
Kafka streams
CEP commercial tools.(ORacle CEP, TIBCO etc.)
A lot of them use DAG similar to Spark’s core engine so hard to pick one from the other.
Use case:
App sends data to middleware until end of event. Event can end specified on periodicity or due to a business condition being met.
Middleware must show real time addition of a value (simplifying) sent by users from their app instances. Accepted that middleware is the floor of the actual sum of values and real value can be higher. Plan to use Kafka streams here to have a consumer that adds all the inputs with minimal latency the consumer posts to a cache which is polled by apps to show current additive value.
Middleware logs all input
After event ends a big data paradigm scans through log data and database records to get accurate count by comparing all dB values and log entries (audit) and compare them to the Kafka shown value. Value calculated by this scheme is the final value.
Design choices:
I like Kafka because it decouples application middleware and is low latency high throughput messaging. Streams code is easy to write . Happy for someone to counter argue using Spark Streams Or Apache Storm or Apache Samza instead?
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients. Not doing client caching due to explicit liveliness of additive value.
You're confusing Hadoop with just MapReduce. Hadoop is an ecosystem of MapReduce, HDFS, and YARN.
First of all, Spark doesn't have a filesystem. That's primarily why Hadoop is nice, in my book. Sure, you can use S3, or many other cloud storages, or bare metal data stores like Ceph, or GlusterFS, but from what I've researched, HDFS is by far the fastest when processing data.
Maybe you're not familiar with the concept of rack locality that YARN offers. If you use Spark Standalone mode with any file system not mounted under the Spark executors, then all your data requests will need to be pulled over a network connection, therefore saturating the network, and causing a bottleneck, regardless of memory. Compare that to the Spark executors running on the YARN NodeManagers, HDFS datanodes are ideally also NodeManagers.
A similar problem - people say Hive is slow, SparkSQL is faster. Well, that's true if you run Hive with MapReduce instead of Tez or Spark execution modes.
Now, if you're wanting streaming and real-time events rather than the batch world commonly associated with Hadoop. You might want to research the SMACK stack.
Update
Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop
Pig is not comparable to NiFi.
You can use NiFi; nothing is stopping you. It would run closer to real-time than Spark micro batches. And it is a good tool to pair with Kafka.
plethora of overlapping choices
Yes, and you didn't even list them all... It's up to some BigData architect in your company to come up with a solution. You'll find that vendor support from Confluent is mostly for Kafka. I haven't seen them talking about Samza much. Hortonworks will support Storm, Nifi, and Spark, but they aren't running the latest version of Kafka if you want fancy features like KSQL. Streamsets is a similar company offering a tool competing with NiFi which consists of employees with backgrounds in other batch/streaming Apache projects.
Storm and Samza are two ways to do the same thing, as far as I know. I think Flink is more programmer friendly than Storm. I don't have experience with Samza, though I work closely with people who primarily are using Kafka Streams rather than it. And Kafka Streams isn't DAG based - it's just a high level Kafka library, embeddable in any JVM application.
If data does not fit in RAM then what ?
By default, it spills to disk... Spark has parameters to configure if you don't want disk to be touched. In which case, your jobs die of OOM more quickly, obviously.
How does Tez make MR2 better?
Tez isn't MR. It creates more optimized DAGs like Spark does. Go read about it.
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Spark has no filesystem. We already covered this. Erasure encoding is primarily for data at-rest, not during processing. I actually don't know if Spark supports Hadoop 3, yet.
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients
Personally, I would use Kafka Streams here because 1) You are using Java already 2) it's a standalone thread in your code that offers you to read/publish data from Kafka without Hadoop/YARN or Spark Clusters. It's not clear what your question has to do with Hadoop from your listed client-server archictecture, but feel free to string an additional line from a Kafka topic to a database/analytics engine of your choice. The Kafka Connect framework has many connectors for you to choose from.
You could also use NiFi as your mobile REST API to just ExposeHTTP and send requests to it, then route flows based on attributes in the data. Then, manipulate and publish to Kafka as well as other systems.
Spark and Hadoop works pretty similar in the way of solving MapReduce problems.
Hadoop is pretty relevant if you talk about HDFS point of view. The HDFS is a well known used solution for big data storage. But your question is about MapReduce.
Spark is the best option if you are talking about good machines with real good configuration of memory and network throughput. But we know that kind of machines are expensive and sometimes you best option is to use Hadoop to process your data. Spark is great and fast but sometimes you get crazy with Memory issues if you don't have a good cluster in case of fit too much data in the memory. Hadoop in this case can be better. But this problem year after year are less relevant.
So hadoop is here com complement Spark, Hadoop is not only MapReduce Hadoop is an ecosystem. Spark doesn't have a distributed file system, to Spark works well you need one, Spark doesn't have a resource manager, Hadoop has called Yarn. And Spark in a cluster mode need a resource manager.
Conclusion
Hadoop still relevant as an ecosystem but as only mapReduce I can say that is not been used anymore.

Apache NIFI for ETL

How effective is to use Apache NIFI for the ETL process having source as HDFS & destination as Oracle DB. What are the limitations of Apache NIFI compared other ETL tools such as Pentaho,Datastage,etc..
Main advantage of NiFi
The main advantages of NiFi:
Intuitive gui, which allows for easy inspection of the data
Strong delivery guarantees
Low latency, you can support both batch and streaming usecases
It can handle any format, not only limited to SQL tables, but can also move log files etc.
Schema aware, and can share schema with solutions like Kafka, Flink, Spark
Main limitation of NiFi
NiFi is really a tool for moving data around, you can do enrichments of individual records but it is typically mentioned to do 'EtL' with a small t. A typical thing that you would not want to do in NiFi is joining two dynamic data sources.
For joining tables, tools like Spark, Hive, or classical ETL alternatives are often used.
For joining streams, tools like Flink and Spark Streaming are often used.
Conclusion
NiFi is a great tool, you just need to make sure you use it for the right usecase. Where needed you can use other tools to complement it.
Extra strong full disclosure: I am an employee of Cloudera, the company that supports NiFi and other projects such as Spark and Flink. I have used other ETL tools before, but not to the same extent as NiFi.
Not sure about sqoop, I can explain the benifits of using Apache Nifi. In your case the data in HDFS could be of any format(Unstructured), Nifi has a capability to process and bring it to format of your choice so that you can directly save it to any RDBMS.
Nifi handles back-pressure in vary effective way to have lossless transmission.
One of the critical features that NiFi provides that our competitors generally don't is the ability to stop jobs and examine the flow and downstream systems while it's running. For you, this means you can test the flow against a test HDFS folder and a test Oracle DB, let some data go through, pause the flow and poke around Oracle to make sure it's to your liking after a matter of seconds or minutes instead of waiting for a "job to complete." It makes the process extremely agile.
Actually Nifi is very good tool. You can easily manipulate processors. In short time you can migrate huge data.
But for destinations such as RDBMS, there are always problems. I used to have a lot of problems about "non-killing" threads, you have to be very careful about stopping processes and the configuration of processors. Some processors like QueryDatabasetable consumes huge memory and the server goes down.

How can i send data from node-red to Hadoop?

I need a mechanism to send data from node-red, to be stored in HDFS (Hadoop).
I prefer the data to be streamed. I am thinking about using the 'websocket out' node to write the data to it and use a Flume agent to read.
I am new to node-red.
Could you please let know if I am in the right direction and clarify with some details if I am not? Any alternate approach should also be fine.
Update: node-red offers 'bluemixhdfs' node which is exclusively tied up with IBM bluemix whereas I am using only a vanilla hadoop.
I recently had the similar issue for a small project of mine. So I try to explain my approach.
A little background: In the application, I had to do some processing on real-time streaming data from different data sources. At the same time, I also needed to store the streaming data for future processing.
I used Apache Kafka message broker as an integration agent between Node-RED and HDFS (and also for Apache Spark Stream processing engine).
In Node-RED, I used Kafka node to publish streaming data from different data sources to separate topics in Kafka.
Node-RED flow with Streaming data sources and Apache Kafka
HDFS Sink Connector, a Kafka Connect component, is then used to store the streaming data to the HDFS.
Flow Architecture for Node-RED to HDFS and Spark Streaming using Kafka Message broker
This approach can also be adopted when many streaming data sources like IoT sensors, Stock market data, Social media data, weather api, etc. are to be connected as a single flow using Node-RED and then want to use HDFS for storing these data for further processing.
I'm afraid that I'm not a Hadoop expert and so probably can't provide an answer directly. However it looks like Kafka supports websockets and this should be reasonably performant.
Depending on your architecture though, you should pay some attention to websocket security. Unless NR and Hadoop are both on a private secured network, websockets may be tricky to secure properly.
I think that websocket performance would be reasonable as long as the data size per transaction isn't too large (kb rather than Gb). You will need to do some testing though as there are too many factors influencing the performance of Node-RED to easily predict whether it will have the performance you require.
Node-RED supports a great many types of connectivity so if websockets don't work in your architecture, there are plenty of others such as UNIX pipes, TCP or UDP connections.

Architecture of syncing logs to hadoop

I have a different environments across a few Cloud providers, like windows servers, linux servers in rackspace, aws..etc. And there is a firewall between that and internal network.
I need to build a real time servers environment where all the newly generated IIS logs, apache logs will be sync to an internal big data environment.
I know there are tools like Splunk or Sumologic that might help but we are required to implement this logic in open source technologies. Due to the existence of the firewall, I am assuming I can only pull the logs instead push from the cloud providers.
Can anyone share with me what is the rule of thumb or common architecture for sync up tons of logs in NRT (near real time)? I heard of Apache Flume, Kafka and wondering if those are required or it is just a matter of using something like rsync.
You can use rsync to get the logs but you can't analyze them in the way Spark Streaming or Apache Storm does.
You can go ahead with one of these two options.
Apache Spark Streaming + Kafka
OR
Apache Storm + Kakfa
Have a look at this article about integration approaches of these two options.
Have a look this presentation, which covers in-depth analysis of Spark Streaming and Apache Storm.
Performance is dependent on your use case. Spark Steaming is 40x faster to Storm processing. But if you add "reliability" as key criteria, then data should be moved into HDFS first before processing by Spark Streaming. It will reduce final throughput.
Reliability Limitations: Apache Storm
Exactly once processing requires a durable data source.
At least once processing requires a reliable data source.
An unreliable data source can be wrapped to provide additional guarantees.
With durable and reliable sources, Storm will not drop data.
Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).
Reliability Limitations: Spark Streaming
Fault tolerance and reliability guarantees require HDFS-backed data source.
Moving data to HDFS prior to stream processing introduces additional latency.
Network data sources (Kafka, etc.) are vulnerable to data loss in the event of a worker node failure.

which tech available for stream data from social media to hadoop?

i am searching for technologies that i can use in order to stream data from social media
to hadoop.
i searched and found those tech
Flume.
Storm.
Kafka.
which tool is the best? and why? does anyone familiar with some other tools ?
Most likely, you will want to use Flume as it is built to work with hdfs. However, as with all things, it depends.
Kafka is basically a queuing system that is usually used to persist data in the event of a failure in your analytics architecture. If this sounds like what you need, it might be worth looking into RabbitMQ, ZeroMQ, or maybe Kestrel.
Storm is used for complex event processing. If you use storm, you will be using zeroMQ under the hood, and will likely have to set up a spout that is hooked up to kafka or RabbitMQ. IF you need to do complicated munging of the data before storage, this might be the right option. There are other options that you can use too like spark. I'm inclined to suggest storm purely out of personal preference. I heard that linkedin was releasing a realtime complex event processing framework as well, but I can't remember the name of it. I'll update the post when I can find it.
On a different note, if you're asking this question, it might be because you haven't built this thing yet. If that is the case, you might want to look into something other than hadoop if you need streaming. The ecosystem is rapidly expanding, and there are probably many ways to do what you want to do.
Apache Kafka is a distributed messaging system. In very brief its like you pushed (published) some messages into a Kafka Queue using a KafKa producer and On the other end you consumed it using a Kafka consumer (subscriber). The messages/feeds can be divided into categories called Topic. Now you can run Kafka in cluster which makes it very scalable and can be expanded without any downtime.
It could be a nice choice for holding your social media streams. Kafka retains the message pushed to it for a configurable time and the best part is from their documentation they say
Kafka's performance is effectively constant with respect to data size so retaining lots of data is not a problem.
Check out the doc for more better visibility.
Now Storm is a very scalable, fault-tolerant distributed computation system which can easily be integrated with any queueing (like Kafka) or databases (HDFS/Cassandra etc). So you can feed your messages to a storm cluster for further processing based on your requirement. There is something called KafkaSpout which does a seamless integration between storm and kafka.
You should also look at the Kafka-hadoop loader #github which creates Hadoop Job for incremental loading messages from Kafka topics onto hdfs with multiple file output semantics
Also as #Peter Klipfel said that:
you might want to look into something other than hadoop if you need streaming
You can also check for other alternatives available like Apache Cassandra ,works great with streaming data with a very low latency.
I think it depends on where you are pulling the data and what you are trying to do with the data.
An alternative is to use IBM Streams where you can pull directly from social media streams and store to many different data store of your choice.
For example, you can use the streamsx.social toolkit from here: https://github.com/IBMStreams/streamsx.social which allows you to pull tweets directly from an HTTP stream.
Once you get data into Streams, the product also provides many adapters that allow you to store the streaming data into datastore (e.g. HDFS using streamsx.hdfs, HBase using streamsx.hbase.)
I think another consideration is what kind of analytics are you doing with the social media data. If you would like to analyze the social data in-stream before the data is stored, IBM Streams also provides a text toolkit that allows you to extract insight from the social data unstructured text. You can analyze the data without really having to store it anywhere.
Hope it helps!

Resources