Why Spark Streaming No Receiver approach is still experimental? - spark-streaming

No Receiver(Direct) approach in Spark Streaming + Kafka is still experimental[Spark 2.0 Version] or I can use it in Production Systems?

Although "No Receiver approach in Spark Streaming + Kafka" is still in experimental (officially based on the Spark Docs), I believe it is safe to use in Production Systems (It was first introduced in Spark 1.3 for Scala/Java & 1.4 for Python).
The feature was introduced long back. I know people who are using this in Production Environment.

Related

KafkaIO Connector/Apache Beam Transform "go" SDK Available?

I am working on building a data ingestion pipeline using Apache Beam "go" SDK.
My pipeline is to consume data from Kafka queue and persist the data to Google Cloud Bigtable (and/or to another Kafka topic).
So far, I have not been able to find a Kafka IO Connector (also known as Apache I/O Transform) written in "go" (I was able to find a java version, however).
Here's link to supported Apache Beam built-in I/O transforms:
https://beam.apache.org/documentation/io/built-in/
I am looking for the "go" equivalent of the following Java code:
pipeline.apply("kafka_deserialization", KafkaIO.<String, String>read()
.withBootstrapServers(KAFKA_BROKER)
.withTopic(KAFKA_TOPIC)
.withConsumerConfigUpdates(CONSUMER_CONFIG)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class))
Do you have any information on the availability of KafkaIO Connector "go" SDK/library?
#cricket_007 In case you are also curious, I received the following update from Robert Burke (rebo#google.com) who is in the Apache Beam team:
There presently isn't a Kafka transform for Go.
The Go SDK is still experimental, largely due to scalable IO support, which is why the Go SDK isn't represented in the built-in io page.
There's presently no way for an SDK user to write a Streaming source in the Go SDK, since there's no mechanism for a DoFn to "self terminate" bundles, such as to allow for scalability and windowing from streaming sources.
However, SplittableDoFns are on their way, and will eventually be the solution for writing these.
At present, the Beam Go SDK IOs haven't been tested and vetted for production use. Until the initial SplittableDoFn support is added to the Go SDK, Batch transforms cannot split, and can't scale beyond a single worker thread. This batch version should land in the next few months, and the streaming version land a few months after that, after which a Kafka IO can be developed.
I wish I had better news for you, but I can say progress is being made.
Robert Burke

Relevance of Hadoop & Streaming solutions when Spark exists

I am starting a big data initiative for my startup. In 2018 is there any reason to use Hadoop at all since Spark is touted to be way faster due to it primarily not writing the intermediate data to disk as Hadoop’s MR.
I realize Spark has a higher need for RAM But that would be just one time CAPEX costs that would pay for itself?
In general unless there are legacy projects why should one pick up Hadoop at all since Spark is available?
Would appreciate real world comparisons of the two, gotchas etc.?
Alternately are there use cases that Hadoop can solve but Spark cannot?
—————-comment below for actual problem————
I would use YARN as the resource manager with HDFS as the file system for Spark.
Also realize that as Spark intersects quiet a bit with Hadoop ecosystem.
Comparos are :
Mapreduce vs Spark code
SparkSQL vs Hive
People mention Pig too but not a whole lot of people want to learn custom querying. And if I had to use Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop?
Also not sure how Spark handles the following:
If data does not fit in RAM then what ? Back to a disk based paradigm (not talking of streaming use cases here..) so no better than Mapreduce? How does Tez make MR2 better?
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Where I am unclear is the plethora of overlapping choices. For e.g. streaming alone has:
Spark streaming
Apache storm
Apache Samza
Kafka streams
CEP commercial tools.(ORacle CEP, TIBCO etc.)
A lot of them use DAG similar to Spark’s core engine so hard to pick one from the other.
Use case:
App sends data to middleware until end of event. Event can end specified on periodicity or due to a business condition being met.
Middleware must show real time addition of a value (simplifying) sent by users from their app instances. Accepted that middleware is the floor of the actual sum of values and real value can be higher. Plan to use Kafka streams here to have a consumer that adds all the inputs with minimal latency the consumer posts to a cache which is polled by apps to show current additive value.
Middleware logs all input
After event ends a big data paradigm scans through log data and database records to get accurate count by comparing all dB values and log entries (audit) and compare them to the Kafka shown value. Value calculated by this scheme is the final value.
Design choices:
I like Kafka because it decouples application middleware and is low latency high throughput messaging. Streams code is easy to write . Happy for someone to counter argue using Spark Streams Or Apache Storm or Apache Samza instead?
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients. Not doing client caching due to explicit liveliness of additive value.
You're confusing Hadoop with just MapReduce. Hadoop is an ecosystem of MapReduce, HDFS, and YARN.
First of all, Spark doesn't have a filesystem. That's primarily why Hadoop is nice, in my book. Sure, you can use S3, or many other cloud storages, or bare metal data stores like Ceph, or GlusterFS, but from what I've researched, HDFS is by far the fastest when processing data.
Maybe you're not familiar with the concept of rack locality that YARN offers. If you use Spark Standalone mode with any file system not mounted under the Spark executors, then all your data requests will need to be pulled over a network connection, therefore saturating the network, and causing a bottleneck, regardless of memory. Compare that to the Spark executors running on the YARN NodeManagers, HDFS datanodes are ideally also NodeManagers.
A similar problem - people say Hive is slow, SparkSQL is faster. Well, that's true if you run Hive with MapReduce instead of Tez or Spark execution modes.
Now, if you're wanting streaming and real-time events rather than the batch world commonly associated with Hadoop. You might want to research the SMACK stack.
Update
Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop
Pig is not comparable to NiFi.
You can use NiFi; nothing is stopping you. It would run closer to real-time than Spark micro batches. And it is a good tool to pair with Kafka.
plethora of overlapping choices
Yes, and you didn't even list them all... It's up to some BigData architect in your company to come up with a solution. You'll find that vendor support from Confluent is mostly for Kafka. I haven't seen them talking about Samza much. Hortonworks will support Storm, Nifi, and Spark, but they aren't running the latest version of Kafka if you want fancy features like KSQL. Streamsets is a similar company offering a tool competing with NiFi which consists of employees with backgrounds in other batch/streaming Apache projects.
Storm and Samza are two ways to do the same thing, as far as I know. I think Flink is more programmer friendly than Storm. I don't have experience with Samza, though I work closely with people who primarily are using Kafka Streams rather than it. And Kafka Streams isn't DAG based - it's just a high level Kafka library, embeddable in any JVM application.
If data does not fit in RAM then what ?
By default, it spills to disk... Spark has parameters to configure if you don't want disk to be touched. In which case, your jobs die of OOM more quickly, obviously.
How does Tez make MR2 better?
Tez isn't MR. It creates more optimized DAGs like Spark does. Go read about it.
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Spark has no filesystem. We already covered this. Erasure encoding is primarily for data at-rest, not during processing. I actually don't know if Spark supports Hadoop 3, yet.
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients
Personally, I would use Kafka Streams here because 1) You are using Java already 2) it's a standalone thread in your code that offers you to read/publish data from Kafka without Hadoop/YARN or Spark Clusters. It's not clear what your question has to do with Hadoop from your listed client-server archictecture, but feel free to string an additional line from a Kafka topic to a database/analytics engine of your choice. The Kafka Connect framework has many connectors for you to choose from.
You could also use NiFi as your mobile REST API to just ExposeHTTP and send requests to it, then route flows based on attributes in the data. Then, manipulate and publish to Kafka as well as other systems.
Spark and Hadoop works pretty similar in the way of solving MapReduce problems.
Hadoop is pretty relevant if you talk about HDFS point of view. The HDFS is a well known used solution for big data storage. But your question is about MapReduce.
Spark is the best option if you are talking about good machines with real good configuration of memory and network throughput. But we know that kind of machines are expensive and sometimes you best option is to use Hadoop to process your data. Spark is great and fast but sometimes you get crazy with Memory issues if you don't have a good cluster in case of fit too much data in the memory. Hadoop in this case can be better. But this problem year after year are less relevant.
So hadoop is here com complement Spark, Hadoop is not only MapReduce Hadoop is an ecosystem. Spark doesn't have a distributed file system, to Spark works well you need one, Spark doesn't have a resource manager, Hadoop has called Yarn. And Spark in a cluster mode need a resource manager.
Conclusion
Hadoop still relevant as an ecosystem but as only mapReduce I can say that is not been used anymore.

Spark Streaming - Calling REST API vs Building Functionality Natively for Spark Streaming

We have a specific functionality for managing timeseries data. The funtionality is already offered as REST API and runs on Cloudfoundry. We want to offer the support for ingesting timeseries data using Spark Streaming and kafka so that the solution is more scalable and robust.
What are the disadvantages of calling the REST API from spark streaming intead building the functionality natively in spark.
I would argue that if your REST API can support the throughput from Spark Streaming, the REST API can support the throughput directly. In which case, you don't actually need Spark Streaming at all. If what you need is a buffer for unexpected spikes, there are simpler ways to achieve that than Spark Streaming.
To address your question more directly, calling the REST API adds latency and an additional failure case to the Spark Streaming pipeline. Implementing your logic in Spark Streaming directly adds code complexity and possible duplication. And both options add operational complexity.

Architecture of syncing logs to hadoop

I have a different environments across a few Cloud providers, like windows servers, linux servers in rackspace, aws..etc. And there is a firewall between that and internal network.
I need to build a real time servers environment where all the newly generated IIS logs, apache logs will be sync to an internal big data environment.
I know there are tools like Splunk or Sumologic that might help but we are required to implement this logic in open source technologies. Due to the existence of the firewall, I am assuming I can only pull the logs instead push from the cloud providers.
Can anyone share with me what is the rule of thumb or common architecture for sync up tons of logs in NRT (near real time)? I heard of Apache Flume, Kafka and wondering if those are required or it is just a matter of using something like rsync.
You can use rsync to get the logs but you can't analyze them in the way Spark Streaming or Apache Storm does.
You can go ahead with one of these two options.
Apache Spark Streaming + Kafka
OR
Apache Storm + Kakfa
Have a look at this article about integration approaches of these two options.
Have a look this presentation, which covers in-depth analysis of Spark Streaming and Apache Storm.
Performance is dependent on your use case. Spark Steaming is 40x faster to Storm processing. But if you add "reliability" as key criteria, then data should be moved into HDFS first before processing by Spark Streaming. It will reduce final throughput.
Reliability Limitations: Apache Storm
Exactly once processing requires a durable data source.
At least once processing requires a reliable data source.
An unreliable data source can be wrapped to provide additional guarantees.
With durable and reliable sources, Storm will not drop data.
Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).
Reliability Limitations: Spark Streaming
Fault tolerance and reliability guarantees require HDFS-backed data source.
Moving data to HDFS prior to stream processing introduces additional latency.
Network data sources (Kafka, etc.) are vulnerable to data loss in the event of a worker node failure.

Are there circumstances where an Akka-based application can replace a Hadoop setup?

From reading about Akka and my own beginning uses of it, it seems to me that Akka could be used, and more simply, than a Hadoop setup for some applications. You wouldn't have HDFS for use, but you could write an application that would send out pieces of work to different "mappers" and have results sent to a "reducer", and it would be easier to set up than Hadoop in VMs or on hardware, fewer services to set up.
Is this reasonable or are the two technologies used for totally different things?
Yes, totally reasonable. We have built a large scale (1000+ workers) map-reduce system using Akka 2.0. Akka 2.2+ is even better because you can use the clustering and remote deathwatch features instead of having to write that functionality yourself.
See this post to get a feel for how it might work.
Akka cluster is currently marked experimental but the Akka team say it's more or less ready for prime time and people are using it in production. I would be very cautious about going this direction and you may instead want to consider hadoop or using zookeeper with akka and zmq or a message queue for horizontally scaling as well.

Resources