I am trying to migrate our organization's hadoop jobs to GCP...I am confused between GCP Data Flow and Data Proc...
I want to re-use Hadoop jobs we already have created and minimize the management of the cluster as much as possible. We also want to be able to persist data beyond the life of the cluster...
Can anyone suggest
I would just start with DataProc as it is very close to what you have.
Check out DataProc initialization actions, https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions, create a simple cluster and get a feel for it.
DataFlow is completely managed and you don't operate any cluster resources, but at the same time you cannot migrate an onsite cluster to DataFlow as is, you need to migrate (some times rewrite) your Hive/Pig/Oozie etc.
Cost for DataFlow is also calculated differently, though there is no upfront cost vs DataProc, everytime you run a job you incur some cost associated with it on DataFlow.
A lot depends on the nature of your Hadoop jobs and the activities you are performing in regards to the selection of Cloud Dataproc (managed big data platform - orientation of Hadoop/Spark) and/or Cloud Dataflow (managed big data platform - orientation of Apache Beam for streaming use cases).
In regards to ensuring persistence of data beyond the operation, you may want to consider storing your data on GCS or on PD's if that's an option basis the need of your use case.
Related
I looked at some posts but they are a bit older on this topic. I have read the AWS and other blogs as well, but ...
My simple non-programming question for AWS in today's environment is:
If we have a DWH of say, 20+TB and growing, that we want to off-load to the Cloud as many are doing, then
if we have a regular daily DWH feed with some mutations, then
should we in the case of AWS, use EMR or EC2?
Moreover, it is a complete batch environment, no Streaming or KAFKA requirements. Usage of SPARK for sure.
EMR seems great, but I have the impression it is for Data Scientists to do whatever they want whenever they want. For more regular ETL I am wondering if this is suited. The appeal of less management is certainly a boon.
In the docs on AWS I cannot find a definitive answer, hence this question.
My impression is that with AMI and bootstrapping own services, that EMR is certainly one way to go, and, that EC2 would be more for a KAFKA Cluster or if you really want to control your own environment and tooling completely based on say Cloudera's distribution per se.
So, the answer here is for others that may need to assess which options apply for off-loading, whatever. It is actually not so hard in hindsight. Note that AZURE and non-AWS vendors not considered here. In a nutshell, then:
EMR is an (PaaS) AWS Managed Hadoop Service
EMR provides tools that AMAZON feel will do the job for Data Science, Analytics, etc. But you can "bootstrap" your own requirements / software, if needed.
EMR-clusters comprise short-running EC2 instances and provisioning happens under water as it were. You get patches effected easily this way. You can up- and downscale very easily as well. Compute and storage are divorced allowing this scaling to occur easily.
Elasticity applies obviously more so to compute, data needs to be there as long as you need it. EMR relies on S3 to save results to, longer term. After saving, one terminates the EMR-cluster, and when required, start a new EMR-cluster and attach your saved S3 results - if applicable - to this new cluster. EMRFS allows S3 to look like part of HDFS and provides easy access. EBS-backed storaged exists that allows saving of results to storage tied to the EC2-instance for the duration of that instance.
It's a new way of doing things. One has access to "spot" instances with obviously spot prices. Billing is less predictable as it depends what you do, but could well overall be cheaper - provided governed correctly. An example of this is expedia's management of EMR-clusters.
Ad-hoc querying is not well served with S3, so you will need another AWS Managed Services such as Presto / Athena or Redshift (Spectrum) which is an additional set of services and cost. Just mentioning this due to slower S3 performance.
EC2 (IaaS) is more "traditional"
You elect to take this path if you want to provision EC2 instances yourself a syou want control of the software and what you want on your Hadoop environment.
EC2 instances - VMs - have compute power, memory, EBS-backed temporal storage, and use EFS for file systems for HDFS or, say, KUDU, and S3. S3 access is not as easy to access as under EMRFS with EMR.
You install and maintain the Hadoop software yourself and apply patches, etc. Management of Hadoop on these EC2 instances is of course less of a big deal with Cloudera and Cloudbreak.
Billing is more predictable one could argue, on the basis of up-time of an EC2 instance, and billing applies continuously for any persisted storage.
Important point, one can combine an EC2 approach for, say, DWH Loading on Hadoop - if "off-loading", and EMR Clusters for Data Science.
MR Data Locality
This not adhered to in both approaches unless bare metal options used, but then the elasticity - E - is harder for both parties, which allows cost savings.
Data locality seems to be assumed by most, but actually it has gone with Cloud computing as expected, and seems quite OK in terms of performance for Data Science etc.
For ad hoc querying AMAZON say they are not so sure on S3, and from experience, using EFS fof HDFS/PARQUET or KUDU works pretty quick, to say the least, from my experience at least.
I am starting a big data initiative for my startup. In 2018 is there any reason to use Hadoop at all since Spark is touted to be way faster due to it primarily not writing the intermediate data to disk as Hadoop’s MR.
I realize Spark has a higher need for RAM But that would be just one time CAPEX costs that would pay for itself?
In general unless there are legacy projects why should one pick up Hadoop at all since Spark is available?
Would appreciate real world comparisons of the two, gotchas etc.?
Alternately are there use cases that Hadoop can solve but Spark cannot?
—————-comment below for actual problem————
I would use YARN as the resource manager with HDFS as the file system for Spark.
Also realize that as Spark intersects quiet a bit with Hadoop ecosystem.
Comparos are :
Mapreduce vs Spark code
SparkSQL vs Hive
People mention Pig too but not a whole lot of people want to learn custom querying. And if I had to use Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop?
Also not sure how Spark handles the following:
If data does not fit in RAM then what ? Back to a disk based paradigm (not talking of streaming use cases here..) so no better than Mapreduce? How does Tez make MR2 better?
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Where I am unclear is the plethora of overlapping choices. For e.g. streaming alone has:
Spark streaming
Apache storm
Apache Samza
Kafka streams
CEP commercial tools.(ORacle CEP, TIBCO etc.)
A lot of them use DAG similar to Spark’s core engine so hard to pick one from the other.
Use case:
App sends data to middleware until end of event. Event can end specified on periodicity or due to a business condition being met.
Middleware must show real time addition of a value (simplifying) sent by users from their app instances. Accepted that middleware is the floor of the actual sum of values and real value can be higher. Plan to use Kafka streams here to have a consumer that adds all the inputs with minimal latency the consumer posts to a cache which is polled by apps to show current additive value.
Middleware logs all input
After event ends a big data paradigm scans through log data and database records to get accurate count by comparing all dB values and log entries (audit) and compare them to the Kafka shown value. Value calculated by this scheme is the final value.
Design choices:
I like Kafka because it decouples application middleware and is low latency high throughput messaging. Streams code is easy to write . Happy for someone to counter argue using Spark Streams Or Apache Storm or Apache Samza instead?
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients. Not doing client caching due to explicit liveliness of additive value.
You're confusing Hadoop with just MapReduce. Hadoop is an ecosystem of MapReduce, HDFS, and YARN.
First of all, Spark doesn't have a filesystem. That's primarily why Hadoop is nice, in my book. Sure, you can use S3, or many other cloud storages, or bare metal data stores like Ceph, or GlusterFS, but from what I've researched, HDFS is by far the fastest when processing data.
Maybe you're not familiar with the concept of rack locality that YARN offers. If you use Spark Standalone mode with any file system not mounted under the Spark executors, then all your data requests will need to be pulled over a network connection, therefore saturating the network, and causing a bottleneck, regardless of memory. Compare that to the Spark executors running on the YARN NodeManagers, HDFS datanodes are ideally also NodeManagers.
A similar problem - people say Hive is slow, SparkSQL is faster. Well, that's true if you run Hive with MapReduce instead of Tez or Spark execution modes.
Now, if you're wanting streaming and real-time events rather than the batch world commonly associated with Hadoop. You might want to research the SMACK stack.
Update
Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop
Pig is not comparable to NiFi.
You can use NiFi; nothing is stopping you. It would run closer to real-time than Spark micro batches. And it is a good tool to pair with Kafka.
plethora of overlapping choices
Yes, and you didn't even list them all... It's up to some BigData architect in your company to come up with a solution. You'll find that vendor support from Confluent is mostly for Kafka. I haven't seen them talking about Samza much. Hortonworks will support Storm, Nifi, and Spark, but they aren't running the latest version of Kafka if you want fancy features like KSQL. Streamsets is a similar company offering a tool competing with NiFi which consists of employees with backgrounds in other batch/streaming Apache projects.
Storm and Samza are two ways to do the same thing, as far as I know. I think Flink is more programmer friendly than Storm. I don't have experience with Samza, though I work closely with people who primarily are using Kafka Streams rather than it. And Kafka Streams isn't DAG based - it's just a high level Kafka library, embeddable in any JVM application.
If data does not fit in RAM then what ?
By default, it spills to disk... Spark has parameters to configure if you don't want disk to be touched. In which case, your jobs die of OOM more quickly, obviously.
How does Tez make MR2 better?
Tez isn't MR. It creates more optimized DAGs like Spark does. Go read about it.
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Spark has no filesystem. We already covered this. Erasure encoding is primarily for data at-rest, not during processing. I actually don't know if Spark supports Hadoop 3, yet.
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients
Personally, I would use Kafka Streams here because 1) You are using Java already 2) it's a standalone thread in your code that offers you to read/publish data from Kafka without Hadoop/YARN or Spark Clusters. It's not clear what your question has to do with Hadoop from your listed client-server archictecture, but feel free to string an additional line from a Kafka topic to a database/analytics engine of your choice. The Kafka Connect framework has many connectors for you to choose from.
You could also use NiFi as your mobile REST API to just ExposeHTTP and send requests to it, then route flows based on attributes in the data. Then, manipulate and publish to Kafka as well as other systems.
Spark and Hadoop works pretty similar in the way of solving MapReduce problems.
Hadoop is pretty relevant if you talk about HDFS point of view. The HDFS is a well known used solution for big data storage. But your question is about MapReduce.
Spark is the best option if you are talking about good machines with real good configuration of memory and network throughput. But we know that kind of machines are expensive and sometimes you best option is to use Hadoop to process your data. Spark is great and fast but sometimes you get crazy with Memory issues if you don't have a good cluster in case of fit too much data in the memory. Hadoop in this case can be better. But this problem year after year are less relevant.
So hadoop is here com complement Spark, Hadoop is not only MapReduce Hadoop is an ecosystem. Spark doesn't have a distributed file system, to Spark works well you need one, Spark doesn't have a resource manager, Hadoop has called Yarn. And Spark in a cluster mode need a resource manager.
Conclusion
Hadoop still relevant as an ecosystem but as only mapReduce I can say that is not been used anymore.
Having run through configuration of both the Hadoop Big Insights and Apache Spark services on Bluemix, I noticed that Hadoop is very configurable.I have a choice of how many nodes there will be in the cluster and the RAM and CPU cores of those nodes as well as hard disk space
But the Spark service seems less configurable. The only choice I have is to choose between 2 and 30 Spark executors.
I am working with Bluemix as part of an IBM IC4 project to evaluate these services, so I have a few questions about this.
Is it possible to configure the Spark service in a similar way to the Hadoop service? i.e. choose nodes, RAM of nodes, CPU cores etc.
What are Spark executors in this context? Are they nodes? If so, what are their specifications?
Is there a plan to improve the options for Spark's configuration in the future?
Apologies for the questions but I need to know these specifications in order to carry out my work.
The Big Insights service is what some would call a hosted service. Which is to say, when you provision on instance of this service you get your own cluster with nodes configured as specified in the chosen plan. Consequently, you'll want to know exactly what each node you're paying for gives you. On the other hand, the Apache Spark service is a shared compute service, wherein you pay for compute to run your spark programs. Running spark is about in-memory compute, and creating RDDs over sources of data hosted by other data services. So in this context, what matters is how many concurrent jobs can I run and how many parallel tasks can I run with how much memory, and so on. In the Spark service plan, these executors seem to be an abstraction on this compute horsepower; unfortunately, hard for you to map that to physical hardware if you care about that. The plan description needs more elaboration and details about how one translates this abstraction to how you map to your workload needs.
However, I understand that this should be improved considerably at some point in the near future. There have been rumors about moving to only a single spark service plan where you can dial in, whenever you want, how much compute you need and that would take effect when you click "go", for all spark jobs from that point forward; it seems like you can twiddle the dials until you get what you want, see what that would cost, then lock it in until next time you need to change it. I can image something even more dynamic than that on a per-job basis. But anyway, seems like the direction things may be going for this compute service.
I'm just beginning to learn about Big Data and I'm interested in Hadoop. I'm planning on building a simple analytics system to make sense of certain events that occurs in my site.
So I'm planning to have code (both front and back end) to trigger some events that would queue messages (most likely with RabbitMQ). These messages will then be processed by a consumer that would write the data continuously to HDFS. Then, I can run a map reduce job at any time to analyze the current data set.
I'm leaning towards Amazon EMR for the Hadoop functionality. So my question is this, from my server running the consumer, how do I save the data to HDFS? I know there's a command like "hadoop dfs -copyFromLocal", but how do I use this across servers? Is there a tool available?
Has anyone tried a similar thing? I would love to hear about your implementations. Details and examples would be very much helpful. Thanks!
If you mention EMR, it's takes input from a folder in s3 storage, so you can use your preffered language library to push data to s3 to analyse it later with EMR jobs. For example, in python one can use boto.
There are even drivers allowing you to mount s3 storage as a device, but some time ago all of them were too buggy to use them in production systems. May be thing have changed with time.
EMR FAQ:
Q: How do I get my data into Amazon S3? You can use Amazon S3 APIs to
upload data to Amazon S3. Alternatively, you can use many open source
or commercial clients to easily upload data to Amazon S3.
Note that emr (as well as s3) implies additional costs, and that it's usage is justified for really big data. Also note that it is always benefical to have relatively large files both in terms of Hadoop performance and storage costs.
We currently have a very write-heavy web analytics application which collects a large number of real time events from a large number of websites and stores for subsequent analytics and reporting.
Our initial planned architecture involved a cluster of web servers handling requests, and writing all of the data into a Cassandra cluster, while simultaneously updating a large number of counters for real-time aggregated reports. We also plan to utilize hadoop directly on CassandraFS (as a replacement of HDFS - offered by datastax) to natively run Map Reduce jobs on the data residing in Cassandra for more involved analytics. The output of the MapR jobs would be written back onto ColumnFamilies in Cassandra natively.
Hadoop map reduce runs on a read-only replica of the main cassandra cluster which is write-heavy. The idea was to avoid multiple data hops and have all data for the analytics in one repository.
More recently we hear about, and have faced first hand issues managing and growing a cassandra cluster with frequent node outages and bad response times. Couchbase seems to be much better with response times and dynamically growing and managing the cluster. So we are considering replacing Cassandra with Couchbase.
However this brings up a few questions.
Does Couchbase scale well in a mostly sequential write-heavy scenario? I don't see our scenario making much use of the in-memory caching, as the raw data being written is rarely read back, only aggregated metrics are. Plus I haven't been able to read much about what happens when Couchbase needs to hit the disk to write back data very frequently (or all the time?). Will it end up performing poorly than Cassandra?
What happens to the Hadoop interface? Couchbase has its own map reduce capabilities, but I understand that they are limited in scope. Would I need to transfer data back and forth between CouchbaseDB and HDFS to be able to support all my analytics and reporting out of a single database?
I have recently evaluated Cassandra and Couchbase among other options for a client requirement, so I can shed some light on both datastores.
Couchbase is incredibly easy to manage and once you have installed the server on a node, you can manage the cluster completely from the dashboard. However, couchbase does not scale as well as Cassandra, when as the data size grows. I also did not find a way to integrate Couchbase and HDFS/Hadoop seemlessly.
Cassandra performs very well for super fast write throughput, but it does not have any server side aggregation capabilities. Cluster management is slightly more difficult than Couchbase, as you have to re-balance the cluster every time you add or remove a node. Apart from it, from performance standpoint, Cassandra is runs pretty much seamlessly, as long as you have designed the schema properly.
If you can afford Datastax Enterprise solutions for Hive to do map-reduce for sophisticated analytics, I'd recommend you to stay with Cassandra, as couchbase map-reduce support is not all that good, and benchmarks show Couchbase performance starts to detoriate as the cluster size grows.