Designing an Analytics System with Hadoop - hadoop

I'm just beginning to learn about Big Data and I'm interested in Hadoop. I'm planning on building a simple analytics system to make sense of certain events that occurs in my site.
So I'm planning to have code (both front and back end) to trigger some events that would queue messages (most likely with RabbitMQ). These messages will then be processed by a consumer that would write the data continuously to HDFS. Then, I can run a map reduce job at any time to analyze the current data set.
I'm leaning towards Amazon EMR for the Hadoop functionality. So my question is this, from my server running the consumer, how do I save the data to HDFS? I know there's a command like "hadoop dfs -copyFromLocal", but how do I use this across servers? Is there a tool available?
Has anyone tried a similar thing? I would love to hear about your implementations. Details and examples would be very much helpful. Thanks!

If you mention EMR, it's takes input from a folder in s3 storage, so you can use your preffered language library to push data to s3 to analyse it later with EMR jobs. For example, in python one can use boto.
There are even drivers allowing you to mount s3 storage as a device, but some time ago all of them were too buggy to use them in production systems. May be thing have changed with time.
EMR FAQ:
Q: How do I get my data into Amazon S3? You can use Amazon S3 APIs to
upload data to Amazon S3. Alternatively, you can use many open source
or commercial clients to easily upload data to Amazon S3.
Note that emr (as well as s3) implies additional costs, and that it's usage is justified for really big data. Also note that it is always benefical to have relatively large files both in terms of Hadoop performance and storage costs.

Related

How to migrate On Prem Hadoop to GCP

I am trying to migrate our organization's hadoop jobs to GCP...I am confused between GCP Data Flow and Data Proc...
I want to re-use Hadoop jobs we already have created and minimize the management of the cluster as much as possible. We also want to be able to persist data beyond the life of the cluster...
Can anyone suggest
I would just start with DataProc as it is very close to what you have.
Check out DataProc initialization actions, https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions, create a simple cluster and get a feel for it.
DataFlow is completely managed and you don't operate any cluster resources, but at the same time you cannot migrate an onsite cluster to DataFlow as is, you need to migrate (some times rewrite) your Hive/Pig/Oozie etc.
Cost for DataFlow is also calculated differently, though there is no upfront cost vs DataProc, everytime you run a job you incur some cost associated with it on DataFlow.
A lot depends on the nature of your Hadoop jobs and the activities you are performing in regards to the selection of Cloud Dataproc (managed big data platform - orientation of Hadoop/Spark) and/or Cloud Dataflow (managed big data platform - orientation of Apache Beam for streaming use cases).
In regards to ensuring persistence of data beyond the operation, you may want to consider storing your data on GCS or on PD's if that's an option basis the need of your use case.

Amazon EMR vs EC2 for Off loading BI & Analytics anno 2018

I looked at some posts but they are a bit older on this topic. I have read the AWS and other blogs as well, but ...
My simple non-programming question for AWS in today's environment is:
If we have a DWH of say, 20+TB and growing, that we want to off-load to the Cloud as many are doing, then
if we have a regular daily DWH feed with some mutations, then
should we in the case of AWS, use EMR or EC2?
Moreover, it is a complete batch environment, no Streaming or KAFKA requirements. Usage of SPARK for sure.
EMR seems great, but I have the impression it is for Data Scientists to do whatever they want whenever they want. For more regular ETL I am wondering if this is suited. The appeal of less management is certainly a boon.
In the docs on AWS I cannot find a definitive answer, hence this question.
My impression is that with AMI and bootstrapping own services, that EMR is certainly one way to go, and, that EC2 would be more for a KAFKA Cluster or if you really want to control your own environment and tooling completely based on say Cloudera's distribution per se.
So, the answer here is for others that may need to assess which options apply for off-loading, whatever. It is actually not so hard in hindsight. Note that AZURE and non-AWS vendors not considered here. In a nutshell, then:
EMR is an (PaaS) AWS Managed Hadoop Service
EMR provides tools that AMAZON feel will do the job for Data Science, Analytics, etc. But you can "bootstrap" your own requirements / software, if needed.
EMR-clusters comprise short-running EC2 instances and provisioning happens under water as it were. You get patches effected easily this way. You can up- and downscale very easily as well. Compute and storage are divorced allowing this scaling to occur easily.
Elasticity applies obviously more so to compute, data needs to be there as long as you need it. EMR relies on S3 to save results to, longer term. After saving, one terminates the EMR-cluster, and when required, start a new EMR-cluster and attach your saved S3 results - if applicable - to this new cluster. EMRFS allows S3 to look like part of HDFS and provides easy access. EBS-backed storaged exists that allows saving of results to storage tied to the EC2-instance for the duration of that instance.
It's a new way of doing things. One has access to "spot" instances with obviously spot prices. Billing is less predictable as it depends what you do, but could well overall be cheaper - provided governed correctly. An example of this is expedia's management of EMR-clusters.
Ad-hoc querying is not well served with S3, so you will need another AWS Managed Services such as Presto / Athena or Redshift (Spectrum) which is an additional set of services and cost. Just mentioning this due to slower S3 performance.
EC2 (IaaS) is more "traditional"
You elect to take this path if you want to provision EC2 instances yourself a syou want control of the software and what you want on your Hadoop environment.
EC2 instances - VMs - have compute power, memory, EBS-backed temporal storage, and use EFS for file systems for HDFS or, say, KUDU, and S3. S3 access is not as easy to access as under EMRFS with EMR.
You install and maintain the Hadoop software yourself and apply patches, etc. Management of Hadoop on these EC2 instances is of course less of a big deal with Cloudera and Cloudbreak.
Billing is more predictable one could argue, on the basis of up-time of an EC2 instance, and billing applies continuously for any persisted storage.
Important point, one can combine an EC2 approach for, say, DWH Loading on Hadoop - if "off-loading", and EMR Clusters for Data Science.
MR Data Locality
This not adhered to in both approaches unless bare metal options used, but then the elasticity - E - is harder for both parties, which allows cost savings.
Data locality seems to be assumed by most, but actually it has gone with Cloud computing as expected, and seems quite OK in terms of performance for Data Science etc.
For ad hoc querying AMAZON say they are not so sure on S3, and from experience, using EFS fof HDFS/PARQUET or KUDU works pretty quick, to say the least, from my experience at least.

Relevance of Hadoop & Streaming solutions when Spark exists

I am starting a big data initiative for my startup. In 2018 is there any reason to use Hadoop at all since Spark is touted to be way faster due to it primarily not writing the intermediate data to disk as Hadoop’s MR.
I realize Spark has a higher need for RAM But that would be just one time CAPEX costs that would pay for itself?
In general unless there are legacy projects why should one pick up Hadoop at all since Spark is available?
Would appreciate real world comparisons of the two, gotchas etc.?
Alternately are there use cases that Hadoop can solve but Spark cannot?
—————-comment below for actual problem————
I would use YARN as the resource manager with HDFS as the file system for Spark.
Also realize that as Spark intersects quiet a bit with Hadoop ecosystem.
Comparos are :
Mapreduce vs Spark code
SparkSQL vs Hive
People mention Pig too but not a whole lot of people want to learn custom querying. And if I had to use Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop?
Also not sure how Spark handles the following:
If data does not fit in RAM then what ? Back to a disk based paradigm (not talking of streaming use cases here..) so no better than Mapreduce? How does Tez make MR2 better?
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Where I am unclear is the plethora of overlapping choices. For e.g. streaming alone has:
Spark streaming
Apache storm
Apache Samza
Kafka streams
CEP commercial tools.(ORacle CEP, TIBCO etc.)
A lot of them use DAG similar to Spark’s core engine so hard to pick one from the other.
Use case:
App sends data to middleware until end of event. Event can end specified on periodicity or due to a business condition being met.
Middleware must show real time addition of a value (simplifying) sent by users from their app instances. Accepted that middleware is the floor of the actual sum of values and real value can be higher. Plan to use Kafka streams here to have a consumer that adds all the inputs with minimal latency the consumer posts to a cache which is polled by apps to show current additive value.
Middleware logs all input
After event ends a big data paradigm scans through log data and database records to get accurate count by comparing all dB values and log entries (audit) and compare them to the Kafka shown value. Value calculated by this scheme is the final value.
Design choices:
I like Kafka because it decouples application middleware and is low latency high throughput messaging. Streams code is easy to write . Happy for someone to counter argue using Spark Streams Or Apache Storm or Apache Samza instead?
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients. Not doing client caching due to explicit liveliness of additive value.
You're confusing Hadoop with just MapReduce. Hadoop is an ecosystem of MapReduce, HDFS, and YARN.
First of all, Spark doesn't have a filesystem. That's primarily why Hadoop is nice, in my book. Sure, you can use S3, or many other cloud storages, or bare metal data stores like Ceph, or GlusterFS, but from what I've researched, HDFS is by far the fastest when processing data.
Maybe you're not familiar with the concept of rack locality that YARN offers. If you use Spark Standalone mode with any file system not mounted under the Spark executors, then all your data requests will need to be pulled over a network connection, therefore saturating the network, and causing a bottleneck, regardless of memory. Compare that to the Spark executors running on the YARN NodeManagers, HDFS datanodes are ideally also NodeManagers.
A similar problem - people say Hive is slow, SparkSQL is faster. Well, that's true if you run Hive with MapReduce instead of Tez or Spark execution modes.
Now, if you're wanting streaming and real-time events rather than the batch world commonly associated with Hadoop. You might want to research the SMACK stack.
Update
Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop
Pig is not comparable to NiFi.
You can use NiFi; nothing is stopping you. It would run closer to real-time than Spark micro batches. And it is a good tool to pair with Kafka.
plethora of overlapping choices
Yes, and you didn't even list them all... It's up to some BigData architect in your company to come up with a solution. You'll find that vendor support from Confluent is mostly for Kafka. I haven't seen them talking about Samza much. Hortonworks will support Storm, Nifi, and Spark, but they aren't running the latest version of Kafka if you want fancy features like KSQL. Streamsets is a similar company offering a tool competing with NiFi which consists of employees with backgrounds in other batch/streaming Apache projects.
Storm and Samza are two ways to do the same thing, as far as I know. I think Flink is more programmer friendly than Storm. I don't have experience with Samza, though I work closely with people who primarily are using Kafka Streams rather than it. And Kafka Streams isn't DAG based - it's just a high level Kafka library, embeddable in any JVM application.
If data does not fit in RAM then what ?
By default, it spills to disk... Spark has parameters to configure if you don't want disk to be touched. In which case, your jobs die of OOM more quickly, obviously.
How does Tez make MR2 better?
Tez isn't MR. It creates more optimized DAGs like Spark does. Go read about it.
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Spark has no filesystem. We already covered this. Erasure encoding is primarily for data at-rest, not during processing. I actually don't know if Spark supports Hadoop 3, yet.
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients
Personally, I would use Kafka Streams here because 1) You are using Java already 2) it's a standalone thread in your code that offers you to read/publish data from Kafka without Hadoop/YARN or Spark Clusters. It's not clear what your question has to do with Hadoop from your listed client-server archictecture, but feel free to string an additional line from a Kafka topic to a database/analytics engine of your choice. The Kafka Connect framework has many connectors for you to choose from.
You could also use NiFi as your mobile REST API to just ExposeHTTP and send requests to it, then route flows based on attributes in the data. Then, manipulate and publish to Kafka as well as other systems.
Spark and Hadoop works pretty similar in the way of solving MapReduce problems.
Hadoop is pretty relevant if you talk about HDFS point of view. The HDFS is a well known used solution for big data storage. But your question is about MapReduce.
Spark is the best option if you are talking about good machines with real good configuration of memory and network throughput. But we know that kind of machines are expensive and sometimes you best option is to use Hadoop to process your data. Spark is great and fast but sometimes you get crazy with Memory issues if you don't have a good cluster in case of fit too much data in the memory. Hadoop in this case can be better. But this problem year after year are less relevant.
So hadoop is here com complement Spark, Hadoop is not only MapReduce Hadoop is an ecosystem. Spark doesn't have a distributed file system, to Spark works well you need one, Spark doesn't have a resource manager, Hadoop has called Yarn. And Spark in a cluster mode need a resource manager.
Conclusion
Hadoop still relevant as an ecosystem but as only mapReduce I can say that is not been used anymore.

Streaming data [Hadoop/MapReduce] - What are the challenges?

I have read in many places about Streaming data, but just trying to understand the challenges which are faced while processing it using Map Reduce technique?
i.e. the reason behind the existence of frameworks like Apache Flume, Apache Storm, etc.
Please share your advise & thoughts.
Thanks,
Ranit
There are many technologies out there, and many of them run on the Hadoop framework.
The older Hadoop services like Hive tend to be slow, and are usually used for batch jobs, not for streaming.
As streaming becomes more and more a necessity, other services have surfaced like Storm or Spark that are designed for faster execution and integration with messaging queues like Kafka for streaming.
In data analytics though, most of the time processing is not al real time: historical data may be processed in batch mode to extract models that are then used for real-time analytics, so a 'streaming' system is usually based on a Lambda Architecture http://lambda-architecture.net/
A service like Spark tries to integrate all of the components, with Spark Streaming for the speed layer, Spark SQL for the Serving layer, Spark MLLib for the modeling, all based on Hadoop Distributed File system (hdfs) for replicated large volume storage.
Flume helps in directing the data from source to hdfs for raw storage, but in order to process it, Storm or Spark are used.
Hope that helps.
Your question is open eneded. But I assume you want to understand the challenges of processing streaming data in Map Reduce environment.
1) Map Reduce is primarily designed for batch processing. It is for processing high volume of data which is at rest in disk.
2) The streaming data is a high velocity of data, which are coming from various sources like Web Application Click Stream, Social Media Logs, Twitter Tags, Application logs.
3) The stream of events might be processed either stateless manner ( assuming every event is unique) or in a stateful manner (collect the data for 2 seconds and processes them) but batch applications does not have any such requirement.
4) Streaming applications wants delivery / process guarantee. For example, the frameworks must provide "exactly once" delivery/process mechanism, so that it processes all the stream events without fail. It is not a challenge in batch processing since all the data is available locally.
5) External Connectors : Streaming frameworks must support external connectivity to read data in realtime from various sources as we discussed in (2). This is not a challenge in batch, since the data is locally available.
Hope this helps.

HadoopFS (HDFS) as distributive file storage

I'm consider to use HDFS as horizontal scaling file storage system for our client video hosting service. My main concern that HDFS wasn't developed for this needs this is more "an open source system currently being used in situations where massive amounts of data need to be processed".
We don't want to process data just store them, create on a base of HDFS something like small internal Amazon S3 analog.
Probably important moment is that stored file size will be quite git from 100Mb to 10Gb.
Did anyone use HDFS in such purposes?
If you are using an S3 equivalient then it should already provide a distributed, mountable file-system no? Perhaps you can check out OpenStack at http://openstack.org/projects/storage/.
The main disadvantage would be the lack of POSIX semantics. You can't mount the drive, and you need special APIs to read and write from it. The Java API is the main one. There is a project called libhdfs that makes a C API over JNI, but I've never used it. Thriftfs is another option.
I'm also not sure about the read performance compared to other alternatives. Maybe someone else knows. Have you checked out other distributed filesystems like Lustre?
You may want to consider MongoDB for this. They have GridFS which will allow you to use it as a storage. You can then horizontally scale your storage through shards and provide fault tolerance with replication.
http://docs.mongodb.org/manual/core/gridfs/
http://docs.mongodb.org/manual/replication/
http://docs.mongodb.org/manual/sharding/

Resources