What is the difference between HUE, YARN and OOZIE - hadoop

I understand the concepts of HDFS and Map Reduce and how it is important to move the processing logic to the data to increase efficiency. I was even able to run a couple of map reduce job on my basic Hadoop cluster. Surrounding these concepts there are a lot of different technologies like YARN, HUE, OOZIE all of which seems to do the same thing (at least from a very high level) which is operation visibility and CRUD abilities for jobs (which can be map-reduce or something else).
Am I correct in making this assumption or is there a much more fundamental difference between them?
Thanks
Kay

YARN - Map Reduce is API where you have to implement data processing logic in it. Once the code is compiled you have to submit the jobs using hadoop jar command. YARN is the framework which will keep track of the resources, submit the job on the cluster, execute the job, show/log the progress.
OOZIE - Take a data integration example. You might have to get a data set from one database and other data set from other database, then you want to join, process the data and reload it into a cache or 3rd database. It involves 2 sqoop jobs to pull data from database, a hive/map reduce job to join and process the data, then push into cache/database. All these jobs are dependent on each other, eg: we are supposed to process the data only after data is pulled from source databases. Hence we need to create a workflow to execute complete data integration process. OOZIE can facilitate that. It is map reduce based workflow tool. Workflow it self will be executed as one or more map reduce jobs.
HUE: There are many tools in Hadoop - HDFS (file system), Sqoop, Hive/pig to process the data, Impala, HBase and many many more. To execute the POCs, it can get tedious to connect to the cluster. Also it need some linux skills. To overcome those challenges all the Hadoop eco system tools are consolidate under one umbrella - called Hue.

Related

Why does one action produce two jobs?

I use Spark 2.1.0.
Why does the following one action produce 2 identical jobs (same DAG in each one)? Shouldn't it produce just 1? Here you have the code:
val path = "/usr/lib/spark/examples/src/main/resources/people.txt"
val peopleDF = spark.
sparkContext.
textFile(path, 4).
map(_.split(",")).
map(attr => Person(attr(0), attr(1).trim.toInt)).
toDF
peopleDF.show()
I see that in the graphic interface when checking what is going on? I suppose it has something to do with all Data Frame transformation.
Although in general, a single SQL query may lead to more than one Spark job in this particular case Spark 2.3.0-SNAPSHOT gives only one (contrary to what you see).
The Job 12 is also pretty nice, i.e. just a single-stage no-shuffle Spark job.
The reason to see more than one Spark job per Spark SQL's structured query (using SQL or Dataset API) is that Spark SQL offers a high level API atop RDDs and uses RDDs and actions freely to make your life as a Spark developer and a Spark performance tuning expert easier. In most cases (esp. when you wanted to build abstractions), you'd have to fire up the Spark jobs yourself to achieve the comparable performance.

Can we use cached RDD across batches on an executor

I have a case where I want to download some data from a remote store every one hour and store that as Key-Value pairs in a RDD on an executor/worker. I want to cache this RDD so that all future jobs/tasks/batches running on this executor/worker can use the cached RDD to do a lookup. Is this possible in Spark Streaming?
Some relevant code or pointers to relevant code will be helpful.
Alluxio is a memory-centric distributed storage system. Alluxio can be used to cache Spark RDDs in memory, for multiple and future Spark applications and jobs to access.
Spark can store RDDs in Alluxio memory, and future Spark jobs can read them from Alluxio memory. That blog post has more details on how that works. Here is information on how to setup and configure Alluxio with Spark.
Given your requirements, here is what I would propose:
Run a Spark Application job every 1 hour, which will get the data from external data source and append to a hive table.
Use Spark thrift server to access the data
Note: Your notion of "caching within executor to use across application" is not correct. Executors relates to single Spark App, so as any RDD within that app.
If you really need to invest on caching data on distributed nodes, you may want to consider off-heap in-memory databases, such as Tachyon and Alluxio
If you just need a giant, distributed map, and you want to use Spark, write a standalone job that downloads the data every hours, and caches the RDD thus obtained (you can unpersist the old RDD). Let us call this Job DataRefresher.
You can then expose a REST api (if you are on Scala, consider using Scalatra) that wraps the DataRefresher, and returns the value, given the key. Something like: http://localhost:9191/lookup/key, which can be used by other jobs to do a relatively fast lookup.

Please clarify my understanding of Hadoop/HBase

I have been reading white papers and watching youtube videos for half the day now and believe I have a proper understanding of the technology, but before I start my project I want to make sure its right.
So with that, here's what I think I know.
As i'm understanding the architecture of hadoop and hbase, they pretty much model out like this
-----------------------------------------
| Mapreduce |
-----------------------------------------
| Hadoop | <-- hbase export--| HBase |
| | --apache pig --> | |
-----------------------------------------
| HDFS |
----------------------------------------
In a nutshell HBase is a completely different DB engine tuned for real time updates and queries that happens to run on the HDFS and is compatible with Mapreduce.
Now, assuming the above is correct, here is what else I think I know.
Hadoop is designed for big data from start to finish. The engine uses a distributed append only system which means you can not delete data once its inserted. To access the data you can use Mapreduce, or the HDFS shell and HDFS API..
Hadoop does not like small chunks and it was never intended to be a real time system. You would not want to store a single person and address per file, you would in fact store a million people and addresses per file and insert the large file.
HBase on the other hand is a pretty typical NoSql database engine that in spirit compares to CouchDB, RavenDB, etc. The notable difference is its built using the HDFS from hadoop allowing it to scale reliably to sizes only limited by your wallet.
Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS. HBase is a NoSql database engine that uses HDFS to efficiently store data across a cluster
To build a Mapreduce job to access data from both Hadoop and HBase, one would be best off to use HBase export to push the HBase data into Hadoop and write your job to process the data, but Mapreduce can access both systems one at a time.
You must be very careful when designing your HBase files as HBase does not natively support indexing fields within that file, HBase only indexes the primary key. Many tips and tricks help work around this fact.
Ok, so if im still accurate to this point, this would be a valid use case.
You build the site with HBase. You use HBase the same as you would any other NoSql or RDBMS to build out your functionality. Once thats done, you put your metrics logging points in the code to record your metrics in say, log4j. You create a new appender in log4j with rules that say when the log file reaches 1 gig in size, push it to the hadoop cluster, delete it, create a new file, go on with life.
Later, a Mapreduce developer can write a routine that uses HBase export to grab a data set from HBase, say a list of user ID's, then go to the logs that are stored in Hadoop and find the bread crumb trail for each user thru the system for a given timespan.
Ok, with that all said, now for the specific question. Are statements 1 - 6 accurate?
**********Edit one,
i have updated my beliefs above based on the answers received.
You can access the file in HDFS directly via HDFS shell or HDFS API.
Correct.
I am not familiar with CouchDB or RavenDB, but in HBase you can not have secondary-index, so you must carefully design your row key to speed up your query. There are a lot of HBase schema design tips on the internet you can google for.
I think it is more appropriate to say Hadoop is a computing engine to a database engine. If you want to import HDFS data to HBase, you can use Apache Pig as stated in this post. If you want to export HBase data to HDFS, you can use the export utility.
MapReduce is a component of Hadoop framework and it does not sit on top of HBase. You can access HBase data in a MapReduce job because of HBase uses HDFS for its storage. I don't think you want to access the HFile directly from a MapReduce job because the raw file is encoded in a special format, it is not easy to parse and it might change in future releases.
Since HBase and Hadoop are different database engines, one can not access the data in the other directly. For HBase to get something out of Hadoop, it must go thru Mapreduce and vice versa.
This is not true since Hadoop is not a database Engine. Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS.
Furthermore Map Reduce is not technology, it is a Model to where you can work parallel on HDFS data.

Cassandra and MapReduce - minimal setup requirements

I need to execute MapReduce on my Cassandra cluster, including data locality, ie. each job queries only rows which belong to local Casandra Node where the job runs.
Tutorials exist, on how to setup Hadoop for MR on older Cassandra version (0.7). I cannot find such for current release.
What has changed since 0.7 in this regard ?
What software modules are required for minimal setup (Hadoop+HDFS+...)?
Do I need Cassandra Enterprise ?
Cassandra contains a few classes which are sufficient to integrate with Hadoop:
ColumnFamilyInputFormat - This is an input for a Map function which can read all rows from a single CF in when using Cassandra's random partitioner, or it can read a row range when used with Cassandra's ordered partitioner. Cassandra cluster has ring form, where each ring part is responsible for concrete key range. Main task of Input Format is to divide Map input into data parts which can be processed in parallel - those are called InputSplits. In Cassandra case this is simple - each ring range has one master node, and this means that Input Format will create one InputSplit for each ring element, and it will result in one Map task. Now we would like to execute our Map task on the same host where data is stored. Each InputSplit remembers IP address of its ring part - this is the IP address of Cassandra node responsible to this particular key range. JobTracker will create Map tasks form InputSplits and assign them to TaskTracker for execution. JobTracker will try to find TaskTracker which has the same IP address as InputSplit - basically we have to start TaskTracker on Cassandra host, and this will guarantee data locality.
ColumnFamilyOutputFormat - this configures context for Reduce function. So that the results can be stored in Cassandra
Results from all Map functions has to be combined together before they can be passed to reduce function - this is called shuffle. It uses local file system - from Cassandra perspective nothing has to be done here, we just need to configure path to local temp directory. Also there is no need to replace this solution with something else (like persisting in Cassandra) - this data does not have to be replicated, Map tasks are idempotent.
Basically using provided Hadoop integration gives up possibility to execute Map job on hosts where data resides, and Reduce function can store results back into Cassandra - it's all that I need.
There are two possibilities to execute Map-Reduce:
org.apache.hadoop.mapreduce.Job - this class simulates Hadoop in one process. It executes Map-Resuce task and does not require any additional services/dependencies, it needs only access to temp directory to store results from map job for shuffle. Basically we have to call few setters on Job class, which contain things like class names for Map task, Reduce task, input format, Cassandra connection, when setup is done job.waitForCompletion(true) has to be called - it starts Map-Reduce task and waits for results. This solution can be used to quickly get into Hadoop world, and for testing. It will not scale (single process), and it will fetch data over network, but still - it will be fine for beginning.
Real Hadoop cluster - I did not set it up yet, but as I understood, Map-Reduce jobs from previous example will work just fine. We need additionally HDFS which will be used to distribute jars containing Map-Reduce classes in Hadoop cluster.
yes I was looking for the same thing, seems DataStaxEnterprise has a simplified Hadoop integration,
read this http://wiki.apache.org/cassandra/HadoopSupport

Can Hadoop MapReduce can run over other filesystems?

I heard like for mapreduce jobs input need not in HDFS. It can be on other file system.. Can someone please provide me more inputs on this..
I am litle confused on this? In standalone mode, data can be on local file system. But in cluster mode how can we point to mapreduce jobs to some other file system?
No it does not need to be in HDFS. For instance jobs which target HBase using its TableInputFormat pull records over the network from HBase nodes as inputs to its map jobs. The DbInputFormat can be used to pull data from a SQL database into a job. You could build an input format that did something like read data off of an NFS mount.
In practice you want to avoid pulling data over the network if you can. MR performance is much better if you can have your data locally on the nodes where the job is being run since Disk Throughput > Network Throughput.
Based in the InputFormat set on the job, Hadoop can read from any source. Hadoop provides a couple of InputFormats. It's not difficult to write a custom InputFormat also, let's say to provide a proprietary format as input to a Job.
On the same lines Hadoop provides a couple of OutputFormats and it shouldn't be difficult to write a custom OutputFormat also.
Here is a nice article on the DBInputFormat.
Another way to achieve it is to put into HDFS files with information where the real data is. Mapper will get this information and pull real data for the processing.
For example we can have several files with URLs of data to be processed.
What we will loose in this case is data locality - otherwise it is fine.

Resources