I need to execute MapReduce on my Cassandra cluster, including data locality, ie. each job queries only rows which belong to local Casandra Node where the job runs.
Tutorials exist, on how to setup Hadoop for MR on older Cassandra version (0.7). I cannot find such for current release.
What has changed since 0.7 in this regard ?
What software modules are required for minimal setup (Hadoop+HDFS+...)?
Do I need Cassandra Enterprise ?

Cassandra contains a few classes which are sufficient to integrate with Hadoop:
ColumnFamilyInputFormat - This is an input for a Map function which can read all rows from a single CF in when using Cassandra's random partitioner, or it can read a row range when used with Cassandra's ordered partitioner. Cassandra cluster has ring form, where each ring part is responsible for concrete key range. Main task of Input Format is to divide Map input into data parts which can be processed in parallel - those are called InputSplits. In Cassandra case this is simple - each ring range has one master node, and this means that Input Format will create one InputSplit for each ring element, and it will result in one Map task. Now we would like to execute our Map task on the same host where data is stored. Each InputSplit remembers IP address of its ring part - this is the IP address of Cassandra node responsible to this particular key range. JobTracker will create Map tasks form InputSplits and assign them to TaskTracker for execution. JobTracker will try to find TaskTracker which has the same IP address as InputSplit - basically we have to start TaskTracker on Cassandra host, and this will guarantee data locality.
ColumnFamilyOutputFormat - this configures context for Reduce function. So that the results can be stored in Cassandra
Results from all Map functions has to be combined together before they can be passed to reduce function - this is called shuffle. It uses local file system - from Cassandra perspective nothing has to be done here, we just need to configure path to local temp directory. Also there is no need to replace this solution with something else (like persisting in Cassandra) - this data does not have to be replicated, Map tasks are idempotent.
Basically using provided Hadoop integration gives up possibility to execute Map job on hosts where data resides, and Reduce function can store results back into Cassandra - it's all that I need.
There are two possibilities to execute Map-Reduce:
org.apache.hadoop.mapreduce.Job - this class simulates Hadoop in one process. It executes Map-Resuce task and does not require any additional services/dependencies, it needs only access to temp directory to store results from map job for shuffle. Basically we have to call few setters on Job class, which contain things like class names for Map task, Reduce task, input format, Cassandra connection, when setup is done job.waitForCompletion(true) has to be called - it starts Map-Reduce task and waits for results. This solution can be used to quickly get into Hadoop world, and for testing. It will not scale (single process), and it will fetch data over network, but still - it will be fine for beginning.
Real Hadoop cluster - I did not set it up yet, but as I understood, Map-Reduce jobs from previous example will work just fine. We need additionally HDFS which will be used to distribute jars containing Map-Reduce classes in Hadoop cluster.

yes I was looking for the same thing, seems DataStaxEnterprise has a simplified Hadoop integration,
read this


What is the difference between HUE, YARN and OOZIE

I understand the concepts of HDFS and Map Reduce and how it is important to move the processing logic to the data to increase efficiency. I was even able to run a couple of map reduce job on my basic Hadoop cluster. Surrounding these concepts there are a lot of different technologies like YARN, HUE, OOZIE all of which seems to do the same thing (at least from a very high level) which is operation visibility and CRUD abilities for jobs (which can be map-reduce or something else).
Am I correct in making this assumption or is there a much more fundamental difference between them?
YARN - Map Reduce is API where you have to implement data processing logic in it. Once the code is compiled you have to submit the jobs using hadoop jar command. YARN is the framework which will keep track of the resources, submit the job on the cluster, execute the job, show/log the progress.
OOZIE - Take a data integration example. You might have to get a data set from one database and other data set from other database, then you want to join, process the data and reload it into a cache or 3rd database. It involves 2 sqoop jobs to pull data from database, a hive/map reduce job to join and process the data, then push into cache/database. All these jobs are dependent on each other, eg: we are supposed to process the data only after data is pulled from source databases. Hence we need to create a workflow to execute complete data integration process. OOZIE can facilitate that. It is map reduce based workflow tool. Workflow it self will be executed as one or more map reduce jobs.
HUE: There are many tools in Hadoop - HDFS (file system), Sqoop, Hive/pig to process the data, Impala, HBase and many many more. To execute the POCs, it can get tedious to connect to the cluster. Also it need some linux skills. To overcome those challenges all the Hadoop eco system tools are consolidate under one umbrella - called Hue.

How to use map reduce output as an input for another map reduce job?

In the first map reduce job I am processing an HBase table and outputting a smaller list of the rowkeys. I need to use this list of strings in order to process another map reduce job which is pulling from a different HBase table and outputting to another Hbase table. What is the proper way to store and access the ouput of the first map reduce job?
Hadoop doesn't support streaming the output of one MR job to another. So, the output of the first MR job has to be stored in HDFS (or some other persistent storage) and then read in the second MR job. Create a DAG of jobs using Oozie or Azkaban. For a simple work flow use Hadoop's JobControl API.
Apache Tez which is still in the incubator phase allows streaming of data across MR tasks. As mentioned, Tez is still in the Incubator stage, so use it with a bit of caution.

Reduce job pending in HFileOutputFormat

I am using
Hbase:0.92.1-cdh4.1.2, and
I have a mapreduce program that will load data from HDFS to HBase using HFileOutputFormat in cluster mode.
In that mapreduce program i'm using HFileOutputFormat.configureIncrementalLoad() to bulk load a 800000 record
data set which is of 7.3GB size and it is running fine, but it's not running for 900000 record data set which is of 8.3GB.
In the case of 8.3GB data my mapreduce program have 133 maps and one reducer,all maps completed successfully.My reducer status is always in Pending for a long time. There is nothing wrong with the cluster since other jobs are running fine and this job also running fine upto 7.3GB of data.
What could i be doing wrong?
How do I fix this issue?
I ran into the same problem. Looking at the DataTracker logs, I noticed there was not enough free space for the single reducer to run on any of my nodes:
2013-09-15 16:55:19,385 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node has 503,777,017,856 bytes free; but we expect reduce input to take 978136413988
This 503gb refers to the free space available on one of the hard drives on the particular slave (""), thus the reducer apparently needs to copy all the data to a single drive.
The reason this happens is your table only has one region when it is brand new. As data is inserted into that region, it'll eventually split on its own.
A solution to this is to pre-create your regions when creating your table. The Bulk Loading Chapter in the HBase book discusses this, and presents two options for doing this. This can also be done via the HBase shell (see create's SPLITS argument I think). The challenge though is defining your splits such that the regions get an even distribution of keys. I've yet to solve this problem perfectly, but here's what I'm doing currently:
HTableDescriptor desc = new HTableDescriptor();
desc.addFamily(new HColumnDescriptor("my_col_fam"));
admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647), 100);
An alternative solution would be to not use configureIncrementalLoad, and instead: 1) just generate your HFile's via MapReduce w/ no reducers; 2) use completebulkload feature in hbase.jar to import your records to HBase. Of course, I think this runs into the same problem with regions, so you'll want to create the regions ahead of time too (I think).
Your job is running with single reduces, means 7GB data getting processed on single task.
The main reason of this is HFileOutputFormat starts reducer that sorts and merges data to be loaded in HBase table.
here, Num of Reducer = num of regions in HBase table
Increase the number of regions and you will achieve parallelism in reducers. :)
You can get more details here:

Is there any way to control in Hadoop MapReduce framework on which node reducer will be started?

shortly speaking I need a way to give Hadoop MapRedice API hint on what host I'd like to run certain reducer based on its partition. Is there any way?
Somewhat longer story:
I have few mapper tasks which generate (or import from another source) records for certain HBase table. Emitted records have ImmutableBytesWritable as keys. Number of reducers for this job exactly matches number of table regions and custom partitioner is used to distribute records so records of every region gets to appropriate reducer.
Reducers are intended to generate HFile images, one image per region so later bulk load could be used on them. The only serious problem here is I'd like reducers at least to 'try to run' on the same hosts appropriate region servers are running. This is to get good probability of generated HFiles locality (in terms of HDFS) for appropriate HBase region servers.
Any idea how to get this behavior?
Alternative could be how to 'request' HDFS file to 'get local'. Having this I could start another MR job with mappers bound to region servers (through splits) and request corresponding HFile to get local.
There is no out-of-box way to do this yet, short of writing a custom scheduler, which would be an overkill.
An upstream ticket does track this feature request at

Running pig on a multi node Cassandra cluster

I am working on BI process that will read data from cassandra, create summaries using Map Reduce and write back to a different keyspace.
Starting with a single node, everything worked as i expected, but when moving to a multi-node, i am not sure I fully understand the topology and configuration.
I have a setup with 3 nodes. Each has a Cassandra node (version 1.1.9), data node and task tracker (version 0.20.2+923.421- CDH3U5) . The NameNode and job tracker are on a different server. At this point i am trying to run Pig script from the DataNode server.
The thing i am not sure of is the pig argument PIG_INITIAL_ADDRESS. I assumed the query would run on all Cassandra nodes, each task tracker would only query the local Cassandra node, and the reducer would handle any duplicates. Based on that assumption i thought the PIG_INITIAL_ADDRESS should be localhost. But when running the pig script it fails: Unable to connect to server localhost:9160
My questions are- should the initial address be any one of the Cassandra nodes, and Splitting the map on the cluster is done from Cassandra keys partitions (will i get the distribution i need)?
IF I where to use java map reduce, will i still need to supply the initial address?
Is the current implementation assumes pig is running from a Cassandra node?
The PIG_INITIAL_ADDRESS is the address of one of the Cassandra nodes in your ring. In order to have the Hadoop job read data from or write data to Cassandra, it just needs to have some properties set. Those properties are also available to set in the job properties or in the default Hadoop configuration on the server that you're running the job from. Other than that, it's just like submitting a job to a job tracker.
For more information, I would look at the readme that's in the cassandra source download under examples/pig. There is a lot of explanation in there as well.
