Adding partitions to Hive from a MapReduce Job - hadoop

I am new to Hive and MapReduce and would really appreciate your answer and also provide a right approach.
I have defined an external table logs in hive partitioned on date and origin server with an external location on hdfs /data/logs/. I have a MapReduce job which fetches these logs file and splits them and stores under the folder mentioned above. Like
"/data/logs/dt=2012-10-01/server01/"
"/data/logs/dt=2012-10-01/server02/"
...
...
From MapReduce job I would like add partitions to the table logs in Hive. I know the two approaches
alter table command -- Too many alter table commands
adding dynamic partitions
For approach two I see only examples of INSERT OVERWRITE which is not an options for me. Is there a way to add these new partitions to the table after the end of the job?

To do this from within a Map/Reduce job I would recommend using Apache HCatalog, which is a new project stamped under Hadoop.
HCatalog really is an abstraction layer on top of HDFS so you can write your outputs in a standardized way, be it from Hive, Pig or M/R. Where this comes into the picture for you, is that you can directly load data in Hive from your Map/Reduce job using the output format HCatOutputFormat. Below is an example taken from the official website.
A current code example for writing out a specific partition for (a=1,b=1) would go something like this:
Map<String, String> partitionValues = new HashMap<String, String>();
partitionValues.put("a", "1");
partitionValues.put("b", "1");
HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues);
HCatOutputFormat.setOutput(job, info);
And to write to multiple partitions, separate jobs will have to be kicked off with each of the above.
You can also use dynamic partitions with HCatalog, in which case you could load as many partitions as you want in the same job !
I recommend reading further on HCatalog on the website provided above, which should give you more details if needed.

In reality, things are a little more complicated than that, which is unfortunate because it is undocumented in official sources (as of now), and it takes a few days of frustration to figure out.
I've found that I need to do the following to get HCatalog Mapreduce jobs to work with writing to dynamic partitions:
In my record writing phase of my job (usually the reducer), I have to manually add my dynamic partitions (HCatFieldSchema) to my HCatSchema objects.
The trouble is that HCatOutputFormat.getTableSchema(config) does not actually return partitioned fields. They need to be manually added
HCatFieldSchema hfs1 = new HCatFieldSchema("date", Type.STRING, null);
HCatFieldSchema hfs2 = new HCatFieldSchema("some_partition", Type.STRING, null);
schema.append(hfs1);
schema.append(hfs2);

Here's the code for writing into multiple tables with dynamic partitioning in one job using HCatalog, the code has been tested on Hadoop 2.5.0, Hive 0.13.1:
// ... Job setup, InputFormatClass, etc ...
String dbName = null;
String[] tables = {"table0", "table1"};
job.setOutputFormatClass(MultiOutputFormat.class);
MultiOutputFormat.JobConfigurer configurer = MultiOutputFormat.createConfigurer(job);
List<String> partitions = new ArrayList<String>();
partitions.add(0, "partition0");
partitions.add(1, "partition1");
HCatFieldSchema partition0 = new HCatFieldSchema("partition0", TypeInfoFactory.stringTypeInfo, null);
HCatFieldSchema partition1 = new HCatFieldSchema("partition1", TypeInfoFactory.stringTypeInfo, null);
for (String table : tables) {
configurer.addOutputFormat(table, HCatOutputFormat.class, BytesWritable.class, CatRecord.class);
OutputJobInfo outputJobInfo = OutputJobInfo.create(dbName, table, null);
outputJobInfo.setDynamicPartitioningKeys(partitions);
HCatOutputFormat.setOutput(
configurer.getJob(table), outputJobInfo
);
HCatSchema schema = HCatOutputFormat.getTableSchema(configurer.getJob(table).getConfiguration());
schema.append(partition0);
schema.append(partition1);
HCatOutputFormat.setSchema(
configurer.getJob(table),
schema
);
}
configurer.configure();
return job.waitForCompletion(true) ? 0 : 1;
Mapper:
public static class MyMapper extends Mapper<LongWritable, Text, BytesWritable, HCatRecord> {
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
HCatRecord record = new DefaultHCatRecord(3); // Including partitions
record.set(0, value.toString());
// partitions must be set after non-partition fields
record.set(1, "0"); // partition0=0
record.set(2, "1"); // partition1=1
MultiOutputFormat.write("table0", null, record, context);
MultiOutputFormat.write("table1", null, record, context);
}
}

Related

Spark not be able to retrieve all of Hbase data in specific column

My Hbase table has 30 Million records, each record has the column raw:sample, raw is columnfamily sample is column. This column is very big, the size from a few KB to 50MB. When I run the following Spark code, it only can get 40 thousand records but I should get 30 million records:
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "10.1.1.15:2181")
conf.set(TableInputFormat.INPUT_TABLE, "sampleData")
conf.set(TableInputFormat.SCAN_COLUMNS, "raw:sample")
conf.set("hbase.client.keyvalue.maxsize","0")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
var arrRdd:RDD[Map[String,Object]] = hBaseRDD.map(tuple => tuple._2).map(...)
Right now I work around this by get the id list first then iterate the id list to get the column raw:sample by pure Hbase java client in Spark foreach.
Any ideas please why I can not get all of the column raw:sample by Spark, is it because the column too big?
A few days ago one of my zookeeper nodes and datanodes down, but I fixed it soon since the replica is 3, is this the reason? Would think if I run hbck -repair would help, thanks a lot!
Internally, TableInputFormat creates a Scan object in order to retrieve the data from HBase.
Try to create a Scan object (without using Spark), configured to retrieve the same column from HBase, see if the error repeats:
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
HTable table = new HTable(config, "emp");
// Instantiating the Scan class
Scan scan = new Scan();
// Scanning the required columns
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));
// Getting the scan result
ResultScanner scanner = table.getScanner(scan);
// Reading values from scan result
for (Result result = scanner.next(); result != null; result = scanner.next())
System.out.println("Found row : " + result);
//closing the scanner
scanner.close();
In addition, by default, TableInputFormat is configured to request a very small chunk of data from the HBase server (which is bad and causes a large overhead). Set the following to increase the chunk size:
scan.setBlockCache(false);
scan.setCaching(2000);
For a high throughput like yours, Apache Kafka is the best solution to integrate the data flow and keeping data pipeline alive. Please refer http://kafka.apache.org/08/uses.html for some use cases of kafka
One more
http://sites.computer.org/debull/A12june/pipeline.pdf

Selectively loading iis log files into Hive

I am just getting started with Hadoop/Pig/Hive on the cloudera platform and have questions on how to effectively load data for querying.
I currently have ~50GB of iis logs loaded into hdfs with the following directory structure:
/user/oi/raw_iis/Webserver1/Org/SubOrg/W3SVC1056242793/
/user/oi/raw_iis/Webserver2/Org/SubOrg/W3SVC1888303555/
/user/oi/raw_iis/Webserver3/Org/SubOrg/W3SVC1056245683/
etc
I would like to load all the logs into a Hive table.
I have two issues/questions:
1.
My first issue is that some of the webservers may not have been configured correctly and will have iis logs without all columns. These incorrect logs need additional processing to map the available columns in the log to the schema that contains all columns.
The data is space delimited, the issue is that when not all columns are enabled, the log only includes the columns enabled. Hive cant automatically insert nulls since the data does not include the columns that are empty. I need to be able to map the available columns in the log to the full schema.
Example good log:
#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/27.0.1453.116+Safari/537.36
Example log with missing columns (cs-method and useragent):
#Fields: date time s-ip cs-uri-stem
2013-07-16 00:00:00 10.1.15.8 /common/viewFile/1232
The log with missing columns needs to be mapped to the full schema like this:
#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 null /common/viewFile/1232 null
How can I map these enabled fields to a schema that includes all possible columns, inserting blank/null/- token for fields that were missing? Is this something I could handle with a Pig script?
2.
How can I define my Hive tables to include information from the hdfs path, namely Org and SubOrg in my dir structure example so that it is query-able in Hive? I am also unsure how to properly import data from the many directories into a single hive table.
First provide Sample data for better help.
How can I map these enabled fields to a schema that includes all possible columns, inserting blank/null/- token for fields that were missing?
If you have delimiter in file you can use Hive and hive automatically inserts nulls properly wherever data is not there.provided that you do not have delimiter as part of your data.
Is this something I could handle with a Pig script?
If you have delimiter among the fields then you can use Hive ,otherwise you can go for mapreduce/pig.
How can I include information from the hdfs path, namely Org and SubOrg in my dir structure example so that it is query-able in Hive?
Seems you are new bee in hive,before querying you have to create a table which includes information like path,delimiter and schema.
Is this a good candidate for partitioning?
You can apply partition on date if you wish.
I was able to solve both my issues with Pig UDF (user defined functions)
Mapping columns to proper schema: See this answer and this one.
All I really had to do is add some logic to handle the iis headers that start with #. Below are the snippets from getNext() that I used, everything else is the same as mr2ert's example code.
See the values[0].equals("#Fields:") parts.
#Override
public Tuple getNext() throws IOException {
...
Tuple t = mTupleFactory.newTuple(1);
// ignore header lines except the field definitions
if(values[0].startsWith("#") && !values[0].equals("#Fields:")) {
return t;
}
ArrayList<String> tf = new ArrayList<String>();
int pos = 0;
for (int i = 0; i < values.length; i++) {
if (fieldHeaders == null || values[0].equals("#Fields:")) {
// grab field headers ignoring the #Fields: token at values[0]
if(i > 0) {
tf.add(values[i]);
}
fieldHeaders = tf;
} else {
readField(values[i], pos);
pos = pos + 1;
}
}
...
}
To include information from the file path, I added the following to my LoadFunc UDF that I used to solve 1. In the prepareToRead override, grab the filepath and store it in a member variable.
public class IISLoader extends LoadFunc {
...
#Override
public void prepareToRead(RecordReader reader, PigSplit split) {
in = reader;
filePath = ((FileSplit)split.getWrappedSplit()).getPath().toString();
}
Then within getNext() I could add the path to the output tuple.

Performing bulk load in cassandra with map reduce

I haven't got much experience working with cassandra, so please excuse me if I have put in a wrong approach.
I am trying to do bulk load in cassandra with map reduce
Basically the word count example
Reference : http://henning.kropponline.de/2012/11/15/using-cassandra-hadoopbulkoutputformat/
I have put the simple Hadoop Wordcount Mapper Example and slightly modified the driver code and the reducer as per the above example.
I have successfully generated the output file as well. Now my doubt is how to perform the loading to cassandra part? Is there any difference in my approach ?
Please advice.
This is a part of the driver code
Job job = new Job();
job.setJobName(getClass().getName());
job.setJarByClass(CassaWordCountJob.class);
Configuration conf = job.getConfiguration();
conf.set("cassandra.output.keyspace", "test");
conf.set("cassandra.output.columnfamily", "words");
conf.set("cassandra.output.partitioner.class", "org.apache.cassandra.dht.RandomPartitioner");
conf.set("cassandra.output.thrift.port","9160"); // default
conf.set("cassandra.output.thrift.address", "localhost");
conf.set("mapreduce.output.bulkoutputformat.streamthrottlembits", "400");
job.setMapperClass(CassaWordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setReducerClass(CassaWordCountReducer.class);
FileOutputFormat.setOutputPath(job, new Path("/home/user/Desktop/test/cassandra"));
MultipleOutputs.addNamedOutput(job, "reducer", BulkOutputFormat.class, ByteBuffer.class, List.class);
return job.waitForCompletion(true) ? 0 : 1;
Mapper is the same as the normal wordcount mapper that just tokenizes and emits Word, 1
The reducer class is of the form
public class CassaWordCountReducer extends
Reducer<Text, IntWritable, ByteBuffer, List<Mutation>> {
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
List<Mutation> columnsToAdd = new ArrayList<Mutation>();
Integer wordCount = 0;
for(IntWritable value : values) {
wordCount += value.get();
}
Column countCol = new Column(ByteBuffer.wrap("count".getBytes()));
countCol.setValue(ByteBuffer.wrap(wordCount.toString().getBytes()));
countCol.setTimestamp(new Date().getTime());
ColumnOrSuperColumn wordCosc = new ColumnOrSuperColumn();
wordCosc.setColumn(countCol);
Mutation countMut = new Mutation();
countMut.column_or_supercolumn = wordCosc;
columnsToAdd.add(countMut);
context.write(ByteBuffer.wrap(key.toString().getBytes()), columnsToAdd);
}
}
To do bulk loads into Cassandra, I would advise looking at this article from DataStax. Basically you need to do 2 things for bulk loading:
Your output data won't natively fit into Cassandra, you need to transform it to SSTables.
Once you have your SSTables, you need to be able to stream them into Cassandra. Of course you don't simply want to copy each SSTable to every node, you want to only copy the relevant part of the data to each node
In your case when using the BulkOutputFormat, it should do all that as it's using the sstableloader behind the scenes. I've never used it with MultipleOutputs, but it should work fine.
I think the error in your case is that you're not using MultipleOutputs correctly: you're still doing a context.write, when you should really be writing to your MultipleOutputs object. The way you're doing it right now, since you're writing to the regular Context, it will get picked up by the default output format of TextOutputFormat and not the one you defined in your MultipleOutputs. More information on how to use the MultipleOutputs in your reducer here.
Once you write to the correct output format of BulkOutputFormat like you defined, your SSTables should get created and streamed to Cassandra from each node in your cluster - you shouldn't need any extra step, the output format will take care of it for you.
Also I would advise looking at this post, where they also explain how to use BulkOutputFormat, but they're using a ConfigHelper which you might want to take a look at to more easily configure your Cassandra endpoint.

Hbase - Hadoop : TableInputFormat extension

Using an hbase table as my input, of which the keys I have pre-processed in order to consist of a number concatenated with the respective row ID, I want to rest assured that all rows with the same number heading their key, will be processed from the same mapper at a M/R job. I am aware that this could be achieved through extension of TableInputFormat, and I have seen one or two posts concerning extension of this class, but I am searching for the most efficient way to do this in particular.
If anyone has any ideas, please let me know.
You can use a PrefixFilter in your scan.
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PrefixFilter.html
And parallelize the launch of your different mappers using Future
final Future<Boolean> newJobFuture = executor.submit(new Callable<Boolean>() {
#Override
public Boolean call() throws Exception {
Job mapReduceJob = MyJobBuilder.createJob(args, thePrefix,
...);
return mapReduceJob.waitForCompletion(true);
}
});
But I believe this is more an approach of a reducer you are looking for.

Hadoop: Map Reduce: read from HBase, but filter rows by content of one column

I am really new to Hadoop and I am not able to find an answer to my question. I want to write a map reduce job, where I read from HBase and write then in a simple text file.
In HBase, Ive got a column representing an id. Now I dont want to work on all containing rows in my HBase Table, but only on those between a maxId and a minId.
I found out that I could possibly user filters (scan.setFilter), so that I can filter rows which dont match my request.
This is my first Map Reduce Job, so please be patient :-)
Ive got a Starter Class, where I configure the job and the Scan Object and then start the job.
Now, my first try looks like this:
private Scan getScan()
{
final Scan scan = new Scan();
// ** FILTER **
List<Filter> filters = new ArrayList<Filter>();
Filter filter1 = new ValueFilter(CompareFilter.CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes(Integer.parseInt(minId))));
filters.add(filter1);
Filter filter2 = new ValueFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, new BinaryComparator(Bytes.toBytes(Integer.parseInt(maxId))));
filters.add(filter2);
FilterList filterList = new FilterList(filters);
scan.setFilter(filterList);
scan.setCaching(500);
scan.setCacheBlocks(false);
// id
scan.addColumn("columnfamily".getBytes(), "id".getBytes());
return scan;
}
Well, Im not sure if this is the right way to do it. I also read that I could pass my minId and maxId maybe with the Configuration Object to the Map Job, but Im not sure how.
Besides, what have I to do afterwards? I would normally just initiate the job with initTableMapperJob and pass the Scan Object to it. Ive read something of ResultScanner and so, do I need them? I thought the MapReduce Framework would now automatically pass the correct rows to my map job, is that correct?

Resources