org.apache.hadoop.mapred.lib.MultipleOutputs.addNamedOutput() in oozie - hadoop

I am trying to use MultipleOutputs for changing output filename in reducer. I am using oozie workflow to run mapreduce job.
I am not able to find ways to add below property in oozie workflow -
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, Text.class, Text.class);
As it is oozie mapreduce action, I don't have driver class to put above code.

The answer lies in the source code for the method.
From hadoop core 1.2.1 jar
/* */ public static void addNamedOutput(Job job, String namedOutput, Class<? extends OutputFormat> outputFormatClass, Class<?> keyClass, Class<?> valueClass)
/* */ {
/* 248 */ checkNamedOutputName(job, namedOutput, true);
/* 249 */ Configuration conf = job.getConfiguration();
/* 250 */ conf.set("mapreduce.multipleoutputs", conf.get("mapreduce.multipleoutputs", "") + " " + namedOutput);
/* */
/* 252 */ conf.setClass("mapreduce.multipleoutputs.namedOutput." + namedOutput + ".format", outputFormatClass, OutputFormat.class);
/* */
/* 254 */ conf.setClass("mapreduce.multipleoutputs.namedOutput." + namedOutput + ".key", keyClass, Object.class);
/* 255 */ conf.setClass("mapreduce.multipleoutputs.namedOutput." + namedOutput + ".value", valueClass, Object.class);
/* */ }
So, it points towards setting "mapreduce.multipleoutputs" again with space separated namedOutput and setting format, key and value classes using below variables.
"mapreduce.multipleoutputs.namedOutput." + namedOutput + ".format"
"mapreduce.multipleoutputs.namedOutput." + namedOutput + ".key"
"mapreduce.multipleoutputs.namedOutput." + namedOutput + ".value"
Hope it helps.

Related

Kafka,Why is partition always 0? Whether I set key or not

public void testKafka() throws InterruptedException {
for (int i = 0; i < 1000; i++) {
kafkaTemplate.send("topic1","zxcvb", String.valueOf(i));
}
Thread.sleep(1000 * 60);
}
#Component
class KafkaConsumer {
#KafkaListener(groupId = "01",topics = "topic1")
public void onMessage1(ConsumerRecord<?, ?> record) {
System.out.println("1============" + record.topic() + "->" + record.partition() + "->" + record.value() + "============");
}
#KafkaListener(groupId = "02", topicPartitions = {
#TopicPartition(topic = "topic1", partitions = {"2"})
})
public void onMessage2(ConsumerRecord<?, ?> record) {
System.out.println("2============" + record.topic() + "->" + record.partition() + "->" + record.value() + "============");
}
#KafkaListener(groupId = "03", topicPartitions = {
#TopicPartition(topic = "topic1", partitions = {"3"})
})
public void onMessage3(ConsumerRecord<?, ?> record) {
System.out.println("3============" + record.topic() + "->" + record.partition() + "->" + record.value() + "============");
}
}
Above is my code, I don't know why partition is always 0
enter image description here
If I set the key, it is still 0
If I use send (string topic, integer partition, K key, # nullable V data)
Error: Topic topic1 not present in metadata after 60000 ms.
It will always send records with the same key to the same partition - that is by design. See the DefaultPartitioner:
/**
* The default partitioning strategy:
* <ul>
* <li>If a partition is specified in the record, use it
* <li>If no partition is specified but a key is present choose a partition based on a hash of the key
* <li>If no partition or key is present choose the sticky partition that changes when the batch is full.
*
* See KIP-480 for details about sticky partitioning.
*/
public class DefaultPartitioner implements Partitioner {
The error probably means you are trying to send to a partition that doesn't exist.

FileNet exception "An exception occurred during a read of the RenditionEngineConnection"

I am trying to publish a document using IBM FileNet.
I used the manual:
https://www.ibm.com/support/knowledgecenter/SSNW2F_5.2.0/com.ibm.p8.ce.dev.ce.doc/publish_procedures.htm
But I got an exception "An exception occurred during a read of the RenditionEngineConnection".
What's my mistake?
How should we set up "FileNet P8 Rendition Engine"?
https://www.ibm.com/support/knowledgecenter/it/SSNW2F_5.1.0/com.ibm.p8.installingre.doc/p8pic003.htm
My source code:
/**
* Create {#link PublishStyleTemplate}
*
* #param objectStore {#link ObjectStore}
* #param description description
*/
public void createPublishStyleTemplate(final ObjectStore objectStore, final String description) {
PublishStyleTemplate pst = Factory.PublishStyleTemplate.createInstance(objectStore);
pst.set_Title(description);
pst.set_Description(description);
StringList formats = Factory.StringList.createList();
formats.add("text/plain");
formats.add("application/msword");
formats.add("application/vnd.ms-excel");
formats.add("application/vnd.ms-powerpoint");
formats.add("application/vnd.openxmlformat");
formats.add("application/vnd.openxmlformats-officedocument.wordprocessingml.document");
formats.add("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
formats.add("application/vnd.openxmlformats-officedocument.presentationml.presentation");
pst.set_InputFormats(formats);
// ProviderID must use the well-known handler name.
String PDF_HANDLER = "PublishRequestPDFHandler";
pst.set_ProviderID(PDF_HANDLER);
pst.set_OutputFormat("application/pdf"); // PDF transformation
pst.save(RefreshMode.REFRESH);
}
/**
* Create {#link PublishTemplate}
*
* #param objectStore {#link ObjectStore}
* #param publishStyleTemplate {#link PublishStyleTemplate}
* #param description description
*/
public void createPublishTemplate(final ObjectStore objectStore, final PublishStyleTemplate publishStyleTemplate, final String description) {
// Create a publish template object.
PublishTemplate pt = Factory.PublishTemplate.createInstance(objectStore);
pt.set_StyleTemplate(publishStyleTemplate);
// Set document title for the publish template
pt.getProperties().putValue("DocumentTitle", description);
pt.set_Description("test_PublishTemplate");
// Is there a cascade delete dependency between source document and publication?
boolean isSourceDependency = true;
// isSourceDependency is a boolean variable that specifies whether the user wants
// to delete the publication automatically when the source is deleted. It is whichever value
// (true or false) the user chooses.
String VALUE_ISSOURCEDEPENDENCY = isSourceDependency ? "true" : "false";
// Publish template content.
String PT_CONTENT =
"<?xml version='1.0'?>" +
"<publishtemplatecontent>" +
"<version>2.0.1</version>" +
"<newinstructions>" +
"<issourcedependent>" + VALUE_ISSOURCEDEPENDENCY + "</issourcedependent>" +
"<outputfolderid>" + "B0FA3471-0000-CD1D-9D5E-B1E6E5E82135" + "</outputfolderid>" +
"<applyproperties><from>source</from></applyproperties>" +
"<applysecurity><from>default</from></applysecurity>" +
"</newinstructions>" +
"<republishinstructions>" +
"<versionablerepublishtype>versionandkeep</versionablerepublishtype>" +
"<nonversionablerepublishtype>addandkeep</nonversionablerepublishtype>" +
"<applypropertiesfrom>destination</applypropertiesfrom>" +
"<applysecurityfrom>destination</applysecurityfrom>" +
"</republishinstructions>" +
"</publishtemplatecontent>";
String[] PT_DATA = {"myNewPublishTemplate.xml", "application/x-filenet-publishtemplate", PT_CONTENT};
// Create content elements.
ContentElementList cel = Factory.ContentElement.createList();
ContentTransfer ctNew = Factory.ContentTransfer.createInstance();
ByteArrayInputStream is = new ByteArrayInputStream(PT_CONTENT.getBytes());
ctNew.setCaptureSource(is);
ctNew.set_RetrievalName(PT_DATA[0]);
ctNew.set_ContentType(PT_DATA[1]);
cel.add(ctNew);
pt.set_ContentElements(cel);
// Check in publish template as major version.
pt.checkin(AutoClassify.DO_NOT_AUTO_CLASSIFY, CheckinType.MAJOR_VERSION);
pt.save(RefreshMode.REFRESH);
}
/**
* Create {#link PublishRequest}
*
* #param objectStore
* #param document
* #param publishTemplate
* #return
*/
public PublishRequest createPublishRequest(
final ObjectStore objectStore,
final Document document,
final PublishTemplate publishTemplate) {
System.out.println(String.format("Document Id = %s", document.get_Id().toString()));
System.out.println(String.format("Document MimeType = %s", document.get_MimeType()));
System.out.println(String.format("Document Name = %s", document.get_Name()));
System.out.println(String.format("PublishTemplate Id = %s", publishTemplate.get_Id().toString()));
PublishStyleTemplate publishStyleTemplate = publishTemplate.get_StyleTemplate();
System.out.println(String.format("PublishStyleTemplate ProviderID = %s", publishStyleTemplate.get_ProviderID()));
StringList stringList = publishStyleTemplate.get_InputFormats();
Iterator iterator = stringList.iterator();
while(iterator.hasNext()) {
String inputFormat = (String) iterator.next();
System.out.println(String.format("PublishStyleTemplate InputFormat = %s", inputFormat));
}
String publishOpts = new String(
"<publishoptions><publicationname>"
+ document.get_Name()
+ "</publicationname></publishoptions>");
PublishRequest publishRequest = Factory.PublishRequest.createInstance(objectStore);
publishRequest.set_InputDocument(document);
publishRequest.set_PublishTemplate(publishTemplate);
publishRequest.setPublishOptions(publishOpts);
publishRequest.save(RefreshMode.REFRESH);
return publishRequest;
}
Log:
2020-04-14T12:47:57.492 9F0B4BDE PUBL FNRCE0000E - ERROR ERROR: Reading RenditionEngineConnection threw: An unexpected exception occurred.
2020-04-14T12:47:57.493 9F0B4BDE PUBL FNRCE0000I - INFO InvokeVista exception: An exception occurred during a read of the RenditionEngineConnection.
2020-04-14T12:47:57.493 E0476562 PUBL FNRCE0066E - ERROR Failed dispatching PublishRequest row {B0AB7771-0000-CA39-BFFD-BF7A84D30A96}\ncom.filenet.api.exception.EngineRuntimeException: FNRCE0066E: E_UNEXPECTED_EXCEPTION: An unexpected exception occurred.\n at com.filenet.engine.publish.PublishRequestPDFHandler.publishPDF(PublishRequestPDFHandler.java:435)\n at com.filenet.engine.publish.PublishRequestPDFHandler.execute(PublishRequestPDFHandler.java:169)\n at com.filenet.engine.publish.PublishRequestHandlerBase$1.run(PublishRequestHandlerBase.java:226)\n at com.filenet.engine.context.CallState.doAs(CallState.java:236)\n at com.filenet.engine.context.CallState.doAs(CallState.java:153)\n at com.filenet.engine.publish.PublishRequestHandlerBase.executeAs(PublishRequestHandlerBase.java:215)\n at com.filenet.engine.publish.PublishRequestExecutor.loadAndExecuteQueuedRow(PublishRequestExecutor.java:214)\n at com.filenet.engine.queueitem.QueueExecutor.dispatchQueuedRow(QueueExecutor.java:389)\n at com.filenet.engine.queueitem.QueueExecutor.dispatchEvent(QueueExecutor.java:209)\n at com.filenet.engine.queueitem.QueueExecutor.execute(QueueExecutor.java:133)\n at com.filenet.engine.tasks.BackgroundTask.safeExecute(BackgroundTask.java:275)\n at com.filenet.engine.tasks.BackgroundTask$BackgroundTaskPriviledgedExceptionAction.run(BackgroundTask.java:1110)\n at com.filenet.engine.context.CallState.doAsSystem(CallState.java:575)\n at com.filenet.engine.tasks.BackgroundTask.run(BackgroundTask.java:209)\n at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)\n at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n at java.lang.Thread.run(Thread.java:785)\nCaused by: com.filenet.api.exception.EngineRuntimeException: FNRCU0005E: PUBLISH_READING_REC_THREW: An exception occurred during a read of the RenditionEngineConnection.\n at com.filenet.engine.publish.PublishRequestHandlerUtil.readRenditionEngineConnection(PublishRequestHandlerUtil.java:199)\n at com.filenet.engine.publish.PublishRequestPDFHandler$InvokeVistaPDF$1.run(PublishRequestPDFHandler.java:128)\n at com.filenet.engine.context.CallState.doAs(CallState.java:236)\n at com.filenet.engine.context.CallState.doAs(CallState.java:153)\n at com.filenet.engine.publish.PublishRequestPDFHandler$InvokeVistaPDF.run(PublishRequestPDFHandler.java:118)\n ... 3 more\nCaused by: com.filenet.api.exception.EngineRuntimeException: FNRCE0066E: E_UNEXPECTED_EXCEPTION: An unexpected exception occurred.\n at com.filenet.engine.publish.PublishRequestHandlerUtil.readRenditionEngineConnection(PublishRequestHandlerUtil.java:120)\n ... 7 more
2020-04-14T12:47:57.494 E0476562 PUBL FNRCE0000I - INFO dispatchFailed: marked queue item: {B0AB7771-0000-CA39-BFFD-BF7A84D30A96} as "poisoned" and will not retry further.

Last Reducer is running from last 24 hour for 200 gb of data set

Hi i have one mapreduce apllication that bulk loads data into HBase .
I have total 142 text files of total size 200gb.
My mapper gets completed within 5 minutes and all reducer also but last one is stuck at 100%.
Its taking very long time and running from past 24 hr .
I have one column family .
My row key is like below .
48433197315|1972-03-31T00:00:00Z|4 48433197315|1972-03-31T00:00:00Z|38 48433197315|1972-03-31T00:00:00Z|41 48433197315|1972-03-31T00:00:00Z|23 48433197315|1972-03-31T00:00:00Z|7 48433336118|1972-03-31T00:00:00Z|17 48433197319|1972-03-31T00:00:00Z|64 48433197319|1972-03-31T00:00:00Z|58 48433197319|1972-03-31T00:00:00Z|61 48433197319|1972-03-31T00:00:00Z|73 48433197319|1972-03-31T00:00:00Z|97 48433336119|1972-03-31T00:00:00Z|7
I have created my table like this .
private static Configuration getHbaseConfiguration() {
try {
if (hbaseConf == null) {
System.out.println(
"UserId= " + USERID + " \t keytab file =" + KEYTAB_FILE + " \t conf =" + KRB5_CONF_FILE);
HBaseConfiguration.create();
hbaseConf = HBaseConfiguration.create();
hbaseConf.set("mapreduce.job.queuename", "root.fricadev");
hbaseConf.set("mapreduce.child.java.opts", "-Xmx6553m");
hbaseConf.set("mapreduce.map.memory.mb", "8192");
hbaseConf.setInt(MAX_FILES_PER_REGION_PER_FAMILY, 1024);
System.setProperty("java.security.krb5.conf", KRB5_CONF_FILE);
UserGroupInformation.loginUserFromKeytab(USERID, KEYTAB_FILE);
}
} catch (Exception e) {
e.printStackTrace();
}
return hbaseConf;
}
/**
* HBase bulk import example Data preparation MapReduce job driver
*
* args[0]: HDFS input path args[1]: HDFS output path
*
* #throws Exception
*
*/
public static void main(String[] args) throws Exception {
if (hbaseConf == null)
hbaseConf = getHbaseConfiguration();
String outputPath = args[2];
hbaseConf.set("data.seperator", DATA_SEPERATOR);
hbaseConf.set("hbase.table.name", args[0]);
hbaseConf.setInt(MAX_FILES_PER_REGION_PER_FAMILY, 1024);
Job job = new Job(hbaseConf);
job.setJarByClass(HBaseBulkLoadDriver.class);
job.setJobName("Bulk Loading HBase Table::" + args[0]);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapperClass(HBaseBulkLoadMapperUnzipped.class);
// job.getConfiguration().set("mapreduce.job.acl-view-job",
// "bigdata-app-fricadev-sdw-u6034690");
if (HbaseBulkLoadMapperConstants.FUNDAMENTAL_ANALYTIC.equals(args[0])) {
HTableDescriptor descriptor = new HTableDescriptor(Bytes.toBytes(args[0]));
descriptor.addFamily(new HColumnDescriptor(COLUMN_FAMILY));
HBaseAdmin admin = new HBaseAdmin(hbaseConf);
byte[] startKey = new byte[16];
Arrays.fill(startKey, (byte) 0);
byte[] endKey = new byte[16];
Arrays.fill(endKey, (byte) 255);
admin.createTable(descriptor, startKey, endKey, REGIONS_COUNT);
admin.close();
// HColumnDescriptor hcd = new
// HColumnDescriptor(COLUMN_FAMILY).setMaxVersions(1);
// createPreSplitLoadTestTable(hbaseConf, descriptor, hcd);
}
job.getConfiguration().setBoolean("mapreduce.compress.map.output", true);
job.getConfiguration().setBoolean("mapreduce.map.output.compress", true);
job.getConfiguration().setBoolean("mapreduce.output.fileoutputformat.compress", true);
job.getConfiguration().setClass("mapreduce.map.output.compression.codec",
org.apache.hadoop.io.compress.GzipCodec.class, org.apache.hadoop.io.compress.CompressionCodec.class);
job.getConfiguration().set("hfile.compression", Compression.Algorithm.LZO.getName());
// Connection connection =
// ConnectionFactory.createConnection(hbaseConf);
// Table table = connection.getTable(TableName.valueOf(args[0]));
FileInputFormat.setInputPaths(job, args[1]);
FileOutputFormat.setOutputPath(job, new Path(outputPath));
job.setMapOutputValueClass(Put.class);
HFileOutputFormat.configureIncrementalLoad(job, new HTable(hbaseConf, args[0]));
System.exit(job.waitForCompletion(true) ? 0 : -1);
System.out.println("job is successfull..........");
// LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConf);
// loader.doBulkLoad(new Path(outputPath), (HTable) table);
HBaseBulkLoad.doBulkLoad(outputPath, args[0]);
}
/**
* Enum of counters.
* It used for collect statistics
*/
public static enum Counters {
/**
* Counts data format errors.
*/
WRONG_DATA_FORMAT_COUNTER
}
}
There is no reducer in my code only mapper .
My ,mapper code is like this .
public class FundamentalAnalyticLoader implements TableLoader {
private ImmutableBytesWritable hbaseTableName;
private Text value;
private Mapper<LongWritable, Text, ImmutableBytesWritable, Put>.Context context;
private String strFileLocationAndDate;
#SuppressWarnings("unchecked")
public FundamentalAnalyticLoader(ImmutableBytesWritable hbaseTableName, Text value, Context context,
String strFileLocationAndDate) {
//System.out.println("Constructing Fundalmental Analytic Load");
this.hbaseTableName = hbaseTableName;
this.value = value;
this.context = context;
this.strFileLocationAndDate = strFileLocationAndDate;
}
#SuppressWarnings("deprecation")
public void load() {
if (!HbaseBulkLoadMapperConstants.FF_ACTION.contains(value.toString())) {
String[] values = value.toString().split(HbaseBulkLoadMapperConstants.DATA_SEPERATOR);
String[] strArrFileLocationAndDate = strFileLocationAndDate
.split(HbaseBulkLoadMapperConstants.FIELD_SEPERATOR);
if (17 == values.length) {
String strKey = values[5].trim() + "|" + values[0].trim() + "|" + values[3].trim() + "|"
+ values[4].trim() + "|" + values[14].trim() + "|" + strArrFileLocationAndDate[0].trim() + "|"
+ strArrFileLocationAndDate[2].trim();
//String strRowKey=StringUtils.leftPad(Integer.toString(Math.abs(strKey.hashCode() % 470)), 3, "0") + "|" + strKey;
byte[] hashedRowKey = HbaseBulkImportUtil.getHash(strKey);
Put put = new Put((hashedRowKey));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FUNDAMENTAL_SERIES_ID),
Bytes.toBytes(values[0].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FUNDAMENTAL_SERIES_ID_OBJECT_TYPE_ID),
Bytes.toBytes(values[1].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FUNDAMENTAL_SERIES_ID_OBJECT_TYPE),
Bytes.toBytes(values[2]));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FINANCIAL_PERIOD_END_DATE),
Bytes.toBytes(values[3].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FINANCIAL_PERIOD_TYPE),
Bytes.toBytes(values[4].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.LINE_ITEM_ID), Bytes.toBytes(values[5].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_ITEM_INSTANCE_KEY),
Bytes.toBytes(values[6].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_VALUE), Bytes.toBytes(values[7].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_CONCEPT_CODE),
Bytes.toBytes(values[8].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_VALUE_CURRENCY_ID),
Bytes.toBytes(values[9].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_IS_ESTIMATED),
Bytes.toBytes(values[10].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_AUDITABILITY_EQUATION),
Bytes.toBytes(values[11].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FINANCIAL_PERIOD_TYPE_ID),
Bytes.toBytes(values[12].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_CONCEPT_ID),
Bytes.toBytes(values[13].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_LINE_ITEM_IS_YEAR_TO_DATE),
Bytes.toBytes(values[14].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.IS_ANNUAL), Bytes.toBytes(values[15].trim()));
// put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
// Bytes.toBytes(HbaseBulkLoadMapperConstants.TAXONOMY_ID),
// Bytes.toBytes(values[16].trim()));
//
// put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
// Bytes.toBytes(HbaseBulkLoadMapperConstants.INSTRUMENT_ID),
// Bytes.toBytes(values[17].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FF_ACTION),
Bytes.toBytes(values[16].substring(0, values[16].length() - 3)));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FILE_PARTITION),
Bytes.toBytes(strArrFileLocationAndDate[0].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FILE_PARTITION_DATE),
Bytes.toBytes(strArrFileLocationAndDate[2].trim()));
try {
context.write(hbaseTableName, put);
} catch (IOException e) {
context.getCounter(Counters.WRONG_DATA_FORMAT_COUNTER).increment(1);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} else {
System.out.println("Values length is less 15 and value is " + value.toString());
}
}
}
Any help to improve the speed is highly appreciated .
Counter image
here`
I suspect that all records go into single region.
When you created empty table, HBase splitted key address space in even ranges. But because all actual keys share the same prefix, they all go into single region. That means that single region/reduce task does all the job and all others regions/reduce tasks do not do anything useful. You may check this hypothesis by looking at Hadoop counters: how many bytes slow reduce task read/wrote compared to other reduce tasks.
If this is the problem, then you need to manually prepare split keys and create table by using createTable(HTableDescriptor desc, byte[][] splitKeys. Split keys should evenly divide your actual dataset for optimal performance.
Example #1. If your keys were ordinary English words, then it would be easy to split table into 26 regions by first character (split keys are 'a', 'b', ..., 'z'). Or to split it into 26*26 regions by first two characters: ('aa', 'ab', ..., 'zz'). Regions would not be necessarily even, but this would be anyway better than to have only single region.
Example #2. If your keys were 4-byte hashes, then it would be easy to split table into 256 regions by first byte (0x00, 0x01, ..., 0xff) or into 2^16 regions by first two bytes.
In your particular case, I see two options:
Search for smallest key (in sorted order) and for largest key in your dataset. And use them as startKey and endKey to Admin.createTable(). This will work well only if keys are uniformly distributed between startKey and endKey.
Prefix your keys with hash(key) and use method in Example #2. This should work well, but you won't be able to make semantical queries like (KEY >= ${first} and KEY <= ${last}).
Mostly if a job is hanging at the last minute or sec, then the issue could be a particular node or resources having concurrency issues etc.
Small check list could be:
1. Try again with smaller data set. This will rule out basic functioning of the code.
2. Since most of the job is done, the mapper and reducer might be good. You can try the job running with same volume few times. The logs can help you identify if the same node is having issues for repeated runs.
3. Verify if the output is getting generated as expected.
4. You can also reduce the number of columns you are trying to add to HBase. This will relieve the load with same volume.
Jobs getting hanged can be caused due to variety of issues. But trouble shooting mostly consists of above steps - verifying the cause if its data related, resource related, a specific node related, memory related etc.

Apache Spark with custom InputFormat for HadoopRDD

I am currently working on Apache Spark. I have implemented a Custom InputFormat for Apache Hadoop that reads key-value records through TCP Sockets. I wanted to port this code to Apache Spark and use it with the hadoopRDD() function. My Apache Spark code is as follows:
public final class SparkParallelDataLoad {
public static void main(String[] args) {
int iterations = 100;
String dbNodesLocations = "";
if(args.length < 3) {
System.err.printf("Usage ParallelLoad <coordinator-IP> <coordinator-port> <numberOfSplits>\n");
System.exit(1);
}
JobConf jobConf = new JobConf();
jobConf.set(CustomConf.confCoordinatorIP, args[0]);
jobConf.set(CustomConf.confCoordinatorPort, args[1]);
jobConf.set(CustomConf.confDBNodesLocations, dbNodesLocations);
int numOfSplits = Integer.parseInt(args[2]);
CustomInputFormat.setCoordinatorIp(args[0]);
CustomInputFormat.setCoordinatorPort(Integer.parseInt(args[1]));
SparkConf sparkConf = new SparkConf().setAppName("SparkParallelDataLoad");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaPairRDD<LongWritable, Text> records = sc.hadoopRDD(jobConf,
CustomInputFormat.class, LongWritable.class, Text.class,
numOfSplits);
JavaRDD<LabeledPoint> points = records.map(new Function<Tuple2<LongWritable, Text>, LabeledPoint>() {
private final Log log = LogFactory.getLog(Function.class);
/**
*
*/
private static final long serialVersionUID = -1771348263117622186L;
private final Pattern SPACE = Pattern.compile(" ");
#Override
public LabeledPoint call(Tuple2<LongWritable, Text> tuple)
throws Exception {
if(tuple == null || tuple._1() == null || tuple._2() == null)
return null;
double y = Double.parseDouble(Long.toString(tuple._1.get()));
String[] tok = SPACE.split(tuple._2.toString());
double[] x = new double[tok.length];
for (int i = 0; i < tok.length; ++i) {
if(tok[i].isEmpty() == false)
x[i] = Double.parseDouble(tok[i]);
}
return new LabeledPoint(y, Vectors.dense(x));
}
});
System.out.println("Number of records: " + points.count());
LinearRegressionModel model = LinearRegressionWithSGD.train(points.rdd(), iterations);
System.out.println("Model weights: " + model.weights());
sc.stop();
}
}
In my project I also have to decide which Spark Worker is going to connect to which Data Source (something like a "matchmake" process with a 1:1 relation). Therefore, I create a number of InputSplits equal to the number of data sources so that my data are sent in parallel to the SparkContext. My questions are the following:
Does the result of method InpuSplit.getLength() affect how many records a RecordReader returns? In detail, I have seen in my test runs that a Job ends after returning only one record, only because I have a value of 0 returned from the CustomInputSplit.getLength() function.
In the Apache Spark context, is the number of workers equal to the number of the InputSplits produced from my InputFormat at least for the execution of the records.map() function call?
The answer to question 2 above is really important for my project.
Thank you,
Nick
Yes. Spark's sc.hadoopRDD will create an RDD with as many partitions as reported by InputFormat.getSplits.
The last argument to hadoopRDD called minPartitions (numOfSplits in your code) will be used as the hint to InputFormat.getSplits. But the number returned by getSplits will be respected no matter if it is greater or smaller.
See the code at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L168

Splitting a tuple into multiple tuples in Pig

I like to generate multiple tuples from a single tuple. What I mean is:
I have file with following data in it.
>> cat data
ID | ColumnName1:Value1 | ColumnName2:Value2
so I load it by the following command
grunt >> A = load '$data' using PigStorage('|');
grunt >> dump A;
(ID,ColumnName1:Value1,ColumnName2:Value2)
Now I want to split this tuple into two tuples.
(ID, ColumnName1, Value1)
(ID, ColumnName2, Value2)
Can I use UDF along with foreach and generate. Some thing like the following?
grunt >> foreach A generate SOMEUDF(A)
EDIT:
input tuple : (id1,column1,column2)
output : two tuples (id1,column1) and (id2,column2) so it is List or should I return a Bag?
public class SPLITTUPPLE extends EvalFunc <List<Tuple>>
{
public List<Tuple> exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
// not sure how whether I can create tuples on my own. Looks like I should use TupleFactory.
// return list of tuples.
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}
Is this approach correct?
You could write a UDF or use a PIG script with built-in functions.
For example:
-- data should be chararray, PigStorage('|') return bytearray which will not work for this example
inpt = load '/pig_fun/input/single_tuple_to_multiple.txt' as (line:chararray);
-- split by | and create a row so we can dereference it later
splt = foreach inpt generate FLATTEN(STRSPLIT($0, '\\|')) ;
-- first column is id, rest is converted into a bag and flatten it to make rows
id_vals = foreach splt generate $0 as id, FLATTEN(TOBAG(*)) as value;
-- there will be records with (id, id), but id should not have ':'
id_vals = foreach id_vals generate id, INDEXOF(value, ':') as p, STRSPLIT(value, ':', 2) as vals;
final = foreach (filter id_vals by p != -1) generate id, FLATTEN(vals) as (col, val);
dump final;
Test INPUT:
1|c1:11:33|c2:12
234|c1:21|c2:22
33|c1:31|c2:32
345|c1:41|c2:42
OUTPUT
(1,c1,11:33)
(1,c2,12)
(234,c1,21)
(234,c2,22)
(33,c1,31)
(33,c2,32)
(345,c1,41)
(345,c2,42)
I hope it helps.
Cheers.
Here is the UDF version. I prefer to return a BAG:
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;
/**
* Converts input chararray "ID|ColumnName1:Value1|ColumnName2:Value2|.." into a bag
* {(ID, ColumnName1, Value1), (ID, ColumnName2, Value2), ...}
*
* Default rows separator is '|' and key value separator is ':'.
* In this implementation white spaces around separator characters are not removed.
* ID can be made of any character (including sequence of white spaces).
* #author
*
*/
public class TupleToBagColumnValuePairs extends EvalFunc<DataBag> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
private static final BagFactory bagFactory = BagFactory.getInstance();
//Row separator character. Default is '|'.
private String rowsSeparator;
//Column value separator character. Default i
private String columnValueSeparator;
public TupleToBagColumnValuePairs() {
this.rowsSeparator = "\\|";
this.columnValueSeparator = ":";
}
public TupleToBagColumnValuePairs(String rowsSeparator, String keyValueSeparator) {
this.rowsSeparator = rowsSeparator;
this.columnValueSeparator = keyValueSeparator;
}
/**
* Creates a tuple with 3 fields (id:chararray, column:chararray, value:chararray)
* #param outputBag Output tuples (id, column, value) are added to this bag
* #param id
* #param column
* #param value
* #throws ExecException
*/
protected void addTuple(DataBag outputBag, String id, String column, String value) throws ExecException {
Tuple outputTuple = tupleFactory.newTuple();
outputTuple.append(id);
outputTuple.append(column);
outputTuple.append( value);
outputBag.add(outputTuple);
}
/**
* Takes column{separator}value from splitInputLine, splits id into column value and adds them to the outputBag as (id, column, value)
* #param outputBag Output tuples (id, column, value) should be added to this bag
* #param id
* #param splitInputLine format column{separator}value, which start from index 1
* #throws ExecException
*/
protected void parseColumnValues(DataBag outputBag, String id,
String[] splitInputLine) throws ExecException {
for (int i = 1; i < splitInputLine.length; i++) {
if (splitInputLine[i] != null) {
int columnValueSplitIndex = splitInputLine[i].indexOf(this.columnValueSeparator);
if (columnValueSplitIndex != -1) {
String column = splitInputLine[i].substring(0, columnValueSplitIndex);
String value = null;
if (columnValueSplitIndex + 1 < splitInputLine[i].length()) {
value = splitInputLine[i].substring(columnValueSplitIndex + 1);
}
this.addTuple(outputBag, id, column, value);
} else {
String column = splitInputLine[i];
this.addTuple(outputBag, id, column, null);
}
}
}
}
/**
* input - contains only one field of type chararray, which will be split by '|'
* All inputs that are: null or of length 0 are ignored.
*/
#Override
public DataBag exec(Tuple input) throws IOException {
if (input == null || input.size() != 1 || input.isNull(0)) {
return null;
}
String inputLine = (String)input.get(0);
String[] splitInputLine = inputLine.split(this.rowsSeparator, -1);
if (splitInputLine.length > 1 && splitInputLine[0].length() > 0) {
String id = splitInputLine[0];
DataBag outputBag = bagFactory.newDefaultBag();
if (splitInputLine.length == 1) { // there is just an id in the line
this.addTuple(outputBag, id, null, null);
} else {
this.parseColumnValues(outputBag, id, splitInputLine);
}
return outputBag;
}
return null;
}
#Override
public Schema outputSchema(Schema input) {
try {
if (input.size() != 1) {
throw new RuntimeException("Expected input to have only one field");
}
Schema.FieldSchema inputFieldSchema = input.getField(0);
if (inputFieldSchema.type != DataType.CHARARRAY) {
throw new RuntimeException("Expected a CHARARRAY as input");
}
Schema tupleSchema = new Schema();
tupleSchema.add(new Schema.FieldSchema("id", DataType.CHARARRAY));
tupleSchema.add(new Schema.FieldSchema("column", DataType.CHARARRAY));
tupleSchema.add(new Schema.FieldSchema("value", DataType.CHARARRAY));
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), tupleSchema, DataType.BAG));
} catch (FrontendException exx) {
throw new RuntimeException(exx);
}
}
}
Here is how it is used in PIG:
register 'path to the jar';
define IdColumnValue myPackage.TupleToBagColumnValuePairs();
inpt = load '/pig_fun/input/single_tuple_to_multiple.txt' as (line:chararray);
result = foreach inpt generate FLATTEN(IdColumnValue($0)) as (id1, c2, v2);
dump result;
A good inspiration for writing UDFs with bags see DataFu source code by LinkedIn
You could use TransposeTupleToBag (UDF from DataFu lib) on the output of STRSPLIT to get the bag, and then FLATTEN the bag to create separate row per original column.

Resources