ArrayWritable as key in Hadoop MapReduce - hadoop

I am attempting to create a dynamic map reduce application that takes in dimensions from an external properties file. The main problem lies in the fact that the variables i.e. the key will be composite and may be of whatever numbers, e.g pair of 3 keys, pair of 4 keys, etc.
My Mapper:
public void map(AvroKey<flumeLogs> key, NullWritable value, Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
int dimensionCount = Integer.parseInt(conf.get("dimensionCount"));
String[] dimensions = conf.get("dimensions").split(","); //this gets the dimensions from the run method in main
Text[] values = new Text[dimensionCount]; //This is supposed to be my composite key
for (int i=0; i<dimensionCount; i++){
switch(dimensions[i]){
case "region": values[i] = new Text("-");
break;
case "event": values[i] = new Text("-");
break;
case "eventCode": values[i] = new Text("-");
break;
case "mobile": values[i] = new Text("-");
}
}
context.write(new StringArrayWritable(values), new IntWritable(1));
}
The values will have good logic later.
My StringArrayWritable:
public class StringArrayWritable extends ArrayWritable {
public StringArrayWritable() {
super(Text.class);
}
public StringArrayWritable(Text[] values){
super(Text.class, values);
Text[] texts = new Text[values.length];
for (int i = 0; i < values.length; i++) {
texts[i] = new Text(values[i]);
}
set(texts);
}
#Override
public String toString(){
StringBuilder sb = new StringBuilder();
for(String s : super.toStrings()){
sb.append(s).append("\t");
}
return sb.toString();
}
}
The error I am getting:
Error: java.io.IOException: Initialization of all the collectors failed. Error in last collector was :class StringArrayWritable
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:414)
at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:81)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassCastException: class StringArrayWritable
at java.lang.Class.asSubclass(Class.java:3165)
at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:892)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:1005)
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:402)
... 9 more
Any help would be greatly appreciated.
Thanks a lot.

You're trying to use a Writable object as the key. In mapreduce the key must implement the WritableComparable interface. ArrayWritable only implements the Writable interface.
The difference between the two is that the comaprable interface requires you to implement a compareTo method so that mapreduce is able to sort and group the keys correctly.

Related

How to resolve ClassCastException in MultiResourceItemReader Spring Batch

I'm reading multiple files from the S3 bucket using MultiResourceItemReader, I'm getting ClassCastException before executing the myReader() method, Something wrong with MultiResourceItemReader not sure what's going wrong here.
Please find my code below:
#Bean
public MultiResourceItemReader<String> multiResourceReader()
{
String bucket = "mybucket;
String key = "/myfiles";
List<InputStream> resourceList = s3Client.getFiles(bucket, key);
List<InputStreamResource> inputStreamResourceList = new ArrayList<>();
for (InputStream s: resourceList) {
inputStreamResourceList.add(new InputStreamResource(s));
}
Resource[] resources = inputStreamResourceList.toArray(new InputStreamResource[inputStreamResourceList.size()]);
//InputStreamResource[] resources = inputStreamResourceList.toArray(new InputStreamResource[inputStreamResourceList.size()]);
// I'm getting all the stream content - I verified my stream is not null
for (int i = 0; i < resources.length; i++) {
try {
InputStream s = resources[i].getInputStream();
String result = IOUtils.toString(s, StandardCharsets.UTF_8);
System.out.println(result);
} catch (IOException e) {
e.printStackTrace();
}
}
MultiResourceItemReader<String> resourceItemReader = new MultiResourceItemReader<>();
resourceItemReader.setResources(resources);
resourceItemReader.setDelegate(myReader());
resourceItemReader.setDelegate((ResourceAwareItemReaderItemStream<? extends String>) new CustomComparator());
return resourceItemReader;
}
Exception:
Caused by: java.lang.ClassCastException: class CustomComparator cannot be cast to class org.springframework.batch.item.file.ResourceAwareItemReaderItemStream (CustomComparator and org.springframework.batch.item.file.ResourceAwareItemReaderItemStream are in unnamed module of loader org.springframework.boot.loader.LaunchedURLClassLoader #cc285f4)
at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:244)
at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:331)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:154)
... 65 common frames omitted
Can someone please help me to resolve this issue. Appreciated your help in advance. Thanks.
The reason you see the NullPointerException is due to the default comparator used by the MultiResourceItemReader to sort the resources after loading them.
The default compare behavior calls the getFilename() method of the InputStreamResource.
Refer - https://github.com/spring-projects/spring-batch/blob/115c3022147692155d45e23cdd5cef84895bf9f5/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/file/MultiResourceItemReader.java#L82
But the InputStreamResource just inherits the getFileName() method from its parent AbstractResource, which just returns null.
https://github.com/spring-projects/spring-framework/blob/316e84f04f3dbec3ea5ab8563cc920fb21f49749/spring-core/src/main/java/org/springframework/core/io/AbstractResource.java#L220
The solution is to provide a custom comparator for the MultiResourceItemReader. Here is a simple example, assuming you do not want to sort the resources in a specific way before processing:
public class CustomComparator implements Comparator<InputStream>{
#Override
public int compare(InputStream is1, InputStream is2) {
//comparing based on last modified time
return Long.compare(is1.hashCode(),is2.hashCode());
}
}
MultiResourceItemReader<String> resourceItemReader = new MultiResourceItemReader<>();
resourceItemReader.setResources(resources);
resourceItemReader.setDelegate(myReader());
//UPDATED with correction - set custom Comparator
resourceItemReader.setComparator(new CustomComparator());
Refer this answer for how a Comparator is used by Spring Batch MultiResourceItemReader.
File processing order with Spring Batch

Spring batch patitioning of db not working properly

I have configured a job as follow, which is to read from db and write into files but by partitioning data on basis of sequence.
//Job Config
#Bean
public Job job(JobBuilderFactory jobBuilderFactory) throws Exception {
Flow masterFlow1 = (Flow) new FlowBuilder<Object>("masterFlow1").start(masterStep()).build();
return (jobBuilderFactory.get("Partition-Job")
.incrementer(new RunIdIncrementer())
.start(masterFlow1)
.build()).build();
}
#Bean
public Step masterStep() throws Exception
{
return stepBuilderFactory.get(MASTERPPREPAREDATA)
//.listener(customSEL)
.partitioner(STEPPREPAREDATA,new DBPartitioner())
.step(prepareDataForS1())
.gridSize(gridSize)
.taskExecutor(new SimpleAsyncTaskExecutor("Thread"))
.build();
}
#Bean
public Step prepareDataForS1() throws Exception
{
return stepBuilderFactory.get(STEPPREPAREDATA)
//.listener(customSEL)
.<InputData,InputData>chunk(chunkSize)
.reader(JDBCItemReader(0,0))
.writer(writer(null))
.build();
}
#Bean(destroyMethod="")
#StepScope
public JdbcCursorItemReader<InputData> JDBCItemReader(#Value("#{stepExecutionContext[startingIndex]}") int startingIndex,
#Value("#{stepExecutionContext[endingIndex]}") int endingIndex)
{
JdbcCursorItemReader<InputData> ir = new JdbcCursorItemReader<>();
ir.setDataSource(batchDataSource);
ir.setMaxItemCount(DBPartitioner.partitionSize);
ir.setSaveState(false);
ir.setRowMapper(new InputDataRowMapper());
ir.setSql("SELECT * FROM FIF_INPUT fi WHERE fi.SEQ > ? AND fi.SEQ < ?");
ir.setPreparedStatementSetter(new PreparedStatementSetter() {
#Override
public void setValues(PreparedStatement ps) throws SQLException {
ps.setInt(1, startingIndex);
ps.setInt(2, endingIndex);
}
});
return ir;
}
#Bean
#StepScope
public FlatFileItemWriter<InputData> writer(#Value("#{stepExecutionContext[index]}") String index)
{
System.out.println("writer initialized!!!!!!!!!!!!!"+index);
//Create writer instance
FlatFileItemWriter<InputData> writer = new FlatFileItemWriter<>();
//Set output file location
writer.setResource(new FileSystemResource(batchDirectory+relativeInputDirectory+index+inputFileForS1));
//All job repetitions should "append" to same output file
writer.setAppendAllowed(false);
//Name field values sequence based on object properties
writer.setLineAggregator(customLineAggregator);
return writer;
}
Partitioner provided for partitioning db is written separately in other file so as follows
//PartitionDb.java
public class DBPartitioner implements Partitioner{
public static int partitionSize;
private static Log log = LogFactory.getLog(DBPartitioner.class);
#SuppressWarnings("unchecked")
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
log.debug("START: Partition"+"grid size:"+gridSize);
#SuppressWarnings("rawtypes")
Map partitionMap = new HashMap<>();
int startingIndex = -1;
int endSize = partitionSize+1;
for(int i=0; i< gridSize; i++){
ExecutionContext ctxMap = new ExecutionContext();
ctxMap.putInt("startingIndex",startingIndex);
ctxMap.putInt("endingIndex", endSize);
ctxMap.put("index", i);
startingIndex = endSize-1;
endSize += partitionSize;
partitionMap.put("Thread:-"+i, ctxMap);
}
log.debug("END: Created Partitions of size: "+ partitionMap.size());
return partitionMap;
}
}
This one is executing properly but problem is even after partitioning on the basis of sequence i am getting same rows in multiple files which is not right as i am providing different set of data for each partition. Can anyone tell me whats wrong. I am using HikariCP for Db connection pooling and spring batch 4
This one is executing properly but problem is even after partitioning on the basis of sequence i am getting same rows in multiple files which is not right as i am providing different set of data for each partition.
I'm not sure your partitioner is working properly. A quick test shows that it is not providing different sets of data as you are claiming:
DBPartitioner dbPartitioner = new DBPartitioner();
Map<String, ExecutionContext> partition = dbPartitioner.partition(5);
for (String s : partition.keySet()) {
System.out.println(s + " : " + partition.get(s));
}
This prints:
Thread:-0 : {endingIndex=1, index=0, startingIndex=-1}
Thread:-1 : {endingIndex=1, index=1, startingIndex=0}
Thread:-2 : {endingIndex=1, index=2, startingIndex=0}
Thread:-3 : {endingIndex=1, index=3, startingIndex=0}
Thread:-4 : {endingIndex=1, index=4, startingIndex=0}
As you can see, almost all partitions will have the same startingIndex and endingIndex.
I recommend you unit test your partitioner before using it in a partitioned step.

Gson: How do I deserialize an inner JSON object to a map if the property name is not fixed?

My client retrieves JSON content as below:
{
"table": "tablename",
"update": 1495104575669,
"rows": [
{"column5": 11, "column6": "yyy"},
{"column3": 22, "column4": "zzz"}
]
}
In rows array content, the key is not fixed. I want to retrieve the key and value and save into a Map using Gson 2.8.x.
How can I configure Gson to simply use to deserialize?
Here is my idea:
public class Dataset {
private String table;
private long update;
private List<Rows>> lists; <-- little confused here.
or private List<HashMap<String,Object> lists
Setter/Getter
}
public class Rows {
private HashMap<String, Object> map;
....
}
Dataset k = gson.fromJson(jsonStr, Dataset.class);
log.info(k.getRows().size()); <-- I got two null object
Thanks.
Gson does not support such a thing out of box. It would be nice, if you can make the property name fixed. If not, then you can have a few options that probably would help you.
Just rename the Dataset.lists field to Dataset.rows, if the property name is fixed, rows.
If the possible name set is known in advance, suggest Gson to pick alternative names using the #SerializedName.
If the possible name set is really unknown and may change in the future, you might want to try to make it fully dynamic using a custom TypeAdapter (streaming mode; requires less memory, but harder to use) or a custom JsonDeserializer (object mode; requires more memory to store intermediate tree views, but it's easy to use) registered with GsonBuilder.
For option #2, you can simply add the names of name alternatives:
#SerializedName(value = "lists", alternate = "rows")
final List<Map<String, Object>> lists;
For option #3, bind a downstream List<Map<String, Object>> type adapter trying to detect the name dynamically. Note that I omit the Rows class deserialization strategy for simplicity (and I believe you might want to remove the Rows class in favor of simple Map<String, Object> (another note: use Map, try not to specify collection implementations -- hash maps are unordered, but telling Gson you're going to deal with Map would let it to pick an ordered map like LinkedTreeMap (Gson internals) or LinkedHashMap that might be important for datasets)).
// Type tokens are immutable and can be declared constants
private static final TypeToken<String> stringTypeToken = new TypeToken<String>() {
};
private static final TypeToken<Long> longTypeToken = new TypeToken<Long>() {
};
private static final TypeToken<List<Map<String, Object>>> stringToObjectMapListTypeToken = new TypeToken<List<Map<String, Object>>>() {
};
private static final Gson gson = new GsonBuilder()
.registerTypeAdapterFactory(new TypeAdapterFactory() {
#Override
public <T> TypeAdapter<T> create(final Gson gson, final TypeToken<T> typeToken) {
if ( typeToken.getRawType() != Dataset.class ) {
return null;
}
// If the actual type token represents the Dataset class, then pick the bunch of downstream type adapters
final TypeAdapter<String> stringTypeAdapter = gson.getDelegateAdapter(this, stringTypeToken);
final TypeAdapter<Long> primitiveLongTypeAdapter = gson.getDelegateAdapter(this, longTypeToken);
final TypeAdapter<List<Map<String, Object>>> stringToObjectMapListTypeAdapter = stringToObjectMapListTypeToken);
// And compose the bunch into a single dataset type adapter
final TypeAdapter<Dataset> datasetTypeAdapter = new TypeAdapter<Dataset>() {
#Override
public void write(final JsonWriter out, final Dataset dataset) {
// Omitted for brevity
throw new UnsupportedOperationException();
}
#Override
public Dataset read(final JsonReader in)
throws IOException {
in.beginObject();
String table = null;
long update = 0;
List<Map<String, Object>> lists = null;
while ( in.hasNext() ) {
final String name = in.nextName();
switch ( name ) {
case "table":
table = stringTypeAdapter.read(in);
break;
case "update":
update = primitiveLongTypeAdapter.read(in);
break;
default:
lists = stringToObjectMapListTypeAdapter.read(in);
break;
}
}
in.endObject();
return new Dataset(table, update, lists);
}
}.nullSafe(); // Making the type adapter null-safe
#SuppressWarnings("unchecked")
final TypeAdapter<T> typeAdapter = (TypeAdapter<T>) datasetTypeAdapter;
return typeAdapter;
}
})
.create();
final Dataset dataset = gson.fromJson(jsonReader, Dataset.class);
System.out.println(dataset.lists);
The code above would print then:
[{column5=11.0, column6=yyy}, {column3=22.0, column4=zzz}]

Getting Hbase Exception No regions passed

Hi i am new to Hbase and im trying to learn how to load bulk data to Hbase table using MapReduce
But i am getting below Exception
Exception in thread "main" java.lang.IllegalArgumentException: No regions passed
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.writePartitions(HFileOutputFormat2.java:307)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.configurePartitioner(HFileOutputFormat2.java:527)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.configureIncrementalLoad(HFileOutputFormat2.java:391)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.configureIncrementalLoad(HFileOutputFormat2.java:356)
at JobDriver.run(JobDriver.java:108)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at JobDriver.main(JobDriver.java:34)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
This is mY Mapper Code
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
System.out.println("Value in Mapper"+value.toString());
String[] values = value.toString().split(",");
byte[] row = Bytes.toBytes(values[0]);
ImmutableBytesWritable k = new ImmutableBytesWritable(row);
KeyValue kvProtocol = new KeyValue(row, "PROTOCOLID".getBytes(), "PROTOCOLID".getBytes(), values[1]
.getBytes());
context.write(k, kvProtocol);
}
This is my Job Configuration
public class JobDriver extends Configured implements Tool{
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
ToolRunner.run(new JobDriver(), args);
System.exit(0);
}
#Override
public int run(String[] arg0) throws Exception {
// TODO Auto-generated method stub
// HBase Configuration
System.out.println("**********Starting Hbase*************");
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "TestHFileToHBase");
job.setJarByClass(JobDriver.class);
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(KeyValue.class);
job.setMapperClass(LoadMapper.class);
job.setOutputFormatClass(HFileOutputFormat2.class);
HTable table = new HTable(conf, "kiran");
FileInputFormat.addInputPath(job, new Path("hdfs://192.168.61.62:9001/sampledata.csv"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.61.62:9001/deletions_6.csv"));
HFileOutputFormat2.configureIncrementalLoad(job, table);
//System.exit(job.waitForCompletion(true) ? 0 : 1);
return job.waitForCompletion(true) ? 0 : 1;
}
}
Can Anyone please help me in resolvin the exception.
You have to create the table first. You can do it with the below code
//Create table and do pre-split
HTableDescriptor descriptor = new HTableDescriptor(
Bytes.toBytes(tableName)
);
descriptor.addFamily(
new HColumnDescriptor(Constants.COLUMN_FAMILY_NAME)
);
HBaseAdmin admin = new HBaseAdmin(config);
byte[] startKey = new byte[16];
Arrays.fill(startKey, (byte) 0);
byte[] endKey = new byte[16];
Arrays.fill(endKey, (byte)255);
admin.createTable(descriptor, startKey, endKey, REGIONS_COUNT);
admin.close();
or directly from the hbase shell with the command:
create 'kiran', 'colfam1'
The exception is caused because the startkeys list is empty: line 306
More info can be found here.
Note that the table name must be the same with the one you use in your code (kiran).

Any suggestions for reading two different dataset into Hadoop at the same time?

Dear hadooper:
I'm new for hadoop, and recently try to implement an algorithm.
This algorithm needs to calculate a matrix, which represent the different rating of every two pair of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)
Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.
But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).
Any suggestions?
You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:
public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String person_name, book_title,file_tag="person_book#";
private String emit_value = new String();
//emit_value = "";
public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] person_detail = line.split(",");
person_name = person_detail[0].trim();
book_title = person_detail[1].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
person_name = "student name missing";
}
emit_value = file_tag + person_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static class FullOuterJoinResultDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String author_name, book_title,file_tag="auth_book#";
private String emit_value = new String();
// emit_value = "";
public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] author_detail = line.split(",");
author_name = author_detail[1].trim();
book_title = author_detail[0].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
author_name = "Not Appeared in Exam";
}
emit_value = file_tag + author_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static void main(String args[])
throws Exception
{
if(args.length !=3)
{
System.out.println("Input outpur file missing");
System.exit(-1);
}
Configuration conf = new Configuration();
String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
conf.set("mapred.textoutputformat.separator", ",");
JobConf mrjob = new JobConf();
mrjob.setJobName("Inner_Join");
mrjob.setJarByClass(FullOuterJoin.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);
FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
mrjob.setReducerClass(FullOuterJoinReducer.class);
mrjob.setOutputKeyClass(Text.class);
mrjob.setOutputValueClass(Text.class);
JobClient.runJob(mrjob);
}
Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object).
The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.

Resources