camus-example work with kafka - hadoop

My usecase is I want to push Avro data from Kafka to HDFS. Camus seems to be right tool, however I am not able to make it work.
I am new to camus, trying to make camus-example work,
https://github.com/linkedin/camus
Now I am trying to make camus-example work. However I am still facing issues.
Code Snippet for DummyLogKafkaProducerClient
package com.linkedin.camus.example.schemaregistry;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.Random;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageEncoder;
import com.linkedin.camus.example.records.DummyLog;
public class DummyLogKafkaProducerClient {
public static void main(String[] args) {
Properties props = new Properties();
props.put("metadata.broker.list", "localhost:6667");
// props.put("serializer.class", "kafka.serializer.StringEncoder");
// props.put("partitioner.class", "example.producer.SimplePartitioner");
//props.put("request.required.acks", "1");
ProducerConfig config = new ProducerConfig(props);
Producer<String, byte[]> producer = new Producer<String, byte[]>(config);
KafkaAvroMessageEncoder encoder = get_DUMMY_LOG_Encoder();
for (int i = 0; i < 500; i++) {
KeyedMessage<String, byte[]> data = new KeyedMessage<String, byte[]>("DUMMY_LOG", encoder.toBytes(getDummyLog()));
producer.send(data);
}
}
public static DummyLog getDummyLog() {
Random random = new Random();
DummyLog dummyLog = DummyLog.newBuilder().build();
dummyLog.setId(random.nextLong());
dummyLog.setLogTime(new Date().getTime());
Map<CharSequence, CharSequence> machoStuff = new HashMap<CharSequence, CharSequence>();
machoStuff.put("macho1", "abcd");
machoStuff.put("macho2", "xyz");
dummyLog.setMuchoStuff(machoStuff);
return dummyLog;
}
public static KafkaAvroMessageEncoder get_DUMMY_LOG_Encoder() {
KafkaAvroMessageEncoder encoder = new KafkaAvroMessageEncoder("DUMMY_LOG", null);
Properties props = new Properties();
props.put(KafkaAvroMessageEncoder.KAFKA_MESSAGE_CODER_SCHEMA_REGISTRY_CLASS, "com.linkedin.camus.example.schemaregistry.DummySchemaRegistry");
encoder.init(props, "DUMMY_LOG");
return encoder;
}
}
I am also added Default no-arg constructor ot DummySchemaRegistry as it was giving instantiation Exception
package com.linkedin.camus.example.schemaregistry;
import org.apache.avro.Schema;
import org.apache.hadoop.conf.Configuration;
import com.linkedin.camus.example.records.DummyLog;
import com.linkedin.camus.example.records.DummyLog2;
import com.linkedin.camus.schemaregistry.MemorySchemaRegistry;
/**
* This is a little dummy registry that just uses a memory-backed schema registry to store two dummy Avro schemas. You
* can use this with camus.properties
*/
public class DummySchemaRegistry extends MemorySchemaRegistry<Schema> {
public DummySchemaRegistry(Configuration conf) {
super();
super.register("DUMMY_LOG", DummyLog.newBuilder().build().getSchema());
super.register("DUMMY_LOG_2", DummyLog2.newBuilder().build()
.getSchema());
}
public DummySchemaRegistry() {
super();
super.register("DUMMY_LOG", DummyLog.newBuilder().build().getSchema());
super.register("DUMMY_LOG_2", DummyLog2.newBuilder().build().getSchema());
}
}
Below Exception trace I am getting after running the Program
Exception in thread "main"
com.linkedin.camus.coders.MessageEncoderException:
org.apache.avro.AvroRuntimeException:
org.apache.avro.AvroRuntimeException: Field id type:LONG pos:0 not set
and has no default value at
com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageEncoder.init(KafkaAvroMessageEncoder.java:55)
at
com.linkedin.camus.example.schemaregistry.DummyLogKafkaProducerClient.get_DUMMY_LOG_Encoder(DummyLogKafkaProducerClient.java:57)
at
com.linkedin.camus.example.schemaregistry.DummyLogKafkaProducerClient.main(DummyLogKafkaProducerClient.java:32)
Caused by: org.apache.avro.AvroRuntimeException:
org.apache.avro.AvroRuntimeException: Field id type:LONG pos:0 not set
and has no default value at
com.linkedin.camus.example.records.DummyLog$Builder.build(DummyLog.java:214)
at
com.linkedin.camus.example.schemaregistry.DummySchemaRegistry.(DummySchemaRegistry.java:16)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at java.lang.Class.newInstance(Class.java:438) at
com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageEncoder.init(KafkaAvroMessageEncoder.java:52)
... 2 more Caused by: org.apache.avro.AvroRuntimeException: Field id
type:LONG pos:0 not set and has no default value at
org.apache.avro.data.RecordBuilderBase.defaultValue(RecordBuilderBase.java:151)
at
com.linkedin.camus.example.records.DummyLog$Builder.build(DummyLog.java:209)
... 9 more

I suppose camus expects the Avro schema to have default values. I had changed my dummyLog.avsc to following and recompile-
{
"namespace": "com.linkedin.camus.example.records",
"type": "record",
"name": "DummyLog",
"doc": "Logs for not so important stuff.",
"fields": [
{
"name": "id",
"type": "int",
"default": 0
},
{
"name": "logTime",
"type": "int",
"default": 0
}
]
}
Let me know if it works for you.
Thanks,
Ambarish

You can default any String or Long field as follows
{"type":"record","name":"CounterData","namespace":"org.avro.usage.tutorial","fields":[{"name":"word","type":["string","null"]},{"name":"count","type":["long","null"]}]}

Camus doesn't assume the schema will have default values. I have recently used camus found the same issue. Actually the way it used in schema registry is not correct in default example. I have done some modification in Camus code you can check out https://github.com/chandanbansal/camus there are minor changes to make it work.
They don't have decoder for Avro records. I have written that as well.

I was getting this issue because I was initializing the registry like so:
super.register("DUMMY_LOG_2", LogEvent.newBuilder().build().getSchema());
When I changed it to:
super.register("logEventAvro", LogEvent.SCHEMA$);
That got me passed the exception.
I also used Garry's com.linkedin.camus.etl.kafka.coders.AvroMessageDecoder.
I also found this blog (Alvin Jin's Notebook) very useful. It pinpoints every issue you could have with camus example and solves it!

Related

Flink consuming s3 parquet file kyro serialization error

We want to consume parquet file from s3
My code snippet is like this. My input files are protobuf encoded parquet files. The protobuf class is Pageview.class.
import com.twitter.chill.protobuf.ProtobufSerializer;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.scala.hadoop.mapreduce.HadoopInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.parquet.proto.ProtoParquetInputFormat;
import org.apache.hadoop.fs.Path;
import scala.Tuple2;
public class ParquetReadJob {
public static void main(String... args) throws Exception {
ExecutionEnvironment ee = ExecutionEnvironment.getExecutionEnvironment();
ee.getConfig().registerTypeWithKryoSerializer(StandardLog.Pageview.class, ProtobufSerializer.class);
String path = args[0];
Job job = Job.getInstance();
job.setInputFormatClass(ProtoParquetInputFormat.class);
HadoopInputFormat<Void, StandardLog.Pageview> hadoopIF =
new HadoopInputFormat<> (new ProtoParquetInputFormat<>(), Void.class, StandardLog.Pageview.class, job);
ProtoParquetInputFormat.addInputPath(job, new Path(path));
DataSource<Tuple2<Void, StandardLog.Pageview>> dataSet = ee.createInput(hadoopIF).setParallelism(10);
dataSet.print();
}
}
There is always errors:
com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException
Serialization trace:
supportCrtSize_ (access.Access$AdPositionInfo)
adPositionInfo_ (access.Access$AccessRequest)
accessRequest_ (com.adshonor.proto.StandardLog$Pageview$Builder)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:730)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:22)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:730)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:315)
at org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)
at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)
at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:216)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException
at java.util.Collections$UnmodifiableCollection.add(Collections.java:1055)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:22)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
... 23 more
Can anyone advise my how to write batch processing program which can consume this kind of files?
I also encountered this issue.
I found this and this in a pending PR for flink-protobuf which solved it.
You would need to add the NonLazyProtobufSerializer and ProtobufKryoSerializer classes to your project
and register NonLazyProtobufSerializer as a default Kryo serializer for the Message type:
env.getConfig().addDefaultKryoSerializer(Message.class, NonLazyProtobufSerializer.class);
From the authors JavaDocs:
This is a workaround for an issue that surfaces when consuming a DataSource from Kafka in a Flink
TableEnvironment. For fields declared with type 'string' in .proto, the corresponding field on the
Java class has declared type 'Object'. The actual type of these fields on objects returned by
Message.parseFrom(byte[]) is 'ByteArray'. But the getter methods for these fields return 'String',
lazily replacing the underlying ByteArray field with a String, when necessary.
Hope this helps.

Apache Beam - Unable to infer a Coder on a DoFn with multiple output tags

I am trying to execute a pipeline using Apache Beam but I get an error when trying to put some output tags:
import com.google.cloud.Tuple;
import com.google.gson.Gson;
import com.google.gson.reflect.TypeToken;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TupleTagList;
import org.joda.time.Duration;
import java.lang.reflect.Type;
import java.util.Map;
import java.util.stream.Collectors;
/**
* The Transformer.
*/
class Transformer {
final static TupleTag<Map<String, String>> successfulTransformation = new TupleTag<>();
final static TupleTag<Tuple<String, String>> failedTransformation = new TupleTag<>();
/**
* The entry point of the application.
*
* #param args the input arguments
*/
public static void main(String... args) {
TransformerOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(TransformerOptions.class);
Pipeline p = Pipeline.create(options);
p.apply("Input", PubsubIO
.readMessagesWithAttributes()
.withIdAttribute("id")
.fromTopic(options.getTopicName()))
.apply(Window.<PubsubMessage>into(FixedWindows
.of(Duration.standardSeconds(60))))
.apply("Transform",
ParDo.of(new JsonTransformer())
.withOutputTags(successfulTransformation,
TupleTagList.of(failedTransformation)));
p.run().waitUntilFinish();
}
/**
* Deserialize the input and convert it to a key-value pairs map.
*/
static class JsonTransformer extends DoFn<PubsubMessage, Map<String, String>> {
/**
* Process each element.
*
* #param c the processing context
*/
#ProcessElement
public void processElement(ProcessContext c) {
String messagePayload = new String(c.element().getPayload());
try {
Type type = new TypeToken<Map<String, String>>() {
}.getType();
Gson gson = new Gson();
Map<String, String> map = gson.fromJson(messagePayload, type);
c.output(map);
} catch (Exception e) {
LOG.error("Failed to process input {} -- adding to dead letter file", c.element(), e);
String attributes = c.element()
.getAttributeMap()
.entrySet().stream().map((entry) ->
String.format("%s -> %s\n", entry.getKey(), entry.getValue()))
.collect(Collectors.joining());
c.output(failedTransformation, Tuple.of(attributes, messagePayload));
}
}
}
}
The error shown is:
Exception in thread "main" java.lang.IllegalStateException: Unable to
return a default Coder for Transform.out1 [PCollection]. Correct one
of the following root causes: No Coder has been manually specified;
you may do so using .setCoder(). Inferring a Coder from the
CoderRegistry failed: Unable to provide a Coder for V. Building a
Coder using a registered CoderProvider failed. See suppressed
exceptions for detailed failures. Using the default output Coder from
the producing PTransform failed: Unable to provide a Coder for V.
Building a Coder using a registered CoderProvider failed.
I tried different ways to fix the issue but I think I just do not understand what is the problem. I know that these lines cause the error to happen:
.withOutputTags(successfulTransformation,TupleTagList.of(failedTransformation))
but I do not get which part of it, what part needs a specific Coder and what is "V" in the error (from "Unable to provide a Coder for V").
Why is the error happening? I also tried to look at Apache Beam's docs but they do not seems to explain such a usage nor I understand much from the section discussing about coders.
Thanks
First, I would suggest the following -- change:
final static TupleTag<Map<String, String>> successfulTransformation =
new TupleTag<>();
final static TupleTag<Tuple<String, String>> failedTransformation =
new TupleTag<>();
into this:
final static TupleTag<Map<String, String>> successfulTransformation =
new TupleTag<Map<String, String>>() {};
final static TupleTag<Tuple<String, String>> failedTransformation =
new TupleTag<Tuple<String, String>>() {};
That should help the coder inference determine the type of the side output. Also, have you properly registered a CoderProvider for Tuple?
Thanks to #Ben Chambers' answer, Kotlin is:
val successTag = object : TupleTag<MyObj>() {}
val deadLetterTag = object : TupleTag<String>() {}

Spring Data REST returns EmptyCollectionEmbeddedWrapper instead of empty collection

I am developing a Service based on Spring Data REST. Cause of the fact that we are creating the frontend code using swagger (generated via SpringFox) I had to deactivate the return of the HAL-format which works fine with one exception.
If the result of a request is an empty list the response looks like this
{
"links": [
{
"rel": "self",
"href": "http://localhost:9999/users"
},
{
"rel": "profile",
"href": "http://localhost:9999/profile/users"
}
],
"content": [
{
"rel": null,
"collectionValue": true,
"relTargetType": "com.example.User",
"value": []
}
]
}
How can I get an empty List as content?
I have had to adapt the solution provided in the previous solution to use the types introduced by the Spring HATEOAS 1.x
This is the code I'm using:
#Component
public class ResourceProcessorEmpty implements RepresentationModelProcessor<CollectionModel<Object>> {
#Override
public CollectionModel<Object> process(CollectionModel<Object> resourceToThrowAway) {
if (resourceToThrowAway.getContent().size() != 1) {
return resourceToThrowAway;
}
if (!resourceToThrowAway.getContent().iterator().next().getClass().getCanonicalName().contains("EmptyCollectionEmbeddedWrapper")) {
return resourceToThrowAway;
}
CollectionModel<Object> newResource = new CollectionModel<>(Collections.emptyList());
newResource.add(resourceToThrowAway.getLinks());
return newResource;
}
}
The best and easiest solution i found finally for this so far is the following. Implement a custom ResourceProcessor, which is automatically picked up and used by spring(because of the #Component). Override the process method and in the method return a new Resource() which is initialized with an empty list, instead of the old Resource you got as an argument, add the links and all what you want and that's it. Like this:
import java.util.Collections;
import javax.servlet.http.HttpServletRequest;
import org.springframework.hateoas.Link;
import org.springframework.hateoas.ResourceProcessor;
import org.springframework.hateoas.Resources;
import org.springframework.stereotype.Component;
import org.springframework.web.context.request.RequestContextHolder;
import org.springframework.web.context.request.ServletRequestAttributes;
#Component
public class ResourceProcessorEmpty implements ResourceProcessor<Resources<Object>>
{
#Override
public Resources<Object> process(final Resources<Object> resourceToThrowAway)
{
HttpServletRequest request = ((ServletRequestAttributes) RequestContextHolder.getRequestAttributes()).getRequest();
// In my case i needed the request link with parameters, and the empty content[] array
Link link = new Link(request.getRequestURL().toString() + "?" + request.getQueryString());
Resources<Object> newResource = new Resources<>(Collections.emptyList());
newResource.add(link);
return newResource;
}
}
For clarification: if you use Resources<Object>, that will handle empty collections(when that "EmptyCollectionEmbeddedWrapper" dummy object would be returned), whereas Resources<Resource<Object>> will handle non-empty collections. In this case the first needs to be used.

Flink thowing serialization error when reading from hbase

When I read from hbase using richfatMapFunction inside a map I am getting serialization error. What I am trying to do is if a datastream equals to a particular string read from hbase else ignore. Below is the sample program and error I am getting.
package com.abb.Flinktest
import java.text.SimpleDateFormat
import java.util.Properties
import scala.collection.concurrent.TrieMap
import org.apache.flink.addons.hbase.TableInputFormat
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.api.common.io.OutputFormat
import org.apache.flink.api.java.tuple.Tuple2
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.createTypeInformation
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.util.Collector
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.filter.BinaryComparator
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.util.Bytes
import org.apache.log4j.Level
import org.apache.flink.api.common.functions.RichMapFunction
object Flinktesthbaseread {
def main(args:Array[String])
{
val env = StreamExecutionEnvironment.createLocalEnvironment()
val kafkaStream = env.fromElements("hello")
val c=kafkaStream.map(x => if(x.equals("hello"))kafkaStream.flatMap(new ReadHbase()) )
env.execute()
}
class ReadHbase extends RichFlatMapFunction[String,Tuple11[String,String,String,String,String,String,String,String,String,String,String]] with Serializable
{
var conf: org.apache.hadoop.conf.Configuration = null;
var table: org.apache.hadoop.hbase.client.HTable = null;
var hbaseconnection:org.apache.hadoop.hbase.client.Connection =null
var taskNumber: String = null;
var rowNumber = 0;
val serialVersionUID = 1L;
override def open(parameters: org.apache.flink.configuration.Configuration) {
println("getting table")
conf = HBaseConfiguration.create()
val in = getClass().getResourceAsStream("/hbase-site.xml")
conf.addResource(in)
hbaseconnection = ConnectionFactory.createConnection(conf)
table = new HTable(conf, "testtable");
// this.taskNumber = String.valueOf(taskNumber);
}
override def flatMap(msg:String,out:Collector[Tuple11[String,String,String,String,String,String,String,String,String,String,String]])
{
//flatmap operation here
}
override def close() {
table.flushCommits();
table.close();
}
}
}
Error:
log4j:WARN No appenders could be found for logger (org.apache.flink.api.scala.ClosureCleaner$).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: Task not serializable
at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:172)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:164)
at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.scalaClean(StreamExecutionEnvironment.scala:617)
at org.apache.flink.streaming.api.scala.DataStream.clean(DataStream.scala:959)
at org.apache.flink.streaming.api.scala.DataStream.map(DataStream.scala:484)
at com.abb.Flinktest.Flinktesthbaseread$.main(Flinktesthbaseread.scala:45)
at com.abb.Flinktest.Flinktesthbaseread.main(Flinktesthbaseread.scala)
Caused by: java.io.NotSerializableException: org.apache.flink.streaming.api.scala.DataStream
- field (class "com.abb.Flinktest.Flinktesthbaseread$$anonfun$1", name: "kafkaStream$1", type: "class org.apache.flink.streaming.api.scala.DataStream")
- root object (class "com.abb.Flinktest.Flinktesthbaseread$$anonfun$1", <function1>)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1182)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:301)
at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:170)
... 6 more
I tried wrapping the field inside a method and a class by making the class serializable as wel, but no luck. Could someone throw some lights on this or suggest some workaround for this.
The problem is that you're trying to access the kafka stream variable in the map function which is simply not serializable. It is just an abstract representation of the data. It doesn't contain anything, which invalidates your function in the first place.
instead, do something like this:
kafkaStream.filter(x => x.equals("hello")).flatMap(new ReadHBase())
The filter funtion will only retain the elements for which the condition is true, and those will be passed to your flatMap function.
I would highly recommend you to read the basis API concepts documentation, as there appears to be some misunderstanding as to what actually happens when you specify a transformation.

Hbase mapside join- One of the tables is not getting read? read from hbase and right result into hbase

I am trying to do mapside join of two tables located in Hbase. My aim is to keep record of the small table in hashmap and compare with the big table, and once matched, write record in a table in hbase again. I wrote the similar code for join operation using both Mapper and Reducer and it worked well and both tables are scanned in mapper class. But since reduce side join is not efficient at all, I want to join the tables in mapper side only. In the following code "commented if block" is just to see that it returns false always and first table (small one) is not getting read. Any hints helps are appreciated. I am using sandbox of HDP.
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
//import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.util.Tool;
import com.sun.tools.javac.util.Log;
import java.io.IOException;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapred.TableOutputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableSplit;
public class JoinDriver extends Configured implements Tool {
static int row_index = 0;
public static class JoinJobMapper extends TableMapper<ImmutableBytesWritable, Put> {
private static byte[] big_table_bytarr = Bytes.toBytes("big_table");
private static byte[] small_table_bytarr = Bytes.toBytes("small_table");
HashMap<String,String> myHashMap = new HashMap<String, String>();
byte[] c1_value;
byte[] c2_value;
String big_table;
String small_table;
String big_table_c1;
String big_table_c2;
String small_table_c1;
String small_table_c2;
Text mapperKeyS;
Text mapperValueS;
Text mapperKeyB;
Text mapperValueB;
public void map(ImmutableBytesWritable rowKey, Result columns, Context context) {
TableSplit currentSplit = (TableSplit) context.getInputSplit();
byte[] tableName = currentSplit.getTableName();
try {
Put put = new Put(Bytes.toBytes(++row_index));
// put small table into hashmap - myhashMap
if (Arrays.equals(tableName, small_table_bytarr)) {
c1_value = columns.getValue(Bytes.toBytes("s_cf"), Bytes.toBytes("s_cf_c1"));
c2_value = columns.getValue(Bytes.toBytes("s_cf"), Bytes.toBytes("s_cf_c2"));
small_table_c1 = new String(c1_value);
small_table_c2 = new String(c2_value);
mapperKeyS = new Text(small_table_c1);
mapperValueS = new Text(small_table_c2);
myHashMap.put(small_table_c1,small_table_c2);
} else if (Arrays.equals(tableName, big_table_bytarr)) {
c1_value = columns.getValue(Bytes.toBytes("b_cf"), Bytes.toBytes("b_cf_c1"));
c2_value = columns.getValue(Bytes.toBytes("b_cf"), Bytes.toBytes("b_cf_c2"));
big_table_c1 = new String(c1_value);
big_table_c2 = new String(c2_value);
mapperKeyB = new Text(big_table_c1);
mapperValueB = new Text(big_table_c2);
// if (set.containsKey(big_table_c1)){
put.addColumn(Bytes.toBytes("join"), Bytes.toBytes("join_c1"), Bytes.toBytes(big_table_c1));
context.write(new ImmutableBytesWritable(mapperKeyB.getBytes()), put );
put.addColumn(Bytes.toBytes("join"), Bytes.toBytes("join_c2"), Bytes.toBytes(big_table_c2));
context.write(new ImmutableBytesWritable(mapperKeyB.getBytes()), put );
put.addColumn(Bytes.toBytes("join"), Bytes.toBytes("join_c3"),Bytes.toBytes((myHashMap.get(big_table_c1))));
context.write(new ImmutableBytesWritable(mapperKeyB.getBytes()), put );
// }
}
} catch (Exception e) {
// TODO : exception handling logic
e.printStackTrace();
}
}
}
public int run(String[] args) throws Exception {
List<Scan> scans = new ArrayList<Scan>();
Scan scan1 = new Scan();
scan1.setAttribute("scan.attributes.table.name", Bytes.toBytes("small_table"));
System.out.println(scan1.getAttribute("scan.attributes.table.name"));
scans.add(scan1);
Scan scan2 = new Scan();
scan2.setAttribute("scan.attributes.table.name", Bytes.toBytes("big_table"));
System.out.println(scan2.getAttribute("scan.attributes.table.name"));
scans.add(scan2);
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJar("MSJJ.jar");
job.setJarByClass(JoinDriver.class);
TableMapReduceUtil.initTableMapperJob(scans, JoinJobMapper.class, ImmutableBytesWritable.class, Put.class, job);
TableMapReduceUtil.initTableReducerJob("joined_table", null, job);
job.setNumReduceTasks(0);
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
JoinDriver runJob = new JoinDriver();
runJob.run(args);
}
}
By reading your problem statement I believe you have got some wrong idea about uses of Multiple HBase table input.
I suggest you load small table in a HashMap, in setup method of mapper class. Then use map only job on big table, in map method you can fetch corresponding values from the HashMap which you loaded earlier.
Let me know how this works out.

Resources