Hbase Issue | google protobuf tag mismatch error while deserialising SCAN string - hadoop

Context: I am in the process of migrating my MR jobs on HBase from CDH 2.0.0-cdh4.5.0 (Hadoop1) to HDP 2.2.0.0-2041 (YARN). After minor changes the code was compiled against HDP 2.2.0.0-2041.
Problem: I am trying to run a oozie workflow that executes a series of MR jobs after creating a scan on HBase. The scan is created programatically and then serialised-deserialised before handing it to the mapper to fetch batches from HBase.
Issue: When TableInputFormat internally tries to deserialise the scan string, it throws an error indicating that under the hood google protobuf was not able to deserialise the string. The stack trace looks as follows.
Exception in thread "main" java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag. at com.flipkart.yarn.test.TestScanSerialiseDeserialise.convertStringToScan(TestScanSerialiseDeserialise.java:37) at com.flipkart.yarn.test.TestScanSerialiseDeserialise.main(TestScanSerialiseDeserialise.java:25) Caused by: ......
Reproducable: I am able to reproduce this in the sample code I am pasting
Sample code:
Scan scan1 = constructScanObjectForUsers("A");
String json = scan1.toJSON();
Scan scan2 = convertStringToScan(Base64.encodeBytes(json.getBytes()));
.......
private static Scan convertStringToScan(String base64) throws IOException {
byte[] decoded = Base64.decode(base64);
// System.out.println(new String(decoded));
ClientProtos.Scan scan;
try {
scan = ClientProtos.Scan.parseFrom(decoded);
} catch (InvalidProtocolBufferException ipbe) {
throw new IOException(ipbe);
}
return ProtobufUtil.toScan(scan);
}
Possible causes: I am suspecting that I missed supplying some dependency or there is some dependency mismatch in underlying jars.
Appreciate any help in solving this?

Scan scan1 = constructScanObjectForUsers("A");
String json = scan1.toJSON();
Scan scan2 = convertStringToScan(Base64.encodeBytes(json.getBytes()));
Here you appear to be encoding the message as JSON. Then you are applying base64 to the JSON text. Usually base64 only applies to binary, but JSON is text.
byte[] decoded = Base64.decode(base64);
// System.out.println(new String(decoded));
ClientProtos.Scan scan;
try {
scan = ClientProtos.Scan.parseFrom(decoded);
Here you are un-base64'ing some text and then decoding it as a protobuf. Is this the same data from above? Because if so, this won't work: JSON and Protobuf are different formats. If you want to decode as Protobuf, you need to encode as Protobuf, not JSON.

Related

How to get generic message as json from MassTransit Fault event

I have a microservices based application and wish to create a service that captures all Fault events with their message payloads (as json) and stores them in a database for later analysis and potential resubmission. I have created a Fault consumer and can capture the Fault but am unable to generically extract the message payload as json.
public Task Consume(ConsumeContext<Fault> context)
{
if (context is PipeContext pipeContext)
{
var result = pipeContext.TryGetPayload(out ConsumeContext<Fault> payload2);
var serCont = context.SerializerContext;
}
Console.WriteLine($"A message faulted:{context.Message.FaultedMessageId} " +
$"{context.Message.Exceptions} " +
$"{context.ConversationId}"
);
return Task.CompletedTask;
}
I can see the full details I want in the context.SerializerContext._message but this is unaccessable.
context.SerializerContext._message
I saw you comment for a similar question:
If you did want to later get the actual fault message, you could use
consumeContext.TryGetMessage<Fault>(out var faultContext) and if it
was present you'd get it back.
I don't have "T" from every service and therefore want to handle all Faults a JSON.
Is there a way I can capture the full Fault with the message, ideally as json, without having access to every T across my system?
I am on MassTransit 8.
Thanks
If you have the message type (T), you can use TryGetMessage<Fault<T>> and it will return the message type deserialized.
If you don't, or if you want to deal with the JSON in a message directly, using V8 you can get the actual JsonElement from the deserializer and navigate the JSON yourself:
var jsonElement = context.TryGetMessage<JsonElement>()
Previous answer, but for Newtonsoft: https://stackoverflow.com/a/46779547/1882

Facing issue with OpenCsv : Number of data fields does not match number of headers

I am using OpenCSV 4.2 in a springboot project and trying to parse a CSV file with 1 data row.
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17
"1234","VSHRT","TTRYE","PLRTY","1165","NOW","","Collection","store/WEZXB6Z2CC_1.jpg","500","ABC","false","0","[{""name"":""fdtty"",""id"":""242541"",""value"":10}]","400","ABC","dummycol"
No new line character after last data column.
This is my function which return Iterator for the data
public static <T> Iterator<T> csvToBeanIterator(String csv, Class<T> clazz) {
CsvToBean cb = new CsvToBeanBuilder<>(new StringReader(csv))
.withType(clazz)
.withSeparator(",")
.build();
return cb.iterator();
}
I am getting error
Caused by: com.opencsv.exceptions.CsvRequiredFieldEmptyException: Number of data fields does not match number of headers.
at com.opencsv.bean.HeaderColumnNameMappingStrategy.verifyLineLength(HeaderColumnNameMappingStrategy.java:105) ~[opencsv-4.2.jar:?]
at com.opencsv.bean.AbstractMappingStrategy.populateNewBean(AbstractMappingStrategy.java:313) ~[opencsv-4.2.jar:?]
at com.opencsv.bean.concurrent.ProcessCsvLine.processLine(ProcessCsvLine.java:116) ~[opencsv-4.2.jar:?]
at com.opencsv.bean.concurrent.ProcessCsvLine.run(ProcessCsvLine.java:77) ~[opencsv-4.2.jar:?]
I have tried multiple posts available on the internet but no luck.
Could someone please point the issue here.
I had the same issue and deleting some of the columns sorted it for me.

Can Kafka Streams consume message in a format and produce another format such as AVRO message

I am using kafka streams to consume JSON string from one topic, process and generate response to be stored in another topic. However the message that needs to be produced to the response topic needs to be in avro format.
I have tried using key as string serde and value as SpecificAvroSerde
Following is my Code to create Topology:
StreamsBuilder builder = new StreamsBuilder();
KStream<Object, Object> consumerStream =builder.stream(kafkaConfiguration.getConsumerTopic());
consumerStream = consumerStream.map(getKeyValueMapper(keyValueMapperClassName));
consumerStream.to(kafkaConfiguration.getProducerTopic());
Following is my config
if (schemaRegistry != null && schemaRegistry.length > 0) {
streamsConfig.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, String.join(",", schemaRegistry));
}
streamsConfig.put(this.keySerializerKeyName, StringSerde.class);
streamsConfig.put(this.valueSerialzerKeyName, SpecificAvroSerde.class);
streamsConfig.put(StreamsConfig.APPLICATION_ID_CONFIG, applicationId);
streamsConfig.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, autoOffsetReset);
streamsConfig.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, batchSize);
streamsConfig.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, FailOnInvalidTimestamp.class);
streamsConfig.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, processingGuarantee);
streamsConfig.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, Integer.parseInt(commitIntervalMs));
streamsConfig.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, numberOfThreads);
streamsConfig.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, replicationFactor);
streamsConfig.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, DeserializationExceptionHandler.class);
streamsConfig.put(StreamsConfig.DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG, ProductionExceptionHandler.class);
streamsConfig.put(StreamsConfig.TOPOLOGY_OPTIMIZATION,StreamsConfig.OPTIMIZE);
streamsConfig.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, compressionMode);
streamsConfig.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, maxPollRecords);
I am seeing the following error when I try with the example:
org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
Problem is with the Key Value Serdes. You should use the correct serdes while consuming the stream and same for while publishing the stream.
In case if your input is JSON and you want to publish as Avro, you can do it as following:
Properties streamsConfig= new Properties();
streamsConfig.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfig.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);
StreamsBuilder builder = new StreamsBuilder();
KStream<Object, Object> consumerStream =builder.stream(kafkaConfiguration.getConsumerTopic(),Consumed.with(Serdes.String(), Serdes.String()));
// Replace AvroObjectClass with your avro object type
KStream<String,AvroObjectClass> consumerAvroStream = consumerStream.map(getKeyValueMapper(keyValueMapperClassName));
consumerAvroStream.to(kafkaConfiguration.getProducerTopic());

CRFClassifier: loading model from a stream gives exception "invalid stream header: 1F8B0800"

I am trying to load a CRFClassifier model from a file. This way works:
// this works
classifier = CRFClassifier.getClassifier("../res/stanford-ner-2018-02-27/classifiers/english.all.3class.distsim.crf.ser.gz");
When I want to use stream, however, I get invalid stream header: 1F8B0800 exception:
// this throws an exception
String modelResourcePath = "../res/stanford-ner-2018-02-27/classifiers/english.all.3class.distsim.crf.ser.gz";
BufferedInputStream stream = new BufferedInputStream(new FileInputStream(modelResourcePath));
classifier = CRFClassifier.getClassifier(stream);
Exception:
Exception in thread "main" java.io.StreamCorruptedException: invalid stream header: 1F8B0800
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:866)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1473)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1456)
at edu.stanford.nlp.ie.crf.CRFClassifier.getClassifier(CRFClassifier.java:2890)
at com.sv.research.ner.stanford.StanfordEntityExtractor.<init>(StanfordEntityExtractor.java:34)
at com.sv.research.ner.stanford.StanfordEntityExtractor.main(StanfordEntityExtractor.java:59)
I would expect both ways to be equivalent. My reason to load through a stream is that ultimately I want to load the model from JAR resources using:
stream = ClassLoader.getSystemClassLoader().getResourceAsStream(modelResourcePath));
The way the classifier you are trying to use was serialized via GZIPInputStream as far as I could see from their sources.
So can you try deserializing the way that they serialize, like this:
BufferedInputStream stream = new BufferedInputStream(new GZIPInputStream(new FileInputStream(modelResourcePath)));
Cheers

Need to extract attributes directly from Avro using NiFi

I have found no way in NiFi to extract attributes directly from Avro so I am using ConvertAvroToJson -> EvaluateJsonPath -> ConvertJsonToAvro as the workaround.
But I would like to write a script to extract the attributes from the Avro flow file for use in an ExecuteScript processor to determine if it is a better approach.
Does anyone have a script to do this? Otherwise, I may end up using the original approach.
Thanks,
Kevin
Here's a Groovy script (which needs the Avro JAR in its Module Directory property) where I let the user specify dynamic properties with JSONPath expressions to be evaluated against the Avro file. Ironically it does a GenericData.toString() which converts the record to JSON anyway, but perhaps there is some code in here you could reuse:
import org.apache.avro.*
import org.apache.avro.generic.*
import org.apache.avro.file.*
import groovy.json.*
import org.apache.commons.io.IOUtils
import java.nio.charset.*
flowFile = session.get()
if(!flowFile) return
final GenericData genericData = GenericData.get();
slurper = new JsonSlurper().setType(JsonParserType.INDEX_OVERLAY)
pathAttrs = this.binding?.variables?.findAll {attr -> attr.key.startsWith('avro.path')}
newAttrs = [:]
try {
session.read(flowFile, { inputStream ->
def reader = new DataFileStream<>(inputStream, new GenericDatumReader<GenericRecord>())
GenericRecord currRecord = null;
if(reader.hasNext()) {
currRecord = reader.next();
log.info(genericData.toString(currRecord))
record = slurper.parseText(genericData.toString(currRecord))
pathAttrs?.each {k,v ->
object = record
v.value.tokenize('.').each {
object = object[it]
}
newAttrs[k - "avro.path."] = String.valueOf(object)
}
reader.close()
}
} as InputStreamCallback)
newAttrs.each{k,v ->
flowFile = session.putAttribute(flowFile, k,v)
}
session.transfer(flowFile, REL_SUCCESS)
} catch(e) {
log.error("Error during Avro Path: {}", [e.message] as Object[], e)
session.transfer(flowFile, REL_FAILURE)
}
If you meant to extract Avro metadata vs fields (not totally sure what you meant by "attributes"), also check MergeContent's AvroMerge as there is some code in there to pull Avro metadata:
If you are extracting simple patterns from a single Avro record per flowfile, ExtractText may be sufficient for you. If you want to take advantage of the new record processing available in Apache NiFi 1.3.0, AvroReader is where you should start, and there are a series of blogs describing this process in detail. You can also extract Avro metadata with ExtractAvroMetadata.

Resources