CRFClassifier: loading model from a stream gives exception "invalid stream header: 1F8B0800" - stanford-nlp

I am trying to load a CRFClassifier model from a file. This way works:
// this works
classifier = CRFClassifier.getClassifier("../res/stanford-ner-2018-02-27/classifiers/english.all.3class.distsim.crf.ser.gz");
When I want to use stream, however, I get invalid stream header: 1F8B0800 exception:
// this throws an exception
String modelResourcePath = "../res/stanford-ner-2018-02-27/classifiers/english.all.3class.distsim.crf.ser.gz";
BufferedInputStream stream = new BufferedInputStream(new FileInputStream(modelResourcePath));
classifier = CRFClassifier.getClassifier(stream);
Exception:
Exception in thread "main" java.io.StreamCorruptedException: invalid stream header: 1F8B0800
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:866)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1473)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1456)
at edu.stanford.nlp.ie.crf.CRFClassifier.getClassifier(CRFClassifier.java:2890)
at com.sv.research.ner.stanford.StanfordEntityExtractor.<init>(StanfordEntityExtractor.java:34)
at com.sv.research.ner.stanford.StanfordEntityExtractor.main(StanfordEntityExtractor.java:59)
I would expect both ways to be equivalent. My reason to load through a stream is that ultimately I want to load the model from JAR resources using:
stream = ClassLoader.getSystemClassLoader().getResourceAsStream(modelResourcePath));

The way the classifier you are trying to use was serialized via GZIPInputStream as far as I could see from their sources.
So can you try deserializing the way that they serialize, like this:
BufferedInputStream stream = new BufferedInputStream(new GZIPInputStream(new FileInputStream(modelResourcePath)));
Cheers

Related

Apache Camel with Spring Boot - Zip process

I am trying to process a zip file with Apache Camel.
After making a call, I get a zip file and try to prepare the next call with this zip file as body.
The call requires a form data with one name and zip file as value.
I handle in this way:
process(e ->{
Object zip = e.getIn().getBody();
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
body.add("file",zip);
e.getIn().setBody(body);
})
But I receive the exception:
org.apache.camel.NoTypeConversionAvailableException: No type converter available to convert from type: org.springframework.util.LinkedMultiValueMap to the required type: java.io.InputStream with value {file=[[B#2b02c691]}
Any Ideas?
Cheers!
I tried to get the response in byte[] but it still dose not work.
As Jeremy said, the error says Camel is expecting (further in the process) a body of type InputStream, whilst you are obviously preparing a body of type MultiValueMap (BTW: why use a map if you have a single object to handle ??)
I do not know what is the concrete type of your 'zip' object, but (if needed) you may have to replace current body with its inputstream equivalent:
process(e ->{
// Print concrete type
Object zip = e.getMessage().getBody();
System.out.println("Type is " + zip.getClass() );
// Convert body
InputStream is = e.getMessage().getBody(InputStream.class);
// Replace body
e.getMessage().setBody(is);
})

Can Kafka Streams consume message in a format and produce another format such as AVRO message

I am using kafka streams to consume JSON string from one topic, process and generate response to be stored in another topic. However the message that needs to be produced to the response topic needs to be in avro format.
I have tried using key as string serde and value as SpecificAvroSerde
Following is my Code to create Topology:
StreamsBuilder builder = new StreamsBuilder();
KStream<Object, Object> consumerStream =builder.stream(kafkaConfiguration.getConsumerTopic());
consumerStream = consumerStream.map(getKeyValueMapper(keyValueMapperClassName));
consumerStream.to(kafkaConfiguration.getProducerTopic());
Following is my config
if (schemaRegistry != null && schemaRegistry.length > 0) {
streamsConfig.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, String.join(",", schemaRegistry));
}
streamsConfig.put(this.keySerializerKeyName, StringSerde.class);
streamsConfig.put(this.valueSerialzerKeyName, SpecificAvroSerde.class);
streamsConfig.put(StreamsConfig.APPLICATION_ID_CONFIG, applicationId);
streamsConfig.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, autoOffsetReset);
streamsConfig.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, batchSize);
streamsConfig.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, FailOnInvalidTimestamp.class);
streamsConfig.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, processingGuarantee);
streamsConfig.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, Integer.parseInt(commitIntervalMs));
streamsConfig.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, numberOfThreads);
streamsConfig.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, replicationFactor);
streamsConfig.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, DeserializationExceptionHandler.class);
streamsConfig.put(StreamsConfig.DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG, ProductionExceptionHandler.class);
streamsConfig.put(StreamsConfig.TOPOLOGY_OPTIMIZATION,StreamsConfig.OPTIMIZE);
streamsConfig.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, compressionMode);
streamsConfig.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, maxPollRecords);
I am seeing the following error when I try with the example:
org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
Problem is with the Key Value Serdes. You should use the correct serdes while consuming the stream and same for while publishing the stream.
In case if your input is JSON and you want to publish as Avro, you can do it as following:
Properties streamsConfig= new Properties();
streamsConfig.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfig.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);
StreamsBuilder builder = new StreamsBuilder();
KStream<Object, Object> consumerStream =builder.stream(kafkaConfiguration.getConsumerTopic(),Consumed.with(Serdes.String(), Serdes.String()));
// Replace AvroObjectClass with your avro object type
KStream<String,AvroObjectClass> consumerAvroStream = consumerStream.map(getKeyValueMapper(keyValueMapperClassName));
consumerAvroStream.to(kafkaConfiguration.getProducerTopic());

Hbase Issue | google protobuf tag mismatch error while deserialising SCAN string

Context: I am in the process of migrating my MR jobs on HBase from CDH 2.0.0-cdh4.5.0 (Hadoop1) to HDP 2.2.0.0-2041 (YARN). After minor changes the code was compiled against HDP 2.2.0.0-2041.
Problem: I am trying to run a oozie workflow that executes a series of MR jobs after creating a scan on HBase. The scan is created programatically and then serialised-deserialised before handing it to the mapper to fetch batches from HBase.
Issue: When TableInputFormat internally tries to deserialise the scan string, it throws an error indicating that under the hood google protobuf was not able to deserialise the string. The stack trace looks as follows.
Exception in thread "main" java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag. at com.flipkart.yarn.test.TestScanSerialiseDeserialise.convertStringToScan(TestScanSerialiseDeserialise.java:37) at com.flipkart.yarn.test.TestScanSerialiseDeserialise.main(TestScanSerialiseDeserialise.java:25) Caused by: ......
Reproducable: I am able to reproduce this in the sample code I am pasting
Sample code:
Scan scan1 = constructScanObjectForUsers("A");
String json = scan1.toJSON();
Scan scan2 = convertStringToScan(Base64.encodeBytes(json.getBytes()));
.......
private static Scan convertStringToScan(String base64) throws IOException {
byte[] decoded = Base64.decode(base64);
// System.out.println(new String(decoded));
ClientProtos.Scan scan;
try {
scan = ClientProtos.Scan.parseFrom(decoded);
} catch (InvalidProtocolBufferException ipbe) {
throw new IOException(ipbe);
}
return ProtobufUtil.toScan(scan);
}
Possible causes: I am suspecting that I missed supplying some dependency or there is some dependency mismatch in underlying jars.
Appreciate any help in solving this?
Scan scan1 = constructScanObjectForUsers("A");
String json = scan1.toJSON();
Scan scan2 = convertStringToScan(Base64.encodeBytes(json.getBytes()));
Here you appear to be encoding the message as JSON. Then you are applying base64 to the JSON text. Usually base64 only applies to binary, but JSON is text.
byte[] decoded = Base64.decode(base64);
// System.out.println(new String(decoded));
ClientProtos.Scan scan;
try {
scan = ClientProtos.Scan.parseFrom(decoded);
Here you are un-base64'ing some text and then decoding it as a protobuf. Is this the same data from above? Because if so, this won't work: JSON and Protobuf are different formats. If you want to decode as Protobuf, you need to encode as Protobuf, not JSON.

Conversion exceptions while using docx4j (From Docx to PDF)

I would like to know why this code:
String inputfilepath = "D:\\DFADFADSF";
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath + ".docx"));
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
wordMLPackage.setFontMapper(new IdentityPlusMapper());
FOSettings foSettings = Docx4J.createFOSettings();
foSettings.setWmlPackage(wordMLPackage);
String outputfilepath = "D:\\OUT_FontContent.pdf";
OutputStream os = new java.io.FileOutputStream(outputfilepath);
Docx4J.toPDF(wordMLPackage,os);
Throws this exception:
org.docx4j.openpackaging.exceptions.Docx4JException: Exception exporting package
org.docx4j.openpackaging.exceptions.Docx4JException: Exception executing transformer: org.apache.fop.fo.ValidationException: "fo:flow" is missing child elements. Required content model: marker* (%block;)+
Although there are similar posts, I haven't seen one about this exception...
Maybe I should add aditional code to configure the conversion...

Receiving an Ajax Request in a Servlet using ObjectInputStream?

I am sending my Ajax Request in the following format
xmlhttp.open("POST","http://172.16.xx.xx:8080/ajax/validate",true);
xmlhttp.setRequestHeader("Content-type","application/x-www-form-urlencoded");
xmlhttp.send(send); //where send is a string retrieved from textarea
This is my Servlet code
ObjectInputStream in =new ObjectInputStream(request.getInputStream());
String inputxmlstring=(String) in.readObject();
I am getting the following exception
java.io.StreamCorruptedException: invalid stream header: 3C3F786D
What is the problem with the code? Is there anything wrong with my request header content type?
EDIT 1
BufferedInputStream in =new BufferedInputStream(req.getInputStream());
byte[] buf=new byte[req.getContentLength()];
while(in.available()>0)
{
in.read(buf);
}
String inputxmlstring=new String(buf);
System.out.println(inputxmlstring);
If I use this code for Servlet I get the following error
14:13:27,828 INFO [STDOUT] [Fatal Error] :1:1: Content is not allowed in prolog
.
14:13:27,843 INFO [STDOUT] org.xml.sax.SAXParseException: Content is not allowe
d in prolog.
EDIT 2
I use this code to parse. The String inputxmlstring has been used in Edit1.
DocumentBuilderFactory fty1 = DocumentBuilderFactory.newInstance();
fty1.setNamespaceAware(true);
DocumentBuilder builder1 = fty1.newDocumentBuilder();
ByteArrayInputStream bais1 = new ByteArrayInputStream(inputxmlstring.getBytes());
Document xmldoc1=builder1.parse(bais1);
You should use the ObjectInputStream only if you know the other end it was written using ObjectOutputStream.
When the client uses ObjectOutputStream, it writes special bytes indicating it is object stream. If these bytes are not present ObjectInputStream will throw StreamCorruptedException.
In your case you should read using request.getInputStream() because the XMLHttpRequest is not sending using ObjectOutputStream.

Resources