One field in Protocol Buffers is always missing when reading from SequenceFile - hadoop

Something mysterious is happening for me:
What I wanted to do:
1. Save a Protocol Buffers object as SequenceFile format.
2. Read this SequenceFile text and extract the field that I need.
The mystery part is:
One field that I wanted to retrieve is always null.
Product_Perf is the field that I wanted to extract from SequencFiles that is always missing.
Here's my protocol buffers schema:
message ProductJoin {
Signals signals = 1;
int64 id = 2;
}
message Signals {
ProductPerf product_perf = 1;
}
message ProductPerf {
int64 impressions = 1;
}
Here's how I save the protocol buffers as SequenceFiles:
JavaPairRDD<BytesWritable, BytesWritable> bytesWritableJavaPairRdd =
flattenedPjPairRdd.mapToPair(
new PairFunction<Tuple2<Long, ProductJoin>, BytesWritable, BytesWritable>() {
#Override
public Tuple2<BytesWritable, BytesWritable> call(Tuple2<Long, ProductJoin> longProductJoinTuple2) throws Exception {
return new Tuple2<>(
new BytesWritable(longProductJoinTuple2._2().getId().getBytes()),
new BytesWritable(longProductJoinTuple2._2().toByteArray()));
}
}
//dump SequenceFiles
bytesWritableJavaPairRdd.saveAsHadoopFile(
"/tmp/path/",
BytesWritable.class,
BytesWritable.class,
SequenceFileOutputFormat.class
);
Below is the code that how I read the SequenceFile:
sparkSession.sparkContext()
.sequenceFile("tmp/path", BytesWritable.class, BytesWritable.class)
.toJavaRDD()
.mapToPair(
bytesWritableBytesWritableTuple2 -> {
Method parserMethod = clazz.getDeclaredMethod("parser");
Parser<T> parser = (Parser<T>) parserMethod.invoke(null);
return new Tuple2<>(
Text.decode(bytesWritableBytesWritableTuple2._1().getBytes()),
parser.parseFrom(bytesWritableBytesWritableTuple2._2().getBytes()));
}
);

Related

Read packed decimal and convert to numeric in spring boot

All,
I am using a Spring boot application to store data in DB. I am getting this data from IBM MQ through Kafka topic.
I am getting messages in EBCDIC format, so used cobol copybook, JRecord, cb2xml jars to convert to readable format and store in DB.
Now i am getting another file also in the same manner, but after conversion the data looks like this:
10020REFUNDONE
10021REFUNDTWO ·" ÷/
10022REFUNDTHREE oú^ "
10023REFUNDFOUR ¨jÄ ò≈
Here is how i am converting to readable format from ebcdic:
AbstractLineReader reader = null;
StringBuffer finalBuffer = new StringBuffer();
try {
String copybook = "/ds_header.cbl";
reader = CustomCobolProvider.getInstance().getLineReader(copybook, Convert.FMT_MAINFRAME, new BufferedInputStream(new ByteArrayInputStream(salesData)));
AbstractLine line;
while ((line = reader.read()) != null) {
if (null != line.getFieldValue(REC_TYPE)){
finalBuffer.append(line.getFullLine());
}
}
}
and this is my getLineReader method:
public AbstractLineReader getLineReader(String copybook, int numericType, InputStream fileStream) throws Exception {
String font = "";
if (numericType == 1) {
font = "cp037";
}
InputStream stream = CustomCobolProvider.class.getResourceAsStream(copybook);
if(stream == null ) throw new RuntimeException("Can't Load the Copybook Metadata file from Resource....");
LayoutDetail copyBook = ((ExternalRecord)this.copybookInt.loadCopyBook(stream, copybook, CopybookLoader.SPLIT_REDEFINE, 0, font, CommonBits.getDefaultCobolTextFormat(), Convert.FMT_MAINFRAME, 0, (AbsSSLogger)null).setFileStructure(Constants.IO_FIXED_LENGTH)).asLayoutDetail();
AbstractLineReader ret = LineIOProvider.getInstance().getLineReader(copyBook, (LineProvider)null);
ret.open(fileStream, copyBook);
return ret;
}
I am stuck here with the numeric conversion, i got to know it is coming in packed decimal.
I have nil knowledge on cobol and mainframe, referred few sites and got to know how to convert from ebcdic to readable format. Please help!
The problem is getFullLine() method does not do any field translation; you need to access individual fields. You can use the line.getFieldIterator(0) to get a field iterator for the line.
Also unless you are using an ancient version of JRecord, you are better off using the JRecordInterface1 class.
Some thing like the following should work:
StringBuffer finalBuffer = new StringBuffer();
try {
ICobolIOBuilder iob = JRecordInterface1.COBOL .newIOBuilder(copybookName)
.setFont("cp037")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
;
AbstractLineReader reader = iob.newReader(dataFile);
while ((line = reader.read()) != null) {
String sep = "";
for (AbstractFieldValue fv : line.getFieldIterator(0)) {
finalBuffer.append(sep).append(fv);
sep = "\t";
}
finalBuffer.append("\n");
}
reader.close();
} catch (Exception e) {
// what ever ....
}
Other points
With MQ data source you do not need to create line-readers. You can create lines directly from a byte array:
ICobolIOBuilder iob = JRecordInterface1.COBOL .newIOBuilder(copybookName)
.setFont("cp037")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
;
AbstractLine line = iob.newLine(byteArrayFromMq);
for (AbstractFieldValue fv : line.getFieldIterator(0)) {
// what ever
}

gRPC slow serialization on large dataset

I know that google states that protobufs don't support large messages (i.e. greater than 1 MB), but I'm trying to stream a dataset using gRPC that's tens of megabytes, and it seems like some people say it's ok, or at least with some splitting...
However, when I try to send an array this way (repeated uint32), it takes like 20 seconds on the same local machine.
#proto
service PAS {
// analyze single file
rpc getPhotonRecords (PhotonRecordsRequest) returns (PhotonRecordsReply) {}
}
message PhotonRecordsRequest {
string fileName = 1;
}
message PhotonRecordsReply {
repeated uint32 PhotonRecords = 1;
}
where PhotonRecordsReply needs to be ~10 million uint32 in length...
Does anyone have an idea on how to speed this up? Or what technology would be more appropriate?
So I think I've implemented streaming based on comments and answers given, but it still takes the same amount of time:
#proto
service PAS {
// analyze single file
rpc getPhotonRecords (PhotonRecordsRequest) returns (stream PhotonRecordsReply) {}
}
class PAS_GRPC(pas_pb2_grpc.PASServicer):
def getPhotonRecords(self, request: pas_pb2.PhotonRecordsRequest, _context):
raw_data_bytes = flb_tools.read_data_bytes(request.fileName)
data = flb_tools.reshape_flb_data(raw_data_bytes)
index = 0
chunk_size = 1024
len_data = len(data)
while index < len_data:
# last chunk
if index + chunk_size > len_data:
yield pas_pb2.PhotonRecordsReply(PhotonRecords=data[index:])
# all other chunks
else:
yield pas_pb2.PhotonRecordsReply(PhotonRecords=data[index:index + chunk_size])
index += chunk_size
Min repro
Github example
If you changed it over to use streams that should help. It took less than 2 seconds to transfer for me. Note this was without ssl and on localhost. This code I threw together. I did run it and it worked. Not sure what might happen if the file is not a multiple of 4 bytes for example. Also the endian order of bytes read is the default for Java.
I made my 10 meg file like this.
dd if=/dev/random of=my_10mb_file bs=1024 count=10240
Here's the service definition. Only thing I added here was the stream to the response.
service PAS {
// analyze single file
rpc getPhotonRecords (PhotonRecordsRequest) returns (stream PhotonRecordsReply) {}
}
Here's the server implementation.
public class PhotonsServerImpl extends PASImplBase {
#Override
public void getPhotonRecords(PhotonRecordsRequest request, StreamObserver<PhotonRecordsReply> responseObserver) {
log.info("inside getPhotonRecords");
// open the file, I suggest using java.nio API for the fastest read times.
Path file = Paths.get(request.getFileName());
try (FileChannel fileChannel = FileChannel.open(file, StandardOpenOption.READ)) {
int blockSize = 1024 * 4;
ByteBuffer byteBuffer = ByteBuffer.allocate(blockSize);
boolean done = false;
while (!done) {
PhotonRecordsReply.Builder response = PhotonRecordsReply.newBuilder();
// read 1000 ints from the file.
byteBuffer.clear();
int read = fileChannel.read(byteBuffer);
if (read < blockSize) {
done = true;
}
// write to the response.
byteBuffer.flip();
for (int index = 0; index < read / 4; index++) {
response.addPhotonRecords(byteBuffer.getInt());
}
// send the response
responseObserver.onNext(response.build());
}
} catch (Exception e) {
log.error("", e);
responseObserver.onError(
Status.INTERNAL.withDescription(e.getMessage()).asRuntimeException());
}
responseObserver.onCompleted();
log.info("exit getPhotonRecords");
}
}
The client just logs the size of the array received.
public long getPhotonRecords(ManagedChannel channel) {
if (log.isInfoEnabled())
log.info("Enter - getPhotonRecords ");
PASGrpc.PASBlockingStub photonClient = PASGrpc.newBlockingStub(channel);
PhotonRecordsRequest request = PhotonRecordsRequest.newBuilder().setFileName("/udata/jdrummond/logs/my_10mb_file").build();
photonClient.getPhotonRecords(request).forEachRemaining(photonRecordsReply -> {
log.info("got this many photons: {}", photonRecordsReply.getPhotonRecordsCount());
});
return 0;
}

Protobuf 3.0 Any Type pack/unpack

I would like to know how to transform a Protobuf Any Type to the original Protobuf message type and vice versa. In Java from Message to Any is easy:
Any.Builder anyBuilder = Any.newBuilder().mergeFrom(protoMess.build());
But how can I parse that Any back to the originial message (e.g. to the type of "protoMess")? I could probably parse everything on a stream just to read it back in, but that's not what I want. I want to have some transformation like this:
ProtoMess.MessData.Builder protoMessBuilder = (ProtoMess.MessData.Builder) transformToMessageBuilder(anyBuilder)
How can I achieve that? Is it already implemented for Java? The Protobuf Language Guide says there were pack and unpack methods, but there are none in Java.
Thank you in Advance :)
The answer might be a bit late but maybe this still helps someone.
In the current version of Protocol Buffers 3 pack and unpack are available in Java.
In your example packing can be done like:
Any anyMessage = Any.pack(protoMess.build()));
And unpacking like:
ProtoMess protoMess = anyMessage.unpack(ProtoMess.class);
Here is also a full example for handling Protocol Buffers messages with nested Any messages:
ProtocolBuffers Files
A simple Protocol Buffers file with a nested Any message could look like:
syntax = "proto3";
import "google/protobuf/any.proto";
message ParentMessage {
string text = 1;
google.protobuf.Any childMessage = 2;
}
A possible nested message could then be:
syntax = "proto3";
message ChildMessage {
string text = 1;
}
Packing
To build the full message the following function can be used:
public ParentMessage createMessage() {
// Create child message
ChildMessage.Builder childMessageBuilder = ChildMessage.newBuilder();
childMessageBuilder.setText("Child Text");
// Create parent message
ParentMessage.Builder parentMessageBuilder = ParentMessage.newBuilder();
parentMessageBuilder.setText("Parent Text");
parentMessageBuilder.setChildMessage(Any.pack(childMessageBuilder.build()));
// Return message
return parentMessageBuilder.build();
}
Unpacking
To read the child message from the parent message the following function can be used:
public ChildMessage readChildMessage(ParentMessage parentMessage) {
try {
return parentMessage.getChildMessage().unpack(ChildMessage.class);
} catch (InvalidProtocolBufferException e) {
e.printStackTrace();
return null;
}
}
EDIT:
If your packed messages can have different Types, you can read out the typeUrl and use reflection to unpack the message. Assuming you have the child messages ChildMessage1 and ChildMessage2 you can do the following:
#SuppressWarnings("unchecked")
public Message readChildMessage(ParentMessage parentMessage) {
try {
Any childMessage = parentMessage.getChildMessage();
String clazzName = childMessage.getTypeUrl().split("/")[1];
String clazzPackage = String.format("package.%s", clazzName);
Class<Message> clazz = (Class<Message>) Class.forName(clazzPackage);
return childMessage.unpack(clazz);
} catch (ClassNotFoundException | InvalidProtocolBufferException e) {
e.printStackTrace();
return null;
}
}
For further processing, you could determine the type of the message with instanceof, which is not very efficient. If you want to get a message of a certain type, you should compare the typeUrl directly:
public ChildMessage1 readChildMessage(ParentMessage parentMessage) {
try {
Any childMessage = parentMessage.getChildMessage();
String clazzName = childMessage.getTypeUrl().split("/")[1];
if (clazzName.equals("ChildMessage1")) {
return childMessage.unpack("ChildMessage1.class");
}
return null
} catch (InvalidProtocolBufferException e) {
e.printStackTrace();
return null;
}
}
Just to add information in case someone has the same problem ... Currently to unpack you have to do (c# .netcore 3.1 Google.Protobuf 3.11.4)
Foo myobject = anyMessage.Unpack<Foo>();
I know this question is very old, but it still came up when I was looking for the answer. Using #sundance answer I had to answer do this a little differently. The problem being that the actual message was a subclass of the actual class. So it required a $.
for(Any x : in.getDetailsList()){
try{
String clazzName = x.getTypeUrl().split("/")[1];
String[] split_name = clazzName.split("\\.");
String nameClass = String.join(".", Arrays.copyOfRange(split_name, 0, split_name.length - 1)) + "$" + split_name[split_name.length-1];
Class<Message> clazz = (Class<Message>) Class.forName(nameClass);
System.out.println(x.unpack(clazz));
} catch (Exception e){
e.printStackTrace();
}
}
With this being the definition of my proto messages
syntax = "proto3";
package cb_grpc.msg.Main;
service QueryService {
rpc anyService (AnyID) returns (QueryResponse) {}
}
enum Buckets {
main = 0;
txn = 1;
hxn = 2;
}
message QueryResponse{
string content = 1;
string code = 2;
}
message AnyID {
Buckets bucket = 1;
string docID = 2;
repeated google.protobuf.Any details = 3;
}
and
syntax = "proto3";
package org.querc.cb_grpc.msg.database;
option java_package = "org.querc.cb_grpc.msg";
option java_outer_classname = "database";
message TxnLog {
string doc_id = 1;
repeated string changes = 2;
}

hbase InternalScanner and filter in coprocessor

all:
Recently,I wrote a coprocessor in Hbase(0.94.17), A Class extends BaseEndpointCoprocessor, a rowcount method to count one table's rows.
And I got a problem.
if I did not set a filter in scan,my code works fine for two tables. One table has 1,000,000 rows,the other has 160,000,000 rows. it took about 2 minutes to count the bigger table.
however ,If I set a filter in scan, it only work on small table. it will throw a exception on the bigger table.
org.apache.hadoop.hbase.ipc.ExecRPCInvoker$1#2c88652b, java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
trust me,I check my code over and over again.
so, to count my table with filter, I have to write the following stupid code, first, I did not set a filter in scan,and then ,after I got one row record, I wrote a method to filter it.
and it work on both tables.
But I do not know why.
I try to read the scanner source code in HRegion.java,however, I did not get it.
So,if you know the answer,please help me. Thank you.
#Override
public long rowCount(Configuration conf) throws IOException {
// TODO Auto-generated method stub
Scan scan = new Scan();
parseConfiguration(conf);
Filter filter = null;
if (this.mFilterString != null && !mFilterString.equals("")) {
ParseFilter parse = new ParseFilter();
filter = parse.parseFilterString(mFilterString);
// scan.setFilter(filter);
}
scan.setCaching(this.mScanCaching);
InternalScanner scanner = ((RegionCoprocessorEnvironment) getEnvironment()).getRegion().getScanner(scan);
long sum = 0;
try {
List<KeyValue> curVals = new ArrayList<KeyValue>();
boolean hasMore = false;
do {
curVals.clear();
hasMore = scanner.next(curVals);
if (filter != null) {
filter.reset();
if (HbaseUtil.filterOneResult(curVals, filter)) {
continue;
}
}
sum++;
} while (hasMore);
} finally {
scanner.close();
}
return sum;
}
The following is my hbase util code:
public static boolean filterOneResult(List<KeyValue> kvList, Filter filter) {
if (kvList.size() == 0)
return true;
KeyValue kv = kvList.get(0);
if (filter.filterRowKey(kv.getBuffer(), kv.getRowOffset(), kv.getRowLength())) {
return true;
}
for (KeyValue kv2 : kvList) {
if (filter.filterKeyValue(kv2) == Filter.ReturnCode.NEXT_ROW) {
return true;
}
}
filter.filterRow(kvList);
if (filter.filterRow())
return true;
else
return false;
}
Ok,It was my mistake. After I use jdb to debug my code, I got the following exception,
"org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
It is obvious ,my result list is empty.
hasMore = scanner.next(curVals);
it means, if I use a Filter in scan,this curVals list might be empty, but hasMore is true.
but I thought,if a record was filtered, it should jump to the next row,and this list should never be empty. I was wrong.
And my client did not print any remote error message on my console, it just catch this remote Exception, and retry.
after retry 10 times, it print an another exception,which was meaningless.

How to decode protobuf binary response

I have created a test app that can recognize some image using Goggle Goggles. It works for me, but I receive binaryt protobuf response. I have no proto-files, just binary response. How can I get data from it? (Have sent some image with bottle of bear and got the nex response):
A
TuborgLogo9 HoaniText���;�)b���2d8e991bff16229f6"�
+TR=T=AQBd6Cl4Kd8:X=OqSEi:S=_rSozFBgfKt5d9b0
+TR=T=6rLQxKE2xdA:X=OqSEi:S=gd6Aqb28X0ltBU9V
+TR=T=uGPf9zJDWe0:X=OqSEi:S=32zTfdIOdI6kuUTa
+TR=T=RLkVoGVd92I:X=OqSEi:S=P7yOhvSAOQW6SRHN
+TR=T=J1FMvNmcyMk:X=OqSEi:S=5Z631_rd2ijo_iuf�
need to get string "Tuborg" and if possible type - "Logo"
You can decode with protoc:
protoc --decode_raw < msg.bin
UnknownFieldSet.parseFrom(msg).toString()
This will show you the top level fields. Unfortunately it can't know the exact details of field types. long/int/bool/enum etc are all encoded as Varint and all look the same. Strings, byte-arrays and sub-messages are length-delimited and are also indistinguishable.
Some useful details here: https://github.com/dcodeIO/protobuf.js/wiki/How-to-reverse-engineer-a-buffer-by-hand
If you follow the code in the UnknownFieldSet.mergeFrom() you'll see how you could try decode sub-messages and falling back to strings if that fails - but it's not going to be very reliable.
There are 2 spare values for the wiretype in the protocol - it would have been really helpful if google had used one of these to denote sub-messages. (And the other for null values perhaps.)
Here's some very crude rushed code which attempts to produce a something useful for diagnostics. It guesses at the data types and in the case of strings and sub-messages it will print both alternatives in some cases. Please don't trust any values it prints:
public static String decodeProto(byte[] data, boolean singleLine) throws IOException {
return decodeProto(ByteString.copyFrom(data), 0, singleLine);
}
public static String decodeProto(ByteString data, int depth, boolean singleLine) throws IOException {
final CodedInputStream input = CodedInputStream.newInstance(data.asReadOnlyByteBuffer());
return decodeProtoInput(input, depth, singleLine);
}
private static String decodeProtoInput(CodedInputStream input, int depth, boolean singleLine) throws IOException {
StringBuilder s = new StringBuilder("{ ");
boolean foundFields = false;
while (true) {
final int tag = input.readTag();
int type = WireFormat.getTagWireType(tag);
if (tag == 0 || type == WireFormat.WIRETYPE_END_GROUP) {
break;
}
foundFields = true;
protoNewline(depth, s, singleLine);
final int number = WireFormat.getTagFieldNumber(tag);
s.append(number).append(": ");
switch (type) {
case WireFormat.WIRETYPE_VARINT:
s.append(input.readInt64());
break;
case WireFormat.WIRETYPE_FIXED64:
s.append(Double.longBitsToDouble(input.readFixed64()));
break;
case WireFormat.WIRETYPE_LENGTH_DELIMITED:
ByteString data = input.readBytes();
try {
String submessage = decodeProto(data, depth + 1, singleLine);
if (data.size() < 30) {
boolean probablyString = true;
String str = new String(data.toByteArray(), Charsets.UTF_8);
for (char c : str.toCharArray()) {
if (c < '\n') {
probablyString = false;
break;
}
}
if (probablyString) {
s.append("\"").append(str).append("\" ");
}
}
s.append(submessage);
} catch (IOException e) {
s.append('"').append(new String(data.toByteArray())).append('"');
}
break;
case WireFormat.WIRETYPE_START_GROUP:
s.append(decodeProtoInput(input, depth + 1, singleLine));
break;
case WireFormat.WIRETYPE_FIXED32:
s.append(Float.intBitsToFloat(input.readFixed32()));
break;
default:
throw new InvalidProtocolBufferException("Invalid wire type");
}
}
if (foundFields) {
protoNewline(depth - 1, s, singleLine);
}
return s.append('}').toString();
}
private static void protoNewline(int depth, StringBuilder s, boolean noNewline) {
if (noNewline) {
s.append(" ");
return;
}
s.append('\n');
for (int i = 0; i <= depth; i++) {
s.append(INDENT);
}
}
I'm going to assume the real question is how to decode protobufs and not how to read binary from the wire using Java.
The answer to your question can be found here
Briefly, on the wire, protobufs are encoded as 3-tuples of <key,type,value>, where:
the key is the field number assigned to the field in the .proto schema
the type is one of <Varint, int32, length-delimited, start-group, end-group,int64. It contains just enough information to decode the value of the 3-tuple, namely it tells you how long the value is.

Resources