Writing protobuf object in parquet using apache beam

Writing protobuf object in parquet using apache beam - protocol-buffers

I fetch protobuf data from google pub/sub and deserialize the data to Message type object. So i get PCollection<Message> type object. Here is sample code:
public class ProcessPubsubMessage extends DoFn<PubsubMessage, Message> {
#ProcessElement
public void processElement(#Element PubsubMessage element, OutputReceiver<Message> receiver) {
byte[] payload = element.getPayload();
try {
Message message = Message.parseFrom(payload);
receiver.output(message);
} catch (InvalidProtocolBufferException e) {
LOG.error("Got exception while parsing message from pubsub. Exception =>" + e.getMessage());
}
}
}
PCollection<Message> event = psMessage.apply("Parsing data from pubsub message",
ParDo.of(new ProcessPubsubMessage()));
I want to apply transformation on PCollection<Message> eventto write in parquet format. I know apache beam has provided ParquetIO but it works fine for PCollection<GenericRecord> type and conversion from Message to GenericRecord may solve the problem (Yet don't know how to do that). There is any easy way to write in parquet format ?

It can be solved by using the following library :
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-protobuf</artifactId>
<version>1.7.7</version>
</dependency>
private GenericRecord getGenericRecord(Event event) throws IOException {
ProtobufDatumWriter<Event> datumWriter = new ProtobufDatumWriter<Event>(Event.class);
ByteArrayOutputStream os = new ByteArrayOutputStream();
Encoder e = EncoderFactory.get().binaryEncoder(os, null);
datumWriter.write(event, e);
e.flush();
ProtobufDatumReader<Event> datumReader = new ProtobufDatumReader<Event>(Event.class);
GenericDatumReader<GenericRecord> genericDatumReader = new GenericDatumReader<GenericRecord>(datumReader.getSchema());
GenericRecord record = genericDatumReader.read(null, DecoderFactory.get().binaryDecoder(new ByteArrayInputStream(os.toByteArray()), null));
return record;
}
For details: https://gist.github.com/alexvictoor/1d3937f502c60318071f

Related

Nifi Custom Processor errors with a "ControllerService found was not a WebSocket ControllerService but a com.sun.proxy.$Proxy75"

*** Update: I have changed my approach as described in my answer to the question, due to which the original issue reported becomes moot. ***
I'm trying to develop a Nifi application that provides a WebSocket interface to Kakfa. I could not accomplish this using the standard Nifi components as I have tried below (it may not make sense but intuitively this is what I want to accomplish):
I have now created a custom Processor "ReadFromKafka" that I intend to use as shown in the image below. "ReadFromKafka" would use the same implementation as the standard "PutWebSocket" component but would read messages from a Kafka Topic and send as response to the WebSocket client.
I have provided a code snippet of the implementation below:
#SystemResourceConsideration(resource = SystemResource.MEMORY)
public class ReadFromKafka extends AbstractProcessor {
public static final PropertyDescriptor PROP_WS_SESSION_ID = new PropertyDescriptor.Builder()
.name("websocket-session-id")
.displayName("WebSocket Session Id")
.description("A NiFi Expression to retrieve the session id. If not specified, a message will be " +
"sent to all connected WebSocket peers for the WebSocket controller service endpoint.")
.required(true)
.addValidator(StandardValidators.NON_BLANK_VALIDATOR)
.expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
.defaultValue("${" + ATTR_WS_SESSION_ID + "}")
.build();
public static final PropertyDescriptor PROP_WS_CONTROLLER_SERVICE_ID = new PropertyDescriptor.Builder()
.name("websocket-controller-service-id")
.displayName("WebSocket ControllerService Id")
.description("A NiFi Expression to retrieve the id of a WebSocket ControllerService.")
.required(true)
.addValidator(StandardValidators.NON_BLANK_VALIDATOR)
.expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
.defaultValue("${" + ATTR_WS_CS_ID + "}")
.build();
public static final PropertyDescriptor PROP_WS_CONTROLLER_SERVICE_ENDPOINT = new PropertyDescriptor.Builder()
.name("websocket-endpoint-id")
.displayName("WebSocket Endpoint Id")
.description("A NiFi Expression to retrieve the endpoint id of a WebSocket ControllerService.")
.required(true)
.addValidator(StandardValidators.NON_BLANK_VALIDATOR)
.expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
.defaultValue("${" + ATTR_WS_ENDPOINT_ID + "}")
.build();
public static final PropertyDescriptor PROP_WS_MESSAGE_TYPE = new PropertyDescriptor.Builder()
.name("websocket-message-type")
.displayName("WebSocket Message Type")
.description("The type of message content: TEXT or BINARY")
.required(true)
.addValidator(StandardValidators.NON_BLANK_VALIDATOR)
.defaultValue(WebSocketMessage.Type.TEXT.toString())
.expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
.build();
public static final Relationship REL_SUCCESS = new Relationship.Builder()
.name("success")
.description("FlowFiles that are sent successfully to the destination are transferred to this relationship.")
.build();
public static final Relationship REL_FAILURE = new Relationship.Builder()
.name("failure")
.description("FlowFiles that failed to send to the destination are transferred to this relationship.")
.build();
private static final List<PropertyDescriptor> descriptors;
private static final Set<Relationship> relationships;
static{
final List<PropertyDescriptor> innerDescriptorsList = new ArrayList<>();
innerDescriptorsList.add(PROP_WS_SESSION_ID);
innerDescriptorsList.add(PROP_WS_CONTROLLER_SERVICE_ID);
innerDescriptorsList.add(PROP_WS_CONTROLLER_SERVICE_ENDPOINT);
innerDescriptorsList.add(PROP_WS_MESSAGE_TYPE);
descriptors = Collections.unmodifiableList(innerDescriptorsList);
final Set<Relationship> innerRelationshipsSet = new HashSet<>();
innerRelationshipsSet.add(REL_SUCCESS);
innerRelationshipsSet.add(REL_FAILURE);
relationships = Collections.unmodifiableSet(innerRelationshipsSet);
}
#Override
public Set<Relationship> getRelationships() {
return relationships;
}
#Override
public final List<PropertyDescriptor> getSupportedPropertyDescriptors() {
return descriptors;
}
#Override
public void onTrigger(final ProcessContext context, final ProcessSession processSession) throws ProcessException {
final FlowFile flowfile = processSession.get();
if (flowfile == null) {
return;
}
final String sessionId = context.getProperty(PROP_WS_SESSION_ID)
.evaluateAttributeExpressions(flowfile).getValue();
final String webSocketServiceId = context.getProperty(PROP_WS_CONTROLLER_SERVICE_ID)
.evaluateAttributeExpressions(flowfile).getValue();
final String webSocketServiceEndpoint = context.getProperty(PROP_WS_CONTROLLER_SERVICE_ENDPOINT)
.evaluateAttributeExpressions(flowfile).getValue();
final String messageTypeStr = context.getProperty(PROP_WS_MESSAGE_TYPE)
.evaluateAttributeExpressions(flowfile).getValue();
final WebSocketMessage.Type messageType = WebSocketMessage.Type.valueOf(messageTypeStr);
if (StringUtils.isEmpty(sessionId)) {
getLogger().debug("Specific SessionID not specified. Message will be broadcast to all connected clients.");
}
if (StringUtils.isEmpty(webSocketServiceId)
|| StringUtils.isEmpty(webSocketServiceEndpoint)) {
transferToFailure(processSession, flowfile, "Required WebSocket attribute was not found.");
return;
}
final ControllerService controllerService = context.getControllerServiceLookup().getControllerService(webSocketServiceId);
if (controllerService == null) {
getLogger().debug("ControllerService is NULL");
transferToFailure(processSession, flowfile, "WebSocket ControllerService was not found.");
return;
} else if (!(controllerService instanceof WebSocketService)) {
getLogger().debug("ControllerService is not instance of WebSocketService");
transferToFailure(processSession, flowfile, "The ControllerService found was not a WebSocket ControllerService but a "
+ controllerService.getClass().getName());
return;
}
...
processSession.getProvenanceReporter().send(updatedFlowFile, transitUri.get(), transmissionMillis);
processSession.transfer(updatedFlowFile, REL_SUCCESS);
processSession.commit();
} catch (WebSocketConfigurationException|IllegalStateException|IOException e) {
// WebSocketConfigurationException: If the corresponding WebSocketGatewayProcessor has been stopped.
// IllegalStateException: Session is already closed or not found.
// IOException: other IO error.
getLogger().error("Failed to send message via WebSocket due to " + e, e);
transferToFailure(processSession, flowfile, e.toString());
}
}
private FlowFile transferToFailure(final ProcessSession processSession, FlowFile flowfile, final String value) {
flowfile = processSession.putAttribute(flowfile, ATTR_WS_FAILURE_DETAIL, value);
processSession.transfer(flowfile, REL_FAILURE);
return flowfile;
}
}
I have deployed the custom processor and when I connect to it using the Chrome "Simple Web Socket Client" I can see the following message in the logs:
ControllerService found was not a WebSocket ControllerService but a com.sun.proxy.$Proxy75
I'm using the exact same code as in PutWebSocket and can't figure out why it would behave any different when I use my custom Processor. I have configured "JettyWebSocketServer" as the ControllerService under "ListenWebSocket" as shown in the image below.
Additional exception details seen in the log are provided below:
java.lang.ClassCastException: class com.sun.proxy.$Proxy75 cannot be cast to class org.apache.nifi.websocket.WebSocketService (com.sun.proxy.$Proxy75 is in unnamed module of loader org.apache.nifi.nar.InstanceClassLoader #35c646b5; org.apache.nifi.websocket.WebSocketService is in unnamed module of loader org.apache.nifi.nar.NarClassLoader #361abd01)

I ended up modifying my flow to utilize out-of-box ListenWebSocket, PutWebSocket Processors, and a custom "FetchFromKafka" Processor that is a modified version of ConsumeKafkaRecord. With this I'm able to provide a WebSocket interface to Kafka. I have provided a screenshot of the updated flow below. More work needs to be done with the custom Processor to support multiple sessions.

"For an upload InputStream with no MD5 digest metadata, the markSupported() method must evaluate to true." in Spring Integration AWS

UPDATE: There is bug in spring-integration-aws-2.3.4
I am integrating SFTP (SftpStreamingMessageSource) as source with S3 as destination.
I have similar Spring Integration configuration:
#Bean
public S3MessageHandler.UploadMetadataProvider uploadMetadataProvider() {
return (metadata, message) -> {
if ( message.getPayload() instanceof DigestInputStream) {
metadata.setContentType( MediaType.APPLICATION_JSON_VALUE );
// can not read stream to manually compute MD5
// metadata.setContentMD5("BLABLA==");
// this is wrong approach: metadata.setContentMD5(BinaryUtils.toBase64((((DigestInputStream) message.getPayload()).getMessageDigest().digest()));
}
};
}
#Bean
#InboundChannelAdapter(channel = "ftpStream")
public MessageSource<InputStream> ftpSource(SftpRemoteFileTemplate template) {
SftpStreamingMessageSource messageSource = new SftpStreamingMessageSource(template);
messageSource.setRemoteDirectory("foo");
messageSource.setFilter(new AcceptAllFileListFilter<>());
messageSource.setMaxFetchSize(1);
messageSource.setLoggingEnabled(true);
messageSource.setCountsEnabled(true);
return messageSource;
}
...
#Bean
#ServiceActivator(inputChannel = "ftpStream")
public MessageHandler s3MessageHandler(AmazonS3 amazonS3, S3MessageHandler.UploadMetadataProvider uploadMetadataProvider) {
S3MessageHandler messageHandler = new S3MessageHandler(amazonS3, "bucketName");
messageHandler.setLoggingEnabled(true);
messageHandler.setCountsEnabled(true);
messageHandler.setCommand(S3MessageHandler.Command.UPLOAD);
messageHandler.setUploadMetadataProvider(uploadMetadataProvider);
messageHandler.setKeyExpression(new ValueExpression<>("key"));
return messageHandler;
}
After start, I am getting following error
"For an upload InputStream with no MD5 digest metadata, the markSupported() method must evaluate to true."
This is because ftpSource is producing InputStream payload without mark/reset support. I even tried to transform InputStream to BufferedInputStream using #Transformer e.g. following
return new BufferedInputStream((InputStream) message.getPayload());
and no success, because then I am getting message "java.io.IOException: Stream closed" because S3MessageHandler:338 is calling Md5Utils.md5AsBase64(inputStream) which closes stream too early.
How to generate MD5 for all messages in Spring Integration AWS without pain?
I am using spring-integration-aws-2.3.4.RELEASE

The S3MessageHandler does this:
if (payload instanceof InputStream) {
InputStream inputStream = (InputStream) payload;
if (metadata.getContentMD5() == null) {
Assert.state(inputStream.markSupported(),
"For an upload InputStream with no MD5 digest metadata, "
+ "the markSupported() method must evaluate to true.");
String contentMd5 = Md5Utils.md5AsBase64(inputStream);
metadata.setContentMD5(contentMd5);
inputStream.reset();
}
putObjectRequest = new PutObjectRequest(bucketName, key, inputStream, metadata);
}
Where that Md5Utils.md5AsBase64() closes an InputStream in the end - bad for us.
This is an omission on our side. Please, raise a GH issue and we will fix it ASAP. Or feel free to provide a contribution.
As a workaround I would suggest to have a transformer upfront of this S3MessageHandler with the code like:
return org.springframework.util.StreamUtils.copyToByteArray(inputStream);
This way you will have already a byte[] as a payload for the S3MessageHandler which will use a different branch for processing:
else if (payload instanceof byte[]) {
byte[] payloadBytes = (byte[]) payload;
InputStream inputStream = new ByteArrayInputStream(payloadBytes);
if (metadata.getContentMD5() == null) {
String contentMd5 = Md5Utils.md5AsBase64(inputStream);
metadata.setContentMD5(contentMd5);
inputStream.reset();
}
if (metadata.getContentLength() == 0) {
metadata.setContentLength(payloadBytes.length);
}
putObjectRequest = new PutObjectRequest(bucketName, key, inputStream, metadata);
}

JMS Message Persistence on ActiveMQ

I need to ensure redelivery of JMS messages when the consumer fails
The way the producer is set up now - DefaultJmsListenerContainerFactory and Session.AUTO_ACKNOWLEDGE
I'm trying to build a jar and try in here to save the message into the server, once the app is able to consume, the producer in the jar will produce the message to the app.
Is that a good approach to do so?! any other way/recommendation to improve this?
public void handleMessagePersistence(final Object bean) {
ObjectMapper mapper = new ObjectMapper();
final String beanJson = mapper.writeValueAsString(bean); // I might need to convert to xml instead
// parameterize location of persistence folder
writeToDriver(beanJson);
try {
Producer.produceMessage(beanJson, beanJson, null, null, null);
} catch (final Exception e) {
LOG.error("Error producing message ");
}
}
here what I have to writ out the meesage:
private void writeToDriver(String beanJson) {
File filename = new File(JMS_LOCATION +
LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS")) + ".xml");
try (final FileWriter fileOut = new FileWriter(filename)) {
try (final BufferedWriter out = new BufferedWriter(fileOut)) {
out.write(beanJson);
out.flush();
}
} catch (Exception e) {
LOG.error("Unable to write out : " + beanJson, e);
}
}

Storm-jms Spout collecting Avro messages and sending down stream?

I am new to Avro Format. I am trying to collect Avro messages from a JMS Queue using Storm-Jms spout and send them to hdfs using hdfs bolt.
Queue is sending avro but i am not able to get them in avro format using the HDFS BOLT.
How to properly collect the avro message and send them downstream without encoding errors in hdfs.

The existing HDFS Bolt does not support Writing avro Files we need to overcome this by making the following changes. In this sample Code i am using the getting JMS Messages from my spout and the converting those JMS bytes message to AVRO and emmiting them to HDFS.
This code can serve as a sample for modifying the methods in AbstractHdfsBolt.
public void execute(Tuple tuple) {
try {
long length = bytesMessage.getBodyLength();
byte[] bytes = new byte[(int)length];
///////////////////////////////////////
bytesMessage.readBytes(bytes);
String replyMessage = new String(bytes, "UTF-8");
datumReader = new SpecificDatumReader<IndexedRecord>(schema);
decoder = DecoderFactory.get().binaryDecoder(bytes, null);
result = datumReader.read(null, decoder);
synchronized (this.writeLock) {
dataFileWriter.append(result);
dataFileWriter.sync();
this.offset += bytes.length;
if (this.syncPolicy.mark(tuple, this.offset)) {
if (this.out instanceof HdfsDataOutputStream) {
((HdfsDataOutputStream) this.out).hsync(EnumSet.of(SyncFlag.UPDATE_LENGTH));
} else {
this.out.hsync();
this.out.flush();
}
this.syncPolicy.reset();
}
dataFileWriter.flush();
}
if(this.rotationPolicy.mark(tuple, this.offset)){
rotateOutputFile(); // synchronized
this.offset = 0;
this.rotationPolicy.reset();
}
} catch (IOException | JMSException e) {
LOG.warn("write/sync failed.", e);
this.collector.fail(tuple);
}
}
#Override
void closeOutputFile() throws IOException {
this.out.close();
}
#Override
Path createOutputFile() throws IOException {
Path path = new Path(this.fileNameFormat.getPath(), this.fileNameFormat.getName(this.rotation, System.currentTimeMillis()));
this.out = this.fs.create(path);
dataFileWriter.create(schema, out);
return path;
}
#Override
void doPrepare(Map conf, TopologyContext topologyContext,OutputCollector collector) throws IOException {
// TODO Auto-generated method stub
LOG.info("Preparing HDFS Bolt...");
try {
schema = new Schema.Parser().parse(new File("/home/*******/********SchemafileName.avsc"));
} catch (IOException e1) {
e1.printStackTrace();
}
this.fs = FileSystem.get(URI.create(this.fsUrl), hdfsConfig);
datumWriter = new SpecificDatumWriter<IndexedRecord>(schema);
dataFileWriter = new DataFileWriter<IndexedRecord>(datumWriter);
JMSAvroUtils JASV = new JMSAvroUtils();
}

Broadcasting using the protocol Zab in ZooKeeper

Good morning,
I am new to ZooKeeper and its protocols and I am interested in its broadcast protocol Zab.
Could you provide me with a simple java code that uses the Zab protocol of Zookeeper? I have been searching about that but I did not succeed to find a code that shows how can I use Zab.
In fact what I need is simple, I have a MapReduce code and I want all the mappers to update a variable (let's say X) whenever they succeed to find a better value of X (i.e. a bigger value). In this case, the leader has to compare the old value and the new value and then to broadcast the actual best value to all mappers. How can I do such a thing in Java?
Thanks in advance,
Regards

You don't need to use the Zab protocol. Instead you may follow the below steps:
You have a Znode say /bigvalue on Zookeeper. All the mappers when starts reads the value stored in it. They also put an watch for data change on the Znode. Whenever a mapper gets a better value, it updates the Znode with the better value. All the mappers will get notification for the data change event and they read the new best value and they re-establish the watch for data changes again. That way they are in sync with the latest best value and may update the latest best value whenever there is a better value.
Actually zkclient is a very good library to work with Zookeeper and it hides a lot of complexities ( https://github.com/sgroschupf/zkclient ). Below is an example that demonstrates how you may watch a Znode "/bigvalue" for any data change.
package geet.org;
import java.io.UnsupportedEncodingException;
import org.I0Itec.zkclient.IZkDataListener;
import org.I0Itec.zkclient.ZkClient;
import org.I0Itec.zkclient.exception.ZkMarshallingError;
import org.I0Itec.zkclient.exception.ZkNodeExistsException;
import org.I0Itec.zkclient.serialize.ZkSerializer;
import org.apache.zookeeper.data.Stat;
public class ZkExample implements IZkDataListener, ZkSerializer {
public static void main(String[] args) {
String znode = "/bigvalue";
ZkExample ins = new ZkExample();
ZkClient cl = new ZkClient("127.0.0.1", 30000, 30000,
ins);
try {
cl.createPersistent(znode);
} catch (ZkNodeExistsException e) {
System.out.println(e.getMessage());
}
// Change the data for fun
Stat stat = new Stat();
String data = cl.readData(znode, stat);
System.out.println("Current data " + data + "version = " + stat.getVersion());
cl.writeData(znode, "My new data ", stat.getVersion());
cl.subscribeDataChanges(znode, ins);
try {
Thread.sleep(36000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
#Override
public void handleDataChange(String dataPath, Object data) throws Exception {
System.out.println("Detected data change");
System.out.println("New data for " + dataPath + " " + (String)data);
}
#Override
public void handleDataDeleted(String dataPath) throws Exception {
System.out.println("Data deleted " + dataPath);
}
#Override
public byte[] serialize(Object data) throws ZkMarshallingError {
if (data instanceof String){
try {
return ((String) data).getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
return null;
}
#Override
public Object deserialize(byte[] bytes) throws ZkMarshallingError {
try {
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return null;
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Writing protobuf object in parquet using apache beam - protocol-buffers

Related

Nifi Custom Processor errors with a "ControllerService found was not a WebSocket ControllerService but a com.sun.proxy.$Proxy75"

"For an upload InputStream with no MD5 digest metadata, the markSupported() method must evaluate to true." in Spring Integration AWS

JMS Message Persistence on ActiveMQ

Storm-jms Spout collecting Avro messages and sending down stream?

Broadcasting using the protocol Zab in ZooKeeper

Categories

Resources