Is it possible to write in ORC format using hdfs2FileSink operator into HDFS in IBM infosphere streams or any other way? - ibm-streams

Is it possible to write in ORC format using hdfs2FileSink operator in IBM infosphere streams
NA

No, it isn't possible at this time using the HDFS2FileSink operator. It only supports text or binary.
The streamsx.parquet toolkit has support for writing to Parquet.
Otherwise, you would have to create your own Java operator that will receive the data and use the ORC API to write the data.
It is fairly straightforward to create a Java operator, as shown in this video.
The Java Operator Development guide can walk you through the process. Specifically see writing a sink operator.
After creating a new Java operator, add code to write to the ORC API in the process method:
#Override
public void process(StreamingInput<Tuple> stream, Tuple tuple)
throws Exception {
// TODO Insert code here to process the incoming tuple,
// typically sending tuple data to an external system or data store.
// String value = tuple.getString("AttributeName");
}
I would start with the ORC home page, choose the appropriate link for Hive, Hadoop

Related

How can I return an array of integers from Solana rust program function to front end?

I don't know how can I write a smart contract in Solana that after executing the logic, returns an array of integers, strings, ... to the client, and how can I fetch it using Web3?
There's a syscall available to on-chain programs called set_return_data, which puts data into a buffer that can be read by the higher-level programs using get_return_data. This all mediated through opaque byte buffers, so you'll need to know how to decode the response.
If you want to fetch the data from the client side, you can simulate the transaction and read the data back from the return_data field in the response: https://edge.docs.solana.com/developing/clients/jsonrpc-api#results-50
The RPC support in simulated transactions is very new in version 1.11, but the return data is available in earlier versions.
Source code for set_return_data at https://github.com/solana-labs/solana/blob/658752cda710cb358d7ccbbc2cee06bf8009c2d4/sdk/program/src/program.rs#L102
Source code for get_return_data at https://github.com/solana-labs/solana/blob/658752cda710cb358d7ccbbc2cee06bf8009c2d4/sdk/program/src/program.rs#L117
So, programs do not return data (other than success or failure).
However; most programs write data to a program owned account's data field and this could be read from client apps (Rust, Python, TS/JS, etc.).
If using the Solana web3 library, you can call getAccountInfo on the Connection object. This will return the byte array of the account. You will then need to deserialize that data. You have to know how the program serializes the data to reverse it successfully.
Check the Solana Cookbook for overview using borsh https://solanacookbook.com/guides/serialization.html#how-to-deserialize-account-data-on-the-client

Kafka Streams: How to use persistentKeyValueStore to reload existing messages from disk?

My code is currently using an InMemoryKeyValueStore, which avoids any persistence to disk or to kafka.
I want to use rocksdb (Stores.persistentKeyValueStore) so that the app will reload state from disk. I'm trying to implement this, and I'm very new to Kafka and the streams API. Would appreciate help on how I might make changes, while I still try to understand stuff as I go.
I tried to create the state store here:
StoreBuilder<KeyValueStore<String, LinkedList<StoreItem>>> store =
Stores.<String, LinkedList<StoreItem>>keyValueStoreBuilder(Stores.persistentKeyValueStore(storeKey), Serdes.String(), valueSerde);
How do I register it with the streams builder?
Existing code which uses the inMemoryKeyValueStore:
static StoreBuilder<KeyValueStore<String, LinkedList<StoreItem>>> makeStoreBuilder(
final String storeKey,
final Serde<LinkedList<StoreItem>> valueSerde,
final boolean loggingDisabled) {
final StoreBuilder<KeyValueStore<String, LinkedList<StoreItem>>> storeBuilder =
Stores.keyValueStoreBuilder(Stores.inMemoryKeyValueStore(storeKey), Serdes.String(), valueSerde);
return storeBuilder;
}
I need to ensure that the streams app will not end up missing existing messages in the log topic each time it restarts.
How do I register it with the streams builder?
By calling StreamsBuilder#addStateStore().
https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/StreamsBuilder.html#addStateStore-org.apache.kafka.streams.state.StoreBuilder-
See StateStoresInTheDSLIntegrationTest at https://github.com/confluentinc/kafka-streams-examples fro an end-to-end demo application.
You use a persistent store the exact some way as an in-memory store. The store takes care of the rest and you don't need to worry about loading data etc. You just use it.

Kafka Streams API: KStream to KTable

I have a Kafka topic where I send location events (key=user_id, value=user_location). I am able to read and process it as a KStream:
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Location> locations = builder
.stream("location_topic")
.map((k, v) -> {
// some processing here, omitted form clarity
Location location = new Location(lat, lon);
return new KeyValue<>(k, location);
});
That works well, but I'd like to have a KTable with the last known position of each user. How could I do it?
I am able to do it writing to and reading from an intermediate topic:
// write to intermediate topic
locations.to(Serdes.String(), new LocationSerde(), "location_topic_aux");
// build KTable from intermediate topic
KTable<String, Location> table = builder.table("location_topic_aux", "store");
Is there a simple way to obtain a KTable from a KStream? This is my first app using Kafka Streams, so I'm probably missing something obvious.
Update:
In Kafka 2.5, a new method KStream#toTable() will be added, that will provide a convenient way to transform a KStream into a KTable. For details see: https://cwiki.apache.org/confluence/display/KAFKA/KIP-523%3A+Add+KStream%23toTable+to+the+Streams+DSL
Original Answer:
There is not straight forward way at the moment to do this. Your approach is absolutely valid as discussed in Confluent FAQs: http://docs.confluent.io/current/streams/faq.html#how-can-i-convert-a-kstream-to-a-ktable-without-an-aggregation-step
This is the simplest approach with regard to the code. However, it has the disadvantages that (a) you need to manage an additional topic and that (b) it results in additional network traffic because data is written to and re-read from Kafka.
There is one alternative, using a "dummy-reduce":
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Long> stream = ...; // some computation that creates the derived KStream
KTable<String, Long> table = stream.groupByKey().reduce(
new Reducer<Long>() {
#Override
public Long apply(Long aggValue, Long newValue) {
return newValue;
}
},
"dummy-aggregation-store");
This approach is somewhat more complex with regard to the code compared to option 1 but has the advantage that (a) no manual topic management is required and (b) re-reading the data from Kafka is not necessary.
Overall, you need to decide by yourself, which approach you like better:
In option 2, Kafka Streams will create an internal changelog topic to back up the KTable for fault tolerance. Thus, both approaches require some additional storage in Kafka and result in additional network traffic. Overall, it’s a trade-off between slightly more complex code in option 2 versus manual topic management in option 1.

How to split an event to multiple events to send them to multiplexed fan out flow

We are planning to use kafka flume-ng integration(Flafka) where flume is the consumer for kafka queues. Flume agents will receive files listing commands and their output as shown below:
root#host> [Command1]
[Output1]
root#host> [Command2]
[Output2]
The file may contain multiple commands and a command's output may be huge. We need to intercept the event (which is the file data) and split the event into multiple events based on commands. The source will then fan out the flow to multiple channel sending each sub-event to a channel(using multiplexing) and each sink will store the command info to respective Hive table.
Is it possible to use fanout flow to split an event to multiple events? Or If I ask in other way, Can we split an event into multiple events in an interceptor?
I have read about regex extractor interceptor and serializer, but not sure if it can be of any help for this scenario.
If I've understood well, you need the original event taken from a Kafka queue is splited into several, let's say, sub-events. And you want to know which piece of Flume could do that.
I think interceptors are not suitable for that purpose since interceptors are "placed" between the source and the channel, and they are designed to add, delete or modify the headers about a Flume event before putting it into the channel; as much, they can drop the entire event. But they are not able to generate several events based on other existing one.
I think you are looking for something like a handler attached to the source, able to interpret the events taken from Kafka and generating several Flume events at the source output. This concept is something similar to the handlers you can attach to a HTTPSoure (more details here). If such a thing is possible with your source, most probably you will have to develop your own custom handler, since the functionality you are required is very specific.
Thanks for the reply frb.
I want to split the incoming event to a flume source to multiple sub-events and send them to respective channel. So the first flume node in the topology will route the each sub-event (using multiplexing) to a specific hop which can handle that kind of piece of information.
As per you reply, I understand that it can not be done using interceptor. Can you please share any example or documentation of handlers?
Yes, flume cannot split event to multiple. Here is my alternative solution for this approach, take Kafka source for example.
First implement a source class that extends Kafka source, replace the default ChannelProcessor object.
public class XXXSplitSource extends KafkaSource {
#Override
public synchronized ChannelProcessor getChannelProcessor()
{
return new XXXYourChannelProcessorProxy(super.getChannelProcessor());
}
}
Then, in the ChannelProcessor proxy implemention, your can split the event with your custom funtion.
public class XXXYourChannelProcessorProxy extends ChannelProcessor {
public ChannelProcessor m_downstreamChannelProcessor = null;
public XXXYourChannelProcessorProxy (ChannelSelector selector) {
super(selector);
}
public XXXYourChannelProcessorProxy (ChannelProcessor processor) {
super(null);
m_downstreamChannelProcessor = processor;
}
#Override
public void processEventBatch(List<Event> events) {
List<Event> generatedEvents = YOUR_SPLIT_FUNCTION_HERE(events);
m_downstreamChannelProcessor.processEventBatch(generatedEvents);
}
}

What does the sync and syncFs of SequenceFile.Writer means?

Environment: Hadoop 0.20.2-cdh3u5
I am trying to upload log data (10G) to HDFS with a customized tool which using SequenceFile.Writer.
SequenceFile.Writer w = SequenceFile.createWriter(
hdfs,
conf,
p,
LongWritable.class,
Text.class,
4096,
hdfs.getDefaultReplication(),
hdfs.getDefaultBlockSize(),
compressionType,
codec,
null,
new Metadata());
During the uploading process, if the tool crashed (without invoke the close() method explicitly), will the log that has been uploaded lost?
Should I invoke sync() or syncFs() timely, what do the two methods means?
Yes, probably.
sync() create a sync point. As stated in the book "Hadoop- The Definitive Guide" by Tom White (Cloudera)
a sync point is a point in the stream which can used by to
resynchronize with a record boundary if the reader is "lost" - for
example after seeking to an arbitrary position on the stream.
Now the implementation of syncFS() is pretty simple:
public void syncFs() throws IOException {
if (out != null) {
out.sync(); // flush contents to file system
}
}
where out is a FSDataOutputStream. Again, in the same book is stated:
HDFS provides a method for forcing all buffers to be synchronized to
the datanodes via the sync() method on FSDataOutputStream. After
a successful call return from sync() HDFS garantees that the data
written up to that point in the file is persisted and visible to all
readers. In the event of a crash (of the client or HDFS), the data
will not be lost.
But a footnote warns to look to bug HDFS-200, since the visibility mentioned above was not always not always honored.

Resources