SpringXD: Can partitionPath and hdfs-dataset coexist? - spring-xd

I defined several streams, using the new partitionPath option so that files end up in per-day directories in Hadoop:
stream create --name XXXX --definition "http --port=8300|hdfs-dataset --format=avro --idleTimeout=100000 --partitionPath=dateFormat('yyyy/MM/dd/')" --deploy
stream create --name YYYY --definition "http --port=8301|hdfs --idleTimeout=100000 --partitionPath=dateFormat('yyyy/MM/dd/')" --deploy
All of the streams were created and deployed, except for XXXX up there:
17:42:49,102 INFO Deployer server.StreamDeploymentListener - Deploying stream Stream{name='XXXX'}
17:42:50,948 INFO Deployer server.StreamDeploymentListener - Deployment status for stream 'XXXX': DeploymentStatus{state=failed,error(s)=java.lang.IllegalArgumentException: Cannot instantiate 'IntegrationConfigurationInitializer': org.springframework.integration.jmx.config.JmxIntegrationConfigurationInitializer}
17:42:50,951 INFO Deployer server.StreamDeploymentListener - Stream Stream{name='XXXX'} deployment attempt complete
Note that its data gets processed and deposited in avro format. And FWIW, where the other streams get put in /xd/<NAME>/<rest of path>, using the hdfs-dataset --format=avro combo results in files going to /xd/<NAME>/string
I re-defined it w/o the partitionPath option, and the stream deployed.
Do we have a bug here, or am I doing something wrong?

The hdfs-dataset sink is intended for writing serialized POJOs to HDFS. We use the Kite SDK kite-data functionality for this, so take a look at that project for some additional info.
The partitioning expressions for hdfs and hdfs-dataset are different. The hdfs-dataset follows the Kite SDK syntax and you need to specify a field of the POJO where your partition value is stored. For a timestamp (long) field the expression would look like this: dateFormat('timestamp', 'YM', 'yyyyMM') where timestamp is the name of the field, 'YM' is the prefix that gets added to the directory for the partition like YM201411 and 'yyyyMM' is the format you want for the partition value. If you want a year/mont/day directory structure for the partition you could use year('timestamp')/month('timestamp')/day('timestamp'). There is some more coverage in the Kite SDK Partitioned Datasets docs.
For your example it doesn't make much sense to add partitioning since you are persisting a simple String value. If you do add a processor to transform the data to a POJO then partitioning makes more sense and we have some examples in the XD docs.

Related

Adding dynamic records in parquet format

I'm working on building a data lake and stuck on a very trivial thing. I'll be using Hadoop/HDFS as our data lake infrastructure and storing records in parquet format. The data will come from a Kafka queue which sends a json record every time. The keys in the json record could vary message to message. For example in the first message keys could be 'a', 'b' and in the second message keys could be 'c', 'd'.
I was using pyarrow to store files in parquet format but as per my understanding we've to predefine schema. So when I try to write the second message, it'll throw an error saying that keys 'c' 'd' are not defined on schema.
Could someone guide as to how to proceed with this? Any other libraries apart from pyarrow also works but with this functionality.
Parquet supports Map types for instances where fields are unknown ahead of time. Or, if some of the fields are known, define more concrete types for those, possibly making them nullable, however you cannot mix named fields with a map on the same level of the record structure.
I've not used Pyarrow, but I'd suggest using Spark Structured Streaming and defining a schema there. Especially when consuming from Kafka. Spark's default output writer to HDFS uses Parquet.

Kafka Connect- Modifying records before writing into sink

I have installed Kafka connect using confluent-4.0.0
Using hdfs connector I am able to save Avro records received from Kafka topic to hive.
I would like to know if there is any way to modify the records before writing into hdfs sink.
My requirement is to do small modifications to values of the record. For Example, performing arithmetic operations on integers or manipulation of strings etc.
Please suggest if there any way to achieve this
You have several options.
Single Message Transforms, which you can see in action here. Great for light-weight changes as messages pass through Connect. Configuration-file based, and extensible using the provided API if there's not an existing transform that does what you want.
See the discussion here on when SMT are suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS. See this example here.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like. Here's an example.
Take a look at Kafka connect transformers [1] & [2]. You can build a custom transformer library and use it in connector.
[1] http://kafka.apache.org/documentation.html#connect_transforms
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect

How can I remove the number from file name?

I use Spring XD and I have created the following stream:
stream create --name test --definition "time | hdfs --rollover=1B --directory=/xd/test --fileName=test --overwrite=true" --deploy
The stream generate a many file. Each file name contains the name and additional number e.g. test-0.txt, test-1.txt, test-2.txt etc.
Because I use Spring XD and Hadoop for educational purpose I want to save free space of my hard drive. So, I would like to overwrite the data. It is possible to remove the above number from file name?
The rollover size 1B is too small which pile up the number of files being created. You can set to use optimal size based on the data you process to control the number of files created.
For more number of options to control the properties you can refer here

how to efficiently move data from Kafka to an Impala table?

Here are the steps to the current process:
Flafka writes logs to a 'landing zone' on HDFS.
A job, scheduled by Oozie, copies complete files from the landing zone to a staging area.
The staging data is 'schema-ified' by a Hive table that uses the staging area as its location.
Records from the staging table are added to a permanent Hive table (e.g. insert into permanent_table select * from staging_table).
The data, from the Hive table, is available in Impala by executing refresh permanent_table in Impala.
I look at the process I've built and it "smells" bad: there are too many intermediate steps that impair the flow of data.
About 20 months ago, I saw a demo where data was being streamed from an Amazon Kinesis pipe and was queryable, in near real-time, by Impala. I don't suppose they did something quite so ugly/convoluted. Is there a more efficient way to stream data from Kafka to Impala (possibly a Kafka consumer that can serialize to Parquet)?
I imagine that "streaming data to low-latency SQL" must be a fairly common use case, and so I'm interested to know how other people have solved this problem.
If you need to dump your Kafka data as-is to HDFS the best option is using Kafka Connect and Confluent HDFS connector.
You can either dump the data to a parket file on HDFS you can load in Impala.
You'll need I think you'll want to use a TimeBasedPartitioner partitioner to make parquet files every X miliseconds (tuning the partition.duration.ms configuration parameter).
Addign something like this to your Kafka Connect configuration might do the trick:
# Don't flush less than 1000 messages to HDFS
flush.size = 1000
# Dump to parquet files
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
partitioner.class = TimebasedPartitioner
# One file every hour. If you change this, remember to change the filename format to reflect this change
partition.duration.ms = 3600000
# Filename format
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm
Answering that question in year 2022, I would say that solution would be streaming messages from Kafka to Kudu and integrate Impala with Kudu, as it has already tight integration.
Here is example of Impala schema for Kudu:
CREATE EXTERNAL TABLE my_table
STORED AS KUDU
TBLPROPERTIES (
'kudu.table_name' = 'my_kudu_table'
);
Apache Kudu supports SQL inserts and it uses own file format under the hood. Alternatively you could use Apache Phoenix which supports inserts and upserts (if you need exactly once semantic) and uses HBase under the hood.
As long as the Impala is your final way of accessing the data, you shouldn't care about underlaying formats.

Lookup using spring-xd

I am looking for way to perform lookup operation using spring-xd.
My problem statement goes like this,
I have a stream of JSON events coming in, I want to have the values of events looked-up against the threshold values in my file in HDFS or directly from RDBMS.
Please suggest a way to perform this.
Thanking you in advance.
If I understand this correctly, you have different thresholds for different values in your messages.
Something like
value 'A' -> 100
value 'B' -> 200
...
This information is stored in a file or in a relational database. Now you want to filter the events based on their values and the corresponding thresholds.
I guess you would have to write a custom processor that holds a connection to the database where these values are stored, and queries them. If the mapping is small enough you should consider cashing it, or at least cache the most frequently used values, such that this does not slow down your stream.
If I understand your question, you can write a groovy processor to receive the payload, filter it, and then pass it to where ever you want like hdfs.
stream --name --def "jdbc | groovyprocessor | hdfs" --deploy
In the case of batch, you will need to write a custom module.
Moha

Resources