In the doc it says that Stateful Operations like mapGroupsWithState in Structured Streaming supported only in Scala and Java but I do need statful capabilities in Python. What should I do?
If you insist on using Pyspark -
Perform the preprocessing action in one spark job, then store the necessary "state" stream to a file sink. In another job, read this stream and perform the output action. There's an extra memory/disk/latency overhead involved.
Use updateStateByKey API instead. This will require DStreams approach instead of Structured Streaming.
Neither approach is great. If you need the latest and the greatest API features, I'd recommend transitioning to Scala now. As your project progresses, you will run into this problem repeatedly. Since Spark is written in Scala, the Python API always lags behind.
Related
I am working on building a data ingestion pipeline using Apache Beam "go" SDK.
My pipeline is to consume data from Kafka queue and persist the data to Google Cloud Bigtable (and/or to another Kafka topic).
So far, I have not been able to find a Kafka IO Connector (also known as Apache I/O Transform) written in "go" (I was able to find a java version, however).
Here's link to supported Apache Beam built-in I/O transforms:
https://beam.apache.org/documentation/io/built-in/
I am looking for the "go" equivalent of the following Java code:
pipeline.apply("kafka_deserialization", KafkaIO.<String, String>read()
.withBootstrapServers(KAFKA_BROKER)
.withTopic(KAFKA_TOPIC)
.withConsumerConfigUpdates(CONSUMER_CONFIG)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class))
Do you have any information on the availability of KafkaIO Connector "go" SDK/library?
#cricket_007 In case you are also curious, I received the following update from Robert Burke (rebo#google.com) who is in the Apache Beam team:
There presently isn't a Kafka transform for Go.
The Go SDK is still experimental, largely due to scalable IO support, which is why the Go SDK isn't represented in the built-in io page.
There's presently no way for an SDK user to write a Streaming source in the Go SDK, since there's no mechanism for a DoFn to "self terminate" bundles, such as to allow for scalability and windowing from streaming sources.
However, SplittableDoFns are on their way, and will eventually be the solution for writing these.
At present, the Beam Go SDK IOs haven't been tested and vetted for production use. Until the initial SplittableDoFn support is added to the Go SDK, Batch transforms cannot split, and can't scale beyond a single worker thread. This batch version should land in the next few months, and the streaming version land a few months after that, after which a Kafka IO can be developed.
I wish I had better news for you, but I can say progress is being made.
Robert Burke
Being a beginner with the Apache Beam programming model, I would like to know what is the difference between JDBC and jdbcio. I have developed a simple dataflow which involves normal JDBC connection and it is working as expected.
Is it mandatory to use jdbcio over JDBC? If yes, what are the issues we face when we go with a normal JDBC code?
Within a Beam pipeline there are various options for reading and writing out to external sources of data. The most common method is to make use of inbuilt sinks and sources that have been built by the Beam community (Built-in I/O Transforms). These connectors will often have had considerable development effort spent on them and will have been production hardened. For example the BigQueryIO has been used in production for many years, with continuous development throughout that period. The general advice will therefore be to make use of the standard Sinks and Sources whenever possible.
However not all interactions with external data sources should be via Sources and Sinks, there are use cases where a hand built communication from a DoFn to the external source is the correct path. A few examples below (there are more of course!);
There is no Sink / Source to the data source, or there is a source
but it does not yet support all switches / modes etc for your needs.
Of course you can always enhance the existing Sink / Source or if it
does not exist to build a new I/O connector from scratch and if
possible would be great to contribute this back to the community :)
You are enriching elements flowing through your streaming pipeline
with a small subset of data from a large data set. For example, let's
say your processing events coming from a sales order and you would
like to add information for each item. The information for the item's
lives in a large multi TB store but on average you will only access a
small percentage of the data as lookup keys. In this example it makes
sense to enrich each element by making an external call to the data
store within a DoFn. Rather than reading all of the data in as a
Source and doing the join operation within the pipeline.
Extra notes / hints:
When calling external systems, keep in mind that Apache Beam is designed to distribute work across many threads, this can place significant load on your external datasource, you can often reduce this load by making use of the start & end bundle annotations;
Java (SDK 2.9.0)
DoFn.StartBundle
DoFn.FinishBundle
Python (SDK 2.9.0)
start_bundle()
finish_bundle()
Has something like this been done before? If not, what would be involved in getting NiFi to ingest a stream arriving over a WebSocket with Google FlatBuffers?
(would a simple TCP stream make it easier or harder?)
UPDATE
I have a C++ program that is running on a node, which is collecting data and publishing it via nanomessage pub/sub channel over a websocket. The data in C++ looks like structs, and I am serializing it with Google Flatbuffers. It is a very simple struct, think of csv records. We have a team member who wants to capture this data with NiFi and put it to a database.
Personally, since Flatbuffers supports conversion of binary to JSON, I think this is almost easier just writing a short C#, python, java or javascript program to receive the flabuffers, open a DB connection, and dump the data. (maybe convert to JSON first, if needed).
To my knowledge, NiFi does not have an integration with the nanomsg library/protocol out of the box. This would likely require writing a custom processor for that is capable of consuming nanomsg packets using the nanomsg PUBSUB pattern / socket types.
One could use existing processors, such as the Consume* processors (ConsumeKafka, ConsumeJMS) as an example / guide for how to write a processor that consumes messages from a topic/queue that follows the pub/sub pattern.
You would then want to transform the payload from Flatbuffers binary to a format insertable into the desired database. Again, a custom processor using code generated from your Flatbuffer schema would probably be the right approach for that.
As you mention, this could also be accomplished with a simple program. If you wrote that program in Java (using the Java nanomsg and flatbuffers libs) as a prototype/proof-of-concept, then it could be refactored into one or more NiFi custom processors in the future if you wish to move to NiFi.
We have a specific functionality for managing timeseries data. The funtionality is already offered as REST API and runs on Cloudfoundry. We want to offer the support for ingesting timeseries data using Spark Streaming and kafka so that the solution is more scalable and robust.
What are the disadvantages of calling the REST API from spark streaming intead building the functionality natively in spark.
I would argue that if your REST API can support the throughput from Spark Streaming, the REST API can support the throughput directly. In which case, you don't actually need Spark Streaming at all. If what you need is a buffer for unexpected spikes, there are simpler ways to achieve that than Spark Streaming.
To address your question more directly, calling the REST API adds latency and an additional failure case to the Spark Streaming pipeline. Implementing your logic in Spark Streaming directly adds code complexity and possible duplication. And both options add operational complexity.
i am searching for technologies that i can use in order to stream data from social media
to hadoop.
i searched and found those tech
Flume.
Storm.
Kafka.
which tool is the best? and why? does anyone familiar with some other tools ?
Most likely, you will want to use Flume as it is built to work with hdfs. However, as with all things, it depends.
Kafka is basically a queuing system that is usually used to persist data in the event of a failure in your analytics architecture. If this sounds like what you need, it might be worth looking into RabbitMQ, ZeroMQ, or maybe Kestrel.
Storm is used for complex event processing. If you use storm, you will be using zeroMQ under the hood, and will likely have to set up a spout that is hooked up to kafka or RabbitMQ. IF you need to do complicated munging of the data before storage, this might be the right option. There are other options that you can use too like spark. I'm inclined to suggest storm purely out of personal preference. I heard that linkedin was releasing a realtime complex event processing framework as well, but I can't remember the name of it. I'll update the post when I can find it.
On a different note, if you're asking this question, it might be because you haven't built this thing yet. If that is the case, you might want to look into something other than hadoop if you need streaming. The ecosystem is rapidly expanding, and there are probably many ways to do what you want to do.
Apache Kafka is a distributed messaging system. In very brief its like you pushed (published) some messages into a Kafka Queue using a KafKa producer and On the other end you consumed it using a Kafka consumer (subscriber). The messages/feeds can be divided into categories called Topic. Now you can run Kafka in cluster which makes it very scalable and can be expanded without any downtime.
It could be a nice choice for holding your social media streams. Kafka retains the message pushed to it for a configurable time and the best part is from their documentation they say
Kafka's performance is effectively constant with respect to data size so retaining lots of data is not a problem.
Check out the doc for more better visibility.
Now Storm is a very scalable, fault-tolerant distributed computation system which can easily be integrated with any queueing (like Kafka) or databases (HDFS/Cassandra etc). So you can feed your messages to a storm cluster for further processing based on your requirement. There is something called KafkaSpout which does a seamless integration between storm and kafka.
You should also look at the Kafka-hadoop loader #github which creates Hadoop Job for incremental loading messages from Kafka topics onto hdfs with multiple file output semantics
Also as #Peter Klipfel said that:
you might want to look into something other than hadoop if you need streaming
You can also check for other alternatives available like Apache Cassandra ,works great with streaming data with a very low latency.
I think it depends on where you are pulling the data and what you are trying to do with the data.
An alternative is to use IBM Streams where you can pull directly from social media streams and store to many different data store of your choice.
For example, you can use the streamsx.social toolkit from here: https://github.com/IBMStreams/streamsx.social which allows you to pull tweets directly from an HTTP stream.
Once you get data into Streams, the product also provides many adapters that allow you to store the streaming data into datastore (e.g. HDFS using streamsx.hdfs, HBase using streamsx.hbase.)
I think another consideration is what kind of analytics are you doing with the social media data. If you would like to analyze the social data in-stream before the data is stored, IBM Streams also provides a text toolkit that allows you to extract insight from the social data unstructured text. You can analyze the data without really having to store it anywhere.
Hope it helps!