KafkaIO Connector/Apache Beam Transform "go" SDK Available? - go

I am working on building a data ingestion pipeline using Apache Beam "go" SDK.
My pipeline is to consume data from Kafka queue and persist the data to Google Cloud Bigtable (and/or to another Kafka topic).
So far, I have not been able to find a Kafka IO Connector (also known as Apache I/O Transform) written in "go" (I was able to find a java version, however).
Here's link to supported Apache Beam built-in I/O transforms:
https://beam.apache.org/documentation/io/built-in/
I am looking for the "go" equivalent of the following Java code:
pipeline.apply("kafka_deserialization", KafkaIO.<String, String>read()
.withBootstrapServers(KAFKA_BROKER)
.withTopic(KAFKA_TOPIC)
.withConsumerConfigUpdates(CONSUMER_CONFIG)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class))
Do you have any information on the availability of KafkaIO Connector "go" SDK/library?

#cricket_007 In case you are also curious, I received the following update from Robert Burke (rebo#google.com) who is in the Apache Beam team:
There presently isn't a Kafka transform for Go.
The Go SDK is still experimental, largely due to scalable IO support, which is why the Go SDK isn't represented in the built-in io page.
There's presently no way for an SDK user to write a Streaming source in the Go SDK, since there's no mechanism for a DoFn to "self terminate" bundles, such as to allow for scalability and windowing from streaming sources.
However, SplittableDoFns are on their way, and will eventually be the solution for writing these.
At present, the Beam Go SDK IOs haven't been tested and vetted for production use. Until the initial SplittableDoFn support is added to the Go SDK, Batch transforms cannot split, and can't scale beyond a single worker thread. This batch version should land in the next few months, and the streaming version land a few months after that, after which a Kafka IO can be developed.
I wish I had better news for you, but I can say progress is being made.
Robert Burke

Related

Difference between normal JDBC and JDBCIO connector in apache Beam?

Being a beginner with the Apache Beam programming model, I would like to know what is the difference between JDBC and jdbcio. I have developed a simple dataflow which involves normal JDBC connection and it is working as expected.
Is it mandatory to use jdbcio over JDBC? If yes, what are the issues we face when we go with a normal JDBC code?
Within a Beam pipeline there are various options for reading and writing out to external sources of data. The most common method is to make use of inbuilt sinks and sources that have been built by the Beam community (Built-in I/O Transforms). These connectors will often have had considerable development effort spent on them and will have been production hardened. For example the BigQueryIO has been used in production for many years, with continuous development throughout that period. The general advice will therefore be to make use of the standard Sinks and Sources whenever possible.
However not all interactions with external data sources should be via Sources and Sinks, there are use cases where a hand built communication from a DoFn to the external source is the correct path. A few examples below (there are more of course!);
There is no Sink / Source to the data source, or there is a source
but it does not yet support all switches / modes etc for your needs.
Of course you can always enhance the existing Sink / Source or if it
does not exist to build a new I/O connector from scratch and if
possible would be great to contribute this back to the community :)
You are enriching elements flowing through your streaming pipeline
with a small subset of data from a large data set. For example, let's
say your processing events coming from a sales order and you would
like to add information for each item. The information for the item's
lives in a large multi TB store but on average you will only access a
small percentage of the data as lookup keys. In this example it makes
sense to enrich each element by making an external call to the data
store within a DoFn. Rather than reading all of the data in as a
Source and doing the join operation within the pipeline.
Extra notes / hints:
When calling external systems, keep in mind that Apache Beam is designed to distribute work across many threads, this can place significant load on your external datasource, you can often reduce this load by making use of the start & end bundle annotations;
Java (SDK 2.9.0)
DoFn.StartBundle
DoFn.FinishBundle
Python (SDK 2.9.0)
start_bundle()
finish_bundle()

Difference between Apache NiFi and StreamSets

I am planning to do a class project and was going through few technologies where I can automate or set the flow of data between systems and found that there are couple of them i.e. Apache NiFi and StreamSets ( to my knowledge ). What I couldn't understand is the difference between them and use-cases where they can be used? I am new to this and if anyone can explain me a bit would be highly appreciated. Thanks
Suraj,
Great question.
My response is as a member of the open source Apache NiFi project management committee and as someone who is passionate about the dataflow management domain.
I've been involved in the NiFi project since it was started in 2006. My knowledge of Streamsets is relatively limited so I'll let them speak for it as they have.
The key thing to understand is that NiFi was built to do one really important thing really well and that is 'Dataflow Management'. It's design is based on a concept called Flow Based Programming which you may want to read about and reference for your project 'https://en.wikipedia.org/wiki/Flow-based_programming'
There are already many systems which produce data such as sensors and others. There are many systems which focus on data processing like Apache Storm, Spark, Flink, and others. And finally there are many systems which store data like HDFS, relational databases, and so on. NiFi purely focuses on the task of connecting those systems and providing the user experience and core functions necessary to do that well.
What are some of those key functions and design choices made to make that effective:
1) Interactive command and control
The job of someone trying to connect systems is to be able to rapidly and efficiently interact with the constant streams of data they see. NiFi's UI allows you do just that as the data is flowing you can add features to operate on it, fork off copies of data to try new approaches, adjust current settings, see recent and historical stats, helpful in-line documentation and more. Almost all other systems by comparison have a model that is design and deploy oriented meaning you make a series of changes and then deploy them. That model is fine and can be intuitive but for the dataflow management job it means you don't get the interactive change by change feedback that is so vital to quickly build new flows or to safely and efficiently correct or improve handling of existing data streams.
2) Data Provenance
A very unique capability of NiFi is its ability to generate fine grained and powerful traceability details for where your data comes from, what is done to it, where its sent and when it is done in the flow. This is essential to effective dataflow management for a number of reasons but for someone in the early exploration phases and working a project the most important thing this gives you is awesome debugging flexibility. You can setup your flows and let things run and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected you can fix the flow and replay the object then repeat. Really helpful.
3) Purpose built data repositories
NiFi's out of the box experience offers very powerful performance even on really modest hardware or virtual environments. This is because of the flowfile and content repository design which gives us the high performance but transactional semantics we want as data works its way through the flow. The flowfile repository is a simple write ahead log implementation and the content repository provides an immutable versioned content store. That in turn means we can 'copy' data by only ever adding a new pointer (not actually copying bytes) or we can transform data by simply reading from the original and writing out a new version. Again very efficient. Couple that with the provenance stuff I mentioned a moment ago and it just provides a really powerful platform. Another really key thing to understand here is that in the business of connecting systems you don't always get to dictate things like size of data involved. The NiFi API was built to honor that fact and so our API lets processors do things like receive, transform, and send data without ever having to load the full objects in memory. These repositories also mean that in most flows the majority of processors do not even touch the content at all. However, you can easily see from the NiFi UI precisely how many bytes are actually being read or written so again you get really helpful information in establishing and observing your flows. This design also means NiFi can support back-pressure and pressure-release naturally and these are really critical features for a dataflow management system.
It was mentioned previously by the folks from the Streamsets company that NiFi is file oriented. I'm not really sure what the difference is between a file or a record or a tuple or an object or a message in generic terms but the reality is when data is in the flow then it is 'a thing that needs to be managed and delivered'. That is what NiFi does. Whether you have lots of really high speed tiny things or you have large things and whether they came from a live audio stream off the Internet or they come from a file sitting on your harddrive it doesn't matter. Once it is in the flow it is time to manage and deliver it. That is what NiFi does.
It was also mentioned by the Streamsets company that NiFi is schemaless. It is accurate that NiFi does not force conversion of data from whatever it is originally to some special NiFi format nor do we have to reconvert it back to some format for follow-on delivery. It would be pretty unfortunate if we did that because what this means is that even the most trivial of cases would have problematic performance implications and luckily NiFi does not have that problem. Further had we gone that route then it would mean handling diverse datasets like media (images, video, audio, and more) would be difficult but we're on the right track and NiFi is used for things like that all the time.
Finally, as you continue with your project and if you find there are things you'd like to see improved or that you'd like to contribute code we'd love to have your help. From https://nifi.apache.org you can quickly find information on how to file tickets, submit patches, email the mailing list, and more.
Here are a couple of fun recent NiFi projects to checkout:
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer
https://twitter.com/KayLerch/status/721455415456882689
Good luck on the class project! If you have any questions the users#nifi.apache.org mailing list would love to help.
Thanks
Joe
Both Apache NiFi and StreamSets Data Collector are Apache-licensed open source tools.
Hortonworks does have a commercially supported variant called Hortonworks DataFlow (HDF).
While both have a lot of similarities such as a web-based ui, both are used for ingesting data there are a few key differences. They also both consist of a processors linked together to perform transformations, serialization, etc.
NiFi processors are file-oriented and schemaless. This means that a piece of data is represented by a FlowFile (this could be an actual file on disk, or some blob of data acquired elsewhere). Each processor is responsible for understanding the content of the data in order to operate on it. Thus if one processor understands format A and another only understands format B, you may need to perform a data format conversion in between those two processors.
NiFi can be run standalone, or as a cluster using its own built-in clustering system.
StreamSets Data Collector (SDC) however, takes a record based approach. What this means is that as data enters your pipeline it (whether its JSON, CSV, etc) it is parsed into a common format so that the responsibility of understanding the data format is no longer placed on each individual processor and any processor can be connected to any other processor.
SDC also runs standalone, and also a clustered mode, but it runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have.
NiFi has been around for about the last 10 years (but less than 2 years in the open source community).
StreamSets was released to the open source community a little bit later in 2015. It is vendor agnostic, and as far as Hadoop goes Hortonworks, Cloudera, and MapR are all supported.
Full Disclosure: I am an engineer who works on StreamSets.
They are very similar for data ingest scenarios.
Apache NIFI(HDP) is more mature and StreamSets is more lightweight.
Both are easy to use, both have strong capability. And StreamSets could easily
They have companies behind, Hortonworks and Cloudera.
Obviously there are more contributors working on NIFI than StreamSets, of course, NIFI have more enterprise deployments in production.
Two of the key differentiators between the two IMHO are.
Apache NiFi is a Top Level Apache project, meaning it has gone through the incubation process described here, http://incubator.apache.org/policy/process.html, and can accept contributions from developers around the world who follow the standard Apache process which ensures software quality. StreamSets, is Apache LICENSED, meaning anyone can reuse the code, etc. But the project is not managed as an Apache project. In fact, in order to even contribute to Streamsets, you are REQUIRED to sign a contract. https://streamsets.com/contributing/ . Contrast this with the Apache NiFi contributor guide, which wasn't written by a lawyer. https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-HowtocontributetoApacheNiFi
StreamSets "runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have." which imposes a bit of restriction if you want to deploy your dataflows further toward the Edge where the Devices that are generating the data live. Apache MiniFi, a sub-project of NiFi can run on a single Raspberry Pi, while I am fairly confident that StreamSets cannot, as YARN or Mesos require more resources than a Raspberry Pi provides.
Disclosure: I am a Hortonworks employee

replacing socket.io with telepat for real time updates

How does telepat-io differ from socket-io and other socket based real time sytems ? what is the underlying technology - is it a wrapper on top of socket-io ?
Reading through their website, you can see references to socket.io...
Reading through their code, for example, their client code, you can also find references to socket.io...
It seems to me that the word wrapper doesn't fit, as they focus on creating an optimized design for meshing different technologies to create a real-time application backend... I would go with the word framework if I had to put a name on it. If you like their approach, you'll probably enjoy simplified scaling as this is one of their main concerns.
As Myst pointed out, Telepat is more of a framework, a full stack software. This framework uses socket.io for the notification part of the system: clients manipulate application resources -> API -> workers -> subscribed clients get notified of the changes through various means (Apple Push Notifications, Google Cloud Messaging and web sockets for any other client).
So in short: Telepat uses socket.io for client notifications.

which tech available for stream data from social media to hadoop?

i am searching for technologies that i can use in order to stream data from social media
to hadoop.
i searched and found those tech
Flume.
Storm.
Kafka.
which tool is the best? and why? does anyone familiar with some other tools ?
Most likely, you will want to use Flume as it is built to work with hdfs. However, as with all things, it depends.
Kafka is basically a queuing system that is usually used to persist data in the event of a failure in your analytics architecture. If this sounds like what you need, it might be worth looking into RabbitMQ, ZeroMQ, or maybe Kestrel.
Storm is used for complex event processing. If you use storm, you will be using zeroMQ under the hood, and will likely have to set up a spout that is hooked up to kafka or RabbitMQ. IF you need to do complicated munging of the data before storage, this might be the right option. There are other options that you can use too like spark. I'm inclined to suggest storm purely out of personal preference. I heard that linkedin was releasing a realtime complex event processing framework as well, but I can't remember the name of it. I'll update the post when I can find it.
On a different note, if you're asking this question, it might be because you haven't built this thing yet. If that is the case, you might want to look into something other than hadoop if you need streaming. The ecosystem is rapidly expanding, and there are probably many ways to do what you want to do.
Apache Kafka is a distributed messaging system. In very brief its like you pushed (published) some messages into a Kafka Queue using a KafKa producer and On the other end you consumed it using a Kafka consumer (subscriber). The messages/feeds can be divided into categories called Topic. Now you can run Kafka in cluster which makes it very scalable and can be expanded without any downtime.
It could be a nice choice for holding your social media streams. Kafka retains the message pushed to it for a configurable time and the best part is from their documentation they say
Kafka's performance is effectively constant with respect to data size so retaining lots of data is not a problem.
Check out the doc for more better visibility.
Now Storm is a very scalable, fault-tolerant distributed computation system which can easily be integrated with any queueing (like Kafka) or databases (HDFS/Cassandra etc). So you can feed your messages to a storm cluster for further processing based on your requirement. There is something called KafkaSpout which does a seamless integration between storm and kafka.
You should also look at the Kafka-hadoop loader #github which creates Hadoop Job for incremental loading messages from Kafka topics onto hdfs with multiple file output semantics
Also as #Peter Klipfel said that:
you might want to look into something other than hadoop if you need streaming
You can also check for other alternatives available like Apache Cassandra ,works great with streaming data with a very low latency.
I think it depends on where you are pulling the data and what you are trying to do with the data.
An alternative is to use IBM Streams where you can pull directly from social media streams and store to many different data store of your choice.
For example, you can use the streamsx.social toolkit from here: https://github.com/IBMStreams/streamsx.social which allows you to pull tweets directly from an HTTP stream.
Once you get data into Streams, the product also provides many adapters that allow you to store the streaming data into datastore (e.g. HDFS using streamsx.hdfs, HBase using streamsx.hbase.)
I think another consideration is what kind of analytics are you doing with the social media data. If you would like to analyze the social data in-stream before the data is stored, IBM Streams also provides a text toolkit that allows you to extract insight from the social data unstructured text. You can analyze the data without really having to store it anywhere.
Hope it helps!

Which CEP product to start with?

I want to learn more on how to build CEP based applications. So I looked around and found several products (overview found here: http://rulecore.com/CEPblog/?page_id=47).
But as there are quite a few at the moment, I don't know which is the best to start with. And overall I just would consider the one available for free. The rest is a bit to expensive for just private use ;)
Esper is for free, but without Esper studio it seems quite tedious to develop a cep app. Streambase offers a free trial, but I couldn't find out how long you can use this (if only for a month, no that helpful for longer research). Oracle CEP suite seems quite complete, but in the cep scene - as far as I can see - it is the least recognized compared to Esper or Streambase.
So do you have any hints on what is the best way to start with cep development? Is it worth to spent time on working through the oracle documenation or is it better to start with Esper or Streambase?
Cheers,
Andreas
Microsoft's CEP offering StreamInsight which closely resembles the reactive programming model of the Rx Framework and LINQ.
A Hitchhiker's Guide to StreamInsight Queries is a good place to start.
Some Code Examples
I would recommend using LINQPad which can connect to Stream Insight as a canvas for your queries.
The current CEP tools do not solve identical problems! So depending on what you like to do you'd like use different tools. In short, my personal choices would be:
For building data driven algorithms, coding in a type of SQL with extensions - The Coral8 engine from Aleri. Free for test and development (Was anyway before bought by Aleri)
For detecting event patterns (situations), no coding (declarative style) but configuration using XML - RuleCore, free test subscription to (Web)service
For a mix of both with low level control and coding in Java - Esper, GPL.
For creating data driven computation logic using graphical boxes-and-arrows style of GUI: StreamBase.
I think the best choice is to compare the solutions that are freely available and then make something with them.
I'm not sure what your end goals are, if it's to learn a technology that you use at work or just to play around with something cool, but for me on a project like this, the deciding factor would be which tool can I use to make something I could share with the world.
In this case, my options would probably be Esper or OpenESB. That way, I could put the project on a resume (especially if I was applying for a job that used CEP tools) and share it with the world.
You could read the blog of Curt Monash (http://www.dbms2.com) , he writes about things like CEP.
would there be any interest in a free subscription to the ruleCore (Cloud, SaaS or whatever these are called today) Service? It would be running on smaller and less reliable (no cluster) hardware and probably only usable for testing out small low performance kind of things. If support#rulecore.com gets a couple of requests of this kind I'm sure it's put up onto the todo list...
For detecting event patterns I found that rulecore is pretty easy to use. I have only tried to detect patterns of low and medium complexity and that did work fine. It takes some time to get used to the concepts but is it actually a very small system so it was not that bad. And you need to like XML as everything is done using XML.
If you are trying to create a trading application then StreamBase would be better. But for monitoring stuff rulecore feels better.
If you have continuous streams (market feeds, IoT sensors, Twitter, news, etc), then stream processing technology is the right choice for you. Stream processing / streaming analytics is only a part of different CEP solutions (streams, rules, patterns, etc.).
There are several open source options for stream processing in the meantime, e.g. Apache Storm, Apache Spark or Apache Samza, but also powerful proprietary products such as IBM InfoSphere Streams, TIBCO StreamBase or Software AG's Apama.
Take a look at my blog post respectively article for more details about different stream processing and streaming analytics solutions (open source and proprietary):
Comparison of Stream Processing and Streaming Analytics Alternatives (Apache Storm, Spark, IBM InfoSphere Streams, TIBCO StreamBase, Software AG Apama)
i would start with the free trial of Aleri Coral8 (currently Sybase)

Resources