Elasticsearch / Storm integration methods - hadoop

Looking for a simple integration path between Elasticsearch and Apache Storm. Support for this is included in the elasticsearch-hadoop library, but this brings tons of dependencies on the Hadoop stack: from Hive to Cascading, that I simply don't need. Has anyone out there succeeded in this integration without bringing in elasticsearch-hadoop? Thanks.

In my project we're using rabbitmq river for indexing the storm output. It's very efficient and convenient way to write to elasticsearch. You basically put the messages to the queue and the river does the rest. If something gets stucked the data are simply buffered on the queue.
So I would say, use this river approach for writing and elasticsearch Java API for reading, like Kit Menke suggests (or the Jest client, we've found this cool and it offers async API basing on ApacheHttpAsyncClient, though we're not reading from elasticsearch in storm topology but in different services).

Related

Versions for integration of apache flink, elasticsearch and kafka

I have problems with different versions of Flink, Kafka and Elastic Search. I'm using Flink 1.8.1 version but I don't know what version to use for Kafka. On the other hand, I want to use the version 6 for Elastic Search. Which versions do you think are suitable for Flink, Kafka and Elastic Search?
The following link is a version of Kafka, but in the comments section, it is introduced as a beta
enter link description here
As listed in the table, Kafka 0.11 (and higher) will work fine. The beta is a version of the Flink Connector, not Kafka itself
Plus, Kafka Connect for Elasticsearch, should you choose to use it, works for elasticsearch 6
As #cricket_007 said, it's safe to use the Kafka connector, even though it is labeled beta (which should be removed as this connector has now been battle-tested since over a year in production).
The setup Kafka -> Flink -> ES6 is quite common, so you can and should use recent version on all involved components.

What options do I have regarding indexing PDFs while running on Elasticsearch 1.x and Spring Data 1.x, especially if I want to upgrade?

We have a new requirement on our Elasticsearch - to index PDFs. We are still running on Elasticsearch version 1.x (and Spring Data 1.3.4).
I look at the documentation for Elasticsearch 5 and they have new ways of supporting PDFs in 5 (and I would like to upgrade).
So given all this the way I see it I have the following options:
Sit tight and wait for Spring Data to support Elasticsearch 5. This is viable if it is not too far away (please let us know, Spring Data and Elasticsearch dev) although given the business urgency on this feature I don't think I have much leeway
Move off Spring data altogether - this is not as crazy as it sounds as given the complexity of my queries I don't use the Spring Data repositories a great deal. I do however use them for inserting data. I would have to provide my own implementations of the current repository interfaces. It would be work but I wouldn't need to wait for any one and would not need to use any outdated plugins etc
Somehow run on Elasticsearch 5 with Spring Data 2.x/3.x. Will this work at all? Chances are it probably won't even startup.
Upgrade my Elasticsearch/Spring Data to 2.x and use the "old" way of indexing PDFs.
Which option is the best way to go?

Spring XD on YARN: ver 1.2.1 direct binding support for kafka source

Spring XD on YARN: ver 1.2.1 direct binding support for kafka source.
1.I know this is not supported yet(as of ver 1.3.0), any definite date/ver would help our project schedule ?
2.This direct binding for kafka source support is very critical for our project. We are in a situation to totally abandon Spring XD YARN in our project just because of this.
Trying to do
stream create --name directkafkatohdfs --definition "kafka | hdfs"
stream deploy directkafkatohdfs --properties "module.*.count=0"
Hitting the exception "must be a positive number. 0-count kafka sources are not currently supported"
I just want to eliminate the use of message bus/transport(redis/kafka/rabbitMQ) and want to have a direct binding of source(kafka) and sink(sink) in the same YARN container.
1.I know this is not supported yet(as of ver 1.3.0), any definite date/ver would help our project schedule.
2.This direct binding for kafka source support is very critical for our project. We are in a situation to totally abandon Spring XD YARN in our project just because of this.
Thanks
Satish Srinivasan
satsrinister#gmail.com
Thanks for the interest in Spring XD :).
For Spring XD 1.x, we suggest using composition instead of direct binding with the Kafka bus - or, in your case, the Kafka source. However, apart from that, in Spring XD 1.x it is not possible to create an entire stream without at least one hop over the bus (regardless of the type of bus or modules being used).
We are addressing direct binding (including support for entire directly bound streams) as part of Spring Cloud Data Flow (http://cloud.spring.io/spring-cloud-dataflow/) - which is the next evolution of Spring XD. We are intending to support it as a specific configuration option, rather than as a side-effect of zero-count modules. From an end-user perspective, SCDF supports the same DSL as Spring XD (with minor variations) and has the same administration UI, and definitely supports YARN, so it should be a fairly seamless transition. I would suggest starting to take a look at that. The upcoming 1.0.0.M2 release of Spring Cloud Data Flow will not support direct binding via DSL yet, but the intent is to support it in the final release which is currently planned for Q1 2016.

Flink for embedded stream processing in OSGi

I would like to use Apache Flink to process event inside an application.
My tests on a standalone JVM worked reasonably well though flink is a really big dependency.
I also tried to get it running in OSGi but gave up for now because of the many dependencies.
So my question is:
How small can I make Flink. I currently tried with the maven dependency on flink-streaming-java.
Unfortunately this depends on or embeds (only listing the questionable ones):
flink-shaded-hadoop2
kryo
zookeeper
netty
jetty
apache http client
apache http core
scala
akka
jackson
It also looks like several jars embed the same libs again and again. Like some google libs and asm.
So is there some way to get a slimmmer version of flink for local usage that does not depend on so many libs?
Many of the dependencies are required for Apache Flink's primary use-cases namely, distributed stream and batch processing.
Zookeeper for high-availability in case of (process) failures
Netty for data network transfer
Jetty for monitoring via REST API and web dashboard
Akka (and transitively Scala) for coordination of distributed processes
Most of these libraries are tightly coupled with the system and cannot be easily switched off or excluded.
I am sorry, there is no stripped down version for local stream processing.

Spring Integration as embedded alternative to standalone ESB

Does anybody has an experience with Spring Integration project as embedded ESB?
I'm highly interesting in such use cases as:
Reading files from directory on schedule basis
Getting data from JDBC data source
Modularity and possibility to start/stop/redeploy module on the fly (e.g. one module can scan directory on schedule basis, another call query from jdbc data source etc.)
repeat/retry policy
UPDATE:
I found answers on all my questions except "Getting data from JDBC data source". Is it technically possible?
Remember, "ESB" is just a marketing term designed to sell more expensive software, it's not a magic bullet. You need to consider the specific jobs you need your software to do, and pick accordingly. If Spring Integration seems to fit the bill, I wouldn't be too concerned if it doesn't look much like an uber-expensive server installation.
The Spring Integration JDBC adapters are available in 2.0, and we just released GA last week. Here's the relevant section from the reference manual: http://static.springsource.org/spring-integration/docs/latest-ga/reference/htmlsingle/#jdbc
This link describes the FileSucker with Spring Integration. Read up on your Enterprise Integration patterns for more info I think.
I kinda think you need to do a bit more investigation your self, or do a couple of tries on some of your usecases. Then we can discuss whats good and bad
JDBC Adapters appear to be a work in progress.
Even if there is no specific adapter available, remember that Spring Integration is a thin wrapper around POJOs. You'll be able to access JDBC in any component e.g. your service activators.
See here for a solution based on a polling inbound channel adapter too.

Resources