I would like to use Apache Flink to process event inside an application.
My tests on a standalone JVM worked reasonably well though flink is a really big dependency.
I also tried to get it running in OSGi but gave up for now because of the many dependencies.
So my question is:
How small can I make Flink. I currently tried with the maven dependency on flink-streaming-java.
Unfortunately this depends on or embeds (only listing the questionable ones):
flink-shaded-hadoop2
kryo
zookeeper
netty
jetty
apache http client
apache http core
scala
akka
jackson
It also looks like several jars embed the same libs again and again. Like some google libs and asm.
So is there some way to get a slimmmer version of flink for local usage that does not depend on so many libs?
Many of the dependencies are required for Apache Flink's primary use-cases namely, distributed stream and batch processing.
Zookeeper for high-availability in case of (process) failures
Netty for data network transfer
Jetty for monitoring via REST API and web dashboard
Akka (and transitively Scala) for coordination of distributed processes
Most of these libraries are tightly coupled with the system and cannot be easily switched off or excluded.
I am sorry, there is no stripped down version for local stream processing.
Related
Have been trying for a while now and Im sure the solution is simple enough, just struggling to find it. Im pretty new so be easy on me..!
Its a requirement to do this using a premade init-script, which is then selected in the UI when configuring the cluster.
I am trying to install com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.18 to a cluster on Azure Databricks. Following the documentations example (it is installing a postgresql driver) they produce an init script using the following command:
dbutils.fs.put("/databricks/scripts/postgresql-install.sh","""
#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)```
My question is, what is the /mnt/driver-daemon/jars/postgresql-42.2.2.jar section of this code? And what would I have to do to make this work for my situation?
Many thanks in advance.
/mnt/driver-daemon/jars/postgresql-42.2.2.jar here is the output path where the jar file will be put. But it makes no sense as this jar won't be put into CLASSPATH and won't be found by Spark. Jars need to be put into /databricks/jars/ directory, where they will be picked up by Spark automatically.
But this method with downloading of jars works only for jars without dependencies, and for libraries like EventHubs connector this is not a case - they won't work if dependencies aren't downloaded as well. Instead it's better to use Cluster UI or Libraries API (or Jobs API for jobs) - with these methods, all dependencies will be fetched as well.
P.S. But really, instead of using EventHubs connector, it's better to use Kafka protocol that is supported by EventHubs as well. There are several reasons for that:
It's better from performance standpoint
It's better from stability standpoint
Kafka connector is included into DBR, so you don't need to install anything extra
You can read how to use Spark + EventHubs + Kafka connector in the EventHubs documentation.
Latest project I used Spring boot, and prepare to deploy to production environment, I want to know which way to run application have better performance or have the same performance?
generate a war package and put it in a stand-alone tomcat
generate a jar package and use embedded tomcat
In addition, when publish to production environment if should to remove devtools dependency.
This is a broad question. The answer is it depends on your requirements.
Personally, I prefer standalone applications with Spring Boot today. One app, one JVM. It gives you more flexibility and reliability in regard to deployments and runtime behaviour. Spring Boot 1.3.0.RELEASE comes with init scripts which allows you to run your Spring Boot application as a daemon on a Linux server. For instance, you can integrate rpm-maven-plugin into your build pipeline in order to package and publish your application as a RPM for deployment or you can dockerize your application easily.
With a classic deployment into a servlet container like Tomcat you will be facing various memory leaks after redeployment for example with logging frameworks, badly managed thread local objects, JDBC drivers and a lot more.
Either you spend time to fix all of those memory leaks inside your application and frameworks you use or just restart servlet container after a deployment. Running your application as a standalone version, you don't care about those memory leaks because you are forced to restart in order to bring you new version up.
In the past, several webapps ran inside one servlet container. This could lead to performance degradation for all webapps because every webapp has its own memory, cpu and GC characteristics which may interfere with each other. Further more, resources like thread pools were shared among all webapps.
In fact, a standalone application is not save from performance degradation due to high load on the server but it does not interfere with others in respect to memory utilization or GC. Keep in mind that performance or GC tuning is much more simpler if you can focus on the characteristics of just one application. It gets complicated as soon as you'll need to find common denominator for several webapps in one servlet container.
In the end, your decision may depend on your work environment. If you are building an application in a corporation where software is running and maintained by operations, it is more likely that you are forced to build a war. If you have the freedom to choose your deployment target, then I recommend a standalone application.
In order to remove devtools from a production build
you can use set the excludeDevtools build property to completely
remove the JAR. The property is supported with both the Maven and
Gradle plugins.
See Spring Boot documentation.
We are developing a open source trading platform based on Springframework and Hibernate http://code.google.com/p/algo-trader/ and http://www.algotrader.ch. The application consists of a trading framework and several strategies that can be started independently. So far, these different parts have been running in separate JVM's communicating through RMI and JMS.
To avoid unnecessary serialization and network overhead we would like to run the entire application within some sort of container (potentially an application server). We do however have the requirement, that the individual parts of the application can be deployed, started and stopped independently.
We have looked into OSGi, but a lot of the libraries that we use are not OSGi ready yet, so this is not currently an option. Also please note, there is no web-GUI in our application.
Any suggestions on this?
Thanks
Andy
If OSGI is not an option then functionality can be broken into smaller units and then deploy them as utility jar, if deployed as utility jar they can be managed independently.
For application server I feel either glassfish or Jboss will be a good option considering they are open source and free.
Though at a later point in time you can check with Weblogic (Dev free).
So in your case you would like to break the static data configuration(Counterparty, Currencies), Dealing(Pricing, Quoting, Booking) as two separate feature.
For your choose of an application server i advise you Jboss and specially in his version 7.1 which is faster and more stable!
We're working on an OSGi-based infrastructure for processing stream-based data flows. Specific processing tasks are executed by individual OSGi components. We now need the possibility to distribute those components over different machines, which means, we need some kind of communication mechanism between OSGi components/containers.
During my research I came across different potential solutions: R-OSGi, Apache CXF for Distributed OSGi, Eclipse Communication Framework.
ECF seems particularly interesting as it supports different transports formats and provides support for stuff like service discovery.
My central questions:
Are there any detailed tutorials/walk-throughs for setting up an ECF infrastructure within Felix? (from my research, I found, that Felix support has been added recently)
Are there any solutions besides the three listed above which I might have missed?
Is there a reason for taking Apache CXF instead of ECF?
The first question -- whether there is a detailed walk-through for setting up ECF with Felix -- I don't know the answer to, though one might use a search engine to find out combinations of those terms.
The problem is ECF uses the Equinox infrastructure, and has at times inadvertently relied on packages that are non-public through transitive dependencies (particularly the Runtime API which uses Equinox for non-public debugging). This, in turn, means that ECF relies on a whole host of other components to be available and it's this set which typically isn't well defined on a Felix runtime.
You have missed out Paremus' Service Fabric, which is a commercial OSGi cloud solution. I'm not sure if you were specifically focussing on open-source or not; but if you are including commercial licenses then they have a very robust architecture for remote services.
Finally, the Apache CXF over ECF question -- if you're using Felix, I'd argue that going with Apache CXF is probably easier than going with ECF. This is mainly due to the dependency set and getting it working, combined with the fact that ECF may not be tested on Felix and so may assume particular aspects of the Equinox runtime (which includes, for example, the runtime's parent classloader delegation to pick up things on the boot classpath). This isn't really the fault of ECF per se, but rather an artefact of how the Eclipse ecosystem works.
If you want to communicate with non-OSGi runtimes, there's an advantage in the Apache CXF in that they can generate WDSL for interaction with other languages. I believe that you can do the same thing in ECF with a bit more work. The CXF solution is likely to be more verbose than a corresponding ECF one (WSDL always is) but if you're not using high volumes of requests this isn't likely to make a significant difference.
Is anyone using Karaf instead of Servicemix? If so, how did you come to this decision? I'm aware that Servicemix adds a layer of functionality around Karaf, just curious if Karaf is being used on its own and why...
We're using Karaf for a number of our applications. We were already using Camel (JMS and Esper) for integration between several different platforms (a JBoss 4.2 instance, a Tomcat and several Felix instances) and as this was working well there was little justification in migrating this too (which would have been cause to consider ServiceMix).
The only reason we have some Felix nodes, is that they're limited in use (on client desktops), rarely need/get updated and I wanted the smallest footprint for these nodes. For anything OSGi on the serverside we're using Karaf.
Karaf provides all of the features you'd expect and need for a production environment (see the apache-karaf tag's info). We do our development and testing against standard minimal framework (using pax-exam) but deploy to Karaf.
If you don't need an ESB, JCA, BPEL, etc but want a solid, tunable OSGi container, then Karaf on it's own is more than adequate. (And if you found yourself needing a limited subset of ServiceMix's functionality you can always install these in a Karaf instance).
You can also customise the Karaf distribution as part of a maven build - personally I like have the container as part of the application's build, as I can checkout, build and run the entire setup from the command line in minimal time.
Recently there's a clustering subproject of Karaf called Cellar using HazelCast, I not sure if this applies to ServiceMix too.
Karaf's life started as the ServiceMix core. Currently, ServiceMix is really a set of bundles that are deployed into a Karaf container. ServiceMix has a number of very handy bundles which do a lot of cool stuff that karaf doesn't. That said, the two primary reason for using ServiceMix is if you want:
1) an ESB,
2) NMR (a feature that allows you to community between bundles AND instances of Karaf).
This all said, the ServiceMix group is currently planning version 5, which will remove the ESB and NMR features and will be focused on being a management container for Camel. In ESB's a great deal of effort when into creating components that could be described using BPL (Business Process Language). However, the folks that wrote ServiceMix began to focus on the implementation of EIP's (Enterprise Integration Patterns) which largely does the same stuff as BPL, but does it in a more standardized and accepted manner. This work was done under the Camel project.
So, in short. If you are using ServiceMix 4+, you're also using Karaf. If you want a more robust integration environment, the environment of choice today (in the Apache/Felix world at least) is Karaf, Camel, and a few bundles from Servicemix.
Here's a little comparative illustration I made. Going from the simplest case (JVM with OSGi functions provided by Apache Felix at the bottom), to more complete/manageable OSGi functions (Apache Karaf in the middle), to enough functions to implement complete ESB instances (Apache ServiceMix at the top) (note that "an ESB" is not a product but a set of endpoints, routers, databases, ETL functions and whatnot configured together in a particular task-specific way).
Karaf does NOT come with CXF.
Its pure extracted kernel of ServiceMix. However, you can install CXF on Karaf as below.
karaf:root()> feature:repo:add cxf
Once the feature URL is added we can see the "provided" features by using the following command.
karaf:root()> feature:repo:feature:list | grep cxf
To install cxf fire the command below
karaf:root()> feature:install cxf