We have existing Xquery files in Oracle Service Bus projects.If we want to resuse the xquery files in IIB projects which saves lot of development effort as there are so many xquery files already available in different projects. So we have created a POC using Saxon processor library to execute the Xquery files in Java compute node by passing Input message available in MbMessageAssembly to XQPreparedExpression which is working fine and getting the response. Here my question is , does it have any performance impact if we follow this approach. Could you please suggest if it causes any issue or it is okay to follow this approach.
I've never used Xquery, so my answer might not be fully accurate, but there are some recommendation by IBM on how to use external classes like JAXB inside IIB, so they are not especially against such scenario.
Of course, it will most likely have worst performances than the same code written in optimized ESQL, but I think in your case it should be relevant to perform a Proof Of Concept. Just compare the average execution time on Oracle and compare it to IIB one, and then you should be able to decide if you go with that solution or if you need to rewrite everything.
Related
I see a broad adoption of Dataweave which I feel is more of transformation library just like Freemarker or Velocity.
In case of DW Change in transformation logic would need change in code, the very same purpose template engines got popular at the first place to seperate logic and code so that we can change transformation logic without needing to rebuild/repackage our code (more deployment hassle).
Can anyone help me to point out few reasons as to why one would prefer DW .
TLDR: If you're looking for a template engine for things like static websites, DataWeave definitely isn't the right choice. Use the right tool for the job. Also, while you can use DataWeave outside of Mule, I don't think I've seen anyone adopt DataWeave that hasn't adopted MuleSoft..
A few things to consider (and most of these I'm stating in the context of developing Mule applications):
These template engines are, typically, for outputting static text. If you're using it to output structured data rather than something like an HTML page.. you're probably doing it wrong. They aren't going to return structured data - they are going to return text. If you're at the very end of your flow and you're going to output that back out of the API or to a file, you're fine I suppose.. but if you want to actually be able to work with that output, you're going to have to convert the plain text to an actual object... introducing a lot of extra steps in this process when you could have just used DataWeave in the first place. Dataweave is especially beneficial when you want to do things like streaming because you're processing large payloads. Dataweave can understand JSON, XML, and CSV (the three most common data types I see) in a streamed format without any additional work, making it very easy to create efficient applications. The big difference between a template engine and a data transformation language is that one is for outputting text using structured data as input, and the other is for working with structured data on the input and outputting structured data that you can continue to work with. There is a reason that almost all of the template engine docs talk about building websites and not things like integrations.
The DataWeave engine is, as Aled indicated, built into the Mule runtime. Deeply so. You can use DataWeave in any field in any connector by default, even fields that don't have the f(x) button - because it's built into the runtime. This makes DataWeave what you could consider a first-class citizen within Mule, unlike something you will only be able to utilize either via connectors or by invoking java bridges/libraries.. which you do via DataWeave or a long series of connector operations.
The benefits you listed are also not things you can't do with DataWeave. You can VERY easily templatize and externalize DataWeave - for example, I have several DataWeave libraries in my maven repo I can include as dependencies. I've built several transformation services that use databases with DataWeave in order to do transformation, allowing me to change those transformations without modifying the app. You can also use dynamic DataWeave, where you use a template system to load specific parts of the script before running it. I've even taken it a step further and written a generic DataWeave script that I can use to do basic mappings without writing DataWeave - this allowed me to wrap a web UI around things pretty easily.
I wouldn't use DataWeave outside of MuleSoft unless you're a MuleSoft shop. If you are a MuleSoft shop, using the CLI to run your scripts, the same way you do with most interpreted languages, works fairly nicely - especially since you likely already have in-house expertise in DataWeave. The language is still niche enough that unless you've already adopted it for use in Mule applications I don't see any advantage in using it.
Docs / basic examples:
https://github.com/mulesoft-labs/data-weave-native
https://docs.mulesoft.com/mule-runtime/4.3/parse-template-reference
https://docs.mulesoft.com/mule-runtime/4.3/dataweave-create-module
https://github.com/mikeacjones/transform-system-api
Because it is the expression and transformation language embedded in Mule runtime. If you are using Mule it is also integrated with the IDE Anypoint Studio.
Outside Mule applications I don't think you can use DataWeave easily. You might want to go with the alternatives.
Being a beginner with the Apache Beam programming model, I would like to know what is the difference between JDBC and jdbcio. I have developed a simple dataflow which involves normal JDBC connection and it is working as expected.
Is it mandatory to use jdbcio over JDBC? If yes, what are the issues we face when we go with a normal JDBC code?
Within a Beam pipeline there are various options for reading and writing out to external sources of data. The most common method is to make use of inbuilt sinks and sources that have been built by the Beam community (Built-in I/O Transforms). These connectors will often have had considerable development effort spent on them and will have been production hardened. For example the BigQueryIO has been used in production for many years, with continuous development throughout that period. The general advice will therefore be to make use of the standard Sinks and Sources whenever possible.
However not all interactions with external data sources should be via Sources and Sinks, there are use cases where a hand built communication from a DoFn to the external source is the correct path. A few examples below (there are more of course!);
There is no Sink / Source to the data source, or there is a source
but it does not yet support all switches / modes etc for your needs.
Of course you can always enhance the existing Sink / Source or if it
does not exist to build a new I/O connector from scratch and if
possible would be great to contribute this back to the community :)
You are enriching elements flowing through your streaming pipeline
with a small subset of data from a large data set. For example, let's
say your processing events coming from a sales order and you would
like to add information for each item. The information for the item's
lives in a large multi TB store but on average you will only access a
small percentage of the data as lookup keys. In this example it makes
sense to enrich each element by making an external call to the data
store within a DoFn. Rather than reading all of the data in as a
Source and doing the join operation within the pipeline.
Extra notes / hints:
When calling external systems, keep in mind that Apache Beam is designed to distribute work across many threads, this can place significant load on your external datasource, you can often reduce this load by making use of the start & end bundle annotations;
Java (SDK 2.9.0)
DoFn.StartBundle
DoFn.FinishBundle
Python (SDK 2.9.0)
start_bundle()
finish_bundle()
Working on a pet project (cassandra, spark, hadoop, kafka) I need a data serialization framework. Checking out the common three frameworks - namely Thrift, Avro and Protocolbuffers - I noticed most of them seem to be dead-alive having 2 minor releases a year at most.
This leaves me with two assumptions:
They are as complete as such a framework should be and just rest in maintenance mode as long as no new features are needed
There is no reason to exist for such framework - not being obvious to me why. If so, what alternatives are out there?
If anyone could give me a hint to my assumptions, any input is welcome.
Protocol Buffers is a very mature framework, having been first introduced nearly 15 years ago at Google. It's certainly not dead: Nearly every service inside Google uses it. But after so much usage, there probably isn't much that needs to change at this point. In fact, they did a major release (3.0) this year, but the release was as much about removing features as adding them.
Protobuf's associated RPC system, gRPC, is relatively new and has had much more activity recently. (However, it is based on Google's internal RPC system which has seen some 12 years of development.)
I don't know as much about Thrift or Avro but they have been around a while too.
The advantage of Thrift compared to Protobuf is that Thrift offers a complete RPC and serialization framework. Plus Thrift supports about 20+ target languages and that number is still growing. We are about to include .NET core and there will be Rust support in the not-so-far future.
The fact that there have been not that many Thrift releases in the last months is surely something that needs to be addressed, and we are fully aware of it. On the other hand, the overall stability of the codebase is quite good, so one may do a Github fork and cut a branch on its own from current master as well - of course with the usual quality measures.
The main difference between Avro and Thrift is that Thrift is statically typed, while Avro uses a more dynamic approach. In most cases a static approach fits the needs quite well, in that case Thrift lets you benefit from the better performance of generated code. If that is not the case, Avro might be more suitable.
Also it is worth mentioning that besides Thrift, Protobuf and Avro there are some more solutions on the market, such as Capt'n'proto or BOLT.
Concerning thrift: as far as I am aware of it is alive and kicking. We use it for serialization and internal API's where I work at and it works fine for that.
Missing things like connection multiplexing and more user-friendly clients have been added through projects such as Twitter's Finagle.
Though I would characterize our use of it as semi-intensive only (ie, we don't look at performance first: it should be easy to use and bug-free before anything else) we did not run into any issue so far.
So, regarding thrift, I'd say it falls into your first category.[*]
Protocolbuffers is an alternative for thrift's serialization part, but it does not provide the RPC toolbox thrift offers.
I'm not aware of any other project that blends RPC and serialization into such a simple to use and complete single package.
[*]Anyway, once you start using it and see all the benefits, it's hard to
put it into your second category :)
They are all very much in use at plenty of places, so I'd say your first assumption. I don't know what your expectation of a release schedule is, but they seem normal to me for a library of that size and maturity. Heck, Avro 1.8.0 came out at the start of 2016, and most things still use Avro 1.7.7 (e.g. Spark, Hadoop). https://avro.apache.org/releases.html
I am planning to do a class project and was going through few technologies where I can automate or set the flow of data between systems and found that there are couple of them i.e. Apache NiFi and StreamSets ( to my knowledge ). What I couldn't understand is the difference between them and use-cases where they can be used? I am new to this and if anyone can explain me a bit would be highly appreciated. Thanks
Suraj,
Great question.
My response is as a member of the open source Apache NiFi project management committee and as someone who is passionate about the dataflow management domain.
I've been involved in the NiFi project since it was started in 2006. My knowledge of Streamsets is relatively limited so I'll let them speak for it as they have.
The key thing to understand is that NiFi was built to do one really important thing really well and that is 'Dataflow Management'. It's design is based on a concept called Flow Based Programming which you may want to read about and reference for your project 'https://en.wikipedia.org/wiki/Flow-based_programming'
There are already many systems which produce data such as sensors and others. There are many systems which focus on data processing like Apache Storm, Spark, Flink, and others. And finally there are many systems which store data like HDFS, relational databases, and so on. NiFi purely focuses on the task of connecting those systems and providing the user experience and core functions necessary to do that well.
What are some of those key functions and design choices made to make that effective:
1) Interactive command and control
The job of someone trying to connect systems is to be able to rapidly and efficiently interact with the constant streams of data they see. NiFi's UI allows you do just that as the data is flowing you can add features to operate on it, fork off copies of data to try new approaches, adjust current settings, see recent and historical stats, helpful in-line documentation and more. Almost all other systems by comparison have a model that is design and deploy oriented meaning you make a series of changes and then deploy them. That model is fine and can be intuitive but for the dataflow management job it means you don't get the interactive change by change feedback that is so vital to quickly build new flows or to safely and efficiently correct or improve handling of existing data streams.
2) Data Provenance
A very unique capability of NiFi is its ability to generate fine grained and powerful traceability details for where your data comes from, what is done to it, where its sent and when it is done in the flow. This is essential to effective dataflow management for a number of reasons but for someone in the early exploration phases and working a project the most important thing this gives you is awesome debugging flexibility. You can setup your flows and let things run and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected you can fix the flow and replay the object then repeat. Really helpful.
3) Purpose built data repositories
NiFi's out of the box experience offers very powerful performance even on really modest hardware or virtual environments. This is because of the flowfile and content repository design which gives us the high performance but transactional semantics we want as data works its way through the flow. The flowfile repository is a simple write ahead log implementation and the content repository provides an immutable versioned content store. That in turn means we can 'copy' data by only ever adding a new pointer (not actually copying bytes) or we can transform data by simply reading from the original and writing out a new version. Again very efficient. Couple that with the provenance stuff I mentioned a moment ago and it just provides a really powerful platform. Another really key thing to understand here is that in the business of connecting systems you don't always get to dictate things like size of data involved. The NiFi API was built to honor that fact and so our API lets processors do things like receive, transform, and send data without ever having to load the full objects in memory. These repositories also mean that in most flows the majority of processors do not even touch the content at all. However, you can easily see from the NiFi UI precisely how many bytes are actually being read or written so again you get really helpful information in establishing and observing your flows. This design also means NiFi can support back-pressure and pressure-release naturally and these are really critical features for a dataflow management system.
It was mentioned previously by the folks from the Streamsets company that NiFi is file oriented. I'm not really sure what the difference is between a file or a record or a tuple or an object or a message in generic terms but the reality is when data is in the flow then it is 'a thing that needs to be managed and delivered'. That is what NiFi does. Whether you have lots of really high speed tiny things or you have large things and whether they came from a live audio stream off the Internet or they come from a file sitting on your harddrive it doesn't matter. Once it is in the flow it is time to manage and deliver it. That is what NiFi does.
It was also mentioned by the Streamsets company that NiFi is schemaless. It is accurate that NiFi does not force conversion of data from whatever it is originally to some special NiFi format nor do we have to reconvert it back to some format for follow-on delivery. It would be pretty unfortunate if we did that because what this means is that even the most trivial of cases would have problematic performance implications and luckily NiFi does not have that problem. Further had we gone that route then it would mean handling diverse datasets like media (images, video, audio, and more) would be difficult but we're on the right track and NiFi is used for things like that all the time.
Finally, as you continue with your project and if you find there are things you'd like to see improved or that you'd like to contribute code we'd love to have your help. From https://nifi.apache.org you can quickly find information on how to file tickets, submit patches, email the mailing list, and more.
Here are a couple of fun recent NiFi projects to checkout:
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer
https://twitter.com/KayLerch/status/721455415456882689
Good luck on the class project! If you have any questions the users#nifi.apache.org mailing list would love to help.
Thanks
Joe
Both Apache NiFi and StreamSets Data Collector are Apache-licensed open source tools.
Hortonworks does have a commercially supported variant called Hortonworks DataFlow (HDF).
While both have a lot of similarities such as a web-based ui, both are used for ingesting data there are a few key differences. They also both consist of a processors linked together to perform transformations, serialization, etc.
NiFi processors are file-oriented and schemaless. This means that a piece of data is represented by a FlowFile (this could be an actual file on disk, or some blob of data acquired elsewhere). Each processor is responsible for understanding the content of the data in order to operate on it. Thus if one processor understands format A and another only understands format B, you may need to perform a data format conversion in between those two processors.
NiFi can be run standalone, or as a cluster using its own built-in clustering system.
StreamSets Data Collector (SDC) however, takes a record based approach. What this means is that as data enters your pipeline it (whether its JSON, CSV, etc) it is parsed into a common format so that the responsibility of understanding the data format is no longer placed on each individual processor and any processor can be connected to any other processor.
SDC also runs standalone, and also a clustered mode, but it runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have.
NiFi has been around for about the last 10 years (but less than 2 years in the open source community).
StreamSets was released to the open source community a little bit later in 2015. It is vendor agnostic, and as far as Hadoop goes Hortonworks, Cloudera, and MapR are all supported.
Full Disclosure: I am an engineer who works on StreamSets.
They are very similar for data ingest scenarios.
Apache NIFI(HDP) is more mature and StreamSets is more lightweight.
Both are easy to use, both have strong capability. And StreamSets could easily
They have companies behind, Hortonworks and Cloudera.
Obviously there are more contributors working on NIFI than StreamSets, of course, NIFI have more enterprise deployments in production.
Two of the key differentiators between the two IMHO are.
Apache NiFi is a Top Level Apache project, meaning it has gone through the incubation process described here, http://incubator.apache.org/policy/process.html, and can accept contributions from developers around the world who follow the standard Apache process which ensures software quality. StreamSets, is Apache LICENSED, meaning anyone can reuse the code, etc. But the project is not managed as an Apache project. In fact, in order to even contribute to Streamsets, you are REQUIRED to sign a contract. https://streamsets.com/contributing/ . Contrast this with the Apache NiFi contributor guide, which wasn't written by a lawyer. https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-HowtocontributetoApacheNiFi
StreamSets "runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have." which imposes a bit of restriction if you want to deploy your dataflows further toward the Edge where the Devices that are generating the data live. Apache MiniFi, a sub-project of NiFi can run on a single Raspberry Pi, while I am fairly confident that StreamSets cannot, as YARN or Mesos require more resources than a Raspberry Pi provides.
Disclosure: I am a Hortonworks employee
We are implementing a remote partitioning job using spring batch and spring integration. For this job we are getting a big file, we are splitting that file using unix command and we are running the batch using those files. Is there way i can write a custom partitioning logic for the large file without splitting the file. Please help someone..
Thanks in advance.
-MK
There is a Jira issue for Spring Batch (BATCH-1613 and related pull request) to support multi threaded file reading. However, the issue we've found is that the benefits of having multiple threads is very environmentally specific. In typical environments, you end up not being able to get the data off the disk fast enough with a single file to keep all your partitions busy. The pull request previously linked to saw no benefits when I was testing it over reading a single file with a single thread which is why it wasn't merged (even though the author was able to present stats that showed increased speed in his environment).
If the linked code benefits you, please feel free to use it. However, I'd want to be sure that the benefits are achievable in more environments (or at least have a more concrete understanding of the requirements for it to gain the benefits so they could be documented) before merging it into the framework itself.
If you work out something that works for you (either based on the code linked or something else), we'd love a pull request!