Difference between parallel engine and server engine - etl

Here is the info from the official IBM web site
> WebSphere DataStage and WebSphere QualityStage server engine
The server engine runs WebSphere DataStage server jobs and performs
some tasks for parallel jobs and WebSphere QualityStage jobs.
> Parallel engine
The parallel engine runs parallel jobs and WebSphere QualityStage
jobs.
Could please anyone explain what is the actual difference between those two? Is it hardware/software/architecture? Don't get the difference between the two.

In short they are two completely different products that are made to look mostly alike in the common client.
Server engine is the old single-threaded(?) engine with limited out-of-the-box stages, insufficient performance and very limited updatability. You should encounter server jobs only in legacy installations that are not performace critical. Everyone else has already upgraded either to Enterprise edition with Parallel jobs or another product.
Enterprise edition of Datastage also contains the Parallel engine which gets all the cool new toys such as SCD-stage, Web Service integration and ability to scale out performance by adding new processing nodes.
You use the same client to develop jobs for each engine and some of the stages may look same but it's superficial.

Related

Long Running ETL Process - Background Jobs, Spark, Hadoop

I have a scenario in an application where;
Have to load data from multiple sources (more than 10)
Mostly sources are HTTP/JSON Web Services and some FTP
Have to process those data and put into a central Database (Postgresql)
Current implementation is done in Ruby using Background jobs. But I see following issues in it;
Very high memory usage
Jobs stuck sometimes without any error report
Horizontal scaling is tricky to setup
Does in this scenario, any way Spark or Hadoop can help or a better option.
Please elaborate with some good reasoning.
Update:
As per comment, I need to elaborate it further. Here are the points why I thought to Spark or Hadoop.
If we scale the concurrency of running jobs, that also increase heavy load on DB server. I had read though, that Spark and Hadoop are build to face such heavy load even on DB side.
We can't run more background process then the physical cores of CPU (as recommended by ruby and sidekiq community)
Concurrency in Ruby is actually dependent on GIL, which is not actually real concurrency supported. So each job can fetch single central data source, if that stuck into an IO call then the source will be locked for it.
All above points considered to be part of builtin architecture of Hadoop & Spark. So I was thinking over lines to look into these tools.
In my opinion, I would give a try to Pentaho Data Integrator (PDI) (or Talend).
They where visual tools designed to solve problems like yours. And have a free version downloadable form SourceForge (just unzip and press the spoon.bat button).
They can a acquire data from FTP and HTTP (among others), decode JSON, and write databases like Postgres. PDI has a free plug-in able to run Ruby code out-of-the-box, so you can save start-up development.
The PDI also has ready made Spark and Hadoop interfaces, so you can implement your hadoop/sparkle servers transparently at a later stage if you need a more metal solution.
The PDI was build for heavy data load and gives you you control on concurrency and remote servers.

How can a Phoenix application tailored only to use channels scale on multiple machines? Using HAProxy? How to broadcast messages to all nodes?

I use the node application purely for socket.io channels with Redis PubSub, and at the moment I have it spread across 3 machines, backed by nginx load balancing on one of the machines.
I want to replace this node application with a Phoenix application, and I'm still all new to the erlang/Elixir world so I still haven't figured out how a single Phoenix application can span on more than one machine. Googling all possible scaling and load balancing terms yielded nothing.
The 1.0 release notes mention this regarding channels:
Even on a cluster of machines, your messages are broadcasted across the nodes automatically
1) So I basically deploy my application to N servers, starting the Cowboy servers in each one of them, similarly to how I do with node and them I tie them nginx/HAProxy?
2) If that is the case, how channel messages are broadcasted across all nodes as mentioned on the release notes?
EDIT 3: Taking Theston answer which clarifies that there is no such thing as Phoenix applications, but instead, Elixir/Erlang applications, I updated my search terms and found some interesting results regarding scaling and load balancing.
A free extensive book: Stuff Goes Bad: Erlang in Anger
Erlang pooling libraries recommendations
EDIT 2: Found this from Elixir's creator:
Elixir provides conveniences for process grouping and global processes (shared between nodes) but you can still use external libraries like Consul or Zookeeper for service discovery or rely on HAProxy for load balancing for the HTTP based frontends.
EDITED: Connecting Elixir nodes on the same LAN is the first one that mentions inter Elixir communication, but it isn't related to Phoenix itself, and is not clear on how it related with load balancing and each Phoenix node communicating with another.
Phoenix isn't the application, when you generate a Phoenix project you create an Elixir application with Phoenix being just a dependency (effectively a bunch of things that make building a web part of your application easier).
Therefore any Node distribution you need to do can still happen within your Elixir application.
You could just use Phoenix for the web routing and then pass the data on to your underlying Elixir app to handle the distribution across nodes.
It's worth reading http://www.phoenixframework.org/v1.0.0/docs/channels (if you haven't already) where it explains how Phoenix channels are able to use PubSub to distribute (which can be configured to use different adapters).
Also, are you spinning up cowboy on your deployment servers by running mix phoenix.server ?
If so, then I'd recommend looking at EXRM https://github.com/bitwalker/exrm
This will bundle your Elixir application into a self contained file that you can simply deploy to your production servers (with Capistrano if you like) and then you start your application.
It also means you don't need any Erlang/Elixir dependencies installed on the production machines either.
In short, Phoenix is not like Rails, Phoenix is not the application, not the stack. It's just a dependency that provides useful functionality to your Elixir application.
Unless I am misunderstanding your use case, you can still use the exact scaling technique your node version of the application is. Simply deploy the Phoenix application to > 1 machines and use an Nginx load balancer configured to forward requests to one of the many application machines.
The built in node communications etc of Erlang are used for applications that scale in a different way than a web app. For instance, distributed databases or queues.
Look at Phoenix.PubSub
It's where Phoenix internally has the Channel communication bits.
It currently has two adapters:
Phoenix.PubSub.PG2 - uses Distributed Elixir, directly exchanging notifications between servers. (This requires that you deploy your application in a elixir/erlang distributed cluster way.)
Phoenix.PubSub.Redis - uses Redis to exchange data between servers. (This should be similar to solutions found in socket.io and others)

Mesos real world use-cases

I'm trying to figure out what would be the reasons for using Mesos. Can you come up with other ones?
Running all of your services in the same cluster instead of dedicated clusters (your end-applications + DevOps such as Jenkins)
Running different maturity applications in same cluster (dev, test, production), or is this viable? Kubernetes has a similar approach with Labels
Mesos simplifies the use of traditional distributed applications such as Hadoop by easing deployment, unified API, bin-packing of resources
Full-disclosure: I currently work at Twitter and I'm involved in both Apache Mesos and Aurora.
Mesos uses cases can vary based upon a few dimensions: scale (10 servers vs 10s of thousands), available hardware (dedicated/static or in the public cloud/scalable), and workloads (primarily services, batch, or both).
Your list is a great start. Here are a few additional use cases / features to add.
Container Orchestration
As container runtimes like Docker have become popular, lots of potential users are looking at Mesos + a scheduler to manage orchestration once container images are created. Mesos is already quite mature and has been proven at scale, which I think has given it a leg up over some emergent solutions.
Increased Resource Utilization
For companies running >50 servers, a common motivation for adopting Mesos is to increase resource utilization to reduce CapEx. There are a number of examples of this in both the public and private cloud. In the case of Ebay they have been running Jenkins on Mesos and were able to reduce their VM footprint. Mesosphere has also published a case study of HubSpot (runnning on AWS), and how they've been able to replace hundreds of smaller servers with dozens of larger ones by more-efficiently using their available hardware.
Preemption
At Twitter we're running Mesos via one scheduler: Apache Aurora. One of the ways we can improve utilization relates to your use case: running different maturity applications in the same cluster. Aurora has a concept of environments, so you can run applications that are production, development, or test. Additionally, Aurora has a built-in preemption feature which allows it to prioritize production over non-production tasks, killing non-production tasks when those resources are needed to run production ones as well as a priority system within each environment.
Long-term, functionality related to preemption will also be located in the Mesos core itself -- it's a killer feature related to both increased resource utilization and running different maturity applications (dev, test, prod). There are a few Mesos tickets to follow if you're interested in keeping up to date, including MESOS-155 for preemption, and MESOS-1474 for inverse offers.
Colocating Batch and Services
Running batch and services in a shared Mesos cluster will be key to driving up utilization even further as js84 points out. Check out Project Myriad, an effort to colocate Mesos and YARN workloads in the same cluster. At this time I'm not aware of any large deployments running both batch and services, but it's certainly the direction the community is moving in as it becomes easier for multiple frameworks to run in a shared cluster.
At least one additional use case comes to mind: Development SDK for developing distributed applications. If you have a look at Mesos Frameworks you will find a number of frameworks which have been developed on top of Mesos. Also interesting Apple's Siri framework powering Siri.
Regarding your 1): One additional angle you should keep in mind here is scaling your applications in the same cluster. I.e. at peak load of your website, shift resources easily towards the webservers while scaling down the Hadoop analytical processing.

YARN as a SOA framework

We are considering building a service oriented architecture on top of YARN. We have different application types - some would work in Storm like streaming mode (where we connect to the running service), some in batch processing mode (when the app is started on every request).
Moreover applications might need to communicate to each other often which would require a lot of internal traffic between different applications within YARN. We want to use as well the caching of different applications, so whenever the request with the same data goes to the same app we can return cached responses.
Is YARN a good or bad solution as a basis for SOA framework? Is Yarn just a autoscaling/deployment-like tool or would it be a good fit for SOA? Would it be fast enough to do this with YARN?
The way I see it YARN is pushing Hadoop form being a distributed file system to a distributed OS. There are a lot of SOA-ish infrastructures that are being built or migrating to YARN (Storm, Samza) that are compelling servicehosts. You can also at weave from continuuity, that will help you host additional types of services.
to specifically address you q. - YARN is a good basis for SOA framework, it is more than a autoscaling it is a resource management and hosting framework and it is fast enough (esp. if you use one of the already developed infrastructures that are built on top of it)

AppHarbor basic questions on architecture and realibility

AppHarbor looks very appealing for our .NET solution. But I have some questions I could not find on internet.
Our major concern is reliability of dedicated SQL Server:
Is it clustered / mirrored / replicated?
What happens when they upgrade / patch / maintain server or. hosted server and when hardware fails?
Are upgrades scheduled?
Can we set time interval when they do upgrades?
Which version and edition of Sql Server is used?
Can I use full text search?
Can I use Reporting service?
Is communication with SQL database reliable? For example in Azure SQL it is recommended to build in retry logic - if command does not succeed, retry.
Is AppHarbor reliable? Every cloud provider has occasionally some blackouts (Amazon, MS Azure ...). Is AppHarbor any less reliable compare to them? I know AppHarbor runs on top of Amazon.
Are there a lot of hidden issues you run into? What are the most common?
Did anybody decide to leave appHarbor for a good reason?
As far I can see Azure is a real cloud system with all the downside and upside - more scalable, but with modified infrastructure like customized SQL server .... AppHarbor mimics more on-premises solution. Is my understanding correct?
How is documentation?
How is support?
Thank you for your help.
Yes AppHarbor offers redundant/replicated dedicated SQL Server databases. These plans are available upon request.
This depends on the type of maintenance/update and your SQL Server database plan. If the database server is replicated, downtime can be minimized by failing over to the replica while performing maintenance. In the event of a server failure the database will be attached to a new instance and the application's configuration will be automatically updated. Should a hard drive fail leading to corrupted/lost data AppHarbor make daily backups that will be used to restore your database. It should be noted that hard drive failures are very rare.
We generally coordinate planned maintenance that requires downtime with customers whenever possible. Dedicated SQL Server customers can also select their own maintenance window.
Not really, but AppHarbor will reach out and coordinate with you when it is necessary.
Different SQL Server versions and editions are used depending on the plan. For single-instance dedicated SQL Servers we generally use SQL Server 2008 R2 Web Edition. Dedicated SQL Server 2012 instances are available upon request. Replicated setups require other and more expensive SQL Server editions. You may also want to consider our dedicated MySQL service if you'd like to reduce costs and don't rely on SQL Server specific features - since AppHarbor doesn't have to pay license costs these are less expensive, particularly for a replicated setup.
Yes.
Not by default, but we can work with you to support reporting services on your dedicated SQL Server instance.
Yes. In fact the primary reason customers upgrade from shared to dedicated SQL Server is for consistent, reliable performance.
I'd say so. The last major outage occurred on July 29th, 2012 due to an electrical storm that affecting multiple availability zones in AWS's North Virginia region. As an example, our blog has been available 99.997% of the time since then. In the event of an application instance failure applications are rapidly moved to healthy instances. We recommend running with at least two workers to ensure redundancy in those cases.
I'm admittedly not the best person to answer this question. The most common request/limitation we hear about is that you can't currently trigger a backup yourself. This will be available at a later time, but we do keep daily backups of your databases.
-
AppHarbor's cloud application platform is relatively similar to Azure in terms of scalability. We support rapid "elastic scaling" of application workers both vertically and horizontally. With regards to the dedicated SQL Server service your understanding is correct: It is very similar to an on-premise solution. While the scaling story is different compared to SQL Azure this allows for much greater flexibility. We can tailor a database plan and server that suits your requirements whether you need high CPU, RAM and/or I/O performance. Similarly we can offer database sizes that are 10x larger than SQL Azure's current 150GB database size limitation.
Most documentation is available in the knowledge base. We try and keep this as up-to-date and comprehensive as possible, but if you find yourself missing some information you're of course more than welcome to let us know and we'll add it. Third party add-on providers typically maintain their own AppHarbor-specific documentation.
This is another question where I may be a little biased, but I can tell a little about our goals: Our goal is to always answer non-critical support requests related to apps on both free and paid plans within the day. Critical support requests and supports requests related to applications or databases on paid plans take priority. Support is included in the plans, but we're working on offering premium support options as well. We generally try to exceed your expectations and are always happy to help out and give advice on issues you experience - whether they're related to the AppHarbor platform or not.
Disclaimer: I'm a co-founder of AppHarbor.

Resources