Creating Apache Spark-powered "As Service" applications - client-server

The question is about the ways to create a Windows desktop-based and/or web-based application client that is able to connect and talk to the server containing Spark application (either local or on-premise cloud distributions) in the run-time.
Any language/architecture may work. So far, I've seen two things that may be a help in that, but I'm not so sure if they would be the best alternative and how they work yet:
Spark Job Server - https://github.com/spark-jobserver/spark-jobserver - defines a REST API for Spark
Hue - http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ - uses item 1)
Any advice would be appreciated. Simple toy example program (or steps) that shows, e.g. how to build such client for simply creating Spark Context on a local machine and say reading text file and returning basic stats would be ideal answer!

You may want to have a look on how the guys at Adobe Research built their spindle platform. Personally I haven't investigated that in detail, but they are also providing "Spark query results as a service".

Related

What would be the advantages of using ELK for log management over a simple python logging + existing database log table combo?

Assuming I have many Python processes running on an automation server such as Jenkins, let's say I want to use Python's native logging module and, other than writing to the Jenkins console or to a log file, I want to store & centralize the logs somewhere.
I thought of using ELK for that, but then I realized that I can just as well create a dedicated log table in an existing database (I'm using Redshift), use something like Grafana for log dashboards/visualization and save myself the trouble of deploying a new system (most of the people in my team are familiar with Redshift but not with ElasticSearch).
Although it sounds straightforward, I feel like I'm not looking at the big picture and that I would be missing some powerful capabilities that components like Logstash were written for the in the first place. What would these capabilities be and how would it be advantageous to use ELK instead of my solution?
Thank you!
I have implemented a full ELK stack in my company in the past year.
The project was huge and took a lot of time to properly implement. The advantages of using ELK and not implementing our own centralized logging solution would be:
Not needing to re-invent the wheel- There is already a product that is doing just that. (and the installation part is extremely easy)
It is battle tested and can stand huge amount of logs in a short time.
As your business and product grows and shift you will need to parse more logs with different structure which will mean DB changes on self built system. logstash will give you endless possibilities of filtering and parsing those new formatted logs.
It has Cluster and HA capabilities, and you can scale your logging system vertically and horizontally.
Very easy to maintain and change over time.
It can send the needed output to a variety of products including Zabbix, Grafana, elasticsearch and many more.
Kibana will give you ability to view the logs, build graphs and dashboards, alerts and more...
The options with ELK are really endless and the more I work with it, the more I find new ways it can help me. not just from viewing logs on distributed remote server systems, but also security alerts and SLA graphs and many other insights.

Example application using HDFS+Map Reduce

I have an academic course "Middleware" which covers different aspects of Distributed Software Systems including introduction to topics like [tag:Distributed File system]. This also involves introduction to hbase,hadoop,mapreduce,hiveql,piglatin.
I want to know, can I have a small project which tries to integrate above technologies. For starters, I am aware of vm provided by cloudera for having a feel of hadoop and playing around using Eclipse.
I was thinking on lines of implementing an application which accepts stream of events as an input, Analyses this and gives an output.
I have both windows/linux on my machine with i7 procoessor and 4Gb Ram.
Please let me know how to get started with everything and any suggestions for simple example application are welcome.
Here is a blog post on analyzing Tweets using Hive/HDFS. And here is a blog post on performing Clickstream analytics using Pig and Hive.
Check some of the Big Data use cases here and try to solve an interesting problem.

What is a good framework for deploying a portable HTML/JavaScript Windows application?

I need to deploy an application onto some Windows machines for purposes of data collection from a group of people (i.e. the application will be used to gather responses to a series of survey questions). The process is interactive, alternating between displays of text and images with specific timing requirements. I have put together a prototype application using HTML and JavaScript that implements the survey. However, there are some unique constraints on the deployment environment that have me stuck:
While the machine is Internet-connected, the client requires that the survey application must run fully local to the PC that it runs on. Therefore, sending the survey results to a remote server is not permissible. Obviously, saving to a local file from a Web browser is typically not permitted for security reasons.
Installation of applications onto the machines that will run the survey is not permitted.
The configuration of the machines is not known specifically a priori, but I can assume some recent version of Windows with IE8+.
The "no remote access" requirement was a late comer, and has thrown a wrench into the plan of just writing a simple Web application that could post results to an HTTP server. I'm now looking for the easiest way forward. Two main approaches come to mind:
Use a GUI framework that provides a control that can display HTML/JavaScript; running a full-blown application on the PC would allow me to save the results to the filesystem. I've never done this, but it seems like in this day and age it shouldn't be too difficult. This would allow me to reuse much of my existing prototype implementation, but I would need some way of transferring the results (which would be stored in a JavaScript data structure) outside of the Web control to where the rest of the application could access it.
Reimplement the entire application using some GUI framework (I've used PyQt successfully before, although not on Windows). This approach is obviously less desirable than #1 due to the lack of reuse. However, it may be necessary if #1 isn't feasible.
Any recommendations for the best way to go? Ideally, I'm looking for a solution that can be run in a "portable" manner from a USB thumbdrive or similar.
Have you looked at HTML Applications (HTA)? They work in IE5+ and can use Windows Scripting Host to write to local drives and UNC shares...
Maybe you can use a portable web server with a scripting language on the server side. http://code.google.com/p/mongoose/ Mongoose, for example, you can run PHP, CGI, etc. .. scripts. Then, simply create a script to save a file to your hard drive. And let the rest of the application in the same manner.
Use a script to start the web server, and perhaps a portable web browser like K-Meleon to start the application http://kmeleon.sourceforge.net/ This is highly configurable. Or start the system explorer to your localhost URL.
The only problem may be that the user has to modify the firewall for the first time you run the server?

Performance monitoring all layers of a system

I use several loadtesting tools (Loadrunner, JMeter, NeoLoad) to performance test different applications. Im wondering if it is possible to monitor all layers of an application stack so for example. Say i have the following data chain.
Loadbalancer <-x-> Application Server <-x-> RMI <-x-> Java Application <-x-> MQ <-x-> Legacy application <-x-> Database
Where i have marked the x in the chain i am interested in monitoring, for example avg responsetimes.
Obviously we could simply create a wrapper on all endpoints which would gather the statistics for us and maybe we could import it into loadrunner or other loadtesting tools and sideline hem with the tools inbuilt performance statistics, but maybe there is tools/applications which already does this?
If not, how should we proceed, in order to gather this kind of statistics?
The standard for this was supposed to be Application Response Measurement (ARM). It was a cross language set of APIs that did just what you were looking for. The issue is that the products that implement this spec all tend to be big, expensive "enterprise" level monitoring tools. Think multi-week installs, consultants, more infrastructure and lots of buzzwords.
Still, if this is a mission critical app with a mission critical budget, this may be what you need. But you may be able to build your own that does just enough without too much effort. A quick search turns up at least one open source ARM implementation if you still want to use that API.
Another option is to simply to have transactions you can run against each tier of the system to check general responsiveness. For example you can have a static web page on the LB, a no-op tx on the app server, a "hello" servlet on the Java app, put a message directly on the queue, etc. During a performance / load test, these could be hit directly by the load testing tool or you could write a wrapper servlet / application call that does this as a single HTTP (RMI?) call. Running these a few times a minute won't add too much load to the system, but it should help you pinpoint which tier is slower. The nice thing about this approach is that it also works in production, just watch out for security issues.
For single user kind of test, where you know you have problem (e.g. this tx is "slow"), I have also had pretty good luck with network tracing. It's very tedious, but when you aren't sure what tier is slow, starting up a network trace on a few machines and running a single tx usually gives a good idea of what the system is doing.
I have handled this decomposition a number of ways in the past. The first is at a very low level using protocol analyzer dumped data to find the time points where a conversation leaves tier X and enters tier Y. The second method is through the use of log examination for the various tiers. Something that can make your examination quite usefule in this case is a common log server for all of your components (syslog, Rsyslog, etc....) and a nice log parsing tool, such as the freely available Microsoft Logparser. The third method utilization of the audit trail for an application stored in the database. You may find this when working on enterprise services bus style applications which have a consumer/producer model and a bus to pass information rather than a direct connection. The audit trails I have seen are typically stored in a database and allow the tracking of an individual transaction through the entire application infrastructure. Your Load balancer, as a network device, may be out of the hunt on this one.
Note, if you go the protocol analyzer or log route, then be sure and synchronize all of your source information devices to a common time server. Having one of your collectors (analyzer, app log) off on a time stamp basis can really be a hair pulling experience when you get into the analysis phase.
As to how you move from your collected data into LoadRunner, that part is very mechanical. The Analysis program supports an interface to import external datapoints. The format is very specific and is documented in both help and the online docs. This import process works very well, as I often have to use it for collection of statistics from hosts which I do not have direct monitoring access to, but which need to be included as a part of the monitored test infrastructure.
James Pulley
Moderator (YahooGroups LoadRunner, Advanced-Loadrunner; GoogleGroups lr-LoadRunner; Linkedin LoadRunner, LoadRunnerByTheHour; SQAForums LoadRunner, WinRunner)

is there any easy-to-use cluster building software?

Assume there are several computers, distributed in the same network.
I install a program on all of them, and so there is a cluster.
and I can log in it, run my application(like web server , db server, and so on).
I don't need to configure the IPs, don't need to balance the loading.
Is there any software like this now?
edit:
OK, I want to build a cluster that can provide an enterprise web server(also db server store data), we have lots of PC, they are only running a small program now(for shop floor work-flow control). I want to use the additional CPU and Disk resource to build a service.
What purpose are you planning to serve with your cluster? That will determine the tool you want to use.
That being said, you will have to do some configuration- like IPs, Authentication Mechanism, et cetra. If you don not tell it what you want, how will it know?
In general, if the application is not designed to be clustered, you will have more pain than advantages.
Is current load too high for current single box hardware?

Resources