Measuring Apache Performance - performance

As our servers gets busier I'm increasingly interesting in monitor what's going on over time, we have some our host offers some crappy graphs which show CPU usage and Memory over time but there not really telling me much.
What sort of high performance tools are available to accurately monitor Apache?

You can setup rrd to store data and create graphs based on data you send it. I suggest parsing data from the apache logs and from the ps command.
You can also use the cacti package to create an interface to rrd.

Related

Spark Performance Monitoring

I have got a requirement to show the management/ Client that the executor-memory, number of cores, default parallelism, number of shuffle partitions and other configuration properties for running the spark job are not excessive or more than required. I need a monitoring (with visualization) tool by which I can justify the memory usage in the spark job. Additionally it should give the kind of information like memory is not getting used properly or certain job requires more memory.
Please suggest some application or tool.
LinkedIn has created a tool that sounds very similar to what you're looking for
See for a presentation as an overview of that product
https://youtu.be/7KjnjwgZN7A?t=480
LinkedIn team has open-sourced Dr. Elephant here -
https://github.com/linkedin/dr-elephant
Give it a try.
Notice that this setup may require manual tweaking of Spark History Server as part of initial integration setup to get the information that Dr. Elephant requires.

Long Running ETL Process - Background Jobs, Spark, Hadoop

I have a scenario in an application where;
Have to load data from multiple sources (more than 10)
Mostly sources are HTTP/JSON Web Services and some FTP
Have to process those data and put into a central Database (Postgresql)
Current implementation is done in Ruby using Background jobs. But I see following issues in it;
Very high memory usage
Jobs stuck sometimes without any error report
Horizontal scaling is tricky to setup
Does in this scenario, any way Spark or Hadoop can help or a better option.
Please elaborate with some good reasoning.
Update:
As per comment, I need to elaborate it further. Here are the points why I thought to Spark or Hadoop.
If we scale the concurrency of running jobs, that also increase heavy load on DB server. I had read though, that Spark and Hadoop are build to face such heavy load even on DB side.
We can't run more background process then the physical cores of CPU (as recommended by ruby and sidekiq community)
Concurrency in Ruby is actually dependent on GIL, which is not actually real concurrency supported. So each job can fetch single central data source, if that stuck into an IO call then the source will be locked for it.
All above points considered to be part of builtin architecture of Hadoop & Spark. So I was thinking over lines to look into these tools.
In my opinion, I would give a try to Pentaho Data Integrator (PDI) (or Talend).
They where visual tools designed to solve problems like yours. And have a free version downloadable form SourceForge (just unzip and press the spoon.bat button).
They can a acquire data from FTP and HTTP (among others), decode JSON, and write databases like Postgres. PDI has a free plug-in able to run Ruby code out-of-the-box, so you can save start-up development.
The PDI also has ready made Spark and Hadoop interfaces, so you can implement your hadoop/sparkle servers transparently at a later stage if you need a more metal solution.
The PDI was build for heavy data load and gives you you control on concurrency and remote servers.

Remote Execution in Ruby (Capistrano or MCollective) to collect cloud server performance metrics

I am looking for a way to collect data remotely from various cloud instances (EC2, Rackpsace). The Rackspace API provides no way for collecting server performance metrics (ie load average, cpu usage, memory) via it's API, otherwise this would have never been asked.
I started looking at solutions like Capistrano or Mcollective (I have also considered collectd), but I am unsure of which one would best suit my application. I am trying to avoid using ssh keys for trending purposes (I don't want to have to keep logging in to collect these metrics) The script I am writing is a Ruby script which reboots a cloud server if it's load average is over a certain number. Because these providers don't expose these metrics via their API, I am looking at a way to gather them myself, and I am new to the Ruby community so after briefing over the documentation for all of these tools, I still haven't been able to get a sense of which framework would work best, or if there are other alternatives.
It sounds like Capistrano is more suited to be a deployment tool, although it can perform remote tasks, so after I read the documentation for that it was pretty much out for the purposes of my script.
MCollective looks really attractive for what I am trying to do but it seems I would have to write my own RPC style plugin for this purpose.
I've also considered plugging into some greater monitoring system such as Nagios, Munin, Zenoss, Hyperic, etc, but I'd rather not install some large bulk monitoring system when all I want to collect is but a few simple metrics.
If your intention is to trigger certain actions based on the system performance (like restarting when cpu usage is too high), you should check out god.
I'm not sure if this is also useful when you want to generate some performance statistics over a longer time period. Personally, I'm using Munin for this, but if you don't like it maybe you can find something on Ruby Toolbox | Server Monitoring.

Is SQLite suitable for use as a read only cache on a web server?

I am currently building a high traffic GIS system which uses python on the web front end. The system is 99% read only. In the interest of performance, I am considering using an externally generated cache of pre-generated read-optimised GIS information and storing in an SQLite database on each individual web server. In short it's going to be used as a distributed read-only cache which doesn't have to hop over the network. The back end OLTP store will be postgreSQL but that will handle less than 1% of the requests.
I have considered using Redis but the dataset is quite large and therefore it will push up the administrative cost and memory cost on the virtual machines this is being hosted on. Memcache is not suitable as it cannot do range queries.
Am I going to hit read-concurrency problems with SQLite doing this?
Is this a sensible approach?
Ok after much research and performance testing, SQLite is suitable for this. It has good request concurrency on static data. SQLite only becomes an issue if you are doing writes as well as heavy reads.
More information here:
http://www.sqlite.org/lockingv3.html
if usage case is just a cache why don't you use something like
http://memcached.org/.
You can find memcached bindings for python in pypi repository.
Another options is that you use materialized views in postgres, this way you will keep things simple and have everything in one place.
http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views

Measuring Application Performance

I was wondering if there is a tool to keep track of application performance. What I have in mind is a tool that will listen for updates and register performance metrics published by an application. i.e. time to serve a request, time a certain operation took to finish. And this tool would then aggregate the data and measure performance trends.
If you want to measure your application from outside, then you can use RRDtool to collect the data.
You can use slamd for webapp written in Java.
For Django use hotshot.
Search for profiler + your language, framework
Take a look at HP SiteScope. It's ability to drive the system with a Web User Script, to monitor the metrics on the backend, even to the extent of creation of custom shell scripts and database queries, plus the ability to add logic for report/alert against these combined data sets appears to be what you need.
Other mechanisms that you might consider would be a roll your own service using CURL to push information in, queries to the systems involved to pull metrics or database information and then your own interface for alerting and reporting.
Then it becomes a cost question, can you roll the level of functionality for less money than you can purchase an already existing solution on the open market.
Ref:
HP SiteScope Wiki Page

Resources