How to speed up OpenGrok indexing - performance

lately I was asked by my boss to explore OpenGrok possibilities in the company I'm working for. First I started with a few projects at my virtualbox lubuntu, it was working ok, but kind of slowly. I blamed my laptop with mediocre parameters for that.
Now I'm having virtual of bigger proportions and I'm also running indexing on larger volume of data (SVN repository - 100 different projects, some of them with multiple branches, tags and trunk, about 100 000 files in total, few GB in size). All files are checked out directly in the SRC_ROOT.
I was hoping for reasonably fast indexing, but it's been running for more than five days now. I can see multiple threads running via htop, but CPU usage is 0.5-2.5%, memory usage 0.9%. So I guess it's not an issue of computing power. And unless there are terribly slow HDDs I don't know what the problem is.
Furthemore the indexing process seems to be slowing down. At the beginning it was approximately 1 sec/file, now it is about 5 sec/file. Unfortunately I haven't triggered the progress option, so I have no idea how long it's still going to run.
Any ideas how to make indexing faster? How to use resources more effectively? Current speed is simply unusable...

I think easy way to improve performance is to run opengrok index with setting up JAVA_OPTS and using 64 bit java.
Also, using derby for storing generated index data increase performance too.
More info about how to use and setup opengrok
https://github.com/OpenGrok/OpenGrok/blob/master/README.txt#L862
https://java.net/projects/opengrok/lists/discuss/archive/2013-03/thread/1#00000

I think the problem is SVN, try to debug and improve speed of SVN access from your VM, or disable(temporarily) svn altogether to get a fast index (and you can add history to index later gradually - per project, even if it will take few days, see options on how to run indexer per project)
Or if you can mirror SVN repo and make local svn calls that should give you a boost too.
So to conclude {OpenGrok can detect svn, skip history creation(enable it on the fly) and just index the checkout and then later add locally history to avoid long waits for history to be generated on the fly.
And that said, git and hg seem to work well with {OpenGrok in terms of history index.

I've been running into this myself, and I've found that the indexer is spending most (>90%) of its time querying the source control systems.
That said, some of the projects I use do use Perforce and SVN, so I don't want to disable them entirely, so what I've done is index twice -- first, with all the options that involve source control disabled, and then again with everything enabled.
That way, it still takes a long time (several days, in my case), but at least I have a usable index up and running in a few hours, and then it can spend days working out all the history.
Subsequent indexes should be faster, as I would expect that the historycache is only updated for files that are newer than the cached history.
(That said, it would be nice if I could update the historycache externally so it's all ready to go before I start the indexer at all, and have the indexer configured to not look up history information at all, but instead to just index what's cached)

Related

Solr Caching Update on Writes

I've been looking at potential ways to speed up solr queries for an application I'm working on. I've read about solr caching (https://wiki.apache.org/solr/SolrCaching), and I think the filter and query caches may be of some help. The application's config does setup these caches, but it looks like with some default settings that weren't experimented with, and our cache hit rate is relatively low.
One detail I've not been able to determine is how the caches deal with updates. If I update records that would result in removing or adding that record from the query or filter cache, do the caches update in a performant way? The application is fairly write-heavy, so whether the caches update in a conducive manner or not will probably determine whether trying to tune the caches will help much.
The short answer is that an update (add, edit, or delete) on your index followed by a commit operation rebuilds the index and replaces the current index. Since caches are associated with a specific index version, they are discarded when the index is replaced. If autowarming is enabled, then the caches in the new index will be primed with recent queries or queries that you specify.
However, this is Solr that we're talking about and there are usually multiple ways to handle any situation. That is definitely the case here. The commit operation mentioned above is known as a hard commit and may or may not be happening depending on your Solr configuration and how your applications interact with it. There's another option known as a soft commit that I believe would be a good choice for your index. Here's the difference...
A hard commit means that the index is rebuilt and then persisted to disk. This ensures that changes are not lost, but is an expensive operation.
A soft commit means that the index is updated in memory and not persisted to disk. This is a far less expensive operation, but data could conceivably be lost if Solr is halted unexpectedly.
Going a step further, Solr has two nifty settings known as autoCommit and autoSoftCommit which I highly recommend. You should disable all hard commit operations in your application code if you enable auto commit. The autoCommit setting can specify a period of time to queue up document changes (maxTime) and/or the number of changes to allow in the queue (maxDocs). When either of these limits is reached, a hard commit is performed. The autoSoftCommit setting works the same way, but results in (you guessed it) a soft commit. Solr's documentation on UpdateHandlers is a good starting point to learn about this.
These settings effectively make it possible to do batch updates instead of one at a time. In a write-heavy application such as yours, this is definitely a good idea. The optimal settings will depend upon the frequency of reads vs writes and, of course, the business requirements of the application. If near-real-time (NRT) search is a requirement, you may want autoSoftCommit set to a few seconds. If it's acceptable for search results to be a bit stale, then you should consider setting autoSoftCommit to a minute or even a few minutes. The autoCommit setting is usually set much higher as its primary function is data integrity and persistence.
I recommend a lot of testing in a non-production environment to decide upon reasonable caching and commit settings for your application. Given that your application is write-heavy, I would lean toward conservative cache settings and you may want to disable autowarming completely. You should also monitor cache statistics in production and reduce the size of caches with low hit rates. And, of course, keep in mind that your optimal settings will be a moving target, so you should review them periodically and make adjustments when needed.
On a related note, the Seven Deadly Sins of Solr is a great read and relevant to the topic at hand. Best of luck and have fun with Solr!

When 777ms are still not good enough (MariaDB tuning)

Over the past couple of months, I've been on a rampage optimising a Joomla website that I'm managing. When I first started, the homepage used to open in around 30-40 seconds, in spite of repeatedly upgrading my dedicated server, as suggested by the hosting firm.
I was able to bring the pagespeed down to around 800ms by religiously following all the recommendations of the likes of GT Matrix and PingdomTools, (such as using JCH-optimize, .htaccess caching and compression settings, and MaxCDN) but now I'm stuck optimising my my.cnf settings, trying various settings suggested on a number of related articles. The fastest I'm getting the homepage to open - with the current settings - is 777ms after refresh, which might not sound too bad, but look at the configuration of my dedicated server:
2 Quads, 128GB, 2x480GB SSD RAID
CloudLinux/Cpanel/WHM
Apache/suEXEC/PHP5/FastCGI
MariaDB 10.0.17 (all tables converted to XtraDB/InnoDB)
The site traffic is moderate, 10,000 and 20,000 visitors per day, with around 200,000 pageviews.
These are the current my.cnf settings. My goal is to bring the pagespeed down to under 600ms, which should be possible with this kind of hardware, provided it is tuned the right way.
[mysqld]
local-infile=0
max_connections=10000
max_user_connections=1000
max_connect_errors=20
key_buffer_size=1G
join_buffer_size=1G
bulk_insert_buffer_size=1G
max_allowed_packet=1G
slow_query_log=1
slow_query_log_file="diskar/mysql-slow.log"
long_query_time=40
connect_timeout=120
wait_timeout=20
interfactive_timeout=25
back_log=500
query_cache_type=1
query_cache_size=512M
query_cache_limit=512K
query_cache_min_res_unit=2K
sort_buffer_size=1G
thread_cache_size=16
open_files_limit=10000
tmp_table_size=8G
thread_handling=pool-of-threads
thread_stack=512M
thread_pool_size=12
thread_pool_idle_timeout=500
thread_cache_size=1000
table_open_cache=52428
table_definition_cache=8192
default-storage-engine=InnoDB
[innodb]
memlock
innodb_buffer_pool_size=96G
innodb_buffer_pool_instances=12
innodb_additional_mem_pool_size=4G
innodb_log_bugger_size=1G
innodb_open_files=300
innodb_data_file_path=ibdata1:400M:autoextend
innodb_use_native_aio=1
innodb_doublewrite=0
innodb_user_atomic_writes=1
innodb_flus_log_at_trx_commit=2
innodb_compression_level=6
innodb_compression_algorithm=2
innodb_flus_method=O_DIRECT
innodb_log_file_size=4G
innodb_log_files_in_group=3
innodb_buffer_pool_instances=16
innodb_adaptive_hash_index_partitions=16
innodb_thread_concurrency
innodb_thread_concurrency=24
innodb_write_io_threads=24
innodb_read_io_threads=32
innodb_adaptive_flushing=1
innodb_flush_neighbors=0
innodb_io_capacity=20000
innodb_io_capacity_max=40000
innodb_lru_scan_depth=20000
innodb_purge_threads=1
innodb_randmon_read_ahead=1
innodb_read_io_threads=64
innodb_write_io_threads=64
innodb_use_fallocate=1
innodb_use_atomic_writes=1
inndb_use_trim=1
innodb_mtflush_threads=16
innodb_use_mfflush=1
innodb_file_per_table=1
innodb_file_format=Barracuda
innodb_fast_shutdown=1
I tried Memcached and APCU, but it didn't work. The site actually runs 2-3 times faster with 'Files' as the caching handler in Joomla's Global Configuration. And yes, I ran my-sqltuner, but that was of no help.
I am newby as far as Linux is concerned and suspect that above settings could be improved. Any comments and/or suggestions?
long_query_time=40
Set that to 1 so you can find out what the slow queries are.
max_connections=10000
That is unreasonably high. If you come anywhere near it, you will have more problems than failure to connect. Let's say only 3000.
query_cache_type=1
query_cache_size=512M
The Query cache is hurting performance by being so large. This is because any write causes all QC entries for the table to be purged. Recommend no more than 50M. If you have heavy writes, it might be better to change the type to DEMAND and pepper your SELECTs with SQL_CACHE (for relatively static tables) or SQL_NO_CACHE (for busy tables).
What OS?
Are the entries in [innodb] making it into the system? I thought these needed to be in [mysqld]. Check by doing SHOW VARIABLES LIKE 'innodb%';.
Ah, buggers; a spelling error:
innodb_log_bugger_size=1G
innodb_flus_log_at_trx_commit=2
inndb_use_trim=1
and more??
After you get some data in the slowlog, run pt-query-digest, and let's discuss the top couple of queries.

XPages performance - 2 apps on same server, 1 runs and 1 doesn't

We have been having a bit of a nightmare this last week with a business critical XPage application, all of a sudden it has started crawling really badly, to the point where I have to reboot the server daily and even then some pages can take 30 seconds to open.
The server has 12GB RAM, and 2 CPUs, I am waiting for another 2 to be added to see if this helps.
The database has around 100,000 documents in it, with no more than 50,000 displayed in any one view.
The same database set up as a training application with far fewer documents, on the same server always responds even when the main copy if crawling.
There are a number of view panels in this application - I have read these are really slow. Should I get rid of them and replace with a Repeat control?
There is also Readers fields on the documents containing Roles, and authors fields as it's a workflow application.
I removed quite a few unnecessary views from the back end over the weekend to help speed it up but that has done very little.
Any ideas where I can check to see what's causing this massive performance hit? It's only really become unworkable in the last week but as far as I know nothing in the design has changed, apart from me deleting some old views.
Try to get more info about state of your server and application.
Hardware troubleshooting is summarized here: http://www-10.lotus.com/ldd/dominowiki.nsf/dx/Domino_Server_performance_troubleshooting_best_practices
According to your experience - only one of two applications is slowed down, it is rather code problem. The best thing is to profile your code: http://www.openntf.org/main.nsf/blog.xsp?permaLink=NHEF-84X8MU
To go deeper you can start to look for semaphore locks: http://www-01.ibm.com/support/docview.wss?uid=swg21094630, or to look at javadumps: http://lazynotesguy.net/blog/2013/10/04/peeking-inside-jvms-heap-part-2-usage/ and NSDs http://www-10.lotus.com/ldd/dominowiki.nsf/dx/Using_NSD_A_Practical_Guide/$file/HND202%20-%20LAB.pdf and garbage collector Best setting for HTTPJVMMaxHeapSize in Domino 8.5.3 64 Bit.
This presentation gives a good overview of Domino troubleshooting (among many others on the web).
Ok so we resolved the performance issues by doing a number of things. I'll list the changes we did in order of the improvement gained, starting with the simple tweaks that weren't really noticeable.
Defrag Domino drive - it was showing as 32% fragmented and I thought I was on to a winner but it was really no better after the defrag. Even though IBM docs say even 1% fragmentation can cause performance issues.
Reviewed all the main code in the application and took a number of needless lookups out when they can be replaced with applicationScope variables. For instance on the search page, one of the drop down choices gets it's choices by doing an #Unique lookup on all documents in the database. Changed it to a keyword and put that in the application Scope.
Removed multiple checks on database.queryAccessRole and put the user's roles in a sessionScope.
DB had 103,000 documents - 70,000 of them were tiny little docs with about 5 fields on them. They don't need to be indexed by the FTIndex so we moved them in to a separate database and pointed the data source to that DB when these docs were needed. The FTIndex went from 500mb to 200mb = faster indexing and searches but the overall performance on the app was still rubbish.
The big one - I finally got around to checking the application properties, advanced tab. I set the following options :
Optimize document table map (ran copystyle compact)
Dont overwrite free space
Dont support specialized response hierarchy
Use LZ1 compression (ran copystyle compact with options to change existing attachments -ZU)
Dont allow headline monitoring
Limit entries in $UpdatedBy and $Revisions to 10 (as per domino documentation)
And also dont allow the use of stored forms.
Now I don't know which one of these options was the biggest gain, and not all of them will be applicable to your own apps, but after doing this the application flies! It's running like there are no documents in there at all, views load super fast, documents open like they should - quickly and everyone is happy.
Until the http threads get locked out - thats another question of mine that I am about to post so please take a look if you have any idea of what's going on :-)
Thanks to all who have suggested things to try.

Rails, how to migrate large amount of data?

I have a Rails 3 app running an older version of Spree (an open source shopping cart). I am in the process of updating it to the latest version. This requires me to run numerous migrations on the database to be compatible with the latest version. However the apps current database is roughly around 300mb and to run the migrations on my local machine (mac os x 10.7, 4gb ram, 2.4GHz Core 2 Duo) takes over three days to complete.
I was able to decrease this time to only 16 hours using an Amazon EC2 instance (High-I/O On-Demand Instances, Quadruple Extra Large). But 16 hours is still too long as I will have to take down the site to perform this update.
Does anyone have any other suggestions to lower this time? Or any tips to increase the performance of the migrations?
FYI: using Ruby 1.9.2, and Ubuntu on the Amazon instance.
Dropping indices beforehand and adding them again afterwards is a good idea.
Also replacing .where(...).each with .find_each and perhaps adding transactions could help, as already mentioned.
Replace .save! with .save(:validate => false), because during the migrations you are not getting random inputs from users, you should be making known-good updates, and validations account for much of the execution time. Or using .update_attribute would also skip validations where you're only updating one field.
Where possible, use fewer AR objects in a loop. Instantiating and later garbage collecting them takes CPU time and uses more memory.
Maybe you have already considered this:
Tell the db not to bother making sure everything is on disk (no WAL, no fsync, etc), you now have an in memory db which should make a very big difference. (Since you have taken the db offline you can just restore from a backup in the unlikely event of power loss or similar). Turn fsync/WAL on when you are done.
It is likely that you can do some of the migrations before you take the db offline. Test this in staging env of course. That big user migration might very well be possible to do live. Make sure that you don't do it in a transaction, you might need to modify them a bit.
I'm not familiar with your exact situation but I'm sure there are even more things you can do unless this isn't enough.
This answer is more about approach than a specific technical solution. If your main criteria is minimum downtime (and data-integrity of course) then the best strategy for this is to not use rails!
Instead you can do all the heavy work up-front and leave just the critical "real time" data migration (i'm using "migration" in the non-rails sense here) as a step during the switchover.
So you have your current app with its db schema and the production data. You also (presumably) have a development version of the app based on the upgraded spree gems with the new db schema but no data. All you have to do is figure out a way of transforming the data between the two. This can be done in a number of ways, for example using pure SQL and temporary tables where necessary or using SQL and ruby to generate insert statements. These steps can be split up so that data that is fairly "static" (reference tables, products, etc) can be loaded into the db ahead of time and the data that changes more frequently (users, sesssions, orders, etc) can be done during the migration step.
You should be able to script this export-transform-import procedure so that it is repeatable and have tests/checks after it's complete to ensure data integrity. If you can arrange access to the new production database during the switchover then it should be easy to run the script against it. If you're restricted to a release process (eg webistrano) then you might have to shoe-horn it into a rails migration but you can run raw SQL using execute.
Take a look at this gem.
https://github.com/zdennis/activerecord-import/
data = []
data << Order.new(:order_info => 'test order')
Order.import data
Unfortunaltly the downrated solution is the only one. What is really slow in rails are the activerecord models. The are not suited for tasks like this.
If you want a fast migration you will have to do it in sql.
There is an other approach. But you will always have to rewrite most of the migrations...

Optimizing Build in ClearCase Dynamic View

I'm trying to optimize my workflow as I still spend quite some time waiting for the computer when it should be the other way 'round IMO.
I'm supposed to hand in topical branches implementing a single feature or fixing a single bug, along with a full build log and regression test report. The project is huge, it takes about 30 minutes to compile on a fairly modern machine when compiling in a snapshot view.
My current workflow thus is to do all development work in a single snapshot view, and when a feature is ready for submission, I create a new dynamic view, merge the relevant changes from the snapshot and start the build/testing procedure overnight.
In a dynamic view, a full build takes about six hours, which is a major PITA, so I'm looking for a way to improve these figures. I've toyed with the cache settings, but that doesn't seem to make much difference. I'm currently pondering writing a script that will create a snapshot view with the same spec as the dynamic view, fetch the files into it and build there, but before I do that I wonder if there is a better way of improving my build times.
Can I somehow make MVFS cache all retrieved objects locally (I have both lots of harddisk space and RAM), ideally sharing the cache between multiple dynamic views (as I build feature branches, most files are bound to be identical between two different branches)
Is there any other setting I could tune to speed up local builds?
Am I doing it wrong (i.e. is there a better workflow for me, considering that snapshot views take about one hour to create)?
Considering that you can have a dynamic view and a snapshot view with the same config spec, I would really recommend:
having a dynamic view ready for merge operation
then, once the merge is done, updating your snapshot view (no need to recreate it from scratch, which takes too much time. Just launch an update)
That way, you get the best of both world:
easy and quick merges within the dynamic view
"fast"(er) compilation within the snapshot view dedicated for that step.
Even if the config spec might have to change in your case (if you really have to use one view per branch), you still can change the config spec of an existing snapshot view (and still benefit from an incremental update), rather than recreating a snapshot view for each branch you need to compile on.

Resources