Rebuilding Indexes for Embedded Derby - derby

I'm using an embedded java database to hold bus schedules. When a new schedule is made available I automatically load the new schedule into database tables and then delete old schedules from the database. This happens automatically without user intervention.
I have noticed that the database slows over time. I have a script which drops and rebuilds indexes (there are 10 of them) and after running this performance improves significantly. Currently I manually stop the system, run the script and then restart the system.
Question is is the a way of rebuilding all 10 indexes from within the java code ? If there was, I would do this immediately after deleting old schedules

SYSCS_UTIL.SYSCS_COMPRESS_TABLE will rebuild indexes. You can call this function on important tables during off-hours. It should not need to be done frequently. Docs are here.
However, before doing this I would make sure that the slow queries aren't resulting in full-table scans. i.e. check to make sure you're not missing an index.

Based on the above suggestion and documentation the following method works. db is the database connection (type java.sql.Connection). Note that the table names needed to be made uppercase to be found.
public void rebuildIndexes(String[] feedTables) throws SQLException {
String feedTable;
CallableStatement cs;
logger.log(Level.INFO, "Reclaiming unused database table space and rebuilding indexes");
for (int i=(feedTables.length-1); i>=0; i--) {
feedTable = feedTables[i].toUpperCase();
logger.log(Level.INFO, String.format(" Rebuilding table %s", feedTable));
cs = db.prepareCall("CALL SYSCS_UTIL.SYSCS_COMPRESS_TABLE(?, ?, ?)");
cs.setString(1, "APP");
cs.setString(2, feedTable);
cs.setShort(3, (short) 1);
cs.execute();
cs.close();
}
logger.log(Level.INFO, "Reclaim and rebuild finished");
}

Related

RowCallbackHandler loades rows info memory

I need to query a big dataset from DB. Actually I'm gonna use pagination parameters (limit and offset) to avoid loading large dataset into heap. For that purpose I'm trying to fetch rows with RowCallBackHadler interface, because docs say An interface used by JdbcTemplate for processing rows of a ResultSet on a per-row basis. and also I've read advices to use that interface to deal with rows one by one.
But something goes wrong every time when I try to fetch data. Here my code below and also screenshot from visualVM with heap space graphic which indicates that all rows were loaded into memory. Query, which I'm trying to execute, returns something around 1.5m rows in DB.
// here just sql query, map with parameters for query, pretty simple RowCallbackHandler
jdbcTemplate.query(queryForExecute, params, new RowCallbackHandler() {
#Override
public void processRow(ResultSet rs) throws SQLException {
while (rs.next()) {
System.out.println("test");
}
}
});
heap via visualVM:
update: I made a mistake when called rs.next(), but removing this line didn't change the situation with loading rows into memory at all
The main problem was with understanding documentation. Doc says
An interface used by JdbcTemplate for processing rows of a ResultSet on a per-row basis.
Actually my code does things in right way: returns me a ResultSet which contains all row (because limit is not defined). I had no confidence that adding LIMIT to any sql query will work good and decided to implement LIMIT via RowCallbackHandler and it was a bad idea, because LIMIT works great with all type of sql queries (complex and simple).

In memory database, with hibernate and periodically persisting to an actual db

I would like to use an in memory db with hibernate, so my queries are super quick.
But moreover i would like to periodically persist that in memory state into a real mysql db.
Ofcourse the in memory database should load its initial content on startup from that mysql db.
Are there any good frameworks/practices for that purpose? (Im using spring) any tutorials or pointers will help.
I'll be honest with you, most decent databases can be considered in-memory to an extent given that they cache data and try not to hit the hard-disk as often as they can. In my experience the best in-memory databases are either caches, or alagamations of other data sources that are already persisted in some other form, and then are updated in a live fashion for time-critical information, or refreshed periodically for non-time-critical information.
Loading data from a cold start in to memory is potentially a lengthy process, but subsequent queries are going to be super-quick.
If you are trying to cache what's already persisted you can look at memcache, but in essence in memory databases always rely on a more persistent source, be it MySQL, SQLServer, Cassandra, MongoDB, you name it.
So it's a little unclear what you're trying to achieve, suffice to say it is possible to bring data in from persistent databases and have a massive in memory cache, but you need to design around how stale certain data can get, and how often you need to hit the real source for up-to-the-second results.
Actually the simplest would be to use some core Hibernate features for that, use the hibernate Session itself and combine it with the second level cache.
Declare the entities you want to cache as #Cacheable:
#Entity
#Cacheable
#Cache(usage = CacheConcurrencyStrategy.NON_STRICT_READ_WRITE)
public class SomeReferenceData { ... }
Then implement the periodically flushing like this, supposing you are using JPA:
open an EntityManager
load the entities you want to cache using that entity manager and no other
Keep the entity manager opened until the next periodic flush, Hibernate is keeping track what instances of SomeReferenceData where modified in-memory via it's dirty checking mechanism, but no modification queries are being issued.
Reads on the database are being prevented via the second level cache
When the moment comes to flush the session, just begin a transaction and commit immediately.
Hibernate will update modified entities in the database, update the second level cache and resume execution
eventually close the entity manager and replace it with a new one, if you want to reload from the database eveything
otherwise keep the same entity manager open
code example:
Try this code to see the overall idea:
public class PeriodicDBSynchronizeTest {
#Test
public void testSynch() {
// create the entity manager, and keep it
EntityManagerFactory factory = Persistence.createEntityManagerFactory("testModel");
EntityManager em = factory.createEntityManager();
// kept in memory due to #Cacheable
SomeReferenceData ref1 = em.find(SomeReferenceData.class, 1L);
SomeReferenceData ref2 = em.find(SomeReferenceData.class, 2L);
SomeReferenceData ref3 = em.find(SomeReferenceData.class, 3L);
....
// modification are tracked but not committed
ref1.setCode("005");
// these two lines will flush the modifications into the database
em.getTransaction().begin();
em.getTransaction().commit();
// continue using the ref data, and tracking modifications until the next request
...
}
}

Updating Solr Index when product data has changed

We are working on implementing Solr on e-commerce site. The site is continuously updated with a new data, either by updates made in existing product information or add new product altogether.
We are using it on asp.net mvc3 application with solrnet.
We are facing issue with indexing. We are currently doing commit using following:
private static ISolrOperations<ProductSolr> solrWorker;
public void ProductIndex()
{
//Check connection instance invoked or not
if (solrWorker == null)
{
Startup.Init<ProductSolr>("http://localhost:8983/solr/");
solrWorker = ServiceLocator.Current.GetInstance<ISolrOperations<ProductSolr>>();
}
var products = GetProductIdandName();
solrWorker.Add(products);
solrWorker.Commit();
}
Although this is just a simple test application where we have inserted just product name and id into the solr index. Every time it runs, the new products gets updated all at once, and available when we search it. I think this create the new data index into solr everytime it runs? Correct me if I'm wrong.
My Question is:
Does this recreate Solr Index Data in whole? Or just update the data that is changed/new? How? Even if it only updates changed/new data, how it knows what data is changed? With large data set, this must have some issues.
What is the alternative way to track what has changed since last commit, and is there any way to add those product into Solr index that has changed.
What happens when we update existing record into solr? Does it delete old data and insert new and recreate whole index? Is this resource intensive?
How big e-commerce retailer does this with millions of products.
What is the best strategy to solve this problem?
When you do an update only that record is delete and inserted. Solr does not update the records. The other records are untouched. When you commit the data new segments would be created with this new data. On optimize the data is optimized into a single segment.
You can use Incremental build technique to add/update records after the last build. DIH provides it out of the box, If you are handling it manually through jobs you can maintain the timestamp and run builds.
Solr does not have an update operation. It will perform a delete and add. So you have to use the complete data again and not just the updated fields. Its not resource intensive. Usually only Commit and Optimize are.
Solr can handle any amount of data. You can use Sharding if your data grows beyond the handling capacity of a single machine.

nhibernate - archiving records

This is a simplistic view of our domain model (we are in healthcare):
Account
{
List<Registration> Registrations {...}
DateTime CreatedDate {...}
Type1 Property1 {...}
Type2 Property2 {...}
...
}
Registration
{
InsuranceInformation {...}
PatientVisit {...}
Type1 Property1 {...}
Type2 Property2 {...}
...
}
Setup
We use Nhibernate/FluentNH to setup and configure the SessionFactory
And let's assume that we have setup all the required table indices
Usage
We get ~10,000 new account a day
We have about 500K accounts Total.
We have several Linq queries that operate over these accounts
All our queries use Linq, most queries are dynamically built using predicate builder pattern (we don't use Hql)
The problem is that,
As the number of accounts increases the execution time for these queries increases.
Note:
Only accounts that are within at 48 hours window are relevant for
our queries / application. But, Older accounts need to be preserved
(so cannot be deleted). Even though these accounts are not needed by
the application it may be consumed later by the analytics
application
To solve this performance issue:
we are considering archiving accounts that are older than 48hrs
Creating an Archive database with the same schema as the main db
Adding a windows service that is scheduled to run on a nightly basic that moves "old" accounts from the main db to the archive db
The windows service will using nhiberate to read old accounts from the main db and save the old accounts(again using nhibernate) to the archive database, and then delete the old accounts from the main db. Right now, we think this service will move one account at a time until all the old accounts are moved to the archive database.
Ocassionally, when we do get a request to restore an account from the archive db, we will reverse the above step
Questions:
Is this archival approach any good? If not, why? can you suggest
some alternate implementations?
Can I use the same sessionfactory to connect to the main db and the archive db during the copy process? How can i change the connection string dynamically? Can I have two simulateous open sessions that work with two database
Can I copy more than one account at a time using this approach? Batch Copy and batch deletes?
Any help appreciated, thank you for your input.
I think your issue is more database related then nhibernate. A database with 500k records is not so much. To optimize access you should think about how you query and how to optimize for those queries.
Query only the data you need
Optimize your table by making indexes
Use the 20/80 rule, find the 20% expensive queries and optimize the code/queries. You program will be 80% faster
NHibernate: optimize your mappings
HHibernate: use batch size if you do multiple updates
Add stored procedures if something is hard to do in code
If your db grows, hire a db expert to advise on database optimization (they can improve your performance by 10% to 90%). You need him first for a few days and then once a week/month depending on how much work there is.

Entity framework and performance

I am trying to develop my first web project using the entity framework, while I love the way that you can use linq instead of writing sql, I do have some severe performance issuses. I have a lot of unhandled data in a table which I would like to do a few transformations on and then insert into another table. I run through all objects and then inserts them into my new table. I need to do some small comparisons (which is why I need to insert the data into another table) but for performance tests I have removed them. The following code (which approximately 12-15 properties to set) took 21 seconds, which is quite a long time. Is it usually this slow, and what might I do wrong?
DataLayer.MotorExtractionEntities mee = new DataLayer.MotorExtractionEntities();
List<DataLayer.CarsBulk> carsBulkAll = ((from c in mee.CarsBulk select c).Take(100)).ToList();
foreach (DataLayer.CarsBulk carBulk in carsBulkAll)
{
DataLayer.Car car = new DataLayer.Car();
car.URL = carBulk.URL;
car.color = carBulk.SellerCity.ToString();
car.year = //... more properties is set this way
mee.AddToCar(car);
}
mee.SaveChanges();
You cannot create batch updates using Entity Framework.
Imagine you need to update rows in a table with a SQL statement like this:
UPDATE table SET col1 = #a where col2 = #b
Using SQL this is just one roundtrip to the server. Using Entity Framework, you have (at least) one roundtrip to the server loading all the data, then you modify the rows on the client, then it will send it back row by row.
This will slow things down especially if your network connection is limited, and if you have more than just a couple of rows.
So for this kind of updates a stored procedure is still a lot more efficient.
I have been experimenting with the entity framework quite a lot and I haven't seen any real performance issues.
Which row of your code is causing the big delay, have you tried debugging it and just measuring which method takes the most time?
Also, the complexity of your database structure could slow down the entity framework a bit, but not to the speed you are saying. Are there some 'infinite loops' in your DB structure? Without the DB structure it is really hard to say what's wrong.
can you try the same in straight SQL?
The problem might be related to your database and not the Entity Framework. For example, if you have massive indexes and lots of check constraints, inserting can become slow.
I've also seen problems at insert with databases which had never been backed-up. The transaction log could not be reclaimed and was growing insanely, causing a single insert to take a few seconds.
Trying this in SQL directly would tell you if the problem is indeed with EF.
I think I solved the problem. I have been running the app locally, and the database is in another country (neighbor, but never the less). I tried to load the application to the server and run it from there, and it then only took 2 seconds to run instead of 20. I tried to transfer 1000 records which took 26 seconds, which is quite an update, though I don't know if this is the "regular" speed for saving the 1000 records to the database?

Resources