Scalability of open source XML Databases - exist-db

We are looking to develop a reporting application that reports on data stored in a large number of XML files. ~3,000,000 files ranging in size from 7KB to 5MB (Each file conforms to the same schema). I’m guessing that there will be about around 200GB of XML. I’m looking at a number of open source XML databases (Sedna, BaseX and eXist-db) and I’m not sure how well these systems will scale, I read a comparison of these three database here. Which is where my concerns of scalability originated from.
Some details regarding what we want to do are: We won’t be changing the data in any of the XML files and new files will be added daily. Since we are concerned with reporting query performance is important to us, and the time it takes to add and index new files isn’t a high priority for us.
I’m wondering if anyone has experience using these systems at similar scales? I’ve looked at the BaseX statistics page and see some fairly large XML instances but no mention of performance.
We don’t require an open source product and the MarkLogic system looks like it can fit the bill nicely, but I’m curious what’s been done with open source products.

I think it is impossible to answer your question with either a yes or no. It is really impossible to state anything about performance from the little details that you have given.
Performance is typically based on the queries that you want to perform and the distribution of your data. Not to mention, what you consider to be "acceptable".
In the paper you referenced, it is interesting to note that they state that they could not get the new range indexes in eXist 2.2 preview to work. Certainly without those, they would have seen much worse performance. Also at the end they state that they will select Sedna as they can overcome the problems with Sedna, it was not clear to me why that was, i.e. do they have C++ devs that can work with Sedna but they don't have Java devs that could work with eXist or BaseX? Finally, the version of Java they used for testing eXist and BaseX is rather old, the next release of eXist (3.0) will only support Java 8 and newer.
I would be surprised if you could not store 200GB of data into BaseX, eXist or Sedna, but without knowing your data and the sort of queries you want to execute, I cannot comment on query performance.
I think you would be best to do a small trial of either one or all, in a manner not dissimilar to that linked article.

Just want to share my experience on this topic. My experience is limited to much smaller data sets - that is roughly about 50k documents of about 1GB total size. We use Sedna XML DB for this purpose. We do not change documents but rather overwrite existing documents when changes occur and have a lot of read-only XQueries including big reports.
Shortly, my opinion is Sedna won't work for you unless you find a way to replicate it to another server to be used for reading. I have experienced major performance issues related to collection locks with a rather moderate load on the database when performing some long-lasting reporting XQueries. As far as I know, Sedna does not offer replication capabilities but you can probably adopt some solution on top of Sedna. For example, quick googling revealed some research in this area. You can try asking on the Sedna mailing list. Among other disadvantages are lack of XQuery 3.0 support and seemingly frozen further development. However, the support is still quite active on the mailing list.
Also I have some experience with eXist-db but I use it more as a XML processing and pipelining platform rather than an XML storage. Still it looks a bit more promising in relation to scaling. Although I have not used its replication capabilities, they are mentioned in the docs. I suggest you try searching/asking on the mailing list as well.


Temporary Dashboard/Reporting Solution while Building a Data Warehouse

Our situation is that we are going to start to build a data warehouse. The data warehouse is going to take some time, if we are going to do it right. It will be built looking at individual processes and growing from there.
We only have three databases that we will be pulling data from. All three databases hold distinct information (financial info, scheduling and patient information - visits, diagnosis,etc).
I am thinking of using a dashboard/reporting tool like (as an example), or to display the information to the business. It will slowly start incoperating the DW as it is beind designed and pushed to production.
My question after all this is: What is the best way to present the data to the application (dashboard/reporter) in the backend that would be efficient, yet not time consuming where I'd rather build the Data Warehouse? Ie. views, materialized views, small seperate DB containing subset data from the main DB's, etc?
This may not be answering your question directly, but rather than find a temporary solution I would just build your warehouse faster.
First, if you can build it quickly then you don't need a temporary one; if you can't build it quickly then you won't be able to build a temporary solution quickly either. You even mentioned developing a "small separate DB containing subset data"; that's exactly what a reporting database is!
Second, any temporary solution will have to be maintained and supported too: if it's too useful then your temporary solution will become your permanent one anyway. That might actually be a good thing because if the 'temporary' solution meets your requirements then why not keep it?
Anyway, I would start by identifying one or two key reports that have high value for your users and commit to delivering them in 2 months (1 month would be even better). Develop the most basic, minimal database and ETL/reporting processes possible to deliver those reports, even if it seems like a horrible, hacked-together mess. Make sure the reports are internal ones that no one will send to an outside customer; that means you can avoid spending time on making them pretty.
After you've delivered those reports, you can now step back and look at what you did. Hopefully you will find yourself in a position where:
Your users got some useful reports very quickly
The reports are ugly but the numbers are correct
You've learned a lot about the users' needs and how they interpret and use the data
Your technical implementation is a mess, but you know that and you also know how to improve it
If #1 and #2 are true then you'll have delivered a lot of business value quickly while also setting the user expectation that correct is often more valuable than pretty (that's really helpful on a reporting project). If #3 and #4 are true then your second iteration will be a big improvement on the first one and even if you find yourself in the worst case scenario of having to re-develop the whole thing from scratch, you'll do it faster and better because you've learned so much.
This is simply agile development, of course: there's no reason you can't use rapid prototyping and incremental delivery in a data warehouse project. Like any IT solution the warehouse will continuously grow and be maintained over time so there's absolutely no reason to try to get everything complete and correct in the first version. It's highly likely that your users don't even really know what they want (in detail) so this approach helps to clarify their expectations and requirements more quickly too.

Tracing ORM performance

This isn't a question of "which is the fastest ORM", nor is it a question on "how to write good code with ORMs". This is the other side: the code's been written, it's gone live, several thousand users are hitting the application, but there's a perceived overall performance problem. A SQL Profiler trace can only be ran for a short amount of time: 5 mins gives several hundred thousand results.
The question is simply this: having used SQL Profiler to narrow down a number of slow queries (duration greater than a given amount of time), what techniques and solutions exist for tracing these SQL queries back into the problematic component? A releated question is that if a specific area is slow, how can we identify the SQL that this area is executing so it can be suitably filtered in SQL Profiler?
The background to this is that we have a rather large application with a fairly complex table structure, and is currently based around data-access via stored procedures. If a SQL performance problem arises, it's usually case of pulling out SQL profiler, find out if there's anything slow (filter by duration) or if a the area being complained about is slow (filter by stored procedure), and tune the stored procedures (or the schema - through indexing).
Now there's a push to move our code over from a mostly-sproc solution to a mostly-ORM solution, however the big push against the move is how performance problems, if they arise, can be traced back to problematic code. I've read around and it seems that more often than not, it may involve third-party tools (ORM tracing utilities like NHProf or .NET tracing utils like dottrace) that we'd need to install on the server. Now whether additional tools can be installed on a live environment is another question, so if things like this can be performed without additional tools, then that may be a bonus.
I'm mostly interested in solutions with SQL Server 2008, but it's probably generic enough for any RDBMS. As far as the ORM tech, on this I have no specific focus as nothing's currently in use, so be interested to hear how techniques differ (or are common) twixt nHibernate, fluent-nhibernate and Entity Framework. Other ORMs are welcome though if they offer something else :-)
I've read through How to find and fix performance problems (...), and I think the issue is simply the section on there that says "isolate". A problem that is easily reproducible only on a live system is going to be difficult to isolate. The figures I quoted in para 2 are figures the types of volumes that we can get from a profile as well...
If you have real-world experience of ORM tracing on live, so much the better :-)
Update, 2016-10-21: Just for completeness, we eventually solved this for NHibernate by writing code, and overriding NHibernate methods. Full details in this other SO question I asked: NHibernate and Interceptors - measuring SQL round trip times. I expect this will be a similar approach for many different ORMs.
There exists profilers for ORM tools, like UberProf. It finds out which SQL statements that are generated by the ORM can be problematic.
Like the select n+1 problem, for instance. These kind of tools might give you an indication of which ORM query statements result in poor SQL code, and perhaps even how you could improve them.
We had a Java/Hibernate app with issues, so we used SET CONTEXT_INFO with a different value. If we saw, say, 0x14 on the same SPID just before a WTF query, we could narrow it to module x.
Not being a Java guy, I don't know exactly what they did, and of course it may not apply to .net. IIRC you have to be careful about when connections are opened/closed
We could also control the client load at this time so we didn't have too much superfluous traffic.
YMMV of course, but it may be useful
I just found these which could be useful too
Temporary tables, sessions and logging in SQL Server?
Why is my CONTEXT_INFO() empty?

How to represent data in an efficient way ? (Graphically Talking)

Before going for further reading, just to let you know this question is vague and do not need one precise answer. To the contrary more answer I get better it will be for me.
The question is : How to represent data in an efficient way ?
I am not talking about representing data into a database or any language.
I am talking about when a program, a report, a page needs to be shown to a user (Static - report- and Dynamic - web pages -) how one should represent the data in order to the user to catch as many information as possible from - almost - the first look. Is there any best-practices, pitfalls to avoid and stuff ?
Edit: Any book/link that can help or that treat about this subject are welcome.
"how one should represent the data in order to the user to catch as many information as
possible from - almost - the first look."
To me, this screams that you need to be speaking to your end-users more. My suggestion would be to mock up the initial layout using something like Balsamiq Mockups (This can be done even if it's a public facing site). Using the mockups will help you visualise the design of the overall page.
"First-look" type views indicates a dashboard which provide overall, high level results.
Now, just to be clear, this is the design and layout of the page and don't confuse this with any web UI tools eg JqueryUI that bring fancy effects to the page.
In terms of links, my suggestion would be thoroughly read through Designing User Interfaces For Business Web Applications from Smashing Magazine (incl. the related links). The one that is probably most relevant is 12 Standard Screen Patterns.
It is a brilliant read and should be, IMO, added to your saved bookmarks.
Effectiveness is always matter then efficiency. Before I express my opinions, I suppose that your question already based on effective solution from user's perspective.
First, data retrieving is about the storage of computer system. If your data can reside totally in the fastest storage(like main memory), keeping data in it is a better strategy than others. But the problem about performance issue is mostly because of non-enough main memories, so the data should be retrived from secondary storages(the slower one) and replace other data in main memory, and produce what you want. So you have to deal with multi-level storage systems.
Second, when you are dealing with multi-level storage systems(as most computer systems), the efficiency ways depend on how much the reductions of access in secondary storages. It's not noly about the gain in loading data from slower storage to faster one, but also, there are sacrifices that the data get kicked out.
In XML, DOM and SAX are two extremities of dealing with multi-level storage systems. In database systems, fully cached indexes are a good solution for performance(when indexes are small enough). In operating systems, file cache is alwasy the one of the most challenging things in computer science.
You can pre-calculating some data before required. You can using more efficient data structures to improve retriving data. You can rudely allocating more main memories to your application. You can... well, buying more memory modules or SSD. Whatever solutions you choose, it's definitely art of fusion in computer science.
Algorithms, data structues, database systems, operating systems, even theories of compilers, these hard metals can help you build a sword which kicks the dragon's ass.

Choosing a strategy for BI module

The company I work for produces a content management system (CMS) with different various add-ons for publishing, e-commerce, online printing, etc. We are now in process of adding "reporting module" and I need to investigate which strategy should be followed. The "reporting module" is otherwise known as Business Intelligence, or BI.
The module is supposed to be able to track item downloads, executed searches and produce various reports out of it. Actually, it is not that important what kind of data is being churned as in the long term we might want to be able to push whatever we think is needed and get a report out of it.
Roughly speaking, we have two options.
Option 1 is to write a solution based on Apache Solr (specifically, using Pros of this approach:
free / open source / good quality
we use Solr/Lucene elsewhere so we know the domain quite well
total flexibility over what is being indexed as we could take incoming data (in XML format), push it through XSLT and feed it to Solr
total flexibility of how to show search results. Similar to step above, we could have custom XSLT search template and show results back in any format we think is necessary
our frontend developers are proficient in XSLT so fitting this mechanism for a different customer should be relatively easy
Solr offers realtime / full text / faceted search which are absolutely necessary for us. A quick prototype (based on Solr, 1M records) was able to deliver search results in 55ms. Our estimated maximum of records is about 1bn of rows (this isn't a lot for typical BI app) and if worse comes to worse, we can always look at SolrCloud, etc.
there are companies doing very similar things using Solr (Honeycomb Lexicon, for example)
Cons of this approach:
SOLR-236 might or might not be stable, moreover, it's not yet clear when/if it will be released as a part of official release
there would possibly be some stuff we'd have to write to get some BI-specific features working. This sounds a bit like reinventing the wheel
the biggest problem is that we don't know what we might need in the future (such as integration with some piece of BI software, export to Excel, etc.)
Option 2 is to do an integration with some free or commercial piece of BI software. So far I have looked at Wabit and will have a look at QlikView, possibly others. Pros of this approach:
no need to reinvent the wheel, software is (hopefully) tried and tested
would save us time we could spend solving problems we specialize in
as we are a Java shop and our solution is cross-platform, we'd have to eliminate a lot of options which are in the market
I am not sure how flexible BI software can be. It would take time to go through some BI offerings to see if they can do flexible indexing, real time / full text search, fully customizable results, etc.
I was told that open source BI offers are not mature enough whereas commercial BIs (SAP, others) cost fortunes, their licenses start from tens of thousands of pounds/dollars. While I am not against commercial choice per se, it will add up to the overall price which can easily become just too big
not sure how well BI is made to work with schema-less data
I am definitely not be the best candidate to find the most approprate integration option in the market (mainly because of absence of knowledge in BI area), however a decision needs to be done fast.
Has anybody been in a similar situation and could advise on which route to take, or even better - advise on possible pros/cons of the option #2? The biggest problem here is that I don't know what I don't know ;)
I have spent some time playing with both QlikView and Wabit, and, have to say, I am quite disappointed.
I had an expectation that the whole BI industry actually has some science under it but from what I found this is just a mere buzzword. This MSDN article was actually an eye opener. The whole business of BI consists of taking data from well-normalized schemas (they call it OLTP), putting it into less-normalized schemas (OLAP, snowflake- or star-type) and creating indices for every aspect you want (industry jargon for this is data cube). The rest is just some scripting to get the pretty graphs.
OK, I know I am oversimplifying things here. I know I might have missed many different aspects (nice reports? export to Excel? predictions?), but from a computer science point of view I simply cannot see anything beyond a database index here.
I was told that some BI tools support compression. Lucene supports that, too. I was told that some BI tools are capable of keeping all index in the memory. For that there is a Lucene cache.
Speaking of the two candidates (Wabit and QlikView) - the first is simply immature (I've got dozens of exceptions when trying to step outside of what was suggested in their demo) whereas the other only works under Windows (not very nice, but I could live with that) and the integration would likely to require me to write some VBScript (yuck!). I had to spend a couple of hours on QlikView forums just to get a simple date range control working and failed because the Personal Edition I had did not support downloadable demo projects available on their site. Don't get me wrong, they're both good tools for what they have been built for, but I simply don't see any point of doing integration with them as I wouldn't gain much.
To address (arguable) immatureness of Solr I will define an abstract API so I can move all the data to a database which supports full text queries if anything goes wrong. And if worse comes to worse, I can always write stuff on top of Solr/Lucene if I need to.
If you're truly in a scenario where you're not sure what you don't know i think it's best to explore an open-source tool and evaluate its usefulness before diving into your own implementation. It could very well be that using the open-source solution will help you further crystallise your own understanding and required features.
I had worked previously w/ an open-source solution called Pentaho. I seriously felt that I understood a whole lot more by learning to use Pentaho's features for my end. Of course, as is the case of working w/ most of the open-source solutions, Pentaho seemed to be a bit intimidating at first, but I managed to get a good grip of it in a month's time. We also worked with Kettle ETL tool and Mondrian cubes - which I think most of the serious BI tools these days build on top of.
Earlier, all these components were independent, but off-late i believe Pentaho took ownership of all these projects.
But once you're confident w/ what you need and what you don't, I'd suggest building some basic reporting tool of your own on top of a mondrian implementation. Customising a sophisticated open-source tool can indeed be a big issue. Besides, there are licenses to be wary of. I believe Pentaho is GPL, though you might want to check on that.
First you should make clear what your reports should show. Which reporting feature do you need? Which output formats do you want? Do you want show it in the browser (HTML) or as PDF or with an interactive viewer (Java/Flash). Where are the data (database, Java, etc.)? Do you need Ad-Hoc reporting or only some hard coded reports? This are only some questions.
Without answers to this question it is difficult to give a real recommendation, but my general recommendation would be i-net Clear Reports (used to be called i-net Crystal-Clear). It is a Java tool. It is a commercial tool but the cost are lower as SAP and co.

How to approach performance issues?

We are developing a client-server desktop application(winforms with sql server 2008, using LINQ-SQL).We are now finding many issues related to performance.These relate to querying too much data with LINQ , bad database design,not much caching etc.What do you suggest,we should do - how to go about solving these performance issues? One thing,I am doing is doing sql profiling,and trying to fix some queries.As far caching is concerned,we have static lists.But,how to keep them updated,we don't have any server side implementation.So,these lists can be stale,if someone changes data.
Performance analysis without tools is fruitless, with the wrong tools frustrating. SQL Profiler is the wrong tool to rely on for what you are looking at. I think it is at best giving you a hint of what is wrong.
You need to use a code profiler to determine why/when these queries are being executed. You should be able to find one by Googling it and run it a x day trial.
The key questions are:
Are queries being run multiple times when there is no reason to at all? Is the data already in memory (even if not stored statically). This happens a lot where data is already retrieved but because of some action on the code it loads it again. Class properties are a big culprit here.
Should certain data be stored statically across the application? How volatile is that data? Can you afford to show stale data?
The only way to decide on #2 is to have hard data to examine the cost of a particular transaction. For example, if I know it takes me 1983 ms to create a new invoice, what will it be after I start caching data. After the cache is that savings significant. But recognize you can't answer that question until you know it takes 1983 ms to create an invoice.
When I profile an application transaction I focus on the big contributor and try to determine why it is so big. I look for individual methods that are slow and for any code that is executed frequently. It is often the latter, the death of a thousand cuts, that gets you.
And I wanted to add this, it is also very important to know when to stop working on a performance issue.
I found Jeff Atwood's articles on this quite interesting:
Compiled Or Bust
All Abstractions are field Abstractions
For updating, you can create a Table. I called it ListVersions.
Just store list id, name and version.
When you do some changes to a list, just increment its version. In your application, you'll just need to compare version and update only if it has changed. Update lists that have version incremented, not all.
I've described it in my answer to this question
What is the preferred method of refreshing a combo box when the data changes?
Good Luck!
A general recipe for performance issues:
Measure (wall clock time, CPU time, memory consumption etc.)
Design & implement an algorithm that you think could be faster than current code.
Measure again to assess the impact of your fix.
Many times the biggest bottle necks aren't exactly where you though they were. So, base your actions on measured data.
Try to keep the number of SQL queries small. You're more likely to get performance improvements by lowering the amount of queries than restrucrturing the SQL syntax of an individual query.
I recommed adding some server side logic instead of directly firing the SQL queries from the client. You could implement caching shared but all clients on the server side.
