Temporary Dashboard/Reporting Solution while Building a Data Warehouse - reporting

Our situation is that we are going to start to build a data warehouse. The data warehouse is going to take some time, if we are going to do it right. It will be built looking at individual processes and growing from there.
We only have three databases that we will be pulling data from. All three databases hold distinct information (financial info, scheduling and patient information - visits, diagnosis,etc).
I am thinking of using a dashboard/reporting tool like (as an example) http://www.jedox.com/en/, or http://www.board.com/us/ to display the information to the business. It will slowly start incoperating the DW as it is beind designed and pushed to production.
My question after all this is: What is the best way to present the data to the application (dashboard/reporter) in the backend that would be efficient, yet not time consuming where I'd rather build the Data Warehouse? Ie. views, materialized views, small seperate DB containing subset data from the main DB's, etc?

This may not be answering your question directly, but rather than find a temporary solution I would just build your warehouse faster.
First, if you can build it quickly then you don't need a temporary one; if you can't build it quickly then you won't be able to build a temporary solution quickly either. You even mentioned developing a "small separate DB containing subset data"; that's exactly what a reporting database is!
Second, any temporary solution will have to be maintained and supported too: if it's too useful then your temporary solution will become your permanent one anyway. That might actually be a good thing because if the 'temporary' solution meets your requirements then why not keep it?
Anyway, I would start by identifying one or two key reports that have high value for your users and commit to delivering them in 2 months (1 month would be even better). Develop the most basic, minimal database and ETL/reporting processes possible to deliver those reports, even if it seems like a horrible, hacked-together mess. Make sure the reports are internal ones that no one will send to an outside customer; that means you can avoid spending time on making them pretty.
After you've delivered those reports, you can now step back and look at what you did. Hopefully you will find yourself in a position where:
Your users got some useful reports very quickly
The reports are ugly but the numbers are correct
You've learned a lot about the users' needs and how they interpret and use the data
Your technical implementation is a mess, but you know that and you also know how to improve it
If #1 and #2 are true then you'll have delivered a lot of business value quickly while also setting the user expectation that correct is often more valuable than pretty (that's really helpful on a reporting project). If #3 and #4 are true then your second iteration will be a big improvement on the first one and even if you find yourself in the worst case scenario of having to re-develop the whole thing from scratch, you'll do it faster and better because you've learned so much.
This is simply agile development, of course: there's no reason you can't use rapid prototyping and incremental delivery in a data warehouse project. Like any IT solution the warehouse will continuously grow and be maintained over time so there's absolutely no reason to try to get everything complete and correct in the first version. It's highly likely that your users don't even really know what they want (in detail) so this approach helps to clarify their expectations and requirements more quickly too.

Related

Scalability of open source XML Databases

We are looking to develop a reporting application that reports on data stored in a large number of XML files. ~3,000,000 files ranging in size from 7KB to 5MB (Each file conforms to the same schema). I’m guessing that there will be about around 200GB of XML. I’m looking at a number of open source XML databases (Sedna, BaseX and eXist-db) and I’m not sure how well these systems will scale, I read a comparison of these three database here. Which is where my concerns of scalability originated from.
Some details regarding what we want to do are: We won’t be changing the data in any of the XML files and new files will be added daily. Since we are concerned with reporting query performance is important to us, and the time it takes to add and index new files isn’t a high priority for us.
I’m wondering if anyone has experience using these systems at similar scales? I’ve looked at the BaseX statistics page and see some fairly large XML instances but no mention of performance.
We don’t require an open source product and the MarkLogic system looks like it can fit the bill nicely, but I’m curious what’s been done with open source products.
I think it is impossible to answer your question with either a yes or no. It is really impossible to state anything about performance from the little details that you have given.
Performance is typically based on the queries that you want to perform and the distribution of your data. Not to mention, what you consider to be "acceptable".
In the paper you referenced, it is interesting to note that they state that they could not get the new range indexes in eXist 2.2 preview to work. Certainly without those, they would have seen much worse performance. Also at the end they state that they will select Sedna as they can overcome the problems with Sedna, it was not clear to me why that was, i.e. do they have C++ devs that can work with Sedna but they don't have Java devs that could work with eXist or BaseX? Finally, the version of Java they used for testing eXist and BaseX is rather old, the next release of eXist (3.0) will only support Java 8 and newer.
I would be surprised if you could not store 200GB of data into BaseX, eXist or Sedna, but without knowing your data and the sort of queries you want to execute, I cannot comment on query performance.
I think you would be best to do a small trial of either one or all, in a manner not dissimilar to that linked article.
Just want to share my experience on this topic. My experience is limited to much smaller data sets - that is roughly about 50k documents of about 1GB total size. We use Sedna XML DB for this purpose. We do not change documents but rather overwrite existing documents when changes occur and have a lot of read-only XQueries including big reports.
Shortly, my opinion is Sedna won't work for you unless you find a way to replicate it to another server to be used for reading. I have experienced major performance issues related to collection locks with a rather moderate load on the database when performing some long-lasting reporting XQueries. As far as I know, Sedna does not offer replication capabilities but you can probably adopt some solution on top of Sedna. For example, quick googling revealed some research in this area. You can try asking on the Sedna mailing list. Among other disadvantages are lack of XQuery 3.0 support and seemingly frozen further development. However, the support is still quite active on the mailing list.
Also I have some experience with eXist-db but I use it more as a XML processing and pipelining platform rather than an XML storage. Still it looks a bit more promising in relation to scaling. Although I have not used its replication capabilities, they are mentioned in the docs. I suggest you try searching/asking on the mailing list as well.

When is it too late to optimize for performance?

I know that you shouldnt optimize too early, and you should instead aim for maintainability. My question is, at what point is it too late?
I'm working on a website, similar to yahoo answers, and my database structure is exactly what I feel it should be. Table for users, questions, answers, question_comments, answer_comments, etc.
My question is, IF the site were to grow, how would this architecture scale? I'm thinking of putting both questions and answers in a single table (posts), separating them by type, and then putting both question_comments and answer_comments in the same table (comments). I believe this is similar to stackoverflow's DB scheme.
I know what you guys are gonna say, "Dont worry about it until it becomes an actual problem". But wouldn't it be a little too late to worry about it then?
Thanks
The reason why it's a bad practice to optimize early is you don't know where your bottlenecks will be until your website sees a significant amount of traffic. How your users access and interact with your site is an unknown at this point.
It's almost always best to start with a 'good' architecture (normalized database, MVC architecture, DRY, well-written frontend code, etc) and go from there. It will be much easier to scale a clean, organized architecture than one that was prematurely optimized.
At best right now you can do some load testing via ab or another load testing tool to see where your current bottlenecks are. It certainly won't find all of them, but it will find some.
If you're really worried about this (and you shouldn't be yet), install Nagios or Munin on your server to monitor performance. Use a third party tool to measure page load time daily. Once you start seeing issues then you can profile and tune.
You absolutely should optimize if a fast service is a fundamental requirement of the application.
If sub-second responses are not a requirement, than you can write clean code and optimize later.
A good example of this was JavaScript before the latest version of browsers, people who wrote nice, clean, extensible JS for their pages had terrible performance and had to start from scratch.
One huge table is generally harder to maintain. People usually cut their tables into partitions and even their databases into shards.
I don't see how putting all comments into the same table would save you a join. Really, putting questions and answers into the same table won't save you a join either, you'll just be joining by the same table.
If you want to save on joins, I'd expect you use a document-oriented NoSQL database, such as MongoDB. That's where you can store a question with all related answers and comments in a single 'record', fetchable with one operation.
Databases need to be designed with performance in mind not wait until you havea problem later. Premature optimization doesn't mean don't do it in design, it means don't get ridiculously excessive about it. However, there are known performance killers for every database backend and it is foolish to design to use one of those when a differnt technique will be faster and take the same amount of time to write code for if you are familar with it. So before designing any database, read up on performance tuning and you will never write database code the same way again.

How to approach performance issues?

We are developing a client-server desktop application(winforms with sql server 2008, using LINQ-SQL).We are now finding many issues related to performance.These relate to querying too much data with LINQ , bad database design,not much caching etc.What do you suggest,we should do - how to go about solving these performance issues? One thing,I am doing is doing sql profiling,and trying to fix some queries.As far caching is concerned,we have static lists.But,how to keep them updated,we don't have any server side implementation.So,these lists can be stale,if someone changes data.
regards
Performance analysis without tools is fruitless, with the wrong tools frustrating. SQL Profiler is the wrong tool to rely on for what you are looking at. I think it is at best giving you a hint of what is wrong.
You need to use a code profiler to determine why/when these queries are being executed. You should be able to find one by Googling it and run it a x day trial.
The key questions are:
Are queries being run multiple times when there is no reason to at all? Is the data already in memory (even if not stored statically). This happens a lot where data is already retrieved but because of some action on the code it loads it again. Class properties are a big culprit here.
Should certain data be stored statically across the application? How volatile is that data? Can you afford to show stale data?
The only way to decide on #2 is to have hard data to examine the cost of a particular transaction. For example, if I know it takes me 1983 ms to create a new invoice, what will it be after I start caching data. After the cache is that savings significant. But recognize you can't answer that question until you know it takes 1983 ms to create an invoice.
When I profile an application transaction I focus on the big contributor and try to determine why it is so big. I look for individual methods that are slow and for any code that is executed frequently. It is often the latter, the death of a thousand cuts, that gets you.
And I wanted to add this, it is also very important to know when to stop working on a performance issue.
I found Jeff Atwood's articles on this quite interesting:
Compiled Or Bust
All Abstractions are field Abstractions
For updating, you can create a Table. I called it ListVersions.
Just store list id, name and version.
When you do some changes to a list, just increment its version. In your application, you'll just need to compare version and update only if it has changed. Update lists that have version incremented, not all.
I've described it in my answer to this question
What is the preferred method of refreshing a combo box when the data changes?
Good Luck!
A general recipe for performance issues:
Measure (wall clock time, CPU time, memory consumption etc.)
Design & implement an algorithm that you think could be faster than current code.
Measure again to assess the impact of your fix.
Many times the biggest bottle necks aren't exactly where you though they were. So, base your actions on measured data.
Try to keep the number of SQL queries small. You're more likely to get performance improvements by lowering the amount of queries than restrucrturing the SQL syntax of an individual query.
I recommed adding some server side logic instead of directly firing the SQL queries from the client. You could implement caching shared but all clients on the server side.

Which are the advantages of splitting the developer's time between two projects? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have two projects, with identical priorities and work hours demand, and a single developer. Two possible approaches:
Deliver one project first.
Split the developer's time and deliver both later.
I can't see any reason why people would choose the second approach. But they do. Can you explain me why?
It seems to me that this decision often comes down to office politics. One business group doesn't want to feel any less important than another, especially with identical priorities set at the top. Regardless as to how many different ways you explain why doing both at the same time is a bad idea, it seems as though the politics get in the way.
To get the best product to the users, you need to prevent developer thrashing. When the developers are thrashing, the risk of defects and length of delivery times begin to increase exponentially.
Also, if you can put your business hat on, you can try to explain to them that right now, nobody is getting any value from what the completed products will deliver. It makes more sense for the business to get the best ROI product out the door first to begin recouping the investment ASAP, while the other project will start as soon as the first is finished.
Sometimes you need to just step away from the code you have been writing for 11 hours in order to stay maximally productive. After you have been staring at the minutiae of a system you have been implementing for a long time it can become difficult to see the forest for the trees, and that is when you start to make mistakes that are hard to un-make.
I think it is best to have 2-3 current projects; one main one and 1-2 other projects that aren't on such a strict timeline.
If both projects have the same priority for the company, one obvious reason is for project managers to give higher management the illusion that both of the projects are taken care of.
Consider that the two projects could belong to different customers (or be requested by different people from higher management).
No customer wants to be told to wait while a different customer's project is given priority.
"We'll leave the other one for later" is, a lot of times, not an acceptable answer, even though this leads to delays for both projects.
I believe this is related to the notion of "Perceived Responsiveness" in a software program. Even if something takes more time to do, it looks faster when it appears to be doing something, instead of idly waiting for some other stuff to complete.
It depends on the dependencies involved. If you have another dependency upon the project that can be fulfilled when the project is not 100% complete, then it may make sense to split the developer's time. For example, if your task is large, it may make sense to have the primary developer do a design, then move on to a second task while a teammember reviews the design the primary developer came up with.
Furthermore, deserializing developers from a single task can help to alleviate boredom. Yes, there is potentially significant loss in the context switch, but if it helps keep the dev sharp, it's worth it.
if you go by whats in the great and holy book 'peopleware', you should keep your programmer on one project at a time.
the main reason for this is that divided attention will reduce productivity.
unfortunately, because so many operational managements are good businessman rather then good managers, they may think that multitasking or working on both projects somehow means more things are getting done (which is impossible, a person can only physically exists in one stream of the space-time continuum at one time).
hope that helps :)
LM
I think the number 1 reason from a management standpoint is for perceived progress. If you work on more than one project at the same time stakeholders are able to see progress immediately. If you hold one project off then the stakeholders of that project may not like that nothing is being worked on.
Working on more than 1 project also minimizes risks somewhat. For example if you work on one project first and that project takes longer than expected you could run into issues with the second project. Stakeholder also most likely want their project done now. Holding one off due to another project can make them reconsider going ahead with the project.
Depending on what the projects are you might be able to leverage work done in one for the other. If they are similar then doing both at the same time could be of benefit. If you do them in sequence only the subsequent projects can benefit from the previous ones.
Most often projects are not a constant stream of work. Sometimes developers are busy and sometimes not. If you only work on 1 project at a time a developer and other team members would likely be doing nothing while the more 'administrative' tasks are taking place. Managing the time over more than one project allows teams to get more done in a shorter timeframe.
As a developer I prefer working on multiple projects as long as the timelines are reasonable. As long as I'm not being asked to do both at the same time with no change in the schedule I am fine. Often if I'm stuck on one project I can work on the other. It depends on the projects though.
I'd personally prefer the former but management might want to see progress in both projects. You might also recognise inaccurate estimates earlier if you are doing some work on both, enabling you to inform the customer earlier.
So from a development perspective 1 is the best option but from a customer service point of view 2 is probably better.
It's managing your clients expectations; if you can tell both clients you are working on their project but it will take a little longer due to other projects then to say we are putting your project off till we finish this other project the client is going to jump ship and find someone that can start working on their project now.
It's a plaecbo effect - splitting a developer between two projects in the manner you've described gives people/"the business" the impression that work is being completed on both projects (at the same rate/cost/time), whilst in reality it's probably a lot more inefficient, since context switching and other considerations carries a cost (in time and effort).
On one hand, it can get the ball rolling on things like requirement clarifications and similar tasks (so the developer can switch to the alternate project when they are blocked) and it can also lead to early input from other business units/stakeholders etc.
Ultimately though, if you have one resource then you have a natural bottleneck.
The best thing you can do for that lone developer is to intercept people( from distracting that person), and try to carry some of the burdon around requirements, chasing clarifications and handling user feedback etc.
The only time I'd ever purposely pull a developer off their main project is if they would be an asset to the second project, and the second project was stalled for some reason. If allowing a developer to split a portion of their time could help jump-start a stalled project, I'd do that. This has happened to me with "expert" developers - the ones who have a lot more experience/specialized skills/etc.
That being said, I would try to keep the developer on two projects for as little time as possible, and bring them back to their "main" project. I prefer to allow people to focus on one task at a time. I feel that my job as a manager is to balance and shift people's priorities and focus - and developers should just develop as much as possible.
There are three real-life advantages of splitting developers' time between projects that cannot be ignored:
Specialisation: doing or consulting on work that requires similar specialised knowledge in both projects.
Consistency and knowledge sharing: bringing consistency into the way two separate products are built and work, spreading knowledge accross the company.
Better team utilisation: on a rare occasion when one of the projects is temporarily on hold waiting for some further input.
Splitting time between several projects is beneficial when it does not involve a significant change in context.
Having a developer to work single-handedly on multiple software development projects negates the benefit of specialisation (there isn't any in the case), consistency and knowledge sharing.
It leaves just the advantage of time utilisation, however if contexts differ significantly and there is no considerable overlap between projects the overhead of switching will very likely exceed any time saved.
Context switching is a very interesting beast: contrary to its name implying a discreet change the process is always gradual. There are various degrees of having context information in one’s head: 10% context (shallow), 90% (deep). It takes less time to shallow-switch as opposed to fully-switch; however there is a direct correlation between the amount of context loaded (concentration on the task) and output quality.
It’s possible to fill your time entirely working on multiple distinct projects relying on shallow-switching (to reduce the lead time), but the output quality will inevitably suffer. At some point it’s not only “non-functional” aspects of quality (system security, usability, performance) that will degrade, but also functional (system failing to accomplish its job, functional failures).
By splitting the time between two projects, you can reduce the risk of delaying one project because of another.
Let's assume the estimate for both projects is 3 months each. By doing it serially, one after the other, you should be able to deliver the first project after 3 months, the second project 3 months later (i.e. after 6 months). But, as things go in software development, chances are that the first project encounters some problems so it takes 12 months instead. Or, even worse, goes into the "in use, but never quite finished" purgatory. The second project starts late or even never!
By splitting resources, you avoid this problem. If everything goes well with the second project, you are able to deliver it after 6 months, no matter how well the first project does.
The real life situations where working on multiple projects can be an advantage is in the case where the spec is unclear (every time) and the customer is often unavailable for clarification. In those cases you can just switch to the other project.
This will cause some task switching and should be avoided in a perfect world, but then again...
This is basically my professional life in a nutshell :-)

Software Rewrite-vs-Running Cost Analysis

The IT department I work in as a programmer revolves around a 30+ year old code base (Fortran and C). The code is in a poor condition partially as a result of 30+ years of ad-hoc poorly thought out changes but I also suspect a lot of it has to do with the capabilities of the programmers who made the changes (and who incidentally are still around).
The business that depends on the software operates 363 days a year and 20 hours a day. Unfortunately there are numerous outages. This is the first place I have worked where there are developers on call to apply operational code fixes to production systems. When I was first, there was actually a copy of the source code and development tools on the production servers so that on the fly changes could be applied; thankfully that practice has now been stopped.
I have hinted a couple of times to management that the costs of the downtime, having developers on call, extra operational staff, unsatisifed customers etc. are costing the business a lot more in the medium, and possibly even short term, than it would to launch a whole hearted effort to re-write/refactor/replace the whole thing (the code base is about 300k lines).
Ideally they'd be some external consultancy that could come in and run the rule over the quality of the code and the costs involved to keep it running vs rewrite/refactor/replace it. The question I have is how should a business go about doing that kind of cost analysis on software AND be able to have confidence in that analysis? The first IT consultants down the street may claim to be able to do the analysis but how could management be made to feel comfortable with it over what they are being told by internal staff?
We recently decided to completely rewrite large portions of our business code from scratch, and it has not gone as well as we had hoped. I've seen a lot of quotes saying you should never try to rewrite anything from scratch, and now I see why. I would recommend starting small - don't try to rewrite the whole thing at once. Identify the large problem areas and focus on refactoring small portions of the system at a time. Since there is 30+ years worth of work in the system, it will take a long time to get it back to a reasonable state. We had about 5-8 years worth of work to rewrite, and it has been difficult. I can't imagine 30+ years of work!
First, the profile of the consultant you need is very specific. Unless you can find someone who worked in a similar domain with the same languages, don't hire him.
Second, there's a 99% probability (I like dramatic numbers) the analysis will go as follow:
Consultant explores the application
Consultant does understand 10% of the application
Time's up, time for the report
Consultant advices a complete rewrite (no refactoring, plain rewrite)
So you may as well make the economy of what the consultant will cost.
You have only two solutions here:
Keep with the actual source code but determine proper methods to fix problems so that you have a very long run refactoring that is progressly made by those who know the application
Get a secondary team to make a new application to replace the old one
If I talk about a secondary team, it's because you cannot bring just one architect to make the new application and have the old team working with him:
They're too busy on the old application
There will be frictions because the newcomer will undoubtedly underestimate the task at hand
I talk from experience, believe me.
If you go the "new application" way don't put your hopes too high. You'll end up with an application that has less than half the functionalities of the current one, simply because you cannot cram 30+ years of special case and exceptional situation fixes into a freshly design software.
Oh, also, if your developers happen to tell you they have a plan, by all means, hear them out. They most probably know what they are talking about.
The first thing that comes to mind is that you are prematurely addressing the rewrite/refactor/replace argument. The first step two steps I would recommend would be:
Unit tests
QA
It's well within engineering scope to implement these. Unit tests are an essential preliminary step before any reasonable refactor or rewrite could possibly take place. By 'unit test' I mean wrap each function call with corresponding code that proves the code works for all known conditions. In complex retrofits this may not actually happen at the most granular level but any automated tests will help immensely.
And QA - have an independent (and aggressive) quality assurance team that rigorously tests beta releases before production. Their test plans and test procedures become essential for any kind of replacement effort.
Once you've got the code under control, then you are in a position where the business can reasonably consider massive changes.
Just a note about your comment about external consultants - no consultancy will ever care enough about the code to provide realistic quality assurance. QA ends up being married to the hip of business defending the company bottom line. It's an internal function ultimately and an external consultant can't provide much more than getting you started really.
I think that your description provides all of the necessary information on code quality (lack thereof). The fact that so many support resources are required also indicates the high costs involved with maintaining the existing system.
As I answered here, a good approach to consider is refactoring one piece of the system at a time until everything works at an acceptable level. I agree with Joel re not throwing away existing code (see Things You Should Never Do. Parts of your code work, so you should leave those in place whenever possible, and focus on the sections that lead to downtime.
Andy also makes a great point about starting small as well.
Another thing to try, is reviewing the processes around the system. When you do this, you should try to determine what failure situations are caused directly or indirectly by user action?, are there configuration or environment problems? If you are having trouble fixing the code directly, then you can still prop it up by dealing with external issues more effectively.
Read the book Working Effectively with Legacy Code (also see the short PDF version) and surround the code with automated tests, as instructed in that book.
Refactor the system little by little. If you rewrite some parts of the code, do it a small subsystem at a time. Don't try to make a Grand Redesign.
The code has been around for 30 years?
Development paradigms have shifted substantially in the last three decades in many ways, and most relevant to your predicament, I feel, is in terms of the amount of time (in man days) required to create something to input->process->output something.
300,000 lines of code 30 years ago, could probably fit into 100,000 lines or less today, and expending fewer man hours(?) This could seem optimistic/ridiculous to some, but on the other hand is achievable, depending on the type of application in question. You have given no indication as to the classification of system - is it a real-time manufacturing process control system of sorts with sensors and actuators tied to it? An airline booking system ? Does it post-process some backlog of data? In other words could it be rebuilt in something like Java and quickly with an agressive, smallish team? Have the requirements been documented, and if so do they need updating or redeveloping from scratch? Is human safety a factor?
Just a quick sanity check, I think whether or not you should rebuild depends on (any order means the same thing):
Number of code dudes required.
Level of expertise of said dudes.
Which languages do not fit.
Which languages do fit.
How much it costs to use chosen language(s) them in terms of hardware and software.
How much does the business depend on this to stay alive.
Is it really too much downtime, or are you just nitpicking? (maybe they really don't care, but pretend to).
Good luck with that!

Resources