I'm working on a project where I am collecting data from different sources, and afterwards I need to do some reporting on that data. All the reports are predefined.
I'm thinking of using RavenDB for this, as I think the indexes and map/reduce part could be a good fit for this, so that I create an index for each report.
Is the one-index-per-report the way to go, or does that come with any pitfalls? And how about index starvation?
One-index-per-report will lead to lots of extra indexes. Think instead of "one-index-per-dataset". Then build multiple reports around each dataset.
There might be more than one dataset per collection if you are doing map/reduce. For example, you might have the following indexes:
OrderDetailsIndex
OrderTotalsByCustomer
OrderTotalsByMonth
OrderTotalsByDay
OrderTotalsByProduct
You could build many reports from these indexes.
What you don't want to do is to have multiple indexes like:
OrdersByCustomer
OrdersByDate
OrdersByZipCode
Those are just multiple maps that can be condensed into the same index, so it would be redundant to split them apart.
If you had one index per report, it would get out of control quickly:
OrderDetailsIndex_ForReportA
OrderDetailsIndex_ForReportB
The only difference between A and B might be the layout of the fields.
And finally, you might want to consider upgrading to RavenDB 2.5. There is a new feature called "Streaming Unbounded Results", that you can read about in Ayende's blog. This is probably the best way to feed data from an index into a report. If your reporting engine requires an IEnumerable data source (most do), then you might want to use this handy extension method I wrote.
Related
I have data from multiple sources - a combination of Excel (table and non table), csv and, sometimes, even a tsv.
I create queries for each data source and then I am bringing them together one step at a time or, actually, it's two steps: merge and then expand to bring in the fields I want for each data source.
This doesn't feel very efficient and I think that maybe I should be just joining everything together in the Data Model. The problem when I did that was that I couldn't then find a way to write a single query to access all the different fields spread across the different data sources.
If it were Access, I'd have no trouble creating a single query one I'd created all my relationships between my tables.
I feel as though I'm missing something: How can I build a single query out of the data model?
Hoping my question is clear. It feels like something that should be easy to do but I can't home in on it with a Google search.
It is never a good idea to push the heavy lifting downstream in Power Query. If you can, work with database views, not full tables, use a modular approach (several smaller queries that you then connect in the data model), filter early, remove unneeded columns etc.
The more work that has to be performed on data you don't really need, the slower the query will be. Please take a look at this article and this one, the latter one having a comprehensive list for Best Practices (you can also just do a search for that term, there are plenty).
In terms of creating a query from the data model, conceptually that makes little sense, as you could conceivably create circular references galore.
Say I have a collection with 100 million records/documents in it.
I want to create a series of reports that involve summing of values in certain columns and grouping by various columns.
What references for XQuery and/or MarkLogic can anyone point me to that will allow me to do this quickly?
I saw cts:avg-aggregate which looks fine. But then I need to group as well..
Also, since I am dealing with a large amount of data and it will take some time to go through it all, I am thinking about setting this up as a job that runs at night to update the report.
I thought of using corb to run through the records and then do something with the output from that. Is this the right approach with MarkLogic and reporting?
Perhaps this guide would help:
http://developer.marklogic.com/blog/group-by-the-marklogic-way
You have several options which are discussed above:
cts:estimate
cts:element-value-co-occurrences
cts:value-tuples + cts:frequency
The idea is to redesign data structure and/or change DB.
I just started to review this project and plan to start optimization from this one.
Currently i have CouchDb with about 80GB of document data, around 30M records.
From that subset for the most of documents properties like id, group_id, location, type can be considered as generic, but unfortunately for now such are even stored with different property naming around the set. Also a lot of deeply nested can be found.
Structure isn't hardly defined, that's why NoSQL db was selected way before some picture was seen.
Data is calculated and populated in DB in a separate Job on powerful cluster. This isn't done too often. From that perspective i can conclude that general write/update performance isn't very important. Also size decrease would be great, but isn't most important. There are only like 1-10 active customers at a time.
Actually read performance with various filtering/grouping etc is most important.
But no heavy summary calculations should be done, this one is already done while population.
This one is a data analytical tool for displaying compare and other reports to quality engineers and data analyst, so they can browse the results, group them or filter from the Web UI.
Now such tasks like searching a subset of document properties for a text isn't possible due to performance.
For sure i've done some initial investigations(like http://www.datastax.com/wp-content/themes/datastax-2014-08/files/NoSQL_Benchmarks_EndPoint.pdf) and it looks Cassandra seems to be good choice among NoSql.
Also it's quite interesting trying to port this data into the new PostgreSQl.
Any ideas would be highly appreciated :-)
Hello please check the following articles:
http://www.enterprisedb.com/nosql-for-enterprise
For me, PostgreSQL json(and jsonb!) capabilities allow to start schema-less, have transactions, indexes, grouping, aggregate functions with very good performance, just from the start. And when ready(and if needed), you can go for the schema, with internal data migration.
Also check:
https://www.compose.io/articles/is-postgresql-your-next-json-database/
Good luck
I will be processing batches of 10,000-50,000 records with roughly 200-400 characters in each record. I expect the number of search terms I could have would be no more than 1500 (all related to local businesses).
I want to create a function that compares the structured tags with a list of terms to tag the data.
These terms are based on business descriptions. So, for example, a [Jazz Bar], [Nightclub], [Sports Bar], or [Wine Bar] would all correspond to queries for [Bar].
Usually this data has some kind of existing tag, so I can also create a strict hierarchy for the first pass and then do a second pass if there is no definitive existing tag.
What is the most performance sensitive way to implement this? I could have a table with all the keywords and try to match them against each piece of data. This is straightforward in the case where I am matching the existing tag, less straightforward when processing free text.
I'm using Heroku/Postgresql
It's a pretty safe bet to use the Sphinx search engine and the ThinkingSphinx Ruby gem. Yes, there is some configuration overhead, but I am yet to find a scenario where Sphinx has failed me. :-)
If you have 30-60 minutes to tinker with setting this up, give it a try. I have been using Sphinx to search in a DB table with 600,000+ records with complex queries (3 separate search criterias + 2 separate field groupings / sortings) and I was getting results in 0.625 secs, which is not bad at all and I am sure is lot better than anything you could accomplish yourself with a pure Ruby code.
I'm creating a mysql-centric site for our client who wants to do custom reporting on their data (via http), something like doing selects in PHPMyAdmin with readonly access (*). Perhaps like reporting in Pentaho, though I've never used it.
My boss is saying Sphinx is the answer, but reading through the Sphinx docs, I don't see that it will make user-driven reporting any easier. Does Sphinx help with user driven reporting? If so, what do they call it?
(*) the system has about 25 tables that would likely be used in reporting, with anywhere between 2 and 50 fields. Reports would likely need up to maybe 5 joins per report.
update: I found http://code.google.com/p/sphinx-report/ so I guess Sphinx doesn't natively do it.
I can only answer for SphinxSEARCH -- dont really know much about the other sphinxes.
It itself doesnt contain features particully for writing reports. Its a general purpose search backend. Just like mysql is not particully designed for reports, but it can be be used as such.
In general think of sphinx as providing a very flexible, fast and scalable 'index' to a database table. Almost like creating a materialized view in mysql.
Because you can think very carefully what data to include in this index. And the index can be 'denormalized' to include all the data (via complex joins if required) - you can then run very fast queries against this index.
Sphinx also supports "GROUP BY" which makes it very useful for creating reports - and because the attribute data is held entirely in memory its generally very fast.
Basically sphinx will be very good to provide the backend for a 'dynamic' and 'interactive' report system. Where speed is required - particully if combined with allowing the user to filter reports (via keywords) - that is where sphinx shines.
Because of the upfront work required in designing this index, its less suited for 'flexible' reports. In building the index, will probably be a number of comprimises, so might be limiting in what reports will be possible. (at least without creating lots of differnt indexes)
short version: lots of upfront work to build the system, for very fast queries down the line.
Sphinx wont really do anything that mysql can't. But using sphinx as part of the system will allow performance to be improved (over a pure mysql solution).