MarkLogic 8 - Reporting and Aggregation from Large Collection - reporting

Say I have a collection with 100 million records/documents in it.
I want to create a series of reports that involve summing of values in certain columns and grouping by various columns.
What references for XQuery and/or MarkLogic can anyone point me to that will allow me to do this quickly?
I saw cts:avg-aggregate which looks fine. But then I need to group as well..
Also, since I am dealing with a large amount of data and it will take some time to go through it all, I am thinking about setting this up as a job that runs at night to update the report.
I thought of using corb to run through the records and then do something with the output from that. Is this the right approach with MarkLogic and reporting?

Perhaps this guide would help:
http://developer.marklogic.com/blog/group-by-the-marklogic-way
You have several options which are discussed above:
cts:estimate
cts:element-value-co-occurrences
cts:value-tuples + cts:frequency

Related

ORACLE - Which is better to generate a large resultset of records, View, SP, or Function

I recently working with Oracle database to generate some reports. What I need is to get result sets of specific records (only SELECT statement), sometimes are large records, to be used for generating the report in excel file.
At first, the reports are queried in Views but some of them are slow (have some complex subqueries). I was asked to increase the performance and also fixed some field mapping. I also want to tidy things up, because when I query against View, I must specifically call the right column name. I want to separate the data works into database, and the web app just for passing parameters and call the right result set.
I'm new to Oracle, so which is better to do this kind of task? Using SP or Function? or in what condition that maybe View is better?
Makes no difference whether you compile your SQL in a view, SP or function. It is the SQL itself that matters.
As long as you are able to meet your requirements with the views they should be a good option. If you intend to break-up your queries into multiple ones for achieving better performance then you should go for stored procedures. If you decide to go for stored procedure then it would be advisable to create a package and bundle all the stored procedures together in the package. If your problem is performance then there may not be a silver bullet solution for the same. You will have to work on your queries and design for the same.
If the problem is performance due to complex SELECT query (queries), you can consider tuning the queries. Often you will find queries written 15-20 years ago, which do not use functionality and techniques that were introduced by Oracle in more recent versions (even if the organization spent the big bucks to buy the more recent versions - making it into a waste of money). Honestly, that may be too much of a task for you if you are new at Oracle; also, some slow queries may have been written by people just like you, many years ago - before they had a chance to learn a lot about Oracle and have experience with it.
Another thing, if the reports don't need to use the absolute current state of the underlying tables (for example, if "what was in the tables at the end of the business day yesterday" is acceptable), you can create a materialized view. It will not work any faster than a regular view, but it can run overnight (say), or every six hours, or whatever - so that the further reporting processing from there will not have to wait for the queries to complete. This is one of the main uses of materialized views.
Good luck!

DB candidate as CouchDB/Schema replacement

The idea is to redesign data structure and/or change DB.
I just started to review this project and plan to start optimization from this one.
Currently i have CouchDb with about 80GB of document data, around 30M records.
From that subset for the most of documents properties like id, group_id, location, type can be considered as generic, but unfortunately for now such are even stored with different property naming around the set. Also a lot of deeply nested can be found.
Structure isn't hardly defined, that's why NoSQL db was selected way before some picture was seen.
Data is calculated and populated in DB in a separate Job on powerful cluster. This isn't done too often. From that perspective i can conclude that general write/update performance isn't very important. Also size decrease would be great, but isn't most important. There are only like 1-10 active customers at a time.
Actually read performance with various filtering/grouping etc is most important.
But no heavy summary calculations should be done, this one is already done while population.
This one is a data analytical tool for displaying compare and other reports to quality engineers and data analyst, so they can browse the results, group them or filter from the Web UI.
Now such tasks like searching a subset of document properties for a text isn't possible due to performance.
For sure i've done some initial investigations(like http://www.datastax.com/wp-content/themes/datastax-2014-08/files/NoSQL_Benchmarks_EndPoint.pdf) and it looks Cassandra seems to be good choice among NoSql.
Also it's quite interesting trying to port this data into the new PostgreSQl.
Any ideas would be highly appreciated :-)
Hello please check the following articles:
http://www.enterprisedb.com/nosql-for-enterprise
For me, PostgreSQL json(and jsonb!) capabilities allow to start schema-less, have transactions, indexes, grouping, aggregate functions with very good performance, just from the start. And when ready(and if needed), you can go for the schema, with internal data migration.
Also check:
https://www.compose.io/articles/is-postgresql-your-next-json-database/
Good luck

MongoID where queries map_reduce association

I have an application who is doing a job aggregating data from different Social Network sites Back end processes done Java working great.
Its front end is developed Rails application deadline was 3 weeks for some analytics filter abd report task still few days left almost completed.
When i started implemented map reduce for different states work great over 100,000 record over my local machine work great.
Suddenly my colleague gave me current updated database which 2.7 millions record now my expectation was it would run great as i specify date range and filter before map_reduce execution. My believe was it would result set of that filter but its not a case.
Example
I have a query just show last 24 hour loaded record stats
result comes 0 record found but after 200 seconds with 2.7 million record before it comes in milliseconds..
CODE EXAMPLE BELOW
filter is hash of condition expected to check before map_reduce
map function
reduce function
SocialContent.where(filter).map_reduce(map, reduce).out(inline: true).entries
Suggestion please.. what would be ideal solution in remaining time frame as database is growing exponentially in days.
I would suggest you look at a few different things:
Does all your data still fit in memory? You have a lot more records now, which could mean that MongoDB needs to go to disk a lot more often.
M/R can not make use of indexes. You have not shown your Map and Reduce functions so it's not possible to point out mistakes. Update the question with those functions, and what they are supposed to do and I'll update the answer.
Look at using the Aggregation Framework instead, it can make use of indexes, and also run concurrently. It's also a lot easier to understand and debug. There is information about it at http://docs.mongodb.org/manual/reference/aggregation/

Postgres tsvector_update_trigger sometimes takes minutes

I have configured free text search on a table in my postgres database. Pretty simple stuff, with firstname, lastname and email. This works well and is fast.
I do however sometimes experience looong delays when inserting a new entry into the table, where the insert keeps running for minutes and also generates huge WAL files. (We use the WAL files for replication).
Is there anything I need to be aware of with my free text index? Like Postgres maybe randomly restructuring it for performance reasons? My index is currently around 400 MB big.
Thanks in advance!
Christian
Given the size of the WAL files, I suspect you are right that it is an index update/rebalancing that is causing the issue. However I have to wonder what else is going on.
I would recommend against storing tsvectors in separate columns. A better way is to run an index on to_tsvector()'s output. You can have multiple indexes for multiple languages if you need. So instead of a trigger that takes, say, a field called description and stores the tsvector in desc_tsvector, I would recommend just doing:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description));
Now, if you need a consistent search interface across a whole table, there are more elegant ways of doing this using "table methods."
In general the functional index approach has fewer issues associated with it than anything else.
Now a second thing you should be aware of are partial indexes. If you need to, you can index only records of interest. For example, if most of my queries only check the last year, I can:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description))
WHERE created_at > now() - '1 year'::interval;

Design questions for using RavenDB for reporting

I'm working on a project where I am collecting data from different sources, and afterwards I need to do some reporting on that data. All the reports are predefined.
I'm thinking of using RavenDB for this, as I think the indexes and map/reduce part could be a good fit for this, so that I create an index for each report.
Is the one-index-per-report the way to go, or does that come with any pitfalls? And how about index starvation?
One-index-per-report will lead to lots of extra indexes. Think instead of "one-index-per-dataset". Then build multiple reports around each dataset.
There might be more than one dataset per collection if you are doing map/reduce. For example, you might have the following indexes:
OrderDetailsIndex
OrderTotalsByCustomer
OrderTotalsByMonth
OrderTotalsByDay
OrderTotalsByProduct
You could build many reports from these indexes.
What you don't want to do is to have multiple indexes like:
OrdersByCustomer
OrdersByDate
OrdersByZipCode
Those are just multiple maps that can be condensed into the same index, so it would be redundant to split them apart.
If you had one index per report, it would get out of control quickly:
OrderDetailsIndex_ForReportA
OrderDetailsIndex_ForReportB
The only difference between A and B might be the layout of the fields.
And finally, you might want to consider upgrading to RavenDB 2.5. There is a new feature called "Streaming Unbounded Results", that you can read about in Ayende's blog. This is probably the best way to feed data from an index into a report. If your reporting engine requires an IEnumerable data source (most do), then you might want to use this handy extension method I wrote.

Resources