Is there a limit on the number of measures in Mondrian or more specifically Saiku? - mondrian

I have a couple of different data sets for which I am attempting to automate cube generation for Saiku 2.6. For the data sets with a limited number of dimensions and measures, it works pretty well. I am however experiencing an issue where Saiku is not showing all of the measures in my schema for schemas with many measures (CalculatedMembers specifically). In fact, it seems to be that the number of measures (CalculatedMembers) Saiku will show at any given point in time is 115.
I know this sounds like a lot, - it is, but in our case it's necessary. There doesn't seem to be anything wrong with the schema definition. For example, if I create a schema with 230 measures, the first 115 will show. If I then delete the first 115 and refresh the schema, the next 115 that were previously not showing, will be visible.
This seems to me to be a bug in Saiku but I have not been able to pin it down yet. Has anyone else experienced this? Any advice?
Thanks!

I've finally been able to figure this out and I hope it helps someone else. Even though the XML is well formed and can even be opened within Schema Designer (Pentaho) Mondrian will not pick up any measures beyond the initial list of Measures. For example:
<Measure name="Cnt - A" column="r_a" aggregator="count" visible="true"></Measure>
<Measure name="Cnt - B" column="r_b" aggregator="count" visible="true"></Measure>
<CalculatedMember name="Sum - A_Rolling_12" dimension="Measures" hierarchy="[A]">
<Formula>sum(parallelperiod([Business date.Time Hierarchy].[Year],1,[Business date.Time Hierarchy].CurrentMember):[Business date.Time Hierarchy].CurrentMember,[Measures].[Sum - A])</Formula>
</CalculatedMember>
<CalculatedMember name="Sum - B_Rolling_12" dimension="Measures" hierarchy="[B]">
<Formula>sum(parallelperiod([Business date.Time Hierarchy].[Year],1,[Business date.Time Hierarchy].CurrentMember):[Business date.Time Hierarchy].CurrentMember,[Measures].[Sum - B])</Formula>
</CalculatedMember>
works fine, however this does not, in the following case, B does not show as a calculated member:
<Measure name="Cnt - A" column="r_a" aggregator="count" visible="true"></Measure>
<CalculatedMember name="Sum - A_Rolling_12" dimension="Measures" hierarchy="[A]">
<Formula>sum(parallelperiod([Business date.Time Hierarchy].[Year],1,[Business date.Time Hierarchy].CurrentMember):[Business date.Time Hierarchy].CurrentMember,[Measures].[Sum - A])</Formula>
</CalculatedMember>
<Measure name="Cnt - B" column="r_b" aggregator="count" visible="true"></Measure>
<CalculatedMember name="Sum - B_Rolling_12" dimension="Measures" hierarchy="[B]">
<Formula>sum(parallelperiod([Business date.Time Hierarchy].[Year],1,[Business date.Time Hierarchy].CurrentMember):[Business date.Time Hierarchy].CurrentMember,[Measures].[Sum - B])</Formula>
</CalculatedMember>
This to me, seems like a bug in Mondrian's parser, in my mind grouping measures like this is pretty logical and even validates against their schema but it does not work. Hope this saves someone some frustration.

Related

How to use Kettle to handle the linefeed in a field

I want to use Kettle to handle a data set.
The format of the data set is as below
product/productId: B00006HAXW
review/userId: A1RSDE90N6RSZF
review/profileName: Joseph M. Kotow
review/helpfulness: 9/9
review/score: 5.0
review/time: 1042502400
review/summary: Pittsburgh - Home of the OLDIES
review/text: I have all of the doo wop DVD's and this one is as good or better than the
1st ones. Remember once these performers are gone, we'll never get to see them again.
Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE
this DVD !!
I read every line of the data first and then transform every eight rows into a record.
However, there is such data:
review/profileName: nancy "crzyfnyfrog
I love my purple pigtails."
This field contains a linefeed and I don't know how to handle it.
Now I use the script code to achieve the function I want, but I still want to know how to use based components to solve it.

Reporting Multiple Values & Sorting

Having a bit of an issue and unsure if it's actually possible to do.
I'm working on a file that I will enter target progression vs actual target reporting the % outcome.
PAGE 1
¦NAME ¦TAR 1 %¦TAR 2 %¦TAR 3 %¦TAR 4 %¦OVERALL¦SUB 1¦SUB 2¦SUB 3¦
¦NAME1¦ 114%¦ 121%¦ 100%¦ 250%¦ 146%¦ 2¦ 0¦ 0%¦
¦NAME2¦ 88%¦ 100%¦ 90%¦ 50%¦ 82%¦ 0¦ 1¦ 0%¦
¦NAME3¦ 82%¦ 54%¦ 64%¦ 100%¦ 75%¦ 6¦ 6¦ 15%¦
¦NAME4¦ 103%¦ 64%¦ 56%¦ 43%¦ 67%¦ 4¦ 4¦ 24%¦
¦NAME5¦ 87%¦ 63%¦ 89%¦ 0%¦ 60%¦ 3¦ 2¦ 16%¦
Now I already have it sorting all rows by the Overall % column so I can quickly see at a glance but I am creating a second page that I need to reference points.
So on the second page I would like to somehow sort and reference different columns for example
PAGE 2
TOP TAR 1¦Name of top %¦Top %¦
TOP TAR 2¦Name of top %¦Top %¦
Is something like this possible to do?
Essentially I'm creating an Employee of the Month form that automatically works out who has topped what.
I'm willing to drop a paypal donation for whoever can figure this out for me as I've been doing it manually every month and would appreciate the time saved
I don't think a complicated array formula is necessary for this - I am suggesting a fairly standard Index/Match approach.
First set up the row titles - you can just copy and transpose them from Page 1, or use a formula in A2 of Page 2 like
=transpose('Page 1'!B1:E1)
The use them in an index/match to get the data in the corresponding column of the main sheet and find its maximum (in C2)
=max(index('Page 1'!A:E,0,match(A2,'Page 1'!A$1:E$1,0)))
Finally look up the maximum in the main sheet to find the corresponding name:
=index('Page 1'!A:A,match(C2,index('Page 1'!A:E,0,match(A2,'Page 1'!A$1:E$1,0)),0))
If you think there could be a tie for first place with two or more people getting the same score, you could use a filter to get the different names:
So if the max score is in B8 this time (same formula)
=max(index('Page 1'!A:E,0,match(A8,'Page 1'!A$1:E$1,0)))
the different names could be spread across the corresponding row using transpose (in C8)
=ArrayFormula(TRANSPOSE(filter('Page 1'!A:A,index('Page 1'!A:E,0,match(A8,'Page 1'!A$1:E$1,0))=B8)))
I have changed the test data slightly to show these different scenarios
Results

How to show the name of the data source of the line in Timelion

I have 20+ datasources some are being update regularly and others are not.
I have made a gruesome concatenate copy/paste thing in Excel. So a "for loop" would be nice, but for now I am primarily missing the names.
.es(index=US, timefield=timestamp), .es(index=UK, timefield=base_date)
Both show up as:
q:* > count
q:* > count
I thought I could fix it as follows:
.es(index=US, timefield=timestamp, q=US), .es(index=UK, timefield=base_date, q=UK)
And than I had my coffee...
Primary question: How to add labels to datasources?
Secondary question: Easy way to see what data sources span what amount of time.
Thanks in advance for your comments!

How to select data in scrapy from html file having both class and id?

<div class="section-body" id="section-2"><p>Most people with aortic stenosis do not develop symptoms until the disease is advanced. The diagnosis may have been made when the health care provider heard a heart murmur and performed tests.</p><p>Symptoms of aortic stenosis include:</p><ul><li>Chest discomfort: The chest pain may get worse with activity and reach into the arm, neck, or jaw. The chest may also feel tight or squeezed.</li><li>Cough, possibly bloody.</li><li>Breathing problems when exercising.</li><li>Becoming easily tired.</li><li>Feeling the heartbeat (palpitations).</li><li>Fainting, weakness, or dizziness with activity.</li></ul><p>In infants and children, symptoms include:</p><ul><li>Becoming easily tired with exertion (in mild cases)</li><li>Failure to gain weight</li><li>Poor feeding</li><li>Serious breathing problems that develop within days or weeks of birth (in severe cases)</li></ul><p>Children with mild or moderate aortic stenosis may get worse as they get older. They are also at risk for a heart infection called bacterial endocarditis.</p></div></div></section>
I have above script and I want to scrap the data in the list. i.e. in
I have tried following commands in scrapy but not working. It is giving '[]' as a output.
response.css("article div.section-body p").extract() <-- this is giving all info under section body but I want only under section-2
response.css("article div.section-body.section-2 p::text").extract()
response.xpath("//article/*[contains(#id, 'setion-2')]").extract()
please help me to extract. Thanks
Try
response.css("article div.section-body#section-2 p::text").extract()
div.section-body#section-2 means select DIV having both Class section-body and ID section-2
Note that ID is selected by # and class is selected by . ... So your CSS Selector posted in your question was wrong.

Firebase many to many performance

I'm wondering about the performance of Firebase when making n + 1 queries. Let's consider the example in this article https://www.firebase.com/blog/2013-04-12-denormalizing-is-normal.html where a link has many comments. If I want to get all of the comments for a link I have to:
Make 1 query to get the index of comments under the link
For each comment ID, make a query to get that comment.
Here's the sample code from that article that fetches all comments belonging to a link:
var commentsRef = new Firebase("https://awesome.firebaseio-demo.com/comments");
var linkRef = new Firebase("https://awesome.firebaseio-demo.com/links");
var linkCommentsRef = linkRef.child(LINK_ID).child("comments");
linkCommentsRef.on("child_added", function(snap) {
commentsRef.child(snap.key()).once("value", function() {
// Render the comment on the link page.
));
});
I'm wondering if this is a performance concern as compared to the equivalent of this query if I were using a SQL database where I could make a single query on comments: SELECT * FROM comments WHERE link_id = LINK_ID clause.
Imagine I have a link with 1000 comments. In SQL this would be a single query, but in Firebase this would be 1001 queries. Should I be worried about the performance of this?
One thing to keep in mind is that Firebase works over web sockets (where available), so while there may be 1001 round trips there is only one connection that needs to be established. Also: a lot of the round trips will be happening in parallel. So you might be surprised at how much time this takes.
Should I worry about this?
In general people over-estimate the amount of use they'll get. So (again: in general) I recommend that you don't worry about it until you actually have that many comments. But from day 1, ensure that nothing you do today precludes optimizing later.
One way to optimize is to further denormalize your data. If you already know that you need all comments every time you render an article, you can also consider duplicating the comments into the article.
A fairly common scenario:
/users
twitter:4784
name: "Frank van Puffelen"
otherData: ....
/messages
-J4377684
text: "Hello world"
uid: "twitter:4784"
name: "Frank van Puffelen"
-J4377964
text: "Welcome to StackOverflow"
uid: "twitter:4784"
name: "Frank van Puffelen"
So in the above data snippet I store both the user's uid and their name for every message. While I could look up the name from the uid, having the name in the messages means I can display the messages without the lookup. I'm also keeping the uid, so that I provide a link to the user's profile page (or other message).
We recently had a good question about this, where I wrote more about the approaches I consider for keeping the derived data up to date: How to write denormalized data in Firebase

Resources