Based on GoodData's excellent suggestion for implementing Fact tables, I have been able to design a model that meets our client’s requirements for joining different attributes across different tables. The issue I have now is that the model metrics are highly denormalized, with data repeating itself. I am currently trying to figure out a way to dedupe results.
For example, I have two tables—the first is a NAMES table and the second is my fact table:
NAMES
Val2 Name
35 John
36 Bill
37 Sally
FACT
VAL1 VAL2 SCORE COURSEGRADE
1 35 50 90%
2 35 50 80%
3 35 50 60%
4 36 10 75%
5 37 40 95%
What I am trying to do is write a metric in such a way so that we can get an average of SCORE that eliminates the duplicate value. GoodData is excellent in that it can actually give me back the unique results using the COUNT(VARIABLE1,RECORD) metric, but I can’t seem to get the average store to stick when eliminating the breakout information. If I keep all fields (including the VAL2), it shows me everything:
VAL2 SCORE(AVG)
35 50
36 10
37 40
AVG: 33.33
But when I remove VAL2, I suddenly lose the "uniqueness" of the record.
SCORE(AVG)
40
What I want is the score of 33.33 we got above.
I’ve tried using a BY statement in my SELECT AVG(SCORE), but this doesn’t seem to work. It’s almost like I need some kind of DISTINCT clause. Any thoughts on how to get that rollup value shown in my first example above?
Happy to help here. I would try the following:
Create an intermediate metric (let's call it Score by Employee):
SELECT MIN( SCORE ) BY ID ALL IN ALL OTHER DIMENSIONS
Then, once you have this metric defined you should be able to create a metric for the average score as follows:
SELECT AVG( Score by Employee )
The reason we create the first metric is to force the table to normalize score around the ID attribute which gets rid of duplicates when we use this in the next metric (we could have used MAX or AVG also, it doesn't matter).
Hopefully this solves your issue, let me know if it doesn't work and I'll be happy to help out more. Also feel free to check out GoodData's Developer Portal for more information about reporting:
https://developer.gooddata.com/docs/reporting
Best,
JT
you should definitively check "How to build a metric in a metric" presentation, made by Petr Olmer (http://www.slideshare.net/petrolmer/in10-how-to-build-a-metric-in-a-metric).
It can help you to understand it better.
Cheers,
Peter
Related
When comparing two objects of the same size, Javers compares 1-to-1. However, if a new change is added such as new row to one of the objects, the comparison reports changes that are NOT changes. Is it possible to have Javers ignore the addition/deletion for the sake of just comparing like objects?
Basically the indices get out of sync.
Row Name Age Phone(Cell/Work)
1 Jo 20 123
2 Sam 25 133
3 Rick 30 152
4 Rick 30 145
New List
Row Name Age Phone(Cell/Work)
1 Jo 20 123
2 Sam 25 133
3 Bill 30 170
4 Rick 30 152
5 Rick 30 145
Because Bill is added the new comparison result will say that Rows 4,5 have changed when they actually didn't.
Thanks.
I'm guessing that your 'rows' are objects representing rows in an excel table and that you have mapped them as ValueObjects and put them into some list.
Since ValueObjects don't have its own identity, it's unclear, even for a human, what was the actual change. Take a look at your row 4:
Row Name Age Phone(Cell/Work)
before:
4 Rick 30 145
after:
4 Rick 30 152
Did you changed Phone at row 4 from 145 to 152? Or maybe you inserted a new data to row 4? How can we know?
We can't. By default, JaVers chooses the simplest answer, so reports value change at index 4.
If you don't care aboute the indices, you can change the list comparision algorithm from Simple to Levenshtein distance. See https://javers.org/documentation/diff-configuration/#list-algorithms
SIMPLE algorithm generates changes for shifted elements (in case when elements are inserted or removed in the middle of a list). On the contrary, Levenshtein algorithm calculates short and clear change list even in case when elements are shifted. It doesn’t care about index changes for shifted elements.
But, I'm not sure if Levenshtein is implemented for ValueObjects, if it is not implemented yet, it's a feature request to javers-core.
I want to visualize the amount of correct auto-responses my system sent in regards to the percentage of questions it has already learned.
So my idea was to filter all my test-results where a boolean field didSendCorrectAutoResponse is true, make the bucket on the x axis over a field called learnPercentage and on the y axis simply take the count as a metric.
The only problem with this is that the values on the y-axis are absolute and only count the number of responses sent but I want it to show it as a percentage of the total number of tests per percentage learned.
Here is how I defined my chart:
I can calculate the total number of test-cases for each percentage learned with this learnPercentage: 100 && strategy.keyword: "sum" (it only counts them for 100% questions learned, but the number of tests for each percentage is the same).
So what I want on the y-axis is not the plain count but count / totalNumberOfTestCases
edit:
In order for you to better understand what I need here is what I do with my system:
Lets say I have 100 known questions my system can learn. And I have 2500 test questions. Now I do the following:
Let my system learn none of the known questions
Ask the 2500 test questions
Save how many questions have been correctly answered (let's say 600)
Save this test result in elastic
Repeat with 10 questions learned:
Let my system learn 10% of the known questions
Ask the 2500 test questions
Save how many questions have been correctly answered (let's say 590)
Save this result in elastic
Repeat with 20 questions learned...
Now I want to plot how many questions have been correctly answered in each learning step:
600 at 0%
590 at 10%
900 at 20%
...
But instead of showing these absolute numbers I want 600/2500, 590/2500 etc on the y-axis.
For Visualizing your Y axis in percentage if it is not already in, You should first create a scripted field for your favorite column and then visualize that scripted field in kibana.
check the photos; in scripted field code, the removed part is your column name.
For example I got below table which is simply a coarse distribution for 20 persons over their age
age count of person
2 1
5 5
8 2
10 3
15 1
16 2
17 1
20 4
21 1
Then by using the same dataset, I could build another 'better' table .
age count of person
10- 8
10s 7
20+ 5
In fact , I could make more tables which contains different age range combination by using the same dataset.
Now I wonder how could I find the best combinations. The possible "goodness functions" we could use to measure if the combination is good or not might come by following three principles:
There should not be too many or too little classes
Ranges of classes should not vary too much.
Distribution should be smooth enough, that is ,number of items covered by each class should not vary too much.
Since this question represents a situation which is just general enough to describe a kind of specific problems , some sophisticated solutions to it should have already been there . But I failed to find them. Anyone could give some suggestions please?
I have go through some classification algorithm like PCA, k-mean or "max entropy based algorithm" but seems they are just too general to cover this specific problem by following all of the above three principles.
I would do the following:
Construct an evaluation function:
double goodness(double firstThreshold, double bucketWidth, int numBuckets)
which returns a goodness score based on your principles. I would then brute force a number of combinations of parameters and pick the combination with the best goodness score. If we try 4-10 values for each parameter then brute force will work, and probably give you nice round numbers for the cutoffs. If you want to get more sophisticated or have it run faster then you can try other search methods like hill-climbing, beam search or simulated annealing but I think that might be overkill for your situation.
I am talking about something like movie/item recommendation, but it seems that real estate is more tricky. When visiting a web-site and doing some search for RE, the user should be presented with some suggestions. Let's separate the task in two tasks:
a) the user has still not entered any personal info - item based recommendation
b) the user has already entered his/hers details such as income, location, etc. - item/user based recommendation
The first thing that comes to my mind for task a) is to start modeling RE features, but using some ranges instead of exact values. For example:
Area in m2
40 - 50 we can mark it for "1"
50 - 70 is "2"
etc ...
Price:
20 - 30 thousands € will be marked as 1
30 - 40 will be 2
etc ...
Proximity to city center:
1 for the RE being within the city center
2 for Zone 2 or up to 2/3 kilometers from center
3 for Zone 3 or 7 kilometers from center
So having ranges lets us assign a vector to each RE property which will allows us to use: Euclidean distance, Pearson correlation and some nearest neighbor algorithms.
Please comment on my approach or suggest a new one.
If you already have a website with enough traffic, you can try a pure collaborative filtering approach, i.e people who viewed this property also viewed these other properties. You could use the Pearson correlation there for good results.
Similarity between 2 RE can be defined as
number of people who viewed both RE1 and RE2
sim = ---------------------------------------------
number of people who viewed either 1 or both
When a user is viewing property RE you can sort all other RE properties based on the similarity score with the property being shown and show the top few.
You could add some obvious filters on top of this like the location of the property, the price range etc.
You can also define the similarity as you have suggested and mix the results from both for good representation from new RE entries which do not have a high chance of getting in if a pure collaborative filtering algorithm is used.
Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.