In order to populate a SCD2 dimension table, a marker to note the lastest active row is always beneficial.
There are two ways I can think of
1) valid_from/valid_to
2) active_status: active/deleted
It is clear that valid_from/valid_to keeps more information, but would that complicate the ETL process a lot?
what are the prons and crons of these two methods?
There are mainly two way to implement SCD2
1 Keep versioning .
2 Keep start date and end date for a dimension.
In most of times we use second approach with having a active inactive flag.
https://en.wikipedia.org/wiki/Slowly_changing_dimension
You'll need the from/to dates if you ever want to load historic data.
The current/active flag is just a query helper.
Related
I don't know if I even worded the question correctly, but I'm trying to create a measure that depends on what is showing in the pivot table (using PowerPivot). In the image I posted, "DealMonth" is an expression in the PowerQuery table itself that simply takes the start date of the employee and subtracts it from the month a deal was closed in. That will show how long it took for that salesperson to close the deal. "TenureMonths" is also an expression in the PowerQuery table that calculates the tenure of the person. The values populating this screenshot are coming from a total headcount measure created. What I'm trying to do is create a separate measure that will show when the "TenureMonths" is less than the "DealMonth." So if the TenureMonths is 5, then after DealMonth of 5, the value would be 0. Is this possible?
Screenshot
I should add the following information.
"DealMonth" - Comes from the FactData table
"TenureMonths" - Comes from the DimSalesStart table
These two tables are joined by name. I feel like I'm so close because I can see what I want. The second image below is a copy/paste of the pivot table result but with my edits to show what I'd want to have shown. Basically, if(TenureMonths >= DealMonth,1,0). The trouble seems to be that since they're in two different tables, I can't make it work. The rows in the fact table are transactions, but the rows in the dim table are just the people with their start and end dates.
Desired Result
This is possible with some IF([measure1]<[measure2],blank(),[measure1]), however without seeing more of the data it will be hard to guide you specifically.
However you need to create two separate measures, one for TenureMonths and one for DealMonth, depending on the data this can be done with an aggregator forumla such as sum, min, max, etc (depends if there will be more than one value).
Then reference those two measures in the formula pattern I mentioned above, and that should give you want you want.
I figured out a solution. I added a dimension table for DealMonth itself and joined to my fact table. That allowed me to do the formulas that I needed.
In Cassandra, a row can be very long and store units of time relevant data. For example, one row could look like the following:
RowKey: "weather"
name=2013-01-02:temperature, value=90,
name=2013-01-02:humidity, value=23,
name=2013-01-02:rain, value=false",
name=2013-01-03:temperature, value=91,
name=2013-01-03:humidity, value=24,
name=2013-01-03:rain, value=false",
name=2013-01-04:temperature, value=90,
name=2013-01-04:humidity, value=23,
name=2013-01-04:rain, value=false".
9 columns of 3 days' weather info.
time is a primary key in this row. So the order of this row would be time based.
My question is, is there any way for me to do a query like: what is the last/first day's humidity value in this row? I know I could use a Order By statement in CQL but since this row is already sorted by time, there should be some way to just get the first/last one directly, instead of doing another sort. Or is cassandra optimizing it already with Order By statement under the hood?
Another way I could think of is, store another column in this row called "last_time_stamp" that always updates itself as new data is inserted in. But that would require one more update every time I insert new weather data.
Thanks for any suggestion!:)
Without seeing more of your actual table, I suggest using a timestamp (or timeuuid if there is a possibility for collisions) as the second component in a compound primary key. Using this, you can get the last "row" by selecting ORDER BY t DESC LIMIT 1.
You could also change the clustering order in your schema to order it naturally for "last N" queries.
Please see examples and linked resource in this answer.
I have been battling this issue for days without success. I have a very tricky format of a report i need to achieve but the main thing is that all the datasets will need to be grouped by 1 parent. I'll attempt to explain...
Say we have dataset1, dataset2. Both have AccountNumber as common field(parent).
I need both datasets to be used in the format/layout of the report but grouped together by AccountNumber, something like this.
[Report Header Data]
[AccountNumber Group]
Dataset1
Dataset2
[end AccountNumber Group]
What is the best way to achieve this? The format of the report has been a major road block on grouping thus making me split the data into multiple datasets, group all them together by accountnumber and then create a custom format per dataset in the report. The flow of the report may be something like this
[Report Header Data]
[AccountNumber Group]
[tablix1]
Dataset1
[tablix1]
[tablix2]
Dataset2
[tablix2]
[end AccountNumber Group]
Looking forward to the discussion on this!
There are multiple ways to achieve this effect, and the best for your situation depends on the details of your report. So I'll just give some of the techniques I've used in the past:
Join the two datasets into one
Joining the datasets into one in your query is one of the simplest answers, and works across all versions of SSRS. It can make the SQL queries large, but it makes report layout simple.
Use the Lookup(...) function
SSRS 2008R2 added the Lookup(...) function, which can be used to access items in a second dataset. It's a little bit awkward to use, and requires a separate formula for every field to be accessed, but it is very powerful for retrieving a few fields from a different dataset.
Sub reports
Similar to the approach descibed in the original question, This lets you create a parent project with one tablix, and then place a subreport within. The subreport will be called multiple times, with the Grouping item as a parameter. Each run of the report should only return the report for that instance of the group. This can be very powerful, but maintenance is difficult: you have two places to change some thingss, and it can require manual tweaking to make sure columns line up correctly. The subreport will often be the fastest report to run, since it is getting called many times.
[NB: StackOverflow.com isn't the best place for discussions. The design of the site is set up to avoid discussion and aim towards question & answers, not discussion.]
I don't know if there's a perfect solution here.
Based on your description, (and it sounds like you're leaning in this direction already) you'll need a Dataset for each distinct AccountNumber, and create a new list or table based on this.
Once you have this set up you need to embed the different Dataset objects (i.e. tablix1, tablix2) in each row.
The main issue here is that you can't use multiple Datasets when embedding tablixes within tablixes, so this makes me think that you may need a subreport solution - this way the subreports can take an AccountNumber parameter and each use a different Dataset.
So something like:
[Report Header Data]
[AccountNumber Group]
[subreport1]
[tablix1]
Dataset1
[tablix1]
[subreport1]
[subreport2]
[tablix2]
Dataset2
[tablix2]
[subreport2]
[end AccountNumber Group]
This will repeat for each AccountNumber as required.
It's tough to say without knowing exactly what your data looks like, but in 2008R2 and above you can use Lookup and LookupSet to join Datasets, but that will be cumbersome for multiple values, even if you are running the correct edition.
Again, depending on your data, another option is adjacent groups, if you can manage to get the data in one Dataset... This would allow to have different groupings next to another under the AccountName group, but it's a long shot.
It would be great if we know the report data e.g Payslip, payslip with loan balance (ie Dataset 1 for payslip and Dataset 2 for loan).
Anyway, the format will depend on the required output of the report. i.e If your planning on produce calculation like sum in the report and if the result output will be per Dataset or for both dataset.
Assuming you will need sum calculations, if the calculation result will be per dataset, then option 2 is good, if the calculation result is for total (Dataset 1 + Dataset 2) then option 1 is better.
If no calculations or total result is required, either will do.
To improve my skills on Hector and cassandra I'm trying diffrent methods to query data out of cassandra.
Currently I'm trying to make a simple message system. I would like to get the posted messages in chronological order with the last posted message first.
In plain sql it is possible to use 'order by'. I know it is possible if you use the OrderPreservingPartitioner but this partioner is deprecated and less-efficient than the RandomPartioner. I thought of creating an index on a secondary column with a timestamp als value, but I can't figure out how to obtain the data. I'm sure that I have to use at least two queries.
My column Family looks like this:
create column family messages
with comparator = UTF8Type
and key_validation_class=LongType
and compression_options =
{sstable_compression:SnappyCompressor, chunk_length_kb:64}
and column_metadata = [
{column_name: message, validation_class: UTF8Type}
{column_name: index, validation_class: DateType, index_type: KEYS}
];
I'm not sure if I should use DataType or long for the index column, but I think that's not important for this question.
So how can I get the data sorted? If possible I like to know hows its done white the CQL syntax and whitout.
Thanks in advance.
I don't think there's a completely simple way to do this when using RandomPartitioner.
The columns within each row are stored in sorted order automatically, so you could store each message as a column, keyed on timestamp.
Pretty soon, of course, your row would grow large. So you would need to divide up the messages into rows (by day, hour or minute, etc) and your client would need to work out which rows (time periods) to access.
See also Cassandra time series data
and http://rubyscale.com/2011/basic-time-series-with-cassandra/
and https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
and http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/
Basically I have a 'thread line' where new threads are made and a TimeUUID is used as a key. Which obviously provides sorting of a new thread quite easily, espically when say making a query of the latest 20 threads etc.
My problem is that when a new 'post' is made to a thread I want to be able to 'bump' that thread to the front of the 'thread line' which is where the problem comes in, how do I basically make this happen so I can still make queries that can still be selected in the right order without providing any kind of duplicates etc.
The only way I can see this working is if rather than a column family sorting via a TimeUUID I need the column family to sort via the insertion Timestamp, therefore I can use the unique thread IDs for column keys and retrieve these in the order they are inserted or reinserted rather than by TimeUUID? Is this possible or am I missing a simple trick that allows for this? As far as I know you have to set a particular comparitor or otherwise it defaults to bytes?
Columns within a row are always sorted by name with the given comparator. You cannot sort by timestamp or value or anything else, or Cassandra would not be able to merge multiple updates to the same column correctly.
As to your use case, I can think of two options.
The most similar to what you are doing now would be to create a second columnfamily, ThreadMostRecentPosts, with timeuuid columns (you said "keys" but it sounds like you mean "columns"). When a new post arrives, delete the old most-recent column and add a new one.
This has two problems:
The unit of replication is the row, so having this grow indefinitely could be problematic. (Using expiring columns to age out no-longer-relevant thread information might help.)
You need a lock manager so that multiple posts to the same thread don't race and possibly leave multiple entries in this row.
I would suggest instead creating a row per day (for instance), whose columns are the thread IDs and whose values are the most recent post. Adding a new post just updates the value in that column; no delete/re-add is done, so the race is not a problem anymore. You don't get sorting for free anymore but that's okay because you're limiting it to a small enough set that you can do that sort in memory (say, yesterday's threads and today's).
(Finally, I would add that I can say from experience that having a cutoff past which old threads don't get bumped to the front by a new reply is a Good Thing.)