I have two fact tables one depend on date date dimension (Day,month,year).
and the other depend on month and year only.
So my question do i need to create two dimension one has (day month year) and another dimension that only has year and month ?
Thank you .
A touch late here; sorry about that. Yes, you should build two dimension tables. I'd also recommend a relationship between them (i.e. each month has multiple days). Finally, and some consider this controversial, you might want to do more of a snowflake approach here and have the day level tables contain no information about months (eg month name, month number, etc.) beyond a link to the month table. The downside is that you'll almost always have to join the month table to the day table when you use the day table. Some feel this join is cheap and worth it for the benefit in reduced data redundancy. Others feel that any unnecessary join in a star is to be avoided.
Related
Can someone please help why the table is taking too much time to write when table is very small
As advised here, you shouldn't partition on a column that has high cardinality (number of unique values). As can be seen in the screenshot, the orderDate column has 753 unique values. Under the covers that means 753 folders have to be created, and each folder would have on average ~1.2 records in a parquet file (assuming equal date distribution).
You should consider extracting the month and year, or just the year value from the orderDate column, and partition on that.
I was creating some analysis on revenue for past years. One thing I noticed is measures of revenue for each month of a year are same for every year's corresponding months. That is revenue for April 2015 is same as revenue for April 2016.
I did some searching to solve this problem. I found that our measure column 'Revenue' is aggreagted based on time dimension as 'Last(sum(revenue))'. So actual revenue values of April 2019 is considered by OBIEE as last and copied to other year's April month revenue.
I can understand that keyword 'last' may be the reason of this, but shouldn't year, quarter, month columns choose exactly those numbers that corresponds to that date? Can someone explain how this works and suggest solutions, please?
Very simply put: The "LAST" is the reason. It doesn't "copy" the value though. It aggregates the values to the last existing value along the dimensional hierarchy specified.
The question is: What SHOULD that Saldo show? What is the real business rule?
Also lastly: Using technical column names and ALL UPPER CASE COLUMN NAMES in the BMM layer shouldn't be done. The names should be user-focused, readabla and pretty. Otherwise everybody has to go and change it 50 times over and over in the front-end.
It's been a year since I posted this question,but a fix for this incorrect representation of data was added today. In the previous version of rpd, we used another alternative solution to this by creating two measure columns of saldo ( saldo_year and saldo_month) and setting level for them at year and level respectively and using them both in an analysis. This was a temporary solution until we did the second version of our rpd since we realized that structure of the old one wasn't completely correct and it was easier and less time consuming to make it from ground and create a new one than to fix the old one.
So as #Chris mentioned, it was all about correct time dimension and hierarchies. We thought we created it with all requirements met, but recently we got the same problem in our analyses. Then we figured out that we didn't set id columns as primary key in month and quarter logical levels. After we got the data we want. If anybody faces this kind of problem, then the first thing to check in rpd is how the time dimension and hierarchy is defined, how logical levels and primary keys and chronological keys are set in hierarchy.
I'm a bit new to BI development/ data warehousing, but am facing the old Slowly Changing Dimensions dilemma. I've read a lot about the types and theory, but have found little in terms of, what I view, would be the most common SELECT queries against these implementations.
I'll keep my example simple. Say you have four sales reasons, East, West, North, and South. You have a group of salespeople that make daily sales and (maybe once a year) get reassigned a new region.
So you'll have raw data like the following:
name; sales; revenue; date
John Smith; 10; 5400; 2015-02-17
You have data like this every day.
You may also have a dimensional table like the following, initially:
name; region
John Smith; East
Nancy Ray; West
Claire Faust; North
So the sales director wants to know the monthly sales revenue for the East region for May 2015. You would execute a query:
SELECT region, month(date), sum(revenue)
from Fact_Table inner join Dim_Table on name = name
where region = East and date between ....
[group by region, month(date)]
You get the idea. Let's ignore that I'm using natural keys instead of surrogate integer keys; I'd clearly use surrogate keys.
Now, obviously, sales people may move regions mid year. Or mid month. So you have to create a SCD type in order to run this query. To me personally, Type 2 makes the most sense. So say you implement that. Say John Smith changed from East region to West region on May 15, 2015. You implement the following table:
name; region; start_date; end_date
John Smith; East; 2015-01-01; 2015-05-15
John Smith; West; 2015-5-15; 9999-12-31
Now the sales director asks the same question. What is the total sales revenue for the East for May 2015? Or moreover, show me the totals by region by month for the whole year. How would you structure the query?
SELECT region, month(date), sum(reveneue)
from Fact_Table inner join Dim_Table
on name = name
and date between start_date and end_date
group by region, month(date)
Would that give the correct results? I guess it might --- my question may be more along the lines of --- okay now assume you have 1 million records in the Fact table ... would this inner join be grossly inefficient, or is there a faster way to achieve this result?
Would it make more sense to write the SCD (like region) directly into a 'denormalized' Fact table --- and when the dimension changes, perhaps update a week or two's worth of Fact record' regions retroactively?
Your concept is correct if your business requirement has a hierarchy of Region->Seller, as shown in your example.
The performance of your current query may be challenging, but it will be improved by the use of appropriate dimension keys and attributes.
Use a date dimension hierarchy that includes date->Month, and you'll be able to avoid the range query.
Use integer, surrogate, keys in both dimensions and your indexing performance will improve.
One million rows is tiny, you won't have performance problems on any competent DBMS :)
I'm building a table that contains about 400k rows of a messaging app's data.
The current table's columns looks something like this:
message_id (int)| sender_userid (int)| other_col (string)| other_col2 (int)| create_dt (timestamp)
A lot of queries I would be running in the future will rely on a where clause involving the create_dt column. Since I expect this table to grow, I would like to try and optimize it right now. I'm aware that partitioning is one way, but when I partition it based on create_dt the result is too many partitions since I have every single date spanning back to Nov 2013.
Is there a way to instead partition by a range of dates? How about partition for every 3 months? or even every month? If this is possible - Could I possibly have too many partitions in the future making it inefficient? What are some other possible partition methods?
I've also read about bucketing, but as far as I'm aware that's only useful if you would be doing joins on a column that the bucket is based on. I would most likely be doing joins only on column sender_userid (int).
Thanks!
I think this might be a case of premature optimization. I'm not sure what your definition of "too many partitions" is, but we have a similar use case. Our tables are partitioned by date and customer column. We have data that spans back to Mar 2013. This created approximately 160k+ partitions. We also use a filter on date and we haven't seen any performance problems with this schema.
On a side note, Hive is getting better at scaling up to 100s of thousands of partitions and tables.
On another side note, I'm curious as to why you're using Hive in the first place for this. 400k rows is a tiny amount of data and is not really suited for Hive.
Check out hive built in UDFs. With the right combination of them you can achieve what you want. Here's an example to partition on every month (produces "YEAR-MONTH" string that you can use as partition column value):
select concat(cast(year(to_date(create_dt)) as string),'-',cast(month(to_date(create_dt)) as string))
But when partitioning on dates it is usually useful to have multiple levels of the date dimension so in this case you should have two partition columns, first for year and second for month:
select year(to_date(create_dt)),month(to_date(create_dt))
Keep in mind that timestamps and dates are strings, and that functions like month() or year() return integers as values of date fields. You can use simple mathematical operations to figure out the right partition.
I will try to explain what I want to accomplish. I am looking for an algorithm or approach, not the actual implementation in my specific system.
I have a table with actuals (incoming customer requests) on a daily basis. These actuals need to be "copied" into the next year, where they will be used as a basis for planning the amount of requests in the future.
The smallest timespan for planning, on a technical basis, is a "period", which consists of at least one day. A period always changes after a week or after a month. This means, that if a week is both in May and June, it will be split in two periods.
Here's an example:
2010-05-24 - 2010-05-30 Week 21 | Period_Id 123
2010-05-31 - 2010-05-31 Week 22 | Period_Id 124
2010-06-01 - 2010-06-06 Week 22 | Period_Id 125
We did this to reduce the amount of data, because we have a few thousand items that have 356 daily values. For planning, this is reduced to "a few thousand x 65" (or whatever the period count is per year). I can aggregate a month, or a week, by combining all periods that belong to one month. The important thing about this is, I could still use daily values, then find the corresponding period and add it there if necessary.
What I need, is an approach on aggregating the actuals for every (working)day, week or month in next years equivalent period. My requirements are not fixed here. The actuals have a certain distribution, because there are certain deadlines and habits that are reflected in the data. I would like to be able to preserve this as far as possible, but planning is never completely accurate, so I can make a compromise here.
Don't know if this is what you're looking for, but this is a strategy for calculating the forecasts using flexible periods:
First define a mapping for each day in next year to the corresponding day in this year. Then when you need a forecast for period x you take all days in that period and sum the actuals for the matching days.
With this you can precalculate every week/month but create new forecasts if the contents of periods change.
Map weeks to weeks. The first full week of this year to the first full week of the next. Don't worry about "periods" and aggregation; they are irrelevant.
Where a missing holiday leaves a hole in the data, just take the values for the same day of the previous week or the next week, and do the same at the beginning/end of the year.
Now for each day of the week, combine the results for the year and look for events more than, say, two standard deviations from the mean (if you don't know what that means then skip this step), and look for correlations with known events like holidays. If a holiday doesn't show an effect in this test then ignore it. If you find an effect, shift it to compensate for the different date next year. Don't worry about higher-order effects, you don't have enough data to pin them down.
Now draw in periods wherever you like and aggregate all you want.
Don't make any promises about the accuracy of these predictions, there's no way to know it. Don't worry about whether this is the best possible way; it isn't, but it's as good as any you're likely to find. You can spend as much more time and effort fine-tuning this as you wish; it might raise expectations but it's not likely to make the results much more accurate-- it's about as likely to make them worse.
There is no A-priori way to answer that question. You have to look at your data, and decide what the important parameters (day of week, week number, month, season, temperature outside?) using the results.
For example, if many of your customers are jewish/muslim, then the gregorian calendar, and ISO-week numbers and all that won't help you much, because jewish/muslim holidays (and so users behaviour) are determined using other calendars.
Another example - Trying to predict iPhone search volume according to last year's search doesn't sound like a good idea. It seems that the important timescales are much longer than a year (the technology becoming mainstream over the years) and much shorter than a year (Specific events that affect us for days-weeks).