unable to use autoregressor function in Vertica using own data - vertica

I am new to both datascience and vertica. I am following this example on autoregressor from Vertica documentation
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/AnalyzingData/MachineLearning/TimeSeries/AutoregressorExample.htm?tocpath=Analyzing%20Data%7CMachine%20Learning%20for%20Predictive%20Analytics%7CRegression%20Algorithms%7C_____1
If I understood correctly, I need to provide a training data to the model and use the model to make predictions.
The training data looks like this (day of a year and temperature on the day)
select * from temp_data limit 10;
time | Temperature
---------------------+-------------
1981-01-01 00:00:00 | 20.7
1981-01-02 00:00:00 | 17.9
1981-01-03 00:00:00 | 18.8
1981-01-04 00:00:00 | 14.6
1981-01-05 00:00:00 | 15.8
1981-01-06 00:00:00 | 15.8
1981-01-07 00:00:00 | 15.8
1981-01-08 00:00:00 | 17.4
1981-01-09 00:00:00 | 21.8
1981-01-10 00:00:00 | 20
(10 rows)
I create the model SELECT AUTOREGRESSOR('AR_temperature', 'temp_data', 'Temperature', 'time' USING PARAMETERS p=3);
Question 1 - The example uses temp_data table for predictions as well? Why? Isn't temp_data used for training and I should use a test data which doesn't has Temperature column?
SELECT PREDICT_AUTOREGRESSOR(Temperature USING PARAMETERS model_name='AR_temperature', npredictions=10) OVER(ORDER BY time) FROM temp_data; <-- why does the example use temp_data
Question 2 - I created my own table with a day. When I use it to make a prediction, I get error
select * from my_temperature_data;
time | temperature
---------------------+-------------
2021-12-12 00:00:00 |
select predict_autoregressor(temperature using parameters model_name='ar_temperature') over(order by time) from my_temperature_data;
ERROR 5861: Error calling processPartition() in User Function predict_autoregressor at [src/Autoregression/PredictAR.cpp:149], error code: 0, message: One or more elements in the input data is invalid.
Question 3 - When I made my own table, I had to make it with both Time and temperature columns. Just having Time didn't work (got error). Why?

Please find the answers below
You are right. test data should have been used
As mentioned in the documentation OVER clause is mandatory with timestamp column because previous timestamps are taken into consideration. Since your query doesn't have a over() clause, it failed with that error.
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/PREDICT_AUTOREGRESSOR.htm
As mentioned in 2, you need to pass both temperature and time for PREDICT_AUTOREGRESSOR to work.

Related

Cannot filter null Datetime values

I have a problem that is driving me crazy. I have to query an oracle view that returns some DATETIME values.
The incredible problem is that even if I set the "IS NOT NULL" on the WHERE clause and even if I set the NVL(FECHA_HASTA, FECHA_DESDE), I´m still getting null values!!. How is that possible???
This is the query:
SELECT CUIL as Cuil,
COD_TIPO_CAUSAL as CodTipoCausal,
COD_CONVENIO as CodConvenio,
FECHA_DESDE as FechaDesde,
NVL(FECHA_HASTA, FECHA_DESDE) as FechaHasta
FROM ORGANISMO.VCAUSAL_AUSENCIA
WHERE FECHA_HASTA IS NOT NULL
AND FECHA_HASTA > (SELECT SYSDATE - 180 FROM SYS.DUAL)
AND CUIL IN (SELECT CUIL FROM ORGANISMO.VEMPLEADO WHERE FECHA_EGRESO IS NULL OR FECHA_EGRESO > (SELECT SYSDATE FROM SYS.DUAL))
EDIT:
Here is dump(fecha_hasta, 1016) added:
The dumped values show that the data is corrupt. The internal date format is well-known:
byte 1 - century (excess 100)
byte 2 - year (excess 100)
byte 3 - month
byte 4 - day
byte 5 - hour (excess 1)
byte 6 - minute (excess 1)
byte 7 - seconds (excess 1)
so the fourth byte in the two values that SQL Developer is reporting as null (even though they clearly are not actually null) should not be zero, as there is no day zero.
Based on those rules, 79,9d,2,0,18,3c,3c in hex, which is 121,157,2,0,24,60,60 in decimal, should convert as:
century: 121 - 100 = 21
year: 157 - 100 - 57
month: 2
day: 0
hour: 24 - 1 = 23
minute: 60 - 1 = 59
second: 60 - 1 = 59
or 2157-02-00 23:59:59. Similarly 78,b8,1,0,18,3c,3c converts to 2084-01-00 23:59:59.
Version 18.3 of SQL Developer displays those values, in both the script output and query results windows, as the previous day:
DT DUMPED
------------------- -----------------------------------
01-07-2020 23:59:59 Typ=12 Len=7: 78,78,7,1,18,3c,3c
31-01-2157 23:59:59 Typ=12 Len=7: 79,9d,2,0,18,3c,3c
31-12-2083 23:59:59 Typ=12 Len=7: 78,b8,1,0,18,3c,3c
01-07-2018 00:00:00 Typ=12 Len=7: 78,76,7,1,1,1,1
whereas db<>fiddle shows the zero-day values.
So, since they are not actually null, it's reasonable that is not null and nvl() didn't affect them, and then it's up to the client or application as to how to present them.
The real issue is that you seem to have corrupted data in the tables underlying the view you're querying, so that needs to be investigated and fixed - assuming the invalid values can be safely identified, and you can find out what they should have been in the first place, which might be a struggle. Just filtering them out, either as part of the view or in your query, won't be simple though - unless you can filter out dates in the future. And assuming all the corruption is both that obvious and pushing dates into the future; on some level you have to question the validity of all of those dates... there could be much more subtle corruptions that look OK.
And then whatever process or tool caused the corruption needs to be tracked down and fixed so it doesn't happen again. Lots of things can cause corruption of course, but I believe imp used to have a bug that could corrupt dates and numbers, and OCI programs can too.

How to complete report derived from another query with zeros or nulls

So, I'm really having a hard time with a report.
I need a report grouped by year. For example, we want to show how many cars are sold per year. So, the report has a column Year, the type/model of the car, and the quantity. However, I also want to show a row with null/zero value, so even when no car of a specific type was sold, I want it to show the row, but with 0.
The problem is, this query is based on a lot of views, which shows each transaction. So, my actual query works fine except it doesn't show a type when none of that type was sold in a year.
When I pivot this report using Oracle APEX, it almost works. It shows all the types, but if I filter per year, then they are gone.
I have all the years I need, but I don't have the data for that year. I take all the data from multiple views with the specifics of the sales. Some of the models/types were not sold in some years, so when I query the report, it doesn't show up, which is expected. For example:
What I get is
//YEAR - MODEL - QUANTITY //
2018 - MODEL 1 - 300
2018 - MODEL 2 - 12
2017 - MODEL 1 - 12
2017 - MODEL 2 - 33
2017 - MODEL 3 - 22
What I want
//YEAR - MODEL - QUANTITY //
2018 - MODEL 1 - 300
2018 - MODEL 2 - 12
2018 - MODEL 3 - 0
2017 - MODEL 1 - 12
2017 - MODEL 2 - 33
2017 - MODEL 3 - 22
Any ideas?
You can conjure rows, and outer join to them.
with years as (
select add_months(date '1980-1-1', (rownum-1)*12) dt
from dual
connect by level < 5
)
select y.dt, count(e.hiredate)
from scott.emp e
right outer join years y
on y.dt = trunc(e.hiredate,'yy')
group by y.dt
DT COUNT(E.HIREDATE)
------------------- -----------------
01-01-1982 00:00:00 1
01-01-1983 00:00:00 0
01-01-1981 00:00:00 10
01-01-1980 00:00:00 1

In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is implicitly not present?

In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:
+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+
If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.
Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.
Step 1: Original Data
The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.
daily_employee_balance
use default;
drop table if exists daily_employee_balance;
create table if not exists daily_employee_balance (
employee_id string,
employee string,
iso_date date,
balance double
);
Insert Sample Data in original table daily_employee_balance
insert into table daily_employee_balance values
('103','John','2018-10-25',1800.2),
('103','John','2018-10-29',1125.7),
('103','John','2018-11-05',2937.74),
('103','John','2018-11-06',3000),
('103','John','2018-11-10',1500),
('103','John','2018-11-19',-755.5),
('103','John','2018-11-20',-800),
('103','John','2018-11-21',1200),
('103','John','2018-11-22',-400),
('103','John','2018-11-23',-900),
('103','John','2018-12-02',-1300);
Step 2: Dimension Table
You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.
use default;
drop table if exists dimension_date;
create external table dimension_date(
date_id int,
iso_date string,
year string,
month string,
month_desc string,
end_of_month_flg string
);
Insert some sample data for entire month of Nov 2018:
insert into table dimension_date values
(6880,'2018-11-01','2018','2018-11','November','N'),
(6881,'2018-11-02','2018','2018-11','November','N'),
(6882,'2018-11-03','2018','2018-11','November','N'),
(6883,'2018-11-04','2018','2018-11','November','N'),
(6884,'2018-11-05','2018','2018-11','November','N'),
(6885,'2018-11-06','2018','2018-11','November','N'),
(6886,'2018-11-07','2018','2018-11','November','N'),
(6887,'2018-11-08','2018','2018-11','November','N'),
(6888,'2018-11-09','2018','2018-11','November','N'),
(6889,'2018-11-10','2018','2018-11','November','N'),
(6890,'2018-11-11','2018','2018-11','November','N'),
(6891,'2018-11-12','2018','2018-11','November','N'),
(6892,'2018-11-13','2018','2018-11','November','N'),
(6893,'2018-11-14','2018','2018-11','November','N'),
(6894,'2018-11-15','2018','2018-11','November','N'),
(6895,'2018-11-16','2018','2018-11','November','N'),
(6896,'2018-11-17','2018','2018-11','November','N'),
(6897,'2018-11-18','2018','2018-11','November','N'),
(6898,'2018-11-19','2018','2018-11','November','N'),
(6899,'2018-11-20','2018','2018-11','November','N'),
(6900,'2018-11-21','2018','2018-11','November','N'),
(6901,'2018-11-22','2018','2018-11','November','N'),
(6902,'2018-11-23','2018','2018-11','November','N'),
(6903,'2018-11-24','2018','2018-11','November','N'),
(6904,'2018-11-25','2018','2018-11','November','N'),
(6905,'2018-11-26','2018','2018-11','November','N'),
(6906,'2018-11-27','2018','2018-11','November','N'),
(6907,'2018-11-28','2018','2018-11','November','N'),
(6908,'2018-11-29','2018','2018-11','November','N'),
(6909,'2018-11-30','2018','2018-11','November','Y');
Step 3: Fact Table
Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.
This will re-create the original table with missing dates and populate the missing balance with earlier known balance.
--inner query to get all the possible dates
--outer self join query will populate the missing dates and balance
drop table if exists f_employee_balance;
create table f_employee_balance
stored as orc tblproperties ("orc.compress"="SNAPPY") as
select q1.employee_id, q1.iso_date,
nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
month, year from (
select distinct
r.employee_id,
d.iso_date as iso_date,
d.month, d.year
from daily_employee_balance r, dimension_date d )q1
left outer join daily_employee_balance r on
(q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);
Step 4: Analytics
The query below will give you the true average for by month:
select employee_id, monthly_avg, month, year from (
select employee_id,
row_number() over (partition by employee_id,year,month) as row_num,
avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
f_employee_balance)q1
where row_num = 1
order by year, month;
Step 5: Conclusion
You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.

Populating future dates in oracle table

I have attached tables, product and date.
Lets say my product table has data till yesterday i.e 05/31/2018
am trying to populate season table where I could do the calculation till 5/31/2018 where value = (value on same day last year/previous day last year) with Ch(a) and P(pen), however the data set was till 5/31/2018. my aim is to get data/calculation for 06/1/2018 till 12/31/2018 as well. how do i get the data for these future dates as I have the data to calculate these future dates in prod table.
appreciate if you can help.
Thank you!
You can generate a series of dates using CONNECT BY subquery like this one:
SELECT Start_date + level - 1 as my_date
FROM (
SELECT date '2018-01-01' as Start_date FROM dual
)
CONNECT BY Start_date + level - 1 <= date '2018-01-05'
Demo: http://www.sqlfiddle.com/#!4/072359/1
| MY_DATE |
|----------------------|
| 2018-01-01T00:00:00Z |
| 2018-01-02T00:00:00Z |
| 2018-01-03T00:00:00Z |
| 2018-01-04T00:00:00Z |
| 2018-01-05T00:00:00Z |

Convert 5 digit date in oracle

I have an Oracle database table with 5 digit Julian dates that I need to convert to date time format.
Sample data
Source Actual date
40786 -> 2015-09-01 |
40785 -> 2015-08-31 |
First I tried the following
SELECT to_char(to_date(to_char(40786), 'J'),'DD-MM-YYYY'),
to_char(to_date(to_char(40785), 'J'),'DD-MM-YYYY')
FROM dual;
40786 -> 4601-09-01 |
40785 -> 4601-08-31 |
Since it is wrong I calculated the difference in days (2416481) and formulated the following query
SELECT to_char(to_date(to_char(40786 + 2416481 ), 'J'),'DD-MM-YYYY'),
to_char(to_date(to_char(40785 + 2416481), 'J'),'DD-MM-YYYY')
FROM dual;
40786 -> 2015-09-01 |
40785 -> 2015-08-31 |
It is correct for above two days but the table has a history since 2010. Will the above adjustment hold correct for the full history. i.e. weekends, leap years etc ...
Many thanks.
V
Your problem is that the column is not stored in Julian date. So you can't ask if the conversion will work or not.
It seems that the dates are based on 1.1.1904 (= day zero)
So the conversion is as follows:
select to_date('1904-01-01','yyyy-mm-dd') + 40786 as dt from dual;
DT
----------
01.09.2015
select to_date('1904-01-01','yyyy-mm-dd') + 40785 as dt from dual;
DT
----------
31.08.2015
If it will realy work, can answer only the code in your GUI conversion routine.
And yes, if you trust in rational software development, you could expect it will work (for dates say in +/- 100 years range).

Resources