Read in the following table format: no tablehead, [date] colum1=xy colum2=abs colum4=iy - oracle

i have a File with the data in the following format:
no tablehead
[date] colum1=xy colum2=abc colum4=xyz
[date] colum1=zz colum3=234 colum4=abc
The problem is, that not every dataset has all of the variables and they´re not seperated by like 2 tabs in that case. Therefore i need to read the file somehow with the columname in front of every datapoint. Im using a oracle database, but also can use SAS.
Thanks in advance

Just use named input mode.
data want;
length date $10 column1-column4 $20;
input date (column1-column4) (=);
cards;
[date] column1=xy column2=abc column4=xyz
[date] column1=zz column3=234 column4=abc
;
Results:
Obs date column1 column2 column3 column4
1 [date] xy abc xyz
2 [date] zz 234 abc

Related

LATERAL VIEW explode funtion in hive

I am trying to export data from excel into a hive table, while doing so, i have a column 'ABC' which has values like '1,2,3'.
I used the lateral view explode function but it does not does anything to my data.
Following is my code snippet :
CREATE TABLE table_name
(
id string,
brand string,
data_name string,
name string,
address string,
country string,
flag string,
sample_list array )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
;
LOAD DATA LOCAL INPATH 'location' INTO TABLE
table_name ;
output sample:
id brand data_name name address country flag sample_list
19 1 ABC SQL ABC Cornstarch IN 1 ["[1,2,3]"]
then i do:
select * from franchise_unsupress LATERAL VIEW explode(SEslist) SEslist as final_SE;
output sample:
id brand data_name name address country flag sample_list
19 1 ABC SQL ABC Cornstarch IN 1 [1,2,3]
I also tried:
select * from franchise_unsupress lateral view explode(split(SEslist,',')) SEslist AS final_SE ;
but got an error:
FAILED: ClassCastException org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
whereas, what i need is:
id brand data_name name address country flag sample_list
19 1 ABC SQL ABC Cornstarch IN 1 1
19 1 ABC SQL ABC Cornstarch IN 1 2
19 1 ABC SQL ABC Cornstarch IN 1 3
Any help will be greatly appreciated! thank you
The problem is that array is recognized in a wrong way and loaded as a single element array ["[1,2,3]"]. It should be [1,2,3] or ["1","2","3"] (if it is array<string>)
When creating table, specify delimiter for collections:
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
I wanted to provide my answer.
The issue was with the input that was being provided. My input txt file had [] around the input value. They had to be removed and it worked.

How to Flatten and get expected output shown below from pig after Group by

Sample Date:
ID marks date
12345 12 20210204
12345 13 20210204
12345 2 20210204
Input:
(12345,{(12345,12,20210204),(12345,13,20210204),(12345,2,20210204)})
Output needed:
(12345,27,20210204)
Second element is the aggregated value.
Help is Appreciated
output = FOREACH input GENERATE
group AS ID,
SUM(sample.marks) AS mark_sum,
MIN(sample.date) AS first_date;
You may need to tweak based on your relation and field names. You might also want to group by the date field too if these are all the same.

Hive - Remove duplicates, keeping newest record - all of it [duplicate]

This question already has answers here:
Retrieve top n in each group of a DataFrame in pyspark
(6 answers)
Closed 6 years ago.
There have been a few questions like this, with no answer, like this one here.
I thought I would post another in hopes of getting one.
I have a hive table with duplicate rows. Consider the following example:
*ID Date value1 value2*
1001 20160101 alpha beta
1001 20160201 delta gamma
1001 20160115 rho omega
1002 20160101 able charlie
1002 20160101 able charlie
When complete, I only want two records. Specifically, these two:
*ID Date value1 value2*
1001 20160201 delta gamma
1002 20160101 able charlie
Why those two? For the ID=1001, I want the latest date and the data that is in that row with it. For the ID=1002, really the same answer, but the two records with that ID are complete duplicates, and I only want one.
So, any suggestions on how to do this? The simple "group by" using the ID and the 'max' date won't work, as that ignores the other columns. I cannot put 'max' on those, as it will pull the max columns from all the records (will pull 'rho' from an older record), which is not good.
I hope my explanation is clear, and I appreciate any insight.
Thank you
Try this:
WITH temp_cte AS (
SELECT mt.ID AS ID
, mt.Date AS Date
, mt.value1 AS value1
, mt.value2 AS value2
, ROW_NUMBER() OVER (PARTITION BY mt.ID ORDER BY mt.Date DESC) AS row_num
FROM my_table mt
)
SELECT tc.ID AS ID
, tc.Date AS Date
, tc.value1 AS value1
, tc.value2 AS value2
FROM temp_cte tc
WHERE tc.row_num = 1
;
Or you can do MAX() and join the table to itself where ID = ID and max_date = Date. HTH.
Edit March 2022:
Since ROW_NUMBER numbers every row and the user only cares about 1 row with the max date there's a better way to do this I discovered.
WITH temp_cte AS (
SELECT mt.ID AS ID
, MAX(NAMED_STRUCT('Date', mt.Date, 'Value1', mt.value1, 'Value2', mt.Value2)) AS my_struct
FROM my_table mt
GROUP BY mt.ID
)
SELECT tt.ID AS ID
, tt.my_struct.Date AS Date
, tt.my_struct.Value1 AS Value1
, tt.my_struct.Value2 AS Value2
FROM temp_cte tt
;

oracle sqlldr time format

I'm using oracle sqlldr (for bulk load operations), but I can't convert this datetime format (first column):
File contents:
Jan 1 1900 11:36:56:000PM|968|409|198|33|30|45|19
Jan 1 1900 11:36:57:000PM|967|415|198|34|33|43|21
Jan 1 1900 11:36:59:000PM|966|427|197|34|33|40|19
Control file contents:
load data
infile '/home/bim/oraload/data/AERO.SONDAJ.samsun.txt'
append
into table AERO.SONDAJ
fields terminated by "|"
TRAILING NULLCOLS
(
refsaat date 'MON DD YYYY HH24:mi:ss', --not running
bsnsvy,
yuks,
sck,
nem,
isba,
rzgyon,
rzghiz
)
Try something like this. Inorder for this to work, the refsaat type should be a timestamp type and not DATE data type. Date Data type does not store beyond seconds.
refsaat TIMESTAMP 'Mon DD YYYY HH:mi:ss:ff3PM'

Grouping data by date ranges

I wonder how do I select a range of data depending on the date range?
I have these data in my payment table in format dd/mm/yyyy
Id Date Amount
1 4/1/2011 300
2 10/1/2011 200
3 27/1/2011 100
4 4/2/2011 300
5 22/2/2011 400
6 1/3/2011 500
7 1/1/2012 600
The closing date is on the 27 of every month. so I would like to group all the data from 27 till 26 of next month into a group.
Meaning to say I would like the output as this.
Group 1
1 4/1/2011 300
2 10/1/2011 200
Group 2
1 27/1/2011 100
2 4/2/2011 300
3 22/2/2011 400
Group 3
1 1/3/2011 500
Group 4
1 1/1/2012 600
It's not clear the context of your qestion. Are you querying a database?
If this is the case, you are asking about datetime but it seems you have a column in string format.
First of all, convert your data in datetime data type (or some equivalent, what db engine are you using?), and then use a grouping criteria like this:
GROUP BY datepart(month, dateadd(day, -26, [datefield])), DATEPART(year, dateadd(day, -26, [datefield]))
EDIT:
So, you are in Linq?
Different language, same logic:
.GroupBy(x => DateTime
.ParseExact(x.Date, "dd/mm/yyyy", CultureInfo.InvariantCulture) //Supposed your date field of string data type
.AddDays(-26)
.ToString("yyyyMM"));
If you are going to do this frequently, it would be worth investing in a table that assigns a unique identifier to each month and the start and end dates:
CREATE TABLE MonthEndings
(
MonthID INTEGER NOT NULL PRIMARY KEY,
StartDate DATE NOT NULL,
EndDate DATE NOT NULL
);
INSERT INTO MonthEndings VALUES(201101, '27/12/2010', '26/01/2011');
INSERT INTO MonthEndings VALUES(201102, '27/01/2011', '26/02/2011');
INSERT INTO MonthEndings VALUES(201103, '27/02/2011', '26/03/2011');
INSERT INTO MonthEndings VALUES(201112, '27/11/2011', '26/01/2012');
You can then group accurately using:
SELECT M.MonthID, P.Id, P.Date, P.Amount
FROM Payments AS P
JOIN MonthEndings AS M ON P.Date BETWEEN M.StartDate and M.EndDate
ORDER BY M.MonthID, P.Date;
Any group headings etc are best handled out of the DBMS - the SQL gets you the data in the correct sequence, and the software retrieving the data presents it to the user.
If you can't translate SQL to LINQ, that makes two of us. Sorry, I have never used LINQ, so I've no idea what is involved.
SELECT *, CASE WHEN datepart(day,date)<27 THEN datepart(month,date)
ELSE datepart(month,date) % 12 + 1 END as group_name
FROM payment

Resources