How to create spark rdd for tree structure - hadoop

I have a file with multiple types of records. As below
File header
Group header 01
Subgroup 01 s1 s2
Detail record 1 v1,v2,v3,v4
Detail record 2 v1,v2,v3,v4
Detail record 3 v1,v2,v3,v4
Subgroup 02
Detail record 21
Detail record 22
Subgroup 02 end
Group header 01 end
File header end
The file can have multiple groups and records. Each group header also has additional info.
Is there a way to create rdd without preprocessing ?
The objective is to be able to analyse/query data in the file. For eg count of all v1 in group 01

Related

How to filter rows in a table based on values in another table in power query

I have two tables in power query.
Price table
Date Company Price
01/01/2000 A 10
01/02/2000 A 12
01/03/2000 A 15
01/01/2000 B 15
01/02/2000 B 85
01/03/2000 B 98
Size table
Date Company Size
01/06/2000 A 10
01/06/2001 A 12
01/06/2002 A 15
01/06/2000 B 15
01/06/2001 B 85
01/06/2002 B 98
In Price table, I want only to have companies which are in size table. In other words, If company C is not in the size table, I do not need that company data points in the price table. Here no need to consider the date.
In Power Query you can use the Merge Queries function to achieve that. (In the Home --> Combine section of the ribbon.
Select the Join Kind to determine which rows to keep.
In your example, create a query from the 2nd table and apply the following steps:
Remove the date and the size column
Remove duplicates
Afterwards you can join the first table with the newly created query and do a inner join. (Only keep matching entries)

talend - ignore row if all columns except first have no value

I have the following table:
date c1 c2 ... cn
01/01 2 3 ... 4
01/02 ...
01/03 ...
What is the easiest way to filter out the rows where all except the date column have no value? (in this example, the rows with date 01/02 and 01/03)
The easiest way is to setup an input component and change its schema a bit by saying in the schema definition that a value is mandatory, and these records should be ignored

How to group by multiple columns and then transpose in Hive

I have some data that I want to group by on multiple columns, perform an aggregation function on, and then transpose into different columns using Hive.
For example, given this input
Input:
hr type value
01 a 10
01 b 20
01 c 50
01 a 30
02 c 10
02 b 90
02 a 80
I want to produce this output:
Output:
hr a_avg b_avg c_avg
01 20 20 50
02 80 90 10
Where there is one distinct column for each distinct type in my input. a_avg corresponds to the average a value for each hour.
How can I do this in Hive? I am guessing I might need to make use of https://github.com/klout/brickhouse/wiki/Collect-UDFs
So far the best I can think of is to use multiple group-by clauses, but that won't transpose the data into multiple columns.
Any ideas?
You don't necessarily need to use Brickhouse, but it will definitely make it easier. Here is what I'm thinking, something like
select hr
, type_map['a'] a_avg
, type_map['b'] b_avg
, type_map['c'] c_avg
from (
select hr
, collect(type, avg_value) type_map -- Brickhouse collect; creates a map
from (
select hr
, type
, avg( value ) avg_value
from db.table
group by hr, type ) x
group by hr ) y

Event Study (Extracting Dates in SAS)

I need to analyse abnormal returns for an event study on mergers and acquisitions.
** I would like to analyse abnormal returns to acquirers by using event windows. Basically I would like to extract the prices for the acquirers using -1 (the day before the announcement date), announcement date, and +1 (the day after the announcement date).**
I have two different datasets to extract information from.
The first is a dataset with all the merger and acquisition information that has the information in the following format:
DealNO AcquirerNO TargetNO AnnouncementDate
123 abcd Cfgg 22/12/2010
222 qwert cddfgf 26/12/1998
In addition, I have a 2nd dataset which has all the prices.
ISINnumber Date Price
abcd 21/12/2010 10
abcd 22/12/2010 11
abcd 23/12/2010 11
abcd 24/12/2010 12
qwert 20/12/1998 20
qwert 21/12/1998 20
qwert 22/12/1998 21
qwert 23/12/1998 21
qwert 24/12/1998 21
qwert 25/12/1998 22
qwert 26/12/1998 21
qwert 27/12/1998 23
ISIN number is the same as acquirer no, and that is the matching code.
In the end I would like to have a database something like this:
DealNO AcquirerNO TargetNO AnnouncementDate Acquirerprice(-1day) Acquireeprice(0day) Acquirerprice(+1day)
123 abcd Cfgg 22/12/2010 10 11 12
222 qwert cddfgf 26/12/1998 22 21 23
Do you know how I can get this?
I'd prefer to use sas to run the code, but if you are familiar with any other programs that can get the data like this, please let me know.
Thank you in advance ^_^.
This can be done quite easily with PROC SQL and joining the PRICE dataset three times. Try this (assuming data set names of ANNOUCE and PRICE):
Warning: untested code
%let day='21DEC2010'd;
proc sql;
create table RESULT as
select a.dealno,
a.acquirerno,
a.targetno,
a.annoucementdate,
p.price as acquirerprice_prev,
c.price as acquirerprice_cur,
n.price as acquirerprice_next
from ANNOUCE a
left join (select * from PRICE where date = &day-1) p on a.acquirerno = p.isinumber
left join (select * from PRICE where date = &day) c on a.acquirerno = c.isinumber
left join (select * from PRICE where date = &day+1) n on a.acquirerno = n.isinumber
;
quit;

National Language Shift Table And Concatenated Sms

Can we combine User Data Header for National Language Locking Shift Table and Concatenated SMS?
Concatenated Sms Header:
00 IEI for concatenated sms
03 Information element data length
a1 A reference number
03 This message has 3 parts
01 This is part 1
How can I add language locking shift table header to this header.
Example of language shift table header:
03 length of udh
24 single shift table
01 Information element data length
02 language of shift table
Thank you.
Yes you can, you just need to create a UDH header that contains both the shift table and the concatenation header. So for your example above you need to have
08 - length of the whole UDH header (5 for concat + 3 for shift table)
00 - IEI for concatenation with 8 bit reference
03 - length of concat data
a1 - reference number
03 - 3 parts in total
01 - part 1
24 - IEI for single shift table ( if you want locking shift, use 25)
01 - length of shift table data
02 - spanish language
Jim

Resources