Powerquery - appending the same table to itself using differing columns - powerquery

So I have a list of properties and a list of the next four servicing dates
e.g:
Property| Last | Next1 | Next2 | Next3 | Next4 |
123 Road| 01-2019 |03-2019| 05-2019| 07-2019| 09-2019|
444 Str | 01-2019 |07-2019| 01-2020| 07-2020| 01-2021|
etc.
I want to see:
Property | Date
123 Road | 01-2019
444 Str | 01-2019
123 Road | 03-2019
123 Road | 05-2019
123 Road | 07-2019
444 Str | 07-2019
etc.
In SQL this would be a union join, in powerquery. I think it's an append, but I'm not sure how to go about it. i.e. how to select columns from a table, then append a table with a different selection. I can append the full table easily, but not certain columns.

Select the date columns and do Transform > Unpivot Columns.
Then you can rename the Value column to Date, remove the Attribute column if you want, and sort as desired.

Related

How to write huge table data to file | Informatica 10.x

I have created a Informatica flow
where I need to read data from table that to only one column which contain empids.
But the column might contain duplicate need to write distinct values to file from below query
Query :
select distinct
emp_id
from
employee
where
empid not in
(
select distinct
custid
from
customer
);
I have added the above query in Source Qualifier
employee table contains : 5 million records and customer table contains : 20 billion records
My Informatica is still running not got completed - 6 hours over till now and nothing is written to file because of huge data in both tables
Following is my query plan
--------------------------------------------------------------------
Id | Operation | Name |
--------------------------------------------------------------------
0 | SELECT STATEMENT | |
1 | AX COORDINATOR | |
2 | AX SEND QC (RANDOM) | :AQ10002 |
3 | HASH UNIQUE | |
4 | AX RECEIVE | |
5 | AX SEND HASH | :AQ10001 |
6 | HASH UNIQUE | |
7 | HASH JOIN ANTI | |
8 | AX RECEIVE | |
9 | AX SEND PARTITION (KEY) | :AQ10000 |
10 | AX SELECTOR | |
11 | INDEX FAST FULL SCAN | PK_EMP_ID |
12 | AX PARTITION RANGE ALL | |
13 | INDEX FAST FULL SCAN | PK_CUST_ID |
--------------------------------------------------------------------
Sample table data :
employee
111
123
145
1345
111
123
145
678
....
customer
111
111
111
1345
111
145
145
145
145
145
145
....
Expected output :
123
678
Any solution is much appreciated !!!
It seems to me the SQL is the problem. If you dont have anything like sorter/aggregator, you dont have to worry about infa.
SQL seems to be having expensive operations. You can try below -
select emp_id
from employee
where not exists
(select 1 from customer where custid =emp_id)
This should be little faster because
you arent doing a subquery to get distinct from a 20billion customer table.
you dont need to use any distinct in first select because you are selecting from emp table where that emp id is unique. And not exist will make sure no duplicates coming out of first select.
You can also use left join +where but i think it will be expensive because of join-induced duplicates.
I would start with partitioning the customer table by hash or range(customer_id) or insert_date, this would speed up your inline select substantially.
Also try this:
select emp_id from employee
minus
select emp_id from employee e, customer c
where e.emp_id=c.custid;

how to count number of words in each column delimited by "|" seperator using hive?

input data is
+----------------------+--------------------------------+
| movie_name | Genres |
+----------------------+--------------------------------+
| digimon | Adventure|Animation|Children's |
| Slumber_Party_Massac | Horror |
+----------------------+--------------------------------+
i need output like
+----------------------+--------------------------------+-----------------+
| movie_name | Genres | count_of_genres |
+----------------------+--------------------------------+-----------------+
| digimon | Adventure|Animation|Children's | 3 |
| Slumber_Party_Massac | Horror | 1 |
+----------------------+--------------------------------+-----------------+
select *
,size(split(coalesce(Genres,''),'[^|\\s]+'))-1 as count_of_genres
from mytable
This solution covers varying use-cases, including -
NULL values
Empty strings
Empty tokens (e.g. Adventure||Animation orAdventure| |Animation )
This is a really, really bad way to store data. You should have a separate MovieGenres table with one row per movie and per genre.
One method is to use length() and replace():
select t.*,
(1 + length(genres) - length(replace(genres, '|', ''))) as num_genres
from t;
This assumes that each movie has at least one genre. If not, you need to test for that as well.

How to create Hive table with user specified number of records?

Is it possible to create a hive table with user-specified number of records?
For example, I want to create a table with x number of rows (where x is defined by the user). The table would have two columns 1. unique row id [could be auto-incremented] 2. Randomly generated String.
Is this possible using Hive?
set N=7;
select pe.i+1 as n
,java_method ('org.apache.commons.lang.RandomStringUtils','randomAlphabetic',10) as str
from (select 1) x
lateral view posexplode(split(space(${hiveconf:N}-1),' ')) pe as i,x
;
+---+------------+
| n | str |
+---+------------+
| 1 | udttBCmtxT |
| 2 | kkrMQmirSG |
| 3 | iYDABgXOvW |
| 4 | DKHKgtXKPS |
| 5 | ylebKcdcGj |
| 6 | DaujBCkCtz |
| 7 | VMaWfbtzFY |
+---+------------+
posexplode
java_method
RandomStringUtils
Specifying limit on number of rows at the time of creating table may not be possible but , its possible to limit the number of rows that can be inserted into table using LIMIT clause
-- <filename:dbloader.sql>
create table {hiveconf:TABLENAME} ( id int, string1 string)
insert into newtable
select id,string1 from oldtable limit {hiveconf:ROWLIMIT};
and while submitting hive script -
hive --hiveconf TABLENAME='XYZ' --hiveconf ROWLIMIT=1000 -f dbloader.sql
as far as creating unique incremental id , you will have to write UDF for it.

Creating sql statements to return information from a table

I am creating sql queries to return information from a table, but I am having issues with one in particular. I want to return all of the urban areas that are in the country of colorado.
The actual definition of the query is
Return the names (name10) of all urban areas (in alphabetical order) that are entirely contained
within Colorado. Return the results in alphabetical order. (64 records)
The tables that I am using are tl_2010_us_state10 (this stores information for the states). I think I am going to use the name10 variable in this table because that has all of the names of the states.
Table "public.tl_2010_us_state10"
Column | Type | Modifiers
------------+-----------------------------+-------------------------------------
gid | integer | not null default
region10 | character varying(2) |
division10 | character varying(2) |
statefp10 | character varying(2) |
statens10 | character varying(8) |
geoid10 | character varying(2) |
stusps10 | character varying(2) |
name10 | character varying(100) |
Then I have a table that displays all the urban information. Once again I think I am going to use the name10 variable because it stores the name of all the urban areas.
Table "public.tl_2010_us_uac10"
Column | Type | Modifiers
------------+-----------------------------+-------------------------------------
gid | integer | not null default
uace10 | character varying(5) |
geoid10 | character varying(5) |
name10 | character varying(100) |
The code That I wrote in my sql was
select a.name10 from tl_2010_us_uac10 as a join tl_2010_us_state10 as b where (b.name10 = 'colorado');
but I get this error
LINE 1: ...l_2010_us_uac10 as a join tl_2010_us_state10 as b where (b.n...
gid is a primary key
You must have a join condition for an inner join. Then an order by to meet your sorting requirement.
select a.name10 as urban_area
from tl_2010_us_uac10 as a
join tl_2010_us_state10 as b
on b.gid = a.gid
where b.name10 = 'colorado'
order by a.name10;

Column to comma separated value in Hive

It's been asked and answered for SQL (Convert multiple rows into one with comma as separator), would any of the approaches mentioned work in Hive, e.g. to go from this:
+------+------+
| Col1 | Col2 |
+------+------+
| a | 1 |
| a | 5 |
| a | 6 |
| b | 2 |
| b | 6 |
+------+------+
to this:
+------+-------+
| Col1 | Col2 |
+------+-------+
| a | 1,5,6 |
| b | 2,6 |
+------+-------+
The aggregator function collect_set can achieve what you are trying to get. Here is the documentation. So you can write a query like:
SELECT Col1, collect_set(Col2)
FROM your_table
GROUP BY Col1;
However, there is one striking difference between MySQL's GROUP BY and Hive's collect_set that while GROUP_CONCAT also retains duplicates in the resulting array, collect_set removes the duplicates occuring in the array. In the example shown by you there are no repeating group values for Col2 so you can go ahead and use it.
And there is collect_list that will take full list (with duplicates).
Try this
SELECT Col1, concat_ws(',', collect_set(Col2)) as col2
FROM your_table
GROUP BY Col1;
apache.org documentation

Resources