How to load data to same Hive table if file has different number of columns - hadoop

I have a main table (Employee) which is having 10 columns and I can load data into it using load data inpath /file1.txt into table Employee
My question is how to handle the same table (Employee) if my file file2.txt has same columns but column 3 and columns 5 are missing. if I directly load data last columns will be NULL NULL. but instead it should load 3rd as NULL and 5th column as NULL.
Suppose I have a table Employee and I want to load the file1.txt and file2.txt to table.
file1.txt
==========
id name sal deptid state coutry
1 aaa 1000 01 TS india
2 bbb 2000 02 AP india
3 ccc 3000 03 BGL india
file2.txt
id name deptid country
1 second 001 US
2 third 002 ENG
3 forth 003 AUS
In file2.txt we are missing 2 columns i.e. sal and state.
we need to use the same Employee table how to handle it ?

I'm not aware of any way to create a table backed by data files with a non-homogenous structure. What you can do however, is to define separate tables for the different column configurations and then define a view that queries both.
I think it's easier if I provide an example. I will use two tables of people, both have a column for name, but one stores height as well, while the other stores weight instead:
> create table table1(name string, height int);
> insert into table1 values ('Alice', 178), ('Charlie', 185);
> create table table2(name string, weight int);
> insert into table2 values ('Bob', 98), ('Denise', 52);
> create view people as
> select name, height, NULL as weight from table1
> union all
> select name, NULL as height, weight from table2;
> select * from people order by name;
+---------+--------+--------+
| name | height | weight |
+---------+--------+--------+
| Alice | 178 | NULL |
| Bob | NULL | 98 |
| Charlie | 185 | NULL |
| Denise | NULL | 52 |
+---------+--------+--------+
Or as a closer example to your problem, let's say that one table has name, height and weight, while the other only has name and weight, thereby height is "missing from the middle":
> create table table1(name string, height int, weight int);
> insert into table1 values ('Alice', 178, 55), ('Charlie', 185, 78);
> create table table2(name string, weight int);
> insert into table2 values ('Bob', 98), ('Denise', 52);
> create view people as
> select name, height, weight from table1
> union all
> select name, NULL as height, weight from table2;
> select * from people order by name;
+---------+--------+--------+
| name | height | weight |
+---------+--------+--------+
| Alice | 178 | 55 |
| Bob | NULL | 98 |
| Charlie | 185 | 78 |
| Denise | NULL | 52 |
+---------+--------+--------+
Be sure to use union all and not just union, because the latter tries to remove duplicate rows, which makes it very expensive.

It seems like there is no way to directly load into specified columns.
As such, this is what you probably need to do:
Load data inpath to a (temporary?) table that matches the file
Insert into relevant columns of final table by selecting the contents of the previous table.
The situation is very similar to this question which covers the opposite scenario (you only want to load a few columns).

Related

How many ROS containers would be created if I am loading a table with 100 rows in a 3 node cluster

I have a 3 node cluster. There is 1 database and 1 table. I have not created a projection. If I load 100 rows in the table using copy command then:
How many projections would be created? I suspect only 1 super projection, am I correct?
If I am using segmentation then would that distribute data evenly (~33 rows) per node? Does that mean I have now 3 Read Optimised Storage (ROS) one per node and the projection has 3 ROSes?
If I use KSafety value as 1 then a copy of each ROS (buddy) would be stored in another node? So DO I have 6 ROSes now, each containing 33 rows?
Well, let's play the scenario ...
You will see that you get a projection and its identical buddy projection ...
And you can query the catalogue to count the rows and identify the projections ..
-- load a file with 100 random generated rows into table example;
-- generate the rows from within Vertica, and export to file
-- then create a new table and see what the projections look like
CREATE TABLE rows100 AS
SELECT
(ARRAY['Ann','Lucy','Mary','Bob','Matt'])[RANDOMINT(5)] AS fname,
(ARRAY['Lee','Ross','Smith','Davis'])[RANDOMINT(4)] AS lname,
'2001-01-01'::DATE + RANDOMINT(365*10) AS hdate,
(10000 + RANDOM()*9000)::NUMERIC(7,2) AS salary
FROM (
SELECT tm FROM (
SELECT now() + INTERVAL ' 1 second' AS t UNION ALL
SELECT now() + INTERVAL '100 seconds' AS t -- Creates 100 rows
) x TIMESERIES tm AS '1 second' OVER(ORDER BY t)
) y
;
-- set field separator to vertical bar (the default, actually...)
\pset fieldsep '|'
-- toggle to tuples only .. no column names and no row count
\tuples_only
-- spool to example.bsv - in bar-separated-value format
\o example.bsv
SELECT * FROM rows100;
-- spool to file off - closes output file
\o
-- create a table without bothering with projections matching the test data
DROP TABLE IF EXISTS example;
CREATE TABLE example LIKE rows100;
-- load the new table ...
COPY example FROM LOCAL 'example.bsv';
-- check the nodes ..
SELECT node_name FROM nodes;
-- out node_name
-- out ----------------
-- out v_sbx_node0001
-- out v_sbx_node0002
-- out v_sbx_node0003
SELECT
node_name
, projection_schema
, anchor_table_name
, projection_name
, row_count
FROM v_monitor.projection_storage
WHERE anchor_table_name='example'
ORDER BY projection_name, node_name
;
-- out node_name | projection_schema | anchor_table_name | projection_name | row_count
-- out ----------------+-------------------+-------------------+-----------------+-----------
-- out v_sbx_node0001 | public | example | example_b0 | 38
-- out v_sbx_node0002 | public | example | example_b0 | 32
-- out v_sbx_node0003 | public | example | example_b0 | 30
-- out v_sbx_node0001 | public | example | example_b1 | 30
-- out v_sbx_node0002 | public | example | example_b1 | 38
-- out v_sbx_node0003 | public | example | example_b1 | 32

Left Outer Join via a link table, using min() to restrict join to one row

I am trying to write an Oracle SQL query to join two tables that are linked via a link table (by that I mean a table with 2 columns, each a foreign key to the primary tables). A min() function is to be used to limit the results from the left outer join to a single row.
My model consists of "parents" and "nephews". Parents can have 0 or more nephews. Parents can be enabled or disabled. Each nephew has a birthday date. The goal of my query is:
Print a single row for each enabled parent, listing that parent's oldest nephew (ie the one with the min(birthday)).
My problem is illustrated here at sqlfiddle: http://sqlfiddle.com/#!4/9a3be0d/1
I can form a query that lists all of the nephews for the enabled parents, but that is not good enough- I just want one row per parent which includes just the oldest nephew. Forming the where clause to the outer table seems to be my stumbling block.
My tables and sample data:
create table parent (parent_id number primary key, parent_name varchar2(50), enabled int);
create table nephew (nephew_id number primary key, birthday date, nephew_name varchar2(50));
create table parent_nephew_link (parent_id number not null, nephew_id number not null);
parent table:
+----+-------------+---------+
| id | parent_name | enabled |
+----+-------------+---------+
| 1 | Donald | 1 |
+----+-------------+---------+
| 2 | Minnie | 0 |
+----+-------------+---------+
| 3 | Mickey | 1 |
+----+-------------+---------+
nephew table:
+-----------+------------+-------------+
| nephew_id | birthday | nephew_name |
+-----------+------------+-------------+
| 100 | 01/01/2017 | Huey |
+-----------+------------+-------------+
| 101 | 01/01/2016 | Dewey |
+-----------+------------+-------------+
| 102 | 01/01/2015 | Louie |
+-----------+------------+-------------+
| 103 | 01/01/2014 | Morty |
+-----------+------------+-------------+
| 104 | 01/01/2013 | Ferdie |
+-----------+------------+-------------+
parent_nephew_link table:
+-----------+-----------+
| parent_id | nephew_id |
+-----------+-----------+
| 1 | 100 |
+-----------+-----------+
| 1 | 101 |
+-----------+-----------+
| 1 | 102 |
+-----------+-----------+
| 3 | 103 |
+-----------+-----------+
| 3 | 104 |
+-----------+-----------+
My (not correct) query:
-- This query is not right, it returns a row for each nephew
select parent_name, nephew_name
from parent p
left outer join parent_nephew_link pnl
on p.parent_id = pnl.parent_id
left outer join nephew n
on n.nephew_id = pnl.nephew_id
where enabled = 1
-- I wish I could add this clause to restrict the result to the oldest
-- nephew but p.parent_id is not available in sub-selects.
-- You get an ORA-00904 error if you try this:
-- and n.birthday = (select min(birthday) from nephew nested where nested.parent_id = p.parent_id)
My desired output would be:
+-------------+-------------+
| parent_name | nephew_name |
+-------------+-------------+
| Donald | Louie |
+-------------+-------------+
| Mickey | Ferdie |
+-------------+-------------+
Thanks for any advice!
John
markaaronky's suggestion
I tried using markaaronky's suggestion but this sql is also flawed.
-- This query is not right either, it returns the correct data but only for one parent
select * from (
select parent_name, n.nephew_name, n.birthday
from parent p
left outer join parent_nephew_link pnl
on p.parent_id = pnl.parent_id
left outer join nephew n
on n.nephew_id = pnl.nephew_id
where enabled = 1
order by parent_name, n.birthday asc
) where rownum <= 1
Why not:
(1) include the n.birthday from the nephews table in your SELECT statement
(2) add an ORDER BY n.birthday ASC to your query
(3) also modify your select so that it only takes the top row?
I tried to write this out in sqlfiddle for you but it doesn't seem to like table aliases (e.g. it throws an error when I write n.birthday), but I'm sure that's legal in Oracle, even though I'm a SQL Server guy.
Also, if I recall correctly, Oracle doesn't have a SELECT TOP like SQL Server does... you have to do something like "WHERE ROWNUM = 1" instead? Same concept... you're just ordering your results so the oldest nephew is the first row, and you're only taking the first row.
Perhaps an undesired side effect is you WOULD get the birthday along with the names in your results. If that's unacceptable, my apologies. It looked like your question has been sitting unanswered for a while and this solution should at least give you a start.
Lastly, since you don't have a NOT NULL constraint on your birthday column and are doing left outer joins, you might make the query safer by adding AND n.birthday IS NOT NULL
Use:
select parent_name, nephew_name
from parent p
left outer join
(
SELECT pnl.parent_id, n.nephew_name
FROM parent_nephew_link pnl
join nephew n
on n.nephew_id = pnl.nephew_id
AND n.BIRTHDAY = (
SELECT min( BIRTHDAY )
FROM nephew n1
JOIN parent_nephew_link pnl1
ON pnl1.NEPHEW_ID = n1.NEPHEW_ID
WHERE pnl1.PARENT_ID = pnl.PARENT_ID
)
) ppp
on p.parent_id = ppp.parent_id
where p.enabled = 1
Demo: http://sqlfiddle.com/#!4/98758/23
| PARENT_NAME | NEPHEW_NAME |
|-------------|-------------|
| Mickey | Louie |
| Donald | Ferdie |

Insert value based on min value greater than value in another row

It's difficult to explain the question well in the title.
I am inserting 6 values from (or based on values in) one row.
I also need to insert a value from a second row where:
The values in one column (ID) must be equal
The values in column (CODE) in the main source row must be IN (100,200), whereas the other row must have value of 300 or 400
The value in another column (OBJID) in the secondary row must be the lowest value above that in the primary row.
Source Table looks like:
OBJID | CODE | ENTRY_TIME | INFO | ID | USER
---------------------------------------------
1 | 100 | x timestamp| .... | 10 | X
2 | 100 | y timestamp| .... | 11 | Y
3 | 300 | z timestamp| .... | 10 | F
4 | 100 | h timestamp| .... | 10 | X
5 | 300 | g timestamp| .... | 10 | G
So to provide an example..
In my second table I want to insert OBJID, OBJID2, CODE, ENTRY_TIME, substr(INFO(...)), ID, USER
i.e. from my example a line inserted in the second table would look like:
OBJID | OBJID2 | CODE | ENTRY_TIME | INFO | ID | USER
-----------------------------------------------------------
1 | 3 | 100 | x timestamp| substring | 10 | X
4 | 5 | 100 | h timestamp| substring2| 10 | X
My insert for everything that just comes from one row works fine.
INSERT INTO TABLE2
(ID, OBJID, INFO, USER, ENTRY_TIME)
SELECT ID, OBJID, DECODE(CODE, 100, (SUBSTR(INFO, 12,
LENGTH(INFO)-27)),
600,'CREATE') INFO, USER, ENTRY_TIME
FROM TABLE1
WHERE CODE IN (100,200);
I'm aware that I'll need to use an alias on TABLE1, but I don't know how to get the rest to work, particularly in an efficient way. There are 2 million rows right now, but there will be closer to 20 million once I start using production data.
You could try this:
select primary.* ,
(select min(objid)
from table1 secondary
where primary.objid < secondary.objid
and secondary.code in (300,400)
and primary.id = secondary.id
) objid2
from table1 primary
where primary.code in (100,200);
Ok, I've come up with:
select OBJID,
min(case when code in (300,400) then objid end)
over (partition by id order by objid
range between 1 following and unbounded following
) objid2,
CODE, ENTRY_TIME, INFO, ID, USER1
from table1;
So, you need a insert select the above query with a where objid2 is not null and code in (100,200);

How to select nth row in CockroachDB?

If I use something like a SERIAL (which is a random number) for my table's primary key, how can I select a numbered row from my table? In MySQL, I just use the auto incremented ID to select a specific row, but not sure how to approach the problem with an arbitrary numbering sequence.
For reference, here is the table I'm working with:
+--------------------+------+-------+
| id | name | score |
+--------------------+------+-------+
| 235451721728983041 | ABC | 1000 |
| 235451721729015809 | EDF | 1100 |
| 235451721729048577 | GHI | 1200 |
| 235451721729081345 | JKL | 900 |
+--------------------+------+-------+
Using the LIMIT and OFFSET clauses will return the nth row. For example SELECT * FROM tbl ORDER BY col1 LIMIT 1 OFFSET 9 returns the 10th row.
Note that it’s important to include the ORDER BY clause here because you care about the order of the results (if you don’t include ORDER BY, it’s possible that the results are arbitrarily ordered).
If you care about the order in which things were inserted, you could ORDER BY the SERIAL column (id in your case), though it’s not always the case because transaction contention and other things could cause the generated SERIAL values to not be strictly ordered.

Oracle Insert Into Child & Parent Tables

I have a table - let's call it MASTER - with a lot of rows in it. Now, I had to created another table called 'MASTER_DETAILS', which will be populated with data from another system. Suh data will be accessed via DB Link.
MASTER has a FK to MASTER_DETAIL (1 -> 1 Relationship).
I created a SQL to populate the MASTER_DETAILS table:
INSERT INTO MASTER_DETAILS(ID, DETAIL1, DETAILS2, BLAH)
WITH QUERY_FROM_EXTERNAL_SYSTEM AS (
SELECT IDENTIFIER,
FIELD1,
FIELD2,
FIELD3
FROM TABLE#DB_LINK
--- DOZENS OF INNERS AND OUTER JOINS HERE
) SELECT MASTER_DETAILS_SEQ.NEXTVAL,
QES.FIELD1,
QES.FIELD2,
QES.FIELD3
FROM MASTER M
INNER JOIN QUERY_FROM_EXTERNAL_SYSTEM QES ON QES.IDENTIFIER = M.ID
--- DOZENS OF JOINS HERE
Approach above works fine to insert all the values into the MASTER_DETAILS.
Problem is:
In the approach above, I cannot insert the value of MASTER_DETAILS_SEQ.CURRVAL into the MASTER table. So I create all the entries into the DETAILS table but I don't link them to the MASTER table.
Does anyone see a way out to this problem using only a INSERT statement? I wish I could avoid creating a complex script with LOOPS and everything to handle this problem.
Ideally I want to do something like this:
INSERT INTO MASTER_DETAILS(ID, DETAIL1, DETAILS2, BLAH) AND MASTER(MASTER_DETAILS_ID)
WITH QUERY_FROM_EXTERNAL_SYSTEM AS (
SELECT IDENTIFIER,
FIELD1,
FIELD2,
FIELD3
FROM TABLE#DB_LINK
--- DOZENS OF INNERS AND OUTER JOINS HERE
) SELECT MASTER_DETAILS_SEQ.NEXTVAL,
QES.FIELD1,
QES.FIELD2,
QES.FIELD3
FROM MASTER M
INNER JOIN QUERY_FROM_EXTERNAL_SYSTEM QES ON QES.IDENTIFIER = M.ID
--- DOZENS OF JOINS HERE,
SELECT MASTER_DETAILS_SEQ.CURRVAL FROM DUAL;
I know such approach does not work on Oracle - but I am showing this SQL to demonstrate what I want to do.
Thanks.
If there is really a 1-to-1 relationship between the two tables, then they could arguably be a single table. Presumably you have a reason to want to keep them separate. Perhaps the master is a vendor-supplied table you shouldn't touch and the detail is extra data; but then you're changing the master anyway by adding the foreign key field. Or perhaps the detail will be reloaded periodically and you don't want to update the master table; but then you have to update the foreign key field anyway. I'll assume you're required to have a separate table, for whatever reason.
If you put a foreign key on the master table that refers to the primary key on the detail table, you're are restricted to it only ever being a 1-to-1 relationship. If that really is the case then conceptually it shouldn't matter which way the relationship is built - which table has the primary key and which has the foreign key. And if it isn't then your model will break when your detail table (or the remote query) comes back with two rows related to the same master - even if you're sure that won't happen today, will it always be true? The pluralisation of the name master_details suggests that might be expected. Maybe. Having the relationship the other way would prevent that being an issue.
I'm guessing you decided to put the relationship that way round so you can join the tables using the detail's key:
select m.column, md.column
from master m
join master_details md on md.id = m.detail_id
... because you expect that to be the quickest way, since md.id will be indexed (implicitly, as a primary key). But you could achieve the same effect by adding the master ID to the details table as a foreign key:
select m.column, md.column
from master m
join master_details md on md.master_id = m.id
It is good practice to index foreign keys anyway, and as long as you have an index on master_details.master_id then the performance should be the same (more or less, other factors may come in to play but I'd expect this to generally be the case). This would also allow multiple detail records in the future, without needing to modify the schema.
So as a simple example, let's say you have a master table created and populated with some dummy data:
create table master(id number, data varchar2(10),
constraint pk_master primary key (id));
create sequence seq_master start with 42;
insert into master (id, data)
values (seq_master.nextval, 'Foo ' || seq_master.nextval);
insert into master (id, data)
values (seq_master.nextval, 'Foo ' || seq_master.nextval);
insert into master (id, data)
values (seq_master.nextval, 'Foo ' || seq_master.nextval);
select * from master;
ID DATA
---------- ----------
42 Foo 42
43 Foo 43
44 Foo 44
The changes you've proposed might look like this:
create table detail (id number, other_data varchar2(10),
constraint pk_detail primary key(id));
create sequence seq_detail;
alter table master add (detail_id number,
constraint fk_master_detail foreign key (detail_id)
references detail (id));
insert into detail (id, other_data)
select seq_detail.nextval, 'Foo ' || seq_detail.nextval
from master m
-- joins etc
;
... plus the update of the master's foreign key, which is what you're struggling with, so let's do that manually for now:
update master set detail_id = 1 where id = 42;
update master set detail_id = 2 where id = 43;
update master set detail_id = 3 where id = 44;
And then you'd query as:
select m.data, d.other_data
from master m
join detail d on d.id = m.detail_id
where m.id = 42;
DATA OTHER_DATA
---------- ----------
Foo 42 Bar 1
Plan hash value: 2192253142
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 22 | 2 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 1 | 22 | 2 (0)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| MASTER | 1 | 13 | 1 (0)| 00:00:01 |
|* 3 | INDEX UNIQUE SCAN | PK_MASTER | 1 | | 0 (0)| 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID| DETAIL | 3 | 27 | 1 (0)| 00:00:01 |
|* 5 | INDEX UNIQUE SCAN | PK_DETAIL | 1 | | 0 (0)| 00:00:01 |
------------------------------------------------------------------------------------------
If you swap the relationship around the changes become:
create table detail (id number, master_id, other_data varchar2(10),
constraint pk_detail primary key(id),
constraint fk_detail_master foreign key (master_id)
references master (id));
create index ix_detail_master_id on detail (master_id);
create sequence seq_detail;
insert into detail (id, master_id, other_data)
select seq_detail.nextval, m.id, 'Bar ' || seq_detail.nextval
from master m
-- joins etc.
;
No update of the master table is needed, and the query becomes:
select m.data, d.other_data
from master m
join detail d on d.master_id = m.id
where m.id = 42;
DATA OTHER_DATA
---------- ----------
Foo 42 Bar 1
Plan hash value: 4273661231
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 19 | 2 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 1 | 19 | 2 (0)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| MASTER | 1 | 10 | 1 (0)| 00:00:01 |
|* 3 | INDEX UNIQUE SCAN | PK_MASTER | 1 | | 0 (0)| 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID| DETAIL | 1 | 9 | 1 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | IX_DETAIL_MASTER_ID | 1 | | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------
The only real difference in the plan is that you now have a range scan instead of a unique scan; if you're really sure it's 1-to-1 you could make the index unique but there's not much benefit.
SQL Fiddle of this approach.

Resources