This is my table's create script
CREATE TABLE IF NOT EXISTS replacing_test (
addr String,
ver UInt64,
stt String,
time DateTime,
) engine=ReplacingMergeTree(ver)
PARTITION BY toYYYYMM(time)
PRIMARY KEY addr
ORDER BY addr
I have 2 rows as follows:
ABC | 0 | NEW | 2020-04-17 12:39:52
ABC | 2 | DONE | 2020-04-17 12:40:52
When I insert 2 rows above in separate times, with the order like above, after merging, I got:
ABC | 2| DONE | 2020-04-17 12:40:52
It also my expectation.
But, when I try to insert these 2 rows at the same time by reading from backup, with random order, the result will be:
ABC | 0| DONE | 2020-04-17 12:39:52
Is there any behavior that I don't know about here?
Related
I have created a Informatica flow
where I need to read data from table that to only one column which contain empids.
But the column might contain duplicate need to write distinct values to file from below query
Query :
select distinct
emp_id
from
employee
where
empid not in
(
select distinct
custid
from
customer
);
I have added the above query in Source Qualifier
employee table contains : 5 million records and customer table contains : 20 billion records
My Informatica is still running not got completed - 6 hours over till now and nothing is written to file because of huge data in both tables
Following is my query plan
--------------------------------------------------------------------
Id | Operation | Name |
--------------------------------------------------------------------
0 | SELECT STATEMENT | |
1 | AX COORDINATOR | |
2 | AX SEND QC (RANDOM) | :AQ10002 |
3 | HASH UNIQUE | |
4 | AX RECEIVE | |
5 | AX SEND HASH | :AQ10001 |
6 | HASH UNIQUE | |
7 | HASH JOIN ANTI | |
8 | AX RECEIVE | |
9 | AX SEND PARTITION (KEY) | :AQ10000 |
10 | AX SELECTOR | |
11 | INDEX FAST FULL SCAN | PK_EMP_ID |
12 | AX PARTITION RANGE ALL | |
13 | INDEX FAST FULL SCAN | PK_CUST_ID |
--------------------------------------------------------------------
Sample table data :
employee
111
123
145
1345
111
123
145
678
....
customer
111
111
111
1345
111
145
145
145
145
145
145
....
Expected output :
123
678
Any solution is much appreciated !!!
It seems to me the SQL is the problem. If you dont have anything like sorter/aggregator, you dont have to worry about infa.
SQL seems to be having expensive operations. You can try below -
select emp_id
from employee
where not exists
(select 1 from customer where custid =emp_id)
This should be little faster because
you arent doing a subquery to get distinct from a 20billion customer table.
you dont need to use any distinct in first select because you are selecting from emp table where that emp id is unique. And not exist will make sure no duplicates coming out of first select.
You can also use left join +where but i think it will be expensive because of join-induced duplicates.
I would start with partitioning the customer table by hash or range(customer_id) or insert_date, this would speed up your inline select substantially.
Also try this:
select emp_id from employee
minus
select emp_id from employee e, customer c
where e.emp_id=c.custid;
I am running a docker image of Vertica on windows. I have created a table in vertica with this schema (student_id is primary key)
dbadmin#d1f942c8c1e0(*)=> \d testschema.student;
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
------------+---------+------------+-------------+------+---------+----------+-------------+-------------
testschema | student | student_id | int | 8 | | t | t |
testschema | student | name | varchar(20) | 20 | | f | f |
testschema | student | major | varchar(20) | 20 | | f | f |
(3 rows)
student_id is a primary key. I am testing loading data from csv file using copy command.
First I used insert - insert into testschema.student values (1,'Jack','Biology');
Then I created a csv file at /home/dbadmin/vertica_test directory -
vi student.csv
2,Kate,Sociology
3,Claire,English
4,Jack,Biology
5,Mike,Comp. Sci
Then I ran this command
copy testschema.students from '/home/dbadmin/vertica_test/student.csv' delimiter ',' rejected data as table students_rejected;
I tested the result
select * from testschema.student - shows 5 rows
select * from students_rejected; - no rows
Then I creates another csv file with bad data at /home/dbadmin/vertica_test directory
vi student_bad.csv
bad_data_type_for_student_id,UnaddedStudent, UnaddedSubject
6,Cassey,Physical Education
I added data from bad csv file
copy testschema.students from '/home/dbadmin/vertica_test/student.csv' delimiter ',' rejected data as table students_rejected;
Then I tested the output
select * from testschema.student - shows 6 rows <-- only one row got added. all ok
select * from students_rejected; - shows 1 row <-- bad row's entry is here. all ok
all looks good
Then I added the bad data again without the rejected data option
copy testschema.students from '/home/dbadmin/vertica_test/student_bad.csv' delimiter ',' ;
But now the entry with student id 6 got added again!!
student_id | name | major
------------+--------+--------------------
1 | Jack | Biology
2 | Kate | Sociology
3 | Claire | English
4 | Jack | Biology
5 | Mike | Comp. Sci
6 | Cassey | Physical Education <--
6 | Cassey | Physical Education <--
Shouldn't this have got rejected?
If you created your students with a command of this type:
DROP TABLE IF EXISTS students;
CREATE TABLE students (
student_id int
, name varchar(20)
, major varchar(20)
, CONSTRAINT pk_students PRIMARY KEY(student_id)
);
that is, without the explicit keyword ENABLED, then the primary key constraint is disabled. That is, you can happily insert duplicates, but will run into an error if you later want to join to the students table via the primary key column.
With the primary key constraint enabled ...
[...]
, CONSTRAINT pk_students PRIMARY KEY(student_id) ENABLED
[...]
I think you get the desired effect.
The whole scenario:
DROP TABLE IF EXISTS students;
CREATE TABLE students (
student_id int
, name varchar(20)
, major varchar(20)
, CONSTRAINT pk_students PRIMARY KEY(student_id) ENABLED
);
INSERT INTO students
SELECT 1,'Jack' ,'Biology'
UNION ALL SELECT 2,'Kate' ,'Sociology'
UNION ALL SELECT 3,'Claire','English'
UNION ALL SELECT 4,'Jack' ,'Biology'
UNION ALL SELECT 5,'Mike' ,'Comp. Sci'
UNION ALL SELECT 6,'Cassey','Physical Education'
;
-- out OUTPUT
-- out --------
-- out 6
COMMIT;
COPY students FROM STDIN DELIMITER ','
REJECTED DATA AS TABLE students_rejected;
6,Cassey,Physical Education
\.
-- out vsql:/home/gessnerm/._vfv.sql:4: ERROR 6745:
-- out Duplicate key values: 'student_id=6'
-- out -- violates constraint 'dbadmin.students.pk_students'
SELECT * FROM students;
-- out student_id | name | major
-- out ------------+--------+--------------------
-- out 1 | Jack | Biology
-- out 2 | Kate | Sociology
-- out 3 | Claire | English
-- out 4 | Jack | Biology
-- out 5 | Mike | Comp. Sci
-- out 6 | Cassey | Physical Education
SELECT * FROM students_rejected;
-- out node_name | file_name | session_id | transaction_id | statement_id | batch_number | row_number | rejected_data | rejected_data_orig_length | rejected_reason
-- out -----------+-----------+------------+----------------+--------------+--------------+------------+---------------+---------------------------+-----------------
-- out (0 rows)
And the only reliable check seems to be the ANALYZE_CONSTRAINTS() call ...
ALTER TABLE students ALTER CONSTRAINT pk_students DISABLED;
-- out Time: First fetch (0 rows): 7.618 ms. All rows formatted: 7.632 ms
COPY students FROM STDIN DELIMITER ','
REJECTED DATA AS TABLE students_rejected;
6,Cassey,Physical Education
\.
-- out Time: First fetch (0 rows): 31.790 ms. All rows formatted: 31.791 ms
SELECT * FROM students;
-- out student_id | name | major
-- out ------------+--------+--------------------
-- out 1 | Jack | Biology
-- out 2 | Kate | Sociology
-- out 3 | Claire | English
-- out 4 | Jack | Biology
-- out 5 | Mike | Comp. Sci
-- out 6 | Cassey | Physical Education
-- out 6 | Cassey | Physical Education
SELECT * FROM students_rejected;
-- out node_name | file_name | session_id | transaction_id | statement_id | batch_number | row_number | rejected_data | rejected_data_orig_length | rejected_reason
-- out -----------+-----------+------------+----------------+--------------+--------------+------------+---------------+---------------------------+-----------------
-- out (0 rows)
SELECT ANALYZE_CONSTRAINTS('students');
-- out Schema Name | Table Name | Column Names | Constraint Name | Constraint Type | Column Values
-- out -------------+------------+--------------+-----------------+-----------------+---------------
-- out dbadmin | students | student_id | pk_students | PRIMARY | ('6')
-- out (1 row)
I have 4 tables:
"Cars" table, where every car has an ID.
"Operations" table that holds the operations have been done on a car.
| ID | CarID | Operation | User | JournalID |
| --- | ----- | --------- | ---- | --------- |
"Transactions" table that records the costs of the operations and other daily expenses, where every operation has 2 transactions, one is > 0 and the other is < 0 (for example: +100 and -100):
| ID | Account | JournalID | Amount | Date |
| --- | ------- | --------- | ------ | ---- |
"Journal" table that records the daily finance:
| ID | Amount | Date |
| --- | ------ | ---- |
What I want is knowing the sum of operations costs amount of a specific car, I was looping through all operations of that car and then looping for every journal row to sum, which lead to a bad result of course.
What can I do in that case to get the result as fast as possible?
NOTE: ALL THE NAMES OF THE COLUMNS ARE IN LOWER CASE
You need to join three tables based on the foreign keys like this and with some select and raw query you can get solution to your problem.
DB::table('cars')->join('operations','cars.id','operations.car_id')
->join('journal','journal.id','operations.journal_id')
->select(DB::raw('SUM(amount) as total_cost'),'cars.*')
->groupBy('cars.id')
->get();
I ended up solving it with #Segar's answer modified:
$result = DB::table('cars')
->join('operations','cars.id','operations.car_id')
->join('journal','journal.id','operations.journal')
->join('transactions','transactions.journal','operations.journal')->where('transactions.type', 0)
->select(DB::raw('SUM(amount) as total_cost'),'cars.*')
->groupBy('cars.id')
->get();
print_r($result);
Thanks.
I have table i have run the job in scdtype 2 load the data below
no | name | loc |
-----------------
1 | abc | hyd |
-----------------
2 | def | bang |
-----------------
3 | ghi | chennai |
then i have run the second run load the data given below
no | name | loc |
-----------------
1 | abc | hyd |
-----------------
2 | def | bang |
-----------------
3 | ghi | chennai |
--------------------
1 | abc | bang |
here no dates,flags,and run ids
how to find second updated record in this situtation
Thanks
I don't think you'll be able to distinguish between the updated record and the original record.
A Dimension table using Type 2 SCD requires additional columns that describes the period in which the record is valid (or current), exactly for this reason.
The solution is to ensure your dimension table has these columns (Typically ValidFrom and ValidTo dates or date/times, and sometimes an IsCurrent flag for good measure). Your ETL process would then populate these columns as part of making the Type 2 updates.
Is it possible to create a hive table with user-specified number of records?
For example, I want to create a table with x number of rows (where x is defined by the user). The table would have two columns 1. unique row id [could be auto-incremented] 2. Randomly generated String.
Is this possible using Hive?
set N=7;
select pe.i+1 as n
,java_method ('org.apache.commons.lang.RandomStringUtils','randomAlphabetic',10) as str
from (select 1) x
lateral view posexplode(split(space(${hiveconf:N}-1),' ')) pe as i,x
;
+---+------------+
| n | str |
+---+------------+
| 1 | udttBCmtxT |
| 2 | kkrMQmirSG |
| 3 | iYDABgXOvW |
| 4 | DKHKgtXKPS |
| 5 | ylebKcdcGj |
| 6 | DaujBCkCtz |
| 7 | VMaWfbtzFY |
+---+------------+
posexplode
java_method
RandomStringUtils
Specifying limit on number of rows at the time of creating table may not be possible but , its possible to limit the number of rows that can be inserted into table using LIMIT clause
-- <filename:dbloader.sql>
create table {hiveconf:TABLENAME} ( id int, string1 string)
insert into newtable
select id,string1 from oldtable limit {hiveconf:ROWLIMIT};
and while submitting hive script -
hive --hiveconf TABLENAME='XYZ' --hiveconf ROWLIMIT=1000 -f dbloader.sql
as far as creating unique incremental id , you will have to write UDF for it.