number of rows affected in HiveQL - hadoop

is there a way to get the rows affected count after running CTAS in hive?
I am running a
create table t1 as select * from t2 where ... ;
Basically , I would like to print the num of rows in new table for logging purposes.
Thanks!

Hive does report number of rows affected as part of CTAS: see example here:
Table default.errors2 stats: [num_partitions: 0, num_files: 1, num_rows: 860, total_size: 17752, raw_data_size: 16892]
More details of the output:
hive> create table errors2 as select * from errors;
..
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/tmp/hive-steve/hive_2014-12-13_06-00-40_553_7396982929134959624/-ext-10001
Moving data to: hdfs://localhost:9000/user/hive/warehouse/errors2
Table default.errors2 stats: [num_partitions: 0, num_files: 1, num_rows: 860, total_size: 17752, raw_data_size: 16892]
OK
dayandhour dowandhour cnt
Time taken: 7.348 seconds
UPDATE OP asked about saving the rowcount in a variable. There is not a builtin hive command AFAIK. You could however run the command from the command line via
hive -e "<hivesql>" | grep "[num_partitions]" | <regex command to isolate the num_rows>

Related

Joining two kafka engine tables in clickhouse

I am trying to join two tables as following
CREATE MATERIALIZED VIEW db.data_v ON CLUSTER shard1 TO db.table
AS
SELECT JSON_VALUE(db.table2_queue.message, '$.after.id') bid,
JSON_VALUE(message, '$.after.brand_id') AS brand_id,
JSON_VALUE(message, '$.after.id') AS id
FROM
db.table1_queue lq
Join db.table2_queue bq
on JSON_VALUE(bq.message, '$.after.id') = JSON_VALUE(lq.message, '$.after.brand_id')
However i got an empty result:
0 rows in set. Elapsed: 0.006 sec.

MonetDB: jdbc based queries return -1 as affected rows often even though the query was successful

e.g, below query:
create table t1 ( age int, name varchar(10) )
insert into t1 values(1, 'name'),(2, 'surname')
copy select * from t1 into 't1.dat' DELIMITERS '|','
','"' null as '';
The copy select cmd returns -1 as the affected row count, although it should return 2 as the value. Not sure why this is so. At many other times, I have seen the same query returning correctly the affected row count.
If I run the same query in the Dbeaver tool Iam using, I see this:
Updated Rows: -1
Query: copy select * from t1 into 't1.dat' DELIMITERS '|','
','"' null as ''
Finish time: Sat Apr 30 16:53:28 IST 2022
I think you have to use an absolute path to the export file, e.g.
copy select * from t1 into '/home/user/t1.dat' DELIMITERS '|','
','"' null as '';
When you see a message like 2 affected rows, it might actually be due to the result summary of the successfully completed INSERT INTO statement. But that doesn't mean the COPY INTO <file> statement is successful though.

Optimize table does not run right after table mutation

I'm updating a table with table mutations like this:
ALTER TABLE T1
UPDATE column1 = replaceAll('X', 'Y')
After that, I'm sending optimize-final command with clickhouse-client like this:
OPTIMIZE TABLE T1 FINAL
Ok.
0 rows in set. Elapsed: 0.002 sec.
But it returns instantly(0.002 sec.) and I can see the rows are not updated yet.
After a couple of seconds(10-50) I run the optimize-final command again but this time it hangs until the table is optimized.
Is this the expected behavior of optimize-final?
I can see the rows are not updated yet.
ALTER TABLE T1 UPDATE -- asynchronous
You should check select count() from system.mutations where not is_done; that your mutation is done.
In next versions you can run mutations synchronously
ALTER TABLE T1 UPDATE column1 = replaceAll('X', 'Y') SETTINGS
mutations_sync = 2
mutations_sync, 0, "Wait for synchronous execution of ALTER TABLE UPDATE/DELETE queries (mutations). 0 - execute asynchronously. 1 - wait current server. 2 - wait all replicas if they exist.
OPTIMIZE TABLE T1 FINAL
OPTIMIZE -- merge has no relation to mutations.
0 rows in set. Elapsed: 0.002 sec.
In some cases OPTIMIZE could not start and returns immediately
Use optimize_throw_if_noop to find out a reason
set optimize_throw_if_noop = 1;
OPTIMIZE TABLE T1 FINAL;

Oracle - Column Histograms Showing NONE even after GATHER_TABLE_STATS

I am trying to do performance tuning on a SQL query in Oracle 12c which is using a window partition. There's an index created on HUB_POL_KEY, PIT_EFF_START_DT on the table PIT. While running the explain plan with /*+ gather_plan_statistics */ hint, I observed there's a Window Sort Step in the Explain Plan which is having an Estimated Row Count of 5000K and an Actual Row Count of 1100. I executed DBMS_STATS.GATHER_TABLE_STATS on the table. When I checked in USER_TAB_COLUMNS table, I see there's no histogram generated for HUB_POL_KEY, PIT_EFF_START_DT. However, there's histogram existing for all other columns.
SQL Query
SELECT
PIT.HUB_POL_KEY,
NVL(LEAD(PIT.PIT_EFF_START_DT) OVER (PARTITION BY PIT.HUB_POL_KEY ORDER BY PIT.PIT_EFF_START_DT) ,TO_DATE('31.12.9999', 'DD.MM.YYYY')) EFF_END_DT
FROM PIT
1st Try:
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT');
2nd Try:
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT', method_opt=>('FOR COLUMNS SIZE 254 (HUB_POL_KEY,PIT_EFF_START_DT)'));
Checking Histogram:
SELECT HISTOGRAM FROM USER_TAB_COLUMNS
WHERE TABLE_NAME = 'PIT'
AND COLUMN_NAME IN ('HUB_POL_KEY','PIT_EFF_START_DT') --NONE
Table Statistics:
SELECT COUNT(*) FROM PIT --5570253
SELECT COLUMN_NAME,NUM_DISTINCT,NUM_BUCKETS,HISTOGRAM FROM USER_TAB_COL_STATISTICS
WHERE TABLE_NAME = 'PIT'
AND COLUMN_NAME IN ('HUB_POL_KEY','PIT_EFF_START_DT')
+------------------+--------------+-------------+-----------+
| COLUMN_NAME | NUM_DISTINCT | NUM_BUCKETS | HISTOGRAM |
+------------------+--------------+-------------+-----------+
| HUB_POL_KEY | 4703744 | 1 | NONE |
| PIT_EFF_START_DT | 154416 | 1 | NONE |
+------------------+--------------+-------------+-----------+
What am I missing here? Why is the bucket size 1 even when I am running the gather_table_stat procedure with method_opt specifying a size?
The correct syntax as per Oracle documentation should be method_opt=>('FOR COLUMNS (HUB_POL_KEY,PIT_EFF_START_DT) SIZE 254'). Trying it did not create the histogram stats as expected thought (maybe a bug ¯_(ツ)_/¯).
On the other side using method_opt=>('FOR ALL COLUMNS SIZE 254') or method_opt=>('FOR COLUMNS <column_name> SIZE 254') is working fine.
Probably a workaround would be then to gather stats for columns separately:
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT', method_opt=>('FOR COLUMNS HUB_POL_KEY SIZE 254'));
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT', method_opt=>('FOR COLUMNS PIT_EFF_START_DT SIZE 254'));

Performance issue in hive version 0.13.1

I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.
The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?
Sample Data:
Table 1:
hive> describe foo;
OK
orderno string
Time taken: 0.101 seconds, Fetched: 1 row(s)
Sample data for table1:
hive>select * from foo;
OK
1826203307
1826207803
1826179498
1826179657
Table 2:
hive> describe de_geo_ip_logs;
OK
id bigint
startorderno bigint
endorderno bigint
itemcode int
Time taken: 0.047 seconds, Fetched: 4 row(s)
Sample data for Table 2:
hive> select * from bar;
127698025 417880320 417880575 306
127698025 3038626048 3038626303 584
127698025 3038626304 3038626431 269
127698025 3038626560 3038626815 163
My Query:
SELECT b.itemcode
FROM foo a, bar b
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
In the very top of your Hive log output, it states "Warning: Shuffle Join JOIN[4][Tables a, b] in Stage 'Stage-1 Mapred' is a cross product."
EDIT:
A 'cross product' or Cartesian product is a join without conditions, which returns every row in the 'b' table, for every row in the 'a' table. So, if you take an example of 'a' is 5 rows, and 'b' is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely 'null' for one or the other tables.
Now, if you have a table 'a' of 20,000 rows and join it to another table 'b' of 500,000 rows, you are asking the SQL engine to return to you a data set 'a, b' of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.
So, if you drop the number of 'b' rows, you see you will get more benefit than the 'a' - in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time.
END EDIT
You're forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It's having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.
Try this query:
SELECT b.itemcode
FROM foo a JOIN bar b on <SomeKey>
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
But I'm having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.
Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.

Resources