I have Kafka Connect set up to dump data from my Topic into HDFS with Hive enabled:
"confluent.topic.bootstrap.servers": "kafka-1:19092,kafka-2:29092,kafka-3:39092",
"connector.class": "io.confluent.connect.hdfs3.Hdfs3SinkConnector",
"flush.size": "3",
"hdfs.url": "hdfs://namenode:9000",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"logs.dir": "logs",
"name": "kafka to hdfs - repos",
"topics": "repos",
"topics.dir": "topics",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"hive.integration": "true",
"hive.metastore.uris": "thrift://hive-metastore:9083",
"schema.compatibility": "BACKWARD"
When running the task, data is sent to HDFS:
# hdfs dfs -ls /topics/repos/partition=0
Found 3 items
-rw-r--r-- 3 appuser supergroup 281 2022-12-02 13:42 /topics/repos/partition=0/repos+0+0000000000+0000000002.avro
-rw-r--r-- 3 appuser supergroup 294 2022-12-02 13:42 /topics/repos/partition=0/repos+0+0000000003+0000000005.avro
-rw-r--r-- 3 appuser supergroup 283 2022-12-02 13:42 /topics/repos/partition=0/repos+0+0000000006+0000000008.avro
And the table is created in the Hive metastore:
0: jdbc:hive2://localhost:10000> show tables;
+-----------+
| tab_name |
+-----------+
| repos |
+-----------+
But for some reason the table is empty / no data loaded from the HDFS files:
0: jdbc:hive2://localhost:10000> select * from repos;
+------------------+------------------+
| repos.repo_name | repos.partition |
+------------------+------------------+
+------------------+------------------+
When looking at the configuration of the generated table, it looks like the hookup is fine:
0: jdbc:hive2://localhost:10000> DESCRIBE FORMATTED repos;
+-------------------------------+----------------------------------------------------+-----------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+-----------------------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| repo_name | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| partition | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | null | NULL |
| CreateTime: | Sat Dec 03 11:46:56 UTC 2022 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://namenode:9000/topics/repos | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\"} |
| | EXTERNAL | TRUE |
| | bucketing_version | 2 |
| | numFiles | 0 |
| | numPartitions | 0 |
| | numRows | 0 |
| | rawDataSize | 0 |
| | totalSize | 0 |
| | transient_lastDdlTime | 1670068016 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+-----------------------------+
I can't seem to find any errors being written anywhere in the logs. I think it could be related to the sub-folder "partition=0", but I am not quite sure how Hive deals with this.
Maybe I am lacking something in the configuration? Or there is something special that has to be done for the data to be loaded?
Related
I have problem with Apache NiFi 1.12.1. For some unknown for me reason CaptureChangeMySQL returns many nulls. Basically, only columns which are int, return correct values. I'm new in a matter of using NiFi so I might miss some obvious thing in configuration.
I have following table:
create table inventory.abc
(
id int auto_increment
primary key,
first_name varchar(100) not null,
last_name varchar(100) not null,
age int not null
);
Processor config:
Bin logs settings:
mysql> show variables like '%bin%';
+--------------------------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------------------------+--------------------------------+
| bind_address | * |
| binlog_cache_size | 32768 |
| binlog_checksum | CRC32 |
| binlog_direct_non_transactional_updates | OFF |
| binlog_error_action | ABORT_SERVER |
| binlog_format | ROW |
| binlog_group_commit_sync_delay | 0 |
| binlog_group_commit_sync_no_delay_count | 0 |
| binlog_gtid_simple_recovery | ON |
| binlog_max_flush_queue_time | 0 |
| binlog_order_commits | ON |
| binlog_row_image | FULL |
| binlog_rows_query_log_events | OFF |
| binlog_stmt_cache_size | 32768 |
| binlog_transaction_dependency_history_size | 25000 |
| binlog_transaction_dependency_tracking | COMMIT_ORDER |
| innodb_api_enable_binlog | OFF |
| innodb_locks_unsafe_for_binlog | OFF |
| log_bin | ON |
| log_bin_basename | /var/lib/mysql/mysql-bin |
| log_bin_index | /var/lib/mysql/mysql-bin.index |
| log_bin_trust_function_creators | OFF |
| log_bin_use_v1_row_events | OFF |
| log_statements_unsafe_for_binlog | ON |
| max_binlog_cache_size | 18446744073709547520 |
| max_binlog_size | 1073741824 |
| max_binlog_stmt_cache_size | 18446744073709547520 |
| sql_log_bin | ON |
| sync_binlog | 1 |
+--------------------------------------------+--------------------------------+
29 rows in set (0.00 sec)
And I get results like this:
Any idea why I get so many nulls in output? I thought it might be related to Distributed Map Cache Client but since this option is not mandatory I don't think that's a problem.
Yesterday, I was create my first datasource Druid from Hive. Today, I'm not sure that works...
First, I ran the following code for create my Db :
SET hive.druid.broker.address.default = 10.20.173.30:8082;
SET hive.druid.metadata.username = druid;
SET hive.druid.metadata.password = druid_password;
SET hive.druid.metadata.db.type = postgresql;
SET hive.druid.metadata.uri = jdbc:postgresql://10.20.173.31:5432/druid;
CREATE EXTERNAL TABLE test (
`__time` TIMESTAMP,
`userId` STRING,
`lang` STRING,
`location` STRING,
`name` STRING
)
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
I can see this datasource on my Hive architecture. How can I know that this datasource is a Druid Datasource and not a Hive table.
I tested this but I don't know if it's a Druid datasource.
DESCRIBE FORMATTED test;
Result
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| __time | timestamp | from deserializer |
| userid | string | from deserializer |
| lang | string | from deserializer |
| location | string | from deserializer |
| name | string | from deserializer |
| # Detailed Table Information | NULL | NULL |
| Database: | druid_datasources | NULL |
| OwnerType: | USER | NULL |
| Owner: | hive | NULL |
| CreateTime: | Tue Oct 15 12:42:22 CEST 2019 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://10.20.173.30:8020/warehouse/tablespace/external/hive/druid_datasources.db/test | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"__time\":\"true\",\"lang\":\"true\",\"location\":\"true\",\"name\":\"true\",\"userid\":\"true\"}} |
| | EXTERNAL | TRUE |
| | bucketing_version | 2 |
| | druid.datasource | druid_datasources.test ||
| | numFiles | 0 |
| | numRows | 0 |
| | rawDataSize | 0 |
| | storage_handler | org.apache.hadoop.hive.druid.DruidStorageHandler |
| | totalSize | 0 |
| | transient_lastDdlTime | 1571136142 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.druid.serde.DruidSerDe | NULL |
| InputFormat: | null | NULL |
| OutputFormat: | null | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
I did well or it's a Hive table with Druid parameters ?
Someone can explain me more about Hive/Druid interactions ?
Thanks :D
I think you registered your druid datasource in hive. Now you can run your queries using hive server on top of this table.
Your table definition look correct to me I think you managed to integrate druid datasoruce with hive. You can see druid related properties in table.
Now when you query the table it will use processing engine depending on the query it will use hive server along with druid. It can use combination of both or one of them on standalone basis to execute query. It depends whether that query can be converted to druid query or not.
You can refer to this doc for more info on Hive/Druid interactions : https://cwiki.apache.org/confluence/display/Hive/Druid+Integration (refer:Querying Druid from Hive)
I am learning hive and read an article about when to use HIVE external table and mentioned the statement below.
To query data stored in external system such as amazon s3
- Avoid brining in that data into HDFS
Can anyone elaborate above statement. "Avoid brining in that data into HDFS"? Load data local command will help to load local file into HDFS and HIVE is applying the format on the top.
Is it possible to access the data which is out of HDFS?
is it possible to access the data which is out of HDFS?
HIve can read data on any Hadoop Compatible filesystem, not only HDFS.
Can someone elaborate above statement. "Avoid brining in that data into HDFS "?
With the example of S3, you can create an external table with a location of s3a://bucket/path, there's no need to bring it to HDFS unless you really needed the speed of reading HDFS compared to S3. However, to persist a dataset in an ephemeral cloud cluster, results should be written back to whatever long-term storage is provided.
It is possible. You can try this yourself. On CDH, I have a file extn\t.txt
[cloudera#quickstart ~]$ pwd
/home/cloudera
[cloudera#quickstart ~]$ cat extn/t.txt
something
[cloudera#quickstart ~]$
I can now create an external table to access this file as follows
create external table tbl(line string)
location 'file:///home/cloudera/extn'
Describe table
INFO : OK
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| line | string | |
+-----------+------------+----------+--+
1 row selected (0.152 seconds)
0: jdbc:hive2://localhost:10000>
Select
INFO : OK
+------------+--+
| tbl.line |
+------------+--+
| something |
+------------+--+
1 row selected (0.134 seconds)
0: jdbc:hive2://localhost:10000>
Describe formatted
+-------------------------------+----------------------------------------------------+-----------------------+--+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+-----------------------+--+
| # col_name | data_type | comment |
| | NULL | NULL |
| line | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | cloudera | NULL |
| CreateTime: | Tue Feb 20 12:49:25 PST 2018 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | file:/home/cloudera/extn | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | false |
| | EXTERNAL | TRUE |
| | numFiles | 0 |
| | numRows | -1 |
| | rawDataSize | -1 |
| | totalSize | 0 |
| | transient_lastDdlTime | 1519159765 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+-----------------------+
Load data is different. Please check this External Table vs Load Data
I run same query in 2 environnements with huge performance différence : 0.015 sec vs 25sec.
Exlain plan :
+------+-------------+---------------+--------+------------------------------------+---------+---------+---------------------------------------------------------------------------------------------------------------------------+------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+---------------+--------+------------------------------------+---------+---------+---------------------------------------------------------------------------------------------------------------------------+------+----------+---------------------------------+
| 1 | SIMPLE | company1_ | const | PRIMARY | PRIMARY | 152 | const | 1 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | user2_ | ref | PRIMARY | PRIMARY | 152 | const | 1032 | 100.00 | Using where |
| 1 | SIMPLE | vacationpr5_ | eq_ref | PRIMARY | PRIMARY | 304 | user2_.ID_COMPANY_VACATION_PROFILE,.user2_.ID_VACATION_PROFILE | 1 | 100.00 | Using index |
| 1 | SIMPLE | vacationac0_ | ref | PRIMARY,I_VACATION_ACCUMULATION_EA | PRIMARY | 304 | const,.user2_.ID_USER | 4 | 100.00 | Using where |
| 1 | SIMPLE | vacationty3_ | eq_ref | PRIMARY | PRIMARY | 304 | const,.vacationac0_.ID_VACATION_TYPE | 1 | 100.00 | Using where |
| 1 | SIMPLE | vacationst6_ | eq_ref | PRIMARY | PRIMARY | 608 | user2_.ID_COMPANY_VACATION_PROFILE,.user2_.ID_VACATION_PROFILE,const,.vacationac0_.ID_VACATION_TYPE | 1 | 100.00 | Using where |
| 1 | SIMPLE | translatio9_ | eq_ref | PRIMARY | PRIMARY | 919 | vacationty3_.ID_COMPANY_TRANSLATION,.vacationty3_.ID_TRANSLATION | 1 | 100.00 | Using index |
| 1 | SIMPLE | descriptio10_ | eq_ref | PRIMARY, | PRIMARY | 951 | vacationty3_.ID_COMPANY_TRANSLATION,.vacationty3_.ID_TRANSLATION,const | 1 | 100.00 | Using where |
| 1 | SIMPLE | listvalue4_ | ALL | NULL | NULL | NULL | NULL | 5284 | 100.00 | Using where |
| 1 | SIMPLE | translatio7_ | eq_ref | PRIMARY | PRIMARY | 919 | listvalue4_.ID_COMPANY_TRANSLATION,.listvalue4_.ID_TRANSLATION | 1 | 100.00 | Using index |
| 1 | SIMPLE | descriptio8_ | eq_ref | PRIMARY | PRIMARY | 951 | listvalue4_.ID_COMPANY_TRANSLATION,.listvalue4_.ID_TRANSLATION,const | 1 | 100.00 | Using where |
+------+-------------+---------------+--------+------------------------------------+---------+---------+---------------------------------------------------------------------------------------------------------------------------+------+----------+---------------------------------+
next explain plan :
+------+-------------+---------------+--------+------------------------------------+---------+---------+---------------------------------------------------------------------------------------------------------------------------------------+------+----------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+---------------+--------+------------------------------------+---------+---------+---------------------------------------------------------------------------------------------------------------------------------------+------+----------+-------------------------------------------------+
| 1 | SIMPLE | company1_ | const | PRIMARY | PRIMARY | 152 | const | 1 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | user2_ | ref | PRIMARY, | PRIMARY | 152 | const | 1050 | 100.00 | Using where |
| 1 | SIMPLE | vacationpr5_ | eq_ref | PRIMARY | PRIMARY | 304 | validation2.user2_.ID_COMPANY_VACATION_PROFILE,validation2.user2_.ID_VACATION_PROFILE | 1 | 100.00 | Using index |
| 1 | SIMPLE | vacationac0_ | ref | PRIMARY,I_VACATION_ACCUMULATION_EA | PRIMARY | 304 | const,validation2.user2_.ID_USER | 5 | 100.00 | Using where |
| 1 | SIMPLE | vacationty3_ | eq_ref | PRIMARY | PRIMARY | 304 | const,validation2.vacationac0_.ID_VACATION_TYPE | 1 | 100.00 | Using where |
| 1 | SIMPLE | vacationst6_ | eq_ref | PRIMARY | PRIMARY | 608 | validation2.user2_.ID_COMPANY_VACATION_PROFILE,validation2.user2_.ID_VACATION_PROFILE,const,validation2.vacationac0_.ID_VACATION_TYPE | 1 | 100.00 | Using where |
| 1 | SIMPLE | translatio9_ | eq_ref | PRIMARY | PRIMARY | 919 | validation2.vacationty3_.ID_COMPANY_TRANSLATION,validation2.vacationty3_.ID_TRANSLATION | 1 | 100.00 | Using index |
| 1 | SIMPLE | descriptio10_ | eq_ref | PRIMARY, | PRIMARY | 951 | validation2.vacationty3_.ID_COMPANY_TRANSLATION,validation2.vacationty3_.ID_TRANSLATION,const | 1 | 100.00 | Using where |
| 1 | SIMPLE | listvalue4_ | ALL | NULL | NULL | NULL | NULL | 5282 | 100.00 | Using where; Using join buffer (flat, BNL join) |
| 1 | SIMPLE | translatio7_ | eq_ref | PRIMARY | PRIMARY | 919 | validation2.listvalue4_.ID_COMPANY_TRANSLATION,validation2.listvalue4_.ID_TRANSLATION | 1 | 100.00 | Using index |
| 1 | SIMPLE | descriptio8_ | eq_ref | PRIMARY, | PRIMARY | 951 | validation2.listvalue4_.ID_COMPANY_TRANSLATION,validation2.listvalue4_.ID_TRANSLATION,const | 1 | 100.00 | Using where |
+------+-------------+---------------+--------+------------------------------------+---------+---------+---------------------------------------------------------------------------------------------------------------------------------------+------+----------+-------------------------------------------------+
How I can force to use join buffer (flat, BNL join) the first environment is the production one and has more memory and CPU.
In first environment :
join_buffer_size............ 16777216
join_buffer_space_limit..... 2097152
In second environment :
join_buffer_size............ 262144
join_buffer_space_limit..... 2097152
Is there any link/ratio between join_buffer_size and join_buffer_space_limit?
We configure 16Mo on join_buffer_size because it is a mysqlTuner hint.
I set join_buffer_space_limit at 128Mo and it resolves performance issue.
So mysqlTuner doesn't give hint for this configuration key.
SET GLOBAL join_buffer_space_limit = 1024 * 1024 * 128;
It takes time (hour) to improve performances.
https://mariadb.com/kb/en/library/multi-range-read-optimization/
sqoop import
--connect jdbc:mysql://localhost/classicmodels
--username root --password cloudera
--query ' select c.customernumber, c.customername, o.orderdate, o.ordernumber from customers AS c JOIN orders As o ON c.customernumber = o.customernumber WHERE $CONDITIONS '
--boundary-query 'select min(customernumber), max(customernumber) from customers '
--target-dir /data/info/customerdata/join
--split-by customernumber ;
mysql> describe customers ;
+------------------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+---------------+------+-----+---------+-------+
| customerNumber | int(11) | NO | PRI | NULL | |
| customerName | varchar(50) | NO | | NULL | |
| contactLastName | varchar(50) | NO | | NULL | |
| contactFirstName | varchar(50) | NO | | NULL | |
| phone | varchar(50) | NO | | NULL | |
| addressLine1 | varchar(50) | NO | | NULL | |
| addressLine2 | varchar(50) | YES | | NULL | |
| city | varchar(50) | NO | | NULL | |
| state | varchar(50) | YES | | NULL | |
| postalCode | varchar(15) | YES | | NULL | |
| country | varchar(50) | NO | | NULL | |
| salesRepEmployeeNumber | int(11) | YES | MUL | NULL | |
| creditLimit | decimal(10,2) | YES | | NULL | |
+------------------------+---------------+------+-----+---------+-------+
mysql> describe orders ;
+----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+-------+
| orderNumber | int(11) | NO | PRI | NULL | |
| orderDate | date | NO | | NULL | |
| requiredDate | date | NO | | NULL | |
| shippedDate | date | YES | | NULL | |
| status | varchar(15) | NO | | NULL | |
| comments | text | YES | | NULL | |
| customerNumber | int(11) | NO | MUL | NULL | |
+----------------+-------------+------+-----+---------+-------+
Try using tablename.column name syntax for the sql in your query's where clause.
sqoop import
--connect jdbc:mysql://localhost/classicmodels
--username root --password cloudera
--query 'select customers.customernumber, customers.customername,
orders.orderdate, orders.ordernumber FROM customers, orders WHERE
customers.customernumber = orders.customernumber AND $CONDITIONS'
--boundary-query 'select min(customernumber), max(customernumber) from customers'
--target-dir /data/info/customerdata/join
--split-by customers.customernumber ;
For --boundary-query, make sure customernumber should be numeric columns and should not be null.