MySQL blob dump to tab delimited files - set

I am migrating a MySQL 5.1 database in Amazon's EC2, and I am having issues tables with longblob datatype we use for image storage. Basically, after the migration, the data in the longblob column is a different size, due to the fact that the character encoding seems to be different.
First of all, here is an example of before and after the migration:
Old:
x??]]??}?_ѕ??d??i|w?%?????q$??+?
New:
x��]]����_ѕ��d��i|w�%�����q$��+�
I checked the character set variables on both machines and they are identical. I also checked the 'show create table' and they are identical as well. The client's are both connecting the same way (no SET NAMES, or specifying character sets).
Here is the mysqldump command I used (I tried it without --hex-blob as well):
mysqldump --hex-blob --default-character-set=utf8 --tab=. DB_NAME
Here is how I loaded the data:
mysql DB_NAME --default-character-set=utf8 -e "LOAD DATA INFILE 'EXAMPLE.txt' INTO TABLE EXAMPLE;"
Here are the MySQL character set variables (identical):
Old:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
New:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
I'm not sure what else to try to be able to run mysqldump and have the blob data be identical on both machines. Any tips would be greatly appreciated.

The issue seems to be a bug in mysql (http://bugs.mysql.com/bug.php?id=27724). The solution is to not use mysqldump, but to write your own SELECT INTO OUTFILE script for the tables that have blob data. Here is an example:
SELECT
COALESCE(column1, #nullval),
COALESCE(column2, #nullval),
COALESCE(HEX(column3), #nullval),
COALESCE(column4, #nullval),
COALESCE(column5, #nullval)
FROM table
INTO OUTFILE '/mnt/dump/table.txt'
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
To load the data:
SET NAMES utf8;
LOAD DATA INFILE '/mnt/dump/table.txt'
INTO TABLE table
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
(column1, column1, #column1, column1, column1)
SET data = UNHEX(#column1)
This loads the blob data correctly.

Related

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng.
Table Name : prod_db.products
desc:
+----------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+----------------------------+-----------------------+-----------------------+--+
| name | string | |
| cost | double | |
| load_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| load_date | string | |
+----------------------------+-----------------------+-----------------------+--+
## I am using the following code
SET hivevar:current_date=current_date();
ALTER TABLE prod_db.products DROP PARTITION(load_date='${current_date}');
Before and After picture of partitions:
+-----------------------+--+
| partition |
+-----------------------+--+
| load_date=2022-04-07 |
| load_date=2022-04-11 |
| load_date=2022-04-18 |
| load_date=2022-04-25 |
+-----------------------+--+
It runs without any error but doesn't work but won't drop the partition. Table is internal/managed.
I tried different ways mentioned on stack but it is just not working for me.
Help.
You dont need to set a variable. You can directly drop using direct sql.
Alter table prod_db.products
drop partition (load_date= current_date());

Timestamp is different for the same table in hive-cli & presto-cli

I am getting different timestamps for the same table in hive-cli & presto-cli.
table structure for the table:
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `mea_req_201`( |
| `mer_id` bigint, |
| `mer_from_dttm` timestamp, |
| `mer_to_dttm` timestamp, |
| `_c0` bigint, |
| `a_number` string, |
| `b_number` string, |
| `time_stamp` timestamp, |
| `duration` bigint) |
| PARTITIONED BY ( |
| `partition_col` bigint) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://hadoop5:8020/warehouse/tablespace/external/hive/mea_req_201' |
| TBLPROPERTIES ( |
| 'TRANSLATED_TO_EXTERNAL'='TRUE', |
| 'bucketing_version'='2', |
| 'external.table.purge'='TRUE', |
| 'spark.sql.create.version'='2.4.0.7.1.4.0-203', |
| 'spark.sql.sources.schema.numPartCols'='1', |
| 'spark.sql.sources.schema.numParts'='1', |
| 'transient_lastDdlTime'='1625496239') |
+----------------------------------------------------+
While running from hive-cli the output is:
While from presto-cli:
in mer_from_dttm col, there's a time difference but for other timestamps columns, dates are exactly the same. Note this time difference behaviour is the same when done from presto-jdbc also. I believe this got nothing to do with timezone because if it was timezone, the time difference should be across all timestamp columns, not just one. Please provide some resolution.
Some Facts:
Presto server version: 0.180
Presto Jdbc version: 0.180
hive.time-zone=Asia/Calcutta
In Presto jvm.config: -Duser.timezone=Asia/Calcutta
Client TimeZone: Asia/Calcutta
Edit 1:
Sorted the query with mer_id to ensure both queries are outputting the same set of rows, However, the erroneous behavior still remains the same.
While Running from hive-cli:
While Running from presto-cli:
Presto 0.180 is really old. It was released in 2017, and many bugs have been fixed along the way.
I would suggest you try with a recent version. In particular, recent versions of Trino (formerly known as PrestoSQL) had a lot work done around handling of timestamp data.
Use order by to see exact rows in each of clients.
SELECT `mer_id`,`mer_from_dttm`, `mer_to_dttm`, `time_stamp` FROM mea_req_201 ORDER BY `mer_id`;

Unable to query/select data those inserted through Spark SQL

I am trying to insert data into a Hive Managed table that has a partition.
Show create table output for reference.
+--------------------------------------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------------------------------------+--+
| CREATE TABLE `part_test08`( |
| `id` string, |
| `name` string, |
| `baseamount` double, |
| `billtoaccid` string, |
| `extendedamount` double, |
| `netamount` decimal(19,5), |
| `netunitamount` decimal(19,5), |
| `pricingdate` timestamp, |
| `quantity` int, |
| `invoiceid` string, |
| `shiptoaccid` string, |
| `soldtoaccid` string, |
| `ingested_on` timestamp, |
| `external_id` string) |
| PARTITIONED BY ( |
| `productid` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'wasb://blobrootpath/hive/warehouse/db_103.db/part_test08' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'transactional'='true', |
| 'transactional_properties'='default', |
| 'transient_lastDdlTime'='1549962363') |
+--------------------------------------------------------------------------------------------------+--+
Trying to execute SQL statement to insert records into the part table like below
sparkSession.sql("INSERT INTO TABLE db_103.part_test08 PARTITION(ProductId) SELECT reflect('java.util.UUID', 'randomUUID'),stg_name,stg_baseamount,stg_billtoaccid,stg_extendedamount,stg_netamount,stg_netunitamount,stg_pricingdate,stg_quantity,stg_invoiceid,stg_shiptoaccid,stg_soldtoaccid,'2019-02-12 09:06:07.566',stg_id,stg_ProductId FROM tmp_table WHERE part_id IS NULL");
Without insert statement, if we run select query then getting below data.
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
|reflect(java.util.UUID, randomUUID)|stg_name|stg_baseamount| stg_billtoaccid|stg_extendedamount|stg_netamount|stg_netunitamount| stg_pricingdate|stg_quantity|stg_invoiceid| stg_shiptoaccid| stg_soldtoaccid|2019-02-12 09:06:07.566|stg_id|stg_ProductId|
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
| 4e0b4331-b551-42d...| OLI6| 16.0|2DD4E682-6B4F-E81...| 34.567| 1166.74380| 916.78000|2018-10-18 05:06:22| 13| I1|2DD4E682-6B4F-E81...|2DD4E682-6B4F-E81...| 2019-02-12 09:06:...| 6| P3|
| 8b327a8e-dd3c-445...| OLI7| 16.0|2DD4E682-6B4F-E81...| 34.567| 766.74380| 1016.78000|2018-10-18 05:06:22| 13| I6|2DD4E682-6B4F-E81...|2DD4E682-6B4F-E81...| 2019-02-12 09:06:...| 7| P4|
| c0e14b9a-8d1a-426...| OLI5| 14.6555| null| 34.56| 500.87000| 814.65000|2018-10-11 05:06:22| 45| I4|29B73C4E-846B-E71...|29B73C4E-846B-E71...| 2019-02-12 09:06:...| 5| P1|
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
Earlier I was getting error while inserting into Managed table. But after restarting Hive & Thrift services now there is no error in execution of job but not able to see those inserted data while doing select query through beeline/program. I can see partition with delta files also got inserted into hive/warehouse, see below screenshot.
Also, I can see some warnings as below not sure its related to the error or not.
Cannot get ACID state for db_103.part_test08 from null
One more note: If I use External Table then it is working fine can able to view data as well.
We are using Azure HDInsight Spark 2.3 (HDI 4.0 Preview) cluster with below Service Stacks.
HDFS: 3.1.1
Hive: 3.1.0
Spark2: 2.3.1
Have you added below set commands while trying to insert data.
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I have faced similar issue where i was not allowed to do any read / write operation, after adding the above properties I was able to query the table.
Since you have not faced any issue with external table, not sure if this is going to solve your problem.

The coding in hp vertica

I installed the latest version of the free hp vertica server on OS Linux CentOS release 6.6 (Final). Next, I set up a server and created a database IM_0609. Next, I created a table with the command:
CREATE TABLE MARKS (SERIAL_NUM varchar(30),PERIOD smallint,MARK_NUM decimal(20,0), END_MARK_NUM decimal(20,0),OLD_MARK_NUM decimal(20,0),DEVICE_NAME varchar(256),DEVICE_MARK varchar(256),CALIBRATION_DATE date);
Next, from the DB2 database I executed EXPORT data to txt file:
5465465|12|+5211.|+5211.||Комплексы компьютеризированные самостоятельного предрейсового экспресс-обследования функционального состояния машиниста, водителя и оператора|ЭкОЗ-01|2004-12-09
5465465|12|+5211.|+5211.||Спектрометры эмиссионные|Metal Lab|2004-12-09
б/н|12|+5207.|+5207.|+5205.|Спектрометры эмиссионные|Metal Lab|2004-12-09
б/н|12|+5207.|+5207.|+5205.|Спектрометры эмиссионные|Metal Test|2004-12-09
....
and I changed the file encoding to UTF-8.
I then import the data from the text file into a database table using the hp vertica here this command:
copy MARKS from '/home/dbadmin/result.txt' delimiter '|' null as '' exceptions '/home/dbadmin/copy-error.log' ABORT ON ERROR;
All data loaded, but Russian characters display some weird characters, apparently this is due to the problems of character encoding the command COPY.
5465465 12 5211 5211 (null) Êîìïëåêñû êîìïüşòåğèçèğîâàííûå ñàìîñòîÿòåëüíîãî ïğåäğåéñîâîãî ıêñïğåññ-îáñëåäîâàíèÿ ôóíêöèîíàëüíîãî ñîñòîÿíèÿ ìàøèíèñòà, âîäèòåëÿ è îï İêÎÇ-01 2004-12-09
5465465 12 5211 5211 (null) Ñïåêòğîìåòğû ıìèññèîííûå Metal Lab 2004-12-09
Question: How can I fix this problem?
Make sure your file encoding us utf-8
[dbadmin#DCG023 ~]$ file rus
rus: UTF-8 Unicode text
[dbadmin#DCG023 ~]$ cat rus
5465465|12|+5211.|+5211.||Комплексы компьютеризированные самостоятельного предрейсового экспресс-обследования функционального состояния машиниста, водителя и оператора|ЭкОЗ-01|2004-12-09
5465465|12|+5211.|+5211.||Спектрометры эмиссионные|Metal Lab|2004-12-09
б/н|12|+5207.|+5207.|+5205.|Спектрометры эмиссионные|Metal Lab|2004-12-09
б/н|12|+5207.|+5207.|+5205.|Спектрометры эмиссионные|Metal Test|2004-12-09
Load the data
[dbadmin#DCG023 ~]$ vsql
Password:
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
(dbadmin#:5433) [dbadmin] > copy MARKS from '/home/dbadmin/rus' delimiter '|' null as '' ABORT ON ERROR;
Rows Loaded
-------------
4
(1 row)
Query the data
(dbadmin#:5433) [dbadmin] > select * from Marks;
SERIAL_NUM | PERIOD | MARK_NUM | END_MARK_NUM | OLD_MARK_NUM | DEVICE_NAME | DEVICE_MARK | CALIBRATION_DATE
------------+--------+----------+--------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------
5465465 | 12 | 5211 | 5211 | | Комплексы компьютеризированные самостоятельного предрейсового экспресс-обследования функционального состояния машиниста, водителя и оп | ЭкОЗ-01 | 2004-12-09
5465465 | 12 | 5211 | 5211 | | Спектрометры эмиссионные | Metal Lab | 2004-12-09
б/н | 12 | 5207 | 5207 | 5205 | Спектрометры эмиссионные | Metal Lab | 2004-12-09
б/н | 12 | 5207 | 5207 | 5205 | Спектрометры эмиссионные | Metal Test | 2004-12-09
(4 rows)

Action Controller: incompatible character encodings: ASCII-8BIT and UTF-8?

I have a text file that has some ® (Registered Trade Mark) symbols in it. The file is in UTF-8.
I'm trying to import this file and populate a MySQL database using Rails 3. The DB appears to be setup fine to take UTF-8
+-------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| id | int(11) | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| user_id | int(11) | NULL | YES | MUL | NULL | | select,insert,update,references | |
| title | varchar(255) | utf8_general_ci | YES | | NULL | | select,insert,update,references | |
| translation | text | utf8_general_ci | YES | | NULL | | select,insert,update,references | |
| created_at | datetime | NULL | NO | | NULL | | select,insert,update,references | |
| updated_at | datetime | NULL | NO | | NULL | | select,insert,update,references | |
+-------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
Yet, when I try to do this:
trans_file = params[:descriptions] #coming from file_field_tag
trans = trans_file.read.split("\r\n")
trans.each do |tran|
ttl = ''
desc = ''
tran.split(']=').each do |title|
if title =~ /\[/ #it's the title
ttl = title.sub('[','')
else
desc = title.gsub('FFF', "\r\n")
end
end
#current_user.cd_translations.build(title: ttl, translation: desc).save
I'm getting the error "Action Controller: incompatible character encodings: ASCII-8BIT and UTF-8".
I'm using the utf-8 encoding in my application.rb file, and I'm using the mysql2 gem.
If I remove the Registered Trade Mark character, the error goes away. But it's not really an option to strip it out of the incoming text.
I tried the solution here: https://stackoverflow.com/a/5215676/102372, but that didn't make any difference.
Stack trace:
app/controllers/users_controller.rb:28:in `block in update_cd_translations'
app/controllers/users_controller.rb:15:in `each'
app/controllers/users_controller.rb:15:in `update_cd_translations'
config/initializers/quiet_assets.rb:7:in `call_with_quiet_assets'
How can I resolve this?
It appears that ruby thinks that the encoding of the uploaded file is ascii-8bit (that is to say binary).
If you know the encoding of the file, you can use force_encoding! to change the encoding of the string (without transcoding). If you're not always going to be sure of the encoding of the file, the charguess gem can be used to guess it.
Try to add a
# -*- encoding : utf-8 -*-
at the beginnig of each file, which is included in the whole process

Resources