Sqoop: using octal value(\0) as delimiter - hadoop

Since I have special char in one of the fields, I wanted to use lower value as delimiter. Hive works fine with the delimiter(\0) but sqoop fails with NoSuchElement Exception. Looks like it is not detecting the delimiter as \0.
This is how my hive an sqoop script looks like. Any help please.
CREATE TABLE SCHEMA.test
(
name CHAR(20),
id int,
dte_report date
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\0'
LOCATION '/user/$USER/test';
sqoop-export \
-Dmapred.job.name="TEST" \
-Dorg.apache.sqoop.export.text.dump_data_on_error=true \
--options-file ${OPTION_FILE_LOCATION}\conn_mysql \
--export-dir /user/$USER/test \
--input-fields-terminated-by '\0' \
--input-lines-terminated-by '\n' \
--input-null-string '\\N' \
--input-null-non-string '\\N' \
--table MYSQL_TEST \
--validate \
--outdir /export/home/$USER/javalib
In VI editor, the delimiter looks like '^#' and with od -c the delimiter is \0

Set the character set to UTF 8 in the my sql conn string that can resolve this issue.
mysql.url=jdbc:mysql://localhost:3306/nbs?useJvmCharsetConverters=false&useDynamicCharsetInfo=false&useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8&useEncoding=true

You should use \000 as delimiter , it will generate that character as a delimiter.

Related

Export Sqoop fields with enclosure and/or delimiter character inside

I'm trying to export this type of data in PostgreSQL
"WIFI:S:FIBRA-3;T:WPA;P:YOdfdgg4677;;";"2021-05-18 14:31:34"
"'":.56#!:&7:&":8";"2021-05-19 15:56:22"
but I am not able to recognize the first field correctly, I think because of the double quotes.
The command that I'm using is:
export \
--connect $DB_JDBC_URL_MAIN \
--username=$DB_USER \
--password="$DB_PASSWORD" \
--table "$DB_SCHEMA.$DB_TABLE" \
--export-dir $EXPORT_DIR \
--input-lines-terminated-by '\n' \
--input-fields-terminated-by ';' \
--input-null-string 'N/A' \
--optionally-enclosed-by '\"' \
--escaped-by \\ \
I hope you can help me.

Remove space between new line output

I am trying to capture the output in one of the file using
cat <<EOF> /var/log/awsmetadata.log
timestamp= $TIME, \
region= $REGION, \
instanceIp= $INSTANCE_IP, \
availabilityZone= $INSTANCE_AZ, \
instanceType= $INSTANCE_TYPE, \
EOF
Where the output created in the format of
cat /var/log/awsmeta.log
timestamp= 2020-11-04 18:51:17, region= us-west-2, instanceIp= 1.2.3.4, availabilityZone= us-west-2a,
How can i eliminate the wide spaces between each output line?
If you don't want redundant whitespaces simply do not add them:
$ cat <<EOF> /var/log/awsmetadata.log
> timestamp= $TIME, \
> region= $REGION, \
> instanceIp= $INSTANCE_IP, \
> availabilityZone= $INSTANCE_AZ, \
> instanceType= $INSTANCE_TYPE
> EOF
I often use sed or tr instead of cat for this sort of thing:
tr -s ' ' <<EOF > /var/log/awsmetadata.log
timestamp= $TIME, \
region= $REGION, \
instanceIp= $INSTANCE_IP, \
availabilityZone= $INSTANCE_AZ, \
instanceType= $INSTANCE_TYPE,
EOF
But it seems cleaner to not escape the newlines at all and do something like:
{ tr -d \\n <<-EOF; echo; } > /var/log/awsmetadata.log
timestamp= $TIME,
region= $REGION,
instanceIp= $INSTANCE_IP,
availabilityZone= $INSTANCE_AZ,
instanceType= $INSTANCE_TYPE,
EOF
(That solution uses the <<- form of the heredoc which redacts hardtabbed indenation. It will not remove leading spaces.)
OTOH, it seems weird to be using a here doc when you're just wanting to generate one line of output. Why not just use echo?

Sqoop --escaped-by --optionally-enclosed-by

i have a requirement to import the data into .csv file with comma(,) as a delimiter .
i am using below sqoop options .
--optionally-enclosed-by '\"'
--escaped-by '\\'
below is the input data and output data i want .
input "foo output i want ""foo
but i am getting below
input "foo output "foo
another example :
input foo" output i want foo""
but i am getting below
input foo" output foo"
how can i achieve the desired output
Refer SqoopGuide 7.2.11. Large Objects for a better understanding of --enclosed-by,--escaped-by and --optionally-enclosed-by with examples.
Based on the question, below are the details understood.
--fields-terminated-by , Since you need a file with a comma as the delimiter.
--optionally-enclosed-by '\"' This will enclose only the fields whose data contains delimiter comma , in them.
--escaped-by \\ Used to escape the enclosing characters(double quotes in this case) if they are present in the data field which requires enclosing.
Example:
Input: Suppose if the data in source table is like below with the respective columns. For representation, I used pipe(|) as the delimiter.
Some string, with a comma.|1|2|3...
Another "string with quotes"|4|5|6...
Output: sqoop import --fields-terminated-by , --enclosed-by '\"' --escaped-by \ ...
"Some string, with a comma.","1","2","3"...
"Another \"string with quotes\"","4","5","6"...
Explanation: All fields are terminated by comma and all fields are enclosed by double-quotes. If there is any field with double-quotes in the data then those quotes will be escaped by a backslash() as in the second line.
Output: sqoop import --fields-terminated-by , --optionally-enclosed-by '\"' --escaped-by \ ...
"Some string, with a comma.",1,2,3...
"Another \"string with quotes\"",4,5,6...
Explanation: All fields are terminated by comma and only fields contacting the comma in the data are enclosed by double-quotes. If there is any field with double-quotes in the data then those quotes will be escaped by a backslash() as in the second line and even this column will also be enclosed as in the second line.
For your scenario:
Input: Suppose if the data in the source table is like below with the respective columns. For representation, I used pipe(|) as the delimiter.
"foo|bar"|1|2
foo"|3|4|"bar
Possible Output: sqoop import --fields-terminated-by , --enclosed-by '\"' --escaped-by \ ...
"\"foo","bar\"","1","2"
"foo\"","3","4","\"bar"
Possible Output: sqoop import --fields-terminated-by , --optionally-enclosed-by '\"' --escaped-by \ ...
"\"foo","bar\"",1,2
"foo\"",3,4,"\"bar"

How do I replace template variables in text file with data in bash script

I have a template file like show below. I have a number of variables in it that I want to replace with values I peel off of a JSON doc. I'm able to do it with sed on the few simple ones, but I have problems doing it on <ARN> and others like that.
#test "Test <SCENARIO_NAME>--<EXPECTED_ACTION>" {
<SKIP_BOOLEAN>
testfile="data/<FILE_NAME>"
assert_file_exist $testfile
IBP_JSON=$(cat $testfile)
run aws iam simulate-custom-policy \
--resource-arns \
"<ARN>"
--action-names \
"<ACTION_NAMES>"
--context-entries \
"ContextKeyName='aws:PrincipalTag/Service', \
ContextKeyValues='svc1', \
ContextKeyType=string" \
"ContextKeyName='aws:PrincipalTag/Department', \
ContextKeyValues='shipping', \
ContextKeyType=string" \
<EXTRA_CONTEXT_KEYS>
--policy-input-list "${IBP_JSON}"
assert_success
<TEST_EXPRESSION>
}
I want the <ARN> placeholder to be replaced with the following text:
"arn:aws:ecs:*:588068252125:cluster/${aws:PrincipalTag/Service}-*" \
"arn:aws:ecs:*:588068252125:task/${aws:PrincipalTag/Service}-*" \
"arn:aws:ecs:*:588068252125:container-instance/${aws:PrincipalTag/Service}-*" \
"arn:aws:ecs:*:588068252125:task-definition/${aws:PrincipalTag/Service}-*:*" \
"arn:aws:ecs:*:588068252125:service/${aws:PrincipalTag/Service}-*" \
How can I do that replacement while also preserving the formatting (\ and /r at line ends)?
The easiest is use bash itself:
original=$(cat file.txt)
read -r -d '' replacement <<'EOF'
"arn:aws:ecs:*:588068252125:cluster/${aws:PrincipalTag/Service}-*" \
"arn:aws:ecs:*:588068252125:task/${aws:PrincipalTag/Service}-*" \
"arn:aws:ecs:*:588068252125:container-instance/${aws:PrincipalTag/Service}-*" \
"arn:aws:ecs:*:588068252125:task-definition/${aws:PrincipalTag/Service}-*:*" \
"arn:aws:ecs:*:588068252125:service/${aws:PrincipalTag/Service}-*" \
EOF
placeholder='"<ARN>"'
modified=${original/$placeholder/$replacement}
echo "$modified"
Look for ${parameter/pattern/string} in man bash.
If input.txt is the input file and replace.txt contains the replacement text:
$ cat input.txt
run aws iam simulate-custom-policy \
--resource-arns \
"<ARN>"
--action-names \
"<ACTION_NAMES>"
$ cat replace.txt
"arn:aws:ecs:*:588068252125:cluster/${aws:PrincipalTag/Service}-*" \\\
"arn:aws:ecs:*:588068252125:task/${aws:PrincipalTag/Service}-*" \\\
"arn:aws:ecs:*:588068252125:container-instance/${aws:PrincipalTag/Service}-*" \\\
"arn:aws:ecs:*:588068252125:task-definition/${aws:PrincipalTag/Service}-*:*" \\\
"arn:aws:ecs:*:588068252125:service/${aws:PrincipalTag/Service}-*"
then you can use sed with # delimiters to make the replacement:
$ sed "s#\"<ARN>\"#$(< replace.txt)#g" input.txt
run aws iam simulate-custom-policy \
--resource-arns \
"arn:aws:ecs:*:588068252125:cluster/${aws:PrincipalTag/Service}-*" \
"arn:aws:ecs:*:588068252125:task/${aws:PrincipalTag/Service}-*" \
"arn:aws:ecs:*:588068252125:container-instance/${aws:PrincipalTag/Service}-*" \
"arn:aws:ecs:*:588068252125:task-definition/${aws:PrincipalTag/Service}-*:*" \
"arn:aws:ecs:*:588068252125:service/${aws:PrincipalTag/Service}-*"
--action-names \
"<ACTION_NAMES>"
Here $(< replace.txt) is equivalent to $(cat replace.txt)

Sqoop import Null string

The Null values are displayed as '\N' when a hive external table is queried.
Below is the sqoop import script:
sqoop import -libjars /usr/lib/sqoop/lib/tdgssconfig.jar,/usr/lib/sqoop/lib/terajdbc4.jar -Dmapred.job.queue.name=xxxxxx \
--connect jdbc:teradata://xxx.xx.xxx.xx/DATABASE=$db,LOGMECH=LDAP --connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username $user --password $pwd --query "
select col1,col2,col3 from $db.xxx
where \$CONDITIONS" \
--null-string '\N' --null-non-string '\N' \
--fields-terminated-by '\t' --num-mappers 6 \
--split-by job_number \
--delete-target-dir \
--target-dir $hdfs_loc
Please advise what change should be done to the script so that nulls are displayed as nulls when the external hive table is queried.
Sathiyan- Below are my findings after many trials
If (null string) property is not included during sqoop import, then NULLs are stored as [blank for integer columns] and [blank for string columns] in HDFS.
2.If the HIVE table on top of HDFS is queried, we would see [NULL for integer column] and [blank for String columns]
If the (--null-string '\N') property is included during sqoop import, then NULLs are stored as ['\N' for both integer and string columns].
If the HIVE table on top of HDFS is queried, we would see [NULL for both integer and string columns not '\N']
In your sqoop script you mentioned --null-string '\N' --null-non-string '\N which means,
--null-string '\N' = The string to be written for a null value for string columns
--null-non-string '\N' = The string to be written for a null value for non-string columns
If any value is NULL in the table and we want to sqoop that table ,then sqoop will import NULL value as string null in HDFS. So, that will create problem to use Null condition in our query using hive
For example: – Lets insert NULL value to mysql table “cities”.
mysql> insert into cities values(6,7,NULL);
By default, Sqoop will import NULL value as string null in HDFS.
Lets sqoop and see what happens:–
sqoop import –connect jdbc:mysql://localhost:3306/sqoop –username sqoop -P –table cities –hive-import –hive-overwrite –hive-table vikas.cities -m 1
http://deltafrog.com/how-to-handle-null-value-during-sqoop-import-export/
In The sqoop import command remove the --null-string and --null-non-string '\N' option.
by default system will assign null for both strings and non string values.
I have tried --null-string '\N' and --null-string '' and other options but getting blank and different issues.

Resources