Spring Cloud Dataflow - http | kafka and kafka | hdfs - Getting Raw message in HDFS - hadoop

I am creating a basic stream in SCDF (Local Server 1.7.3) wherein I am configuring 2 streams.
1. HTTP -> Kafka Topic
2. Kafka Topic -> HDFS
Streams:
stream create --name ingest_from_http --definition "http --port=8000 --path-pattern=/test > :streamtest1"
stream deploy --name ingest_from_http --properties "app.http.spring.cloud.stream.bindings.output.producer.headerMode=raw"
stream create --name ingest_to_hdfs --definition ":streamtest1 > hdfs --fs-uri=hdfs://<host>:8020 --directory=/tmp/hive/sensedev/streamdemo/ --file-extension=xml --spring.cloud.stream.bindings.input.consumer.headerMode=raw"
I have created a Hive managed table on location /tmp/hive/sensedev/streamdemo/
DROP TABLE IF EXISTS gwdemo.xml_test;
CREATE TABLE gwdemo.xml_test(
id int,
name string
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.id"="/body/id/text()",
"column.xpath.name"="/body/name/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/tmp/hive/sensedev/streamdemo'
TBLPROPERTIES (
"xmlinput.start"="<body>",
"xmlinput.end"="</body>")
;
Testing:
Whether Hive is able to read XML : Put a xml file in the location
/tmp/hive/sensedev/streamdemo.
File Content: <body><id>1</id><name>Test1</name></body>
On running a SELECT command on the table, it was showing the above record properly.
When posting record in SCDF with http post, I am getting proper data
in Kafka Consumer but when I am checking HDFS, the xml files are
being created but I am receiving raw messages in those files.
Example:
dataflow>http post --target http:///test
--data "<body><id>2</id><name>Test2</name></body>" --contentType application/xml
In Kafka Console Consumer, I am able to read proper XML message: <body><id>2</id><name>Test2</name></body>
$ hdfs dfs -cat /tmp/hive/sensedev/streamdemo/hdfs-sink-2.xml
[B#31d94539
Questions:
1. What am I missing? How can I get proper XML records in the newly created XML files in HDFS?

HDFS Sink expects a Java Serialized object.

Related

Unable to run SerDe

We have one ebcdic sample file.
It is stored in /user/hive/warehouse/ebcdic_test_file.txt
Cobol layout of the file is stored in /user/hive/Warehouse/CobolSerde.cob
We are running on Hue browser query editor.
We also tried in CLI.
But the same error is coming
We have added CobolSerde.jar.
Via
Add jar /home/cloudera/Desktop/CobolSerde.jar
It has been added successfully. Proof via LIST JARS.
Query
CREATE EXTERNAL TABLE cobol2Hve
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde2.cobol.CobolSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/hive/warehouse/ebcdic_test_file.txt'
TBLPROPERTIES ('cobol.layout.url'='/user/hive/warehouse/CobolSerDe.cob','fb.length'='159');
Error while processing statement:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Cannot validate serde: com.savy3.hadoop.hive.serde2.cobol.CobolSerDe
Why is the error coming?
What is fb.length?

FileAlreadyExistsException occurred when I was exporting data from Presto with Hive engine on Amazon EMR

I tried to export data from the S3 bucket to other S3 bucket using Presto with Hive engine on Amazon EMR, like ETL, but FileAlreadyExistsException occurred when I was exporting data.
How can I export data using Presto?
Environments
emr-4.3.0
Hive 1.0.0
Presto-Sandbox 0.130
Error
I tried the following operation:
$ hive
hive> CREATE EXTERNAL TABLE logs(log string)
-> LOCATION 's3://foo-bucket/logs/';
hive> CREATE EXTERNAL TABLE s3_export(log string)
-> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
-> LOCATION 's3://foo-bucket/export/';
hive> exit;
$ presto-cli --catalog hive --schema default
presto:default> INSERT INTO s3_export SELECT log FROM logs;
Query 20160203_125741_00018_ba5sw, FAILED, 3 nodes
Splits: 7 total, 1 done (14.29%)
0:01 [39 rows, 4KB] [49 rows/s, 5.13KB/s]
Query 20160203_125741_00018_ba5sw failed: java.nio.file.FileAlreadyExistsException: /tmp
This is caused by Presto Hive connector does not like the symlink /tmp/ EMR (4.2 and 4.3) is using for hive.s3.staging-directory, you can use the Configuration API to overwrite hive.s3.staging-directory and set it to /mnt/tmp/, like this:
classification=presto-connector-hive,properties=[hive.s3.staging-directory=/mnt/tmp/]
I've resolved the problem by the following command:
presto-cli --catalog hive --schema default --execute 'select log from logs' | aws s3 cp - s3://foo-bucket/export/data.txt

Spring XD missing modules

I install spring-xd-1.2.1.RELEASE and start in Spring XD in xd-signle mode, when I type the following command
xd:>stream create --definition "time | log" --name ticktock --deploy
I get the following result:
Command failed org.springframework.xd.rest.client.impl.SpringXDException: Could not find module with name 'log' and type 'sink'
When I type the following command:
xd:> module list
I get the following resul:
Source Processor Sink Job
gemfire gemfire-json-server filejdbc
gemfire-cq gemfire-server hdfsjdbc
jdbc jdbc jdbchdfs
kafka rabbit sqoop
rabbit redis
twittersearch
twitterstream
Some default modules are missed ? What happens ? Is there any other configuration to set before starting spring xd ?
Check XD_HOME/modules/sink/log - Is this folder exist?

Spring XD 1.1.0 - JDBC Source connection issues

I have installed SPRING-XD version 1.1.0 on a Centos machine. Using xd-singlenode I want to connect it to a SQL Server database via jdbc source and put the data into file.
I created some streams as follows:
1)xd:>stream create connectiontest --definition "jdbc --url=jdbc:sqlserver://sqlserverhost:1433/SampleDatabase --username=sample --password=***** --query= 'SELECT * FROM schema.tablename' |file" --deploy
2)xd:>stream create connectiontest --definition "jdbc --connectionProperties=jdbc:sqlserver://sqlserverhost:1433/SampleDatabase --username=sample --password=***** --initSQL= 'SELECT * FROM schema.tablename' |file" --deploy
Everytime I deploy the stream it gives the following error:
Command failed org.springframework.xd.rest.client.impl.SpringXDException: Multiple top level module resources found :file [/opt/pivotal/spring-xd-1.1.0.RELEASE/xd/config/jms-hornetq.properties],file [/opt/pivotal/spring-xd-1.1.0.RELEASE/xd/config/hadoop.properties],file [/opt/pivotal/spring-xd-1.1.0.RELEASE/xd/config/xd-admin-logger.properties],file [/opt/pivotal/spring-xd-1.1.0.RELEASE/xd/config/xd-singlenode-logger.properties],file [/opt/pivotal/spring-xd-1.1.0.RELEASE/xd/config/xd-container-logger.properties],file [/opt/pivotal/spring-xd-1.1.0.RELEASE/xd/config/jms-activemq.properties],file [/opt/pivotal/spring-xd-1.1.0.RELEASE/xd/config/httpSSL.properties]
Earlier I set springxd_home pointing to my springxd directory. After removing the path it is working fine now.
Thanks for the support.

Amazon EMR and Hive: Getting a "java.io.IOException: Not a file" exception when loading subdirectories to an external table

I'm using Amazon EMR.
I have some log data in s3, all in the same bucket, but under different subdirectories
like:
"s3://bucketname/2014/08/01/abc/file1.bz"
"s3://bucketname/2014/08/01/abc/file2.bz"
"s3://bucketname/2014/08/01/xyz/file1.bz"
"s3://bucketname/2014/08/01/xyz/file3.bz"
I'm using :
Set hive.mapred.supports.subdirectories=true;
Set mapred.input.dir.recursive=true;
When trying to load all data from "s3://bucketname/2014/08/":
CREATE EXTERNAL TABLE table1(id string, at string,
custom struct<param1:string, param2:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucketname/2014/08/';
In return I get:
OK
Time taken: 0.169 seconds
When trying to query the table:
SELECT * FROM table1 LIMIT 10;
I get:
Failed with exception java.io.IOException:java.io.IOException: Not a file: s3://bucketname/2014/08/01
Does anyone has an idea on how to solev this?
It's an EMR specific problem, here is what i got from Amazon support:
Unfortunately Hadoop does not recursively check the subdirectories of Amazon S3 buckets. The input files must be directly in the input directory or Amazon S3 bucket that you specify, not in sub-directories.
According to this document ("Are you trying to recursively traverse input directories?")
Looks like EMR does not support recursive directory at the moment. We are sorry about the inconvenience.
This works now (May 2018)
A global EMR_wide fix is to set the following in /etc/spark/conf/spark-defaults.conf file:
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive true
hive.mapred.supports.subdirectories true
Or, can be fixed locally like in following pyspark code:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.enableHiveSupport() \
.config("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true") \
.config("hive.mapred.supports.subdirectories","true") \
.getOrCreate()
spark.sql("<YourQueryHere>").show()
The problem is the way you have specified the location
s3://bucketname/2014/08/
The hive external table expect files to be present at this location but it has folders.
Try putting path like
"s3://bucketname/2014/08/01/abc/,s3://bucketname/2014/08/01/xyz/"
You need to provide path till files.

Resources