How to Sqoop Import as JSON? - sqoop

I am aware Sqoop supports importing data as Avro, as Parquet, as Text etc. Is there a way to import data as JSON?
Using Spark is not an option for me at the moment.

Sqoop does not support importing JSON files. You can import it as textfiles into HDFS and parse it using Python/scala.

Related

I am trying to store in HDFS as parquet file from teradata with help of TDCH jar 1.6 version

I am trying to store in HDFS as parquet file from teradata with help of TDCH jar
I am getting connection exception : plugin "hdfs-parquet" not found
How can i resolve the issue?
You won't be able to do that since TDCH does not have that feature available. Parquet support is for Hive, so you need to have a hive table which is stored as parquet and then use TDCH with job type hive and file format as parquet.
If you are wanting to store the data from TD to HDFS as parquet, then you would need to use sqoop with jdbc connection and use --as-parquetfile option and not use CLDR or HWX TD wrapper.

Import data to Hdfs from AWS S3 using Sqoop

I am using distcp(For Batch data) to get data from S3.
But according to sqoop website we can import from s3 to hdfs. I tried but I get error every time for connection build error :
https://sqoop.apache.org/docs/1.99.7/user/examples/S3Import.html
So, is there anyone who can tell me how I can do this perfectly ?
Also, What I can do to get auto syncing of incremental data.
You may want to take a look at s3distcp instead. See https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/

sqoop syntax to import to kudu table

We'd like to test Kudu and need to import data. Sqoop seems like the correct choice. I find references that you can import to Kudu but no specifics. Is there any way to import to Kudu using Sqoop?
Not at this time. See:
https://issues.apache.org/jira/browse/SQOOP-2903 - Add Kudu connector for Sqoop

How to convert parquet file to Avro file?

I am new to hadoop and Big data Technologies. I like to convert a parquet file to avro file and read that data. I search in few forums and it suggested to use AvroParquetReader.
AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
GenericRecord nextRecord = reader.read();
But I am not sure how to include AvroParquetReader. I am not able
to import it at all.
I can read this file using spark-shell and may be convert it to some JSON
and then that JSON can be converted to avro. But I am looking for a
simpler solution.
If you are able to use Spark DataFrames, you will be able to read the parquet files natively in Apache Spark, e.g. (in Python pseudo-code):
df = spark.read.parquet(...)
To save the files, you can use the spark-avro Spark Package. To write the DataFrame out as an avro, it would be something like:
df.write.format("com.databricks.spark.avro").save("...")
Don't forget that you will need to include the right version of the spark-avro Spark Package with your version of your Spark cluster (e.g. 3.1.0-s2.11 corresponds to spark-avro package 3.1 using Scala 2.11 which matches the default Spark 2.0 cluster). For more information on how to use the package, please refer to https://spark-packages.org/package/databricks/spark-avro.
Some handy references include:
Spark SQL Programming Guide
spark-avro Spark Package.

Direct import from Oracle to Hadoop using Sqoop

I want to use --direct parameter when I import the data from the Oracle. Is it possible to use data dump/pump utility using --direct option? Do I need to install any Oracle utility on my shell? If yes, please suggest what do I need to install?
Dharmesh
Unfortunately, there's no Sqoop connector that uses the DataPump utility.
Oracle does have their own (closed source) big data connectors. I believe SQL Loader for Hadoop uses datapump format.
Oracle Big Data Connector (Loader) is used to import data from Hadoop to Oracle. But, not from Oracle to Hadoop.

Resources