Import data to Hdfs from AWS S3 using Sqoop - hadoop

I am using distcp(For Batch data) to get data from S3.
But according to sqoop website we can import from s3 to hdfs. I tried but I get error every time for connection build error :
https://sqoop.apache.org/docs/1.99.7/user/examples/S3Import.html
So, is there anyone who can tell me how I can do this perfectly ?
Also, What I can do to get auto syncing of incremental data.

You may want to take a look at s3distcp instead. See https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/

Related

Migrate data from snowflake to elasticsearch

we are using snowflake data warehouse in my project, we would like to replace snowflake with Elasticsearch as part of project enhancement POC,
i don't found any solutions for moving data from snowflake to Elasticsearch.
can anyone help me to resolve the above concerns.
please share sufficient information, steps etc.
Thanks in advance
don't found any clues on data migration.
You can try to do it into 2 steps:
export data from Elastic to AWS S3 bucket
load data from AWS S3 bucket to snowflake.
You need to implement the migration at schema level. Moreover if you specify the question with the issues. It will be helpful to answer and guide you.
You can use COPY command to export data from Snowflake to a file that can then be loaded to another system. However I am curious to know why you are trying to replace Snowflake with Elasticsearch, as these are 2 different technologies, serving very different functions.
You can export your data from Snowflake S3 copy command.
Export in multiparts so your s3 bucket has small files.
Then you can hook a lambda on S3 PUT Object. So on each file upload a lambda will trigger.
You can write code in your Lambda to make rest calls to Elasticsearch.

pushing 500 million + records to oracle db from emr using spark

We have got an EMR cluster where the data is residing in parquet format in Hive/warehouse(HDFS). We are trying to push this data(>500m records) to Oracle RDS. The job is getting timed out after running for 3+ hours. At times we also see bad data node exception as well. We are using df.write.jdbc() command to export the data to Oracle DB which is in same/different VPC. Need some help to troubleshoot this issue and how shall we take this forward. We have tried to observe the sql-trace at Oracle end but couldn't find anything wrong.
Since the data is residing in parquet format the data export doesn't work with Sqoop command.
Though when we try to export the data to csv(csv format-> com.databricks.spark.csv) file using df.save() and then running s3-distcp command we are able to export this entire volume of data in ~20mins.
Any pointers in this regard would be highly appreciated.
Thanks in advance!
Regards
Manish

Hive Create table over S3 in RIAK CS

I have Hive service running on a Hadoop cluster. I'm trying to create a Hive table over Eucalyptus(RIAK CS) S3 data. I have configured the AccessKeyID and SecretAccessKey in core-site.xml and hive-site.xml. When I execute the Create table command and specify the S3 location using s3n schema, I get the below error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:org.apache.http.conn.ConnectTimeoutException: Connect to my-bucket.s3.amazonaws.com:443 timed out)
If I try using the s3a schema, I get the below error:
FAILED: AmazonClientException Unable to load AWS credentials from any providern the chain
I could change the endpoint URL for distcp command using jets3t, but the same didnt work for Hive. Any suggestions to point Hive to Eucalyptus S3 endpoint are welcome.
I'm not familiar with Hive, but as long as I hear it uses MapReduce as backend processing system. MapReduce uses jets3t as S3 connector - changing its configuration worked for me in both MapReduce and Spark. Hope this helps: http://qiita.com/kuenishi/items/71b3cda9bbd1a0bc4f9e
Configurations like
s3service.https-only=false
s3service.s3-endpoint=yourdomain.com
s3service.s3-endpoint-http-port=8080
s3service.s3-endpoint-https-port=8080
would work for you?
I have upgraded to HDP2.3(Hadoop 2.7) and now I'm able to configure s3a schema for Hive to S3 access.

MIT StarCluster and S3

I am trying to run a mapreduce job on spot instances.
I launch my instances by using StarClusters and its hadoop plugin. I have no problem upload the data then put it into HDFS and then copy the result back from the HDFS.
My question is that is there way to load the data directly from s3 and push the result back to s3? (I don't want to manually download the data from s3 to HDFS and push the result from HDFS to s3, is there a way to do it in background)?
I am using the standard MIT starcluster ami
you cannot do it, but you can write a script to do that.
for example you can use:
hadoop distcp s3n://ID:key#mybucket/file /user/root/file
to put the file directly to hdfs from s3

S3 to local hdfs data transfer using Sqoop

I have installed Apache hadoop on my local system and want to import data from Amazon s3 using Sqoop.
Is there any way to achive this.
If yes kindly help me how can i achieve this.
Examples would be much appreciated.
Please help me as soon as possible.
Note:I am not using Amazon EMR.
Sqoop is for getting data from relational databases only at the moment.
Try using "distcp" for getting data from S3.
The usage is documented here: http://wiki.apache.org/hadoop/AmazonS3 In the section "Running bulk copies in and out of S3"

Resources