I have installed Apache hadoop on my local system and want to import data from Amazon s3 using Sqoop.
Is there any way to achive this.
If yes kindly help me how can i achieve this.
Examples would be much appreciated.
Please help me as soon as possible.
Note:I am not using Amazon EMR.
Sqoop is for getting data from relational databases only at the moment.
Try using "distcp" for getting data from S3.
The usage is documented here: http://wiki.apache.org/hadoop/AmazonS3 In the section "Running bulk copies in and out of S3"
Related
we are using snowflake data warehouse in my project, we would like to replace snowflake with Elasticsearch as part of project enhancement POC,
i don't found any solutions for moving data from snowflake to Elasticsearch.
can anyone help me to resolve the above concerns.
please share sufficient information, steps etc.
Thanks in advance
don't found any clues on data migration.
You can try to do it into 2 steps:
export data from Elastic to AWS S3 bucket
load data from AWS S3 bucket to snowflake.
You need to implement the migration at schema level. Moreover if you specify the question with the issues. It will be helpful to answer and guide you.
You can use COPY command to export data from Snowflake to a file that can then be loaded to another system. However I am curious to know why you are trying to replace Snowflake with Elasticsearch, as these are 2 different technologies, serving very different functions.
You can export your data from Snowflake S3 copy command.
Export in multiparts so your s3 bucket has small files.
Then you can hook a lambda on S3 PUT Object. So on each file upload a lambda will trigger.
You can write code in your Lambda to make rest calls to Elasticsearch.
I am using distcp(For Batch data) to get data from S3.
But according to sqoop website we can import from s3 to hdfs. I tried but I get error every time for connection build error :
https://sqoop.apache.org/docs/1.99.7/user/examples/S3Import.html
So, is there anyone who can tell me how I can do this perfectly ?
Also, What I can do to get auto syncing of incremental data.
You may want to take a look at s3distcp instead. See https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/
We have got an EMR cluster where the data is residing in parquet format in Hive/warehouse(HDFS). We are trying to push this data(>500m records) to Oracle RDS. The job is getting timed out after running for 3+ hours. At times we also see bad data node exception as well. We are using df.write.jdbc() command to export the data to Oracle DB which is in same/different VPC. Need some help to troubleshoot this issue and how shall we take this forward. We have tried to observe the sql-trace at Oracle end but couldn't find anything wrong.
Since the data is residing in parquet format the data export doesn't work with Sqoop command.
Though when we try to export the data to csv(csv format-> com.databricks.spark.csv) file using df.save() and then running s3-distcp command we are able to export this entire volume of data in ~20mins.
Any pointers in this regard would be highly appreciated.
Thanks in advance!
Regards
Manish
I want other way than by using sqoop for transfer RDBMS to HDFS please give me clue
Please anyone explain to me, whats the relation between hive and sqoop?
Added to dev ツ's answers you have one more tool called streamsets data collector which help you to get data from mysql to HDFS by creating JDBC connection.
I was reading the below integration of using Hive for querying data on DynamoDB.
http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html
But as per that link, Hive needs to be setup on top of EMR. But I wanted to know if I can use this integration with the standalone Hadoop cluster I already have instead of using EMR. Has anyone done this? Will there be sync issues between data in DynamoDB and HDFS happen compared to using EMR?
To be able to use it on your own cluster, you would need the custom StorageHandler for DynamoDB(it probably involves a custom SerDe as well).
It seems to be no available at the moment, at least not at AWS website.
What you can do is use the JDBC interface, provided by Amazon, to produce the queries from your cluster, but it would still be executed on top of EMR.