The output of Hive on EMR is a file named 000000_0 (perhaps a different number if there is more than 1 reducer).
How do I get this file to be named differently? I see two options:
1) Get Hive to write it differently
2) Rename the file(s) in S3 after it is written. This is could be a problem: from what I've read S3 doesn't really have a "rename". You have to copy it, and delete the original. When dealing with a file that is 1TB in size, for example, this could cause performance problems or increase usage cost?
The AWS Command Line Interface (CLI) has a convenient mv command that you could add to a script:
aws s3 mv s3://my-bucket/000000_0 s3://my-bucket/data1
Or, you could do it programmatically via the Amazon S3 COPY API call.
Related
I have a bash script file where each line is a curl command to download a file. This bash file is in a Google bucket.
I would like to execute the file either directly from the storage and copy its downloaded contents there or execute it locally and directly copy its content to the bucket.
Basically, I do not want to have these fils on my local machine.. I have tried things along these lines but it either failed or simply downloaded everything locally.
gsutil cp gs://bucket/my_file.sh - | bash gs://bucket/folder_to_copy_to/
Thank you!
To do so, the bucket needs to be mounted on the pod (the pod would see it as a directory).
If the bucket supports NFS, you would be able to mount it as shown here.
Also, there is another way as shown in this question.
otherwise, you would need to copy the script to the pod, run it, then upload the generated files to the bucket, and lastly clean everything up.
The better option is to use a filestore which can be easily mounted using CSI drivers as mentioned here.
I have one s3 bucket which contains zip files.
I have a script which downloading the zip files to my local server, unzip them and upload them to another s3 bucket.
How can I set watermark so i will know what was the last file i downloaded so i won't need to save all the files locally or download all the files each time the script is running ?
I'm using aws sync command which as much as i understand should copy only new files, am i right ?
aws s3 sync $gcs3$gcRegion/$gcTech/$gcPrinterFamily/$gcPrinterType/$gcPrinterName/ $dir
The AWS Command-Line Interface (CLI) aws s3 sync command will copy any files that are not present in the destination (sort of).
So, you either need to keep all previously-downloaded files, or you need another way to keep track of the files that were downloaded.
Instead, I would recommend writing your own program that:
Downloads all files from the S3 bucket with a LastModified timestamp after a stored timestamp
Stores the current time
Unzips the files and copies them to the other S3 bucket
Deletes the zip files and unzipped files
So, the program will need to remember the last time it downloaded files, but it will not need to remember which files it downloaded. Be careful — S3 stores time in UTC, so you'll need to convert your timezones. Or, simply remember the highest LastModified value of the files you downloaded.
To obtain a list of files since a certain LastModified date, you could use the AWS CLI:
aws s3api list-objects --bucket jstack-a --query "Contents[?LastModified>='2019-04-11'].[Key]" --output text
However, I would recommend writing a Python program for the above activities, since it would be easier that writing command-line scripts.
I need some help. I am downloading a file from a webpage using python code and placing it in local file system and then transferring it into HDFS using put command and then performing operations on it.
But there might be some situations where the file size will be very large and downloading into Local File System is not a right procedure. So I want the file to be directly be downloaded into HDFS with out using the local file system at all.
Can any one suggest me some methods which one would be the best method to proceed?
If there are any errors in my question please correct me.
You can pipe it directly from a download to avoid writing it to disk, e.g.:
curl server.com/my/file | hdfs dfs -put - destination/file
The - parameter to -put tells it to read from stdin (see the documentation).
This will still route the download through your local machine, though, just not through your local file system. If you want to download the file without using your local machine at all, you can write a map-only MapReduce job whose tasks accept e.g. an input file containing a list of files to be downloaded and then download them and stream out the results. Note that this will require your cluster to have open access to the internet which is generally not desirable.
I'm running dataFrame.rdd.saveAsTextFile("/home/hadoop/test") in an attempt to write a data frame to disk. This executes with no errors, but the folder is not created. Furthermore, when I run the same command again (in the shell) an Exception is thrown:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/feet already exists
Any idea why this is? Is there a nuance of the submission move (client, cluster) that affects this?
EDIT:
I have permission to create directories in /home/hadoop but I cannot create directories inside any of the dirs/sub-dirs created by rdd.saveAsTextFile("file:/home/hadoop/test"). The structure looks like this:
/home/hadoop/test/_temporary/0
How are _temporary and 0 being created if I do not have permission to create directories inside test from the command line? Is there a way to change the permission of these created directories?
Edit2:
In the end I wrote to s3 instead using rdd.coalesce(1).saveAsTextFile("s3://..."). This is only viable if you have a very small output - because coalesce(n) will cause the RDD to exist and be processed further on only n workers. In my case, I chose 1 worker so that the file would be generated by one worker. This gave me a folder containing one part-00000 file which had all of my data.
Since https://spark-project.atlassian.net/browse/SPARK-1100 saveAsTextFile should never be able to silently overwrite an already existing folder.
If you receive an java.io.IOException: Mkdirs failed to create file:... it probably means you have permission problems when trying to write in the output path.
If you give more context info the answers could be more helpful.
Like: are you running on local shell? cluster shell? which type of cluster?
EDIT: I think you are facing that error because all executors are trying to write to same same path which isn't available on all executors.
saveAsTextFile works. It writes to the default file system (configured by fs.default.name in your core-site.xml). In this case the default file system is hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/.
If you want to write to local disk, use saveAsTextFile("file:/home/hadoop/test"). If you have more than one node in the Spark cluster, the results will be mostly unusable: each node will write some parts of the RDD to local disk. But for testing this may be okay.
I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:
/tmp/data/identifier1/indentifier1-0,0001
/tmp/data/identifier1/indentifier1-0,0002
/tmp/data/identifier2/indentifier2-0,0001
/tmp/data/identifier3/indentifier3-0,0001
I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).* option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:
/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3
Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:
/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3
Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?
Thanks!
The solution I've arrived upon is to use distcp to bring these files out of the directories before using s3distcp:
hadoop distcp -update /tmp/data/** /tmp/grouped
Then, I changed the s3distcp script to move data from /tmp/grouped into my S3 bucket.
Using distcp before s3distcp is really expensive. One other option you have is to create a manifest file with all your files in it and give its path to s3distcp. In this manifest you can define the "base name" of each file. If you need an example of a manifest file just run s3distcp on any folder with argument --outputManifest.
more information can be found here