How to make Hadoop Distcp copy custom list of folders? - hadoop

I'm looking for efficient way to sync list of directories from one Hadoop filesytem to another with same directory structure.
For example lets say HDFS1 is official source where data is created and once a week we need to copy newly created data under all data-2 directories to HDFS2:
**HDFS1**
hdfs://namenode1:port/repo/area-1/data-1
hdfs://namenode1:port/repo/area-1/data-2
hdfs://namenode1:port/repo/area-1/data-3
hdfs://namenode1:port/repo/area-2/data-1
hdfs://namenode1:port/repo/area-2/data-2
hdfs://namenode1:port/repo/area-3/data-1
**HDFS2** (subset of HDFS1 - only data-2)
hdfs://namenode2:port/repo/area-1/dir2
hdfs://namenode2:port/repo/area-2/dir2
In this case we have 2 directories to sync:
/repo/area-1/data-2
/repo/area-1/data-2
This can be done by:
hadoop distcp hdfs://namenode1:port/repo/area-1/data-2 hdfs://namenode2:port/repo/area-1
hadoop distcp hdfs://namenode1:port/repo/area-2/data-2 hdfs://namenode2:port/repo/area-2
This will run 2 Hadoop jobs, and if number of directories is big, let's say 500 different non overlapping directories under hdfs://namenode1:port/ - this will create 500 Hadoop jobs which is obvious overkill.
Is there a way to inject custom directory list into distcp?
How to make distcp create one job copying all paths in custom list of directories?

Not sure if this answers the problem, but I noticed you haven't used the "update" operator. The "-update" operator will only copy over the difference in the blocks between the two file systems...

Related

DistCP - Even simple copies result in CRC Exceptions

I'm running into an issue using distcp to copy files - every copy fails with an IO Exception (Checksum mismatch), even if performing a simple copy within the cluster (i.e. hadoop distcp -pbugctrx /foo/bar /foo/baz).
If forced to complete the copy using -skipcrccheck, I can see that the checksum is different ( hdfs dfs -checksum ), but that this isn't being caused by a difference in the actual source data (hdfs dfs -cat | md5sum returns matching checksums for source and destination).
I'm leery of disabling a data integrity check if I don't need to. Is there a better way to address this failing check than just ignoring it.
Both the source and target may be in different encryption zones. In that case also the checksum will fail

Merging small files into single file in hdfs

In a cluster of hdfs, i receive multiple files on a daily basis which can be of 3 types :
1) product_info_timestamp
2) user_info_timestamp
3) user_activity_timestamp
The number of files received can be of any number but they will belong to one of these 3 categories only.
I want to merge all the files(after checking whether they are less than 100mb) belonging to one category into a single file.
for eg: 3 files named product_info_* should be merged into one file named product_info.
How do i achieve this?
You can use getmerge toachieve this, but the result will be stored in your local node (edge node), so you need to be sure you have enough space there.
hadoop fs -getmerge /hdfs_path/product_info_* /local_path/product_inf
You can move them back to hdfs with put
hadoop fs -put /local_path/product_inf /hdfs_path
You can use hadoop archive (.har file) or sequence file. It is very simple to use - just google "hadoop archive" or "sequence file".
Another set of commands along the similar lines as suggested by #SCouto
hdfs dfs -cat /hdfs_path/product_info_* > /local_path/product_info_combined.txt
hdfs dfs -put /local_path/product_info_combined.txt /hdfs_path/

Writing Spark dataframe as parquet to S3 without creating a _temporary folder

Using pyspark I'm reading a dataframe from parquet files on Amazon S3 like
dataS3 = sql.read.parquet("s3a://" + s3_bucket_in)
This works without problems. But then I try to write the data
dataS3.write.parquet("s3a://" + s3_bucket_out)
I do get the following exception
py4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.
: java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: s3a://<s3_bucket_out>_temporary
It seems to me that Spark is trying to create a _temporary folder first, before it is writing to write into the given bucket. Can this be prevent somehow, so that spark is writing directly to the given output bucket?
You can't eliminate the _temporary file as that's used to keep the intermediate
work of a query hidden until it's complete
But that's OK, as this isn't the problem. The problem is that the output committer gets a bit confused trying to write to the root directory (can't delete it, see)
You need to write to a subdirectory under a bucket, with a full prefix. e.g.
s3a://mybucket/work/out .
I should add that trying to commit data to S3A is not reliable, precisely because of the way it mimics rename() by what is something like ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %". Because ls has delayed consistency on S3, it can miss newly created files, so not copy them.
See: Improving Apache Spark for the details.
Right now, you can only reliably commit to s3a by writing to HDFS and then copying. EMR s3 works around this by using DynamoDB to offer a consistent listing
I had the same issue when writing the root of S3 bucket:
df.save("s3://bucketname")
I resolved it by adding a / after the bucket name:
df.save("s3://bucketname/")

Concatenating multiple text files into one very large file in HDFS

I have the multiple text files.
The total size of them exceeds the largest disk size available to me (~1.5TB)
A spark program reads a single input text file from HDFS. So I need to combine those files into one. (I cannot re-write the program code. I am given only the *.jar file for execution)
Does HDFS have such a capability? How can I achieve this?
What I understood from your question is you want to Concatenate multiple files into one. Here is a solution which might not be the most efficient way of doing it but it works. suppose you have two files: file1 and file2 and you want to get a combined file as ConcatenatedFile
.Here is the script for that.
hadoop fs -cat /hadoop/path/to/file/file1.txt /hadoop/path/to/file/file2.txt | hadoop fs -put - /hadoop/path/to/file/Concatenate_file_Folder/ConcatenateFile.txt
Hope this helps.
HDFS by itself does not provide such capabilities. All out-of-the-box features (like hdfs dfs -text * with pipes or FileUtil's copy methods) use your client server to transfer all data.
In my experience we always used our own written MapReduce jobs to merge many small files in HDFS in distributed way.
So you have two solutions:
Write your own simple MapReduce/Spark job to combine text files with
your format.
Find already implemented solution for such kind of
purposes.
About solution #2: there is the simple project FileCrush for combining text or sequence files in HDFS. It might be suitable for you, check it.
Example of usage:
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
--input-format=text \
--output-format=text \
--compress=none \
/input/dir /output/dir 20161228161647
I had a problem to run it without these options (especially -Ddfs.block.size and output file date prefix 20161228161647) so make sure you run it properly.
You can do a pig job:
A = LOAD '/path/to/inputFiles' as (SCHEMA);
STORE A into '/path/to/outputFile';
Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will degradate your network

hadoop copy preserving the ownership/permissions

Is there any way to retain the ownership/permissions while copying files in hadoop?
Tried hadoop fs -cp -p <src> <dest> . Didn't work.
Yes of course you can. But I recommend you to use distcp, is an advanced tool to copy data between clusters or on the same cluster, you have a lot of option to optimize the execution. This command will run a mapreduce, so for a long copies it will take less time and you will can preserve all attributes.
Example:
hadoop distcp /source_dir/data \
/target_dir/data
hadoop distcp /source_dir/dataA \
/source_dir/dataB \
/target_dir/
For all attributes:
r: replication number
b: block size
u: user
g: group
p: permission
c: checksum-type
a: ACL
x: XAttr
t: timestamp
Another example, but preserving all attributes:
hadoop distcp -p rbugpcaxt \
/source_dir/data \
/target_dir/data
You can read more about this command on hadoop-distcp
The most important is not the owner and group or permissions, you can change it easy after copy command, the most important attributes are ACL, block size, replication number, and some times timestamp, this are extra properties that you can not change so easy after a simply copy (hdfs dfs -cp).
There is not, but you can (assuming you have the appropriate permissions) change the ownership after you copy the files.
It is currently not possible to create two copies of the file while copying permissions -- Depending on your use case, however, an option may be to move the files instead. For instance, I have had to change the location of a file and its permissions, and also wanted to keep a backup (permissions didn't matter) so I moved with permissions to the new location and copied back to the original without. I know that's not very helpful, but that's the best we have in Hadoop at the moment.

Resources