How to gzip compress a directory in hdfs without changing the name of the files - hadoop

I need to gzip compress a directory which will have many files. As i cant modify the file name of the files inside the directory i cant use mapreduce. Is there any way using java interface we can compress a directory without changing the names of the files inside the directory.

Related

Chilkat unzip files only from root directory

zip.UnzipMatching("qa_output","*.xml",true)
With this syntax I can unzip every Xml in every directory from my zip file and create the same directory structure.
But how can I unzip only the xml in the root directory?
I cannot understand how to write the filter.
I tried with "/*.xml" but nothing is extracted.
If I write "*/*.xml" I only extract xml files from subdirectory (and I skip the xml in the root directory!).
Can anyone help me?
example of a zip files content:
a1.xml
b1.xml
c1.xml
dir1\a2.xml
dir1\c2.xml
dir2\dir3\c3.xml
with unzipmatching("qa_output","*.xml", true) I extract all this files with the original directory structure, but I want to extract only a1.xml, b1.xml and c1.xml.
Is there a way to write filter to achieve this result, or a different command, or a different approach?
I think what you want is to call UnzipMatchingInto: All files (matching the pattern) in the Zip are unzipped into the specfied dirPath regardless of the directory path information contained in the Zip. This has the effect of collapsing all files into a single directory. If several files in the Zip have the same name, the files unzipped last will overwrite the files already unzipped.

How to restore a folder structure 7Zip'd with split volume option?

I 7Zip'd a multi-gig folder which contained many folders each with many files using the split to volumes (9Meg) option. 7Zip created files of type .zip.001,
.zip.002, etc. When I extract .001 it appears to work correctly but I get an 'unexpected end of data' error. 7Zip does not automatically go to .002. When I extract .002, it also gives the same error and it does not continue the original folder/file structure. Instead it extracts a zip file in the same folder as the previously extracted files. How do I properly extract split files to obtain the original folder/file structure? Thank you.

Multiple source files for s3distcp

Is there a way to copy a list of files from S3 to hdfs instead of complete folder using s3distcp? this is when srcPattern can not work.
I have multiple files on a s3 folder all having different names. I want to copy only specific files to a hdfs directory. I did not find any way to specify multiple source files path to s3distcp.
Workaround that I am currently using is to tell all the file names in srcPattern
hadoop jar s3distcp.jar
--src s3n://bucket/src_folder/
--dest hdfs:///test/output/
--srcPattern '.*somefile.*|.*anotherone.*'
Can this thing work when the number of files is too many? like around 10 000?
hadoop distcp should solve your problem.
we can use distcp to copy data from s3 to hdfs.
And it also supports wildcards and we can provide multiple source paths in the command.
http://hadoop.apache.org/docs/r1.2.1/distcp.html
Go through the usage section in this particular url
Example:
consider you have the following files in s3 bucket(test-bucket) inside test1 folder.
abc.txt
abd.txt
defg.txt
And inside test2 folder you have
hijk.txt
hjikl.txt
xyz.txt
And your hdfs path is hdfs://localhost.localdomain:9000/user/test/
Then distcp command is as follows for a particular pattern.
hadoop distcp s3n://test-bucket/test1/ab*.txt \ s3n://test-bucket/test2/hi*.txt hdfs://localhost.localdomain:9000/user/test/
Yes you can. create a manifest file with all the files you need and use --copyFromManifest option as mentioned here

Unix/Mac OS X: Use file list to copy files and folder keeping the directory structure

I have a plain text file containing names of hundreds of files with path relative to a home directory (can be made absolute path, if needed) in various sub-directories. The home directory contains multiple directories, and thousands of files. I need to create another directory copying the files in the list, while maintaining their directory structure in the destination.
Example:
Source folder:
/home/a/
file1.jpg
file2.jpg
file3.jpg
/home/b/
file4.jpg
file5.jpg
file6.jpg
File List: (plain text, in /home/)
./a/file2.jpg
./b/file5.jpg
Expected Result:
/home/dest/a/
file2.jpg
/home/dest/b/
file5.jpg
Tried cp with various modifications from various questions on stackoverflow, but got a flat folder structure in the result every time.
Using bash on OS X Terminal
Please tell how this can be done.
You can use rsync:
rsync --relative --files-from file-list.txt /home /home/dest

Why are the contents of the subfolders when MQFTE is used for transfer?

Hi when I tried to transfer the contents of a folder ( The folder has several subfolders and few files) using MQFTE ftecreatetransfer command, Not only the few files in the folder but also the contents of the subfolder are transferred to destination. The same subfolders are created in destination and the contents are transferred. Is there a way to avoid the files from subfolders being transferred ?
As per this page in the Infocenter:
When a directory is specified as a source file specification, the
contents of the directory are copied. More precisely, all files in the
directory and in all its subdirectories, including hidden files, are
copied.
However, it looks like they anticipated your question because the page recently added this clarification:
For example, to copy the contents of DIR1 to DIR2 only, specify
fteCreateTransfer ... -dd DIR2 DIR1/*
So instead of specifying the folder, add the wild card to the end and you get just the files in the top level of that folder. (Assuming of course that you do not also use the -r option!)

Resources