Hadoop DistCp handle same file name by renaming - hadoop

Is there any way to run DistCp, but with an option to rename on file name collisions? Maybe it's easiest to explain with an example.
Let's say I'm copying to hdfs:///foo to hdfs:///bar, and foo contains these files:
hdfs:///foo/a
hdfs:///foo/b
hdfs:///foo/c
and bar contains these:
hdfs:///bar/a
hdfs:///bar/b
Then after the copy, I'd like bar to contain something like:
hdfs:///bar/a
hdfs:///bar/a-copy1
hdfs:///bar/b
hdfs:///bar/b-copy1
hdfs:///bar/c
If there is no such option, what might be the most reliable/efficient way to do this? My own home-grown version of distcp could certainly get it done, but that seems like it could be a lot of work and pretty error-prone. Basically, I don't care at all about the file names, just their directory, and I want to periodically copy large amounts of data into a "consolidation" directory.

Distcp does not have that option. If you are using the Java API for it, it can be easily handled by checking if the destination path exist and changing the path in case it already exists. You can check that with a FileSystem object using the method exists(Path p).

Related

naming convention of part files in HDFS

When we do an INSERT INTO command in Hive, the result of the execution creates multiple part files in HDFS.
e.g. part-*-***** or 000000_0,000001_0 etc or something else.
Is there a configuration/setting that controls the naming of these part files?
The cluster I work in creates 000000_0, 000001_0, 000000_1 etc. I would like to change this to part- or text- etc so that its easier for me to pick these files up and merge them if needed.
If there is a setting that can be set in Hive right before executing the HQL, that would be ideal.
Thanks in advance.
I think you should be able
set mapreduce.output.basename = part-;
This won't work. The only way I have found is with a custom file writer.

Handle single files while extracting tar.gz

I am having a huge .tgz file which is further structured inside like this:
./RandomFoldername1/file1
./RandomFoldername1/file2
./RandomFoldername2/file1
./RandomFoldername2/file2
etc
What I want to do is having each individual file extracted to standard output so that I can pipe it afterwards to another command. While doing this, I also need to get the RandomFoldername name and file name so that I can deal with them properly from within the second command.
Till now the options I have are
to either extract all of the tarball and deal with the structured files that I will be having, which is not an option since the extracted tar doesn't fit into the hard drive
Make a loop that pattern match each file and extract one file at time. This option although that solves the problem, is too slow because the tarball is sweeped each time for only one file.
While searching on how to solve this, I've started to fear that there is no better alternative to this.
Using tar the tool I don't believe you have any other options.
Using a tar library for some language of your choice should allow you to do what you want though as it should let you iterate over the entries in the tarball one-by-one and allow you to extract/pipe/etc. each file one-by-one as necessary.

ruby - get a file from directory without listing all contents

I'm using the split linux command to split huge xml files into node-sized ones. The problem is now I have directory with hundreds of thousands of files.
I want a way to get a file from the directory (to pass to another process for import into our database) without needing to list everything in it. Is this how Dir.foreach already works? Any other ideas?
You can use Dir.glob to find the files you need. More details here, but basically, you pass it a pattern like Dir.glob 'dir/*.rb' and get back filenames matching that pattern. I assume it's done in a reasonably good way, but it will depend on your platform and implementation.
As to Dir.foreach, this should be efficient too - the concern would be if it has to process the entire directory for every pass around the loop. But that would be awful implementation, and is not the case.

Arbitrary sort key in filesystem

I have a pet project where I build a text-to-HTML translator. I keep the content and the converted output in a directory tree, mirroring the structure via the filesystem hierachy. Chapters go into directories and subchapters go into subdirectories. I get the chapter headings from the directory and file names. I want to keep all data in files, no database or so.
Kind of a keep-it-simple approach, no need to deal with meta-data.
All works well, except for the sort order of the directories and files to be included. I need sort of an arbitrary key for sorting directories and files in my application. That would determine the order the content goes into the output.
I have two solutions, both not really good:
1) Prepend directories and files with a sort key (e.g. "01_") and strip that in the output files in order not to pollute the output file names. That works badly for directories since they must keep the key data in order not to break the directory structure. That ends with an ugly "01_Introduction"...
2) put an config file into each directory with information on how to sort the directory content, to be used from my applications. That is error-prone and breaks the keep-it-simple no meta-data approach.
Do you have an idea? What would you do?
If your goal is to effectively avoid metadata, then I'd go with some variation of option 1.
I really do not find 01_Introduction to be ugly., at all.

Ruby - How to prevent wiping your hard drive when using delete file and directory commands in your code

I'm writing some code that at run time may create or delete directories within the project path. I haven't really used ruby for file processing so i'm really uneasy about having code that, with a few mistypes weeks down the line, could result in wiping other directories outside of my project path.
Is there anyway to make it impossible for the program to delete files outside of its own path regardless of whats typed in destructive calls?
Pathname is a wrapper class for almost any file operations.
require "pathname"
path= Pathname.new("/home/johannes")
path.directory? # => true
path.children # => [#<Pathname:.bash_history>, #<Pathname:Documents>, #<Pathname:Desktop>]
path.children.each do |p|
p.delete if p.file?
end
Pathname#children does not contain . or .. so you don't accidently walk up the tree instead of down. If you still don't trust in the code, you can even check if on path is contained in another
Pathname.new("test") <=> Pathname.new("test/123") # => -1
You might want to create a wrapper method around your favourite delete method (or, perhaps, around whole class, because not only deleting files is potentially destructive file operation), which would expand all the submitted paths and check whether they begin with your "sandbox" path). You can also try to redefine delete method, if you are willing to cripple it through whole application.
And maybe the cleanest solution of them all would be to create a new user on your system and run your program as him.
On a POSIX system, you can use Dir.chroot to change the root that your application sees. Then ALL actions, not just delete ones, will be limited to the project directory. This does mean that external commands will be unavailable unless you make them part of your project directory as well.
This is the standard 'sandboxing' method used in Unix based systems. It can be difficult to setup (eliminating all external dependancies is sometimes hard), but affords significant protection when configured properly.
You could generate an Array of filenames in your project directory using
my_files = Dir["/bla/bla/your/directory/**/*"]
and then simply check if the filename passed to your "delete" function exist in your my_files array.
I'm sure there is a more elegant solution, but this could work ^_^
You could use File.expand_path and File.dirname on the input, and check that against __FILE__. So something like this might work:
File.delete(path) if File.dirname(File.expand_path(path)).include? File.dirname(File.expand_path(__FILE__))
I've got automated tests that routinely create and wipe out directories. I've taken two approaches:
Use /tmp as much as possible. The 'tmpdir' standard library module will create temporary directories which will be destroyed when your program exits. Or,
When the code creates a directory that it will later be deleting, it drops a marker file into the directory. When it comes time to delete the directory, if the marker file is not found, the code refuses to delete the directory. A marker file might be called ".ok_to_delete", for example.

Resources