Load multiple files from multiple directory into Pig - hadoop

Hello I have a directory with sub-directory similar to this a1,a2,..a8. and each of this directory has multiple files like
bat-a1-0-0
bat-a1-0-1
bat-a1-1-0
bat-a1-1-1
...
bat-a1-31-0
bat-a1-31-1
and for sub-directory a2 its similar
bat-a2-0-0
bat-a2-0-1
bat-a2-1-0
bat-a2-1-1
...
bat-a2-31-0
bat-a2-31-1
What I decide to do in order not to complicate things is to have multiple LOAD statement to load each directory and find a way to UNION to get all. But I do not know how to load the files in each of the directory using Apache Pig version 0.10.0-cdh4.2.1 since they seem not to follow a simple pattern. Need helps thanks.

In fact this may be simpler than you think. If you load in files in pig, you can simply point to a directory, and pig will recursively load all files. Even those which may be deeply nested.
So the solution is: Make sure all your data is under 1 (or a few) directories, and load them in.

Related

naming convention of part files in HDFS

When we do an INSERT INTO command in Hive, the result of the execution creates multiple part files in HDFS.
e.g. part-*-***** or 000000_0,000001_0 etc or something else.
Is there a configuration/setting that controls the naming of these part files?
The cluster I work in creates 000000_0, 000001_0, 000000_1 etc. I would like to change this to part- or text- etc so that its easier for me to pick these files up and merge them if needed.
If there is a setting that can be set in Hive right before executing the HQL, that would be ideal.
Thanks in advance.
I think you should be able
set mapreduce.output.basename = part-;
This won't work. The only way I have found is with a custom file writer.

S3DistCp Grouping by Folder

I'm trying to use S3DistCp to get around the small files problem in Hadoop. It's working, but the output is a bit annoying to work with. The file path's I'm dealing with are like :
s3://test-bucket/test/0000eb6e-4460-4b99-b93a-469d20543bf3/201402.csv
and there can be multiple files within that folder. I want to group by the folder name, so I use the following group by argument in s3distcp:
--groupBy '.*(........-.........-....-............).*'
and it does group the files, but it results it still results in multiple output folders, with one file in each folder. Is there any way to output the grouped files into one folder, instead of multiple?
Thanks!
As of 2015-11-20, this is the behavior of S3DistCp. It will create multiple directories based on the source directories. It will not combine across directories.
I think you can try out this:
--groupBy ".*/(........-.........-....-............)/.*"
In your example you should use something like: --src "s3://test-bucket/test/"
This way you will have multiple folders with all files inside those folders merged together.

Hadoop DistCp handle same file name by renaming

Is there any way to run DistCp, but with an option to rename on file name collisions? Maybe it's easiest to explain with an example.
Let's say I'm copying to hdfs:///foo to hdfs:///bar, and foo contains these files:
hdfs:///foo/a
hdfs:///foo/b
hdfs:///foo/c
and bar contains these:
hdfs:///bar/a
hdfs:///bar/b
Then after the copy, I'd like bar to contain something like:
hdfs:///bar/a
hdfs:///bar/a-copy1
hdfs:///bar/b
hdfs:///bar/b-copy1
hdfs:///bar/c
If there is no such option, what might be the most reliable/efficient way to do this? My own home-grown version of distcp could certainly get it done, but that seems like it could be a lot of work and pretty error-prone. Basically, I don't care at all about the file names, just their directory, and I want to periodically copy large amounts of data into a "consolidation" directory.
Distcp does not have that option. If you are using the Java API for it, it can be easily handled by checking if the destination path exist and changing the path in case it already exists. You can check that with a FileSystem object using the method exists(Path p).

ruby - get a file from directory without listing all contents

I'm using the split linux command to split huge xml files into node-sized ones. The problem is now I have directory with hundreds of thousands of files.
I want a way to get a file from the directory (to pass to another process for import into our database) without needing to list everything in it. Is this how Dir.foreach already works? Any other ideas?
You can use Dir.glob to find the files you need. More details here, but basically, you pass it a pattern like Dir.glob 'dir/*.rb' and get back filenames matching that pattern. I assume it's done in a reasonably good way, but it will depend on your platform and implementation.
As to Dir.foreach, this should be efficient too - the concern would be if it has to process the entire directory for every pass around the loop. But that would be awful implementation, and is not the case.

Arbitrary sort key in filesystem

I have a pet project where I build a text-to-HTML translator. I keep the content and the converted output in a directory tree, mirroring the structure via the filesystem hierachy. Chapters go into directories and subchapters go into subdirectories. I get the chapter headings from the directory and file names. I want to keep all data in files, no database or so.
Kind of a keep-it-simple approach, no need to deal with meta-data.
All works well, except for the sort order of the directories and files to be included. I need sort of an arbitrary key for sorting directories and files in my application. That would determine the order the content goes into the output.
I have two solutions, both not really good:
1) Prepend directories and files with a sort key (e.g. "01_") and strip that in the output files in order not to pollute the output file names. That works badly for directories since they must keep the key data in order not to break the directory structure. That ends with an ugly "01_Introduction"...
2) put an config file into each directory with information on how to sort the directory content, to be used from my applications. That is error-prone and breaks the keep-it-simple no meta-data approach.
Do you have an idea? What would you do?
If your goal is to effectively avoid metadata, then I'd go with some variation of option 1.
I really do not find 01_Introduction to be ugly., at all.

Resources