I'm trying to use S3DistCp to get around the small files problem in Hadoop. It's working, but the output is a bit annoying to work with. The file path's I'm dealing with are like :
s3://test-bucket/test/0000eb6e-4460-4b99-b93a-469d20543bf3/201402.csv
and there can be multiple files within that folder. I want to group by the folder name, so I use the following group by argument in s3distcp:
--groupBy '.*(........-.........-....-............).*'
and it does group the files, but it results it still results in multiple output folders, with one file in each folder. Is there any way to output the grouped files into one folder, instead of multiple?
Thanks!
As of 2015-11-20, this is the behavior of S3DistCp. It will create multiple directories based on the source directories. It will not combine across directories.
I think you can try out this:
--groupBy ".*/(........-.........-....-............)/.*"
In your example you should use something like: --src "s3://test-bucket/test/"
This way you will have multiple folders with all files inside those folders merged together.
Related
I'm trying to figure out how to perform the following steps within NiFi.
Obtain listing of directories from a specific location e.g. /my_src (Note the folders that will be appearing within here will be dated e.g. 20211125)
Based off of the listing obtained I need to sort the folders by date
For each folder then I need to GetFile from that directory
Then sort those files by their names
I am stuck at step 1 on finding a processor that pulls the directory names. I only see GetFile and List file.
Reason for this is that I need to process the folders based on the oldest to newest.
I would expect to be using a regex pattern to locate the valid folders that match the date format and ignore the other folders. Then with those values found pass them along sorted to another process that would get files from that path location, which GetFile does not seem to allow me to set dynamically.
Am I to approach this process differently within NiFi?
I have a folder that has around 400 subfolders each with ONE .jpeg file in them. I need to get all the pictures into 1 new folder using SSIS, everything is on my local (no connecting through different servers or DBs) just subfolders to one folder so that I can pull out those images without going one by one into each subfolder.
I would create 3 variables, all of type String. CurrentFile, FolderBase, FolderOutput.
FolderBase is going to be where we start searching i.e. C:\ssisdata
FolderOutput is where we are going to move any .jpg files that we find rooted under FolderBase.
Use a Foreach File Enumerator (sample How to import text files with the same name and schema but different directories into database?) configured to process subfolders looking for *.jpg. Map the first element on the Variable tab to be our CurrentFile. Map the Enumerator to start in FolderBase. For extra flexibility, create an additional variable to hold the file mask *.jpg.
Run the package. It should quickly zip through all the folders finding and doing nothing.
Drag and drop a file system task into the Foreach Enumerator. Make it a Move file (or maybe it's rename) type. Use a Variable source and destination. The Source will be CurrentFile and the destination will be FolderOutput
Hello I have a directory with sub-directory similar to this a1,a2,..a8. and each of this directory has multiple files like
bat-a1-0-0
bat-a1-0-1
bat-a1-1-0
bat-a1-1-1
...
bat-a1-31-0
bat-a1-31-1
and for sub-directory a2 its similar
bat-a2-0-0
bat-a2-0-1
bat-a2-1-0
bat-a2-1-1
...
bat-a2-31-0
bat-a2-31-1
What I decide to do in order not to complicate things is to have multiple LOAD statement to load each directory and find a way to UNION to get all. But I do not know how to load the files in each of the directory using Apache Pig version 0.10.0-cdh4.2.1 since they seem not to follow a simple pattern. Need helps thanks.
In fact this may be simpler than you think. If you load in files in pig, you can simply point to a directory, and pig will recursively load all files. Even those which may be deeply nested.
So the solution is: Make sure all your data is under 1 (or a few) directories, and load them in.
I created a C# snippet that calls 7zip (7za) to add a list of files to a zip archive. Problem is multiple files in different directories have the same name, so 7zip either complains about duplicate names or replaces the first file with the second only storing the last added. I cannot recursively scan a directory which would allow duplicates.
Is there a way to force 7zip to store the directory, or in ASP.NET MVC 3 C# to create zip files with duplicate file names when not considering the full path?
The path to our image is the GTIN number broken up by every five digits. The last five are the name of the image.
G:\1234\56789\01234.jpg
G:\4321\09876\01234.jpg
G:\5531\33355\01234.jpg
These would fail to all store in a 7zip archive correctly.
You can use SevenZipSharp: http://sevenzipsharp.codeplex.com/ a wrapper around 7zip. You will have full control from code.
We managed to get multiples in the same archive by creating a file list that doesn't contain leading backslashes, then running the application from the directory containing them:
1234\56789\01234.jpg
4321\09876\01234.jpg
5531\33355\01234.jpg
It solves it for now. Anyone with a better idea?
I have a pet project where I build a text-to-HTML translator. I keep the content and the converted output in a directory tree, mirroring the structure via the filesystem hierachy. Chapters go into directories and subchapters go into subdirectories. I get the chapter headings from the directory and file names. I want to keep all data in files, no database or so.
Kind of a keep-it-simple approach, no need to deal with meta-data.
All works well, except for the sort order of the directories and files to be included. I need sort of an arbitrary key for sorting directories and files in my application. That would determine the order the content goes into the output.
I have two solutions, both not really good:
1) Prepend directories and files with a sort key (e.g. "01_") and strip that in the output files in order not to pollute the output file names. That works badly for directories since they must keep the key data in order not to break the directory structure. That ends with an ugly "01_Introduction"...
2) put an config file into each directory with information on how to sort the directory content, to be used from my applications. That is error-prone and breaks the keep-it-simple no meta-data approach.
Do you have an idea? What would you do?
If your goal is to effectively avoid metadata, then I'd go with some variation of option 1.
I really do not find 01_Introduction to be ugly., at all.