pig load udf for loading files from several sub directories - hadoop

I want to write a custom load udf in pig for loading files from a directory structure.
The directory structure is like an email directory.It has a root directory called maildir.Inside this we have the sub-directory of individual mail holders.Inside every mailaccount holder directory are several sub directories like inbox,sent,trash etc.
eg: maildir/mailholdername1/inbox/1.txt
maildir/mailholdername2/sent/1.txt
I want to read only inbox files from all mailerholdername sub-directories.
I am not able to understand:
what should be passed to the load udf as parameter
how should the entire directory structure be parsed an only respective inbox files are read.
I want to process one file and perform some data extraction and load it as one record.Hence if there are 10 files, i get a relation having 10 records
Further, i want to do some operation on these inbox files and extract some data.

Because you have a defined folder structure that doesn't have variable depth, I think it's as simple as passing the following pattern as your input path:
A = LOAD 'maildir/*/inbox/1.txt' USING PigStorage('\t') AS (f1,f2,f3)
You probably don't need to create your own UDF for this, the PigLoader should be able to handle them, assuming they are in some delimited format (the above example assumes 3 fields, tab delimited).
If there are multiple txt files in each inbox, use *.txt rather than 1.txt. Finally, if the maildir root directory is not in your users home directory, you should use the absolute path to the folder, say /data/maildir/*/index/*.txt

Related

NiFi - How to get listing of directories and organize them by name then obtain files from each directory?

I'm trying to figure out how to perform the following steps within NiFi.
Obtain listing of directories from a specific location e.g. /my_src (Note the folders that will be appearing within here will be dated e.g. 20211125)
Based off of the listing obtained I need to sort the folders by date
For each folder then I need to GetFile from that directory
Then sort those files by their names
I am stuck at step 1 on finding a processor that pulls the directory names. I only see GetFile and List file.
Reason for this is that I need to process the folders based on the oldest to newest.
I would expect to be using a regex pattern to locate the valid folders that match the date format and ignore the other folders. Then with those values found pass them along sorted to another process that would get files from that path location, which GetFile does not seem to allow me to set dynamically.
Am I to approach this process differently within NiFi?

extracting .jpeg files from subfolders and putting them in another folder using SSIS

I have a folder that has around 400 subfolders each with ONE .jpeg file in them. I need to get all the pictures into 1 new folder using SSIS, everything is on my local (no connecting through different servers or DBs) just subfolders to one folder so that I can pull out those images without going one by one into each subfolder.
I would create 3 variables, all of type String. CurrentFile, FolderBase, FolderOutput.
FolderBase is going to be where we start searching i.e. C:\ssisdata
FolderOutput is where we are going to move any .jpg files that we find rooted under FolderBase.
Use a Foreach File Enumerator (sample How to import text files with the same name and schema but different directories into database?) configured to process subfolders looking for *.jpg. Map the first element on the Variable tab to be our CurrentFile. Map the Enumerator to start in FolderBase. For extra flexibility, create an additional variable to hold the file mask *.jpg.
Run the package. It should quickly zip through all the folders finding and doing nothing.
Drag and drop a file system task into the Foreach Enumerator. Make it a Move file (or maybe it's rename) type. Use a Variable source and destination. The Source will be CurrentFile and the destination will be FolderOutput

vbscript to read text files and extract data

i am in need of script to extract number of note-ref_ and #ref_ presence in all html files
my folder structure will be
D:\Count_Test
wherein lot of folders and sub-folder will contain and in each sub-folder will have a ref.html, text.html file will contain note-ref_ and #ref_ text (apart these files, some other files such as xml, txt and imges and css sub-folder will contain)
I need to count for every single file how many times note-ref_ and #ref_ appeared and the results needs to capture in .csv file
can anybody help me by providing solution to extract text into csv file
Suggestions:
Use the Scripting.FileSystemObject (FSO) to walk through files and sub folders to identify the scope of your actions. Alternatively, you could capture the output of DIR /s /b D:\Count_Test*.html.
Once you know the list of files you'll need to open, you should read each of them using the OpenTextFile function of the FSO and loop through each row. When you find what you're looking for, increase some sort of counter - perhaps in an array.
Finally once you've finished collecting the data, you can output your results by once again doing OpenTextFile, but this time opening your CSV file location and writing the data you've collected in the appropriate format.

7zip: In C#, how to add multiple files of the same name in different directories to the same zip file?

I created a C# snippet that calls 7zip (7za) to add a list of files to a zip archive. Problem is multiple files in different directories have the same name, so 7zip either complains about duplicate names or replaces the first file with the second only storing the last added. I cannot recursively scan a directory which would allow duplicates.
Is there a way to force 7zip to store the directory, or in ASP.NET MVC 3 C# to create zip files with duplicate file names when not considering the full path?
The path to our image is the GTIN number broken up by every five digits. The last five are the name of the image.
G:\1234\56789\01234.jpg
G:\4321\09876\01234.jpg
G:\5531\33355\01234.jpg
These would fail to all store in a 7zip archive correctly.
You can use SevenZipSharp: http://sevenzipsharp.codeplex.com/ a wrapper around 7zip. You will have full control from code.
We managed to get multiples in the same archive by creating a file list that doesn't contain leading backslashes, then running the application from the directory containing them:
1234\56789\01234.jpg
4321\09876\01234.jpg
5531\33355\01234.jpg
It solves it for now. Anyone with a better idea?

Arbitrary sort key in filesystem

I have a pet project where I build a text-to-HTML translator. I keep the content and the converted output in a directory tree, mirroring the structure via the filesystem hierachy. Chapters go into directories and subchapters go into subdirectories. I get the chapter headings from the directory and file names. I want to keep all data in files, no database or so.
Kind of a keep-it-simple approach, no need to deal with meta-data.
All works well, except for the sort order of the directories and files to be included. I need sort of an arbitrary key for sorting directories and files in my application. That would determine the order the content goes into the output.
I have two solutions, both not really good:
1) Prepend directories and files with a sort key (e.g. "01_") and strip that in the output files in order not to pollute the output file names. That works badly for directories since they must keep the key data in order not to break the directory structure. That ends with an ugly "01_Introduction"...
2) put an config file into each directory with information on how to sort the directory content, to be used from my applications. That is error-prone and breaks the keep-it-simple no meta-data approach.
Do you have an idea? What would you do?
If your goal is to effectively avoid metadata, then I'd go with some variation of option 1.
I really do not find 01_Introduction to be ugly., at all.

Resources