Hadoop - Managing multiple input/output files - hadoop

I'm facing problems managing multiple input files.
I have many in .../input/ folder and a mapreduce job which I want to be executed for each of my input files, so that every input file has its own output (in .../output/).
Now, I tried searching on the Net but many pages are very old and I can't get a working method. Any methods/classes which I can use to make this work?
Thanks in advance.

Related

Google Cloud Logs Export Names

Is there a way to configure the names of the files exported from Logging?
Currently the file exported includes colons. This are invalid characters as a path element in hadoop, so PySpark for instance cannot read these files. Obviously the easy solution is to rename the files, but this interferes with syncing.
Is there a way to configure the names or change them to no include colons? Any other solutions are appreciated. Thanks!
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md
At this time, there is no way to change the naming convention when exporting log files as this process is automated on the backend.
If you would like to request to have this feature available in GCP, I would suggest creating a PIT. This page allows you to report bugs and request new features to be implemented within GCP.

Nifi: How to sync two directories in nifi

I have to write my response flowfiles in one directory than get data from it change it and then put it inside other dierctory i want to make this two direcotry sync(i mean that whenever i delet, or change flowfile in one directory it should change in other directories too ) I have ore than 10000 flowfiles so chechlist wouldn't be good solution. Can you reccomend me:
any contreoller service which can help me make this?
any better way i can make this task without controller service
You can use a combination of ListFile, FetchFile, and PutFile processors to detect individual file write changes within a file system directory and copy their contents to another directory. This will not detect file deletions however, so I believe a better solution is to use rsync within an ExecuteProcess processor.
To the best of my knowledge, rsync does not work on HDFS file systems, so in that case I would recommend using a tool like Helix or DistCp (I have not evaluated these tools in particular). You can either invoke them from the "command line" via ExecuteProcess or wrapping a client library in an ExecuteScript or custom processor.

Program solution reading through share folders

I have a quick project I am working on for one of our VPs.
We have a few thousand CAD jobs stored on a network file share. The file structure is such that there is a parent folder for the CAD job. Part of the folder name contains the job number. Inside the folder, there are 1 to many .ini text files that contain the connection information I need.
What I need is a programatic way to search through all the folders and extract the job number from the folder name, and all the connection values from the ini files.
For example for a folder named CM8252390-3, the job number is 8252390-3. Inside this folder are 3 ini files. Inside the ini files are that look like this:
[Connection]
Name=IMP_Acme_3.5
[Origin]
X=-15.044784
Y=19.620095
Z=44.621395
So my program needs to give me the following result
Job Connection
8252390-3 IMP_Acme1_3.5
8252390-3 IMP_Acme2_3.5
8252390-3 IMP_Acme3_3.5
8254260-1 IMP_Acme3_2.4
8254260-1 IMP_Acme3_4.1
...continued for all folders in the network share
Any suggestion on the best way to do this. I am primarily an Oracle PL/SQL developer, but have some basic Windows batch and Unix shell experience. If I can get the data loaded into Oracle tables, I can search using PL/SQL tools, but is there a better way using shell, batch, or other tools?
Thank you.
I think this is a job for Powershell or vbScript. It would be easy to use these tools to write the information you need to one file.
This file should be written to an Oracle directory.
grant read permission to a database user on this directory
use utl_file to read the file or treat the file as an external table and expose it as a view
schedule a regular OS job to refresh or rebuild the list

Skipping bad input files in hadoop

I'm using Amazon Elastic MapReduce to process some log files uploaded to S3.
The log files are uploaded daily from servers using S3, but it seems that a some get corrupted during the transfer. This results in a java.io.IOException: IO error in map input file exception.
Is there any way to have hadoop skip over the bad files?
There's a who bunch of record skipping configuration properties you can use to do this - see the mapred.skip. prefixed properties on http://hadoop.apache.org/docs/r1.2.1/mapred-default.html
There's also a nice blog post entry about this subject and these config properties:
http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code
That said, if you file is completely corrupt (i.e. broken before the first record), you might still have issues even with these properties.
Chris White's comment suggesting writing your own RecordReader and InputFormat is exactly right. I recently faced this issue and was able to solve it by catching the file exceptions in those classes, logging them, and then moving on to the next file.
I've written up some details (including full Java source code) here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/

check directory of oracle logs

I'm using the check_logfiles nagios plugin to monitor Oracle alert logs. It works wonderfully for that purpose.
However I also need to monitor and entire directory of oracle trace logs for errors. This is because the oracle database is always creating new log files with different names.
What I need to know is the best way to scan an entire directory of oracle trace logs to find out which ones match patterns that specify oracle alerts.
Using check logfiles I tried specifying these options -
--criticalpattern='ORA-00600|ORA-00060|ORA-07445|ORA-04031|Shutting
down instance'
and to specify the directory of logs -
--logfile='/global/cms/u01/app/orahb/admin/opbhb/udump/'
and
--logfile="/global/cms/u01/app/orahb/admin/opbhb/udump/*"
Neither of which have any effect. The check runs but returns ok. Does anyone know if this nagios plugin called check_logfiles can monitor a directory of files rather than just a single file? Or perhaps there is another, better way to achieve the same goal of monitoring a bunch of files that can't be specified ahead of time?
Use a script which:
Opens each file
Copies entries which match the pattern
Outputs the matches to a file

Resources