Does SQL*Loader have any functionality that allows for customizing the log file? - oracle

I have been asked to create a system for allowing third party companies to dump data into several of our tables. These third parties provide csv files on a periodic basis, and after doing some research it seemed like Oracle themselves had a standard tool for doing so, "sqlldr". I've since gotten it working to an acceptable degree, and we have a job scheduled to run that script once a day.
But one of the third parties supplies really dirty data, of the sort where I can't expect it to always load every row/record (looking like up to about 8% will fail). My boss asked me to forward "all output" from the first few tests to him, and like a moron I also sent the log file.
He has asked that this "report" be modified to include those exceptions that aren't unique constraints along with the line in the input file that caused the exception.
This means that I need data from the log file, but also from the (I believe) reject file in a single document. Rather than write a convoluted shell script to combine those two, does SQL*Loader itself allow any customization that might achieve the same thing? I've read through the Oracle documentation and haven't found anything that suggests this, but I've also learned not to trust it entirely either.
Is this possible? Ideally, the solution would allow me to add values to the reject file that don't exist in the original input file, but I'm also interested in any customization of the log file or reject file.

No.
I was going to stop there, but you can define the name of the log file, which might help with issue. Most automation with SQL*Loader involves wrapping it within shell scripts; aka "roll your own."

Related

Nifi: how to avoid copying file that are partially written

I am trying to use Nifi to get a file from SFTP server. Potentially the file can be big , so my question is how to avoid getting the file while it is being written. I am planning to use ListSFTP+FetchSFTP but also okay with GetSFTP if it can avoid copying partially written files.
thank you
In addition to Andy's solid answer you can also be a bit more flexible by using the ListSFTP/FetchSFTP processor pair by doing some metadata based routing.
After ListSFTP each flowfile will have attributes such as 'file.lastModifiedTime' and others. You can read about them here https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ListSFTP/index.html
You can put a RouteOnAttribute process in between the List and Fetch to detect objects that at least based on the reported last modified time are 'too new'. You could route those to a processor that is just a slow pass through to intentionally wait a bit. You can then run those back through the first router until they are 'old enough'. Now, this is admittedly a power user approach but it does give you a lot of flexibility and control. The approach I'm mentioning here is not fool proof as the source system may not report the last mod time correctly, it may not mean the source file is doing being written, etc.. But it gives you additional options IF you cannot do the definitely correct thing above that Andy talks about.
If you have control over the process which writes the file in, a common pattern to solve this is to initially write the file with a specific naming structure, such as beginning with .. After the successful write operation, the file is renamed without the . and it is picked up by the processor. Both GetSFTP and ListSFTP have a processor property called Ignore Dotted Files which is set to true by default and means those processors will not operate on or return files beginning with the dot character.
There is a minimum file age property you can use. The last modification time gets updated as the file is being written. Setting this value to something other than 0 will help fix the problem:

Parameterise Parm File name In Informatatica

I want to know how to (or can I) parameterize the parm file name in informatica?
little bit of background. I am building a standard map in informatica. Which business users can call directly after selecting the standard filters they want to apply in the map using a GUI.
The parm file name will be given by business users and all the filters that he/she selected will be in parm. The file will be dropped in the parm folder in informatica server.
This is a good case scenario, when only 1 users is using it at 1 point of time.
Also, I want to find out what should I do when multiple users are working on GUI and generating the parm files and invoking the informatica map. How do I get multiple instences of the same map running at the same time?
I hope I am making sense here....
Thanks!!!
You can achieve this by using concurrent execution of the workflow. Read about it and understand how can you implement it.
Once you know how to implement it, use a backend script/code by the gui to assign an instance name to each call through GUI. For each instance name, you can have an individual parameter file. (I believe that there would be a finite set of combination of variable values in your case). You can use below command to call individual instances, (either through you GUI or by any other backend code.
pmcmd %workflow_name% %informatica_folder_name%
-paramfile %paramfilepathandname% -rin %instance_name%
It might sound a bit confusing, but once you understand how concurrent workflows work, you can build on it based on the above input.
It'll be only possible if you call the Informatica from external tool, not the Client tools. One way is described by #Utsav, the other is when you use Informatica WSH to call a Workflow - you can indicate the parameterfile you want to be used with the workflow, as well as desired instance name.
I Think this guide to concurrent workflows May be what you are looking for:
https://kb.informatica.com/howto/6/Pages/17/301264.aspx

Monitor A File For Additions And Get Last Added Line

I'm having trouble monitoring a file for changes. I need to be able to know when a file changes, and when it does, I need the new line that was added. I intend to parse each line and find ones that match certain criteria, and act on information in those lines. I know the expected number of matching lines ahead of time, but I do not know how many lines in total will be added to the file, or where the matching lines will be.
I've tried 2 packages so far, with no avail.
fsnotify/fsnotify
As fas as I can tell, fsnotify can only tell me when a file is modified, not what the details of the modification was. Since I need to know what exactly was added to the file, this is no good for me.
(As a side-question, can this be run in a loop? The example that I tried exited after just one modification. I need to monitor for multiple modifications.)
hpcloud/tail
This package tries to mimic the Unix tail command, but it seems to have its own issues. The output that I get includes timestamps and other data - I just want the added line, nothing else. Also, it seems to think a file has been modified multiple times, even when it's just one edit. Further, the deal breaker here is that it does not output the last line if the line was not followed by a newline character.
Delegating to tail
I came across this answer, which suggests to delegate this work to the tail command itself, but I need this to work cross-platform (specifically, macOS, Linux and Windows). I don't believe that an equivalent command exists on Windows.
How do I go about tackling this?
#user2515526,
Usually changed diff is out of scope of file watchers' functionality, because, you know, you could change an image, and a watcher would need to keep a track several Mb of a diff in memory, and what if we have thousands of files?
However, as bad as it sounds, this may be exactly the way you want to implement this (sure, depends on your app, etc. - could be fine for text files), i.e. - keeping a map of diffs (1 diff per file) since last modification. Cannot say I like it, but sounds like fsnotify has no support for changes/diffs that you need.
Also, regarding your question about running in a loop, maybe you can get some hints here: https://github.com/kataras/iris/blob/8370d76910cdd8de043753ed81ae080eae8dc798/utils/file.go
Its a framework that allows to build a server that watches for TypeScript file changes. So sounds similar to your case/question.
Cheers,
-D

Store parsable backup of all printed documents

What I'm trying to accomplish is to always keep a parsable duplicate of all printed documents, and execute a secondary process for each print.
(i.e.: Be able to parse all text, account for pages, vectors, images, etc).
Processing the document can either be done immediately or deferred (immediately is desirable).
As formats go, any PDL might be suitable, my best guess is XPS would probably be the best bet for a parsable format, any recommendations for other formats are appreciated.
Ideally, I'd like to not mess with the user interaction with the printing (e.g.: print settings page; or create a virtual printer, which could save a XPS and then forward the print job to the physical printer).
Since users might not be tech savvy to either set up/use it properly and/or mess up the process at a later date.
What I'm looking for at this time:
Documentation on the print process and flow (WDK, PDL, what else?)
How this could be accomplished, if at all possible; are there any existing solutions?
Any directions into what I should be looking at.
It's only part of an answer, but rumor has it you can tell Windows to keep spooled documents (right-click the printer, choose "Printer Properties", Advanced, "Keep Printed Documents").
You could enable this, and then create a scheduled task (or system service, etc.) that watches the spool directory and moves all files older than a certain threshold to a more appropriate location for further processing. (The age threshold would be a reasonable way to avoid trying to move files that are currently being written.)
Then you'd have to find a program to convert the .spl files to whatever format you like, or try interpreting it yourself. It looks pretty low-level but Microsoft does offer some documentation about the MS-EMF and MS-EMFSPOOL formats that might be a start.

Verify whether ftp is complete or not?

I got an application which is polling on a folder continuously. Once any file is ftp to the folder, the application has to move this file to some other folder for processing.
Here, we don't have any option to verify whether ftp is complete or not.
One command "lsof" is suggested in the technical forums. It got a file description column which gives the file status.
Since, this is a free bsd command and not present in old versions of linux, I want to clarify the usage of this command.
Can you guys tell us your experience in file verification and is there any other alternative solution available?
Also, is there any risk in using this utility?
Appreciate your help in advance.
Thanks,
Mathew Liju
We've done this before in a number of different ways.
Method one:
If you can control the process sending the files, have it send the file itself followed by a sentinel file. For example, send the real file "contracts.doc" followed by a one-byte "contracts.doc.sentinel".
Then have your listener process watch out for the sentinel files. When one of them is created, you should process the equivalent data file, then delete both.
Any data file that's more than a day old and doesn't have a corresponding sentinel file, get rid of it - it was a failed transmission.
Method two:
Keep an eye on the files themselves (specifically the last modification date/time). Only process files whose modification time is more than N minutes in the past. That increases the latency of processing the files but you can usually be certain that, if a file hasn't been written to in five minutes (for example), it's done.
Conclusion:
Both those methods have been used by us successfully in the past. I prefer the first but we had to use the second one once when we were not allowed to change the process sending the files.
The advantage of the first one is that you know the file is ready when the sentinel file appears. With both lsof (I'm assuming you're treating files that aren't open by any process as ready for processing) and the timestamps, it's possible that the FTP crashed in the middle and you may be processing half a file.
There are normally three approaches to this sort of problem.
providing a signal file so that when your file is transferred, an additional file is sent to mark that transfer is complete
add an entry to a log file within that directory to indicate a transfer is complete (this really only works if you have a single peer updating the directory, to avoid concurrency issues)
parsing the file to determine completeness. e.g. does the file start with a length field, or is it obviously incomplete ? e.g. parsing an incomplete XML file will result in a parse error due to the lack of an end element. Depending on your file's size and format, this can be trivial, or can be very time-consuming.
lsof would possibly be an option, although you've identified your Linux portability issue. If you use this, note the -F option, which formats the output suitable for processing by other programs, rather than being human-readable.
EDIT: Pax identified a fourth (!) method I'd forgotten - using the fact that the timestamp of the file hasn't updated in some time.
There is a fifth method. You can also check if the FTP Session is still active. This will work if every peer has it's own ftp user account. As long as the user is not logged off from FTP, assume the files are not complete.

Resources