I'm writing a large file > 7MB from an Oracle stored procedure and the requirements are to have no line termination characters (no carriage return/line feed) at the end of each record.
I've written a stored procedure using UTL_FILE.PUT and I'm following each call to UTL_FILE.PUT with a UTL_FILE.FFLUSH. This procedure errors with a write error once I get to the point where I've written more than the buffer size (set to max 32767) although I'm making the FFLUSH calls. The procedure works fine if I replace the PUT calls with PUT_LINE calls.
Is it not possible to write more than the buffer size without a newline character? If so, is there a work around?
Dustin,
The Oracle documentation here:
http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14258/u_file.htm#i1003404
States that:
FFLUSH physically writes pending data to the file identified by the file handle. Normally, data being written to a file is buffered. The FFLUSH procedure forces the buffered data to be written to the file. The data must be terminated with a newline character.
The last sentence being the most pertinent.
Could you not write the data using UTL_FILE.PUT_LINE before then searching the resulting file for the line terminators and removing them?
Just a thought....
deleted quote from docs, see Ollie's answer
Another possible way to do this is a Java stored procedure, where you can use the more full-featured Java API for creating and writing to files.
Although it is less than desirable, you could always PUT until you have detected that you are nearing the buffer size. When this occurs, you can FCLOSE the file handle (flushing the buffer) and re-open that same file with FOPEN using 'a' (append) as the mode. Again, this technique should generally be avoided, especially if other processes are also trying to access the file (for example: closing a file usually revokes any locks the process had placed upon it, freeing up any other processes that were trying to acquire a lock).
Thanks for all the great responses, they have been very helpful. The java stored procedure looked like the way to go, but since we don't have a lot of java expertise in-house, it would be frowned upon by management. But, I was able to find a way to do this from the stored procedure. I had to open the file in write byte mode 'WB'. Then, for each record I'm writing to the file, I convert it to the RAW datatype with UTL_RAW.CAST_TO_RAW. Then use UTL_FILE.PUT_RAW to write to the file followed by any necessary FFLUSH calls to flush the buffers. The receiving system has been able to read the files; so far so good.
Related
I am writing an application to read from a list of files, line by line and do some processing. I want to use as little RAM as I can.
I came across this question https://stackoverflow.com/a/41741702/3531263
Where the poster is saying readString uses more RAM than readLine and they have posted some code.
What I don't understand is how one uses more RAM? Because ultimately, the way their code is written, they are still writing an entire line to their buffer. So would that not mean if they had just used readString, it would have been the same thing?
the way their code is written, they are still writing an entire line to their buffer
Their code, yes. Your code might not need the whole line to be in memory at the same time. For example, your program is filtering a log file by request id, which is in the beginning of the line. It doesn't need to read the whole line which may be a few megabytes or more, only to reject it due to wrong request id. But with ReadString you don't have the luxury of choice.
I 'gree with Sergio. Also, have a look at the current implementation in the standard library. ReadLine calls ReadSlice('\n') once, then runs through a few branches to make sure the appropriate sentinel values or errors are returned with the converted data. On the other hand, ReadBytes and ReadString both loop over repeated calls to ReadSlice(delim), so it follows that they would necessarily be copying at least as much data into memory as ReadLine, and potentially much more if the delimiter wasn't found in the first call.
I have a situation where I need to concurrently read/write from/to the file, but the scope of operations is limited:
append only, no random offset writes
read from random position, where I know for sure the content has been written before(via append, internal access serialization via golang channel to ensure random read happens only after content's been appended)
there is only one process running
This is a high loaded application and I would like to avoid locking file for each read/write I do
I was going to open 2 files - one for read, another for append only
would doing so create some potential issues/bugs?
what is the recommended practice if I would like to avoid file locking for each read/write I do?
p.s. golang, linux, ext4
I'll assume by "random read" you actually mean "arbitrary read".
If I understand your use case correctly, you don't need to seek or lock or do anything manual. UNIX has this covered via O_APPEND. Here is what you can do:
Open the file with os.O_APPEND. This way every write, regardless of any preceding operations, will go to the end of the file
When reading use File.ReadAt. This lets you specify arbitrary offsets for your reads
Using this scheme you can avoid any sort of locking: the OS will do it for you. Because of the buffer cache this scheme is not even inefficient: appends and reads are pretty much independent.
I have been using dbms_output.put_line to write the output in line, which is around 500000 lines with more than 500 characters.
I want the output in a single line and I had tried dbms_output.put . But it does not do so as dbms_output.put has a limitation of 32000 bits. Please suggest any solution.
DBMS_OUTPUT is a mechanism for displaying messages to standard output (i.e. a screen). The nature of PL/SQL programs is they tend to be batch oriented and frequently run as background jobs, so DBMS_OUTPUT has narrow applications. Writing out huge amounts of data is not one of them.
You haven't explained your use case, but the need to render millions of characters in a single stream of output, without carriage returns, suggests you want to write to a file. Oracle provides a built-in for this, UTL_FILE. We can iterate calls to UTL_FILE.PUT() to write as much data as we like to a single line in a file. Find out more.
UTL_FILE has one major constraint, which is it can only write to a file on the database server. Perhaps this is why you are trying to use DBMS_OUTPUT?
Something else? Oracle 12c supports Streaming in PLSQL via DBMS_XMLDOM. Check it out.
This is a pretty vague answer, because your question is pretty vague. If you edit your question and provide actual details regarding your issue I can make my response more concrete .
I'm developing a program that needs to write a large amout of data to disk then read back much smaller amount of data back later on. It needs to "bin" related data together then once it figures out what to do with it, then it can process the data further. It's basically acting like a database, but with temp files on disk. Portions of the temp files get reused fairly frequently as I don't care about the data on disk after I read it back out, so that portion of the file can be recycled. I'm using I/O completion ports to implement this because sequential I/O is simply too slow.
The problem is that sometimes when I read the data, I don't get all of it back. For example, I will zero out my read buffer, do a read operation of, say, 20 bytes, and when the corresponding completion event triggers, some or even none of my read buffer will match what should be on disk, but all of it won't be zeroed out. Occasionally, I can detect this and try sleeping 5 seconds and reading the same portion again, and it matches what I read in the first try. This is taking place on a top of the line SSD, so 5 seconds should be plenty to flush to disk. However, when I stop my application and look at the contents of the file, it's correct on disk. It's as if the previous write hasn't flushed to disk and it tried reading old data.
To test that theory, I tried writing 0xFF on entire sections as I read them. When this error happened again, my read buffer did not contain 0xFFs as I would have expected. So presumably, I'm not reading old data.
I also checked to make sure that the number of bytes returned from the completion event matched the number of bytes that I passed to ReadFile, and they do match. There is no error returned by the completion event or by ReadFile (other than ERROR_IO_PENDING). I am creating my temp files with FILE_ATTRIBUTE_NORMAL, FILE_FLAG_OVERLAPPED, and FILE_FLAG_RANDOM_ACCESS.
I also tried waiting for all pending writes for a given portion of the file to complete before trying to read, but to no avail. I would hope that Windows would do that for me, but it isn't covered in any documentation that I've read.
I'm really at a loss as to why I'm getting what look to be partial or corrupted reads. I'm really just looking for some ideas that might cause this behavior because I'm all out.
From the sound of things you're firing off writes and reads to the same portions of the same file and sometimes the data that the read returns isn't what you think you have previously written.
I assume you are waiting for the write completion for a piece of data before issuing a read request for the same area of the file? If not the read could be occurring before the write completes? When lots of data is being written to the same disk the write completions may begin to slow down and writes may spend more time pending (watch out for the resources that this consumes!)
Personally I'd include my own memory cache layer which knows about the data block until the write completion occurs - you can then satisfy reads for this part of the file from your cache if the write has not yet completed.
If I'm writing a simple text log file from multiple processes, can they overwrite/corrupt each other's entries?
(Basically, this question Is file append atomic in UNIX? but for Windows/NTFS.)
You can get atomic append on local files. Open the file with FILE_APPEND_DATA access (Documented in WDK). When you omit FILE_WRITE_DATA access then all writes will ignore the the current file pointer and be done at the end-of file. Or you may use FILE_WRITE_DATA access and for append writes specify it in overlapped structure (Offset = FILE_WRITE_TO_END_OF_FILE and OffsetHigh = -1 Documented in WDK).
The append behavior is properly synchronized between writes via different handles. I use that regularly for logging by multiple processes. I do write BOM at every open to offset 0 and all other writes are appended. The timestamps are not a problem, they can be sorted when needed.
Even if append is atomic (which I don't believe it is), it may not give you the results you want. For example, assuming a log includes a timestamp, it seems reasonable to expect more recent logs to be appended after older logs. With concurrency, this guarantee doesn't hold - if multiple processes are waiting to write to the same file, any one of them might get the write lock - not just the oldest one waiting. Thus, logs can be written out of sequence.
If this is not desirable behaviour, you can avoid it by publishing logs entries from all processes to a shared queue, such as a named pipe. You then have a single process that writes from this queue to the log file. This avoids the conccurrency issues, ensures that logs are written in order, and works when file appends are not atomic, since the file is only written to directly by one process.
From this MSDN page on creating and opening Files:
An application also uses CreateFile to specify whether it wants to share the file for reading, writing, both, or neither. This is known as the sharing mode. An open file that is not shared (dwShareMode set to zero) cannot be opened again, either by the application that opened it or by another application, until its handle has been closed. This is also referred to as exclusive access.
and:
If you specify an access or sharing mode that conflicts with the modes specified in the previous call, CreateFile fails.
So if you use CreateFile rather than say File.Open which doesn't have the same level of control over the file access, you should be able to open a file in such a way that it can't get corrupted by other processes.
You'll obviously have to add code to your processes to cope with the case where they can't get exclusive access to the log file.
No it isn't. If you need this there is Transactional NTFS in Windows Vista/7.