Is it possible to reopen ParquetWriter after close() is called? - parquet

I'm currently using ParquetWriter to write Avro records to parquet files. I can use the write() and close() methods to write and close files as needed. Now I have a use case where I need to reopen a closed file. I don't see such a method exists in ParquetWriter. Based on the Parquet specification, it seems possible to reopen a file. I'm wondering if there is any reason why we don't have a reopen() method implemented. Or is there any other classes that I could use to reopen a closed parquet file?
If I would like to implement such a method, could someone provide me with any guideline? Thanks.

Related

Writing in the same files using Spring Batch

I am trying to build the batch application in which I will pick the files from the some folders, filter them using name, pass them to the batch operation using MultiResourceItemReader. Then I will implement my own ItemProcessor to change few rows based on some condition.
My requirement is to write the updated data in the same files I am taking input from, I don't know if we can really do this with Spring Batch.
So basically I can't think of how to implement the ItemWriter here, because I need to write the data to the same file and at the same time to the multiple files.
I guess ClassifierCompositeItemWriter can be used here or MultiResourceItemWriter, I have tried to read about them in different stackoverflow answers, but couldn't find anything related to my requirement.
Can anyone help me to implement this.
Code example would be really helpful.
Thanks
While this should be technically possible, I'm not sure it is a good idea for restartability. What would you do if something goes wrong? The input file would have been overridden with new data and you would lose the original items.
I'm probably missing something here, but I see no reason not to keep the original file, which could be deleted before the new one is renamed with the same name in a final step at the end of the job (after all the processing has been successfully done).
Partitioning can be used here, I don't know if this the right approach, but its working in my case very well.
Firstly I created 3 sets of Partitioners for 3 types of files, using MultiResourcePartitioner. I filtered the files from file system using nio.Files class and fed the collection of files to the MultiResourcePartitioner. It will automatically create the partitioner for every file.
Then I write the Reader, Writer and Processor for every Paritioner. In Reader I dynamically picked the filename from the stepExecutionContext and #StepScope annotation, and in writer I used temporary filename to store the output.
Then finally I created a Tasklet for deleting the original files and renaming the temporary files.
Read the documentation about partioning and parallel processing, you will know if it works in your case [doc]:https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html#partitioning

Is Close() needed for a file opened by os.Open()? [duplicate]

This question already has answers here:
Is it necessary to close files after reading (only) in any programming language?
(3 answers)
Does the file need to be closed?
(1 answer)
Closed 2 years ago.
It seems that os.Open() open read-only files. So I think there is no need to Close() it. The doc is not clear on this. Is my understanding correct?
https://golang.org/pkg/os/#Open
In general, you should always close the files you open. In a long running program, you may exhaust all available file handles if you do not close your files. That said, the Go garbage collector closes open files, so depending on your exact situation leaving files open may not be a big deal.
There is a limit to how many filehandles a process can have open at once, the limit is determined by your environment, so it's important to close them.
In addition, Windows file locking is complicated; if you hold a file open it may not be able to be written to or deleted.
Unless you're returning the open filehandle, I'd advise to always match an open with a defer file.Close()
Close releases resources that are independent of the read/write status of the file. Close the file when you are done with it.
Your best bet is to always use defer file.Close(). This function is invoked for cleanup purposes, and also releases resources that are indirectly related to the I/O operation itself.
This also holds true to HTTP/s response bodies and any data type that implements the Reader interface.

Stop a currently running writeToFile:

I have some code which writes some PDFDocument objects to a user-chosen destination. This works fine.
Perhaps some of these files (chosen by the user) may be pretty large (maybe hundreds of megabytes) and now I wonder whether there is a possibility to cancel the current writeToFile:withOptions: call (e.g. the user changed his mind and wants to stop it).
I doubt you can do it with that method, since it provides no canceling functionality.
I suggest you use the dataRepresentation method of PDFDocument to first get the PDF data. You can then split up the data using NSData’s subdataWithRange:. And then you can successively write out the data to a file using NSFileHandle’s fileHandleForWritingToURL:error:, writeData:, and closeFile methods.
Writing it out in chunks like this from a non-main thread, in a for-loop say, you can cancel it any time you wish.

How to process an open file using MapReduce framework

I have a file that get aggregated and written into HDFS. This file will be opened for an hour before it is closed. Is it possible to compute this file using MapReduce framework, while it is open? I tried it but it's not picking up all appended data. I could query the data in HDFS and it available but not when done by MapReduce. Is there anyway I could force MapReduce to read an open file? Perhaps customize the FileInputFormat class?
You can read what was physically flushed. Since close() makes the final flush of the data, your reads may miss some of the most recent data regardless how you access it (mapreduce or command line).
As a solution I would recommend periodically close the current file, and then open a new one (with some incremented index suffix). You can run you map reduce on multiple files. You would still end up with some data missing in the most recent file, but at least you can control it by frequency of of your file "rotation".

Linux - Programmatically write to a proc file

I have found several examples online where we can create a proc file, assign read and write methods that are called every time the proc file is opened for read or written to.
However, I can't seem to find any documentation on how to programatically write to a proc file. Ideally, I would like to add a timestamp with other user details every time the proc file is opened for read or for write. Again, I've found where I can add the read and write functions that are triggered when the proc file is opened, but I can't find documentation on how to actually write to a proc file programatically. This would be different from a regular IO read/write, correct?
Fixed the issue -- I didn't fully understand proc files but now understand how they work and that there isn't really any file to write to -- just variables. Got it working. Thanks!

Resources