I want to be able to read and write files, etc. How can I do this?
Check out Chapter 28: Data Files of Learning J.
I found several other learning resources on J Software's Getting Started page.
Related
I am relatively new as active user to the forum, but have to thank you all first your contributions because I have been looking for answers since years...
Today, I have a question that nobody has solved or I am not able to find...
I am trying to read files in parallel from s3 (AWS) to spark (local computer) as part of a test system. I have used mclapply, but when set more that 1 core, it fails...
Example: (the same code works when using one core, but fails when using 2)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 1)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 2)
Warning message:
In mclapply(seq(file_paths), function(i) { :
all scheduled cores encountered errors in user code
Any suggestion???
Thanks in advance.
Just read everything into one table via 1 spark_read_parquet() call, this way Spark handles the parallelization for you. If you need separate tables you can split them afterwards assuming there's a column that tells you which file the data came from. In general you shouldn't need to use mcapply() when using Spark with R.
I am trying to use Nifi to get a file from SFTP server. Potentially the file can be big , so my question is how to avoid getting the file while it is being written. I am planning to use ListSFTP+FetchSFTP but also okay with GetSFTP if it can avoid copying partially written files.
thank you
In addition to Andy's solid answer you can also be a bit more flexible by using the ListSFTP/FetchSFTP processor pair by doing some metadata based routing.
After ListSFTP each flowfile will have attributes such as 'file.lastModifiedTime' and others. You can read about them here https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ListSFTP/index.html
You can put a RouteOnAttribute process in between the List and Fetch to detect objects that at least based on the reported last modified time are 'too new'. You could route those to a processor that is just a slow pass through to intentionally wait a bit. You can then run those back through the first router until they are 'old enough'. Now, this is admittedly a power user approach but it does give you a lot of flexibility and control. The approach I'm mentioning here is not fool proof as the source system may not report the last mod time correctly, it may not mean the source file is doing being written, etc.. But it gives you additional options IF you cannot do the definitely correct thing above that Andy talks about.
If you have control over the process which writes the file in, a common pattern to solve this is to initially write the file with a specific naming structure, such as beginning with .. After the successful write operation, the file is renamed without the . and it is picked up by the processor. Both GetSFTP and ListSFTP have a processor property called Ignore Dotted Files which is set to true by default and means those processors will not operate on or return files beginning with the dot character.
There is a minimum file age property you can use. The last modification time gets updated as the file is being written. Setting this value to something other than 0 will help fix the problem:
Im looking for a quick and dirty way to get a graph to show a rate of lines added to a log file, such as php.log, or any text file for that matter.
Know of any open source projects out there?
I know there are a lot of charting tools out there and am familiar with rrd and such, but the answer I'm interested in is the actual nuts and bolts of implementing the solution for any of those charting systems.
If you are familiar with rrd, then you can make a simple database which stores the actual line count of the log file, then visualize with the standard rrd tools.
You can update your database with the following snippet:
rrdtool update test.rrd `date +%s`:`wc -l logfile.txt`
for some commercial project I'm doing I need to be able to read the actual data stored on the $mft file.
I found a gpl lib that could help, but since its gpl i can't integrate it into my code.
could someone please point me to a project that i could use / or point me at the relevant windows API (something that doesn't require 1000 lines of code to implement)
BTW, why doesn't windows simply allow me to read the mft file directly anyway? (through the create file and the read method, if i want to ruin my drive it's my business not Ms's).
thanks.
You just have to open a handle to the volume using CreateFile() on \.\X: where X is the drive letter (check the MSDN documentation on CreateFile(), it mentions this in the Remarks section).
Read the first sector into a NTFS Boot Record structure (you can find it online, search for Richard "Flatcap" Russon, edit: I found it, http://www.flatcap.org/ntfs/ntfs/files/boot.html ). One of the fields in the boot sector structure gives the start location of the MFT in clusters (LCN of VCN 0 of the $MFT), you have to do a SetFilePointer() to that location an read in multiples of sectors. The first 1024 bytes from that location is the file record of the $MFT, again you can parse this structure to find the data attribute which is always non-resident and it's size is the actual size of the MFT file at that time.
The basic structures for $Boot, File Record and basic attributes (Standard Information, File Name and Data) along with the parsing code should run you less than 1000 lines of code.
This is not going to be a trivial proposition. You'll likely have to roll your own code solution to accomplish this. You can get some info about the details of the $MFT by checking out http://www.ntfs.com/ntfs-mft.htm
Another option is to spend some time looking through the source code to the opensource project NTFS-3g. You can download the source from http://www.tuxera.com/community/ntfs-3g-download/
Another good project is the NTFSProgs http://en.wikipedia.org/wiki/Ntfsprogs
Good luck.
In order not to block the reactor I would like to read files asynchronously, but I've found no obvious way of doing it using EventMachine. I've tried a few different approaches, but none of them feels right:
Just read the file, it'll block the reactor, but what the hell, it's not that slow (unless it's a big file, and then it definitely is).
Open the file for reading and read a chunk on each tick (but how much to read? too much and it'll block the reactor, too little and reading will get slower than necessary).
EM.popen('cat some/file', FileReader) feels really weird, but works better than the alternatives above. In combination with the LineAndTextProtocol it reads lines pretty swiftly.
EM.attach, but I haven't found any examples of how to use it, and the only thing I've found on the mailing list is that it's deprecated in favour of…
EM.watch, which I've found no examples of how to use for reading files.
How do you read files within a EventMachine reactor loop?
EM.attach/watch cannot be used on files, as select/epoll on a disk-based file descriptor will always return readable.
Ultimately, it depends on what you're trying to do. If it's a small file, just File.read it. If it is larger, you can read small chunks over time. For example, EM::FileStreamer does this to send large file over the network.
Another common use-case is to tail a file and read in new contents when it changes. This can be achieved using EM.watch_file: http://github.com/jordansissel/eventmachine-tail