Spark textFileStream [duplicate] - spark-streaming

Should the file name contain a number for the tetFileStream to pickup? my program is picking up new files only if the file name contains a number. Ignoring all other files even if they are new. Is there any setting I need to change for picking up all the files? Please help

No. it scans the directory for new files which appear within the window. If you are writing to S3, do a direct write with your code, as the file doesn't appear until the final close() —no need to rename. In constrast, if you are working with file streaming sources against normal filesystems, you should create out of the scanned dir and rename in at the end —otherwise work-in-progress files may get read. And once read: never re-read.

After spending hours on analyzing stack trace, I figured out that the problem is S3 address. I was providing "s3://mybucket", which was working for Spark 1.6 and Scala 2.10.5. On Spark 2.0 (and Scala 2.11), it must be provided as "s3://mybucket/". May be some Regex related stuff. Working fine now. Thanks for all the help.

Related

How Windows differentiates between Copied files and Created files

I am looking for a bit of advice on how Windows file system differentiates between files that are copied(copy and pasted from another location) and files that are created (a new file created in a a folder).
A bit of background to this so it makes more sense: I have an application that is used to move files. The application will monitor a directory and when a file is placed in the directory it will move it elsewhere. However, I am having issues where the application will not pick up a file that is created within the monitored directory but will pick up files that have been created else where and are copied into the monitored directory.
Any advice on how Windows differentiates, or if it does at all, would be greatly appreciated.
This is running on Microsoft Windows Server 2008 R2 Standard. I can't dig into the code and see what is going on under the hood unfortunately, so need to get an idea of the difference if any there would be.
The filesystems don't know the operation of "copying" the file. Any copying is a sequence of file open/read/write/close operations. The same applies to moving to the different filesystem. Moving within the same filesystem, though, is an operation native to the filesystems and it can be done with one command to the filesystem.
Now about your problem. Most likely you catch the creation of the file (before the data is written), and when your application reacts, the file is still opened for writing. So you need to wait until the file is closed.
Depending on how you do monitoring, such waiting is done in different ways. In filesystem filters you wait for file close operation. With .NET FileSystemWatcher there's no way to track file close operation, but I saw a couple of tricks here on StackOverflow (don't have a link though, sorry).
A file existing in D: drive, from creation
The same file which was copied to E: drive
As you can see, the file which was copied to E: drive, has a creation time as the latest, when it was copied to and the modification time as the last modification time for that file in previous location.
So I guess this illustrates, how windows differentiates between copied files and created files.

Add and get CHM file version

I need to add version info and retrieve it from the CHM file to make it possible to make an alert message to a user about a new CHM-file version available and downloaded.
It means, that I should compare versions of CHM on different machines (Server and clients).
The other way is date and time checking, but I should remember about time zones, so this is complicated, because also one should remember about different file systems, according to Windows SDK help about SetFileTime function and FILETIME structure.
If someone knows the trick - please share.
.chm files don't have version information. You will need to find some alternative way to mark the file version.
One way that occurs to me is to include a topic in the help file that contains the version. I imagine that this topic would be hidden, that is not linked by any other topic, not in the table of contents, etc.
All that remains is for your application to be able to read that topic from the help file. I'm sure that can be done with a .chm file parser, if you can obtain one. Perhaps more easily you can get the platform browser to read the topic for you. You can use IHTMLDocument2 to read a URI like this:
mk:#MSITStore:C:\somedir\somefile.chm::/html/SomeTopic.html

I am having trouble understanding IBM procedures

I am reading this file "readme_5202_P8CPE_IF001.htm" (Content Platform Engine 5.2.0.2 Interim Fix 1), and I encountered this procedure:
Copy the new Engine-xx.ear file into the following directory. This directory is a staging location. Do not delete or overwrite the currently deployed Engine-xx.ear file.
Windows
installation_location\ContentEngine\lib
UNIX
installation_location/ContentEngine/lib
What I don't understand is, how am I supposed to copy the Engine-xx.ear file to the specified directory without deleting or overwriting the existing Engine-xx.ear file there? Please bear with me if my understanding in English is not good.
Thank you very much!
That's a note (poorly explained, though) not to overwrite the original file (Engine-xx.ear), but to rename it, as e.g. Engine-xx-orig.ear, and store it aside (to have it if something goes bad with your upgrade) before copying/building the version from a fix pack to that same location.
Which is called a staging environment.
Because it's used to build a WAR/EAR for deployment.
Judging from the rest of the document, I belive that they are trying to tell you that you should follow their instructions carefully, and not try to be clever and overwrite the deployed application (at some other location).
I would give it a shot and overwrite whatever is in installation_location\ContentEngine\lib, the staging location. Make a backup of the files you overwrite in case it doesn't work out.

Storing temporary files

I would like to generate some temporary files in the course of my application. Specifically, I'm using AVAudioRecorder to record a file that I, upon stopping the recording, would like to load and edit/process. My question is:
What is the appropriate standard place to create temporary files. Is there some generally accepted approach to this for Mac or for iPad programming in general? I don't want to simply create a directory and write files into it if there is a proper protocol to this.
The answer to this question is actually a lot more complicated then one might assume. One cannot necessarily just use NSTemporaryDirectory and be done. I cocoadev.com has some good pages on this topic and I would suggest that you study them yourself and determine what will work best for your circumstance.
http://www.cocoadev.com/index.pl?NSTemporaryDirectory
http://www.cocoadev.com/index.pl?GettingTemporaryFolderOnSpecificVolume
The usual place for applications to store temporary data is /var/tmp. You could also use /tmp but this directory is for system-generated temporary files and anything in /tmp is deleted when the machine reboots.
What I found was that according to the iOS Application Programming Guide, I am supposed to query for the appropriate temporary folder for my application via NSTemporaryDirectory(). I tried this and it returned a folder within the /var directory, in my case '/var/folders/pQ/pQ+ZqZCSHWSIHftcbIo57U+++TI/-Tmp-/'.
/tmp or /usr/tmp are the usual places to store temporary files in Unix (which Mac OS X and iOS are).

How to know in Ruby if a file is completely downloaded

Our issue is that our project has files being downloaded using wget to the file system. We are using ruby to read the downloaded files for data.
How is it possible to tell if the file is completely downloaded so we don't read a half complete file?
I asked a very similar question and got some good answers... in summary, use some combination of one or more of the following:
download the file to a holding area and finally copy to your input directory;
use a marker file, created once the download completes, to signal readiness;
poll the file twice and see if its size has stopped increasing;
check the file's permissions (some download processes block reads during download);
use another method like in-thread download by the Ruby process.
To quote Martin Cowie, "This is a middleware problem as old as the hills"...
The typical approach to this is to download the file to a temporary location and when finished 'move' it to the final destination for processing.

Resources