Stale NFS file handle issue on a remote cluster

Stale NFS file handle issue on a remote cluster - cluster-computing

I need to run a bunch of simulations using a tool called ngspice, and since I want to run a million simulations, I am distributing them across a cluster of machines (master+ a slave to start with, which have 12 cores each).
This is the command:
ngspice deck_1.sp; ngspice deck_2.sp etc.,
Step 1: A python script is used to generate these sp files.
Step 2: Python invokes GNU parallel to distribute the sp files across the master/slave and run the simulations using ngspice
Step 3: I post-process the results (python script).
I generate and process only 1000 files at a time to save disk space. So the above Step 1 to 3 are repeated in a loop till a million files are simulated.
Now, my problem is:
When I execute the loop for the 1st time, I have no problem. The files are distributed across the master/slave till the 1000 simulations are complete. When the loop starts off the second time, I clear off the existing sp files and regenerate them (step 1). Now, when I execute step 2- for some strange reason, some files are not being detected. After some debugging, the error I get is- "Stale NFS file handle" and "No such file or directory deck_21.sp" etc., for certain sp files that are created in step 1.
I paused my python script and did an 'ls' in the directory and I see that the files actually exist, but like the error points out, it is because of the Stale NFS file handle. This link recommends that I remount the client etc., but I am logged into a machine to which I have no admin privileges to mount.
Is there a way I can resolve this?
Thanks!

No. You need admin prviledges to fix this.

Related

Check the status of copy

I need to write a script to be called by Windows Task Scheduler every min. The script is supposed to look for a file in a network drive and move it somewhere else.
If a user copies a large file (something like 300 MB) to a Windows network drive, depending on the users locations and bandwidth it could take a couple of minutes for the file to get uploaded completely.
When the Task Scheduler calls my script, let's say it found a file, now I will create a temporary lock file so the next time the script is called it simply will exit because of that lock file, it knows the previous script that was called a min ago is active trying to move the file.
What is the best way for my PowerShell script to know when the file copy to the network drive completed and then start transferring the file somewhere else?
have some kind of while loop check the size of the file, pause 30 sec, check the size again if equal, then I know the transfer is completed?
is there a PowerShell API to check to see if the file is in used or not? If not it means the upload completed?
any better more efficient way of doing this?

Windows job scheduler still executing when time for next operation

I have a Server 2008 scheduled job that does the following:
Tests to see if the active/passive mscs service is running on this node.
If so then maps a shared folder.
Moves a bunch of fies from shared folder to clustered drive.
Unmaps shared folder.
The job runs every 5 minutes. The files are procuded by another system and though not completley time critical delating the process much more than this is not acceptable to the business.
So far so good.
What I'm seeing is that although the script works correctly, the job history says that the second time it attempts to run the job, 'there is a previous copy of the job still running'.
Does anyone have any thoughts on:
Why this is happening?
And how to go about debugging it?
If this were Unix/Linux this would not be a problem but this is a complete mystery to me.

Scripting for safe file backup under windows

I need to back up some large files that are being written to disk by a process. The process is perpetually running, and occasionally dumps large files that need to be moved over the network. Having the process do this itself is not an option, as the process locks out users whilst it is doing file dumps.
So, this runs under a windows machine, and as a primarily linux user, I am not entirely certain how to do this...
Under linux I would simply use a cron job in the folder (I know the glob that will match the output files), then check lsof, to ensure that the file is not being written to, such that I don't try to copy a partially complete file. Data integrity is critical, so I would normally md5 the files before and after the copy.
So I guess my question is -- how does one do this sort of stuff under windows? I feel like I am kneecapped from the start -- I can use python, but I can't emulate lsof, nor cron to do the task scheduling.
I tried looking at "handle" -- but it needs admin privelidges at execution time, which is also not an option. I can't run the backup process as an admin, it has to run with user privs.
Thanks..
Edit: I just realised I could keep the python instance running, with a sleep, so task scheduling is not a problem :)

For replacing cron you can use the "Task Scheduler" in windows to start your script every few minutes (or specific times).
For lsof the question was discussed here : How can I determine whether a specific file is open in Windows?

jRuby Zip out of Memory

We have a small utility that finds unused items on our server and zips them up then moves them this is written in jRuby. When we go to run this on the actual servers needing clean up they run out of memory before they can complete the operation of the clean up. The java memory is as high as we can get it to run stably on 32bit and we can't move to 64bit at this time it is around 1800m max heap size. There is our main application running as well that we would like to avoid shutting down. The zips the system is creating are 800megs plus is there any way to do this and not have the entire zip file open in memory?

Can you execute zip via the command line?
You may also want to look at pbzip2, you will still need tar to do the archival of multiple files though.

Why does FileCopy fail at random on Windows 7?

I have a VB6 program running on Windows 7. It is copying a large number of files and sometimes FileCopy fails with an access violation (between every 60 and 500 files).
I cannot reproduce it using a single file, only during such mass-copying operations this problem happens.
It makes no difference, if source/target are on hard disks, network shares or CD-ROMs.
What could trigger this problem?
EDIT: My question might be a little bit convoluted, so here's some more data:
Run 1:
Start copying 5.000 files
Access violation on file #983
Access violation on file #1437
Access violation on file #1499
Access violation on file #2132
Access violation on file #3456
Access violation on file #4320
Done
Run 2:
Start copying 5.000 files
Access violation on file #60
Access violation on file #3745
Done
Observations
The affected files are always different
The number of affected files tends to decrease if the same file batch is copied multiple times in succession.
Running as Administrator makes no difference
The application has read/write access to all necessary file system objects
This problem happens on Windows 7 workstations only!

Best guess: Is it possible that another user/application is using the specified file at the time the process is running? (anti-virus scanner, Win7 search indexing tool, windows defender, etc) You might try booting the machine in safe-mood to eliminate any of the background services/apps and try running the process to see.
Is there any consistency in the file types or size of the files causing the issue?
Is the machine low on resources? RAM/Disk Space
You said it occurs on Win7 – is it multiple Win7 machines or just one. (help to rule out system resources vs. software/OS)
Any hints from the event viewer (control panel > admin tools) – doubtful
Does the process take a long time to complete? If you can take the performance hit you might look at destroying and recreating the FSO object after every copy or every X files to make sure there isn’t some odd memory leak issue with Win7/VB6.
Not necessarily a recommended solution but if all else fails you could handle that error and save the files that trigger it in a dictionary/collection and reloop through the process with any those files when done. No guarantee it wouldn’t happen again.

Not enough information (as you probably know). Do you log the activity? If not, it's a good place to start. Knowing whether certain files are the problem, and if the issue is repeatable, can help narrow it down.
In your case I would also trap (and log) all errors and retry N times after waiting N seconds. You could be trying to copy in-use files locked by another process, and a retry may allow time for that lock to go away.
Really, more data is the key, and logging is the way to get it.

Is there any chance your antivirus program or some indexer is getting in the way?
Try creating a procmon trace while reproducing the error and see what is actually failing. With the trace you can see if there is another program causing the issue or if your app is trying to write somewhere it should't (incorrect permissions) or can't (a temp/scratch directory without enough space).
Check out the presentations linked to on the procmon page or Mark Russinovich's blog for some cool examples of using this tool to solve various Windows/application mysteries.

Is there a a hidden/system file in the directory that is potentially blocking it?
Does running the VB6 App with right-click "Run As Administrator" make a difference?
Is the point where it dies at the max # of files in the directory? e.g. Are you sure the upper limit on whatever loop structure you are using in VB6 is correct (Count vs count -1)?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio