Why is the read-only attribute set (sometimes) for files created by my service? - windows

NOTE: This is a complete re-write of this question. I'd previously conflated some ACL issues with the problem I'm hunting, which is probably why there were no answers.
I have a windows service that uses the standard open/close/write routines to write a log file (it reads stuff from a pipe and stuffs it into the log). A new log file is opened each day at midnight. The system is Windows XP Embedded.
The service runs as the Local System service (CreateService with NULL for the user).
When the service initially starts up, it creates a log file and writes to it with no problems. At this point everything is OK, and you can restart the service (or the computer) with no issues.
However, at midnight (when the day changes), the service creates a new log file and writes to it. The funny thing is, this new log file has the 'read only' flag set. That's a problem because if the service (or the computer) restarts, the service can no longer open the file for writing.
Here's the relevant information from the system with the problem having already happened:
Directory of C:\bbbaudit
09/16/2009 12:00 AM <DIR> .
09/16/2009 12:00 AM <DIR> ..
09/16/2009 12:00 AM 437 AU090915.ADX
09/16/2009 12:00 AM 62 AU090916.ADX
attrib c:\bbbaudit\*
A C:\bbbaudit\AU090915.ADX <-- old log file (before midnight)
A R C:\bbbaudit\AU090916.ADX <-- new log file (after midnight)
cacls output:
C:\ BUILTIN\Administrators:(OI)(CI)F
NT AUTHORITY\SYSTEM:(OI)(CI)F
CREATOR OWNER:(OI)(CI)(IO)F
BUILTIN\Users:(OI)(CI)R
BUILTIN\Users:(CI)(special access:)
FILE_APPEND_DATA
BUILTIN\Users:(CI)(IO)(special access:)
FILE_WRITE_DATA
Everyone:R
C:\bbbaudit BUILTIN\Administrators:(OI)(CI)F
NT AUTHORITY\SYSTEM:(OI)(CI)F
CFN3\Administrator:F
CREATOR OWNER:(OI)(CI)(IO)F
Here's the code I use to open/create the log files:
static int open_or_create_file(char *fname, bool &alreadyExists)
{
int fdes;
// try to create new file, fail if it already exists
alreadyExists = false;
fdes = open(fname, O_WRONLY | O_APPEND | O_CREAT | O_EXCL);
if (fdes < 0)
{
// try to open existing, don't create new file
alreadyExists = true;
fdes = open(fname, O_WRONLY | O_APPEND);
}
return fdes;
}
I'm having real trouble figuring out how the file is getting that read-only flag on it. Anyone who can give me a clue or a direction, I'd greatly appreciate it.
Compiler is VC 6 (Yea, I know, it's so far out of date it isn't funny. Until you realize that we're just now upgraded to XPE from NT 3.51).

The Microsoft implementation of open() has an optional third argument 'pmode', which is required to be present when the second argument 'oflag' includes the O_CREAT flag. The pmode argument specifies the file permission settings, which are set when the new file is closed for the first time. Typically you would pass S_IREAD | S_IWRITE for pmode, resulting in an ordinary read/write file.
In your case you have specified O_CREAT but omitted the third argument, so open() has used whatever value happened to be on the stack at the third argument position. The value of S_IWRITE is 0x0080, so if the value in the third argument position happened to have bit 7 clear, it would result in a read-only file. The fact that you got a read-only file only some of the time, is consistent with stack junk being passed as the third argument.
Below is the link for the Visual Studio 2010 documentation for open(). This aspect of the function's behaviour has not changed since VC 6.
http://msdn.microsoft.com/en-us/library/z0kc8e3z.aspx

Well, I have no idea what the underlying problem is with the 'open' APIs in this case. In order to 'fix' the problem, I ended up switching to using the Win32 APIs for file management (CreateFile, WriteFile, CloseHandle).

Related

what is the file ORA_DUMMY_FILE.f in oracle?

oracle version: 12.2.0.1
As you know, these are then unix processes for the parallel servers in oracle:
ora_p000_ora12c
ora_p001_ora12c
....
ora_p???_ora12c
They can be seen also with the view: gv$px_process.
The spid for each parallel server can be obtained from there.
Then I look for the open files associated with te parallel server here:
ls -l /proc/<spid>/fd
And I'm obtaining around 500-10000 file descriptors for several parallel servers equal to this one:
991 -> /u01/app/oracle/admin/ora12c/dpdump/676185682F2D4EA0E0530100007FFF5E/ORA_DUMMY_FILE.f (deleted)
I've deleted them using:(actually I've create a small script for doing it because there are thousands of them)
gdb -p <spid>
gdb> p close(<fd_id>)
But after some hours the file descriptors start being created again (hundreds every day)
If they are not deleted then eventually the linux limit is reached and any parallel query throws an error like this:
ORA-12801: error signaled in parallel query server P001
ORA-01116: error in opening database file 132
ORA-01110: data file 132: '/u02/oradata/ora12c/pdbname/tablespacenaname_ts_1.dbf'
ORA-27077: too many files open
Does anyone have any idea of how and why this file descriptors are being created, and how to avoid it?.
Edited: Added some more information that could be useful.
I've tested that when a new PDB is created a directory DATA_PUMP_DIR is created in it (select * from all_directories) that is pointing to:
/u01/app/oracle/admin/ora12c/dpdump/<xxxxxxxxxxxxx>
The linux directory is also created.
Also one file descriptor is created pointing to ORA_DUMMY_FILE.f in the new dpdump subdirectory like the ones described initially
lsof | grep "ORA_DUMMY_FILE.f (deleted)"
/u01/app/oracle/admin/ora12c/dpdump/<xxxxxxxxxxxxx>/ORA_DUMMY_FILE.f (deleted)
This may be ok, the problem I face is the continuos growing of the file descriptors pointing to ORA_DUMMY_FILE that reach the linux limits.

Meaning of ST_INO (os.stat() output) in Windows OS

Can anyone tell me what the meaning of the value for st_ino is when running os.stat() on Windows (Python 3.5.3)?
In earlier Python releases, this contained dummy values, but this has recently changed and I couldn't find how it is calculated/generated. I suspect that it differs based on the file system (NTFS, FAT, ...)
Example
import os
stat = os.stat(r'C:\temp\dummy.pdf')
for attr in dir(stat):
if attr.startswith('st_'):
print('{}: {}'.format(attr,
stat.__getattribute__(attr)))
Result
st_atime: 1495113452.7421005
st_atime_ns: 1495113452742100400
st_ctime: 1495113452.7421005
st_ctime_ns: 1495113452742100400
st_dev: 2387022088
st_file_attributes: 33
st_gid: 0
st_ino: 10414574138828642
st_mode: 33060
st_mtime: 1494487966.9528062
st_mtime_ns: 1494487966952806300
st_nlink: 1
st_size: 34538
st_uid: 0
Background
I used the shutil.copyfile() function an ran into an SameFileError. After looking through the code (and despite what it says in the comments of shutil.py) the shutil._samefile() function does not compare pathnames in Windows. Instead, it uses os.path.samefile() which compares the st_ino and st_dev values.
Both source and target file resided on the same device (volume), which would explain why the value for st_dev was the identical. But I'm still puzzled as to why st_ino had the same value for both files.
Remark: both files were on a Sharepoint volume mounted using webDAV, so their st_ino value may have been 0 (dummy), which would explain why they were equal. I'm still curious though ;-)
Update
As I suspected, the value for st_ino returned for the files residing on the Sharepoint volume (WebDAV), was 0, as was the value for st_dev. This is the reason for (incorrect) SameFileError. Example output:
\\sharepoint#SSL\AUT.pdf os.stat_result(st_mode=33206, st_ino=0, st_dev=0, st_nlink=1, st_uid=0, st_gid=0, st_size=4717, st_atime=1495031011, st_mtime=1495031011, st_ctime=1495031570)
\\sharepoint#SSL\ING.pdf os.stat_result(st_mode=33206, st_ino=0, st_dev=0, st_nlink=1, st_uid=0, st_gid=0, st_size=4722, st_atime=1495031203, st_mtime=1495031203, st_ctime=1495031733)
\\sharepoint#SSL\WAG.pdf os.stat_result(st_mode=33206, st_ino=0, st_dev=0, st_nlink=1, st_uid=0, st_gid=0, st_size=4710, st_atime=1495031511, st_mtime=1495031511, st_ctime=1495031912)

Strange pause when python download file from ftp

I have some code, where i check some directories on ftp server and download new files on my server. There are above 3 million files on server (zip archives). I am doing many not optimize things in this code, but all of them works fast, except part with downloading. Here is this part:
lf = open(local_filename, "wb") //here i create blank file
print ("opened")
try:
ftp.retrbinary("RETR "+name, lf.write) //here i write data
print ("wrote")
except ftplib.error_perm:
pass
lf.close() //here i close file with data
print ("closed")
my problem in the part between print ("opened") and print ("wrote"). My python console (2.7) keep silence for 10-20 second on this fase, but size of downloading files is very tiny. Its below 2-3 Kb.
Strange thing in next: when i start script from my own PC (windows 7), it works great and fast, but when i start it on windows server 2012 R2 (VDS), i got this sadly pause. Guys, i need your help. What should i do for configuration my server and fast downloading?
i got the answer. just need to run next command:
netsh int tcp set global ecncapability=disabled
and everything will be excellent!

file's date changes after zip in and out again, according to XCOPY

So, here's the problem: I have files which are regular files, and they are put into a ZIP file (see below for details on ZIP). Then I unzip them (see below for details on the tool used), and the files are restored. The date of the file is restored, as in standard in the ZIP/UNZIP tools used. When querying using DIR, or in Windows Explorer, the files involved have the same date as they had, before being handled by the ZIP/UNZIP process.
So, all OK.
But then, I'm using the XCOPY /D command, to further manipulate different copies of those files on the disk ... and, XCOPY says : one file is newer than the other one. Given the fact the date, hour, up until minutes is the same .. the difference would be regarding a smaller entity, like seconds ?
All involved disks have NTFS file system.
Example:
C:\my>dir C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd
Volume in drive C is mycomputerC
Volume Serial Number is 1234-5678
Directory of C:\windows
31/12/2014 19:50 51 Background_mycomputer.cmd
1 File(s) 51 bytes
Directory of C:\my\directory
31/12/2014 19:50 51 Background_mycomputer.cmd
1 File(s) 51 bytes
0 Dir(s) 33.655.316.480 bytes free
C:\my>xcopy C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd /D
Overwrite C:\my\directory\Background_mycomputer.cmd (Yes/No/All)? y
C:\windows\Background_mycomputer.cmd
1 File(s) copied
C:\my>xcopy C:\my\directory\Background_mycomputer.cmd C:\windows\Background_mycomputer.cmd /D
0 File(s) copied
C:\my>xcopy C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd /D
0 File(s) copied
C:\my>unzip -v
UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler. Send
bug reports using http://www.info-zip.org/zip-bug.html; see README for details.
Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip/ ;
see ftp://ftp.info-zip.org/pub/infozip/UnZip.html for other sites.
Compiled with Microsoft C 13.10 (Visual C++ 7.1) for
Windows 9x / Windows NT/2K/XP/2K3 (32-bit) on Apr 20 2009.
UnZip special compilation options:
ASM_CRC
COPYRIGHT_CLEAN (PKZIP 0.9x unreducing method not supported)
NTSD_EAS
SET_DIR_ATTRIB
TIMESTAMP
UNIXBACKUP
USE_EF_UT_TIME
USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported)
USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported)
UNICODE_SUPPORT [wide-chars] (handle UTF-8 paths)
MBCS-support (multibyte character support, MB_CUR_MAX = 1)
LARGE_FILE_SUPPORT (large files over 2 GiB supported)
ZIP64_SUPPORT (archives using Zip64 for large files supported)
USE_BZIP2 (PKZIP 4.6+, using bzip2 lib version 1.0.5, 10-Dec-2007)
VMS_TEXT_CONV
[decryption, version 2.11 of 05 Jan 2007]
UnZip and ZipInfo environment options:
UNZIP: [none]
UNZIPOPT: [none]
ZIPINFO: [none]
ZIPINFOOPT: [none]
C:\my>ver
Microsoft Windows [Version 6.1.7601]
C:\my>zip -?
Copyright (c) 1990-2006 Info-ZIP - Type 'zip "-L"' for software license.
Zip 2.32 (June 19th 2006). Usage:
zip [-options] [-b path] [-t mmddyyyy] [-n suffixes] [zipfile list] [-xi list]
The default action is to add or replace zipfile entries from list, which
can include the special name - to compress standard input.
If zipfile and list are omitted, zip compresses stdin to stdout.
-f freshen: only changed files -u update: only changed or new files
-d delete entries in zipfile -m move into zipfile (delete files)
-r recurse into directories -j junk (don't record) directory names
-0 store only -l convert LF to CR LF (-ll CR LF to LF)
-1 compress faster -9 compress better
-q quiet operation -v verbose operation/print version info
-c add one-line comments -z add zipfile comment
-# read names from stdin -o make zipfile as old as latest entry
-x exclude the following names -i include only the following names
-F fix zipfile (-FF try harder) -D do not add directory entries
-A adjust self-extracting exe -J junk zipfile prefix (unzipsfx)
-T test zipfile integrity -X eXclude eXtra file attributes
-! use privileges (if granted) to obtain all aspects of WinNT security
-R PKZIP recursion (see manual)
-$ include volume label -S include system and hidden files
-e encrypt -n don't compress these suffixes
C:\my>
Question: I do not want XCOPY to make updates where I know they are invalid cause the time format is doing something wrong. How do I prevent that ?
From how I see, there's different things involved, being XCOPY, very specific ZIP and UNZIP, and NTFS file system. Which one is doing something wrong ?
I must stress that apart from ZIP and UNZIP, there are no other changes done to the file, like changing 1 file, then making a change to another one, in less than 60 seconds time.
At moment of test, the time shown was NOT the current time, and not close to it either. No file is adjusting to the current time, the times refer to last changes of the file in question, which may be any time in the past. In this case, it's one day later, but it can be anything.
I noticed the peculiar behavior Raymond Chen describes when writing a Powershell script (GitHub link) to freshen a zip archive using the System.IO.Compression and System.IO.Compression.FileSystem libraries.
Interestingly, Zip archives can store multiple copies of the same file with identical metadata (name, relative path, modification dates). Extracting the second copy of the file will fail in Windows Explorer because the file already exists.
When trying to prevent re-zipping a file was already archived, I checked the relative path and date, and noticed that there was a discrepancy of up to two seconds in the LastWriteTime. This workaround compensates for the loss of precision:
$AlreadyArchivedFile = ($WriteArchive.Entries | Where-Object {#zip will store multiple copies of the exact same file - prevent this by checking if already archived.
(($_.FullName -eq $RelativePath) -and ($_.Length -eq $File.Length) ) -and
([math]::Abs(($_.LastWriteTime.UtcDateTime - $File.LastWriteTimeUtc).Seconds) -le 2) #ZipFileExtensions timestamps are only precise within 2 seconds.
})
Also, the IsDaylightSavingTime flag is not stored in the Zip archive. As a result I was surprised when extracted files became an hour newer than the original archived file. I tried this several times and saw the extracted file's timestamp incremented by an hour every time it was compressed and extracted.
Here's a very ugly workaround that decreases the archived file time by one hour to make the original source file and extracted file timestamps consistent:
If($File.LastWriteTime.IsDaylightSavingTime() -and $ArchivedFile){#HACK: fix for buggy date - adds an hour inside archive when the zipped file was created during PDT (files created during PST are not affected).
$entry = $WriteArchive.GetEntry($RelativePath)
$entry.LastWriteTime = ($File.LastWriteTime.ToLocalTime() - (New-TimeSpan -Hours 1))
}
There's probably a better way to handle this. Unfortunately I'm not aware of any way to store a Daylight Savings indicator for a file in a .Zip archive, and that information is lost.

Random corruption in file created/updated from shell script on a singular client to NFS mount

We have bash script (job wrapper) that writes to a file, launches a job, then at job completion it appends to the file information about the job. The wrapper is run on one of several thousand batch nodes, but has only cropped up with several batch machines (I believe RHEL6) accessing one NFS server, and at least one known instance of a different batch job on a different batch node using a different NFS server. In all cases, only one client host is writing to the files in question. Some jobs take hours to run, others take minutes.
In the same time period that this has occurred, there seems to be 10-50 issues out of 100,000+ jobs.
Here is what I believe to effectively be the (distilled) version of the job wrapper:
#!/bin/bash
## cwd is /nfs/path/to/jobwd
## This file is /nfs/path/to/jobwd/job_wrapper
gotEXIT()
{
## end of script, however gotEXIT is called because we trap EXIT
END="EndTime: `date`\nStatus: Ended”
echo -e "${END}" >> job_info
cat job_info | sendmail jobtracker#example.com
}
trap gotEXIT EXIT
function jobSetVar { echo "job.$1: $2" >> job_info; }
export -f jobSetVar
MSG=“${email_metadata}\n${job_metadata}”
echo -e "${MSG}\nStatus: Started" | sendmail jobtracker#example.com
echo -e "${MSG}" > job_info
## At the job’s end, the output from `time` command is the first non-corrupt data in job_info
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
## 10-360 minutes later…
RC=$?
echo -e "ExitCode: ${RC}" >> job_info
So I think there are two possibilities:
echo -e "${MSG}" > job_info
This command throws out corrupt data.
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
This corrupts the existing data, then outputs it’s data correctly.
However, some job, but not all, call jobSetVar, which doesn't end up being corrupt.
So, I dig into time.c (from GNU time 1.7) to see when the file is open. To summarize, time.c is effectively this:
FILE *outfp;
void main (int argc, char** argv) {
const char **command_line;
RESUSE res;
/* internally, getargs opens “job_info”, so outfp = fopen ("job_info", "a”) */
command_line = getargs (argc, argv);
/* run_command doesn't care about outfp */
run_command (command_line, &res);
/* internally, summarize calls fprintf and putc on outfp FILE pointer */
summarize (outfp, output_format, command_line, &res); /
fflush (outfp);
}
So, time has FILE *outfp (job_info handle) open the entire time of the job. It then writes the summary at the end of the job, and then doesn’t actually appear to close the file (not sure if this is necessary with fflush?) I've no clue if bash also has the file handle open concurrently as well.
EDIT:
A corrupted file will typically end consist of the corrupted part, followed with the non-corrupted part, which may look like this:
The corrupted section, which would occur before the non-corrupted section, is typically largely a bunch of 0x0000, with maybe some cyclic garbage mixed in:
Here's an example hexdump:
40000000 00000000 00000000 00000000
00000000 00000000 C8B450AC 772B0000
01000000 00000000 C8B450AC 772B0000
[ 361 x 0x00]
Then, at the 409th byte, it continues with the non-corrupted section:
Elapsed: 879.07
User: 0.71
System: 31.49
ExitCode: 0
EndTime: Fri Dec 6 15:29:27 PST 2013
Status: Ended
Another file looks like this:
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
[96 x 0x00]
[Repeat above 3 times ]
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
Followed by the non corrupted section:
Elapsed: 12621.27
User: 12472.32
System: 40.37
ExitCode: 0
EndTime: Thu Nov 14 08:01:14 PST 2013
Status: Ended
There are other files that have much more random corruption sections, but more than a few were cyclical similar to above.
EDIT 2: The first email, sent from the echo -e statement goes through fine. The last email is never sent due to no email metadata from corruption. So MSG isn't corrupted at that point. It's assumed that job_info probably isn't corrupt at that point either, but we haven't been able to verify that yet. This is a production system which hasn't had major code modifications and I have verified through audit that no jobs have been ran concurrently which would touch this file. The problem seems to be somewhat recent (last 2 months), but it's possible it's happened before and slipped through. This error does prevent reporting which means jobs are considered failed, so they are typically resubmitted, but one user in specific has ~9 hour jobs in which this error is particularly frustrating. I wish I could come up with more info or a way of reproducing this at will, but I was hoping somebody has maybe seen a similar problem, especially recently. I don't manage the NFS servers, but I'll try to talk to the admins to see what updates the NFS servers at the time of these issues (RHEL6 I believe) were running.
Well, the emails corresponding to the corrupt job_info files should tell you what was in MSG (which will probably be business as usual). You may want to check how NFS is being run: there's a remote possibility that you are running NFS over UDP without checksums. That could explain some corrupt data. I also hear that UDP/TCP checksums are not strong enough and the data can still end up corrupt -- maybe you are hitting such a problem (I have seen corrupt packets slipping through a network stack at least once before, and I'm quite sure some checksumming was going on). Presumably the MSG goes out as a single packet and there might be something about it that makes checksum conflicts with the garbage you see more likely. Of course it could also be an NFS bug (client or server), a server-side filesystem bug, busted piece of RAM... possibilities are almost endless here (although I see how the fact that it's always MSG that gets corrupted makes some of those quite unlikely). The problem might be related to seeking (which happens during the append). You could also have a bug somewhere else in the system, causing multiple clients to open the same job_info file, making it a jumble.
You can also try using different file for 'time' output and then merge them together with job_info at the end of script. That may help to isolate problem further.
Shell opens 'job_info' file for writing, outputs MSG and then shall close its file descriptor before launching main job. 'time' program opens same file for append as stream and I suspect the seek over NFS is not done correctly which may cause that garbage. Can't explain why, but normally this shall not happen (and is not happening). Such rare occasions may point to some race condition somewhere, can be caused by out of sequence packet delivery (due to network latency spike) or retransmits which causes duplicate data, or a bug somewhere. At first look I would suspect some bug, but that bug may be triggered by some network behavior, e.g. unusually large delay or spike of packet loss.
File access between different processes are serialized by kernel, but for additional safeguard may be worth adding some artificial delays - sleep timers between outputs for example.
Network is not transparent, especially a large one. There can be WAN optimization devices which are known to cause application issues sometimes. CIFS and NFS are good candidates for optimization over WAN with local caching of filesystem operations. Might be worth looking for recent changes with network admins..
Another thing to try, although can be difficult due to rare occurrences is capture of interesting NFS sessions via tcpdump or wireshark. In really tough cases we do simultaneous capturing on both client and server side and then compare the protocol logic to prove that network is or is not working correctly. That's a whole topic in itself, requires thorough preparation and luck but usually a last resort of desperate troubleshooting :)
It turns out this was actually another issue altogether, apparently to do with an out-of-date page being written to disk.
A bug fix was supplied to the linux-nfs implementation:
http://www.spinics.net/lists/linux-nfs/msg41357.html

Resources