How does std::string keep track of NUL char - c++11

C++11 guarantees that std::string stores the nul terminator internally. How can that be achieved without an explicit terminate method?
Example:
std::string foo("Baz");
printf("foo contains %s\n",&foo[0]); //Completely safe

The standard requires that the string you get from data() is NUL terminated, and requires that the references generated from operator[] are contiguous and NUL terminated. That's all the specification says about it.
It is the job of the implementation to do that. Typically, this happens by just storing a NUL terminator at the end of the string, making the actual buffer size one larger than it appears. If you add characters, it still stores a NUL terminator at the end of the newly-appended-to sequence of characters. If you remove characters, it moves the NUL terminator appropriately.
This is what encapsulated types can do. They can establish and maintain invariants.

Related

ReadFile truncating console input data containing multibyte characters, how to get correct input?

I was trying to implement a unified input interface using Windows API function ReadFile for my application, which should be able to handle both console input and redirection. It didn't work as expected with console input containing multibyte (like CJK) characters.
According to Microsoft Documentation, for console input handles, ReadFile just behaves like ReadConsoleA. (FYI, results are encoded in console's current code page, so A family console functions are acceptable. And there's no ReadFileW as ReadFile works on bytes.) The third and fourth arguments in ReadFile is nNumberOfBytesToRead and lpNumberOfBytesRead respectively, but they are nNumberOfCharsToRead and lpNumberOfCharsRead in ReadConsole. To find out the exact mechanism, I did the following test:
BYTE buf[8];
DWORD len;
BOOL f = ReadFile(in, buf, 4, &len, NULL);
if (f) {
// Print buf, len
ReadConsoleW(in, buf, 4, &len, NULL); // check count of remaining characters
// Print len
}
For input like 字, len is set to 4 first (character plus CRLF), indicating the arguments are counting bytes.
For 文字 or a字, len keeps 4 and only the first 4 bytes of buf are used at first, but the second read does not get the CRLF. Only when more than 3 characters are input will the second read get unread LF, then CR. It means that ReadFile is actually consuming up to 4 logical characters, and discarding the part of input after the first 4 bytes.
The behavior of ReadConsoleA is identical to ReadFile.
Obviously, this is more likely to be a bug than design. I did some searches and found a related feedback dating back to 2009. It seems that ReadConsoleA and ReadFile did read data fully from console input, but as it was inconsistent with ReadFile specifications and could cause severe buffer overflow that threatened system processes, Microsoft did a makeshift repair, by simply discarding excess bytes, ignoring support for multibyte charsets. (This is an issue about the behavior after that, limiting buffer to 1 byte.)
Currently the only practical solution I have come up with to make input correct is to check whether the input handle is a console, and process it differently using ReadConsoleW if so, which adds complexity to the implementation. Are there other ways to get it correct?
Maybe I could still keep ReadFile, by providing a buffer large enough to hold any input at one time. However, I don't have any ideas on how to check or set the input buffer size. (I can only enter 256 characters (254 plus CRLF) in my application on my computer, but cmd.exe allows to enter 8,192 characters, so this is really a problem.) It will also be helpful if more information about this can be provided.
Ps.: Maybe _getws could also help, but this question is about Windows API, and my application needs to use some low-level console functions.

SetNamedSecurityInfo takes a writeable path; how big should the buffer be?

SetNamedSecurityInfo is defined as taking an LPTSTR, not an LPCTSTR. Now the standard Win32 API that takes a LPTSTR also has some way of indicating the necessary buffer length. Sometimes that's explicit in the signature, sometimes it's documented as MAX_PATH or otherwise. Not so for SetNamedSecurityInfo.
To be honest, I have no idea why SetNamedSecurityInfo would want to write to that buffer, but perhaps it tries to canonicalize a path in-place. But then I might need to support 32768 characters?
As you can see in the document SetNamedSecurityInfo
pObjectName
A pointer to a null-terminated string that specifies the name of the
object for which to set security information.
That means the buffer length which will be sent into the function is always related to the string length of the buffer.

Is writing to a unix file through shell script is synchronized?

i have a requirement where many threads will call same shell script to perform a work, and then will write output(data as single text line) to a common text file.
as here many threads will try to write data to same file, my question is whether unix provides a default locking mechanism so that all can not write at the same time.
Performing a short single write to a file opened for append is mostly atomic; you can get away with it most of the time (depending on your filesystem), but if you want to be guaranteed that your writes won't interrupt each other, or to write arbitrarily long strings, or to be able to perform multiple writes, or to perform a block of writes and be assured that their contents will be next to each other in the resulting file, then you'll want to lock.
While not part of POSIX (unlike the C library call for which it's named), the flock tool provides the ability to perform advisory locking ("advisory" -- as opposed to "mandatory" -- meaning that other potential writers need to voluntarily participate):
(
flock -x 99 || exit # lock the file descriptor
echo "content" >&99 # write content to that locked FD
) 99>>/path/to/shared-file
The use of file descriptor #99 is completely arbitrary -- any unused FD number can be chosen. Similarly, one can safely put the lock on a different file than the one to which content is written while the lock is held.
The advantage of this approach over several conventional mechanisms (such as using exclusive creation of a file or directory) is automatic unlock: If the subshell holding the file descriptor on which the lock is held exits for any reason, including a power failure or unexpected reboot, the lock will be automatically released.
my question is whether unix provides a default locking mechanism so
that all can not write at the same time.
In general, no. At least not something that's guaranteed to work. But there are other ways to solve your problem, such as lockfile, if you have it available:
Examples
Suppose you want to make sure that access to the file "important" is
serialised, i.e., no more than one program or shell script should be
allowed to access it. For simplicity's sake, let's suppose that it is
a shell script. In this case you could solve it like this:
...
lockfile important.lock
...
access_"important"_to_your_hearts_content
...
rm -f important.lock
...
Now if all the scripts that access "important" follow this guideline,
you will be assured that at most one script will be executing between
the 'lockfile' and the 'rm' commands.
But, there's actually a better way, if you can use C or C++: Use the low-level open call to open the file in append mode, and call write() to write your data. With no locking necessary. Per the write() man page:
If the O_APPEND flag of the file status flags is set, the file offset
shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation.
Like this:
// process-wide global file descriptor
int outputFD = open( fileName, O_WRONLY | O_APPEND, 0600 );
.
.
.
// write a string to the file
ssize_t writeToFile( const char *data )
{
return( write( outputFD, data, strlen( data ) );
}
In practice, you can write anything to the file - it doesn't have to be a NUL-terminated character string.
That's supposed to be atomic on writes up to PIPE_BUF bytes, which is usually something like 512, 4096, or 5120. Some Linux filesystems apparently don't implement that properly, so you may in practice be limited to about 1K on those file systems.

Who is responsible for putting the null terminator when handling TB_GETBUTTONTEXT?

The documentation for TB_GETBUTTONTEXT says that the handler has to return the number of characters and optionally (if lParam is not null) copy the string into the supplied buffer.
The caveat is that the length doesn't include the terminating character. I see the following problem. Say the handler stores the string precomputed (so its length doesn't change). First the caller sends the message with lParam set to null - to find the number of characters - and the handler returns the number of characters without the terminating null. Then the caller allocates memory and sends the message again - this time passing the buffer address as lParam.
Should the handler copy the terminating null? I mean if the first time the handler returned N and the caller allocated space for N characters and the handler appends a terminating null then buffer overrun occurs. But if the caller really expected the string to be null terminated and allocated space for N+1 characters and the handler doesn't append the null terminator the handler will have a string that is not null-terminated and again buffer overrun can occur (if the caller isn't careful enough).
So what should the handler do? Should it copy the null terminator or not?
MFC uses the sane approach in its CMFCToolBar::OnGetButtonText() implementation, it assumes the caller knows it should allocate N+1 and uses lstrcpy() to copy the text.

Using named pipe to communicate between unicode and non-unicode processes on windows

If a process with unicode enabled creates a named pipe, it must pass a LPCTSTR for the pipe name, in this case a LPCWSTR. Will a second process wihtout unicode be able to open that pipe by passing a LPCSTR for the pipe name?
Also, can I call CreateNamedPipeW or CreateNamedPipeA and ignore whether unicode is enabled, or do I have to call the appropriate one?
Processes aren't Unicode or non-Unicode, they're just processes. The Unicode/non-Unicode distinction applies only to windows and window-related objects.
You can call either of the two functions. The A version merely converts the string to UTF-16 and passes it to the W function.

Resources