How Should I read a File line-by-line in C? - macos

I'd like to read a file line-by-line. I have fgets() working okay, but am not sure what to do if a line is longer than the buffer sizes I've passed to fgets()? And furthermore, since fgets() doesn't seem to be Unicode-aware, and I want to allow UTF-8 files, it might miss line endings and read the whole file, no?
Then I thought I'd use getline(). However, I'm on Mac OS X, and while getline() is specified in /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/stdio.h, it's not in /usr/include/stdio, so gcc doesn't find it in the shell. And it's not particularly portable, obviously, and I'd like the library I'm developing to be generally useful.
So what's the best way to read a file line-by-line in C?

First of all, it's very unlikely that you need to worry about non-standard line terminators like U+2028. Normal text files are not expected to contain them, and the very overwhelming majority of all existing software that reads normal text files doesn't support them. You mention getline() which is available in glibc but not in MacOS's libc, and it would surprise me if getline() did support such fancy line terminators. It's almost a certainly that you can get away with just supporting LF (U+000A) and maybe also CR+LF (U+000D U+000A). To do that, you don't need to care about UTF-8. That's the beauty of UTF-8's ASCII compatibility and is by design.
As for supporting lines that are longer than the buffer you pass to fgets(), you can do this with a little extra logic around fgets. In pseudocode:
while true {
fgets(buffer, size, stream);
dynamically_allocated_string = strdup(buffer);
while the last char (before the terminating NUL) in the buffer is not '\n' {
concatenate the contents of buffer to the dynamically allocated string
/* the current line is not finished. read more of it */
fgets(buffer, size, stream);
}
process the whole line, as found in the dynamically allocated string
}
But again, I think you will find that there's really quite a lot of software out there that simply doesn't bother with that, from software that parses system config files like /etc/passwd to (some) scripting languages. Depending on your use case, it may very well be good enough to use a "big enough" buffer (e.g. 4096 bytes) and declare that you don't support lines longer than that. You can even call it a security feature (a line length limit is protection against resource exhaustion attacks from a crafted input file).

Based on this answer, here's what I've come up with:
#define LINE_BUF_SIZE 1024
char * getline_from(FILE *fp) {
char * line = malloc(LINE_BUF_SIZE), * linep = line;
size_t lenmax = LINE_BUF_SIZE, len = lenmax;
int c;
if(line == NULL)
return NULL;
for(;;) {
c = fgetc(fp);
if(c == EOF)
break;
if(--len == 0) {
len = lenmax;
char * linen = realloc(linep, lenmax *= 2);
if(linen == NULL) {
// Fail.
free(linep);
return NULL;
}
line = linen + (line - linep);
linep = linen;
}
if((*line++ = c) == '\n')
break;
}
*line = '\0';
return linep;
}
To read stdin:
char *line;
while ( line = getline_from(stdin) ) {
// do stuff
free(line);
}
To read some other file, I first open it with fopen():
FILE *fp;
fp = fopen ( filename, "rb" );
if (!fp) {
fprintf(stderr, "Cannot open %s: ", argv[1]);
perror(NULL);
exit(1);
}
char *line;
while ( line = getline_from(fp) ) {
// do stuff
free(line);
}
This works very nicely for me. I'd love to see an alternative that uses fgets() as suggested by #paul-tomblin, but I don't have the energy to figure it out tonight.

Related

only read last line of text file (C++ Builder)

Is there an efficient way to read the last line of a text file? Right now i'm simply reading each line with code like below. Then S holds the last line read. Is there a good way to grab that last line without looping through entire text file?
TStreamReader* Reader;
Reader = new TStreamReader(myfile);
while (!Reader->EndOfStream)
{
String S = Reader->ReadLine();
}
Exactly as Remy Lebeau commented:
Use file access functions FileOpen,FileSeek,FileRead
look here for example of usage:
Convert the Linux open, read, write, close functions to work on Windows
load your file by chunks from end into memory
so make a static buffer and load file into it from end by chunks ...
stop on eol (end of line) usually CR,LF
just scan for 13,10 ASCII codes or their combinations from end of chunk. Beware some files have last line also terminated so you should skip that the first time ...
known eols are:
13
10
13,10
10,13
construct line
if no eol found add whole chunk to string, if found add just the part after it ...
Here small example:
int hnd,siz,i,n;
const int bufsz=256; // buffer size
char buf[bufsz+1];
AnsiString lin; // last line output
buf[bufsz]=0; // string terminator
hnd=FileOpen("in.txt",fmOpenRead); // open file
siz=FileSeek(hnd,0,2); // obtain size and point to its end
for (i=-1,lin="";siz;)
{
n=bufsz; // n = chunk size to load
if (n>siz) n=siz; siz-=n;
FileSeek(hnd,siz,0); // point to its location (from start)
FileRead(hnd,buf,n); // load it to buf[]
if (i<0) // first time pass (skip last eol)
{
i=n-1; if (i>0) if ((buf[i]==10)||(buf[i]==13)) n--;
i--; if (i>0) if ((buf[i]==10)||(buf[i]==13)) if (buf[i]!=buf[i+1]) n--;
}
for (i=n-1;i>=0;i--) // scan for eol (CR,LF)
if ((buf[i]==10)||(buf[i]==13))
{ siz=0; break; } i++; // i points to start of line and siz is zero so no chunks are readed after...
lin=AnsiString(buf+i)+lin; // add new chunk to line
}
FileClose(hnd); // close file
// here lin is your last line

C Program Strange Characters retrieved due to language setting on Windows

If the below code is compiled with UNICODE as compiler option, the GetComputerNameEx API returns junk characters.
Whereas if compiled without UNICODE option, the API returns truncated value of the hostname.
This issue is mostly seen with Asia-Pacific languages like Chinese, Japanese, Korean to name a few (i.e., non-English).
Can anyone throw some light on how this issue can be resolved.
# define INFO_SIZE 30
int main()
{
int ret;
TCHAR infoBuf[INFO_SIZE+1];
DWORD bufSize = (INFO_SIZE+1);
char *buf;
buf = (char *) malloc(INFO_SIZE+1);
if (!GetComputerNameEx((COMPUTER_NAME_FORMAT)1,
(LPTSTR)infoBuf, &bufSize))
{
printf("GetComputerNameEx failed (%d)\n", GetLastError());
return -1;
}
ret = wcstombs(buf, infoBuf, (INFO_SIZE+1));
buf[INFO_SIZE] = '\0';
return 0;
}
In the languages you mentioned, most characters are represented by more than one byte. This is because these languages have alphabets of much more than 256 characters. So you may need more than 30 bytes to encode 30 characters.
The usual pattern for calling a function like wcstombs goes like this: first get the amount of bytes required, then allocate a buffer, then convert the string.
(edit: that actually relies on a POSIX extension, which also got implemented on Windows)
size_t size = wcstombs(NULL, infoBuf, 0);
if (size == (size_t) -1) {
// some character can't be converted
}
char *buf = new char[size + 1];
size = wcstombs(buf, infoBuf, size + 1);

Same .txt files, different sizes?

I have a program that reads from a .txt file
I use the cmd prompt to execute the program with the name of the text file to read from.
ex: program.exe myfile.txt
The problem is that sometimes it works, sometimes it doesn't.
The original file is 130KB and doesn't work.
If I copy/paste the contents, the file is 65KB and works.
If I copy/paste the file and rename it, it's 130KB and doesn't work.
Any ideas?
After more testing it shows that this is what makes it not work:
int main(int argc, char *argv[])
{
char *infile1
char tmp[1024] = { 0x0 };
FILE *in;
for (i = 1; i < argc; i++) /* Skip argv[0] (program name). */
{
if (strcmp(argv[i], "-sec") == 0) /* Process optional arguments. */
{
opt = 1; /* This is used as a boolean value. */
/*
* The last argument is argv[argc-1]. Make sure there are
* enough arguments.
*/
if (i + 1 <= argc - 1) /* There are enough arguments in argv. */
{
/*
* Increment 'i' twice so that you don't check these
* arguments the next time through the loop.
*/
i++;
optarg1 = atoi(argv[i]); /* Convert string to int. */
}
}
else /* not -sec */
{
if (infile1 == NULL) {
infile1 = argv[i];
}
else {
if (outfile == NULL) {
outfile = argv[i];
}
}
}
}
in = fopen(infile1, "r");
if (in == NULL)
{
fprintf(stderr, "Unable to open file %s: %s\n", infile1, strerror(errno));
exit(1);
}
while (fgets(tmp, sizeof(tmp), in) != 0)
{
fprintf(stderr, "string is %s.", tmp);
//Rest of code
}
}
Whether it works or not, the code inside the while loop gets executed.
When it works tmp actually has a value.
When it doesn't work tmp has no value.
EDIT:
Thanks to sneftel, we know what the problem is,
For me to use fgetws() instead of fgets(), I need tmp to be a wchar_t* instead of a char*.
Type casting seems to not work.
I tried changing the declaration of tmp to
wchar_t tmp[1024] = { 0x0 };
but I realized that tmp is a parameter in strtok() used elsewhere in my code.
I here is what I tried in that function:
//tmp is passed as the first parameter in parse()
void parse(wchar_t *record, char *delim, char arr[][MAXFLDSIZE], int *fldcnt)
{
if (*record != NULL)
{
char*p = strtok((char*)record, delim);
int fld = 0;
while (p) {
strcpy(arr[fld], p);
fld++;
p = strtok('\0', delim);
}
*fldcnt = fld;
}
else
{
fprintf(stderr, "string is null");
}
}
But typecasting to char* in strtok doesn't work either.
Now I'm looking for a way to just convert the file from UTF-16 to UTF-8 so tmp can be of type char*
I found this which looks like it can be useful but in the example it uses input from the user as UTF-16, how can that input be taken from the file instead?
http://www.cplusplus.com/reference/locale/codecvt/out/
It sounds an awful lot like the original file is UTF-16 encoded. When you copy/paste it in your text editor, you then save the result out as a new (default encoding) (ASCII or UTF-8) text file. Since a single character takes 2 bytes in a UTF-16-encode file but only 1 byte in a UTF-8-encoded file, that results in the file size being roughly halved when you save it out.
UTF-16 is fine, but you'll need to use Unicode-aware functions (that is, not fgets) to work with it. If you don't want to deal with all that Unicode jazz right now, and you don't actually have any non-ASCII characters to deal with in the file, just do the manual conversion (either with your copy/paste or with a command-line utility) before running your program.

Count lines in pipe stream without consume

I open a pipe stream with given cmd command:
FILE* fp = popen(cmd.c_str(), "r");
How to count its lines without consume?
I tried:
char* line = NULL;
size_t len = 0;
unsigned int lines = 0;
while(getline(&line, &len, fp) != -1){
++lines;
}
But it consumes fp pipe stream.
I guess you are on Linux or some other POSIX system.
You basically cannot process the data from a pipe(7) (internally used by popen(3) ...) without consuming it, since pipes are non-seekable (lseek(2) would fail with ESPIPE, mmap(2) would fail with EACCESS)
You could either redirect the command to some temporary file (using lower level fork,dup2,execve syscalls(2), as explained in Advanced Linux Programming) then process that file and rewind it (and/or resend it elsewhere) or read all the data from the pipe into memory (so the available memory is a limiting factor).

Alternative to fgets()?

Description:
Obtain output from an executable
Note:
Will not compile, due to fgets() declaration
Question:
What is the best alternative to fgets, as fgets requires char *?
Is there a better alternative?
Illustration:
void Q_analysis (const char *data)
{
string buffer;
size_t found;
found = buffer.find_first_of (*data);
FILE *condorData = _popen ("condor_q", "r");
while (fgets (buffer.c_str(), buffer.max_size(), condorData) != NULL)
{
if (found == string::npos)
{
Sleep(2000);
} else {
break;
}
}
return;
}
You should be using the string.getline function for strings
cppreference
however in your case, you should be using a char[] to read into.
eg
string s;
char buffer[ 4096 ];
fgets(buffer, sizeof( buffer ), condorData);
s.assign( buffer, strlen( buffer ));
or your code:
void Q_analysis( const char *data )
{
char buffer[ 4096 ];
FILE *condorData = _popen ("condor_q", "r");
while( fgets( buffer, sizeof( buffer ), condorData ) != NULL )
{
if( strstr( buffer, data ) == NULL )
{
Sleep(2000);
}
else
{
break;
}
}
}
Instead of declaring you buffer as a string declare it as something like:
char buffer[MY_MAX_SIZE]
call fgets with that, and then build the string from the buffer if you need in that form instead of going the other way.
The reason what you're doing doesn't work is that you're getting a copy of the buffer contents as a c-style string, not a pointer into the gut of the buffer. It is, by design, read only.
-- MarkusQ
You're right that you can't read directly into a std::string because its c_str and data methods both return const pointers. You could read into a std::vector<char> instead.
You could also use the getline function. But it requires an iostream object, not a C FILE pointer. You can get from one to the other, though, in a vendor-specific way. See "A Handy Guide To Handling Handles" for a diagram and some suggestions on how to get from one file type to another. Call fileno on your FILE* to get a numeric file descriptor, and then use fstream::attach to associate it with an fstream object. Then you can use getline.
Try the boost library - I believe it has a function to create an fstream from a FILE*
or you could use fileno() to get a standard C file handle from the FILE, then use fstream::attach to attach a stream to that file. From there you can use getline(), etc. Something like this:
FILE *condorData = _popen ("condor_q", "r");
std::ifstream &stream = new std::ifstream();
stream.attach(_fileno(condorData));
I haven't tested it all too well, but the below appears to do the job:
//! read a line of text from a FILE* to a std::string, returns false on 'no data'
bool stringfgets(FILE* fp, std::string& line)
{
char buffer[1024];
line.clear();
do {
if(!fgets(buffer, sizeof(buffer), fp))
return !line.empty();
line.append(buffer);
} while(!strchr(buffer, '\n'));
return true;
}
Be aware however that this will happily read a 100G line of text, so care must be taken that this is not a DoS-vector from untrusted source files or sockets.

Resources