Why should I have to bother putting a linefeed at the end of every file? - text-files

I've occasionally encountered software - including compilers - that refuse to accept or properly handle text files that aren't properly terminated with a newline. I've even encountered explicit errors of the form,
no newline at the end of the file
...which would seem to indicate that they're explicitly checking for this case and then rejecting it just to be stubborn.
Am I missing something here? Why would - or should - anything care whether or not a file ends with a seemingly-superfluous bit of whitespace?

Historically, at least in the Unix world, "newline" or rather U+000A Line Feed was a line terminator. This stands in stark contrast to the practice in Windows for example, where CR+LF is a line separator.
A naïve solution of reading every line in a file would be to append characters to a buffer until an LF was encountered. If done really stupid this would ignore the last line in a file if it wasn't terminated by LF.
Another thing to consider are macro systems that allow including files. A line such as
%include "foo.inc"
might be replaced by the contents of the mentioned file where, if the last line wasn't ended with an LF, it would get merged with the next line. And yes, I've seen this behavior with a particular macro assembler for an embedded platform.
Nowadays I'm in the firm belief that (a) it's a relic of ancient times and (b) I haven't seen modern software that can't handle it but yet we still carry around numerous editors on Unix-like systems who helpfully put a byte more than needed at the end of a file.

Generally I would say that a lack of a newline at the end of a source file would mean that something went wrong in the editor or source code control client and not all of the code in the buffer got flushed. While it's likely that this would result in other errors, knowing that something likely went wrong in the editor/SCM and code may be missing is a pretty useful bit of knowledge. Certainly something that I would want to check.

Related

how should I interpret the MT940 specifications

I'm building my own MT940 parser and I'm running into something that seems to be unspecified issue.
The specification of a :61: tag, states that it ends with a variable amount of characters (34x). From an example file I see that they can continue on the next line.
For example:
:61:1510151015C54,01NTRFNONREF//15288910043499
/TRCD/00100/
How do I determine if the next line is a new tag or if it is a continuation of the content of the preceding tag. It seems that looking for an :xx: pattern at the beginning of the line is naive as it could cause a bug in the exceptional situation where the content actually contains that specific pattern.
Every line that starts with a tag such as :61: is a new line of information in the format. If it doesn't start with such an tag then it's a continuation.
Small word of warning though. MT940 is a standard, but there are subtle differences per bank. So it might be that works for one, but doesn't work for another. For instance some specifications have a header that defines that start of a transaction, but others don't.

How to find foreign language used in "C comments"

I have a large source code where most of the documentation and source code comments are in english. But one of the minor contributors wrote comments in a different language, spread in various places.
Is there a simple trick that will let me find them ? I imagine first a way to extract all comments from the code and generate a single text file (with possible source file / line number info), then pipe this through some language detection app.
If that matters, I'm on Linux and the current compiler on this project is CLang.
The only thing that comes to mind is to go through all of the code manually and check it yourself. If it's a similar language, that doesn't contain foreign letters, consider using something with a spellchecker. This way, the text that isn't recognized will get underlined, and easy to spot.
Other than that, I don't see an easy way to go through with this.
You could make a program, that reads the files and only prints the comments out to another output file, where you then spell check that file, but this would seem to be a waste of time, as you would easily be able to spot the comments yourself.
If you do make a program for that, however, keep in mind that there are three things to check for:
If comment starts with /*, make sure it stops reading when encountering */
If comment starts with //, only read one line - unless:
If line starting with // ends with \, read next line as well
While it is possible to detect a language from a string automatically, you need way more words than fit in a usual comment to do so.
Solution: Use your own eyes and your own brain...

New line at the end of source code

Anytime I open up the code editor in Visual Studio, there is always an empty new line at the end of generated codes. I usually delete them since they seem irrelevant to me. However, recently I read code at Github which said:
\ No newline at end of file
This was the last line. Now I'm thinking those empty new lines at the end of source codes do have some relevance. But what do they mean? Do they provide any performance boost?
Two things make me prefer having a newline at the end of files:
Code reviews are slightly easier when looking at diffs that occur at the end of the file (i.e., if a line is added at the end of the file, it appears that the previous line changed, when it only gained a newline)
Going to the end of the file (Ctrl+End in Windows) always puts me at the same column and not in some unexpected position out to the right
Pretty much the only difference it makes is that if you have a file with no newline - like this:
blah\n
bleh (no newline)
When you modify it to be:
blah\n
bleh\n
foo (no newline)
Then according to the diff, you modified 2 lines - one with content, the other one with newline... which is not what you wanted probably. Then again, in reality it doesn't matter that much which way you choose. If you include newlines, your diffs will be a little bit cleaner.
It also makes difference for some preprocessors as mentioned in other answer - but that depends on what language you use.
Of course it makes no performance difference at all.
No, it makes no difference whatsoever.
Some coding conventions say it's good to have a final newline, some say it's good not to.
Read more about new line in C++ here: "No newline at end of file" compiler warning
I suppose both Visual Studio and Git do it mostly for being coherent with the convention.

strange characters at beginning of file

there are strange characters at the beginning of a file I'm editing (using textmate..)
I don't know when they appeared, they're invisible in textmate but my script that reads the file goes crazy..
this is the first few chars in the file (as seen with od command):
0000000 177377 000120 000105 000117 000120 000114 000105 000072
the first 2 shouldn't be there I think.. maybe they were caused by some strange dropbox sync? Or something else.. but they tend to reappear (I don't yet know when..)
My question: what is that 177377 and a simple way to remove it in my ruby script?
thanks
The 000000 177377 (hex 0x0000FEFF) is a byte-order mark (BOM). It indicates to consumers that the remainder of the file is in big-endian UTF-32 encoding. This may not be correct in your case, but that's what the bytes indicate.
What to do with it is a little tricky. In general, the BOM does accurately represent the encoding of the following data. Detecting and skipping it and treating the subsequent content as if it were in your local default charset is usually going to be the wrong thing to do, even though it seems to be correct here. Instead, I'd try to figure out why your editor is inserting an incorrect BOM and whether there's a way to disable it.

"descript.ion" file spec?

There appears to be a somewhat standard "descript.ion" file in Windows programs universe which provides meta data for all/some of the files in a given directory.
I know there are various programs which write this file (example: NewsBin, UseNet downloader) and read it (Example: "FAR", a file manager mimicking old Norton Commander).
I'm writing my own file indexer, and would like to add the ability to parse and use the info from "descript.ion" files.
The problem I have is that I have not been able to find an actual spec for the file, despine much googling.
I reverse engineered it as best I could, but I'm not certain whether I captured 100% of the possible details, so I figured I'd ask SO.
Here are example lines from the file:
"Rus Song1.mp3" SovietMus 1/2, rus_song#gmail.com, Fri Aug 08 00:46:27 2008
RusSong2.mp3 SovietMus 2/2, rus_song#gmail.com, Fri Aug 08 01:46:22 2008
As it seems the structure is:
First "token" is a file name.
If the token starts with any letter but double quote, the token ends at the first space character.
If the token starts with the double quote, the end of token is the following double quote
Not sure what happens if filename contains a double quote, IIRC it's illegal in Windows filesystems, so escaping the quote may be a moot question)
Last token (end of line to the very last comma moving backwards) is a timestamp.
Second to last token (the very last comma to second-to-last comma moving backwards) is the name of the poster from the Usenet newsgroup. I'm not quite sure what happens in generic format since the only descript.ion files I saw were from NewsBin that is obviously Usenet centric.
Everything in between is a description, in NewsBin's case coming from post's subject.
QUESTIONs:
Does anyone know of a bit more official "descript.ion" file spec/documentation?
(or, at elast, have your own knowledge of those files and can verify my spec)
Does anyone know of any other programs that read or write this file?
Thanks!
The description files on my system are from Total Commander as well. They follow the basic spec mentioned in the other answers:
Filename Text I typed to describe the file
"Long filename" Some text
Each line ends in a normal Windows line break.
In addition, the program stores multi-line comments as follows:
Filename This is the first line\\nSecond line\\nLast line\x04\xc2
Here, I mean that the descript.ion file contains a backslash and a letter 'n' where I typed a line break, and two special characters 04 C2 at the end of the comment. In addition, the line is ended by a Windows line break 0D 0A.
Apparently, the two extra characters at the end of the line signal the end of a multiline comment. If I remove them, the comment is rendered as a single line in the GUI, and the '\n' sequences are displayed literally.
The original usage of DESCRIPT.ION was to provide longer more descriptive names to 8.3 filenames; all it had was the shortname and a longer description. As you've found, others have co-opted the name with varying formats and usages. Frankly speaking, I don't think you'll find any specific commonality among the various usages.
Format is simple: FileName remainder of the line is a description of the file
https://jpsoft.com/ascii/descfile.txt
(Wayback Machine)
The descript.ion file is extensively used in the file management utility "total commander", a shareware found in www.ghisler.com. From version 7.5 of TC, it can have length of 4096 bytes. I have been using it extensively to annotate my files without any issues. You may look up different user's experience at the total commander users forum.
the answer above looks correct for me, just a addition:
from http://filext.com/file-extension/ION
The ION file type is primarily associated with '4DOS'. Note: Norton Utilities also uses 4DOS.
http://www.optimasc.com/products/fileid/4dos-descext.pdf
Collected links to 4DOS description-aware programs of all kind and 4DOS tools.
http://www.4dos.info/4tools.htm
http://drupal.org/node/289988

Resources