Why do line ending differ from platform to platform? Even why is there term like line ending in programming?
I prefer saving my codes in Unix/Linux format, even if I'm on Windows. Am I missing anything by not saving it in Windows or MacOS format? How does line ending effect in coding.
In the early days, when Typewriters were nearly the only way of getting output from a computer, CR and LF did different things. Unix started the tradition of using a single character to mark the end of a line, probably because it made their pipelining easier; their drivers could easily convert a single LF to CR/LF if need be. Linux is mostly a Unix clone so it keeps that convention. The others hold on to the CR/LF convention for historical reasons, even though it's not strictly necessary.
Some languages such as C, C++, and Python will let you specify the type of file when you open it, either binary or text. For text files a translation is performed so that a single LF is translated into the line ending convention required by the OS.
Basically everyone wanted to be different when creating OS's - Un*x's started with LF, then VMS and DOS wanted CR/LF (like a typewriter) and of course MAC wanted to be different so they went for CR only.
They just wanted to make it harder to transfer between OS's so that you 'bought' into one
Added because of comment
Up to the programmer - if you need to support different line endings then you must code for them. eg you could create a #define for the line ending and then have this change depending on compile options
Related
It is my understanding that by default, Character is Latin_1, Wide_Character is UCS-2, and Wide_Wide_Character is UCS-4, but that GNAT can have specified pragma Wide_Character_Encoding(UTF8); or -gnatW8 and that those characters and their strings will be UTF-8 encoded instead.
At least on Linux and FreeBSD, the results fit with my expectations. But on Windows the results are odd.
For either Wide or Wide_Wide variants, once a character moves beyond the ASCII set, I get a garbled mess. I beleive this is called emojibake by some. So I figured it was a codepage issue. After all, the default codepage in Windows, and therefore what the Console Host would load with, is 437 which isn't the UTF-8 codepage. chcp 65001 and now instead of the mess of extra characters, there's an immediate exception raised ADA.IO_EXCEPTIONS.DEVICE_ERROR : a-ztexio.adb:1295. Looking at where the exception occurred, it seems to be in the putc binding of fputc(). But this is Standard_Output, shouldn't an EOF never happen?
Is there some kind of special consideration Windows needs? How can I get UTF-8 output?
edit:
I tried piping the output into a text file. The supposed UTF-8 encoded program still generates emojibake in the file. Not sure why this would immediately throw an exception in the console though.
So then I tried directly opening and writing to a file instead of the console/pipe. Oddly this works exactly as it should. The text is completely correct.
I've never seen this kind of behavior with any other language, so it should still be possible to get proper UTF-8 at the console, right?
The deficiency so many others, not just here, describe in the Windows Console Host has either been fixed or never existed in the first place. Based on this document, I feel it was probably always very misunderstood. Windows doesn't treat the console like files, and it's easy to fall into that trap.
Using this very straight forward code, along with what Windows needs and expects behind the scenes...
It correctly produces the following, as long as either pragma Wide_Character_Encoding(UTF8); or -gnatW8 is used.
Piping the output of this test program into a file works as it should. Similarly, piping the output of this test program into another program works as it should. And also similarly, taking the file from piped output, and piping it into another program works as it should.
Full UTF-8 behavior as one would expect under Linux, on Windows.
What needs to be done is twofold. In the package initializer, the Console Host needs to be told what it's working with, which can be done like this.
Character output is then done through fputwc. According to MS Docs fputc should never be used for UNICODE on Windows, which is part of the problem GNAT has. String output and character/string input is all similar.
Based on others comments and some further research to confirm, I'm pretty sure this is a deficiency of the Windows Console Host.
edit: don't listen to this
Why documents writted in differents OSs have differents line endings characters? Are any technical reason behind this or the creators put distintinct characters just because they want.
As usual with issues of this kind, it's historical. In early days of computing when the concept of a device driver was not fully formed yet, ASCII text files were basically instructions for teleprinters (TTYs). Since the device needed to perform two movements (feed an extra line to move down and then return the carriage to the left side) to add an extra line, two characters were included in files (sometimes more characters were added to give the devices extra time to position the carriage).
Early standards by both ISO and ASA/ANSI enshrined CR+LF combination.
When Multics was written, it incorporated device drivers which could easily handle the translation of characters into instructions for a device and it was decided that LF is enough. Device drivers were mapping newline to CR followed by LF when sending instructions to the device and so users would only store LF in the text files. This was later adopted by most modern operating systems (Linux, UNIX, Mac OS X) with the exception of Windows which kept the old convention.
One fact worth noting is that these were not the only two competing conventions in the early days. For example, EBCDIC-based systems used character NEL (0x15) to indicate newline. Also, some of the ASCII based systems, used CR alone. See this wikipedia article for more details.
I understand the difference between ASCII mode vs binary when it comes to FTP, but what I don't understand is why there is even a need for ASCII mode at all? Is this just a legacy thing that used to save time by eliminating the most significant bit, therefore causing the overall speed of the transfer to increase by 1/8th? Or is there some hidden use for it that I don't know about?
I've encountered many problems because I would forget to switch the mode to bin when transferring text between different OS's. I don't understand why "bin" isn't just the default for everything, especially with today's much faster internet speeds.
Knowwutimean, Vern?
ASCII mode exists so you can get the right answer when you upload a text file to a remote system without having to know what the line termination or character set conventions are for that system. It was more important when transferring text files was more often done via FTP than, say, email.
To address your practical problem: check the documentation for both your FTP client and server(s) to see if there's a way to set ASCII mode by default. Often this is as simple as some kind of "profile" that sends some FTP commands every time you connect.
To address your philosophical problem: FTP is a 40 year old protocol that has its fair share of historical baggage. One day you'll be very glad that some protocol you depend on was standardized long ago and you can still access some old data.
I, for one, vote to eliminate ascii mode from ftp servers. Any EOL translation can be done by applications consuming the files, and many apps today understand both EOL types anyway. At a minimum, I'd like to see servers switch to using binary by default, and only use ascii if requested.
One scenario of practical use of ASCII mode is to upload PHP or Perl or similar scripts from Windows development machine to Unix server. Use of Binary mode would require separate conversion of line ending sequences, while with ASCII mode conversion is performed "automatically".
Update: there's one more scenario that we have come across - when transferring data to/from mainframes that use EBCDIC encoding, ASCII mode tells the server to perform conversion between encodings.
Here's a practical example of a problem that comes from using a binary FTP connection. In php there are two types of comments:
// a single line comment like this
/* a block comment like this */
The block comment has a start and an end. But the single line comment just ends at the end of the line.
If you upload a php file with single line comments using a binary connection, the php will stop running as soon as it hits the single line comment. It doesn't recognise the end of the line as the end of the comment, so it effectively comments out the rest of your php script.
If however you use FTP in ASCII mode, it will correctly read the end of the line and will run your php code as expected.
I am creating a utility that will store data on flat file in a specific binary format.
I want the filename extension to be specific to my application. Is there any reason other than the old 8.3 filename limit for restricting the extension to 3 characters, and if not, what is the limit? Can I have myfilename.MyExtensionSoHandsOffEverybodyElse ?
This is a hold over from the old windows 3.x/MSDOS days. Today, there are plenty of file names that have more than 3 character extensions.
If I remember correctly, Windows XP had a maximum character limit for path names (including the file name) of 255 characters.
In my experience, having seen a few non-3-character extensions I'd say that it's a matter of tradition, and you're perfectly welcome to use myfilename.MyExtensionSoHandsOffEverybodyElse.
The only good reason for doing this is if you plan to support Windows 9x. If you're only targeting XP and later, as with most projects nowdays, the 8.3 thing is irrelevant.
In fact, Windows itself stores things in long-extension filenames in Vista and later, for example, .search-ms for saved searches.
No, there isn't a good reason to limit the extension to 3 characters. However, a shorter, descriptive name is better if a user has to remember it. For example, most people know what a .html or .doc file would contain.
As long as you make a reasonable attempt to avoid naming collisions with major software there shouldn't be an issue. A corollary to that is the fact that unless you create some insanely long extension that will only ever be unique to your software (and even then, it's not guaranteed), the extension you choose will always be subject to name collision by other people's software when they choose their program's data file extension as you are doing here.
I am using a FTPS connection to send a text file
[this file will contain EDI(Electronic Data Interchange) information]to a mailbox INOVIS.I have configured the system to open a FTPS connection and using the PUT command I write the file to a folder on the FTP server.
The problem is: what mode of file transfer should I use? How do I switch between modes?
Moreover which mode is the 'best-practice' to use when transferring file over FTPS connection.
If some one can provide me a small ftp script it would be helpful.
Many of the other answers to this question are a collection of nearly correct to outright wrong information.
ASCII mode means that the file should be converted to canonical text form on the wire. Among other things this means:
NVT-ASCII character set. Even if the original file is in some other character set, such as ASCII, EBCDIC or UTF-8. Technically this disallows characters with the 8th bit set, but most implementations won't enforce this.
CRLF line endings.
EBCDIC mode means a similar set of rules, except that the data on the wire should be in EBCDIC.
LOCAL mode allows sending data with a size other than 8 bits per byte.
IMAGE (or BINARY) mode means that the data should be send without any changes. It is up to the user to ensure that the target system can understand the data once it arrives.
Among other things, this means that the recommendation to use BINARY mode to send text data will fail if one of the systems involved doesn't use a ASCII based character set.
ASCII mode changes new line characters between unix and DOS formats. \n to \r\n and viceversa.
Actually, ASCII/BINARY has nothing to do with the 8th bit. It's a convention for translating line endings.
When you are on a windows machine talking to a Unix FTP server (FTPS or FTP - doesn't matter - the protocol is the same), the server will replace any <CR><LF>-Combination with <LF> before storing the file and consequently do the translation in reverse in case you get the file from the unix server.
The idea behind ASCII mode is to convert the line endings to the respective endings of the target platform.
As todays world seems to be converging to the unix convention (<LF>) and as nearly all of todays editors (aside of notepad) can easily handle Unix-Line-Endings, the days of ASCII mode are, indeed, numbered and I would by all means recommend to always use BINARY transfer mode.
The prospect of having data altered in mid-transfer is somewhat frightening anyways.
ASCII mode also makes file sharing of text files across different platforms more straightforward for end users. They won't have to worry about the default line ending (cr/lf versus just lf for example) since the ASCII mode will do that translation for them on the fly.
For most file types you will ALWAYS want to use BINARY mode though.
ACSII mode converts text files between UNIX and Windows formats based on the server and client platform (CR/LF vs LF), Binary doesn't. Of course, if you transfer nearly anything in ASCII mode that isn't text, it will probably be corrupted for that reason.
If you want an exact copy the data use binary mode - using ascii mode will assume the data is 7bit text (chars 0-127) and truncate any data outside of this range. Dates back to arcane 7bit networking days where ascii mode could save you time.
In a globalized environment that we live in - such that it is quite common to find non-ascii characters e.g. foreign languages, currency symbols etc. - you should always use BINARY mode.
For the FTP protocol, the ASCII transfer mode will consider the 8th bit of each of your character as insignificant and will use it for error checking. As for binary transfer mode, your data will be sent as is. Note that sending binary data in ASCII mode will (almost) always end up in data corruption. However, transferring ASCII data in binary mode will work as long as the sending and receiving systems use the 8th bit in the same way (in modern system the 8th bit should stay at 0 to prevent collision with extended ASCII charsets).