File Size with Tabs vs Spaces - coding-style

Is is true that file size with Tab Indentation is much smaller than the one's with spaces?
In the article below it shows Space indented file size to be about 18% more than Tab indented files.
Source:
http://madskristensen.net/post/performance-of-tabs-vs-spaces-in-html-files

Well it makes sense. The tab is a single ascii character whereas 4 spaces are, well, 4 ascii characters. Over many indents, that can definitely add up

The analysis is flawed in that it doesn't consider the ways files are stored. The first example on the source article shows 1403 (tabs) and 1703 (spaces), but both would take up the same storage space on modern drives (4096 block size), for instance.
Of course, space indentation takes more space, but the question of "how much" is more complicated than comparing byte counts.

Related

Clarifications for FHIR R4 string element

There are a couple of things that I am having trouble with regarding HL7 FHIR R4 strings (https://www.hl7.org/fhir/datatypes.html#string):
The specification mentions: Note that strings SHALL NOT exceed 1MB (1024*1024 characters) in size. The trouble I am having with this is that 1024x1024 Unicode characters are not always 1MB in size. Besides that it is unclear to me what Unicode encoding is meant here, and I will assume the reasonable UTF-8 since that is the default for both XML and JSON. For example the character '🦁' needs 4 bytes to encode, therefore 1024x1024 of such characters would be 4MB in size. The Regex-es in the notes, though not normative, make this a bit clearer, but not much. It states that codes up to FFFF are ok, which means a max. byte use of 3 per characters which would still exceed the 1MB limit by a factor of 3. My interpretation is that we would like a reasonable limit that doesn't open up any denial-of-service attacks. Therefore I would like to suggest keeping the meaningful 1MB limit but drop the number of characters requirement OR add it as a separate requirement.
The specification mentions: Therefore strings SHOULD always contain non-whitespace content. It does not mention what it considers whitespace. Is this just the three codes mentioned earlier representing horizontal tab, carriage return and line feed or are more exotic whitespace characters also prohibited, like next line or no-break space?
Ok, that about sums up my questions about the string specifications. Hope that someone can help me out.
Best,
Dirk
The rule is clearly expressed in characters explicitly because Unicode characters have variable length. There is no maximum in bytes, only in characters (though given Unicode rules, you could calculate what the maximum possible length in bytes might be). If you feel this isn't sufficiently clear, feel free to submit a change request.
The expectation is a string SHOULD always have textual content. If you have nothing to say, omit the element. Trying to work around the "no empty string" limitation by transmitting a non-breaking space or some other non-visible character to meet the non-empty requirement while not actually conveying any human-readable information would be contrary to the intent of the specification. We don't demand that systems enforce this because trying to figure out all the creative ways implementers might have of conveying "no useful text" with Unicode isn't terribly practical. I believe the Java code just does a trim() and compares the result to empty string.

In a trie for inserting and searching normal paths, are ascii 1-31 worth considering?

I am working on a trie data structure which inserts and searches for normal paths.
A path can contain any character from unicode, so in order to represent it completely in utf-8, the array in trie needs to contain next nodes for all 256 ascii.
But I am also concerned about the space and insertion time taken by trie.
The conditions under which my trie is setup rarely would insert a character of unicode(I mean 128-255 ascii). So I just put an if condition to reject paths which contain above ascii 127. I don’t think the ascii 1-31 are relevant either, although I am unsure about this. As 1-31 chars are like carriage return, esc etc, can I simply continue the loop without inserting them? Like is it possible to encounter paths that are actually differentiable because of ascii 1-31 in a real scenario?
Answering this old question, on macOS ascii 13 is used to represent custom icons which may appear in many paths. Thanks to #EricPostpischil who told that in comments.
All other characters ranging between 1-31 appear pretty less in paths.
Also, macOS users mostly have a case-insensitive path, so generally considering both lowercase and uppercase is also useless.
PS:
Although this question seems to be opinion based, but it actually isn't because it can be answered quite concisely. It attempts to ask for frequency of appearance of characters in paths on macOS. (sorry for the confusing title, I was a noob that time, changing it now will make all comments on it absurd)

Why are text editors slow when editing very long lines?

Most text editors are slow when lines are very long. The suggested structure for data storage for text editor seems to be rope, which should be immune to long lines modification. By the way editors are even slow when simply navigating within long lines.
Example :
A single character like 0 repeated 100000 times in PSPad or 1000000 times in Vim on a single line slow the cursor moves when you are at the end of the line. If there is as much bytes in the file but dispatched on multiple lines the cursor is not slowed down at all so I suppose it's not a memory issue.
What's the origin of that issue that is so common ?
I'm mostly using Windows, so may be this is something related to Windows font handling ?
You're probably using a variable-length encoding like utf8. The editor wants to keep track of what column you're in with every cursor movement, and with a variable-length encoding there is no shortcut to scanning every byte to see how many characters there are; with a long line that's a lot of scanning.
I suspect that you will not see such a slowdown with long lines using a single-byte encoding like iso8859-1 (latin1). If you use a single-byte encoding then character length = byte length and the column can be calculated quickly with simple pointer arithmetic. A fixed-length multibyte encoding like ucs-2 should be able to use the same shortcut (just dividing by the constant character size) but the editors might not be smart enough to take advantage of that.
As evil otto suggested, line encoding can force the line to be re-parse and for long lines this causes all sorts of performance issues.
But it is not only encoding that causes the line to be re-parsed.
Tab characters also require a full line scan, since you need to parse the whole line in order to calculate the true cursor location.
Certain syntax highlighting definitions (i.e. block comments, quoted strings etc) also require a full line parse.
You mentioned vim, so I'll assume that's the editor you use. Vim does not use a rope, as described here and here. It uses an array of lines, so your assumption that ropes should be immune to such long lines does not matter because ropes are not used.

Is there a good two way hash to convert an email address to a predictable, readable, unix username?

We are working with a number of unix based filesystems, all of which share a similar set of restrictions on that certain characters can't be used in the username fields. One of those restrictions is no "#" , "_", or "." in the names. Being unix there are a number of other restrictions.
So the question is if there is a good known algorithm that can take an email address and turn that into a predictable unix filename. We would need to reverse this at some point to get the email.
I've considered doing thing like "."->"DOT", "#"->"AT", etc. But there are size limitations and other things that are generally problematic. I could also optimize by being able to map the #xyz.com part of the email to a special char or something. Each implementation would only have at most 3 domains it would need to support. I'm hoping someone has found a solution without a huge number of tradeoffs.
UPDATE:
-The two target filesystems are AFS and NFS.
-Base64 doesn't work as it has not compatible characters. "/"
-Readable is preferable.
Seems like the best answer would be to replace the #xyz.com domain to a single non-standard character, and then have a function that could shrink the first part of a name to something that fits in the username length restrictions of the various filesystems. But what is a good function for that?
You could try a modified version of the URL percent (%) encoding scheme used on for URIs.
If the percent symbol isn't allowed on your particular filesystem(s), simply replace it with a different, allowed character (and remember to encode any occurrences of that character properly).
Using this method:
mail.address#server.com
Would become:
mail%2Eaddress%40server%2Ecom
Or, if you had to substitute (for example), the letter a instead of the % symbol:
ma61ila2Ea61ddressa40servera2Ecom
Not exactly humanly-readable perhaps, but easily enough processed through an encoding algorithm. For the best space efficiency, your escape character should be a character allowed by the filesystem, yet one that is not likely to appear frequently in an address.
This encoding scheme has the advantage that there is no size increase for most normal characters. The string length will ONLY go up for characters not supported by the filesystem.
Check out base64. Encoding and decoding is well defined.
I'd prefer this over rolling my own format any day.
Hmm, from your question I'm not totally clear on this point, but since you wanted some conversion I'm assuming that you want something that is at least human readable?
Each OS may have different restrictions, but are you close enough to the platforms that you would be able to find out/test what is acceptable in a username? If you could find three 'special' characters that you could use just to do a replace on '#', '.', '_' you would be good to go. (Is that comprehensive? if not you would need to make sure you know all of them otherwise you could clash.) I searched a bit trying to find whether there was a POSIX standard, but wasn't able to find anything, so that's why I think if you can just test what's valid that would be the most direct route.
With even one special character, you could do URL encoding, either with '%' if it's available, or whatever you choose if not, say '!", then { '#'->'!40", '_'->'!5F', '.'-> '!2E' }. (The spec [RFC1738] http://www.rfc-editor.org/rfc/rfc1738.txt) defines the characters as US-ASCII so you can just find a table, e.g. in wikipedia's ASCII article and look up the correct hex digits there.) Or, you could just do your own simple mapping since you don't need the whole ASCII set, you could just do a map with two characters per escaped character and have, say, '!a','!u','!p' for at, underscore, period.
If you have two special characters, say, '%', and '!', you could delimit text that represents the character, say, %at!, &us!, and '&pd!'. (This is pretty much html-style encoding, but instead of '&' and ';' you are using the available ones, and you're making up your own mnemonics.) Another idea is that you could use runs of a symbol to determine the translated character, where each new character flops which symbol is being used. (This conveniently stops the run if we need to put two of the disallowed characters next to each other.) So assume '%' and '!', with period being 1, underscore 2, and at-sign being three, 'mickey._sample_#fake.out' would become 'mickey%!!sample%%!!!fake%out'. There are other variations but this one is easy to code.
If none of this is an option (e.g. no symbols at all, just [a-zA-Z0-9]), then really I think the Base64 answer sounds about right. Really once we're getting to anything other than a simple replacement (and even that) it's already getting hard to type if that's the goal. But if you really need to try to keep the email mostly readable, what you do is implement some sort of escaping. I'm thinking use '0' as your escape character, so now '0' becomes '00', '#' becomes '01', '.' becomes '02', and '_' becomes '03'. So now, 'mickey01._sample_#fake.out'would become 'mickey0010203sample0301fake02out'. Not beautiful but it should work; since we escaped any raw 0's, just always make sure you define a mapping for whatever you choose as your escape char and you should be fine..
That's all I can think of atm. :) Definitely if there's no need for these usernames to be readable in the raw it seems like apparently Base64 won't work, since it can produce slashes. Heck, ok, just the 2-digit US-ASCII hex value for each character and you're done...] is a good way to go; there's lots of nice debugged, heavily field-tested code out there for it and it solves your problem quite handily. :)
Given...
- the limited set of characters allowed in various file systems
- the desire to keep the encoded email address short (both for human readability and for possible concerns with file system limitations)
...a possible approach may be a two steps encoding logic whereby the email is
first compressed using a lossless compression algorithm such as Lempel-Ziv, effectively turning it into a "binary" form, stored in a shorter array of bytes
then this array of bytes is encoded using a Base64-like algorithm
The idea is to minimize the size of the binary representation, so that the expansion associated with the storage inefficiency of the encoding -which can only store roughly 6 bits (and probably a bit less) per character-, doesn't cause the encoded string to be too long.
Without getting overly sophisticated for the compression nor the encoding, such a system would likely produce encoded strings that are maybe 4/5 of the input string size (the email address): the compression should easily half the size, but the encoding, say Base32, would grow the binary form size by 8/5.
Efforts in improving the compression ratio may allow the selection of more "wasteful" encoding schemes (with smaller character sets) and this may help making the output more human-readable and also more broadly safe on various flavors of file systems. For example whereby a Base64 seems optimal. space-wise, using only uppercase letter (base 26) may ensure portability of the underlying scheme to file systems where the file names are not case sensitive.
Another benefit of the initial generic compression is that few, if any, assumptions need to be made about the syntax of valid input key (email addresses here).
Ideas for compression:
LZ seems like a good choice, 'though one may consider primin its initial buffer with common patterns found in email addresses (example ".com" or even "a.com", "b.com" etc.). This initial buffer would ensure several instances of "citations" per compressed email address, hence a better compression ratio overall). To further squeeze a few bytes, maybe LZH or other LZ-variations could be used.
Aside from the priming of the buffer mentioned above, another customization may be to use a shorter buffer than typical LZ algorithms, since the string we have to compress (email address instances) are themselves very short and would not benefit from say a 512 bytes buffer. (Shorter buffer sizes allow shorter codes for the citations)
Ideas for encoding:
Base64 is not suitable as-is because of the slash (/), plus (+) and equal (=) characters. Alternate characters could be used to replace these; dash (-) comes to mind, but finding three charcters, allowed by all "flavors" of the targeted file systems may be a stretch.
Never the less, Base64 and its 4 output characters per 3 payload bytes ratio provide what is probably the barely achievable upper limit of storage efficiency [for an acceptable character set].
At the lower end of this efficiency, is maybe an ASCII representation of the Hexadeciamal values of the bytes in the array. This format with a doubling of the payload bytes may be acceptable, length-wise, and is interesting because of its simplicity (there is a direct and simple relation between each nibble (4 bits) in the input and characters in the encoded string.
Base32 whereby A thru Z encode 0 thru 25 and 0 thru 5 encode 26 thru 31, respectively, essentially variation of Base64 with an 8 output characters per 5 payload bytes ratio may be a very viable compromise.

Why is the default terminal width 80 characters?

80 seems to be the default in many different environments and I'm looking for a technical or historical reason. It is common knowledge that lines of code shouldn't exceed 80 characters, but I'm hard pressed to find a reason why outside of "some people might get annoyed."
As per Wikipedia:
80 chars per line is historically descended from punched
cards
and later broadly used in monitor text mode
source: http://en.wikipedia.org/wiki/Characters_per_line
Shall I still use 80 CPL?
Many developers argue to use 80 CPL even if you could use more. Quoting from: http://richarddingwall.name/2008/05/31/is-the-80-character-line-limit-still-relevant/
Long lines that span too far across the monitor are hard to read. This
is typography 101. The shorter your line lengths, the less your eye
has to travel to see it.
If your code is narrow enough, you can fit two files on screen, side
by side, at the same time. This can be very useful if you’re comparing
files, or watching your application run side-by-side with a debugger
in real time.
Plus, if you write code 80 columns wide, you can relax knowing that
your code will be readable and maintainable on more-or-less any
computer in the world.
Another nice side effect is that snippets of narrow code are much
easier to embed into documents or blog posts.
As a Vim user, I keep ColorColumn=80 in my ~/.vimrc. If I remember correctly, Eclipse autoformat CtrlShiftF, breaks lines at 80 chars by default.
It is because IBM punch cards were 80 characters wide.
Your computer probably doesn't have a punch card reader, but it probably does have lpr(1) which follows the convention set by IBM for punch cards. The lpr(1) command defaults to Courier font with margins set for 80-columns and 8-space tabs for plain text files on 8.5"x11" paper. Try cat foo.c | lpr and if the author of foo.c used conventional line width and source code formatting rules, then the printed page will look mostly readable. Otherwise, best not to kill the trees.
One of the characteristics of good typography is properly set measure - length of the line of characters.
There is an optimum width for a Measure and that is defined by the amount of characters are in the line. A general good rule of thumb is 2-3 alphabets in length, or 52-78 characters (including spaces).
It simply makes sense to make your text readable.
See Five simple steps to better typography for more info.
If I remember correctly the old dot matrix printers were only able to print 80 characters across. I am pretty sure my old commodore 64 and 128 had the same 80 characters, now that I think about it, I don't think the monitor could display more than 80 characters either
The LA30 was a 30 character/second dot
matrix printer introduced in 1970 by
Digital Equipment Corporation of
Maynard, Massachusetts. It printed 80
columns of uppercase-only 5x7 dot
matrix characters across a
unique-sized paper.
http://en.wikipedia.org/wiki/Dot_matrix_printer
possibly due to screen resolution of 640 pixels..
each character is or was 8 pixels wide giving you 640 (80x8)

Resources