Related
Something that troubles me when I think of is how incoming data is interpreted in computers. I searched a lot but could not find an answer so as a last resort I am asking in here. What I am saying is that you plug in a USB to your computer and data stream starts. Your computer receives ones and zeros from the USB and interpret them correctly like for example inside of the USB there are pictures with different names and different formats and resolutions. What I do not understand is how computer correctly puts them together and the big picture emerges. This could be seen as a stupid question but had me thinking for a while. How does this system work?
I am not a computer scientist but I am studying Electrical and electronics engineering and know somethings.
It is all just streams of ones and zeros, which get counted up into bytes. As you probably know one can multiplex them, but with modern hardware that isn't very necessary (the 's' in USB standing for 'serial)
A pure black and white image of an "A" would be a 2d array:
111
101
111
101
101
3x5 font
I would guess that "A" is stored in a font file as 111101111101101, with a known length of 3*5=15 bits.
When displayed in a window, that A would be broken down into lines, and inserted on the respective line of the window, becoming a stream which contains 320x256 pixels perhaps.
When the length of data is not constant, it can:
If there is a max size, could be the size of the max size (integers and other primitive data types do this, a 0 takes 32/64 bits, as does 400123)
A length is included somewhere, often a sort of "header"
It gets chunked up into either constant or variable sized chunks, and has a continue bit (UTF-8 is a good simple example of constant chunks, some networking protocols (maybe TCP/IP) are a good example of variable chunks)
Both sides need to know how to decode the data, in your example of a USB stick with an image on it. The operating system has a driver which understands the UUID is a storage device, and attempts to read special sectors from it. If it detects a partition type it recognizes (for windows that would be NTFS or FAT32), it will then load the file tables, using drivers that understand how to decode those. It finds a filename allows access via the filename. Then an image reading program is able to load the bytestream of that file and decode it using its headers and installed codecs into a raster image array. If any of those pieces are not available in your system, you cannot view the image, and it will be just any random binary to you (if you format the usb stick with Linux, or use a uncommon/old image format)
So its all various level of explicit or implicit handshakes to agree on what the data is when you get to the higher levels (higher level being at least once you agree on endianness and baudrate of data transmission)
Can anyone, in simple terms, explain what does "Syntax Directed Translation" mean? I started to read the topic from Dragon Book but couldn't understand. The Wiki article didn't help either.
In simplest terms, 'Syntax Directed Translation' means driving the entire compilation (translation) process with the syntax recognizer (the parser).
Conceptually, the process of compiling a program (translating it from source code to machine code) starts with a parser that produces a parse tree, and then transforms that parse tree through a sequence of tree or graph transformations, each of which is largely independent, resulting in a final simplified tree or graph that is traversed to produce machine code.
This view, while nice in theory, has a drawback that if you try to implement it directly, enough memory to hold at least two copies of the entire tree or graph is needed. Back when the Dragon Book was written (and when a lot of this theory was hashed out), computer memories were measured in kilobytes, and 64K was a lot. So compiling large programs could be tricky.
With Syntax Directed Translation, you organize all of the graph transformations around the order in which the parser recognizes the parse tree. Instead of producing a complete parse tree, your parser builds little bits of it, and then feeds those bits to the subsequent passes of the compiler, ultimately producing a small piece of machine code, before continuing the parsing process to build the next piece of parse tree. Since only small amounts of the parse tree (or the subsequent graphs) exist at any time, much less memory is required. Since the syntax recognizer is the master sequencer controlling all of this (deciding the order in which things happen), this is called Syntax Directed Translation.
Since this is such an effective way of keeping down memory use, people even redesigned languages to make it easier to do -- the ideal being to have a "Single Pass" compiler that could in fact do the entire process from parsing to machine code generation in a single pass.
Nowadays, memory is not at such a premium, so there's less pressure to force everything into a single pass. Instead you generally use Syntax Direct Translation just for the front end, parsing the syntax, doing typechecking and other semantic checks, and a few simple transformations all from the parser and producing some internal form (three address code, trees, or dags of some kind) and then having separate optimization and back end passes that are independent (and so not syntax directed). Even in this case you might claim that these later passes are at least partly syntax directed, as the compiler may be organized to operate on large pieces of the input (such as entire functions or modules), pushing through all the passes before continuing with the next piece of input.
Tools like yacc are designed around the idea of Syntax Directed Translation -- the tool produces a syntax recognizer that directly runs fragments of code ('actions' in the tool parlance) as productions (fragments of the parse tree) are recognized, without ever creating an actual 'tree'. These actions can directly invoke what are logically later passes in the compiler, and then return to continue parsing. The imperative main loop that drives all of this is the parser's token reading state machine.
Actually No. Historically before the Dragon Book there were syntax directed compilers. Attending ACM SEGPlan meeting in the late 1960's I learned of several types of directed translation. Tree directed and graph directed translation were also discussed. I think these got muddled together in the Dragon Book though I have never owned the Dragon Book. My favorite book was Programming Systems and Languages by Saul Rosen. It is a collection of papers on compilers, operating systems and computer systems. I'll try to explain the early syntax directed compiler parser programming languages. The later ones producing trees were combined with tree directed code generating languages.
Early syntax directed compilers, translated source directly to stack machine code. The Borrows B5000 ALGOL compiler is an example.
A*(B+C) -> A,B,C,ADD,MPY
Schorre's META II domain specific parser programming language, compiler compiler, developed in the 1960s is an example of a syntax directed compiler. You can find the original META II paper in the ACM archive. META II avoids left recursion using $ postfix zero or more sequence operator and ( ) grouping.
EXPR = TERM $('+' TERM .OUT 'ADD'|'-' TERM .OUT 'SUB');
Later Schorre based metalanguage compilers translated to trees using stack based tree transformation operators :<node name> and !<number>.
EXPR = TERM $(('+':ADD|'-':SUB) TERM!2);
Except for TREEMETA that used [<number>] instead of !<number>. The above EXPR formula is basically the same as the META II EXPR except we have factored operators + and - recognition creating corresponding nodes and pushing the node onto the node stack. Then on recognizing the right TERM the tree constructor !2 creates a tree popping the top 2 parse stack <TERM>s and top node from the node stack to form a tree:
ADD or SUB
/ \ / \
TERM TERM TERM TERM
Tokens were recognized by supplied recognizers .ID .NUMBER and .STRING. Later replaced by token ".." and character class ":" formula in CWIC:
id .. let $(leter|dgt|+'_');
Tree directed compiler languages were combined with the syntax directed compilers to generate code. The CWIC compiler compiler developed at Systems Development Corporation included a LISP 2 based tree directed generator language. A short paper in CWIC can be found in the ACM archives.
In the parser programming languages you are programming a type of recursive decent parser. When you get to CWIC all the problems that today are attributed to recursive decent parsers were eliminated. There is no left recursion problem as the $ zero or more construct and programed tree construction eliminated the need of left recursion. You control the tree construction. A loop construct is used to produces a left handed tree and tail recursion a right handed tree. Though parsing formulas may generate no tree at all:
program = $declarations;
In the above the $ zero or more loop operator preceding declarations specifies that declarations is to be repeatably called as long as it returns success. The input source code being compiled is made up of any positive number of declarations. The declarations formula would then define the types of declarations. You might need external linkages declarations, data declarations, function or procedure code declarations.
declarations = linkage_decl | data_decl | code_decl;
The types of declarations each being a separate formula. The syntax language controls when semantic processing and code generation occurs. The program and declarations formulas above do not produce trees. They are simply controlling when and what language structure are parsed. These are neither LL oe LR parser sears. The provide unlimited (limited only by available memory) programed backtracking. They provide programed look ahead and peak ahead tests.
As a last example the following example including token and character class formula illustrates producing both left and right handed trees. Specifically exponentiation using tail recursion.
assign = id '=' expr ';' :ASSIGN!2 arith_gen[*1];
expr = term $(('+':ADD | '-':SUB) term !2);
term = factor $(('*':MPY | '//' :REM | '/':DIV) factor!2);
factor = ( id ('(' +[ arg $(',' arg ]+ ')' :CALL!2 | .EMPTY)
| number
| '(' expr ')'
) ('^' factor:EXP!2 | .EMPTY);
bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';
upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';
lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';
alpha: upr|lwr;
alphanum: alpha|dgt;
number .. dgt $dgt MAKENUM[];
id .. alpha $(alphanum|+'_');
The problem is the implementing a prefix tree (Trie) in functional language without using any storage and iterative method.
I am trying to solve this problem. How should I approach this problem ? Can you give me exact algorithm or link which shows already implemented one in any functional language?
Why I am trying to do => creating a simple search engine with an feature of
adding word to tree
searching a word in tree
deleting a word in tree
Why I want to use functional language => I want improve my problem-solving ability a bit further.
NOTE : Since it is my hobby project, I will first implement basic features.
EDIT:
i.) What I mean about "without using storage" => I don't want use variable storage ( ex int a ), reference to a variable, array . I want calculate the result by recursively then showing result to the screen.
ii.) I have wrote some line but then I have erased because what I wrote is made me angry. Sorry for not showing my effort.
Take a look at haskell's Data.IntMap. It is purely functional implementation of
Patricia trie and it's source is quite readable.
bytestring-trie package extends this approach to ByteStrings
There is accompanying paper Fast Mergeable Integer Maps which is also readable and through. It describes implementation step-by-step: from binary tries to big-endian patricia trees.
Here is little extract from the paper.
At its simplest, a binary trie is a complete binary tree of depth
equal to the number of bits in the keys, where each leaf is either
empty, indicating that the corresponding key is unbound, or full, in
which case it contains the data to which the corresponding key is
bound. This style of trie might be represented in Standard ML as
datatype 'a Dict =
Empty
| Lf of 'a
| Br of 'a Dict * 'a Dict
To lookup a value in a binary trie, we simply read the bits of the
key, going left or right as directed, until we reach a leaf.
fun lookup (k, Empty) = NONE
| lookup (k, Lf x) = SOME x
| lookup (k, Br (t0,t1)) =
if even k then lookup (k div 2, t0)
else lookup (k div 2, t1)
The key point in immutable data structure implementations is sharing of both data and structure. To update an object you should create new version of it with the most possible number of shared nodes. Concretely for tries following approach may be used.
Consider such a trie (from Wikipedia):
Imagine that you haven't added word "inn" yet, but you already have word "in". To add "inn" you have to create new instance of the whole trie with "inn" added. However, you are not forced to copy the whole thing - you can create only new instance of the root node (this without label) and the right banch. New root node will point to new right banch, but to old other branches, so with each update most of the structure is shared with the previous state.
However, your keys may be quite long, so recreating the whole branch each time is still both time and space consuming. To lessen this effect, you may share structure inside one node too. Normally each node is a vector or map of all possible outcomes (e.g. in a picture node with label "te" has 3 outcomes - "a", "d" and "n"). There are plenty of implementations for immutable maps (Scala, Clojure, see their repositories for more examples) and Clojure also has excellent implementation of an immutable vector (which is actually a tree).
All operations on creating, updating and searching resulting tries may be implemented recursively without any mutable state.
I was wondering, what is the longest possible name length allowed by the Windows kernel?
E.g.: I know the kernel uses UNICODE_STRING structures to hold all object paths, and since the byte length of a wide-character string is stored inside a USHORT, that allows for a maximum path length of 2^15 - 1 characters. Is there a similar, hard restriction on a file name (rather than path)? (I don't care if NTFS or FAT32 imposes a particular restriction; I'm looking for the longest possible theoretically allowed name in the kernel, assuming no additional file system or shell restrictions.)
(Edit: For those wondering why this even matters, consider that normally, traversing a directory is achieved by FindFirstFile/FindNextFile calls, one call per file. Given the function named NtQueryDirectoryFile, which is the underlying system call and which returns multiple file names per call, it's actually possible to take advantage of this maximum-length restriction on the path to make an extremely-fast directory traverser that uses solely the stack as a buffer. Now I'm trying to extend that concept, and I need to know the maximum size of a file name.)
The maximum length of a path is 32,767 characters whereby each path component (directory or file) can have a maximum length of 255 characters (to be more exact, the value returned in the lpMaximumComponentLength parameter of the GetVolumeInformation function).
This is documented on MSDN.
Ah, I found this page myself that guarantees that file names can't be longer than 255 characters:
A pathname MUST be no more than 32,760 characters in length.
...
Each pathname component MUST be no more than 255 characters in length.
Which makes me wonder:
Why does Windows use ULONGs for file name lengths, when it uses USHORTs for path lengths?!
If anyone knows why this is, please post/comment! I'm rather curious. :)
I've implemented file compression using huffman's algorithm, but the problem I have is that to enable decompression of the compressed file, the coding tree used, or the codes itself should be written to the file too. The question is: how do i do that? What is the best way to write the coding tree at the beggining of the compressed file?
There's a pretty standard implementation of Huffman Coding in the Basic Compression Library (BCL), including a recursive function that writes the tree out to a file. Look at huffman.C. It just writes out the leaves in order so the decoder can reconstruct the same tree.
BCL is also nice because there are some other pretty straightforward compression algorithm pieces in there, too. It's quite handy if you need to roll your own algorithm.
First off, have you considered using a standard compression Stream (like GZipStream in .net) ?
About how/where to write your data, you can manipulate a Streams position with Seek (even reserve space that way). If you know the size of the Tree ahead of time you can start writing after that position. But you may want to position the coding tree after the actual data, and just make sure you know where that starts. Ie, reserve a little space in front, write the compressed data, record the position, write the tree, go to the front and write out the position.
Assuming you compress on 8-bit symbols (i.e. bytes) and the algorithm is non-adaptive, the simplest way would be to store not the tree but the distribution of the values. For example by storing how often you found byte 0, how often byte 1, ..., how often byte 255. Then when reading back the file you can re-assemble the tree. This is the simplest solution, but requires the most storage space (e.g. to cover large files, you would need 4 bytes per value, i.e. 1kb).
You could optimize this by not storing exactly how often each byte was found in the file, but instead normalizing the values to 0..255 (0 = found least, ...), in which case you would only need to save 256 bytes. Re-assembling of the tree based on these values would result in the same tree. (This is not going to work as pointed out by Edmund and in question 759707 - see there for further links and answers to your question)
P.S.: And as Henk said, using seek() allows you to keep space at the beginning of the file to store the values in later.
Most implementations are using canonical huffman encoding. You have only to store the symbol lengths in a compact way. Hier an implementation: shcodec.
Another way is using a semi-static huffman encoding (periodic rescale), then you have not to store any tree.
Instead of writing the code tree to the file, write how often each character was found, so the decompression program can generate the same tree.
The most naive solution would be to parse the compression tree in pre-order and write the 256 values in the header of your file.
Since every node in a huffman tree is either a branch with two children, or a leaf, you can use a single bit to represent each node unambiguously. For a leaf, follow immediately with the 8 bits for that node.
e.g. for this tree:
/\
/\ A
B /\
C D
You could store 001[B]01[C]1[D]1[A]
(Turns out this is exactly what happens in the huffman.c example posted earlier, but not how it was described above).
it is better to send the frequencies of characters and build the tree at the receiving end. This data will be of constant size for a fixed alphabet. I guess this must be serializable and put in the file. Sending the tree depends on its implementation, for what I have tried, an array based approach leads to more memory left unused for the tree since, the tree may not be a balanced tree most of the time. If the tree was balanced then array representation would have generated the best option.
Harisankar Krishna swamy
Did you try adaptive Huffman coding? From first look it seems the tree need not be sent at all, but more work to optimize and synchronize the tress.