how data is interpreted in computers - interpretation

Something that troubles me when I think of is how incoming data is interpreted in computers. I searched a lot but could not find an answer so as a last resort I am asking in here. What I am saying is that you plug in a USB to your computer and data stream starts. Your computer receives ones and zeros from the USB and interpret them correctly like for example inside of the USB there are pictures with different names and different formats and resolutions. What I do not understand is how computer correctly puts them together and the big picture emerges. This could be seen as a stupid question but had me thinking for a while. How does this system work?
I am not a computer scientist but I am studying Electrical and electronics engineering and know somethings.

It is all just streams of ones and zeros, which get counted up into bytes. As you probably know one can multiplex them, but with modern hardware that isn't very necessary (the 's' in USB standing for 'serial)
A pure black and white image of an "A" would be a 2d array:
111
101
111
101
101
3x5 font
I would guess that "A" is stored in a font file as 111101111101101, with a known length of 3*5=15 bits.
When displayed in a window, that A would be broken down into lines, and inserted on the respective line of the window, becoming a stream which contains 320x256 pixels perhaps.
When the length of data is not constant, it can:
If there is a max size, could be the size of the max size (integers and other primitive data types do this, a 0 takes 32/64 bits, as does 400123)
A length is included somewhere, often a sort of "header"
It gets chunked up into either constant or variable sized chunks, and has a continue bit (UTF-8 is a good simple example of constant chunks, some networking protocols (maybe TCP/IP) are a good example of variable chunks)
Both sides need to know how to decode the data, in your example of a USB stick with an image on it. The operating system has a driver which understands the UUID is a storage device, and attempts to read special sectors from it. If it detects a partition type it recognizes (for windows that would be NTFS or FAT32), it will then load the file tables, using drivers that understand how to decode those. It finds a filename allows access via the filename. Then an image reading program is able to load the bytestream of that file and decode it using its headers and installed codecs into a raster image array. If any of those pieces are not available in your system, you cannot view the image, and it will be just any random binary to you (if you format the usb stick with Linux, or use a uncommon/old image format)
So its all various level of explicit or implicit handshakes to agree on what the data is when you get to the higher levels (higher level being at least once you agree on endianness and baudrate of data transmission)

Related

Hex - Search by bytes to get Offset & Search by Offset to get Bytes

just currently prototyping a little software and currently stuck. I'm trying to create a little program that'll edit a .bin file, and for this I will need to do the following:
Get Bytes by Searching for Offset
Get Offset by searching for Bytes
Write/Update .bin file
I usually use the program HxD to do this manually, but want to get a small automated process in place.
Using hex.EncodeToString returns what I want as the output (Like HxD) however I can't find a way to search for the values by bytes and offests
Could anyone help or have suggestions?
OK, "searching of an offset" is a misnomer because if you have an offset and a medium which supports random access, you just "seek" the known offset there; for files, see os.File.Seek.
Searching is more complex: it consists of converting the user input into something searchable and, well, the searching itself.
Conversion is the process of translation of the human operator's imput to a slice of bytes — for instance, you'd need to convert a string "00 87" to a slice of bytes, []byte{00, 87}.
Such conversion can be done using, say, encoding/hex.Decode after removing any whitespace, which can be done using a multitude of ways.
Searching the file given a slice of bytes can be either simple of complex.
If a file is small (a couple megabytes, on today's hardware), you can just slurp it into memory (for instance, using io.ReadAll) and do a simple search using bytes.Index.
If a file is big, the complexity of the task quickly escalates.
For instance, you could read the file from its beginning to its end using chunks of some sensible size and search for your byte slice in each of them.
But you'd need to watch out for two issues: the slice to search should be smaller than each of such chunks, and two adjacent chunks might contain the sequence to be found positioned right across their "sides" — so that the Nth chunk contains the first part of the pattern at its end and the N+1th chunk contains the rest of it at its beginning.
There exist more advanced approaches to such searching — for instance, using so-called "memory-mapped files" but I'd speculate it's a bit too early to tread these lands, given your question.

What for people sometimes convert numbers or strings to bytes?

Sometimes I encounter questions about converting sth to bytes. Are anything existing where it is vitally important to convert to bytes or what for could I convert sth to bytes?
In most languages the most common string functions come as part of the language or in a library/include/import that comes pre-made, often employing object code to take advantage of processor based strings functions, however, sometimes you need to do something with a string that isnt natively supported by the language so since 8-bit days, people have viewed strings as an array of 7 or 8-bit characters, which fit within a byte and use conventions like ASCII to determine which byte value represents which character.
While standard languages often have functions like "string.replaceChar(OFFSET,'a')" this methodology can be painstaking slow because each call to the replaceChar method results in processing overhead which may be greater than the processing needing to be done.
There is also the simplicity factor when designing your own string algorithms but like I said, most of the common algorithms come prebuilt in modern languages. (stringCompare, trimString, reverseString, etc).
Suppose you want to perform an operation on a string which doesnt come as standard.
Suppose you want to add two numbers which are represented in decimal digits in strings and the size of these numbers are greater than the 64-bit bus size of the processor? The RSA encryption/descryption behind the SSL browser padlocks employs the use of numbers which dont fit into the word size of a desktop computer but none the less the programs on a desktop which deal with RSA certificates and keys must be able to process these data which are actually strings.
There are many and varied reasons you would want to deal with string, as an array of bytes but each of these reasons would be fairly specialised.

Most efficient barcode to store a GUID

I have a system that I'm working on at the moment that requires users to log into the system, and the client wants to use a barcode scanner and cards to keep prices down. (Yes username and password cheaper, but she wants a card type solution so she gets one.)
All my data uses GUIDs as key fields, so I'd like to store the GUID directly on the card in the barcode. While its simple enough to code it as 3 of 9 its not going to be the most efficient use of space.
Is there a best practice or most efficient method for storing GUIDs in a barcode? I'd have assumed that since there's a consistent length, and depth to the data there would be a standard, but I can't find it. Would be easy enough to generate my own - control char either end and then binary data between, but would like something that standard readers will know how to interpret.
Any help gratefully received.
There are no open standards for special-purpose data compaction with generic linear barcodes such as Code 39 and Code 128. Most ISO/IEC-standardised 2D barcodes do support a special-purpose data encoding mechanism called Extended Channel Interpretation (ECI) which allows you to specify that data conforms to a certain application standard or encoding regime, for example ECI 298765 for IPv4 address compaction [*]. Unfortunately GUID compaction isn't amongst those that have been registered and even if it were you would nevertheless need to handle this within your application as reader support would be lacking.
That leaves you with having to pre-encode (and subsequently decode) the GUID into a format that can be handled efficiently by some ubiquitous barcode symbology.
An efficient way to store a GUID would be to convert it to a 40-digit[†] decimal representation and store the result in a Code 128 barcode using double-density numeric compression ("Mode C").
For example, consider the GUID:
cd171f7c-560d-4a62-8d65-16b87419a58c
Expressed as a hexadecimal number:
0xCD171F7C560D4A628D6516B87419A58C
Converted to 40 decimal digits:
0272611800569275698104677545117639878028
Encoded within a Code 128 barcode:
Your application would of course need to recognise this input as a decimal-encoded GUID and reverse the above process but I doubt that a significantly more efficient approach exists that doesn't require you to transform the data into an unusual radix and then deal with the complexities of handling ASCII control characters at scan time.
[*] The register of assigned ECI codes is available from the AIM store as "ECI Part 3: Register".
[†] Whilst it is possible to store the entire GUID range within 39 digits a 39-digit Mode C Code 128 symbol is in fact longer than a 40-digit symbol.

Bash string compression

I'd like to know how I can compress a string into fewer characters using a shell script. The goal is to take a Mac's serial number and MAC address then compress those values into a 14 character string. I'm not sure if this is possible, but I'd like to hear if anyone has any suggestions.
Thank you
Your question is way too vague to result in a detailed answer.
Given your restriction of a 14 character string output, you won't be able to use "real" compression (like zip), due to the overhead. This leaves you with simple algorithms, like RLE or bit concatenation.
If by "string" you mean "printable string", i.e. only about 62 or so values are usable in a character (depending on the exact printable set you choose), then you have an additional space constraint.
A handy trick you could use with the MAC address part is, since it belongs to an Apple device, you already know that the first three values (AA:BB:CC) are one of 297 combinations, so you could save 6 characters (plus 2 for the colons) worth of information into 2+ characters (depending on your output character set, see above).
The remaining three MAC address values are base-16 (0-9, A-F), so you could "compress" this information slightly as well.
A similar analysis can be done for the Mac serial number (which values can it take? how much space can be saved?).
The effort to do this in bash would be disproportionate though. I'd highly recommend a C (or other programming language) approach.
Cheating answer
Get someone at Apple to give you access to the database I'm assuming they have which matches devices' serial numbers to MAC addresses. Then you can just store the MAC address and look it up in the database whenever you need the serial number. The 64-bit MAC address can easily be stored in 12 characters with standard base64 encoding.
Frustrating answer
You have to make some unreliable assumptions just to make this approachable. You can fix the assumptions later, but I don't know if it would still fit in 14 characters. Personally, I have no idea why you want to save space by reprocessing the serial and MAC numbers, but here's how I'd start.
Simplifying assumptions
Apple will never use MAC address prefixes beyond the 297 combinations mentioned in Sir Athos' answer.
The "new" Mac serial number format in this article from
2010 is the only format Apple has used or ever will use.
Core concepts of encoding
You're taking something which could have n possible values and you're converting it into something else with n possible values.
There may be gaps in the original's possible values, such as if Apple cancels building a manufacturing plant after already assigning it a location code.
There may be gaps in your encoded form's possible values, perhaps in anticipation of Apple doing things that would fill the gaps.
Abstract integer encoding
Break apart the serial number into groups as "PPP Y W SSS CCCC" (like the article describes)
Make groups for the first 3 bytes and last 5 bytes of the MAC address.
Translate each group into a number from 0 to n-1 where n is the number of possible values for something in the group. As far as I can tell from the article, the values are n_P=36^3, n_Y=20, n_W=27, n_S=3^3, and n_C=36^4. The first 3 MAC bytes has 297 values and the last 5 have 2^(8*5)=2^40 values.
Set a variable, i, to the value of the first group's number.
For each remaining group's number, multiply i by the number of values possible for the group, and then add the number to i.
Base n encoding
Make a list of n characters that you want to use in your final output.
Print the character in your list at index i%n.
Subtract the modulus from the integer encoding and divide by n.
Repeat 1 and 2 until the integer becomes 0.
Result
This results in a total of 36^3 * 20 * 27 * 36 * 7 * 297 * 2^40 ~= 2 * 10^24 combinations. If you let n=64 for a custom base64 encoding
(without any padding characters), then you can barely fit that into ceiling(log(2 * 10^24) / log(64)) = 14 characters. If you use all 95 printable ASCII characters, then you can fit it into ceiling(log(2 * 10^24) / log(95)) = 13 characters.
Fixing the assumptions
If you're trying to build something that uses this and are determined to make it work, here's what you need to do to make it solid, along with some tips.
Do the same analysis on every other serial number format you may care about. You might want to see if there's any redundant information between the serial and MAC numbers.
Figure out a way to detect between serial number formats. Adding an extra thing at the end of the abstract number encoding can enable you to track which version it uses.
Think long and careful about the format you're making. It's a lot easier to make changes before you're stuck with backwards compatibility.
If you can, use a language that's well suited for mapping between values, doing a lot of arithmetic, and handling big numbers. You may be able to do it in Bash, but it'd probably be easier in, say, Python.

viewing the contents of a file as decimal numbers format using vi

I am saving audio input in a file called sound.raw by using alsa api. I think that the sound amplitudes are being saved (it is a guess, i am not sure). The format i use is signed 16 bit number little endian (S16_LE). Now if the amplitudes are being saved, how do i see the amplitudes in decimal number format because as of now i only see a collection of #s and ^s and various other symbols which aren't making sense when i open the sound.raw file with vi.
What you are seeing is the binary representation of the sound data as interpreted by vi (probably as ASCII). However, it is not meant to be human-readable, or a lot of storage would be wasted.
See Using vi as a hex editor for a way to show the data in hexadecimal format, which is the closest you're going to get to an answer to your question without (writing your own) specific software for displaying ALSA-formatted sound data in a human-readable fashion.

Resources