Little-Endian Signed Integer - endianness

I know the WAV file format uses signed integers for 16-bit samples. It also stores them in little-endian order, meaning the lowest 8 bits come first, then the next, etc. Is the special sign bit on the first byte, or is the special sign bit always on the most significant bit (highest value)?
Meaning:
Which one is the sign bit in the WAV format?
++---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---++
|| a | b | c | d | e | f | g | h || i | j | k | l | m | n | o | p ||
++---+---+---+---+---+---+---+---++---+---+---+---+---+---+---+---++
--------------------------- here -> ^ ------------- or here? -> ^
i or p?

signed int, little endian:
byte 1(lsb) byte 2(msb)
---------------------------------
7|6|5|4|3|2|1|0 | 7|6|5|4|3|2|1|0|
----------------------------------
^
|
Sign bit
You only need to concern yourself with that when reading/writing a short int to some external media. Within your program, the sign bit is the most significant bit in the short, no matter if you're on a big or little endian platform.

The sign bit is the most significant bit on any two's-complement machine (like the x86), and thus will be in the last byte in a little-endian format
Just cause i didn't want to be the one not including ASCII art... :)
+---------------------------------------+---------------------------------------+
| first byte | second byte |
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
^--- lsb msb / sign bit -----^
Bits are basically represented "backwards" from how most people think about them, which is why the high byte is last. But it's all consistent; "bit 15" comes after "bit 0" just as addresses ought to work, and is still the most significant bit of the most significant byte of the word. You don't have to do any bit twiddling, because the hardware talks in terms of bytes at all but the lowest levels -- so when you read a byte, it looks exactly like you'd expect. Just look at the most significant bit of your word (or the last byte of it, if you're reading a byte at a time), and there's your sign bit.
Note, though, that two's complement doesn't exactly designate a particular bit as the "sign bit". That's just a very convenient side effect of how the numbers are represented. For 16-bit numbers, -x is equal to 65536-x rather than 32768+x (which would be the case if the upper bit were strictly the sign).

Related

awk - match positive whole numbers and floating point numbers only

I have an input file in .csv format which contains entries of tax invoices separated by pipe.
for example:
Header--TIN | NAME | INV NO | DATE | NET | TAX | OTHERS | TOTAL
Record1-29001234768 | A S Spares | AB012 | 23/07/2016 | 5600.25 | 200.70 | 10.05 | 5811.00
Record2-29450956221 | HONDA Spare Parts | HOSS0987 |29/09/2016 | 70000 | 2200 | 0 | 72200
The record's NET value, TAX Value, OTHER Charges and TOTAL value column may contain positive whole numbers or positive floating point numbers with 2-4 places after the decimal point.
Now my requirement is to check whether the columns meets the specified constraints by checking with appropriate 'Regular Expression using awk'.
I need to match these 4 columns with regular expression such that if I encounter any numeric value other than positive whole number or positive floating point number , I need to print an error message to the user.
I've tried the following , but it doesn't seem to work.
if(!($5 ~ /[0-9]+/) || !($5 ~ /[0-9]+[.][0-9]+/) || ($5<=0))
{ printf("NET VALUE (Violates constraints)" }
Can anyone give the proper working regular expression or any implementation using built-in-function to meet my requirements?
Sounds like your validation should be:
$5 ~ /^[0-9]+(\.[0-9]{2,4})?$/
If it matches that, then it's valid (either a positive whole number, or a number followed by . and between 2 and 4 other numbers).
The anchors to the start and end of the field are important!
As rightly pointed out in the comments, if you want to accept numbers with no digits before the decimal point, then you will have to go for a more complex regular expression.

What algorithm to use to format an EDIFACT file?

I work with EDIFACT messages and have developed lots of tools to help me parse and extract the relevant information out of the raw file format.
Something I have always struggled with is presenting the raw EDIFACT. I typically just copy the message into Microsoft Word, do a find and replace for the segment separator and view the contents line by line.
I have always wanted to display the EDIFACT file in its hierarchy format but can not for the life of me work out a method to do this.
Below is a small extract of a raw EDIFACT message.
The left side shows how I get the data (not including line numbers), the right side shows how I want it to be displayed based on a customers specification.
01. UNA -UNA
02. UNB -UNB
03. UNH -UNH
04. BGM -BGM
05. DTM - | DTM
06. DTM - | DTM
07. DTM - | DTM
08. NAD - | NAD
09. NAD - | NAD
10. NAD - | NAD
11. GIS - | GIS
12. LIN - | | LIN
13. LOC - | | | LOC
14. LOC - | | | LOC
15. LOC - | | | LOC
16. RFF - | | | RFF
17. QTY - | | | QTY
18. QTY - | | | QTY
19. RFF - | | | | RFF
20. DTM - | | | | | DTM
21. SCC - | | | SCC
22. QTY - | | | | QTY
23. DTM - | | | | | DTM
24. DTM - | | | | | DTM
25. SCC - | | | SCC
26. QTY - | | | | QTY
27. DTM - | | | | | DTM
28. DTM - | | | | | DTM
29. SCC - | | | SCC
30. QTY - | | | | QTY
31. DTM - | | | | | DTM
32. QTY - | | | | QTY
33. DTM - | | | | | DTM
34. SCC - | | | SCC
35. QTY - | | | | QTY
36. DTM - | | | | | DTM
37. NAD - | | | NAD
38. CTA - | | | | CTA
39. COM - | | | | | COM
40. SCC - | | | | SCC
41. QTY - | | | | | QTY
42. UNT -UNT
43. UNZ -UNZ
You can see that the data is tree based, and it is described by a specification that is sent to me. One specification for the above EDIFACT message is as follow:
Tag St Max Lvl
0000 1 UNA C 1 0 SERVICE STRING ADVICE
0000 2 UNB M 1 0 INTERCHANGE HEADER
0010 3 UNH M 1 0 MESSAGE HEADER
0020 4 BGM M 1 0 BEGINNING OF MESSAGE
0030 5 DTM M 10 1 DATE/TIME/PERIOD
0040 6 FTX C 5 1 FREE TEXT
0080 SG2 C 99 1 NAD
0090 7 NAD M 1 1 NAME AND ADDRESS
0190 SG6 C 9999 1 GIS-SG7-SG12
0200 8 GIS M 1 1 GENERAL INDICATOR
0210 SG7 C 1 2 NAD
0220 9 NAD M 1 2 NAME AND ADDRESS
0370 SG12 C 9999 2 LIN-LOC-FTX-SG13-SG15-SG17-SG22
0380 10 LIN M 1 2 LINE ITEM
0450 11 LOC C 999 3 PLACE/LOCATION IDENTIFICATION
0470 12 FTX C 5 3 FREE TEXT
0480 SG13 C 10 3 RFF
0490 13 RFF M 1 3 REFERENCE
0540 SG15 C 10 3 QTY-SG16
0550 14 QTY M 1 3 QUANTITY
0570 SG16 C 10 4 RFF-DTM
0580 15 RFF M 1 4 REFERENCE
0590 16 DTM C 1 5 DATE/TIME/PERIOD
0600 SG17 C 999 3 SCC-SG18
0610 17 SCC M 1 3 SCHEDULING CONDITIONS
0620 SG18 C 999 4 QTY-DTM
0630 18 QTY M 1 4 QUANTITY
0640 19 DTM C 2 5 DATE/TIME/PERIOD
0760 SG22 C 999 3 NAD-SG24-SG27
0770 20 NAD M 1 3 NAME AND ADDRESS
0830 SG24 C 5 4 CTA-COM
0840 21 CTA M 1 4 CONTACT INFORMATION
0850 22 COM C 5 5 COMMUNICATION CONTACT
0920 SG27 M 999 4 SCC-SG28
0940 SG28 M 999 5 QTY
0950 24 QTY M 1 5 QUANTITY
1030 25 UNT M 1 0 MESSAGE TRAILER
0000 26 UNZ M 1 0 INTERCHANGE TRAILER
The important columns are Tag, St (M=Mandatory, C=Conditional), Max (Maximum times it can repeat), lvl (How deep in the tree it is).
The Tags that start with SG denote that there is a loop
The problem I face is that the format is very flexible, where it can have conditional segments, conditional loops, repeated segments. Trying to think of a method that can handle all this has been my issue.
Starting from the top in the above specification, you can immeditely see that when you come to the DTM tag, it can be repeated upto a max of 10 times. In the sample EDIFACT message, it only appears 3 times on lines 5, 6, 7. Following the specification, FTX may appear but does not in my sample message, then there is a SG2 tag, which means the following NAD tag may repeat 99 times.
Moving slightly ahead inside the LIN tag (which is under the SG12 group, which can repeat upto 9999 times and in many cases does repeat a number of times), it comes to the first QTY tag.
According to the specification, this segment can have conditional group (SG15) RFF and a DTM under it. Using my sample, you can see on line 17 and 18 that it has the QTY segment but line 18, has this conditional group too.
Similiar things start happening when you look into the SCC segments too.
What I have in my mind, is to be able to enter that specification into some sort of file format, then run the raw EDIFACT message against the rules of this specification so the output is hierarchy based so it's easy to see at a glance what data relates to what segment and a way to check to see if the EDIFACT message is valid.
What I have trouble with, is the actual algorithm or process to do that conversion.
I have tried naive approaches, like going line by line but then it gets messy when I am trying to work out if the current line is in a group, or a repeat or something else.
I have tried a recursive approach, by splitting the entire EDIFACT by the largest group (The SG12-LIN group), then recursively process each of them splits and build an output. This has been my best approach yet but it's still far from working with many false readings due to my logic not being right.
I basically need to be able to pick a segment of the message, and determine where in the hierarchy it should be and display it.
I am at a loss on how I can solve this. I am sure there is a nice simple method at doing this but I just cannot work it out.
Any assistance would be most grateful.
Slight update.
I have converted the specification into a XML file following the hierarchy of said specification. This XML file now contains all the groups and various attributes related to each tag. Now I have a start on what the EDIFACT needs to conform too.
If I go through it on paper (and in head) and I can build the output that I am trying to do with a bit of forward thinking so my new idea is to "Scan ahead" in the EDIFACT file, and build a probably based result.
Bit like how a chess AI looks ahead a few moves.
most of the thing you want I can help you with (and did them). But this is not easy done on a small piece of paper with no interaction.
So if you want more information, just contact me. (no, this is not a commercial thing)

Segment multilanguage parallel text

I have multi-language text that contains a message translated to several languages.
For example:
English message
Russian message
Ukrainian message
The order is not exact.
I would like to devise some kind of supervised/unsupervised learning algorithm to do the segmentation automatically, and extract each translation in order to create a parallel corpus of data.
Could you suggest any papers/approaches?
I am not able to get the proper keywords for googling.
The most basic approach to your problem would be to generate a bag of words from your document. To sum up, a bag of word is a matrix where each row is a line in your document and each column a distinct term.
For instance, if your document is like this :
hello world
привет мир
привіт світ
You will have this matrix :
hello | world | привет | мир | привіт | світ
l1 | 1 | 1 | 0 | 0 | 0 | 0
l2 | 0 | 0 | 1 | 1 | 0 | 0
l3 | 0 | 0 | 0 | 0 | 1 | 1
You can then apply classifications algorithms (such as k-means or svms) according to your needs.
For more details, I would suggest to read this paper which provides a great summary of techniques.
Regarding keywords for googling, I would say text analysis, text mining or information retrieval are a good start.
Why don't you try some language identification software? They are reporting > 90% accuracy:
langid.py https://github.com/saffsd/langid.py
TextCat http://odur.let.rug.nl/~vannoord/TextCat/
Linguine http://www.jmis-web.org/articles/v16_n3_p71/index.html

turned on bits counter

Suppose I have a black box with 3 inputs (each input is 1 bit) and 2 bits output.
The black box counts the amount of turned on input bits.
Using only such black boxes,one needs to implement the counter of turned on bits in the input,which has 7 bits.The implementation should use the minimum possible amount of black boxes.
//This is a job interview question
You're making a binary adder. Try this...
Two black boxes for input with one input remaining:
7 6 5 4 3 2 1
| | | | | | |
------- ------- |
| | | | |
| H L | | H L | |
------- ------- |
| | | | |
Take the two low outputs and the remaining input (1) and feed them to another black box:
L L 1
| | |
-------
| |
| C L |
-------
| |
The low output from this black box will be the low bit of the result. The high output is the carry bit. Feed this carry bit along with the high bits from the first two black boxes into the fourth black box:
H H C L
| | | |
------- |
| | |
| H M | |
------- |
| | |
The result should be the number of "on" bits in the input expressed in binary by the High, Middle and Low bits.
Suppose that each BB outputs a 2-bit binary count 00, 01, 10, or 11, when 0, 1, 2, or 3 of its inputs are on. Also suppose that the desired ultimate output O₄O₂O₁ is a 3-bit binary count 000 ... 111, when 0, 1, ... 7 of the 7 input bits i₁...i₇ are on. For problems like this in general, you can write a boolean expression for what the BB does and a boolean expression for the desired output and then synthesize the output. In this particular case, however, try the obvious approach of putting i₁, i₂, i₃ into a first box B₁, and i₄, i₅, i₆ into a second box B₂, and i₇ into one input of a third box B₃. Looking at this it's clear that if you run the units outputs from B₁ and B₂ into the other two inputs of B₃ then the units output from B₃ is equal to the desired value O₁. You can get the sum of the twos outputs from B₁, B₂, B₃ via a box B₄, and this sum is equal to the desired values O₄O₂.

Algorithm to find optimal groups

A device contains an array of locations, some of which contain values that we want to read periodically.
Our list of locations that we want to read periodically also specifies how often we want to read them. It is permitted to read a value more frequently than specified, but not less frequently.
A single read operation can read a contiguous sequence of locations from the array, so it is possible to return a group of multiple values from one read operation. The maximum number of contiguous locations that can be read in a single operation is M.
The goal is to group locations so as to minimize the time-averaged number of read operations. In the event that there is more than one way to do this, the tie-breaker is to minimize the time-averaged number of locations read.
(Bonus points are awarded if the algorithm to do this allows incremental changes to the list of locations - i.e. adding or removing one location to/from the list doesn't require the groupings to be recalculated from scratch!)
I'll try to clarify this with some examples where M=6.
The following diagram shows the array of locations. The numbers represent the desired read period for that location.
| 1 | 1 | | | 1 | | | | | | 5 | | 2 |
\-------------------/ \-----------/
group A group B
In this first example group A is read every second and group B every 2 seconds. Note that the location that should be read every 5 seconds is actually read every 2 seconds - which is fine.
| 1 | | | | | 1 | 1 | | 1 |
\-----------------------/\----------/
group A group B (non-optimal!)
This example shows a failure of my initial simple-minded algorithm, which was to fill up the first group until full and then start another. The following grouping is more optimal because although the number of group reads per second is the same, the number of locations read in those groups is smaller:
| 1 | | | | | 1 | 1 | | 1 |
\---/ \---------------/
group A group B (optimal)
Finally, an example where three groups is better than two:
| 5 | | | | | 1 | 1 | | | | | 5 |
\-----------------------/\----------------------/
group A group B (non-optimal)
This solution requires two group reads per second. A better solution is as follows:
| 5 | | | | | 1 | 1 | | | | | 5 |
\---/ \-------/ \---/
group A group B group C
This requires two reads every 5 seconds (groups A and C) plus one every second (group B): 1.4 group reads per second.
Edit: (There is an even better solution to this example if you allow reads to be non-periodic. On the 1st second, read both groups of the first solution. On seconds 2, 3, 4 and 5 read group B of the second solution. Repeat. This results in 1.2 group reads per second. But I'm going to disallow this because it would make the code responsible for scheduling the reads much more complicated.)
I looked up clustering algorithms but this isn't a clustering problem. I also found Algorithm to allocate a list of numbers to N groups under certain condition, which pointed to the 'Bin packing' problem, but I don't think this is it either.
By the way, sorry for the vague title. I can't think of a concise description, or even relevant search keywords!
New examples added 28 September 2010:
This is like the previous example, but all items updating at the same rate. Now two groups is better than three:
| 1 | | | | | 1 | 1 | | | | | 1 |
\-----------------------/\----------------------/
group A group B (optimal)
I've started trying to see how iterative improvements might be implemented. Suppose a grouping algorithm came up with:
| 1 | | | | | 1 | 1 | | | | | 1 | 1 | | | | | 1 |
\---/ \-------/ \-------/ \---/
group A group B group C group D (non-optimal)
\-----------------------/\----------------------/\----------------------/
group A group B group C (optimal)
This can be improved to three adjacent groups each of 6. Rex suggested (comment below) that I could try combining triplets into pairs. But in this case I would have to combine a quartet into a triplet, because there is no legal intermediate arrangement in which A+B+C (or B+C+D) can be rearranged into a pair leaving D as it is.
I originally thought that this was an indication that in the general case there is no guarantee that a new valid solution can be created from an existing valid solution by making a local modification. This would have meant that algorithms such as simulated annealing, genetic algorithms, etc, could be used to try to refine a suboptimal solution.
But Rex pointed out (comment below) that you can always split an existing group into two. Despite the fact that this always increases the cost function, all that means is that the solution needs to get out of its local minimum in order to reach the global minimum.
This problem has the same property of instability on addition of new items that similar NP-complete problems do, so I assume it is one also. Since I suspect that you want something that works reasonably well instead of a proof of why it's hard, I'll focus on an algorithm to give an approximate solution.
I would solve this problem by converting this into a graph where bins were valued at 1/N if they had to be read N times per second, and blur the graph with a width of M (e.g. 6), peaked at the original. (For 6, I might use weighting (1/6 1/5 1/4 1/3 1/2 1 1/2 1/3 1/4 1/5 1/6).) Then throw bins at all the local maxima (sort pairs by distance apart and cover close pairs of maxima first if you can). Now you'll have most of your most important values covered. Then catch any missing groups by extending the existing reads, or by adding new reads if necessary. Depending on the structure you may want to add some refinement by shifting locations between reads, but if you're lucky that won't even be necessary.
Since this is essentially a local algorithm, if you keep track of the blurred graph, you can fairly easily add new items and re-do the peak-covering locally (and the refinement locally).
Just to see how this would work on your data, the two-group case would look like (multiplying by 60 so I don't have to keep track of fractional weights)
60 30 20 15 12 10 00 00 00 <- contribution from left-most location
10 12 15 20 30 60 30 20 15 <- second
00 10 12 15 20 30 60 30 20 <- third
00 00 00 10 12 15 20 30 60 <- rightmost
--------------------------
70 42 47 50 74 B5 B0 80 95 (using "B" to represent 11)
^^ ^^ ^^ Local maxima
------------- -------
dist=6 dist=4
|===========| <- Hit closely-spaced peaks first
|==| <- Then remaining
So we're done, and the solution is optimal.
For the three group example, weighting "5" as "1/5" and multiplying everything by 300 so again there are no fractions,
060 030 020 015 012 010 000 000 000 000 000 000 <- from 5 on left
050 060 075 100 150 300 150 100 075 060 050 000 <- 1 on left
000 050 060 075 100 150 300 150 100 075 060 050 <- on right
000 000 000 000 000 000 010 012 015 020 030 060 <- 5 on right
-----------------------------------------------
110 140 155 190 262 460 460 262 190 155 140 110
|=======| <- only one peak, grab it
=== === <- missed some, so pick them back up

Resources