What algorithm to use to format an EDIFACT file? - algorithm

I work with EDIFACT messages and have developed lots of tools to help me parse and extract the relevant information out of the raw file format.
Something I have always struggled with is presenting the raw EDIFACT. I typically just copy the message into Microsoft Word, do a find and replace for the segment separator and view the contents line by line.
I have always wanted to display the EDIFACT file in its hierarchy format but can not for the life of me work out a method to do this.
Below is a small extract of a raw EDIFACT message.
The left side shows how I get the data (not including line numbers), the right side shows how I want it to be displayed based on a customers specification.
01. UNA -UNA
02. UNB -UNB
03. UNH -UNH
04. BGM -BGM
05. DTM - | DTM
06. DTM - | DTM
07. DTM - | DTM
08. NAD - | NAD
09. NAD - | NAD
10. NAD - | NAD
11. GIS - | GIS
12. LIN - | | LIN
13. LOC - | | | LOC
14. LOC - | | | LOC
15. LOC - | | | LOC
16. RFF - | | | RFF
17. QTY - | | | QTY
18. QTY - | | | QTY
19. RFF - | | | | RFF
20. DTM - | | | | | DTM
21. SCC - | | | SCC
22. QTY - | | | | QTY
23. DTM - | | | | | DTM
24. DTM - | | | | | DTM
25. SCC - | | | SCC
26. QTY - | | | | QTY
27. DTM - | | | | | DTM
28. DTM - | | | | | DTM
29. SCC - | | | SCC
30. QTY - | | | | QTY
31. DTM - | | | | | DTM
32. QTY - | | | | QTY
33. DTM - | | | | | DTM
34. SCC - | | | SCC
35. QTY - | | | | QTY
36. DTM - | | | | | DTM
37. NAD - | | | NAD
38. CTA - | | | | CTA
39. COM - | | | | | COM
40. SCC - | | | | SCC
41. QTY - | | | | | QTY
42. UNT -UNT
43. UNZ -UNZ
You can see that the data is tree based, and it is described by a specification that is sent to me. One specification for the above EDIFACT message is as follow:
Tag St Max Lvl
0000 1 UNA C 1 0 SERVICE STRING ADVICE
0000 2 UNB M 1 0 INTERCHANGE HEADER
0010 3 UNH M 1 0 MESSAGE HEADER
0020 4 BGM M 1 0 BEGINNING OF MESSAGE
0030 5 DTM M 10 1 DATE/TIME/PERIOD
0040 6 FTX C 5 1 FREE TEXT
0080 SG2 C 99 1 NAD
0090 7 NAD M 1 1 NAME AND ADDRESS
0190 SG6 C 9999 1 GIS-SG7-SG12
0200 8 GIS M 1 1 GENERAL INDICATOR
0210 SG7 C 1 2 NAD
0220 9 NAD M 1 2 NAME AND ADDRESS
0370 SG12 C 9999 2 LIN-LOC-FTX-SG13-SG15-SG17-SG22
0380 10 LIN M 1 2 LINE ITEM
0450 11 LOC C 999 3 PLACE/LOCATION IDENTIFICATION
0470 12 FTX C 5 3 FREE TEXT
0480 SG13 C 10 3 RFF
0490 13 RFF M 1 3 REFERENCE
0540 SG15 C 10 3 QTY-SG16
0550 14 QTY M 1 3 QUANTITY
0570 SG16 C 10 4 RFF-DTM
0580 15 RFF M 1 4 REFERENCE
0590 16 DTM C 1 5 DATE/TIME/PERIOD
0600 SG17 C 999 3 SCC-SG18
0610 17 SCC M 1 3 SCHEDULING CONDITIONS
0620 SG18 C 999 4 QTY-DTM
0630 18 QTY M 1 4 QUANTITY
0640 19 DTM C 2 5 DATE/TIME/PERIOD
0760 SG22 C 999 3 NAD-SG24-SG27
0770 20 NAD M 1 3 NAME AND ADDRESS
0830 SG24 C 5 4 CTA-COM
0840 21 CTA M 1 4 CONTACT INFORMATION
0850 22 COM C 5 5 COMMUNICATION CONTACT
0920 SG27 M 999 4 SCC-SG28
0940 SG28 M 999 5 QTY
0950 24 QTY M 1 5 QUANTITY
1030 25 UNT M 1 0 MESSAGE TRAILER
0000 26 UNZ M 1 0 INTERCHANGE TRAILER
The important columns are Tag, St (M=Mandatory, C=Conditional), Max (Maximum times it can repeat), lvl (How deep in the tree it is).
The Tags that start with SG denote that there is a loop
The problem I face is that the format is very flexible, where it can have conditional segments, conditional loops, repeated segments. Trying to think of a method that can handle all this has been my issue.
Starting from the top in the above specification, you can immeditely see that when you come to the DTM tag, it can be repeated upto a max of 10 times. In the sample EDIFACT message, it only appears 3 times on lines 5, 6, 7. Following the specification, FTX may appear but does not in my sample message, then there is a SG2 tag, which means the following NAD tag may repeat 99 times.
Moving slightly ahead inside the LIN tag (which is under the SG12 group, which can repeat upto 9999 times and in many cases does repeat a number of times), it comes to the first QTY tag.
According to the specification, this segment can have conditional group (SG15) RFF and a DTM under it. Using my sample, you can see on line 17 and 18 that it has the QTY segment but line 18, has this conditional group too.
Similiar things start happening when you look into the SCC segments too.
What I have in my mind, is to be able to enter that specification into some sort of file format, then run the raw EDIFACT message against the rules of this specification so the output is hierarchy based so it's easy to see at a glance what data relates to what segment and a way to check to see if the EDIFACT message is valid.
What I have trouble with, is the actual algorithm or process to do that conversion.
I have tried naive approaches, like going line by line but then it gets messy when I am trying to work out if the current line is in a group, or a repeat or something else.
I have tried a recursive approach, by splitting the entire EDIFACT by the largest group (The SG12-LIN group), then recursively process each of them splits and build an output. This has been my best approach yet but it's still far from working with many false readings due to my logic not being right.
I basically need to be able to pick a segment of the message, and determine where in the hierarchy it should be and display it.
I am at a loss on how I can solve this. I am sure there is a nice simple method at doing this but I just cannot work it out.
Any assistance would be most grateful.
Slight update.
I have converted the specification into a XML file following the hierarchy of said specification. This XML file now contains all the groups and various attributes related to each tag. Now I have a start on what the EDIFACT needs to conform too.
If I go through it on paper (and in head) and I can build the output that I am trying to do with a bit of forward thinking so my new idea is to "Scan ahead" in the EDIFACT file, and build a probably based result.
Bit like how a chess AI looks ahead a few moves.

most of the thing you want I can help you with (and did them). But this is not easy done on a small piece of paper with no interaction.
So if you want more information, just contact me. (no, this is not a commercial thing)

Related

Randomness Comparison Experiment

I have a drug analysis experiment that need to generate a value based on given drug database and set of 1000 random experiments.
The original database looks like this where the number in the columns represent the rank for the drug. This is a simplified version of actual database, the actual database will have more Drug and more Gene.
+-------+-------+-------+
| Genes | DrugA | DrugB |
+-------+-------+-------+
| A | 1 | 3 |
| B | 2 | 1 |
| C | 4 | 5 |
| D | 5 | 4 |
| E | 3 | 2 |
+-------+-------+-------+
A score is calculated based on user's input: A and C, using the following formula:
# Compute Function
# ['A','C'] as array input
computeFunction(array) {
# do some stuff with the array ...
}
The formula used will be same for any provided value.
For randomness test, each set of experiment requires the algorithm to provide randomized values of A and C, so both A and C can be having any number from 1 to 5
Now I have two methods of selecting value to generate the 1000 sets for P-Value calculation, but I would need someone to point out if there is one better than another, or if there is any method to compare these two methods.
Method 1
Generate 1000 randomized database based on given database input shown above, meaning all the table should contain different set of value pair.
Example for 1 database from 1000 randomized database:
+-------+-------+-------+
| Genes | DrugA | DrugB |
+-------+-------+-------+
| A | 2 | 3 |
| B | 4 | 4 |
| C | 3 | 2 |
| D | 1 | 5 |
| E | 5 | 1 |
+-------+-------+-------+
Next we perform computeFunction() with new A and C value.
Method 2
Pick any random gene from original database and use it as a newly randomized gene value.
For example, we pick the values from E and B as a new value for A and C.
From original database, E is 3, B is 2.
So, now A is 3, C is 2. Next we perform computeFunction() with new A and C value.
Summary
Since both methods produce completely randomized input, therefore it seems to me that it will produce similar 1000-value outcome. Is there any way I could prove they are similar?

How to implement Oracle's "func(...) keep (dense_rank ...)" In Hive

I have a table abcd in Oracle DB
+-------------+----------+
| abcd.speed | abcd.ab |
+-------------+----------+
| 4.0 | 2 |
| 4.0 | 2 |
| 7.0 | 2 |
| 7.0 | 2 |
| 8.0 | 1 |
+-------------+----------+
And I'm using a query like this:
select min(speed) keep (dense_rank last order by abcd.ab NULLS FIRST) MOD from abcd;
I'm trying to convert the code to Hive, but it looks like keep is not available in Hive.
Could you suggest an equivalent statement?
select -max(struct(ab,-speed)).col2 as mod
from abcd
;
+------+
| mod |
+------+
| 4.0 |
+------+
Let start by explaining min(speed) keep (dense_rank last order by abcd.ab NULLS FIRST):
Find the row(s) with the max value of ab.
For this/those row(s), find the min value of speed.
We are using 2 tricks here.
The 1st is based on the ability to get the max value of a struct.
max(struct(c1,c2,c3,...)) returns the same result as if you have sorted the structs by c1, then by c2, then by c3 etc. and then chose the last element.
The 2nd trick is to use -speed (which is the same of -1*speed).
Finding the max of -speed and then taking the minus of that value (which gives us speed), is the same of finding the min of speed.
If we would have ordered the structs, it would have looked like this (since 2 is bigger than 1 and -4 is bigger than -7):
+----+-------+
| ab | speed |
+----+-------+
| 1 | -8.0 |
| 2 | -7.0 |
| 2 | -7.0 |
| 2 | -4.0 |
| 2 | -4.0 |
+----+-------+
The last struct in this case in struct(2,-4.0), therefore this is the result of the max function.
The fields names for a struct are col1, col2, col3 etc., so
struct(2,-4.0).col2 is -4.0. and preceding it with minus (which is the same as multiple it by -1) as in -struct(2,-4.0).col2 is 4.0.

expanding dates outside of current range

I am trying to expand a data set to include dates outside of the current range.
The data I have ranges from 1992q1 to 2017q1. Each observation exists within a portion of that larger window, for example from 1993q2 to 1997q1.
I need to create quarterly observations for each range to fill the missing time. I have already expanded the existing data into quarters.
What I cannot figure out how to do is add in those missing quarters. For example, country1 may have the dates 1993q2 to 1997q1. I need to add in the missing dates from 1992q1 to 1993q1 and 1997q2 to 2017q1.
A very simple analogue of I want I think is your question is shown by this sandbox dataset.
clear
set obs 10
gen id = cond(_n < 7, 1, 2)
gen qdate = yq(1992, 1) in 1
replace qdate = yq(1992, 3) in 7
bysort id (qdate) : replace qdate = qdate[_n-1] + 1 if missing(qdate)
format qdate %tq
list, sepby(id)
+-------------+
| id qdate |
|-------------|
1. | 1 1992q1 |
2. | 1 1992q2 |
3. | 1 1992q3 |
4. | 1 1992q4 |
5. | 1 1993q1 |
6. | 1 1993q2 |
|-------------|
7. | 2 1992q3 |
8. | 2 1992q4 |
9. | 2 1993q1 |
10. | 2 1993q2 |
+-------------+
fillin id qdate
list, sepby(id)
+-----------------------+
| id qdate _fillin |
|-----------------------|
1. | 1 1992q1 0 |
2. | 1 1992q2 0 |
3. | 1 1992q3 0 |
4. | 1 1992q4 0 |
5. | 1 1993q1 0 |
6. | 1 1993q2 0 |
|-----------------------|
7. | 2 1992q1 1 |
8. | 2 1992q2 1 |
9. | 2 1992q3 0 |
10. | 2 1992q4 0 |
11. | 2 1993q1 0 |
12. | 2 1993q2 0 |
+-----------------------+
So. fillin is a simple way of ensuring that all cross-combinations of identifier and time are present. However, to what benefit? Although not shown in this example, values of other variables spring into existence only as missing values. In some situations, proceeding with interpolation is justified, but usually, you just live with incomplete panels.
How to find solutions like these? One good strategy is to skim through the [D] manual to see what basic data management commands exist.

Oracle Recursive Subquery Factoring convert

I'm trying to use this recursive SQL feature but can't get it to do what I want, not even close. I've coded up the logic in an unrolled loop, asking if it can be converted into a single recursive SQL query, not the table update style I've used.
http://sqlfiddle.com/#!4/b7217/1
There are six players to be ranked. They have id, group id, score and rank.
Initial state
+----+--------+-------+--------+
| id | grp_id | score | rank |
+----+--------+-------+--------+
| 1 | 1 | 100 | (null) |
| 2 | 1 | 90 | (null) |
| 3 | 1 | 70 | (null) |
| 4 | 2 | 95 | (null) |
| 5 | 2 | 70 | (null) |
| 6 | 2 | 60 | (null) |
+----+--------+-------+--------+
I want to take the person with the highest initial score and give them rank 1. Then I apply 10 bonus points to the score of everyone who has the same group id. Take the next highest, assign rank 2, distribute bonus points and so on until there are no players left.
User id breaks ties.
The bonus points changes the ranking. id=4 initially appears to be second placed with 95, behind the leader with 100 but with the 10 pts bonus, id=2 moves up and takes the spot.
Final state
+-----+---------+--------+------+
| ID | GRP_ID | SCORE | RANK |
+-----+---------+--------+------+
| 1 | 1 | 100 | 1 |
| 2 | 1 | 100 | 2 |
| 4 | 2 | 95 | 3 |
| 3 | 1 | 90 | 4 |
| 5 | 2 | 80 | 5 |
| 6 | 2 | 80 | 6 |
+-----+---------+--------+------+
This is a quite a bit late, but I'm not sure this can be done using Recursive CTE. I did however come up with a solution using the MODEL clause:
WITH SAMPLE (ID,GRP_ID,SCORE,RANK) AS (
SELECT 1,1,100,NULL FROM DUAL UNION
SELECT 2,1,90,NULL FROM DUAL UNION
SELECT 3,1,70,NULL FROM DUAL UNION
SELECT 4,2,95,NULL FROM DUAL UNION
SELECT 5,2,70,NULL FROM DUAL UNION
SELECT 6,2,60,NULL FROM DUAL)
SELECT ID,GRP_ID,SCORE,RANK FROM SAMPLE
MODEL
DIMENSION BY (ID,GRP_ID)
MEASURES (SCORE,0 RANK,0 LAST_RANKED_GRP,0 ITEM_COUNT,0 HAS_RANK)
RULES
ITERATE (1000) UNTIL (ITERATION_NUMBER = ITEM_COUNT[1,1]) --ITERATE ONCE FOR EACH ITEM TO BE RANKED
(
RANK[ANY,ANY] = CASE WHEN SCORE[CV(),CV()] = MAX(SCORE) OVER (PARTITION BY HAS_RANK) THEN RANK() OVER (ORDER BY SCORE DESC,ID) ELSE RANK[CV(),CV()] END, --IF THE CURRENT ITEM SCORE IS EQUAL TO THE MAX SCORE OF UNRANKED, ASSIGN A RANK
LAST_RANKED_GRP[ANY,ANY] = FIRST_VALUE(GRP_ID) OVER (ORDER BY RANK DESC),
SCORE[ANY,ANY] = CASE WHEN RANK[CV(),CV()] = 0 AND CV(GRP_ID) = LAST_RANKED_GRP[CV(),CV()] THEN SCORE[CV(),CV()]+10 ELSE SCORE[CV(),CV()] END,
ITEM_COUNT[ANY,ANY] = COUNT(*) OVER (),
HAS_RANK[ANY,ANY] = CASE WHEN RANK[CV(),CV()] <> 0 THEN 1 ELSE 0 END --TO SEPARATE RANKED/UNRANKED ITEMS
)
ORDER BY RANK;
It's not very pretty, and I suspect there is a better way to go about this, but it does give the expected output.
Caveats:
You'd have to increase the iteration count if you have more than that number of rows.
This does a full re-ranking based on the score after each iteration. So if we took your sample data, but changed the initial score of item 2 to 95 rather than 90: after ranking item 1 and giving the 10 point bonus to item 2, it now has a score of 105. So we rank it as 1st and move item 1 down to 2nd. You'd have to make a few modifications if this is not the desired behavior.

turned on bits counter

Suppose I have a black box with 3 inputs (each input is 1 bit) and 2 bits output.
The black box counts the amount of turned on input bits.
Using only such black boxes,one needs to implement the counter of turned on bits in the input,which has 7 bits.The implementation should use the minimum possible amount of black boxes.
//This is a job interview question
You're making a binary adder. Try this...
Two black boxes for input with one input remaining:
7 6 5 4 3 2 1
| | | | | | |
------- ------- |
| | | | |
| H L | | H L | |
------- ------- |
| | | | |
Take the two low outputs and the remaining input (1) and feed them to another black box:
L L 1
| | |
-------
| |
| C L |
-------
| |
The low output from this black box will be the low bit of the result. The high output is the carry bit. Feed this carry bit along with the high bits from the first two black boxes into the fourth black box:
H H C L
| | | |
------- |
| | |
| H M | |
------- |
| | |
The result should be the number of "on" bits in the input expressed in binary by the High, Middle and Low bits.
Suppose that each BB outputs a 2-bit binary count 00, 01, 10, or 11, when 0, 1, 2, or 3 of its inputs are on. Also suppose that the desired ultimate output O₄O₂O₁ is a 3-bit binary count 000 ... 111, when 0, 1, ... 7 of the 7 input bits i₁...i₇ are on. For problems like this in general, you can write a boolean expression for what the BB does and a boolean expression for the desired output and then synthesize the output. In this particular case, however, try the obvious approach of putting i₁, i₂, i₃ into a first box B₁, and i₄, i₅, i₆ into a second box B₂, and i₇ into one input of a third box B₃. Looking at this it's clear that if you run the units outputs from B₁ and B₂ into the other two inputs of B₃ then the units output from B₃ is equal to the desired value O₁. You can get the sum of the twos outputs from B₁, B₂, B₃ via a box B₄, and this sum is equal to the desired values O₄O₂.

Resources