How to compute record count in Visual FoxPro Table using Header Bytes? - visual-foxpro

We have a table that gets corrupted from time to time with a "Not a table" error. Running the demo version of DBF Doctor, it reports the header has an incorrect record count.
Looking at the one example I could find, I'm not really sure how to figure out how to compute the record count using base 256.
Bytes 4-7 have the number of records in the DBF.
(https://www.dbf2002.com/dbf-file-format.html)
The first 10 bytes from the DBF header I read using a small Java program are:
0 -> 0
1 -> 20
2 -> 3
3 -> 31
4 -> 81
5 -> 113
6 -> 0
7 -> 0
8 -> 0
9 -> 0
10-> 0
Can anyone help me figure out how to compute the number of records in the DBF?
Want to see if I can write a quick utility to fix this problem.
Thanks in advance,

You can calculate the rows by multiplying by 256. For each byte (from least to most significant), you have to multiply the value of that byte by (256^n).
For your example:
81 = 81
113 * 256 = 28928
0 * 256 * 256 = 0
0 * 256 * 256 * 256 = 0
81 + 28928 + 0 + 0 = 29009
The calculation is also discussed in this thread, but it's FoxPro Code.
A better understanding of this topic can be achieved by looking at your dbf with a Hex-Editor.

If you do have VFP available,
VFP Code to fix file headers is available and done directly in VFP. Relatively simple procedure and computes the file header record counts fix and adjusts via low-level file open/write operation. I have used on more than once in the past.

Related

hpack encoding integer significance

After reading this, https://httpwg.org/specs/rfc7541.html#integer.representation
I am confused about quite a few things, although I seem to have the overall gist of the idea.
For one, What are the 'prefixes' exactly/what is their purpose?
For two:
C.1.1. Example 1: Encoding 10 Using a 5-Bit Prefix
The value 10 is to be encoded with a 5-bit prefix.
10 is less than 31 (2^5 - 1) and is represented using the 5-bit prefix.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 0 | 1 | 0 | 1 | 0 | 10 stored on 5 bits
+---+---+---+---+---+---+---+---+
What are the leading Xs? What is the starting 0 for?
>>> bin(10)
'0b1010'
>>>
Typing this in the python IDE, you see almost the same output... Why does it differ?
This is when the number fits within the number of prefix bits though, making it seemingly simple.
C.1.2. Example 2: Encoding 1337 Using a 5-Bit Prefix
The value I=1337 is to be encoded with a 5-bit prefix.
1337 is greater than 31 (25 - 1).
The 5-bit prefix is filled with its max value (31).
I = 1337 - (25 - 1) = 1306.
I (1306) is greater than or equal to 128, so the while loop body executes:
I % 128 == 26
26 + 128 == 154
154 is encoded in 8 bits as: 10011010
I is set to 10 (1306 / 128 == 10)
I is no longer greater than or equal to 128, so the while loop terminates.
I, now 10, is encoded in 8 bits as: 00001010.
The process ends.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 1 | 1 | 1 | 1 | 1 | Prefix = 31, I = 1306
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1306>=128, encode(154), I=1306/128
| 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 10<128, encode(10), done
+---+---+---+---+---+---+---+---+
The octet-like diagram shows three different numbers being produced... Since the numbers are produced throughout the loop, how do you replicate this octet-like diagram within an integer? What is the actual final result? The diagram or "I" being 10, or 00001010.
def f(a, b):
if a < 2**b - 1:
print(a)
else:
c = 2**b - 1
remain = a - c
print(c)
if remain >= 128:
while 1:
e = remain % 128
g = e + 128
remain = remain / 128
if remain >= 128:
continue
else:
print(remain)
c+=int(remain)
print(c)
break
As im trying to figure this out, I wrote a quick python implementation of it, It seems that i am left with a few useless variables, one being g which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
For one, What are the 'prefixes' exactly/what is their purpose?
Integers are used in a few places in HPACK messages and often they have leading bits that cannot be used to for the actual integer. Therefore, there will often be a few leading digits that will be unavailable to use for the integer itself. They are represented by the X. For the purposes of this calculation it doesn't make what those Xs are: could be 000, or 111, or 010 or...etc. Also, there will not always be 3 Xs - that is just an example. There could only be one leading X, or two, or four...etc.
For example, to look up a previous HPACK decoded header, we use 6.1. Indexed Header Field Representation which starts with a leading 1, followed by the table index value. Therefore that 1 is the X in the previous example. We have 7-bits (instead of only 5-bits in the original example in your question). If the table index value is 127 or less we can represent it using those 7-bits. If it's >= 127 then we need to do some extra work (we'll come back to this).
If it's a new value we want to add to the table (to reuse in future requests), but we already have that header name in the table (so it's just a new value for that name we want as a new entry) then we use 6.2.1. Literal Header Field with Incremental Indexing. This has 2 bits at the beginning (01 - which are the Xs), and we only have 6-bits this time to represent the index of the name we want to reuse. So in this case there are two Xs.
So don't worry about there being 3 Xs - that's just an example. In the above examples there was one X (as first bit had to be 1), and two Xs (as first two bits had to be 01) respectively. The Integer Representation section is telling you how to handle any prefixed integer, whether prefixed by 1, 2, 3... etc unusable "X" bits.
What are the leading Xs? What is the starting 0 for?
The leading Xs are discussed above. The starting 0 is just because, in this example we have 5-bits to represent the integers and only need 4-bits. So we pad it with 0. If the value to encode was 20 it would be 10100. If the value was 40, we couldn't fit it in 5-bits so need to do something else.
Typing this in the python IDE, you see almost the same output... Why does it differ?
Python uses 0b to show it's a binary number. It doesn't bother showing any leading zeros. So 0b1010 is the same as 0b01010 and also the same as 0b00001010.
This is when the number fits within the number of prefix bits though, making it seemingly simple.
Exactly. If you need more than the number of bits you have, you don't have space for it. You can't just use more bits as HPACK will not know whether you are intending to use more bits (so should look at next byte) or if it's just a straight number (so only look at this one byte). It needs a signal to know that. That signal is using all 1s.
So to encode 40 in 5 bits, we need to use 11111 to say "it's not big enough", overflow to next byte. 11111 in binary is 31, so we know it's bigger than that, so we'll not waste that, and instead use it, and subtract it from the 40 to give 9 left to encode in the next byte. A new additional byte gives us 8 new bits to play with (well actually only 7 as we'll soon discover, as the first bit is used to signal a further overflow). This is enough so we can use 00001001 to encode our 9. So our complex number is represented in two bytes: XXX11111 and 00001001.
If we want to encode a value bigger than can fix in the first prefixed bit, AND the left over is bigger than 127 that would fit into the available 7 bits of the second byte, then we can't use this overflow mechanism using two bytes. Instead we use another "overflow, overflow" mechanism using three bytes:
For this "overflow, overflow" mechanism, we set the first byte bits to 1s as usual for an overflow (XXX11111) and then set the first bit of the second byte to 1. This leaves 7 bits available to encode the value, plus the next 8 bits in the third byte we're going to have to use (actually only 7 bits of the third byte, because again it uses the first bit to indicate another overflow).
There's various ways they could go have gone about this using the second and third bytes. What they decided to do was encode this as two numbers: the 128 mod, and the 128 multiplier.
1337 = 31 + (128 * 10) + 26
So that means the frist byte is set to 31 as per pervious example, the second byte is set to 26 (which is 11010) plus the leading 1 to show we're using the overflow overflow method (so 100011010), and the third byte is set to 10 (or 00001010).
So 1337 is encoded in three bytes: XXX11111 100011010 00001010 (including setting X to whatever those values were).
Using 128 mod and multiplier is quite efficient and means this large number (and in fact any number up to 16,383) can be represented in three bytes which is, not uncoincidentally, also the max integer that can be represented in 7 + 7 = 14 bits). But it does take a bit of getting your head around!
If it's bigger than 16,383 then we need to do another round of overflow in a similar manner.
All this seems horrendously complex but is actually relatively simply, and efficiently, coded up. Computers can do this pretty easily and quickly.
It seems that i am left with a few useless variables, one being g
You are not print this value in the if statement. Only the left over value in the else. You need to print both.
which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
Exactly, it's because the first bit (value 128) needs to be set as per explanation above, to show we are continuing/overflowing into needing a third byte.

Subsetting Data with GREP

I have a very large text file (16GB) that I want to subset as fast as possible.
Here is a sample of the data involved
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
2 M 15 1
2 0 Q 0 17143989 4219157,1841361,853923,1720163,1912374,1755325,4454730 65548702,4975721 197782,39086 54375043,4396765 31589696,3091097 6876504,851594 3374640,455375 13274885,1354902 31585771,3091016 61234218,4723345 31583582,3091014
2 27 C 0 31589696
The first number on every line is a sessionID and any line with an 'M' denotes the start of a session (data is grouped by session). The number following an M is a Day and the second number is a userID, a user can have multiple sessions.
I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines). As a second task I also want to extract all session lines related to a specific day.
For example with the above data, to extract the records for userid '0' the output would be:
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
To extract the records for day 7 the output would be:
1 M 7 0
1 0 Q 0 17143989
I believe there is a much more elegant and simple solution to what I have achieved so far and it would be great to get some feedback and suggestions. Thank you.
What I have tried
I tried to use pcrgrep -M to apply this pattern directly (matching data between two M's) but struggled to get this working across the linebreaks. I still suspect this may be the fastest option so any guidance on whether this may be possible would be great.
The next part is quite scattered and it is not necessary to read on if you already have an idea for a better solution!
Failing the above, I split the problem into two parts:
Part 1: Isolating all 'M' lines to obtain a list of sessions which belonging to that user/day
grep method is fast (then need to figure out how to use this data)
time grep -c "M\t.*\t$user_id" trainSample.txt >> sessions.txt
awk method to create an array is slow
time myarr=$(awk '/M\t.*\t$user_id/ {print $1}' trainSample.txt
Part 2: Extracting all lines belonging to a session on the list created in part 1
Continuing from the awk method, I ran grep for each but this is WAY too slow (days to complete 16GB)
for i in "${!myarr[#]}";
do
grep "^${myarr[$i]}\t" trainSample.txt >> sessions.txt
echo -ne "Session $i\r"
done
Instead of running grep once per session ID as above using them all in the one grep command is MUCH faster (I ran it with 8 sessionIDs in a [1|2|3|..|8] format and it took the same time as each did separately i.e. 8X faster). However I need then to figure out how to do this dynamically
Update
I have actually established a working solution which only takes seconds to complete but it is some messy and inflexible bash coe which I have yet to extend to the second (isolating by days) case.
I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines).
$ awk '$2=="M"{p=$4==0}p' file
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
As a second task I also want to extract all session lines related to a specific day.
$ awk '$2=="M"{p=$3==7}p' file
1 M 7 0
1 0 Q 0 17143989

Increasing performance in file manipulation

I have a file which contains a matrix of numbers as following:
0 10 24 10 13 4 101 ...
6 0 52 10 4 5 0 4 ...
3 4 0 86 29 20 77 294 ...
4 1 1 0 78 100 83 199 ...
5 4 9 10 0 58 8 19 ...
6 58 60 13 68 0 148 41 ...
. .
. .
. .
What I am trying to do is sum each row and output the sum of each row to a new file (with the sum of each row on a new line).
I have tried doing it in Haskell using ByteStrings, but the performance is 3 times a slow as the python implementation. Here is the Haskell implementation:
import qualified Data.ByteString.Char8 as B
-- This function is for summing a row
sumrows r = foldr (\x y -> (maybe 0 (*1) $ fst <$> (B.readInt x)) + y) 0 (B.split ' ' r)
-- This function is for mapping the sumrows function to each line
sumfile f = map (\x -> (show x) ++ "\n") (map sumrows (B.split '\n' f))
main = do
contents <- B.readFile "telematrix"
-- I get the sum of each line, and then pack up all the results so that it can be written
B.writeFile "teleDensity" $ (B.pack . unwords) (sumfile contents)
print "complete"
This takes about 14 seconds for a 25 MB file.
Here is the python implemenation
fd = open("telematrix", "r")
nfd = open("teleDensity", "w")
for line in fd:
nfd.write(str(sum(map(int, line.split(" ")))) + "\n")
fd.close()
nfd.close()
This takes about 5 seconds for the same 25 MB file.
Any suggestions on how to increase the Haskell implementation?
It seems that he problem was that I was compiling and running the program with runhaskell as opposed to using ghc and then running the program. By compiling first and then running, I increased performance to 1 second in Haskell
At a glance, I would bet your first bottleneck is in the ++ on strings in sumfile, which is destructuring the left operand each time and rebuilding it. Instead of appending "\n" to the end, you could replace the unwords function call with unlines, which does exactly what you want it to here. That should get you a nice little speed boost.
A more minor nitpick is that the (*1) in the maybe function is unneeded. Using id there would be more efficient, since (*1) wastes a multiplication operation, but that's no more than a few processor cycles.
Then finally, I have to ask why you're using ByteString's here. ByteString's store string data efficiently as an array, like traditional strings in a more imperative language. However, what you're doing here involves splitting the string and iterating over the elements, which are operations that linked lists would be suited for. I would honestly recommend using the traditional [Char] type in this case. That B.split call may be what's ruining you, since it has to take the entire line and copy it into separate arrays of the split form, whereas the words function for linked lists of characters simply splits the linked structure off at a few points.
The main reason for the poor performance was because I was using runhaskell instead of first compiling and then running the program. So I switched from:
runhaskell program.hs
to
ghc program.hs
./program

Consolidate 10 bit Value into a Unique Byte

As part of an algorithm I'm writing, I need to find a way to convert a 10-bit word into a unique 8-bit word. The 10-bit word is made up of 5 pairs, where each pair can only ever equal 0, 1 or 2 (never 3). For example:
|00|10|00|01|10|
This value needs to somehow be consolidated into a single, unique byte.
As each pair can never equal 3, there are a wide range of values that this 10-bit word will never represent, which makes me think that it is possible to create an algorithm to perform this conversion. The simplest way to do this would be to use a lookup table, but it seems like a waste of resources to store ~680 values which will only be used once in my program. I've already tried to incorporate one of the pairs into the others somehow, but every attempt I've made has resulted in a non-unique value, and I'm now very quickly running out of ideas!
Any help?
The number you have is essentially base 3. You just need to convert this to base 2.
There are 5 pairs, so 3^5 = 243 numbers. And 8 bits is 2^8 = 256 numbers, so it's possible.
The simplest way to convert between bases is to go to base 10 first.
So, for your example:
00|10|00|01|10
Base 3: 02012
Base 10: 2*3^3 + 1*3^1 + 2*3^0
= 54 + 3 + 2
= 59
Base 2:
59 % 2 = 1
/2 29 % 2 = 1
/2 14 % 2 = 0
/2 7 % 2 = 1
/2 3 % 2 = 1
/2 1 % 2 = 1
So 111011 is your number in binary
This explains the above process in a bit more detail.
Note that once you have 59 above stored in a 1-byte integer, you'll probably already have what you want, thus explicitly converting to base 2 might not be necessary.
What you basically have is a base 3 number and you want to convert this to a single number 0 - 255, luckily 5 digits in ternary (base 3) gives 243 combinations.
What you'll need to do is:
Digit Action
( 1st x 3^4)
+ (2nd x 3^3)
+ (3rd x 3^2)
+ (4th x 3)
+ (5th)
This will give you a number 0 to 242.
You are considering to store some information in a byte. A byte can contain at most 2 ^ 8 = 256 status.
Your status is totally 3 ^ 5 = 243 < 256. That make the transfer possible.
Consider your pairs are ABCDE (each character can be 0, 1 or 2)
You can just calculate A*3^4 + B*3^3 + C*3^2 + D*3 + E as your result. I guarantee the result will be in range 0 -- 255.

ELKI - input distance matrix

I'm trying to use ELKI for outlier detection ; I have my custom distance matrix and I'm trying to input it to ELKI to perform LOF (for example, in a first time).
I try to follow http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances but it is not very clear to me. What I do:
I don't want to load data from database so I use:
-dbc DBIDRangeDatabaseConnection -idgen.count 100
(where 100 is the number of objects I'll be analyzing)
I use LOF algo and call the external distance file
-algorithm outlier.LOF
-algorithm.distancefunction external.FileBasedDoubleDistanceFunction
-distance.matrix testData.ascii -lof.k 3
My distance file is as follows (very simple for testing purposes)
0 0 0
0 1 1
0 2 0.2
0 3 0.1
1 1 0
1 2 0.9
1 3 0.9
2 2 0
2 3 0.2
3 3 0
4 0 0.23
4 1 0.97
4 2 0.15
4 3 0.07
4 4 0
5 0 0.1
5 1 0.85
5 2 0.02
5 3 0.15
5 4 0.1
5 5 0
6 0 1
6 1 1
6 2 1
6 3 1
etc
the results say : "all in one trivial clustering", while this is not clustering and there definitely are outliers in my data.
do I do the stuff right ? Or what am I missing ?
When using DBIDRangeDatabaseConnection, and not giving ELKI any actual data, the visualization cannot produce a particularly useful result (because it doesn't have the actual data, after all). Nor can the data be evaluated automatically.
The "all in one trivial clustering" is an artifact from the automatic attempts to visualize the data, but for the reasons discussed above this cannot work. This clustering is automatically added for unlabeled data, to allow some visualizations to work.
There are two things to do for you:
set an output handler. For example -resulthandler ResultWriter, which will produce an output similar to this:
ID=0 lof-outlier=1.0
Where ID= is the object number, and lof-outlier= is the LOF outlier score.
Alternatively, you can implement your own output handler. An example is found here:
http://elki.dbs.ifi.lmu.de/browser/elki/trunk/src/tutorial/outlier/SimpleScoreDumper.java
fix DBIDRangeDatabaseConnection. You are however bitten by a bug in ELKI 0.6.0~beta1: the DBIDRangeDatabaseConnection actually doesn't initialize its parameters correctly.
The trivial bug fix (parameters not initialized correctly in the constructor) is here:
http://elki.dbs.ifi.lmu.de/changeset/11027/elki
Alternatively, you can create a dummy input file and use the regular text input. A file containing
0
1
2
...
should do the trick. Use -dbc.in numbers100.txt -dbc.filter FixedDBIDsFilter -dbc.startid 0. The latter arguments are to have your IDs start at 0, not 1 (default).
This workaround will produce a slightly different output format:
ID=0 0.0 lof-outlier=1.0
where the additional column is from the dummy file. The dummy values will not affect the algorithm result of LOF, when an external distance function is used; but this approach will use some additional memory.

Resources