File seek with two-byte characters - go

I'm writing small log parser, which should find some tags in files.
Files are large (512mb) and have the following structure:
[2018.07.10 00:30:03:125] VersionInfo\886
...some data...
[2018.07.10 00:30:03:109][TraceID: 8HRWSI105YVO91]->IncomingTime\16
...some data...
[2018.07.10 00:30:03:109][TraceID: 8HRWSI105YVO91]->IncomingData\397
...some data...
[2018.07.10 00:30:03:749][TraceID: 8HRWSI105YVO91]->OutgoingData\26651
...somedata...
Each block IncomingTime, IncomingData, OutgoingData, etc. has block size (characters count, not bytes) at the end 886, 16, 397, 26651. Some blocks are very large and can't be read without large buffer (if i use bufio). I want to skip unnecessary blocks using file.Seek.
The problem is that file.Seek needs bytes length and i've only characters count (block may have unicode data with two-byte charcters). Is there any chance to skip blocks using characters count?

The problem is that file.Seek needs bytes length and i've only characters count (block may have unicode data with two-byte charcters). Is there any chance to skip blocks using characters count?
That's actually impossible. As you've described the file format, both of the following are possible:
...VersionInfo\1
[ 20 ]
...VersionInfo\1
[ C2 A0 ]
If you've just read the newline and you know you need to read one character, you know it's somewhere between 1 and 2 bytes (UTF-8 characters can go up to 4 bytes even) but not which, and blindly launching forward some number of bytes without inspecting the intermediate data won't work. The pathological case is a larger block, where the first half has many multi-byte characters and the last half has text that happens to look like one of your entry headers.
With this file format you're forced to read it a character at a time.

Related

How do I interpret a python byte string coming from F1 2020 game UDP packet?

Title may be wildly incorrect for what I'm trying to work out.
I'm trying to interpret packets I am recieving from a racing game in a way that I understand, but I honestly don't really know what I'm looking at, or what to search to understand it.
Information on the packets I am recieving here:
https://forums.codemasters.com/topic/54423-f1%C2%AE-2020-udp-specification/?tab=comments#comment-532560
I'm using python to print the packets, here's a snippet of the output, which I don't understand how to interpret.
received message: b'\xe4\x07\x01\x03\x01\x07O\x90.\xea\xc2!7\x16\xa5\xbb\x02C\xda\n\x00\x00\x00\xff\x01\x00\x03:\x00\x00\x00 A\x00\x00\xdcB\xb5+\xc1#\xc82\xcc\x10\t\x00\xd9\x00\x00\x00\x00\x00\x12\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00$tJ\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01
I'm very new to coding, and not sure what my next step is, so a nudge in the right direction will help loads, thanks.
This is the python code:
import socket
UDP_IP = "127.0.0.1"
UDP_PORT = 20777
sock = socket.socket(socket.AF_INET, # Internet
socket.SOCK_DGRAM) # UDP
sock.bind((UDP_IP, UDP_PORT))
while True:
data, addr = sock.recvfrom(4096)
print ("received message:", data)
The website you link to is describing the data format. All data represented as a series of 1's and 0's. A byte is a series of 8 1's and 0's. However, just because you have a series of bytes doesn't mean you know how to interpret them. Do they represent a character? An integer? Can that integer be negative? All of that is defined by whoever crafted the data in the first place.
The type descriptions you see at the top are telling you how to actually interpret that series of 1's and 0's. When you see "unit8", that is an "unsigned integer that is 8 bits (1 byte) long". In other words, a positive number between 0 and 255. An "int8" on the other hand is an "8-bit integer", or a number that can be positive or negative (so the range is -128 to 127). The same basic idea applies to the *16 and *64 variants, just with 16 bits or 64 bits. A float represent a floating point number (a number with a fractional part, such as 1.2345), generally 4 bytes long. Additionally, you need to know the order to interpret the bytes within a word (left-to-right or right-to-left). This is referred to as the endianness, and every computer architecture has a native endianness (big-endian or little-endian).
Given all of that, you can interpret the PacketHeader. The easiest way is probably to use the struct package in Python. Details can be found here:
https://docs.python.org/3/library/struct.html
As a proof of concept, the following will interpret the first 24 bytes:
import struct
data = b'\xe4\x07\x01\x03\x01\x07O\x90.\xea\xc2!7\x16\xa5\xbb\x02C\xda\n\x00\x00\x00\xff\x01\x00\x03:\x00\x00\x00 A\x00\x00\xdcB\xb5+\xc1#\xc82\xcc\x10\t\x00\xd9\x00\x00\x00\x00\x00\x12\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00$tJ\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'
#Note that I am only taking the first 24 bytes. You must pass data that is
#the appropriate length to the unpack function. We don't know what everything
#else is until after we parse out the header
header = struct.unpack('<HBBBBQfIBB', data[:24])
print(header)
You basically want to read the first 24 bytes to get the header of the message. From there, you need to use the m_packetId field to determine what the rest of the message is. As an example, this particular packet has a packetId of 7, which is a "Car Status" packet. So you would look at the packing format for the struct CarStatus further down on that page to figure out how to interpret the rest of the message. Rinse and repeat as data arrives.
Update: In the format string, the < tells you to interpret the bytes as little-endian with no alignment (based on the fact that the documentation says it is little-endian and packed). I would recommend reading through the entire section on Format Characters in the documentation above to fully understand what all is happening regarding alignment, but in a nutshell it will try to align those bytes with their representation in memory, which may not match exactly the format you specify. In this case, HBBBBQ takes up 2 bytes more than you'd expect. This is because your computer will try to pack structs in memory so that they are word-aligned. Your computer architecture determines the word alignment (on a 64-bit computer, words are 64-bits, or 8 bytes, long). A Q takes a full word, so the packer will try to align everything before the Q to a word. However, HBBBB only requires 6 bytes; so, Python will, by default, pad an extra 2 bytes to make sure everything lines up. Using < at the front both ensures that the bytes will be interpreted in the correct order, and that it won't try to align the bytes.
Just for information if someone else is looking for this. In python there is the library f1-2019-telemetry existing. On the documentation, there is a missing part about the "how to use" so here is a snippet:
from f1_2020_telemetry.packets import *
...
udp_socket = socket.socket(family=socket.AF_INET, type=socket.SOCK_DGRAM)
udp_socket.bind((host, port))
while True:
udp_packet = udp_socket.recv(2048)
packet = unpack_udp_packet(udp_packet)
if isinstance(packet, PacketSessionData_V1): # refer to doc for classes / attribute
print(packet.trackTemperature) # for example
if isinstance(packet, PacketParticipantsData_V1):
for i, participant in enumerate(packet.participants):
print(DriverIDs[participant.driverId]) # the library has some mapping for pilot name / track name / ...
Regards,
Nicolas

Print lines around position in the file

I'm importing a big csv (5gb) file to the BiqQuery and I had information about an error in the file and its position — specified as a byte offset from the start of the file (for example, 134683757). I'd like to look at lines around this error position.
Some example lines of the file:
field1, field2, field3
abc, bcd, efg
...
dge, hfr, kdf,
dgj, "a""a", fbd # in this line is an invalid csv element and I get error, let's say on the position 134683757
skd, frd, lqw
...
asd, fij, fle
I need some command to show lines around error like
dge, hfr, kdf,
dgj, "a""a", fbd
skd, frd, lqw
I tried sed and awk but I didn't find any simple solution.
It was definitely not clear from the original version of the question that you only got a byte offset from the start of the file.
You need to get a better position from the software generating the error; the developer was lazy in reporting an unusable number. It is reasonable to request a line number (and preferably offset within the line), rather than (or as well as) the byte offset from the start.
Assuming that the number is a byte position in the file, that gets tricky. Most Unix utilities work with lines (of variable length). I'd be tempted to write some C code to do the job, but that might be beyond you (and no shame in that).
Failing that, your best is likely the dd command. If the number reported is 134683757, then I'd guess that your lines are probably not more than 1 KiB each (adjust numbers if they're bigger, or smaller), and then use:
dd if=big.csv of=extract.csv bs=1 skip=$((134683757 - 3 * 1024)) count=6144
echo >> extract.csv
You'd then look at extract.csv. The raw dd output probably won't have a newline at the end of the last line (the echo >>extract.csv fixes that). The output will probably start part way through a record and end part way through another record. However, you're likely to have the relevant information, as well as some irrelevant information. As I said, adjust the numbers to suit your exact situation.
The trickiest part is identifying exactly where the byte offset is in the file you get. With custom C code, that can be provided easily (more easily). With the output from dd, you have to do the calculation yourself.
awk -v offset=$((134683757 - 3 * 1024)) '
{ printf "%9d: %s\n", offset, $0; offset += length($0) + 1 }
' extract.cvs
That takes the starting offset from the dd command, and prefixes the (remnants of) the first line with that number and the data; it then adds the length to the offset plus one for the newline that wasn't counted, and continues to the end of the file. That gives you the start offset for each line in the extracted data. You can see where your actual start was by looking at the offsets — you should be able to identify which record that was.
You could use a variant of this Awk script that reads the whole file line by line, and tracks the offset (as well as the line numbers) and prints the data when it gets to the vicinity of where you have the problem.
In times long past, I had to deal with data from 1/2 inch mag tapes (those big circular tapes you see in old movies) where the files generated on a mainframe seemed sanely formatted for the first few tens of megabytes, but then the format changed to some alternative format for a few megabytes, and then reverted to the original format once more. I never did find out why; I just learned how to deal with it. Trial and error!

AVAssetWriter - How do I get a byte count of what has been written?

I am writing a MOV file, in which I am supplying a bunch of CMSampleBuffers to pass along to an AVAssetWriterInput object.
While this is going on, I am tracking the byte size of the compressed data inside the CMSampleBuffers to write to a log file on the system.
The only thing that I am missing, is the MOV header size.
The difference between my count and the saved file size, is typically about 2000 bytes or so of data. I can't figure out how to get the exact size written to the file system from AVAssetWriter.
Now, I could just find the file size after the MOV file is closed, but for some reason, NSFileSystemManager "attributesOfItemAtPath" "NSFileSize" never matches the byte count with I look at the file in the bash shell.
Suggestions are welcome!
bob.

Should I expect JPEG SOI marker at very beginning of the data stream?

... or should I go deeper into the data stream looking for 0xFF 0xD8 sequence?
From this Q, I've learned what APPn does not have to follow SOI immediately. Are there specification compliant JPEG cases where SOI position != beginning of the stream?
A quote from the specification (Annex B, § 1.1.2):
Markers serve to identify the various structural parts of the
compressed data formats. Most markers start marker segments containing
a related group of parameters; some markers stand alone. All markers
are assigned two-byte codes: an X’FF’ byte followed by a byte which is
not equal to 0 or X’FF’ (see Table B.1). Any marker may optionally be
preceded by any number of fill bytes, which are bytes assigned code
X’FF’.
libjpeg does not allow garbage before the SOI:
/* Like next_marker, but used to obtain the initial SOI marker. */
/* For this marker, we do not allow preceding garbage or fill; otherwise,
* we might well scan an entire input file before realizing it ain't JPEG.
* If an application wants to process non-JFIF files, it must seek to the
* SOI before calling the JPEG library.
*/
From: Random libjpeg mirror.
E.g. the go implementation also does not allow preceding garbage.
However, if in doubt, stick to Postel's Law:
Be liberal in what you accept, and conservative in what you send
Although, you don't want to be too liberal, or you might end up extracting not the actual JPEG from the stream but the embedded EXIF thumbnail or something like that.

Bug with find.exe?

In C++ we have a method to search for text in a file. It works by reading the file to a variable, and using strstr. But we got into trouble when the file got very large.
I thought I could solve this by calling find.exe using _popen. It works fine, except when these conditions are all true:
The file is of type unicode (BOM=FFFE)
The file is EXACTLY 4096 bytes
The text you are searching for is the last text in the file
To recreate, you can do this:
Open notepad
Insert 2046 X's then an A at the end
Save as test.txt, encoding = "unicode"
Verify that file is exactly 4096 bytes
Open a command prompt and type: find "A" /c test2.txt -> No hits
I also tried this:
Add or remove an X, and you will get a hit (file is not 4096 bytes anymore)
Save as UTF-8 (and add enough X's so that the file is 4096 bytes again), and you get a hit
Search for something in the middle of the file (file still unicode and 4096 bytes), and you get a hit.
Is this a bug, or is there something I'm missing?
Very interesting bug.
This question caused me to do some experiments on XP and Win 7 - the behaviors are different.
XP
ANSI - FIND cannot read past 1023 characters (1023 bytes) on a single line. FIND can match a line that exceeds 1023 characters as long as the search string matches before the 1024th. The matching line printout is truncated after 1023 characters.
Unicode - FIND cannot read past 1024 characters (2048 bytes) on a single line. FIND can match a line that exceeds 1024 characters as long as the search string matches before the 1025th. The matching line printout is truncated after 1024 characters.
I find it very odd that the line limits for Unicode and ANSI on XP are not the same number of bytes, nor is it a simple multiple. The Unicode limit expressed as bytes is 2 times the limit for ANSI plus 1.
Note: truncation of matching long lines also truncates the new-line character, so the next matching line will appear to be appended to the previous line. You can tell it is a new line if you use the /N option.
Window 7
ANSI - I have not found a limit to the max line length that can be searched, (though I did not try very hard). Any matching line that exceeds 4095 characters (4095 bytes) is truncated after 4095 characters. FIND can successfully search past 4095 characters on a line, it just can't display all of them.
Unicode - I have not found a limit to the max line length that can be searched, (though I did not try very hard). Any matching line that exceeds 2047 characters (4094 bytes) is truncated after 2047 characters. FIND can successfully search past 2047 characters on a line, it just can't display all of them.
Since Unicode byte lengths are always a multiple of 2, and the max ANSI displayable length is an odd number, it makes sense that the max displayable line length in bytes is one less for Unicode than for ANSI.
But then there is also the weird Unicode bug. If the Unicode file length is an exact multiple of 4096 bytes, then the last character cannot be searched or printed. It does not matter if the file contains a single line or multiple lines. It only depends on the total file length.
I find it interesting that the multiple of 4096 bug is within one of the max printable line length (in bytes). But I don't know if there is a relationship between those behaviors or if it is simply coincidence.
Note: truncation of matching long lines also truncates any new-line character, so the next matching line will appear to be appended to the previous line. You can tell it is a new line if you use the /N option.

Resources