I'm importing a big csv (5gb) file to the BiqQuery and I had information about an error in the file and its position — specified as a byte offset from the start of the file (for example, 134683757). I'd like to look at lines around this error position.
Some example lines of the file:
field1, field2, field3
abc, bcd, efg
...
dge, hfr, kdf,
dgj, "a""a", fbd # in this line is an invalid csv element and I get error, let's say on the position 134683757
skd, frd, lqw
...
asd, fij, fle
I need some command to show lines around error like
dge, hfr, kdf,
dgj, "a""a", fbd
skd, frd, lqw
I tried sed and awk but I didn't find any simple solution.
It was definitely not clear from the original version of the question that you only got a byte offset from the start of the file.
You need to get a better position from the software generating the error; the developer was lazy in reporting an unusable number. It is reasonable to request a line number (and preferably offset within the line), rather than (or as well as) the byte offset from the start.
Assuming that the number is a byte position in the file, that gets tricky. Most Unix utilities work with lines (of variable length). I'd be tempted to write some C code to do the job, but that might be beyond you (and no shame in that).
Failing that, your best is likely the dd command. If the number reported is 134683757, then I'd guess that your lines are probably not more than 1 KiB each (adjust numbers if they're bigger, or smaller), and then use:
dd if=big.csv of=extract.csv bs=1 skip=$((134683757 - 3 * 1024)) count=6144
echo >> extract.csv
You'd then look at extract.csv. The raw dd output probably won't have a newline at the end of the last line (the echo >>extract.csv fixes that). The output will probably start part way through a record and end part way through another record. However, you're likely to have the relevant information, as well as some irrelevant information. As I said, adjust the numbers to suit your exact situation.
The trickiest part is identifying exactly where the byte offset is in the file you get. With custom C code, that can be provided easily (more easily). With the output from dd, you have to do the calculation yourself.
awk -v offset=$((134683757 - 3 * 1024)) '
{ printf "%9d: %s\n", offset, $0; offset += length($0) + 1 }
' extract.cvs
That takes the starting offset from the dd command, and prefixes the (remnants of) the first line with that number and the data; it then adds the length to the offset plus one for the newline that wasn't counted, and continues to the end of the file. That gives you the start offset for each line in the extracted data. You can see where your actual start was by looking at the offsets — you should be able to identify which record that was.
You could use a variant of this Awk script that reads the whole file line by line, and tracks the offset (as well as the line numbers) and prints the data when it gets to the vicinity of where you have the problem.
In times long past, I had to deal with data from 1/2 inch mag tapes (those big circular tapes you see in old movies) where the files generated on a mainframe seemed sanely formatted for the first few tens of megabytes, but then the format changed to some alternative format for a few megabytes, and then reverted to the original format once more. I never did find out why; I just learned how to deal with it. Trial and error!
Related
Title may be wildly incorrect for what I'm trying to work out.
I'm trying to interpret packets I am recieving from a racing game in a way that I understand, but I honestly don't really know what I'm looking at, or what to search to understand it.
Information on the packets I am recieving here:
https://forums.codemasters.com/topic/54423-f1%C2%AE-2020-udp-specification/?tab=comments#comment-532560
I'm using python to print the packets, here's a snippet of the output, which I don't understand how to interpret.
received message: b'\xe4\x07\x01\x03\x01\x07O\x90.\xea\xc2!7\x16\xa5\xbb\x02C\xda\n\x00\x00\x00\xff\x01\x00\x03:\x00\x00\x00 A\x00\x00\xdcB\xb5+\xc1#\xc82\xcc\x10\t\x00\xd9\x00\x00\x00\x00\x00\x12\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00$tJ\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01
I'm very new to coding, and not sure what my next step is, so a nudge in the right direction will help loads, thanks.
This is the python code:
import socket
UDP_IP = "127.0.0.1"
UDP_PORT = 20777
sock = socket.socket(socket.AF_INET, # Internet
socket.SOCK_DGRAM) # UDP
sock.bind((UDP_IP, UDP_PORT))
while True:
data, addr = sock.recvfrom(4096)
print ("received message:", data)
The website you link to is describing the data format. All data represented as a series of 1's and 0's. A byte is a series of 8 1's and 0's. However, just because you have a series of bytes doesn't mean you know how to interpret them. Do they represent a character? An integer? Can that integer be negative? All of that is defined by whoever crafted the data in the first place.
The type descriptions you see at the top are telling you how to actually interpret that series of 1's and 0's. When you see "unit8", that is an "unsigned integer that is 8 bits (1 byte) long". In other words, a positive number between 0 and 255. An "int8" on the other hand is an "8-bit integer", or a number that can be positive or negative (so the range is -128 to 127). The same basic idea applies to the *16 and *64 variants, just with 16 bits or 64 bits. A float represent a floating point number (a number with a fractional part, such as 1.2345), generally 4 bytes long. Additionally, you need to know the order to interpret the bytes within a word (left-to-right or right-to-left). This is referred to as the endianness, and every computer architecture has a native endianness (big-endian or little-endian).
Given all of that, you can interpret the PacketHeader. The easiest way is probably to use the struct package in Python. Details can be found here:
https://docs.python.org/3/library/struct.html
As a proof of concept, the following will interpret the first 24 bytes:
import struct
data = b'\xe4\x07\x01\x03\x01\x07O\x90.\xea\xc2!7\x16\xa5\xbb\x02C\xda\n\x00\x00\x00\xff\x01\x00\x03:\x00\x00\x00 A\x00\x00\xdcB\xb5+\xc1#\xc82\xcc\x10\t\x00\xd9\x00\x00\x00\x00\x00\x12\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00$tJ\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'
#Note that I am only taking the first 24 bytes. You must pass data that is
#the appropriate length to the unpack function. We don't know what everything
#else is until after we parse out the header
header = struct.unpack('<HBBBBQfIBB', data[:24])
print(header)
You basically want to read the first 24 bytes to get the header of the message. From there, you need to use the m_packetId field to determine what the rest of the message is. As an example, this particular packet has a packetId of 7, which is a "Car Status" packet. So you would look at the packing format for the struct CarStatus further down on that page to figure out how to interpret the rest of the message. Rinse and repeat as data arrives.
Update: In the format string, the < tells you to interpret the bytes as little-endian with no alignment (based on the fact that the documentation says it is little-endian and packed). I would recommend reading through the entire section on Format Characters in the documentation above to fully understand what all is happening regarding alignment, but in a nutshell it will try to align those bytes with their representation in memory, which may not match exactly the format you specify. In this case, HBBBBQ takes up 2 bytes more than you'd expect. This is because your computer will try to pack structs in memory so that they are word-aligned. Your computer architecture determines the word alignment (on a 64-bit computer, words are 64-bits, or 8 bytes, long). A Q takes a full word, so the packer will try to align everything before the Q to a word. However, HBBBB only requires 6 bytes; so, Python will, by default, pad an extra 2 bytes to make sure everything lines up. Using < at the front both ensures that the bytes will be interpreted in the correct order, and that it won't try to align the bytes.
Just for information if someone else is looking for this. In python there is the library f1-2019-telemetry existing. On the documentation, there is a missing part about the "how to use" so here is a snippet:
from f1_2020_telemetry.packets import *
...
udp_socket = socket.socket(family=socket.AF_INET, type=socket.SOCK_DGRAM)
udp_socket.bind((host, port))
while True:
udp_packet = udp_socket.recv(2048)
packet = unpack_udp_packet(udp_packet)
if isinstance(packet, PacketSessionData_V1): # refer to doc for classes / attribute
print(packet.trackTemperature) # for example
if isinstance(packet, PacketParticipantsData_V1):
for i, participant in enumerate(packet.participants):
print(DriverIDs[participant.driverId]) # the library has some mapping for pilot name / track name / ...
Regards,
Nicolas
I'm writing small log parser, which should find some tags in files.
Files are large (512mb) and have the following structure:
[2018.07.10 00:30:03:125] VersionInfo\886
...some data...
[2018.07.10 00:30:03:109][TraceID: 8HRWSI105YVO91]->IncomingTime\16
...some data...
[2018.07.10 00:30:03:109][TraceID: 8HRWSI105YVO91]->IncomingData\397
...some data...
[2018.07.10 00:30:03:749][TraceID: 8HRWSI105YVO91]->OutgoingData\26651
...somedata...
Each block IncomingTime, IncomingData, OutgoingData, etc. has block size (characters count, not bytes) at the end 886, 16, 397, 26651. Some blocks are very large and can't be read without large buffer (if i use bufio). I want to skip unnecessary blocks using file.Seek.
The problem is that file.Seek needs bytes length and i've only characters count (block may have unicode data with two-byte charcters). Is there any chance to skip blocks using characters count?
The problem is that file.Seek needs bytes length and i've only characters count (block may have unicode data with two-byte charcters). Is there any chance to skip blocks using characters count?
That's actually impossible. As you've described the file format, both of the following are possible:
...VersionInfo\1
[ 20 ]
...VersionInfo\1
[ C2 A0 ]
If you've just read the newline and you know you need to read one character, you know it's somewhere between 1 and 2 bytes (UTF-8 characters can go up to 4 bytes even) but not which, and blindly launching forward some number of bytes without inspecting the intermediate data won't work. The pathological case is a larger block, where the first half has many multi-byte characters and the last half has text that happens to look like one of your entry headers.
With this file format you're forced to read it a character at a time.
Short version:
how to read from STDIN (or a file) char by char while maintaining high performance using Ruby? (though the problem is probably not Ruby specific)
Long version:
While learning Ruby I'm designing a little utility that has to read from a piped text data, find and collect numbers in it and do some processing.
cat huge_text_file.txt | program.rb
input > 123123sdas234sdsd5a ...
output > 123123, 234, 5, ...
The text input might be huge (gigabytes) and it might not contain newlines or whitespace (any non-digit char is a separator) so I did a char by char reading (though I had my concerns about the performance) and it turns out doing it this way is incredibly slow.
Simply reading char by char with no processing on a 900Kb input file takes around 7 seconds!
while c = STDIN.read(1)
end
If I input data with newlines and read line by line, same file is read 100x times faster.
while s = STDIN.gets
end
It seems like reading from a pipe with STDIN.read(1) doesn't involve any buffering and every time read happens, hard drive is hit - but shouldn't it be cached by OS?
Doesn't STDIN.gets read char by char internally until it encounters '\n'?
Using C, I would probably read data in chunks though I would I have to deal with numbers being split by buffer window but that doesn't look like an elegant solution for Ruby. So what is the proper way of doing this?
P.S Timing reading the same file in Python:
for line in f:
line
f.close()
Running time is 0.01 sec.
c = f.read(1)
while c:
c = f.read(1)
f.close()
Running time is 0.17 sec.
Thanks!
This script reads the IO object word by word, and executes the block every time 1000 words have been found or the end of the file has been reached.
No more than 1000 words will be kept in memory at the same time. Note that using " " as separator means that "words" might contain newlines.
This scripts uses IO#each to specify a separator (a whitespace in this case, to get an Enumerator of words), lazy to avoid doing any operation on the whole file content and each_slice to get an array of batch_size words.
batch_size = 1000
STDIN.each(" ").lazy.each_slice(batch_size) do |batch|
# batch is an Array of batch_size words
end
Instead of using cat and |, you could also read the file directly :
batch_size = 1000
File.open('huge_text_file.txt').each(" ").lazy.each_slice(batch_size) do |batch|
# batch is an Array of batch_size words
end
With this code, no number will be split, no logic is needed, it should be much faster than reading the file char by char and it will use much less memory than reading the whole file into a String.
How to load de-normalized file to Normalized table. I'm new to cobol, any suggestion on the below requirement. Thanks.
Inbound file: FileA.DAT
ABC01
ABC2014/01/01
FDE987
FDE2012/01/06
DEE6759
DEE2014/12/12
QQQ444
QQQ2004/10/12
RRR678
RRR2001/09/01
Table : TypeDB
TY_CD Varchar(03)
SEQ_NUM CHAR(10)
END_DT DATE
I have to write a COBOL program to load the table : TypeDB
Output of the result should be,
TY_CD SEQ_NUM END_DT
ABC 01 2014/01/01
FDE 987 2012/01/06
DEE 6759 2014/12/12
QQQ 444 2004/10/12
RRR 678 2001/09/01
Below is the pseudo-codeish
Perform Until F1 IS EOF
Read F1
MOVE F1-REC to WH1-REC
Read F1
MOVE F1-REC to WH2-REC
IF WH1-TY-CD = WH2-TY-CD
move WH1-TY-CD to TY-CD
move WH1-CD to SEQ_NUM
move WH2-DT to END-DT
END-IF
END-PERFORM
This is not working.. any thing better? instead read 2 inside the perform?
I'd definitely go with reading in pairs, like you have. It is clearer, to me, than having "flags" to say what is going on.
I suspect you've overwritten your first record with the second without realising it.
A simple way around that, for a beginner, is to use READ ... INTO ... to get your two different layouts. As you become more experienced, you'll perhaps save the data you need from the first record, and just use the second record from the FD area.
Here's some pseudo-code. It is the same as yours, but by using a "Priming Read". This time the Priming Read is reading two records. No problem.
By testing the FILE STATUS field as indicated, the paired structure of the file is verified. Checking the key ensures that the pairs are always for the same "key" as well. All built-in and hidden away from your actual logic (which in this case is not much anyway).
PrimingRead
FileLoop Until EOF
ProcessPair
ReadData
EndFileLoop
ProcessPair
Do the processing from Layout1 and Layout2
PrimingRead
ReadData
Crash with non-zero file-status
ReadData
ReadRec1
ReadRec2
If Rec2-key not equal to Rec1-key, crash
ReadRec1
Read Into Layout1
Crash with non-zero file-status
ReadRec2
Read Into Layout2
Crash with file-status other than zero or 10
While we are at it, we can apply this solution from Valdis Grinbergs as well (see https://stackoverflow.com/a/28744236/1927206).
PrimingRead
FileLoop Until EOF
ProcessPair
ReadPairStructure
EndFileLoop
ProcessPair
Do the processing from Layout1 and Layout2
PrimingRead
ReadPairStructure
Crash with non-zero file-status
ReadPairStructure
ReadRec1
ReadSecondOfPair
ReadSecondOfPair
ReadRec2
If Rec2-key not equal to Rec1-key, crash
ReadRec1
Read Into Layout1
Crash with non-zero file-status
ReadRec2
Read Into Layout2
Crash with file-status other than zero or 10
Because the structure of the file is very simple either can do. With fixed-number-groups of records, I'd go for the read-a-group-at-a-time. With a more complex structure, the second, "sideways".
Either method clearly reflects the structure of the file, and when you do that in your program, you aid the understanding of the program for human readers (which may be you some time in the future).
The Fortran program I am working is encountering a runtime error when processing an input file.
At line 182 of file ../SOURCE_FILE.f90 (unit = 1, file = 'INPUT_FILE.1')
Fortran runtime error: Bad value during integer read
Looking to line 182 I see a READ statement with an implicit/implied DO loop:
182: READ(IT4, 310 )((IPPRM2(IP,I),IP=1,NP),I=1,16) ! read 6 integers
183: READ(IT4, 320 )((PPARM2(IP,I),IP=1,NP),I=1,14) ! read 5 reals
Format statement:
310 FORMAT(1X,6I12)
When I reach this code in the debugger NP has a value of 2. I has a value of 6, and IP has a value of 67. I think I and IP should be reinitialized in the loop.
My problem is that when I try to step through in the debugger once I get to the READ statement it seems to execute and then throw the error. I'm not sure how to follow it as it reads. I tried stepping into the function, but it seems like that may be a difficult route to take since I am unfamiliar with the gfortran library. The input file looks OK, I think it should be read just fine. This makes me think this READ statement isn't looping as intended.
I am completely new to Fortran and implicit DO loops like this, but from what I can gather line 182 should read in 6 integers according to the format string #310. However, when I arrive NP has a value of 2 which makes me think it will only try to read 2 integers 16 times.
How can I debug this read statement to examine the values read into IPPARM as they are read from the file? Will I have to step through the Fortran library?
Any tips that can clear up my confusion regarding these implicit loops would be appreciated!
Thanks!
NOTE: I'm using gfortran/gcc and gdb on Linux.
Is there any reason you need specific formatting on the read? I would use READ(IT4, *) where feasible...
Later versions of gfortran support unlimited format reads (see link http://fortranwiki.org/fortran/show/Fortran+2008+status)
Then it may be helpful to specify
310 FORMAT("*(1X,6I12)")
Or for older compilers
310 FORMAT(1000(1X,6I12))
The variables IP and I are loop indices and so they are reinitialized by the loop. With NP=2 the first statement is going to read a total of 32 integers -- it is contributing to the determination the list of items to read. The format determines how they are read. With "1X,6I12" they will be read as 6 integers per line of the input file. When the first 6 of the requested 32 integers is read fron a line/record, Fortran will consider that line/record completed and advance to the next record.
With a format of "1X,6I12" the integers must be precisely arranged in the file. There should be a single blank, then the integers should each be right-justified in fields of 12 columns. If they get out of alignment you could get the wrong value read or a runtime error.