How to count/detect frames (pictures) in raw H.264 bitstream? I know there are 5 VCL NALU types but I don't know how to rec(k)ognize sequence of them as access unit. I suppose detect a frame means detect an access unit as access unit is
A set of NAL units that are consecutive in decoding order and contain
exactly one primary coded picture. In addition to the primary coded
picture, an access unit may also contain one or more redundant coded
pictures, one auxiliary coded picture, or other NAL units not
containing slices or slice data partitions of a coded picture. The
decoding of an access unit always results in a decoded picture.
I want it to know what is the FPS of live stream out to server.
You are right on the interpretation, and if you want to parse the stream by yourself, take a look here
But to quickly extract stream info in a format easy to read and parse (with any text parser) you can use ffprobe
ffprobe -show_streams -count_frames -pretty filename
You will find in the output:
nb_read_frames=....
And for the fps, as I heard that ffprobe may report some error for the fps, try a simple ffmpeg -i command.
ffmpeg -i filename 2>&1 | sed -n "s/.*, \(.*\) fps.*/\1/p"
From ITU-T H.264 (03/2009):
7.4.1.2.3 Order of NAL units and coded pictures and association to access units
This subclause specifies the order of NAL units and coded pictures and association to access unit for coded video sequences that conform to one or more of the profiles specified in Annex A that are decoded using the decoding process specified in clauses 2-9.
An access unit consists of one primary coded picture, zero or more corresponding redundant coded pictures, and zero or more non-VCL NAL units. The association of VCL NAL units to primary or redundant coded pictures is described in subclause 7.4.1.2.5.
The first access unit in the bitstream starts with the first NAL unit of the bitstream.
The first of any of the following NAL units after the last VCL NAL unit of a primary coded picture specifies the start of a new access unit:
access unit delimiter NAL unit (when present),
sequence parameter set NAL unit (when present),
picture parameter set NAL unit (when present),
SEI NAL unit (when present),
NAL units with nal_unit_type in the range of 14 to 18, inclusive (when present),
first VCL NAL unit of a primary coded picture (always present).
The constraints for the detection of the first VCL NAL unit of a primary coded picture are specified in subclause 7.4.1.2.4.
7.4.1.2.4 Detection of the first VCL NAL unit of a primary coded picture
This subclause specifies constraints on VCL NAL unit syntax that are sufficient to enable the detection of the first VCL NAL unit of each primary coded picture for coded video sequences that conform to one or more of the profiles specified in Annex A that are decoded using the decoding process specified in clauses 2-9.
Any coded slice NAL unit or coded slice data partition A NAL unit of the primary coded picture of the current access unit shall be different from any coded slice NAL unit or coded slice data partition A NAL unit of the primary coded picture of the previous access unit in one or more of the following ways:
frame_num differs in value. The value of frame_num used to test this condition is the value of frame_num that appears in the syntax of the slice header, regardless of whether that value is inferred to have been equal to 0 for subsequent use in the decoding process due to the presence of memory_management_control_operation equal to 5. (NOTE 1 – A consequence of the above statement is that a primary coded picture having frame_num equal to 1 cannot contain a memory_management_control_operation equal to 5 unless some other condition listed below is fulfilled for the next primary coded picture that follows after it (if any).)
pic_parameter_set_id differs in value.
field_pic_flag differs in value.
bottom_field_flag is present in both and differs in value.
nal_ref_idc differs in value with one of the nal_ref_idc values being equal to 0.
pic_order_cnt_type is equal to0 for both and either pic_order_cnt_lsb differs in value, or delta_pic_order_cnt_bottom differs in value.
pic_order_cnt_type is equal to 1 for both and either delta_pic_order_cnt[ 0 ] differs in value, or delta_pic_order_cnt[ 1 ] differs in value.
IdrPicFlag differs in value.
IdrPicFlag is equal to 1 for both and idr_pic_id differs in value.
(NOTE 2 – Some of the VCL NAL units in redundant coded pictures or some non-VCL NAL units (e.g., an access unit delimiter NAL unit) may also be used for the detection of the boundary between access units, and may therefore aid in the detection of the start of a new primary coded picture.)
NAL units do not have a 1-1 relationship to frames necessarily. Frames can be split into multiple NAL units. If you want to parse the stream manually, you'll need to handle each type which is pretty well defined in the blog article below. If the stream has an SPS NAL packet it should contain frame rate, but thats not necessarily the actual frame rate, just what the container believes it has.
As you are asking as well about how to find the actual start of an AU, if its an "Annex B" bitstream each NALU will have a start code 0x000001 or 0x00000001. AVCC uses a small header to define the length of the NALU.
Check out the following great blog post for more details: szatmary.org
Hope that helps!
Related
I'm parsing HEVC [H.265] header and I noticed that many values are in Golomb code notation. One of them, for example, is the width.
Let's suppose a width value of 1600, in Golomb code is written as:
g=000000000001001000001
call "leadingZero" (lz) the first part of the string (from left to right).
LeadingZero is composed by 11 zeros. Let's call b the rest of the string of Golomb code.
to decode the Golomb code, where b=1001000001 (or decimal 577), you do:
a=2^(lz-1)-1;
n=a+to_decimal(b)
where to_decimal converts from binary to decimal value.
So you have 1023 + 577 = 1600.
Question:
With Golomb you're using 21 bits to represent 1600.
But 1600 in binary takes 11 bits (110 0100 0000).
Also the Golomb method does not allow for a custom number of bits to represent values.
So... Why Golomb code is used in compression algorithms like H.265?
Well, usually compression of High Level Syntax (HLS) is not a critical priority in video compression. If you do the math for a typical resolution (e.g. 1080p) in a typical bandwidth (e.g. 7 Mbps), you will see that saving a few bits to signal frame-level and sequence-level information is really negligible.
However, since ex Colomb code is also used in signalling large DCT coefficients, one might ask the same question in that context. And it would be a valid compression concern, as efficiency residual coding is everything! To answer that question, there are a lot of well stablished literature, dating back to AVC time.
After careful reading of FFmpeg Bitstream Filters Documentation, I still do not understand what they are really for.
The document states that the filter:
performs bitstream level modifications without performing decoding
Could anyone further explain that to me? A use case would greatly clarify things. Also, there are clearly different filters. How do they differ?
Let me explain by example. FFmpeg video decoders typically work by converting one video frame per call to avcodec_decode_video2. So the input is expected to be "one image" worth of bitstream data. Let's consider this issue of going from a file (an array of bytes of disk) to images for a second.
For "raw" (annexb) H264 (.h264/.bin/.264 files), the individual nal unit data (sps/pps header bitstreams or cabac-encoded frame data) is concatenated in a sequence of nal units, with a start code (00 00 01 XX) in between, where XX is the nal unit type. (In order to prevent the nal data itself to have 00 00 01 data, it is RBSP escaped.) So a h264 frame parser can simply cut the file at start code markers. They search for successive packets that start with and including 00 00 01, until and excluding the next occurence of 00 00 01. Then they parse the nal unit type and slice header to find which frame each packet belongs to, and return a set of nal units making up one frame as input to the h264 decoder.
H264 data in .mp4 files is different, though. You can imagine that the 00 00 01 start code can be considered redundant if the muxing format already has length markers in it, as is the case for mp4. So, to save 3 bytes per frame, they remove the 00 00 01 prefix. They also put the PPS/SPS in the file header instead of prepending it before the first frame, and these also miss their 00 00 01 prefixes. So, if I were to input this into the h264 decoder, which expects the prefixes for all nal units, it wouldn't work. The h264_mp4toannexb bitstream filter fixes this, by identifying the pps/sps in the extracted parts of the file header (ffmpeg calls this "extradata"), prepending this and each nal from individual frame packets with the start code, and concatenating them back together before inputting them in the h264 decoder.
You might now feel that there's a very fine line distinction between a "parser" and a "bitstream filter". This is true. I think the official definition is that a parser takes a sequence of input data and splits it in frames without discarding any data or adding any data. The only thing a parser does is change packet boundaries. A bitstream filter, on the other hand, is allowed to actually modify the data. I'm not sure this definition is entirely true (see e.g. vp9 below), but it's the conceptual reason mp4toannexb is a BSF, not a parser (because it adds 00 00 01 prefixes).
Other cases where such "bitstream tweaks" help keep decoders simple and uniform, but allow us to support all files variants that happen to exist in the wild:
mpeg4 (divx) b frame unpacking (to get B-frames sequences like IBP, which are coded as IPB, in AVI and get timestamps correct, people came up with this concept of B-frame packing where I-B-P / I-P-B is packed in frames as I-(PB)-(), i.e. the third packet is empty and the second has two frames. This means the timestamp associated with the P and B frame at the decoding phase is correct. It also means you have two frames worth of input data for one packet, which violates ffmpeg's one-frame-in-one-frame-out concept, so we wrote a bsf to split the packet back in two - along with deleting the marker that says that the packet contains two frames, hence a BSF and not a parser - before inputting it into the decoder. In practice, this solves otherwise hard problems with frame multithreading. VP9 does the same thing (called superframes), but splits frames in the parser, so the parser/BSF split isn't always theoretically perfect; maybe VP9's should be called a BSF)
hevc mp4 to annexb conversion (same story as above, but for hevc)
aac adts to asc conversion (this is basically the same as h264/hevc annexb vs. mp4, but for aac audio)
I've been trying to figure out why the GIF89a spec requires that the initial LZW code size to be at least 2-bits, even when encoding 1-bit images (B&W). In appendix F of the spec, it says the following:
ESTABLISH CODE SIZE
The first byte of the Compressed Data stream is a value indicating the minimum number of bits required to represent the set of actual pixel values. Normally this will be the same as the number of color bits. Because of some algorithmic constraints however, black & white images which have one color bit must be indicated as having a code size of 2.
I'm curious as to what these algorithmic constraints are. What would possibly prevent the variant of LZW used in GIF from using a code size of 1? Was this just a limitation of the early encoders or decoders? Or is there some weird edge case that can manifest itself in with just the right combination of bits? Or is there something completely different going on here?
In addition to the codes for 0 and 1, you also have a clear code and an end of information code.
Quoting from the spec:
The output codes are of variable length, starting at +1 bits per
code, up to 12 bits per code. This defines a maximum code value of 4095
(0xFFF). Whenever the LZW code value would exceed the current code length, the
code length is increased by one.
If you start with a code size of 1, the code size needs to be increased immediately by this rule.
This limitation gets rid of one if in implementation (with codesize==1 the first vocabulary phrase code would have width==codesize+2, in all other cases width==codesize+1).
The drawback is very small decreasing in compression ratio for 2-color pictures.
Is it possible to create 28 bits bar code of quality A with width of 3 1/2 inches.
Details: Code 128 bar code
Bits: 28 bit code
Can you please share the dimensions of bar code including quiet zones which will make bar code lie between class A or class B quality.
Perhaps you mean "28 characters" not "28 bits".
Essentially, the barcode width depends on the dots-per-inch (dpi) pitch of the printer you are using and the internal structure of the data.
For all-numeric data (which has best compression in code128) there should be no problem fitting 28 numeric characters into 3.5", even with a 200dpi model like a 2844. a 3844 would give you 300 dpi for better quality at a higher price.
Zebra firmware automatically compresses the barcode produced to the best internal sequence to produce the shortest physical code - which depends on the precise data to be encoded. If your code is all-numeric, that will be the shortest possible (and a constant width). If your data always follows the same pattern (eg 10 numerics, 6 alphas,12 numerics) then that will also produce a constant width (but it will be longer than all-numeric). All-alphas will yield the longest code. Other combinations may generate a shorter code depending on the length of any "run" of sequential numerics.
In Protobuffers documentation, it has been given
"For historical reasons, repeated fields of basic numeric types aren't encoded as
efficiently as they could be. New code should use the special option [packed=true] to get
a more efficient encoding. For example:
repeated int32 samples = 4 [packed=true];"
Can someone clearly explain how does the statement "packed=true" improve the efficieny of encoding basic numeric datatypes??
Basically, under the original encoding the field header (which is composed of the wire type combined with the field-number, bit-shifted and or'd) occurs for every element. Because the header is varint encoded, it is at least one byte per element, but possibly more. So 10 4-byte floats would be at least 50 bytes and quite possibly 90 bytes if the header takes 5 bytes (large field numbers take more space than small field numbers).
With the packed encoding, the field header occurs only once, followed by a varint that indicates the number of bytes to follow. So for 10 floats, the payload length is 40, which is varint-encoded in a single byte for the length prefix. At deserialization time it simply consumes that-many bytes, reading elements as it does so. Therefore for the same data (50 to 90 bytes previously) we are now using 42 to 46 bytes (again, for the range of field numbers that take 1 to 5 bytes each).
These 2 layouts are very different on the wire, and code expecting one can not usually decode the other. As such, it needs to be explicitly enabled to prevent breaking existing messages.