Description of the 4XM Video Codec by Michael Niedermayer Table of Contents 1 Introduction 2 Terms and Definitions 3 High-level Description 3.1 I-Frame 3.1.1 Macroblock 3.1.2 DC Prediction 3.1.3 Dequantization and IDCT 3.1.4 YCbCr 4:2:0 -> RGB565 colorspace transform 3.2 P-Frame 3.2.1 Motion Vector table 3.3 C-Frame 4 Bitstream 4.1 I-Frame 4.1.1 Prefix stream 4.1.2 Macroblock 4.1.3 Block 4.2 P-Frame 4.2.1 Block 4.3 C-Frame 5 VLC Codes 5.1 prefix_vlc in I frames 5.2 level vlc in I Frames 5.3 Block Mode Codes in P frames 6 Applications and Platforms 7 Changelog 8 Copyright 1 Introduction The 4XM video codec is a mixture between a very simplified JPEG scheme and rectangular block based fullpel motion compensation with DC difference coding. The codec uses 4:2:0 YCbCr colorspace for the JPEG part but converts it to RGB16 before using it for motion compensation. The latest version of this document is available at http://www.mplayerhq.hu/~michael/4xm.{lyx,txt,html,ps} 4XM video is normally encapsulated in the proprietary [http://www.pcisys.net/~melanson/codecs/4xm-format.txt||4XM format]. This document assumes familiarity with mathematical and coding concepts such as the discrete cosine transform, quantization, YCbCr colorspaces, macroblocks, and variable length codes (VLCs). A familiarity with the standard JPEG coding method is also helpful. 2 Terms and Definitions AC Any DCT coefficient for which the frequency in one or both dimensions is non-zero. DC The DCT coefficient for which the frequency is zero in both dimensions (I)DCT (Inverse) Discrete Cosine Transform VLC Variable Length Code AAN IDCT IDCT algorithm by Arai, Agui, and Nakajima JPEG Joint Photographic Expert Group 3 High-level Description The 4XM video coding method embodies 3 types of frames: I-frames, P-frames, and C-frames. I-frames are intraframes and stand on their own. P and C-frames are Interframes. 3.1 I-Frame I-frames are practically the same as JPEG images. Differences include just a single Huffman table, different headers, and a bitstream split into 2 partitions with one partition written in 32-bit byteswapped order. There are also no parameters for rate or quality control. The picture is split into macroblocks which are coded left->right, top->bottom. 3.1.1 Macroblock 16x16 luma + 8x8 chroma as 4 8x8 luma blocks and 2 8x8 chroma blocks : Y:+----+---+ | 0 | 1 | +----+---+ | 2 | 3 | +----+---+ Cb:+---+ | 4 | +---+ Cr:+---+ | 5 | +---+ 3.1.2 DC Prediction DC values are predicted from the last coded block. The initial prediction value used for the first top left luma block is 0. No special handling is done between luma and chroma blocks or at the right border, so the DC value of the rightmost 8x8 Cr block of the first row will be used as the predictor for the first/top-left 8x8 luma block of the second MB row. 3.1.3 Dequantization and IDCT 4XM uses an AAN IDCT with the premultiply table merged with the quantization table. The quantization table is the default luma table used in JPEG. default luma quantization table used in JPEG 16, 11, 10, 16, 24, 40, 51, 61, 12, 12, 14, 19, 26, 58, 60, 55, 14, 13, 16, 24, 40, 57, 69, 56, 14, 17, 22, 29, 51, 87, 80, 62, 18, 22, 37, 56, 68, 109, 103, 77, 24, 35, 55, 64, 81, 104, 113, 92, 49, 64, 78, 87, 103, 121, 120, 101, 72, 92, 95, 98, 112, 100, 103, 99 AAN premultiply table 16384, 22725, 21407, 19266, 16384, 12873, 8867, 4520, 22725, 31521, 29692, 26722, 22725, 17855, 12299, 6270, 21407, 29692, 27969, 25172, 21407, 16819, 11585, 5906, 19266, 26722, 25172, 22654, 19266, 15137, 10426, 5315, 16384, 22725, 21407, 19266, 16384, 12873, 8867, 4520, 12873, 17855, 16819, 15137, 12873, 10114, 6967, 3552, 8867 , 12299, 11585, 10426, 8867, 6967, 4799, 2446, 4520 , 6270, 5906, 5315, 4520, 3552, 2446, 1247 merged table used in 4XM 16, 15, 13, 19, 24, 31, 28, 17, 17, 23, 25, 31, 36, 63, 45, 21, 18, 24, 27, 37, 52, 59, 49, 20, 16, 28, 34, 40, 60, 80, 51, 20, 18, 31, 48, 66, 68, 86, 56, 21, 19, 38, 56, 59, 64, 64, 48, 20, 27, 48, 55, 55, 56, 51, 35, 15, 20, 35, 34, 32, 31, 22, 15, 8, This is simply the element-wise product of the quantization table and the AAN table divided by 2^{16} . 4XM's AAN IDCT uses (a*const)>>16 to approximate multiplications and simply shifts the transformed result 16 bits to the right. The scaled constants are: +-----------------+-----------------+ | exact | scaled constant | +-----------------+-----------------+ +-----------------+-----------------+ | 1.082392200... | 70936 | +-----------------+-----------------+ | 1.414213562... | 92682 | +-----------------+-----------------+ | 1.847759065... | 121095 | +-----------------+-----------------+ | 2.613125930... | 171254 | +-----------------+-----------------+ which are simply the exact constants multiplied by 2^{16} and rounded to the nearest integer. see [http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/ffmpeg/ffmpeg/libavcodec/4xm.c?rev=HEAD&content-type=text/vnd.viewcvs-markup||4xm.c] 3.1.4 YCbCr 4:2:0 -> RGB565 colorspace transform Chroma is first upsampled by sample replication / nearest neighbor scaling, so that the same Cb and Cr samples are used for each 2x2 Y samples R= (Y + Cr + 128)>>3 G= (Y - ((Cb+Cr)>>1) + 128)>>2 B= (Y + 2Cb + 128)>>3 There is no check or protection against overflow, so values will wrap around if they are too large or small. 3.2 P-Frame A P-frame picture is split into blocks which are coded left->right, top->bottom. Each block contains 8x8 samples in RGB565 format (5 bits for red, 6 bits for green, 5 bits for blue). Each block can be recursively split into 2, down to 2x1/1x2 sized blocks. A P-frame block can be coded using 1 of 7 methods: 1. motion compensated with 1 vector 2. horizontally split in the middle 3. vertically split in the middle 4. skipped (block data copied from the frame before the last) Example: Intraframe, Interframe1, Interframe2, Interframe3 a skipped block in Interframe3 will use the data from Interframe1; skipped blocks in Interframe1 are dissallowed as there is no source frame 5. motion vector + DC difference (the 16-bit words of the DC and the source block are simply added, there is no special handling of overflows) 6. DC only, the whole block is filled with the DC color 7. hardcoded pixel values (left->right, top->bottom) Block splitting is only available if the resulting blocks are larger than 1x2/2x1. Hardcoded pixel values are only available for 1x2/2x1 sized blocks Motion compensation assumes that the number of words (16-bit RGB565 pixels) per line is equal to the width, so that motion vectors which point right or left outside of the picture use the pixels from the other side. There is no subpel motion compensation or filtering which means that motion compensation can simply be done by copying the pixels from the motion block. 3.2.1 Motion Vector table See mv[256][2] at [http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/ffmpeg/ffmpeg/libavcodec/4xm.c?rev=HEAD&content-type=text/vnd.viewcvs-markup||4xm.c]. 3.3 C-Frame A C-frame is essentially a partial P frame. It has all the same coding options but a different header. 4 Bitstream All 32-bit values are in little endian byte order. 4.1 I-Frame 32bit 'ifrm' 32bit chunk length 32bit 0 (unknown) 32bit bitstream size n byte bitstream 32bit prefixstream size / 4 32bit token_count n byte prefixstream 4.1.1 Prefix stream start 8bit end 8bit do{ for(i=start; i<=end; i++) frequency[i] 8bit start 8bit if(start==0) break; end 8bit } while(not 32bit aligned) 0 8bit for(i=0; i>4; level_prefix= ac_prefix&0xF; level_suffix level_prefix bits bitstream block[ zigzag[i] ]= level; i++; } } 4.2 P-Frame 32bit 'pfrm' 32bit chunk size 32bit 0 (unknown) 32bit unknown, perhaps a checksum 32bit unknown 32bit bitstream size 32bit wordstream size 32bit bytestream size n bytes bitstream, stored in byteswapped 32-bit words n bytes RGB16 wordstream, stored in little endian order n bytes bytestream 4.2.1 Block block(){ mode vlc bitstream if(mode==h_split || mode==v_split){ block() block() } if(mode==mc || mode==mcdc) mv 8bit bytestream if(mode==dc || mode==mcdc) dc 16bit wordstream if(mode==esc){ col1 16bit wordstream col2 16bit wordstream } } 4.3 C-Frame 32bit 'cfrm' 32bit chunk size 32bit 0 (unknown) 32bit frame number / frame id, this is the frame number where the frame will be shown, it is also the frame number at which the last cframe part of this frame will be; note, all parts of the same cframe contain the same id here 32bit whole frame size * p frame, this is (unk, unk, bitstream size, wordstream size, ...) for the first c frame chunk of a c frame 5 VLC Codes 5.1 prefix_vlc in I frames The prefix_vlc table is generated from the frequencies stored in the prefix stream. Additionally, the element 256 is added with an implicit frequency of 1. For the exact algorithm see libavcodec/4xm.c read_huffman_tables(). 5.2 level vlc in I Frames Identical to JPEG +---------+---------+--------------+ | prefix | vlc | level | +---------+---------+--------------+ +---------+---------+--------------+ | 0 | | 0 | +---------+---------+--------------+ | 1 | 0/1 | -1/1 | +---------+---------+--------------+ | 2 | 0X/1X | -3..-2/2..3 | +---------+---------+--------------+ | 3 | 0XX/1XX | -7..-4/4..7 | +---------+---------+--------------+ | ... | ... | ... | +---------+---------+--------------+ One way to decode this is: if(prefix){ v= get_bits(prefix); if((v & (1<<(prefix-1))) == 0) v= (-1 < This text can be used under the GNU Free Documentation License or GNU General Public License. See [http://www.gnu.org/licenses/fdl.txt].