Team Fortress 2 Source Code as on 22/4/2020
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1175 lines
36 KiB

  1. LZMA specification (DRAFT version)
  2. ----------------------------------
  3. Author: Igor Pavlov
  4. Date: 2013-07-28
  5. This specification defines the format of LZMA compressed data and lzma file format.
  6. Notation
  7. --------
  8. We use the syntax of C++ programming language.
  9. We use the following types in C++ code:
  10. unsigned - unsigned integer, at least 16 bits in size
  11. int - signed integer, at least 16 bits in size
  12. UInt64 - 64-bit unsigned integer
  13. UInt32 - 32-bit unsigned integer
  14. UInt16 - 16-bit unsigned integer
  15. Byte - 8-bit unsigned integer
  16. bool - boolean type with two possible values: false, true
  17. lzma file format
  18. ================
  19. The lzma file contains the raw LZMA stream and the header with related properties.
  20. The files in that format use ".lzma" extension.
  21. The lzma file format layout:
  22. Offset Size Description
  23. 0 1 LZMA model properties (lc, lp, pb) in encoded form
  24. 1 4 Dictionary size (32-bit unsigned integer, little-endian)
  25. 5 8 Uncompressed size (64-bit unsigned integer, little-endian)
  26. 13 Compressed data (LZMA stream)
  27. LZMA properties:
  28. name Range Description
  29. lc [0, 8] the number of "literal context" bits
  30. lp [0, 4] the number of "literal pos" bits
  31. pb [0, 4] the number of "pos" bits
  32. dictSize [0, 2^32 - 1] the dictionary size
  33. The following code encodes LZMA properties:
  34. void EncodeProperties(Byte *properties)
  35. {
  36. properties[0] = (Byte)((pb * 5 + lp) * 9 + lc);
  37. Set_UInt32_LittleEndian(properties + 1, dictSize);
  38. }
  39. If the value of dictionary size in properties is smaller than (1 << 12),
  40. the LZMA decoder must set the dictionary size variable to (1 << 12).
  41. #define LZMA_DIC_MIN (1 << 12)
  42. unsigned lc, pb, lp;
  43. UInt32 dictSize;
  44. UInt32 dictSizeInProperties;
  45. void DecodeProperties(const Byte *properties)
  46. {
  47. unsigned d = properties[0];
  48. if (d >= (9 * 5 * 5))
  49. throw "Incorrect LZMA properties";
  50. lc = d % 9;
  51. d /= 9;
  52. pb = d / 5;
  53. lp = d % 5;
  54. dictSizeInProperties = 0;
  55. for (int i = 0; i < 4; i++)
  56. dictSizeInProperties |= (UInt32)properties[i + 1] << (8 * i);
  57. dictSize = dictSizeInProperties;
  58. if (dictSize < LZMA_DIC_MIN)
  59. dictSize = LZMA_DIC_MIN;
  60. }
  61. If "Uncompressed size" field contains ones in all 64 bits, it means that
  62. uncompressed size is unknown and there is the "end marker" in stream,
  63. that indicates the end of decoding point.
  64. In opposite case, if the value from "Uncompressed size" field is not
  65. equal to ((2^64) - 1), the LZMA stream decoding must be finished after
  66. specified number of bytes (Uncompressed size) is decoded. And if there
  67. is the "end marker", the LZMA decoder must read that marker also.
  68. The new scheme to encode LZMA properties
  69. ----------------------------------------
  70. If LZMA compression is used for some another format, it's recommended to
  71. use a new improved scheme to encode LZMA properties. That new scheme was
  72. used in xz format that uses the LZMA2 compression algorithm.
  73. The LZMA2 is a new compression algorithm that is based on the LZMA algorithm.
  74. The dictionary size in LZMA2 is encoded with just one byte and LZMA2 supports
  75. only reduced set of dictionary sizes:
  76. (2 << 11), (3 << 11),
  77. (2 << 12), (3 << 12),
  78. ...
  79. (2 << 30), (3 << 30),
  80. (2 << 31) - 1
  81. The dictionary size can be extracted from encoded value with the following code:
  82. dictSize = (p == 40) ? 0xFFFFFFFF : (((UInt32)2 | ((p) & 1)) << ((p) / 2 + 11));
  83. Also there is additional limitation (lc + lp <= 4) in LZMA2 for values of
  84. "lc" and "lp" properties:
  85. if (lc + lp > 4)
  86. throw "Unsupported properties: (lc + lp) > 4";
  87. There are some advantages for LZMA decoder with such (lc + lp) value
  88. limitation. It reduces the maximum size of tables allocated by decoder.
  89. And it reduces the complexity of initialization procedure, that can be
  90. important to keep high speed of decoding of big number of small LZMA streams.
  91. It's recommended to use that limitation (lc + lp <= 4) for any new format
  92. that uses LZMA compression. Note that the combinations of "lc" and "lp"
  93. parameters, where (lc + lp > 4), can provide significant improvement in
  94. compression ratio only in some rare cases.
  95. The LZMA properties can be encoded into two bytes in new scheme:
  96. Offset Size Description
  97. 0 1 The dictionary size encoded with LZMA2 scheme
  98. 1 1 LZMA model properties (lc, lp, pb) in encoded form
  99. The RAM usage
  100. =============
  101. The RAM usage for LZMA decoder is determined by the following parts:
  102. 1) The Sliding Window (from 4 KiB to 4 GiB).
  103. 2) The probability model counter arrays (arrays of 16-bit variables).
  104. 3) Some additional state variables (about 10 variables of 32-bit integers).
  105. The RAM usage for Sliding Window
  106. --------------------------------
  107. There are two main scenarios of decoding:
  108. 1) The decoding of full stream to one RAM buffer.
  109. If we decode full LZMA stream to one output buffer in RAM, the decoder
  110. can use that output buffer as sliding window. So the decoder doesn't
  111. need additional buffer allocated for sliding window.
  112. 2) The decoding to some external storage.
  113. If we decode LZMA stream to external storage, the decoder must allocate
  114. the buffer for sliding window. The size of that buffer must be equal
  115. or larger than the value of dictionary size from properties of LZMA stream.
  116. In this specification we describe the code for decoding to some external
  117. storage. The optimized version of code for decoding of full stream to one
  118. output RAM buffer can require some minor changes in code.
  119. The RAM usage for the probability model counters
  120. ------------------------------------------------
  121. The size of the probability model counter arrays is calculated with the
  122. following formula:
  123. size_of_prob_arrays = 1846 + 768 * (1 << (lp + lc))
  124. Each probability model counter is 11-bit unsigned integer.
  125. If we use 16-bit integer variables (2-byte integers) for these probability
  126. model counters, the RAM usage required by probability model counter arrays
  127. can be estimated with the following formula:
  128. RAM = 4 KiB + 1.5 KiB * (1 << (lp + lc))
  129. For example, for default LZMA parameters (lp = 0 and lc = 3), the RAM usage is
  130. RAM_lc3_lp0 = 4 KiB + 1.5 KiB * 8 = 16 KiB
  131. The maximum RAM state usage is required for decoding the stream with lp = 4
  132. and lc = 8:
  133. RAM_lc8_lp4 = 4 KiB + 1.5 KiB * 4096 = 6148 KiB
  134. If the decoder uses LZMA2's limited property condition
  135. (lc + lp <= 4), the RAM usage will be not larger than
  136. RAM_lc_lp_4 = 4 KiB + 1.5 KiB * 16 = 28 KiB
  137. The RAM usage for encoder
  138. -------------------------
  139. There are many variants for LZMA encoding code.
  140. These variants have different values for memory consumption.
  141. Note that memory consumption for LZMA Encoder can not be
  142. smaller than memory consumption of LZMA Decoder for same stream.
  143. The RAM usage required by modern effective implementation of
  144. LZMA Encoder can be estimated with the following formula:
  145. Encoder_RAM_Usage = 4 MiB + 11 * dictionarySize.
  146. But there are some modes of the encoder that require less memory.
  147. LZMA Decoding
  148. =============
  149. The LZMA compression algorithm uses LZ-based compression with Sliding Window
  150. and Range Encoding as entropy coding method.
  151. Sliding Window
  152. --------------
  153. LZMA uses Sliding Window compression similar to LZ77 algorithm.
  154. LZMA stream must be decoded to the sequence that consists
  155. of MATCHES and LITERALS:
  156. - a LITERAL is a 8-bit character (one byte).
  157. The decoder just puts that LITERAL to the uncompressed stream.
  158. - a MATCH is a pair of two numbers (DISTANCE-LENGTH pair).
  159. The decoder takes one byte exactly "DISTANCE" characters behind
  160. current position in the uncompressed stream and puts it to
  161. uncompressed stream. The decoder must repeat it "LENGTH" times.
  162. The "DISTANCE" can not be larger than dictionary size.
  163. And the "DISTANCE" can not be larger than the number of bytes in
  164. the uncompressed stream that were decoded before that match.
  165. In this specification we use cyclic buffer to implement Sliding Window
  166. for LZMA decoder:
  167. class COutWindow
  168. {
  169. Byte *Buf;
  170. UInt32 Pos;
  171. UInt32 Size;
  172. bool IsFull;
  173. public:
  174. unsigned TotalPos;
  175. COutStream OutStream;
  176. COutWindow(): Buf(NULL) {}
  177. ~COutWindow() { delete []Buf; }
  178. void Create(UInt32 dictSize)
  179. {
  180. Buf = new Byte[dictSize];
  181. Pos = 0;
  182. Size = dictSize;
  183. IsFull = false;
  184. TotalPos = 0;
  185. }
  186. void PutByte(Byte b)
  187. {
  188. TotalPos++;
  189. Buf[Pos++] = b;
  190. if (Pos == Size)
  191. {
  192. Pos = 0;
  193. IsFull = true;
  194. }
  195. OutStream.WriteByte(b);
  196. }
  197. Byte GetByte(UInt32 dist) const
  198. {
  199. return Buf[dist <= Pos ? Pos - dist : Size - dist + Pos];
  200. }
  201. void CopyMatch(UInt32 dist, unsigned len)
  202. {
  203. for (; len > 0; len--)
  204. PutByte(GetByte(dist));
  205. }
  206. bool CheckDistance(UInt32 dist) const
  207. {
  208. return dist <= Pos || IsFull;
  209. }
  210. bool IsEmpty() const
  211. {
  212. return Pos == 0 && !IsFull;
  213. }
  214. };
  215. In another implementation it's possible to use one buffer that contains
  216. Sliding Window and the whole data stream after uncompressing.
  217. Range Decoder
  218. -------------
  219. LZMA algorithm uses Range Encoding (1) as entropy coding method.
  220. LZMA stream contains just one very big number in big-endian encoding.
  221. LZMA decoder uses the Range Decoder to extract a sequence of binary
  222. symbols from that big number.
  223. The state of the Range Decoder:
  224. struct CRangeDecoder
  225. {
  226. UInt32 Range;
  227. UInt32 Code;
  228. InputStream *InStream;
  229. bool Corrupted;
  230. }
  231. The notes about UInt32 type for the "Range" and "Code" variables:
  232. It's possible to use 64-bit (unsigned or signed) integer type
  233. for the "Range" and the "Code" variables instead of 32-bit unsigned,
  234. but some additional code must be used to truncate the values to
  235. low 32-bits after some operations.
  236. If the programming language does not support 32-bit unsigned integer type
  237. (like in case of JAVA language), it's possible to use 32-bit signed integer,
  238. but some code must be changed. For example, it's required to change the code
  239. that uses comparison operations for UInt32 variables in this specification.
  240. The Range Decoder can be in some states that can be treated as
  241. "Corruption" in LZMA stream. The Range Decoder uses the variable "Corrupted":
  242. (Corrupted == false), if the Range Decoder has not detected any corruption.
  243. (Corrupted == true), if the Range Decoder has detected some corruption.
  244. The reference LZMA Decoder ignores the value of the "Corrupted" variable.
  245. So it continues to decode the stream, even if the corruption can be detected
  246. in the Range Decoder. To provide the full compatibility with output of the
  247. reference LZMA Decoder, another LZMA Decoder implementations must also
  248. ignore the value of the "Corrupted" variable.
  249. The LZMA Encoder is required to create only such LZMA streams, that will not
  250. lead the Range Decoder to states, where the "Corrupted" variable is set to true.
  251. The Range Decoder reads first 5 bytes from input stream to initialize
  252. the state:
  253. void CRangeDecoder::Init()
  254. {
  255. Corrupted = false;
  256. if (InStream->ReadByte() != 0)
  257. Corrupted = true;
  258. Range = 0xFFFFFFFF;
  259. Code = 0;
  260. for (int i = 0; i < 4; i++)
  261. Code = (Code << 8) | InStream->ReadByte();
  262. if (Code == Range)
  263. Corrupted = true;
  264. }
  265. The LZMA Encoder always writes ZERO in initial byte of compressed stream.
  266. That scheme allows to simplify the code of the Range Encoder in the
  267. LZMA Encoder.
  268. After the last bit of data was decoded by Range Decoder, the value of the
  269. "Code" variable must be equal to 0. The LZMA Decoder must check it by
  270. calling the IsFinishedOK() function:
  271. bool IsFinishedOK() const { return Code == 0; }
  272. If there is corruption in data stream, there is big probability that
  273. the "Code" value will be not equal to 0 in the Finish() function. So that
  274. check in the IsFinishedOK() function provides very good feature for
  275. corruption detection.
  276. The value of the "Range" variable before each bit decoding can not be smaller
  277. than ((UInt32)1 << 24). The Normalize() function keeps the "Range" value in
  278. described range.
  279. #define kTopValue ((UInt32)1 << 24)
  280. void CRangeDecoder::Normalize()
  281. {
  282. if (Range < kTopValue)
  283. {
  284. Range <<= 8;
  285. Code = (Code << 8) | InStream->ReadByte();
  286. }
  287. }
  288. Notes: if the size of the "Code" variable is larger than 32 bits, it's
  289. required to keep only low 32 bits of the "Code" variable after the change
  290. in Normalize() function.
  291. If the LZMA Stream is not corrupted, the value of the "Code" variable is
  292. always smaller than value of the "Range" variable.
  293. But the Range Decoder ignores some types of corruptions, so the value of
  294. the "Code" variable can be equal or larger than value of the "Range" variable
  295. for some "Corrupted" archives.
  296. LZMA uses Range Encoding only with binary symbols of two types:
  297. 1) binary symbols with fixed and equal probabilities (direct bits)
  298. 2) binary symbols with predicted probabilities
  299. The DecodeDirectBits() function decodes the sequence of direct bits:
  300. UInt32 CRangeDecoder::DecodeDirectBits(unsigned numBits)
  301. {
  302. UInt32 res = 0;
  303. do
  304. {
  305. Range >>= 1;
  306. Code -= Range;
  307. UInt32 t = 0 - ((UInt32)Code >> 31);
  308. Code += Range & t;
  309. if (Code == Range)
  310. Corrupted = true;
  311. Normalize();
  312. res <<= 1;
  313. res += t + 1;
  314. }
  315. while (--numBits);
  316. return res;
  317. }
  318. The Bit Decoding with Probability Model
  319. ---------------------------------------
  320. The task of Bit Probability Model is to estimate probabilities of binary
  321. symbols. And then it provides the Range Decoder with that information.
  322. The better prediction provides better compression ratio.
  323. The Bit Probability Model uses statistical data of previous decoded
  324. symbols.
  325. That estimated probability is presented as 11-bit unsigned integer value
  326. that represents the probability of symbol "0".
  327. #define kNumBitModelTotalBits 11
  328. Mathematical probabilities can be presented with the following formulas:
  329. probability(symbol_0) = prob / 2048.
  330. probability(symbol_1) = 1 - Probability(symbol_0) =
  331. = 1 - prob / 2048 =
  332. = (2048 - prob) / 2048
  333. where the "prob" variable contains 11-bit integer probability counter.
  334. It's recommended to use 16-bit unsigned integer type, to store these 11-bit
  335. probability values:
  336. typedef UInt16 CProb;
  337. Each probability value must be initialized with value ((1 << 11) / 2),
  338. that represents the state, where probabilities of symbols 0 and 1
  339. are equal to 0.5:
  340. #define PROB_INIT_VAL ((1 << kNumBitModelTotalBits) / 2)
  341. The INIT_PROBS macro is used to initialize the array of CProb variables:
  342. #define INIT_PROBS(p) \
  343. { for (unsigned i = 0; i < sizeof(p) / sizeof(p[0]); i++) p[i] = PROB_INIT_VAL; }
  344. The DecodeBit() function decodes one bit.
  345. The LZMA decoder provides the pointer to CProb variable that contains
  346. information about estimated probability for symbol 0 and the Range Decoder
  347. updates that CProb variable after decoding. The Range Decoder increases
  348. estimated probability of the symbol that was decoded:
  349. #define kNumMoveBits 5
  350. unsigned CRangeDecoder::DecodeBit(CProb *prob)
  351. {
  352. unsigned v = *prob;
  353. UInt32 bound = (Range >> kNumBitModelTotalBits) * v;
  354. unsigned symbol;
  355. if (Code < bound)
  356. {
  357. v += ((1 << kNumBitModelTotalBits) - v) >> kNumMoveBits;
  358. Range = bound;
  359. symbol = 0;
  360. }
  361. else
  362. {
  363. v -= v >> kNumMoveBits;
  364. Code -= bound;
  365. Range -= bound;
  366. symbol = 1;
  367. }
  368. *prob = (CProb)v;
  369. Normalize();
  370. return symbol;
  371. }
  372. The Binary Tree of bit model counters
  373. -------------------------------------
  374. LZMA uses a tree of Bit model variables to decode symbol that needs
  375. several bits for storing. There are two versions of such trees in LZMA:
  376. 1) the tree that decodes bits from high bit to low bit (the normal scheme).
  377. 2) the tree that decodes bits from low bit to high bit (the reverse scheme).
  378. Each binary tree structure supports different size of decoded symbol
  379. (the size of binary sequence that contains value of symbol).
  380. If that size of decoded symbol is "NumBits" bits, the tree structure
  381. uses the array of (2 << NumBits) counters of CProb type.
  382. But only ((2 << NumBits) - 1) items are used by encoder and decoder.
  383. The first item (the item with index equal to 0) in array is unused.
  384. That scheme with unused array's item allows to simplify the code.
  385. unsigned BitTreeReverseDecode(CProb *probs, unsigned numBits, CRangeDecoder *rc)
  386. {
  387. unsigned m = 1;
  388. unsigned symbol = 0;
  389. for (unsigned i = 0; i < numBits; i++)
  390. {
  391. unsigned bit = rc->DecodeBit(&probs[m]);
  392. m <<= 1;
  393. m += bit;
  394. symbol |= (bit << i);
  395. }
  396. return symbol;
  397. }
  398. template <unsigned NumBits>
  399. class CBitTreeDecoder
  400. {
  401. CProb Probs[(unsigned)1 << NumBits];
  402. public:
  403. void Init()
  404. {
  405. INIT_PROBS(Probs);
  406. }
  407. unsigned Decode(CRangeDecoder *rc)
  408. {
  409. unsigned m = 1;
  410. for (unsigned i = 0; i < NumBits; i++)
  411. m = (m << 1) + rc->DecodeBit(&Probs[m]);
  412. return m - ((unsigned)1 << NumBits);
  413. }
  414. unsigned ReverseDecode(CRangeDecoder *rc)
  415. {
  416. return BitTreeReverseDecode(Probs, NumBits, rc);
  417. }
  418. };
  419. LZ part of LZMA
  420. ---------------
  421. LZ part of LZMA describes details about the decoding of MATCHES and LITERALS.
  422. The Literal Decoding
  423. --------------------
  424. The LZMA Decoder uses (1 << (lc + lp)) tables with CProb values, where
  425. each table contains 0x300 CProb values:
  426. CProb *LitProbs;
  427. void CreateLiterals()
  428. {
  429. LitProbs = new CProb[(UInt32)0x300 << (lc + lp)];
  430. }
  431. void InitLiterals()
  432. {
  433. UInt32 num = (UInt32)0x300 << (lc + lp);
  434. for (UInt32 i = 0; i < num; i++)
  435. LitProbs[i] = PROB_INIT_VAL;
  436. }
  437. To select the table for decoding it uses the context that consists of
  438. (lc) high bits from previous literal and (lp) low bits from value that
  439. represents current position in outputStream.
  440. If (State > 7), the Literal Decoder also uses "matchByte" that represents
  441. the byte in OutputStream at position the is the DISTANCE bytes before
  442. current position, where the DISTANCE is the distance in DISTANCE-LENGTH pair
  443. of latest decoded match.
  444. The following code decodes one literal and puts it to Sliding Window buffer:
  445. void DecodeLiteral(unsigned state, UInt32 rep0)
  446. {
  447. unsigned prevByte = 0;
  448. if (!OutWindow.IsEmpty())
  449. prevByte = OutWindow.GetByte(1);
  450. unsigned symbol = 1;
  451. unsigned litState = ((OutWindow.TotalPos & ((1 << lp) - 1)) << lc) + (prevByte >> (8 - lc));
  452. CProb *probs = &LitProbs[(UInt32)0x300 * litState];
  453. if (state >= 7)
  454. {
  455. unsigned matchByte = OutWindow.GetByte(rep0 + 1);
  456. do
  457. {
  458. unsigned matchBit = (matchByte >> 7) & 1;
  459. matchByte <<= 1;
  460. unsigned bit = RangeDec.DecodeBit(&probs[((1 + matchBit) << 8) + symbol]);
  461. symbol = (symbol << 1) | bit;
  462. if (matchBit != bit)
  463. break;
  464. }
  465. while (symbol < 0x100);
  466. }
  467. while (symbol < 0x100)
  468. symbol = (symbol << 1) | RangeDec.DecodeBit(&probs[symbol]);
  469. OutWindow.PutByte((Byte)(symbol - 0x100));
  470. }
  471. The match length decoding
  472. -------------------------
  473. The match length decoder returns normalized (zero-based value)
  474. length of match. That value can be converted to real length of the match
  475. with the following code:
  476. #define kMatchMinLen 2
  477. matchLen = len + kMatchMinLen;
  478. The match length decoder can return the values from 0 to 271.
  479. And the corresponded real match length values can be in the range
  480. from 2 to 273.
  481. The following scheme is used for the match length encoding:
  482. Binary encoding Binary Tree structure Zero-based match length
  483. sequence (binary + decimal):
  484. 0 xxx LowCoder[posState] xxx
  485. 1 0 yyy MidCoder[posState] yyy + 8
  486. 1 1 zzzzzzzz HighCoder zzzzzzzz + 16
  487. LZMA uses bit model variable "Choice" to decode the first selection bit.
  488. If the first selection bit is equal to 0, the decoder uses binary tree
  489. LowCoder[posState] to decode 3-bit zero-based match length (xxx).
  490. If the first selection bit is equal to 1, the decoder uses bit model
  491. variable "Choice2" to decode the second selection bit.
  492. If the second selection bit is equal to 0, the decoder uses binary tree
  493. MidCoder[posState] to decode 3-bit "yyy" value, and zero-based match
  494. length is equal to (yyy + 8).
  495. If the second selection bit is equal to 1, the decoder uses binary tree
  496. HighCoder to decode 8-bit "zzzzzzzz" value, and zero-based
  497. match length is equal to (zzzzzzzz + 16).
  498. LZMA uses "posState" value as context to select the binary tree
  499. from LowCoder and MidCoder binary tree arrays:
  500. unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);
  501. The full code of the length decoder:
  502. class CLenDecoder
  503. {
  504. CProb Choice;
  505. CProb Choice2;
  506. CBitTreeDecoder<3> LowCoder[1 << kNumPosBitsMax];
  507. CBitTreeDecoder<3> MidCoder[1 << kNumPosBitsMax];
  508. CBitTreeDecoder<8> HighCoder;
  509. public:
  510. void Init()
  511. {
  512. Choice = PROB_INIT_VAL;
  513. Choice2 = PROB_INIT_VAL;
  514. HighCoder.Init();
  515. for (unsigned i = 0; i < (1 << kNumPosBitsMax); i++)
  516. {
  517. LowCoder[i].Init();
  518. MidCoder[i].Init();
  519. }
  520. }
  521. unsigned Decode(CRangeDecoder *rc, unsigned posState)
  522. {
  523. if (rc->DecodeBit(&Choice) == 0)
  524. return LowCoder[posState].Decode(rc);
  525. if (rc->DecodeBit(&Choice2) == 0)
  526. return 8 + MidCoder[posState].Decode(rc);
  527. return 16 + HighCoder.Decode(rc);
  528. }
  529. };
  530. The LZMA decoder uses two instances of CLenDecoder class.
  531. The first instance is for the matches of "Simple Match" type,
  532. and the second instance is for the matches of "Rep Match" type:
  533. CLenDecoder LenDecoder;
  534. CLenDecoder RepLenDecoder;
  535. The match distance decoding
  536. ---------------------------
  537. LZMA supports dictionary sizes up to 4 GiB minus 1.
  538. The value of match distance (decoded by distance decoder) can be
  539. from 1 to 2^32. But the distance value that is equal to 2^32 is used to
  540. indicate the "End of stream" marker. So real largest match distance
  541. that is used for LZ-window match is (2^32 - 1).
  542. LZMA uses normalized match length (zero-based length)
  543. to calculate the context state "lenState" do decode the distance value:
  544. #define kNumLenToPosStates 4
  545. unsigned lenState = len;
  546. if (lenState > kNumLenToPosStates - 1)
  547. lenState = kNumLenToPosStates - 1;
  548. The distance decoder returns the "dist" value that is zero-based value
  549. of match distance. The real match distance can be calculated with the
  550. following code:
  551. matchDistance = dist + 1;
  552. The state of the distance decoder and the initialization code:
  553. #define kEndPosModelIndex 14
  554. #define kNumFullDistances (1 << (kEndPosModelIndex >> 1))
  555. #define kNumAlignBits 4
  556. CBitTreeDecoder<6> PosSlotDecoder[kNumLenToPosStates];
  557. CProb PosDecoders[1 + kNumFullDistances - kEndPosModelIndex];
  558. CBitTreeDecoder<kNumAlignBits> AlignDecoder;
  559. void InitDist()
  560. {
  561. for (unsigned i = 0; i < kNumLenToPosStates; i++)
  562. PosSlotDecoder[i].Init();
  563. AlignDecoder.Init();
  564. INIT_PROBS(PosDecoders);
  565. }
  566. At first stage the distance decoder decodes 6-bit "posSlot" value with bit
  567. tree decoder from PosSlotDecoder array. It's possible to get 2^6=64 different
  568. "posSlot" values.
  569. unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
  570. The encoding scheme for distance value is shown in the following table:
  571. posSlot (decimal) /
  572. zero-based distance (binary)
  573. 0 0
  574. 1 1
  575. 2 10
  576. 3 11
  577. 4 10 x
  578. 5 11 x
  579. 6 10 xx
  580. 7 11 xx
  581. 8 10 xxx
  582. 9 11 xxx
  583. 10 10 xxxx
  584. 11 11 xxxx
  585. 12 10 xxxxx
  586. 13 11 xxxxx
  587. 14 10 yy zzzz
  588. 15 11 yy zzzz
  589. 16 10 yyy zzzz
  590. 17 11 yyy zzzz
  591. ...
  592. 62 10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
  593. 63 11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
  594. where
  595. "x ... x" means the sequence of binary symbols encoded with binary tree and
  596. "Reverse" scheme. It uses separated binary tree for each posSlot from 4 to 13.
  597. "y" means direct bit encoded with range coder.
  598. "zzzz" means the sequence of four binary symbols encoded with binary
  599. tree with "Reverse" scheme, where one common binary tree "AlignDecoder"
  600. is used for all posSlot values.
  601. If (posSlot < 4), the "dist" value is equal to posSlot value.
  602. If (posSlot >= 4), the decoder uses "posSlot" value to calculate the value of
  603. the high bits of "dist" value and the number of the low bits.
  604. If (4 <= posSlot < kEndPosModelIndex), the decoder uses bit tree decoders.
  605. (one separated bit tree decoder per one posSlot value) and "Reverse" scheme.
  606. In this implementation we use one CProb array "PosDecoders" that contains
  607. all CProb variables for all these bit decoders.
  608. if (posSlot >= kEndPosModelIndex), the middle bits are decoded as direct
  609. bits from RangeDecoder and the low 4 bits are decoded with a bit tree
  610. decoder "AlignDecoder" with "Reverse" scheme.
  611. The code to decode zero-based match distance:
  612. unsigned DecodeDistance(unsigned len)
  613. {
  614. unsigned lenState = len;
  615. if (lenState > kNumLenToPosStates - 1)
  616. lenState = kNumLenToPosStates - 1;
  617. unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
  618. if (posSlot < 4)
  619. return posSlot;
  620. unsigned numDirectBits = (unsigned)((posSlot >> 1) - 1);
  621. UInt32 dist = ((2 | (posSlot & 1)) << numDirectBits);
  622. if (posSlot < kEndPosModelIndex)
  623. dist += BitTreeReverseDecode(PosDecoders + dist - posSlot, numDirectBits, &RangeDec);
  624. else
  625. {
  626. dist += RangeDec.DecodeDirectBits(numDirectBits - kNumAlignBits) << kNumAlignBits;
  627. dist += AlignDecoder.ReverseDecode(&RangeDec);
  628. }
  629. return dist;
  630. }
  631. LZMA Decoding modes
  632. -------------------
  633. There are 2 types of LZMA streams:
  634. 1) The stream with "End of stream" marker.
  635. 2) The stream without "End of stream" marker.
  636. And the LZMA Decoder supports 3 modes of decoding:
  637. 1) The unpack size is undefined. The LZMA decoder stops decoding after
  638. getting "End of stream" marker.
  639. The input variables for that case:
  640. markerIsMandatory = true
  641. unpackSizeDefined = false
  642. unpackSize contains any value
  643. 2) The unpack size is defined and LZMA decoder supports both variants,
  644. where the stream can contain "End of stream" marker or the stream is
  645. finished without "End of stream" marker. The LZMA decoder must detect
  646. any of these situations.
  647. The input variables for that case:
  648. markerIsMandatory = false
  649. unpackSizeDefined = true
  650. unpackSize contains unpack size
  651. 3) The unpack size is defined and the LZMA stream must contain
  652. "End of stream" marker
  653. The input variables for that case:
  654. markerIsMandatory = true
  655. unpackSizeDefined = true
  656. unpackSize contains unpack size
  657. The main loop of decoder
  658. ------------------------
  659. The main loop of LZMA decoder:
  660. Initialize the LZMA state.
  661. loop
  662. {
  663. // begin of loop
  664. Check "end of stream" conditions.
  665. Decode Type of MATCH / LITERAL.
  666. If it's LITERAL, decode LITERAL value and put the LITERAL to Window.
  667. If it's MATCH, decode the length of match and the match distance.
  668. Check error conditions, check end of stream conditions and copy
  669. the sequence of match bytes from sliding window to current position
  670. in window.
  671. Go to begin of loop
  672. }
  673. The reference implementation of LZMA decoder uses "unpackSize" variable
  674. to keep the number of remaining bytes in output stream. So it reduces
  675. "unpackSize" value after each decoded LITERAL or MATCH.
  676. The following code contains the "end of stream" condition check at the start
  677. of the loop:
  678. if (unpackSizeDefined && unpackSize == 0 && !markerIsMandatory)
  679. if (RangeDec.IsFinishedOK())
  680. return LZMA_RES_FINISHED_WITHOUT_MARKER;
  681. LZMA uses three types of matches:
  682. 1) "Simple Match" - the match with distance value encoded with bit models.
  683. 2) "Rep Match" - the match that uses the distance from distance
  684. history table.
  685. 3) "Short Rep Match" - the match of single byte length, that uses the latest
  686. distance from distance history table.
  687. The LZMA decoder keeps the history of latest 4 match distances that were used
  688. by decoder. That set of 4 variables contains zero-based match distances and
  689. these variables are initialized with zero values:
  690. UInt32 rep0 = 0, rep1 = 0, rep2 = 0, rep3 = 0;
  691. The LZMA decoder uses binary model variables to select type of MATCH or LITERAL:
  692. #define kNumStates 12
  693. #define kNumPosBitsMax 4
  694. CProb IsMatch[kNumStates << kNumPosBitsMax];
  695. CProb IsRep[kNumStates];
  696. CProb IsRepG0[kNumStates];
  697. CProb IsRepG1[kNumStates];
  698. CProb IsRepG2[kNumStates];
  699. CProb IsRep0Long[kNumStates << kNumPosBitsMax];
  700. The decoder uses "state" variable value to select exact variable
  701. from "IsRep", "IsRepG0", "IsRepG1" and "IsRepG2" arrays.
  702. The "state" variable can get the value from 0 to 11.
  703. Initial value for "state" variable is zero:
  704. unsigned state = 0;
  705. The "state" variable is updated after each LITERAL or MATCH with one of the
  706. following functions:
  707. unsigned UpdateState_Literal(unsigned state)
  708. {
  709. if (state < 4) return 0;
  710. else if (state < 10) return state - 3;
  711. else return state - 6;
  712. }
  713. unsigned UpdateState_Match (unsigned state) { return state < 7 ? 7 : 10; }
  714. unsigned UpdateState_Rep (unsigned state) { return state < 7 ? 8 : 11; }
  715. unsigned UpdateState_ShortRep(unsigned state) { return state < 7 ? 9 : 11; }
  716. The decoder calculates "state2" variable value to select exact variable from
  717. "IsMatch" and "IsRep0Long" arrays:
  718. unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);
  719. unsigned state2 = (state << kNumPosBitsMax) + posState;
  720. The decoder uses the following code flow scheme to select exact
  721. type of LITERAL or MATCH:
  722. IsMatch[state2] decode
  723. 0 - the Literal
  724. 1 - the Match
  725. IsRep[state] decode
  726. 0 - Simple Match
  727. 1 - Rep Match
  728. IsRepG0[state] decode
  729. 0 - the distance is rep0
  730. IsRep0Long[state2] decode
  731. 0 - Short Rep Match
  732. 1 - Rep Match 0
  733. 1 -
  734. IsRepG1[state] decode
  735. 0 - Rep Match 1
  736. 1 -
  737. IsRepG2[state] decode
  738. 0 - Rep Match 2
  739. 1 - Rep Match 3
  740. LITERAL symbol
  741. --------------
  742. If the value "0" was decoded with IsMatch[state2] decoding, we have "LITERAL" type.
  743. At first the LZMA decoder must check that it doesn't exceed
  744. specified uncompressed size:
  745. if (unpackSizeDefined && unpackSize == 0)
  746. return LZMA_RES_ERROR;
  747. Then it decodes literal value and puts it to sliding window:
  748. DecodeLiteral(state, rep0);
  749. Then the decoder must update the "state" value and "unpackSize" value;
  750. state = UpdateState_Literal(state);
  751. unpackSize--;
  752. Then the decoder must go to the begin of main loop to decode next Match or Literal.
  753. Simple Match
  754. ------------
  755. If the value "1" was decoded with IsMatch[state2] decoding,
  756. we have the "Simple Match" type.
  757. The distance history table is updated with the following scheme:
  758. rep3 = rep2;
  759. rep2 = rep1;
  760. rep1 = rep0;
  761. The zero-based length is decoded with "LenDecoder":
  762. len = LenDecoder.Decode(&RangeDec, posState);
  763. The state is update with UpdateState_Match function:
  764. state = UpdateState_Match(state);
  765. and the new "rep0" value is decoded with DecodeDistance:
  766. rep0 = DecodeDistance(len);
  767. That "rep0" will be used as zero-based distance for current match.
  768. If the value of "rep0" is equal to 0xFFFFFFFF, it means that we have
  769. "End of stream" marker, so we can stop decoding and check finishing
  770. condition in Range Decoder:
  771. if (rep0 == 0xFFFFFFFF)
  772. return RangeDec.IsFinishedOK() ?
  773. LZMA_RES_FINISHED_WITH_MARKER :
  774. LZMA_RES_ERROR;
  775. If uncompressed size is defined, LZMA decoder must check that it doesn't
  776. exceed that specified uncompressed size:
  777. if (unpackSizeDefined && unpackSize == 0)
  778. return LZMA_RES_ERROR;
  779. Also the decoder must check that "rep0" value is not larger than dictionary size
  780. and is not larger than the number of already decoded bytes:
  781. if (rep0 >= dictSize || !OutWindow.CheckDistance(rep0))
  782. return LZMA_RES_ERROR;
  783. Then the decoder must copy match bytes as described in
  784. "The match symbols copying" section.
  785. Rep Match
  786. ---------
  787. If the LZMA decoder has decoded the value "1" with IsRep[state] variable,
  788. we have "Rep Match" type.
  789. At first the LZMA decoder must check that it doesn't exceed
  790. specified uncompressed size:
  791. if (unpackSizeDefined && unpackSize == 0)
  792. return LZMA_RES_ERROR;
  793. Also the decoder must return error, if the LZ window is empty:
  794. if (OutWindow.IsEmpty())
  795. return LZMA_RES_ERROR;
  796. If the match type is "Rep Match", the decoder uses one of the 4 variables of
  797. distance history table to get the value of distance for current match.
  798. And there are 4 corresponding ways of decoding flow.
  799. The decoder updates the distance history with the following scheme
  800. depending from type of match:
  801. - "Rep Match 0" or "Short Rep Match":
  802. ; LZMA doesn't update the distance history
  803. - "Rep Match 1":
  804. UInt32 dist = rep1;
  805. rep1 = rep0;
  806. rep0 = dist;
  807. - "Rep Match 2":
  808. UInt32 dist = rep2;
  809. rep2 = rep1;
  810. rep1 = rep0;
  811. rep0 = dist;
  812. - "Rep Match 3":
  813. UInt32 dist = rep3;
  814. rep3 = rep2;
  815. rep2 = rep1;
  816. rep1 = rep0;
  817. rep0 = dist;
  818. Then the decoder decodes exact subtype of "Rep Match" using "IsRepG0", "IsRep0Long",
  819. "IsRepG1", "IsRepG2".
  820. If the subtype is "Short Rep Match", the decoder updates the state, puts
  821. the one byte from window to current position in window and goes to next
  822. MATCH/LITERAL symbol (the begin of main loop):
  823. state = UpdateState_ShortRep(state);
  824. OutWindow.PutByte(OutWindow.GetByte(rep0 + 1));
  825. unpackSize--;
  826. continue;
  827. In other cases (Rep Match 0/1/2/3), it decodes the zero-based
  828. length of match with "RepLenDecoder" decoder:
  829. len = RepLenDecoder.Decode(&RangeDec, posState);
  830. Then it updates the state:
  831. state = UpdateState_Rep(state);
  832. Then the decoder must copy match bytes as described in
  833. "The Match symbols copying" section.
  834. The match symbols copying
  835. -------------------------
  836. If we have the match (Simple Match or Rep Match 0/1/2/3), the decoder must
  837. copy the sequence of bytes with calculated match distance and match length.
  838. If uncompressed size is defined, LZMA decoder must check that it doesn't
  839. exceed that specified uncompressed size:
  840. len += kMatchMinLen;
  841. bool isError = false;
  842. if (unpackSizeDefined && unpackSize < len)
  843. {
  844. len = (unsigned)unpackSize;
  845. isError = true;
  846. }
  847. OutWindow.CopyMatch(rep0 + 1, len);
  848. unpackSize -= len;
  849. if (isError)
  850. return LZMA_RES_ERROR;
  851. Then the decoder must go to the begin of main loop to decode next MATCH or LITERAL.
  852. NOTES
  853. -----
  854. This specification doesn't describe the variant of decoder implementation
  855. that supports partial decoding. Such partial decoding case can require some
  856. changes in "end of stream" condition checks code. Also such code
  857. can use additional status codes, returned by decoder.
  858. This specification uses C++ code with templates to simplify describing.
  859. The optimized version of LZMA decoder doesn't need templates.
  860. Such optimized version can use just two arrays of CProb variables:
  861. 1) The dynamic array of CProb variables allocated for the Literal Decoder.
  862. 2) The one common array that contains all other CProb variables.
  863. References:
  864. 1. G. N. N. Martin, Range encoding: an algorithm for removing redundancy
  865. from a digitized message, Video & Data Recording Conference,
  866. Southampton, UK, July 24-27, 1979.