from (https://wiki.mobileread.com/wiki/MOBI)

About
-----

MOBI is the format used by the the MobiPocket Reader. It may have a .mobi
extension or it may have a .prc extension. The extension can be changed by the
user to either of the accepted forms. In either case it may be DRM protected or
non-DRM. The .prc extension is used because the PalmOS doesn't support any file
extensions except .prc or .pdb. Note that Mobipocket prohibits their DRM format
to be used on dedicated eBook readers that support other DRM formats.


Description
-----------

MOBI format was originally an extension of the PalmDOC  format by adding
certain HTML like tags to the data. Many MOBI formatted documents still use
this form. However there is also a high compression version of this file format
that compresses data to a larger degree in a proprietary manner. There are some
third party programs that can read the eBooks in the original MOBI format but
there are only a few third party program that can read the eBooks in the new
compressed form. The higher compression mode is using a huffman coding scheme
that has been called the Huff/cdic algorithm.

From time to time features have been added to the format so new files may have
problems if you try and read them with a down level reader. Currently the
source files follow the guidelines in the Open eBook format.

Note that AZW for the Amazon Kindle is the same format as MOBI except that it
uses a slightly different DRM scheme.


Format
------

Like PalmDOC, the Mobipocket file format is that of a standard Palm Database
Format file. The header of that format includes the name of the database
(usually the book title and sometimes a portion of the authors name) which is
up to 31 bytes of data. The files are identified as Creator ID of MOBI and a
Type of BOOK.


PalmDOC Header
--------------

The first record in the Palm Database Format gives more information about the
Mobipocket file. The first 16 bytes are almost identical to the first sixteen
bytes of a PalmDOC format file.

bytes   content             comments
2       Compression         1 == no compression, 2 = PalmDOC compression,
                            17480 = HUFF/CDIC compression.
2       Unused              Always zero
4       text length         Uncompressed length of the entire text of the book
2       record count        Number of PDB records used for the text of the book.
2       record size         Maximum size of each record containing text, always
                            4096.
4       Current Position    Current reading position, as an offset into the
                            uncompressed text

There are two differences from a Palm DOC file. There's an additional
compression type (17480), and the Current Position bytes are used for a
different purpose:

bytes   content             comments
2       Encryption Type     0 == no encryption, 1 = Old Mobipocket Encryption,
                            2 = Mobipocket Encryption.
2       Unknown             Usually zero

The old Mobipocket Encryption scheme only allows the file to be registered
with one PID, unlike the current encryption scheme that allows multiple PIDs to
be used in a single file. Unless specifically mentioned, all the encryption
information on this page refers to the current scheme.


MOBI Header
-----------

Most Mobipocket file also have a MOBI header in record 0 that follows these
16 bytes, and newer formats also have an EXTH header following the MOBI header,
again all in record 0 of the PDB file format.

The MOBI header is of variable length and is not documented. Some fields have
been tentatively identified as follows:

offset  bytes   content                 comments
16      4       identifier              The characters M O B I
20      4       header length           The length of the MOBI header, including
                                        the previous 4 bytes
24      4       Mobi type               The kind of Mobipocket file this is
                                            2 Mobipocket Book
                                            3 PalmDoc Book
                                            4 Audio
                                            257 News
                                            258 News_Feed
                                            259 News_Magazine
                                            513 PICS
                                            514 WORD
                                            515 XLS
                                            516 PPT
                                            517 TEXT
                                            518 HTML
28      4       text Encoding           1252 = CP1252 (WinLatin1); 65001 = UTF-8
32      4       Unique-ID               Some kind of unique ID number (random?)
36      4       Generator version       Potentially the version of the
                                        Mobipocket-generation tool. Always >=
                                        the value of the "format version" field
                                        and <= the version of mobigen used to
                                        produce the file.
40      40      Reserved                All 0xFF. In case of a dictionary, or
                                        some newer file formats, a few bytes are
                                        used from this range of 40 0xFFs
80      4       First Non-book index?   First record number (starting with 0)
                                        that's not the book's text
84      4       Full Name Offset        Offset in record 0 (not from start of
                                        file) of the full name of the book
88      4       Full Name Length        Length in bytes of the full name of the
                                        book
92      4       Language                Book language code. Low byte is main
                                        language 09= English, next byte is
                                        dialect, 08 = British, 04 = US
96      4       Input Language          Input language for a dictionary
100     4       Output Language         Output language for a dictionary
104     4       Format version          Potentially the version of the
                                        Mobipocket format used in this file.
                                        Always >= 1 and <= the value of the
                                        "generator version" field.
108     4       First Image record      First record number (starting with 0)
                                        that contains an image. Image records
                                        should be sequential. If there are
                                        no images this will be 0xffffffff.
112     4       HUFF record             Record containing Huff information
                                        used in HUFF/CDIC decompression.
116     4       HUFF count              Number of Huff records.
122     4       DATP record             Unknown: Records starts with DATP.
124     4       DATP count              Number of DATP records.
128     4       EXTH flags              Bitfield. if bit 6, 0x40 is set, then
                                        there's an EXTH record
The following records are only present if the mobi header is long enough.
132     36      ?                       32 unknown bytes, if MOBI is long enough
168     4       DRM Offset              Offset to DRM key info in DRMed files.
                                        0xFFFFFFFF if no DRM
172     4       DRM Count               Number of entries in DRM info.
174     4       DRM Size                Number of bytes in DRM info.
176     4       DRM Flags               Some flags concerning the DRM info.
180     6       ?
186     2       Last Image record       Possible vaule with the last image
                                        record. If there are no images in the
                                        book this will be 0xffff.
188     4       ?
192     4       FCIS record             Unknown. Record starts with FCIS.
196     4       ?
200     4       FLIS record             Unknown. Records starts with FLIS.
204     ?       ?                       Bytes to the end of the MOBI header,
                                        including the following if the header
                                        length >= 228. ( 244 from start of
                                        record)
242     2       Extra Data Flags        A set of binary flags, some of which
                                        indicate extra data at the end of each
                                        text block. This only seems to be valid
                                        for Mobipocket format version 5 and 6
                                        (and higher?), when the header length
                                        is 228 (0xE4) or 232 (0xE8).


EXTH Header
-----------

If the MOBI header indicates that there's an EXTH header, it follows immediately
after the MOBI header. since the MOBI header is of variable length, this isn't
at any fixed offset in record 0. Note that some readers will ignore any EXTH
header info if the mobipocket version number specified in the MOBI header is 2
or less (perhaps 3 or less).

The EXTH header is also undocumented, so some of this is guesswork.

bytes   content             comments
4       identifier          the characters E X T H
4       header length       the length of the EXTH header, including the previous 4 bytes
4       record Count        The number of records in the EXTH header. the rest of the EXTH header consists of repeated EXTH records to the end of the EXTH length.
        EXTH record start   Repeat until done.
4       record type         Exth Record type. Just a number identifying what's stored in the record
4       record length       length of EXTH record = L , including the 8 bytes in the type and length fields
L-8     record data         Data.
        EXTH record end     Repeat until done.

There are lots of different EXTH Records types. Ones found so far in Mobipocket
files are listed here, with possible meanings. Hopefully the table will be
filled in as more information comes to light.

record type    usual length     name             comments
1                               drm_server_id
2                               drm_commerce_id
3                               drm_ebookbase_book_id
100                             author
101                             publisher
102                             imprint
103                             description
104                             isbn
105                             subject
106                             publishingdate
107                             review
108                             contributor
109                             rights
110                             subjectcode
111                             type
112                             source
113                             asin
114                             versionnumber
115                             sample
116                             startreading
117            3                adult            Mobipocket Creator adds this if Adult only is checked; contents: "yes" 
118                             retail price     As text, e.g. "4.99" 
119                             retail price currency     As text, e.g. "USD" 
201            4                coveroffset      Add to first image field in Mobi Header to find PDB record containing the cover image 
202            4                thumboffset      Add to first image field in Mobi Header to find PDB record containing the thumbnail cover image 
203                             hasfakecover
204            4                Creator Software Records 204-207 are usually the same for all books from a certain source, e.g. 1-6-2-41 for Baen and 201-1-0-85 for project gutenberg, 200-1-0-85 for amazon when converted to a 32 bit integer. 
205            4                Creator Major Version
206            4                Creator Minor Version
207            4                Creator Build Number
208                             watermark
209                             tamper proof keys Used by the Kindle (and Android app) for generating book-specific PIDs. 
300                             fontsignature
401            1                clippinglimit
402                             publisherlimit
403                             403 Unknown      1 - Text to Speech disabled; 0 - Text to Speech enabled
404            1                404 ttsflag
501            4                cdetype          PDOC - Personal Doc;
                                                 EBOK - ebook;
502                             lastupdatetime
503                             updatedtitle

And now, at the end of Record 0 of the PDB file format, we usually get the full
file name, the offset of which is given in the MOBI header.


Variable-width integers
-----------------------

Some parts of the Mobipocket format encode data as variable-width integers.
These integers are represented big-endian with 7 bits per byte in bits 1-7. They
may be either forward-encoded, in which case only the LSB has bit 8 set, or
backward-encoded, in which case only the MSB has bit 8 set. For example, the
number 0x11111 would be represented forward-encoded as:

    0x04 0x22 0x91

And backward-encoded as: 

    0x84 0x22 0x11


Trailing entries
----------------

The Extra Data Flags field of the MOBI header indicates which, if any, trailing
entries are appended to the end of each text record. Each set bit in the field
indicates a trailing entry. The entries appear to occur in bit-order; e.g.,
trailing entry 1 immediately follows the text content and entry 16 occurs at
the very end of the record. The effect and exact details of most of these
entries is unknown. The trailing entries indicated by bits 2-16 appear to
follow a common format. That format is:

    <data><size>

Where <size> is the size of the entire trailing entry (including the size of
<size>) as a backward-encoded Mobipocket variable-width integer.

Only a few bits have been identified

bit     Data at end of records
0x0001  Multi-byte character overlaps
0x0002  Some data to help with indexing
0x0004  Some data about uncrossable breaks


Multibyte character overlap
---------------------------

When bit 1 of the Extra Data Flags field is set, each record is followed by a
trailing entry containing any extra bytes necessary to complete a multibyte
character which crosses the record boundary. The bytes do not participate in
compression regardless which compression scheme is used for the file. However,
unlike the trailing data bytes, the multibytes (including the count byte) do
get included in any encryption. The overlapping bytes then re-appear as normal
content at the beginning of the following record. The trailing entry ends with
a byte containing a count of the overlapping bytes plus additional flags.

offset  bytes   content         comments
0       0-3    N   terminal bytes
                of a multibyte
                character    
N       1       Size & flags    bits 1-2 encode N, use of bits 3-8 is unknown 


PalmDOC Compression
-------------------

PalmDOC uses LZ77 compression techniques. DOC files can contain only compressed
text. The format does not allow for any text formatting. This keeps files small,
in keeping with the Palm philosophy. However, extensions to the format can use
tags, such as HTML or PML, to include formatting within text. These extensions
to PalmDoc are not interchangeable and are the basis for most eBook Reader
formats on Palm devices.

LZ77 algorithms achieve compression by replacing portions of the data with
references to matching data that has already passed through both encoder and
decoder. A match is encoded by a pair of numbers called a length-distance pair,
which is equivalent to the statement "each of the next length characters is
equal to the character exactly distance characters behind it in the uncompressed
stream." (The "distance" is sometimes called the "offset" instead.)

In the PalmDoc format, a length-distance pair is always encoded by a two-byte
sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding
the distance, 3 go to encoding the length, and the remaining two are used to
make sure the decoder can identify the first byte as the beginning of such a
two-byte sequence. The exact alforithm needed to decode the compressed text can
be found on the PalmDOC page.

PalmDOC data is always divided into 4096 byte blocks and the blocks are acted
upon independently.

PalmDOC does have support for bookmarks. These pointers are named and refer to
an offset location in a file. If the file is edited these locations may no
longer refer to the correct locations. Some reading programs allow the user to
enter or edit these bookmarks while others treat them as a TOC. Some reading
programs may ignore them entirely. They are stored at the end of the file itself
so the full file needs to be scanned when loaded to find them. 


Image Records
-------------

If the file contains images, they follow the text blocks, with each image using a
single block. The 4096-byte record size in the PalmDoc header applies only to
text records; image records may be larger. 


Magic Records
-------------

In some cases, MobiPocket Creator adds a 2-zero-byte record after the text
records in a file. This record is not included in the "record count" of text
records in the PalmDoc header, and is also not used as the "first non-book
index" in the MOBI header. (If the 2-zero-byte record is present, the index of
the following block is used as the "first non-book index".)

MobiPocket Creator also ends files with three records: 'FLIS', 'FCIS', and
'end-of-file', in that order. The 'FLIS' and 'FCIS' records do not seem to be
necessary for MobiPocket Reader or the Amazon Kindle 2 to read the file. The
'end-of-file' record might be necessary. 


FLIS Record
-----------

The FLIS record appears to have a fixed value. The meaning of the values is not known.

offset    bytes    content      comments
0         4        identifier   the characters F L I S (0x46 0x4c 0x49 0x53)
4         4        ?            fixed value: 8
8         2        ?            fixed value: 65
10        2        ?            fixed value: 0
12        4        ?            fixed value: 0
16        4        ?            fixed value: -1
20        2        ?            fixed value: 1
22        2        ?            fixed value: 3
24        4        ?            fixed value: 3
28        4        ?            fixed value: 1
32        4        ?            fixed value: -1 


FCIS Record
-----------

The FCIS record appears to have mostly fixed values.

offset    bytes    content      comments
0         4        identifier   the characters F C I S (0x46 0x43 0x49 0x53)
4         4        ?            fixed value: 20
8         4        ?            fixed value: 16
12        4        ?            fixed value: 1
16        4        ?            fixed value: 0
20        4        ?            text length (the same value as "text length" in the PalmDoc header)
24        4        ?            fixed value: 0
28        4        ?            fixed value: 32
32        4        ?            fixed value: 8
36        2        ?            fixed value: 1
38        2        ?            fixed value: 1
40        4        ?            fixed value: 0


End-of-file Record
------------------

The end-of-file record is a fixed 4-byte record. While the last two bytes
appear to be a CRLF marker, the meaning of the first two bytes is unknown.

offset    bytes    content    comments
0         1         ?         fixed value: 233 (0xe9)
1         1         ?         fixed value: 142 (0x8e)
2         1         ?         fixed value: 13 (0x0d)
3         1         ?         fixed value: 10 (0x0a)


SRCS Record
-----------

kindlegen creates a record whose content is a zip archive of all source files
(i.e., .opf, .ncx, .htm, .jpg, ...) given to the command and puts it in the
generated MOBI file. The record begins with the "SRCS" signature and is
located just before the #End-of-file Record.

MOBI files created with Mobipocket creator, Amazon's Personal Document Service,
or Kindle Direct Publishing (former Amazon DTP) don't include SRCS record.
In a past, kindlegen had an undocumented option to suppress this record, but
the option was removed in 2010.

offset    bytes    content      comments
0         4        identifier   "SRCS" (0x53 0x52 0x43 0x53)
4         4        ?            fixed value(?): 0x00000010
8         4        ?            fixed value(?): 0x0000002f
12        4        ?            fixed value(?): 0x00000001
16        zip                   The zip archive continues to the end of this record 


MBP
---

This is the extension used on a side file (auxiliary) for MOBI formatted eBooks.
It is used to store metadata  used by the library software and also to store
user entered data like bookmarks, annotations, last read position. This file is
created automatically by the reader program when the eBook is first opened and
has a .mbp extension. The Library management software in MobiPocket uses this
file to get information displayed in the library window such as title and author
so that it won't have to open the larger eBook file.

