notes.txt
changeset 0 9117c6561b0b
child 47 45cc818f2445
equal deleted inserted replaced
-1:000000000000 0:9117c6561b0b
       
     1 Some notes about Mercurial's design
       
     2 
       
     3 Revlogs:
       
     4 
       
     5 The fundamental storage type in Mercurial is a "revlog". A revlog is
       
     6 the set of all revisions to a file. Each revision is either stored
       
     7 compressed in its entirety or as a compressed binary delta against the
       
     8 previous version. The decision of when to store a full version is made
       
     9 based on how much data would be needed to reconstruct the file. This
       
    10 lets us ensure that we never need to read huge amounts of data to
       
    11 reconstruct a file, regardless of how many revisions of it we store.
       
    12 
       
    13 In fact, we should always be able to do it with a single read,
       
    14 provided we know when and where to read. This is where the index comes
       
    15 in. Each revlog has an index containing a special hash (nodeid) of the
       
    16 text, hashes for its parents, and where and how much of the revlog
       
    17 data we need to read to reconstruct it. Thus, with one read of the
       
    18 index and one read of the data, we can reconstruct any version in time
       
    19 proportional to the file size.
       
    20 
       
    21 Similarly, revlogs and their indices are append-only. This means that
       
    22 adding a new version is also O(1) seeks.
       
    23 
       
    24 Generally revlogs are used to represent revisions of files, but they
       
    25 also are used to represent manifests and changesets.
       
    26 
       
    27 Manifests:
       
    28 
       
    29 A manifest is simply a list of all files in a given revision of a
       
    30 project along with the nodeids of the corresponding file revisions. So
       
    31 grabbing a given version of the project means simply looking up its
       
    32 manifest and reconstruction all the file revisions pointed to by it.
       
    33 
       
    34 Changesets:
       
    35 
       
    36 A changeset is a list of all files changed in a check-in along with a
       
    37 change description and some metadata like user and date. It also
       
    38 contains a nodeid to the relevent revision of the manifest. Changesets
       
    39 and manifests are one-to-one, but contain different data for
       
    40 convenience.
       
    41 
       
    42 Nodeids:
       
    43 
       
    44 Nodeids are unique ids that are used to represent the contents of a
       
    45 file AND its position in the project history. That is, if you change a
       
    46 file and then change it back, the result will have a different nodeid
       
    47 because it has different history. This is accomplished by including
       
    48 the parents of a given revision's nodeids with the revision's text
       
    49 when calculating the hash.
       
    50 
       
    51 Graph merging:
       
    52 
       
    53 Nodeids are implemented as they are to simplify merging. Merging a
       
    54 pair of directed acyclic graphs (aka "the family tree" of the file
       
    55 history) requires some method of determining if nodes in different
       
    56 graphs correspond. Simply comparing the contents of the node (by
       
    57 comparing text of given revisions or their hashes) can get confused by
       
    58 identical revisions in the tree.
       
    59 
       
    60 The nodeid approach makes it trivial - the hash uniquely describes a
       
    61 revision's contents and its graph position relative to the root, so
       
    62 merge is simply checking whether each nodeid in graph A is in the hash
       
    63 table of graph B. If not, we pull them in, adding them sequentially to
       
    64 the revlog.
       
    65 
       
    66 Graph resolving:
       
    67 
       
    68 Mercurial does branching by copying (or COWing) a repository and thus
       
    69 keeps everything nice and linear within a repository. However, when a
       
    70 merge of repositories (a "pull") is done, we may often have two head
       
    71 revisions in a given graph. To keep things simple, Mercurial forces
       
    72 the head revisions to be merged.
       
    73 
       
    74 It first finds the closest common ancestor of the two heads. If one is
       
    75 a child of the other, it becomes the new head. Otherwise, we call out
       
    76 to a user-specified 3-way merge tool.
       
    77 
       
    78 Merging files, manifests, and changesets:
       
    79 
       
    80 We begin by comparing changeset DAGs, pulling all nodes we don't have
       
    81 in our DAG from the other repository. As we do so, we collect a list
       
    82 of changed files to merge.
       
    83 
       
    84 Then for each file, we perform a graph merge and resolve as above.
       
    85 It's important to merge files using per-file DAGs rather than just
       
    86 changeset level DAGs as this diagram illustrates: 
       
    87 
       
    88 M   M1   M2
       
    89 
       
    90 AB
       
    91  |`-------v     M2 clones M
       
    92 aB       AB     file A is change in mainline
       
    93  |`---v  AB'    file B is changed in M2
       
    94  |   aB / |     M1 clones M
       
    95  |   ab/  |     M1 changes B
       
    96  |   ab'  |     M1 merges from M2, changes to B conflict
       
    97  |    |  A'B'   M2 changes A
       
    98   `---+--.|
       
    99       |  a'B'   M2 merges from mainline, changes to A conflict
       
   100       `--.|
       
   101          ???    depending on which ancestor we choose, we will have
       
   102 	        to redo A hand-merge, B hand-merge, or both
       
   103                 but if we look at the files independently, everything
       
   104 		is fine
       
   105 
       
   106 After we've merged files, we merge the manifest log DAG and resolve
       
   107 additions and deletions. Then we are ready to resolve the changeset
       
   108 DAG - if our merge required any changes (the new head is not a
       
   109 decendent of our tip), we must create a new changeset describing all
       
   110 of the changes needed to merge it into the tip.
       
   111 
       
   112 Merge performance:
       
   113 
       
   114 The I/O operations for performing a merge are O(changed files), not
       
   115 O(total changes) and in many cases, we needn't even unpack the deltas
       
   116 to add them to our repository (though this optimization isn't
       
   117 necessary).
       
   118 
       
   119 Rollback:
       
   120 
       
   121 Rollback is not yet implemented, but will be easy to add. When
       
   122 performing a commit or a merge, we order things so that the changeset
       
   123 entry gets added last. We keep a transaction log of the name of each
       
   124 file and its length prior to the transaction. On abort, we simply
       
   125 truncate each file to its prior length. This is one of the nice
       
   126 properties of the append-only structure of the revlogs.
       
   127 
       
   128 Remote access:
       
   129 
       
   130 Mercurial currently supports pulling from "serverless" repositories.
       
   131 Simply making the repo directory accessibly via the web and pointing
       
   132 hg at it can accomplish a pull. This is relatively bandwidth efficient
       
   133 but no effort has been spent on pipelining, so it won't work
       
   134 especially well over LAN yet.
       
   135 
       
   136 It's also quite amenable to rsync, if you don't mind keeping an intact
       
   137 copy of the master around locally.
       
   138 
       
   139 Also note the append-only and ordering properties of the commit
       
   140 guarantee that readers will always see a repository in a consistent
       
   141 state and no special locking is necessary. As there is generally only
       
   142 one writer to an hg repository, there is in fact no exclusion
       
   143 implemented yet. 
       
   144 
       
   145 
       
   146 Some comparisons to git:
       
   147 
       
   148 Most notably, Mercurial uses delta compression and repositories
       
   149 created with it will grow much more slowly over time. This also allows
       
   150 it to be much more bandwidth efficient. I expect repos sizes and sync
       
   151 speeds to be similar to or better than BK, given the use of binary diffs.
       
   152 
       
   153 Mercurial is roughly the same performance as git and is faster in
       
   154 others as it keeps around more metadata. One example is listing and
       
   155 retrieving past versions of a file, which it can do without reading
       
   156 all the changesets. This metadata will also allow it to perform better
       
   157 merges as described above.
       
   158 
       
   159