255 lines
12 KiB
HTML
255 lines
12 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
|
<title>Mnesia Rocksdb - Rocksdb backend plugin for Mnesia
|
|
</title>
|
|
<link rel="stylesheet" type="text/css" href="stylesheet.css" title="EDoc">
|
|
</head>
|
|
<body bgcolor="white">
|
|
<div class="navbar"><a name="#navbar_top"></a><table width="100%" border="0" cellspacing="0" cellpadding="2" summary="navigation bar"><tr><td><a href="overview-summary.html" target="overviewFrame">Overview</a></td><td><a href="http://www.erlang.org/"><img src="erlang.png" align="right" border="0" alt="erlang logo"></a></td></tr></table></div>
|
|
<h1>Mnesia Rocksdb - Rocksdb backend plugin for Mnesia
|
|
</h1>
|
|
<p>Copyright © 2013-21 Klarna AB</p>
|
|
<p><b>Authors:</b> Ulf Wiger (<a href="mailto:ulf@wiger.net"><tt>ulf@wiger.net</tt></a>).</p>
|
|
|
|
|
|
<p>The Mnesia DBMS, part of Erlang/OTP, supports 'backend plugins', making
|
|
it possible to utilize more capable key-value stores than the <code>dets</code>
|
|
module (limited to 2 GB per table). Unfortunately, this support is
|
|
undocumented. Below, some informal documentation for the plugin system
|
|
is provided.</p>
|
|
|
|
<h3><a name="Table_of_Contents">Table of Contents</a></h3>
|
|
<ol>
|
|
<li><a href="#Usage">Usage</a></li>
|
|
<ol>
|
|
<li><a href="#Prerequisites">Prerequisites</a></li>
|
|
<li><a href="#Getting_started">Getting started</a></li>
|
|
<li><a href="#Special_features">Special features</a></li>
|
|
<li><a href="#Customization">Customization</a></li>
|
|
<li><a href="#Handling_of_errors_in_write_operations">Handling of errors in write operations</a></li>
|
|
<li><a href="#Caveats">Caveats</a></li>
|
|
</ol>
|
|
<li><a href="#Mnesia_backend_plugins">Mnesia backend plugins</a></li>
|
|
<ol>
|
|
<li><a href="#Background">Background</a></li>
|
|
<li><a href="#Design">Design</a></li>
|
|
</ol>
|
|
<li><a href="#Mnesia_index_plugins">Mnesia index plugins</a></li>
|
|
<li><a href="#Rocksdb">Rocksdb</a></li>
|
|
</ol>
|
|
|
|
<h3><a name="Usage">Usage</a></h3>
|
|
|
|
<h4><a name="Prerequisites">Prerequisites</a></h4>
|
|
|
|
<ul>
|
|
<li>rocksdb (included as dependency)</li>
|
|
<li>sext (included as dependency)</li>
|
|
<li>Erlang/OTP 21.0 or newer (https://github.com/erlang/otp)</li>
|
|
</ul>
|
|
|
|
<h4><a name="Getting_started">Getting started</a></h4>
|
|
|
|
<p>Call <code>mnesia_rocksdb:register()</code> immediately after
|
|
starting mnesia.</p>
|
|
|
|
<p>Put <code>{rocksdb_copies, [node()]}</code> into the table definitions of
|
|
tables you want to be in RocksDB.</p>
|
|
|
|
<h4><a name="Special_features">Special features</a></h4>
|
|
|
|
<p>RocksDB tables support efficient selects on <em>prefix keys</em>.</p>
|
|
|
|
<p>The backend uses the <code>sext</code> module (see
|
|
<a href="https://github.com/uwiger/sext" target="_top"><tt>https://github.com/uwiger/sext</tt></a>) for mapping between Erlang terms and the
|
|
binary data stored in the tables. This provides two useful properties:</p>
|
|
|
|
<ul>
|
|
<li>The records are stored in the Erlang term order of their keys.</li>
|
|
<li>A prefix of a composite key is ordered just before any key for which
|
|
it is a prefix. For example, <code>{x, '_'}</code> is a prefix for keys <code>{x, a}</code>,
|
|
<code>{x, b}</code> and so on.</li>
|
|
</ul>
|
|
|
|
<p>This means that a prefix key identifies the start of the sequence of
|
|
entries whose keys match the prefix. The backend uses this to optimize
|
|
selects on prefix keys.</p>
|
|
|
|
<p>### Customization</p>
|
|
|
|
<p>RocksDB supports a number of customization options. These can be specified
|
|
by providing a <code>{Key, Value}</code> list named <code>rocksdb_opts</code> under <code>user_properties</code>,
|
|
for example:</p>
|
|
|
|
<pre>mnesia:create_table(foo, [{rocksdb_copies, [node()]},
|
|
...
|
|
{user_properties,
|
|
[{rocksdb_opts, [{max_open_files, 1024}]}]
|
|
}])</pre>
|
|
|
|
<p>Consult the <a href="https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning">RocksDB documentation</a>
|
|
for information on configuration parameters. Also see the section below on handling write errors.</p>
|
|
|
|
The default configuration for tables in <code>mnesia_rocksdb</code> is:
|
|
<pre>default_open_opts() ->
|
|
[ {create_if_missing, true}
|
|
, {cache_size,
|
|
list_to_integer(get_env_default("ROCKSDB_CACHE_SIZE", "32212254"))}
|
|
, {block_size, 1024}
|
|
, {max_open_files, 100}
|
|
, {write_buffer_size,
|
|
list_to_integer(get_env_default(
|
|
"ROCKSDB_WRITE_BUFFER_SIZE", "4194304"))}
|
|
, {compression,
|
|
list_to_atom(get_env_default("ROCKSDB_COMPRESSION", "true"))}
|
|
, {use_bloomfilter, true}
|
|
].</pre>
|
|
|
|
<p>It is also possible, for larger databases, to produce a tuning parameter file.
|
|
This is experimental, and mostly copied from <code>mnesia_leveldb</code>. Consult the
|
|
source code in <code>mnesia_rocksdb_tuning.erl</code> and <code>mnesia_rocksdb_params.erl</code>.
|
|
Contributions are welcome.</p>
|
|
|
|
<h4><a name="Caveats">Caveats</a></h4>
|
|
|
|
<p>Avoid placing <code>bag</code> tables in RocksDB. Although they work, each write
|
|
requires additional reads, causing substantial runtime overheads. There
|
|
are better ways to represent and process bag data (see above about
|
|
<em>prefix keys</em>).</p>
|
|
|
|
<p>The <code>mnesia:table_info(T, size)</code> call always returns zero for RocksDB
|
|
tables. RocksDB itself does not track the number of elements in a table, and
|
|
although it is possible to make the <code>mnesia_rocksdb</code> backend maintain a size
|
|
counter, it incurs a high runtime overhead for writes and deletes since it
|
|
forces them to first do a read to check the existence of the key. If you
|
|
depend on having an up to date size count at all times, you need to maintain
|
|
it yourself. If you only need the size occasionally, you may traverse the
|
|
table to count the elements.</p>
|
|
|
|
<h3><a name="Mnesia_backend_plugins">Mnesia backend plugins</a></h3>
|
|
|
|
<h4><a name="Background">Background</a></h4>
|
|
|
|
<p>Mnesia was initially designed to be a RAM-only DBMS, and Erlang's
|
|
<code>ets</code> tables were developed for this purpose. In order to support
|
|
persistence, e.g. for configuration data, a disk-based version of <code>ets</code>
|
|
(called <code>dets</code>) was created. The <code>dets</code> API mimicks the <code>ets</code> API,
|
|
and <code>dets</code> is quite convenient and fast for (nowadays) small datasets.
|
|
However, using a 32-bit bucket system, it is limited to 2GB of data.
|
|
It also doesn't support ordered sets. When used in Mnesia, dets-based
|
|
tables are called <code>disc_only_copies</code>.</p>
|
|
|
|
<p>To circumvent these limitations, another table type, called <code>disc_copies</code>
|
|
was added. This is a combination of <code>ets</code> and <code>disk_log</code>, where Mnesia
|
|
periodically snapshots the <code>ets</code> data to a log file on disk, and meanwhile
|
|
maintains a log of updates, which can be applied at startup. These tables
|
|
are quite performant (especially on read access), but all data is kept in
|
|
RAM, which can become a serious limitation.</p>
|
|
|
|
<p>A backend plugin system was proposed by Ulf Wiger in 2016, and further
|
|
developed with Klarna's support, to finally become included in OTP 19.
|
|
Klarna uses a LevelDb backend, but Aeternity, in 2017, instead chose
|
|
to implement a Rocksdb backend plugin.</p>
|
|
|
|
<h3><a name="Design">Design</a></h3>
|
|
|
|
<p>As backend plugins were added on a long-since legacy-stable Mnesia,
|
|
they had to conform to the existing code structure. For this reason,
|
|
the plugin callbacks hook into the already present low-level access
|
|
API in the <code>mnesia_lib</code> module. As a consequence, backend plugins have
|
|
the same access semantics and granularity as <code>ets</code> and <code>dets</code>. This
|
|
isn't much of a disadvantage for key-value stores like LevelDb and RocksDB,
|
|
but a more serious issue is that the update part of this API is called
|
|
on <em>after</em> the point of no return. That is, Mnesia does not expect
|
|
these updates to fail, and has no recourse if they do. As an aside,
|
|
this could also happen if a <code>disc_only_copies</code> table exceeds the 2 GB
|
|
limit (mnesia will not check it, and <code>dets</code> will not complain, but simply
|
|
drop the update.)</p>
|
|
|
|
<h3><a name="Mnesia_index_plugins">Mnesia index plugins</a></h3>
|
|
|
|
<p>When adding support for backend plugins, index plugins were also added. Unfortunately, they remain undocumented.</p>
|
|
|
|
<p>An index plugin can be added in one of two ways:</p>
|
|
|
|
<ol>
|
|
<li>When creating a schema, provide <code>{index_plugins, [{Name, Module, Function}]}</code> options.</li>
|
|
<li>Call the function <code>mnesia_schema:add_index_plugin(Name, Module, Function)</code></li>
|
|
</ol>
|
|
|
|
<p><code>Name</code> must be an atom wrapped as a 1-tuple, e.g. <code>{words}</code>.</p>
|
|
|
|
<p>The plugin callback is called as <code>Module:Function(Table, Pos, Obj)</code>, where <code>Pos=={words}</code> in
|
|
our example. It returns a list of index terms.</p>
|
|
|
|
<p><strong>Example</strong></p>
|
|
|
|
<p>Given the following index plugin implementation:</p>
|
|
|
|
<pre>-module(words).
|
|
-export([words_f/3]).
|
|
|
|
words_f(_,_,Obj) when is_tuple(Obj) ->
|
|
words_(tuple_to_list(Obj)).
|
|
|
|
words_(Str) when is_binary(Str) ->
|
|
string:lexemes(Str, [$\s, $\n, [$\r,$\n]]);
|
|
words_(L) when is_list(L) ->
|
|
lists:flatmap(fun words_/1, L);
|
|
words_(_) ->
|
|
[].</pre>
|
|
|
|
<p>We can register the plugin and use it in table definitions:</p>
|
|
|
|
<pre>Eshell V12.1.3 (abort with ^G)
|
|
1> mnesia:start().
|
|
ok
|
|
2> mnesia_schema:add_index_plugin({words}, words, words_f).
|
|
{atomic,ok}
|
|
3> mnesia:create_table(i, [{index, [{words}]}]).
|
|
{atomic,ok}</pre>
|
|
|
|
<p>Note that in this case, we had neither a backend plugin, nor even a persistent schema.
|
|
Index plugins can be used with all table types. The registered indexing function (arity 3) must exist
|
|
as an exported function along the node's code path.</p>
|
|
|
|
<p>To see what happens when we insert an object, we can turn on call trace.</p>
|
|
|
|
<pre>4> dbg:tracer().
|
|
{ok,<0.108.0>}
|
|
5> dbg:tp(words, x).
|
|
{ok,[{matched,nonode@nohost,3},{saved,x}]}
|
|
6> dbg:p(all,[c]).
|
|
{ok,[{matched,nonode@nohost,60}]}
|
|
7> mnesia:dirty_write({i,<<"one two">>, [<<"three">>, <<"four">>]}).
|
|
(<0.84.0>) call words:words_f(i,{words},{i,<<"one two">>,[<<"three">>,<<"four">>]})
|
|
(<0.84.0>) returned from words:words_f/3 -> [<<"one">>,<<"two">>,<<"three">>,
|
|
<<"four">>]
|
|
(<0.84.0>) call words:words_f(i,{words},{i,<<"one two">>,[<<"three">>,<<"four">>]})
|
|
(<0.84.0>) returned from words:words_f/3 -> [<<"one">>,<<"two">>,<<"three">>,
|
|
<<"four">>]
|
|
ok
|
|
8> dbg:ctp('_'), dbg:stop().
|
|
ok
|
|
9> mnesia:dirty_index_read(i, <<"one">>, {words}).
|
|
[{i,<<"one two">>,[<<"three">>,<<"four">>]}]</pre>
|
|
|
|
<p>(The fact that the indexing function is called twice, seems like a performance bug.)</p>
|
|
|
|
<p>We can observe that the indexing callback is able to operate on the whole object.
|
|
It needs to be side-effect free and efficient, since it will be called at least once for each update
|
|
(if an old object exists in the table, the indexing function will be called on it too, before it is
|
|
replaced by the new object.)</p>
|
|
|
|
<h3><a name="Rocksdb">Rocksdb</a></h3>
|
|
|
|
<h3><a name="Usage">Usage</a></h3>
|
|
|
|
<hr>
|
|
<div class="navbar"><a name="#navbar_bottom"></a><table width="100%" border="0" cellspacing="0" cellpadding="2" summary="navigation bar"><tr><td><a href="overview-summary.html" target="overviewFrame">Overview</a></td><td><a href="http://www.erlang.org/"><img src="erlang.png" align="right" border="0" alt="erlang logo"></a></td></tr></table></div>
|
|
<p><i>Generated by EDoc</i></p>
|
|
</body>
|
|
</html>
|