Survive the Fire

Survive the Fire

The Situation

We have a GenServer that, on initialization, pulls frequently accessed records from a database and serves that data to users. Everything is going well, but then Murphy’s law strikes, and the server crashes, destroying all of the innocent data along with it. Fortunately, we’ve leveraged the power of Elixir and built a resilient system, so our supervisor restarts the server, which then makes an expensive call to the database to reload all that same information. In about half a minute everything is available again. Not bad, but can we do better?

The Options

There are a few ways to preserve state when a process crashes. For the purposes of this article, we’ll focus on solving our problem with one of the more common patterns: using another server as a backup server. In the wonderful world of OTP, that usually is accomplished by creating two supervised GenServers: one that has the data and one that would temporarily own the data in the event the first one crashes.

The most common method is to use the GenServer’s terminate/2 function from the primary server to send all or some portion of the state to the backup server. Think of it as a dying breath or a last will and testament. Pairing it with the init/1 function produces a simple way to preserve state.

def init(_) do
  if preserved_state = OtherServer.has_my_beer?() do
    {:ok, preserved_state}
    {:ok, create_new_state()}
def terminate(_reason, state) do

In most situations, that works just fine. But there are a few drawbacks. First, not every exit will invoke terminate/2. Second, if we have a particularly large data set, moving it twice (once to OtherServer, and then once back to our server on restart) can potentially be an expensive operation. Third, we are adding additional complexity to our server by requiring it to deal with issues besides processing data - a clear departure from the “let it fail” philosophy.

As long as the state is small and the exceptions to terminate/2 are acceptable, then this solution works fine. But what if the data you’re working with is significantly larger?

Erlang Term Storage

Enter :ets, or “ehts”, or “ee tee es”. However you pronounce it, ETS is an effective in-memory term storage provided by erlang that offers constant time data access. You create ETS tables within a process and it will live as long as that process does…and, despite some common misconceptions, even well beyond if you configure it correctly.

Erlang’s :ets module provides us a couple ways to efficiently solve our problem with how to gracefully handle preservation of large sets of data stored in an ETS table. First, there’s the :ets.give_away/3 function that allows a process to transfer ownership of an ETS table to another process. That low cost ownership transfer allows us to avoid transferring huge swaths of data between processes. As a result, we can solve our data transfer issues from the terminate/2 solution outlined above. Unfortunately, we’re still left with the other issues.

To simplify the rest, the :ets module provides us another way: we can set an heir when creating the ETS table. The identified heir (just a pid of another process) will inherit ownership of the table if it’s current owner terminates. No need for the dying process to explicitly tell an ETS table where to go when it dies.

An Example

Source Code

For the purpose of this exercise, we’re going to create a simple OTP app: DataBackup diagram

Our Supervisor will start two workers: DataMaster, which creates, distributes, and inherits the ETS table, and Server, which will assume ownership of the ETS table during initialization and handle data access and manipulation. We’ll dig into each of these over the next few sections.


DataMaster is the ultimate guardian of our ETS table in three ways.

First, it creates the ETS table:

@table_name :data
def init(_) do
  # 1, [:set, :protected, :named_table, {:heir, self(), nil}])

  {:ok, %{ets: @table_name}}

BackupData.DataMaster.init/1 Building the ETS table

During the ETS creation (#1), DataMaster sets itself as the heir. If any other process assumes ownership of this ETS table and happens to crash, then by default DataMaster will regain ownership and the ETS table will be preserved. No transfer of data, no issues where terminate/2 doesn’t fire, and we can just let our Server fail.

def get_ets do, :get_ets)
def handle_call(:get_ets, _from, nil), do: {:reply, nil, nil}
def handle_call(:get_ets, {pid, _ref}, %{ets: ets}) do
  :ets.give_away(ets, pid, nil)
  {:reply, ets, nil}

BackupData.DataMaster.get_ets/0 Distributing the ETS table

The get_ets/0 function allows other processes to request and, if available, assume ownership of the ETS table.

def handle_info({:"ETS-TRANSFER", @table_name, _origin, _gift_data}, _state) do
  {:noreply, %{ets: @table_name}}

BackupData.DataMaster.handle_info/2:”ETS-TRANSFER” Re-owning

Whenever a process is identified as the owner of an ETS table, that process will receive the message {:"ETS-TRANSFER", table_name, origin_pid, misc_data}. In this case, when DataMaster receives ownership of the same table again, it will adjust it’s state accordingly and be ready to distribute it again.


The Server handles everything related to storing and retrieving data via :ets.insert/2 and :ets.lookup/2 functions. Unfortunately, Server initially can’t access the ETS table due to the protected option we passed, which affords read only privileges to every other process. The remedy is simple enough: get ownership.

def init(_) do
  {:ok, %{ets: DataBackup.DataMaster.get_ets()}}

BackupData.Server.init/1 Takes ownership on init

Upon initialization, Server will request ownership via the previously discussed DataMaster.get_ets/0 function and store the name of that table in state. Whether that’s the first time the app is started or if coming back from the dead, Server.

def insert_data(id, data) do, {:insert_data, id, data})

def get_data(id) do, {:get_data, id})
def handle_call({:insert_data, id, data}, _from, state) do
  :ets.insert(state.ets, {id, data})
  {:reply, :ok, state}

def handle_call({:get_data, id}, _from, state) do
  record =
    |> :ets.lookup(id)
    |> List.first
  {:reply, record, state}

BackupData.Server Data manipulation functions.

And some basic functions for creating rows and finding them.


When it’s all said in done, here’s how everything comes together.

Sequence diagram

Our Supervisor starts DataMaster first, which in turn creates our ETS table within it’s initialization function, setting itself as the heir. The Supervisor then starts our Server, which initializes and requests the ETS table from DataMaster via the latter’s get_ets/0 function. DataMaster utilizes the :ets.give_away/3 function to pass ownership of the ETS table to Server. Server serves data, adds data to the ets table, etc. Then calamity strikes and Server crashes. Since DataMaster is the heir, ownership of the ETS table reverts to DataMaster. That allows the Server to die quickly. Supervisor then restarts our Server, which quickly resumes ownership of the ETS table and resumes normal function.


Using an ETS table allows us to separate in-memory data functionality from the implementation of our application. Server is no longer responsible for maintaining and preserving the data and can instead focus on serving and manipulating that data. We use DataMaster to provide a greater degree of reliability in maintaining the life of in-memory data. All of that happens with only a few simple lines of code thanks to erlang’s built in ETS functionality.

If you’re interested in digging deeper, check out the repo on GitHub. I’d recommend paying special attention to the tests, as they’ll give you a good idea of how this simple OTP app works.

Bonus points if you figured out the photo is actually of our ETS and Server: one can’t be killed by fire, the other can come back from the dead. This story is about how to get them to sing together.

Author face

Dave Lively

Dave is a Support Engineer at SalesLoft, husband to Sarah, graduate of West Point, and avid Atlanta United supporter.

Recent post