Runtime swapable and upgrade-able modules in ANSI C

09 Jul 2020 - tsp
Last update 14 Aug 2020
Reading time 16 mins

What is this blog entry about?

As everyone who has worked with servlet containers like Apache Tomcat knows these containers are capable of deploying new servlets by simply copying a new version of the web application archive into a folder. The container then terminated the current running version of the servlet, redeploys the application from the newly copied archive and instantiates the new servlet. This allows for easy upgrading of web applications with some minimal amount of downtime. The upgrade process is also pretty simply because it just requires access for an file transfer method like rsync or scp as long as one doesn’t want to upgrade the servlet container itself (note that of today the servlet container is often deployed using a mechanism like docker together with the web application - the approach described in this blog post isn’t suited for this approach).

On the other hand the approach taken by most servlet containers still is somewhat problematic since only a single version of the servlet can be deployed at the same time. This means that first all requests to the old version have to be completed, the old version has to be shutdown and the new version has to be deployed and will be reachable for new connections after the deployment succeeded. This leads to a (hopefully short) downtime during which connections are either dropped or delayed. Also most servlet containers return errors while the servlet has been stopped and the new version hasn’t been deployed.

To solve that problem other systems like the legendary Erlang support keeping two versions of modules loaded and active at the same time. Control flow stays inside the current version of the module and might jump into the new version whenever the programmer decides to do so (normally this is done during some tail recursion calls) or after all old lightweight threads that used the old version terminated. This is a feature that allows for example runtime upgrading of telecommunication routers without interrupting any network connections. There have also been experiments of using Erlang for robotic control systems - for example there has been a demonstration of replacing the control algorithms of a quadcopter in flight. New calls to module functions from the outside are also directed to the new version of the module; calls from the inside either to the version they originate from or to the new one.

In Erlang there exists a limit of a maximum of two module versions - if one tries to load a third one the VM simply kills the application. The feature is heavily supported by the beam virtual machine and the Erlang language itself. The fact that it’s a (non pure) functional language is of course also helpful since global state is minimized with this programming style - and it’s heavily encouraged to handle for example different network connections using different lightweight threads.

The Ansatz described in this blog post tries to provide the foundation for a similar behavior for applications coded in ANSI C. Note that in this case the application modules have to support runtime upgrading in an expressive way.

Basics: Loading modules

First one has to know how one can load modules into the current process. The basic idea is to use dynamic link libraries (DLL) or shared objects (SO) depending on the operating system. On all major operating systems they can be opened (dlopen on POSIX systems, LoadLibrary on Windows). After a module has been opened one can query a pointer (function or data) to symbols exported from these modules - relative to the module handles returned by the previous functions. This is normally done using dlsym and dlfunc on POSIX or GetProcAddress on Windows. After an application is finished using the DLL/SO it gets closed by dlclose on POSIX and FreeLibrary on Windows.

There is one drawback in case one simply wants to do file alteration monitoring on a module directory and simply opening changed DLLs/SOs in the new version. During the first deployment this would work - and one is even capable of unlinking the DLLs/SOs so their inodes get released after the modules get closed using dlcose or FreeLibrary. Unfortunately a copy of a new version on top of the existing one replaces the file and doesn’t create a new inode so the code of the module gets replaced (especially if it’s only mmap’ed) so the old version gets overwritten and applications might crash - or the write access is simply denied.

To solve that problem a rather simple approach can be used: Whenever the file alteration monitor or a periodic scanner detect a new module version this module will first be copied to a temporary location using an unique filename with suitable permissions. Then the module gets opened using dlopen or LoadLibrary. In case signature verification is required it should be done on the new temporary copy of the file that’s inaccessible by any entity except the application itself. This solves an often encountered bug that allows injecting a correctly signed binary - then after the loader calculated the hash of the module an external application overwrites the plugin. The signature check is done against the original correctly signed binary - the loader then opens some injected code.

Then the file gets immediately unlink-ed or deleted using DeleteFile so there is no chance of files staying inside the cache without being needed any more or being overwritten. After that access to the symbols is done as usual. After the library has been closed the inode is immediately deleted.

So the basic flow is:

Basic switching

The most simple method of upgrading for short lived network services like webservers or similar systems is really simple - just keep a reference for the newest version of the loaded library inside the core application and deliver each and every new connection to the newest module. Old modules still handle old connections. There just has to be synchronization when accessing shared data stores or global state. In case all modules are reference counted they’ll be closed and released automatically after the old connections have been dropped.

Pro:

Cons:

Runtime upgrading

This is a more advanced idea. It works by using a event callback based approach. As an module gets loaded for the first time it registers event callback handlers inside the main container or some event handling framework. For example it registers and function to be called in case new incoming connections have been accepted - or it registers an callback that will be called whenever data is received from a client.

The new module will now register filtering event callbacks at the same points that the old module has registered it’s own. During the registration step the new module will simply call the old registered functions again. This puts the new module transparently in place. Then the new module will start to transfer state via an module implementation specific method into it’s own instance and applies the messages passing through the filter to the internal state. This allows runtime state transfer from the old module into the new module - and it’s easiest when using a pattern like event sourcing - and for example caching incoming messages during partial state transfers. As soon as the module is capable of taking over connections or processes from the old module the filter functions don’t call back into the old filter any more. This allows to transfer running connections into a new version.

Pros:

Cons:

Details: Directory change notification

So after one knows how to load modules one has to know how directory change notifications work. This is an operating system dependent part - there currently is no portable way of performing such detection.

Note that there is another caveat - file system change notifications are not reliable on many systems (like Windows for example) and do not work on all types of filesystems like for example network filesystems (NFS). To circumvent this situation one should only use change notifications as an immediate indication of change and then perform a scan either based on well known metadata of files like last changed time, creation time and/or file size - my implementation uses all of them and detects a change whenever any of the attributes changed. Depending on the OS also attributes like owner and group are used.

Since it’s possible on most systems that event notifications are missed all implementations that I’ve written also run a periodic scan over the specific watched directories and check modification of attributes independent of any notifications. This of course induces some overhead - especially in case directories get large - but it’s inevitable. One should only use this kind of watching for rather small directories not containing tens of thousands or even millions of files - one might use hashed directory storage and large timeouts for that.

FreeBSD (kqueue)

On FreeBSD the most efficient way to monitor directory notification is simply opening the directory handle using open

    int hDirectory;

    hDirectory = open(lpDirectory, O_RDONLY|O_SHLOCK|O_DIRECTORY|O_CLOEXEC);
    if(hDirectory < 0) {
        // Error handling
    }

In this case the flags:

Then one can simply subscribe to the EVFILT_VNODE watching filter. This filter triggers on different supported conditions on all supported filesystems:

Note that immediately after enabling the filter a directory scan operation should be started. This should also happen after each re-arming of the notification. This is required to not miss any modifications but might trigger scanning twice - which is most of the time acceptable.

Note: Keep in mind that this only watches for directory modifications - the filter EVFILT_VNODE is not triggering in case a member simply get written to but still gets triggered if it gets replaced by another file atomically.

Windows (IO completion ports and ReadDirectoryChangesW)

Windows works - as usual - a little bit different. There are multiple ways to subscribe to directory change notifications. The most flexible and powerful one is to use ReadDirectoryChangesW in conjunction with the excellent IO completion ports (IOCP). IO completions ports are the method to go to perform asynchronous overlapped operations on Windows. One assigns a file handle to an IO completion port, executes an overlapped I/O operation (i.e. and operation with all required buffers already attached so data can be read or written directly by the specific driver) and gets an notification enqueued in an scaling task queue. The task queue itself can be used by an arbitrary number of threads but is capable of controlling the concurrency limit - i.e. it can control how many threads are used to process events in parallel.

The flow to use IOCP is somewhat different from kqueue on FreeBSD:

First one has to open the directory. This is done using CreateFile as usual - one should at least specify the GENERIC_READ access permissions as well as OPEN_EXISTING to prevent creating a new file. The flags FILE_FLAG_OVERLAPPED and FILE_FLAG_BACKUP_SEMANTICS have to be specified. Overlapped I/O operation is required to be used with IOCP, the backup semantics are required to use ReadDirectoryChangesW.

After that the directory handle gets assigned to the IOCP that’s going to be used for directory watching using CreateIoCompletionPort as usual. Since I normally use a single set of threads for all directory watching operations I designate a single IO completion port to directory watching - all watching threads are attached to the same IOCP. Usually I’m also using just one watching thread since change notifications from directories are normally not the highest priority in the applications I’m developing.

Then one has to start a read operation using the ReadDirectoryChangesW operation. This operation already requires a target buffer to write into which has to be pre-allocated. This is usually done on a per directory basis and stored together with the directory handle.

One can specify which type of events one wants to receive:

After starting the routine also a scan should be triggered immediately to be able to not miss any modification. The same should be done on every re-arming of the function. Always first enqueue the ReadDirectoryChangesW operation and then enqueue the scan operation - this might trigger two consecutive scans but doesn’t miss any change events.

Detecting change

Now that one gets change notifications one could be tempted to fully trust the notifications received - this would be an major mistake especially on Windows since there are many conditions under which one might miss change notifications as well as the missing support of file alteration monitoring on some filesystems like NFS on most major operating systems. Because of this change notifications should only be seen as a hint that something (highly likely) has happened - but not be relied upon.

To detect changes one might then walk the watched directory or the whole directory hierarchy and keep a record of all known files. As usual this approach is not suited for every application - in case one has thousands or millions of files periodic scanning would not be a good idea (for example when monitoring image or media galleries) - but in case of runtime loadable modules this is totally feasible. If it’s not because there are too many modules one should think about a different approach of injecting new components.

What can be used to detect change?

In my current implementations I use a tuple of last modified time, access permissions and file size. The hash and signature are used by me only after the file has been copied to a different location and it’s integrity should be verified.

This information can be gathered using:

Note: To be continued

This article is tagged:


Data protection policy

Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)

This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/

Valid HTML 4.01 Strict Powered by FreeBSD IPv6 support