Zero-downtime service migration

When I am running my website, now.in, I did encounter a trouble.  That is, when there is a bug in the production server, then I need to restart them.  Sounds fine, right? Just to restart a server.  Yes, for most of web HTTP servers, they are stateless,  it is fine to restart them whenever you want, but not true for now.in, they are audio streaming servers.  Here s a server stats diagram shows the problem:

You can see there are some gaps in the plot, that’s caused by server restarting.  Of course, for users, that would definitely be bad experience.  Thus, I’m thinking how to solve this problem recently.  Before we go into the design, let’s look the reasons for restarting server first.

  • To deploy new version of program
  • To fix bugs
  • The process is using to much memory
  • To reload environment, ulimit -n for example (the limit count of file descriptor under unix-like environment)
  • To migrate from host A to host B

For simply employe new version of program, we can use reload function of Python to reload modules.  But there is some problems, reload function only rerun the module, those created instances are still there, it might work if the change is minor.  On the other hand,  reloading can’t solve memory usage problem, process environment change problem.  And here comes the final reason, to migrate service from host A to B.  Indeed, it is difficult not to make any down time for such migration, and there is little chance to do such migration, therefore, we’ll only focus on migration in same host.

The idea

The biggest challenge is, how to migrate those working connections?  My idea is simple, create a new process, and transfer those connection to the new process, and shut the old one down.  Following diagrams show my method.

The Master is a process which is charge for managing migration and receiving command from other process.  And the process A is for running service.

Before we perform the migration, the Manager startup process B, and wait it says it’s readly.

When process B said “Hey! I’m ready”, then manager tell process A to send the connection state descriptor to process B.  Process B receive the state, and take over the service.

Finally, process B took over the service, then master tells process A “You are fired.” and the process A rolls itself out.

That’s it, the service was migrated, and there is no down time.

The problem – socket transfer

The idea sounds good, right? But still, we have some technical problem to solve.  It is “How to transfer socket (file descriptor) from one process to another?”.  To solve this problem, I have some study, and eventually I know two methods to achieve the goal.

Child process

For most of unix-like OS, child processes inherit file descriptors from their parent.  Of course we can use this feature to migrate our service, but however, it got its limitation.  You can only transfer file descriptors to child process.

Sendmsg

Another way to achieve same goal is, to use sendmsg through a unix domain socket to send the file descriptors.  With sendmsg, you can transfer file descriptors to almost any processes you like, that’s much flexible.

A simple implementation

To simplify the example, we only implement process A and process B here, it is quite enough for two process to complete the migration.  Before we go into the details, there is another problem to solve, which is sendmsg is not a standard function in Python.  Fortunately, there is a third party library named sendmsg provides this function.  To install sendmsg, it is easy, just type

easy_install sendmsg

And here you are.  Okay, following are the two programs.

a.py

import os
import socket
import sendmsg
import struct

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('', 5566))
s.listen(1)
conn, addr = s.accept()
conn.send('Hello, process %d is servingn' % os.getpid())
print 'Accept inet connection', conn

us = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
us.bind(('mig.sock'))
us.listen(1)
uconn, addr = us.accept()
print 'Accept unix connection', uconn

payload = struct.pack('i', conn.fileno())
sendmsg.sendmsg(
    uconn.fileno(), '', 0, (socket.SOL_SOCKET, sendmsg.SCM_RIGHTS, payload))
print 'Sent socket', conn.fileno()
print 'Done.'

b.py

import os
import socket
import sendmsg
import struct

us = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
us.connect(('mig.sock'))
print 'Make unix connection', us

result = sendmsg.recvmsg(us.fileno())
identifier, flags, [(level, type, data)] = result
print identifier, flags, [(level, type, data)]
fd = struct.unpack('i', data)[0]
print 'Get fd', fd

conn = socket.fromfd(fd, socket.AF_INET, socket.SOCK_STREAM)
os.close(fd)
conn.send('Hello, process %d is servingn' % os.getpid())
raw_input()

The flow is simple, the a.py accept an inet socket and open an unix domain socket, wait b.py to take over the service.  And here we runs b.py, it connects to a.py, receives the fd of socket and takes over providing service.

The result

As the result shows, there is not down time between the service migration.

It is very useful to employ this technique in server programs.  You can even migrate service from Python server to a C/C++ server, or vice versa.  Also, to keep the memory usage low, you can also migrate the service to same program periodically.  I will try to employ this technique in my servers to achieve zero down time migration.  If you are interesting in this technique, have a try, it is fun and useful :)

This entry was posted in English Articles, Python, Unix-Like, 分享 and tagged , , , , . Bookmark the permalink.

One Response to Zero-downtime service migration

  1. This is very informative thanks for posting this . I also have a blog something related to socket error troubleshooting .I hope it will help in the future.

Leave a Reply