Writing Scripts That Only Allow One Instance to Run

As a UNIX admin its often required that certain scripts only allow a single simultogetherous instance to run. For me this most often happens with backup scripts. The duration of a single run of a script is tied to how much data you want to copy and how fast it can move. This varies from day to day and in many scenarios could take longer to run than the time window provided by the CRON schedule. In other words: a daily backup could take longer than a day. And starting a second backup while the first is still running compounds issues.

There may be any number of other reasons a script needs to be limited to a single running instance. So it is handy to have a mechanism that we can use in the limited facilities of BASH or other shells to protect our script from having additional instances running while the first is finishing. Over the years I've perfected a relatively simple technique that works well in BASH on Linux. It should work well in other languages, if they don't provide better facilities, and on the other UNIXes (BSDs, OS X, ...).

To accomplish this we need a few things:

A flag to show that our process is running.
An exit if the flag already exists
And a way to make sure the flag only exists during the run of the task. (ie delete when done.)

My solution uses the following ingredients:

A subdirectory as a flag, since their creation is an atomic operation.
And the "trap" statement to clean it up.

The basic script recipe looks like this:

LOCK="/tmp/unique-job-name" # Make sure only one of us is running: if ! mkdir "$LOCK" 2>&1 > /dev/null; then echo "Still running. If its not please remove \"$LOCK\" and restart." >2 exit 2 fi # If we get signaled to end remove lock trap "rmdir "$LOCK"" EXIT

... do something ...

To use this just put the code at the top of your script. The rest of the script can be written as usual. This header is all that's needed to prevent multiple instances.

Briefly this scriptlet works like this:

We determine the directory we want to use for locking and assign it to $LOCK.
We attempt to create the $LOCK directory.
If it fails an instance must already be running so we exit.
If it didn't fail we use "trap" to remove the directory when the script EXITs for any reason.
Then the rest of the script is free to proceed as normal.

The details:

The environment variable $LOCK is set to a directory name that is unique to this particular task. I place this in "/tmp" since its universally writable. But if permissions allow, it could be placed wherever is convenient. From a systems standpoint it would be good to use the same folder for all locks and that all single-instance scripts/programs you make are assigned their own unique name. Another good option might be "/var/run/lock" or simply "/run/lock" or the more traditional "/var/lock". On modern distros "/var/run" is a symlink to the "/run" ramdrive and "/var/lock" is a symlink to "/run/lock".

Using a "RAM disk" (tmpfs) for this is a particularly good way to go since it automatically self-destructs at reboot and its fast and we are using a teensy amount of space.

"mkdir" will fail if the directory already exists. This gives us our flag creation and check in one command. Its tempting to do something like this:

if [ ! -e "$LOCK" ]; then
> "$LOCK" # Create's file
else
echo "Still going..."
exit 2
fi
trap 'rm -f "$LOCK"' EXIT

This can work. But here's the problem with that. The check for existence and the creation are two separate operations that will allow for interruption. And its not an error for the same file to be (re)created multiple times. Its possible for multiple instances to start, not see the "lock file", both create it and both run. This is called a "race condition". This is especially likely to happen with a multiple core machines since more than one program can be running literally simultogetherously. Where as a single core machine can only run one at a time. But in a single core situation its still possible for a time slice to end at the point the existence was checked, but before the file is created. In my CRON launched scenario this is not likely to happen so with the addition of the "trap" command this could work for this purpose. But its not bullet proof.

I believe good code is a "habit". So I learn the better/best ways to do things and repeat them everywhere. As I mentioned "mkdir" is what's called an "atomic" operation. What this means is that the kernel will make sure that only one process can do the check for existence and creation at any one time and not allow the same directory to be recreated while already existing. If "mkdir` returns successfully I know that the directory didn't exist prior to my creating it and by extension that this is the only one instance running... of course someone with enough permissions could "rmdir" the lock folder while the process is still running and cause multiple instances to run. So watch your permissions!

Another possible issue that would cause "mkdir" to fail is write problems with the parent directory, like lack of write-rights or out of space. This would not allow any instances to run. So be careful where place your $LOCK directory. If it does fail, claiming to be running, check that you still have rights. For my purposes this is of no consequence but if one really wanted to, you could write an additional test to verify the write permission and alter the error message.

If the "mkdir" failed we go on to report the error to stderr and then exit with a code of 2. The exit code is simply my preference. Specifying a number other than 0 will signal to a parent process that the script failed. This is typical UNIX operation. Since many command line tools take arguments I personally reserve the first error code, 1, for "bad user"... uh... I mean invalid arguments. Then all script / program exit codes above 1 are used for application specifice situations. In this case I used 2 to signal "already running". If you script is part of some larger program it can be useful to signal different failure reasons so that the calling process can respond differently to them. In this case some kind of "task manager" might see the 2 and say, "no need to worry, its still running."

Writing the error to stderr is also just a normal UNIX method. If the script is part of a pipe line then the error message won't pollute the output data. Or some controlling processes might ignore stdout and log stderr. Or in a situation like "inetd" you could have both scenarios happening. Stdout is going out to client as if it were a pipe line. While stderr could be recorded in logs.

The "trap" statement is the special magic sauce. "trap" catches signals and allows our script to either ignore them or perform some action on them. The "trap" command also allows us to revert them to normal system defined behavior. See "help trap" and/or "man bash" for more info. The word "EXIT", in this case, is a POSIX shell specific pseudo signal. It happens when the shell or script exits for any reason, even errors. Well... any reason short of power failure, hardware failure, kernel panic, OOM Killer, SIGKILL (kill -9). :-D I might have missed something. ;-) So we use "trap" to make sure the directory is removed when the script exits.

A parting thought:

Since the $LOCK directory has been created one could also use it to store work files. You could change the "rmdir" in the "trap" command to " rm -rf" for the clean up. You'd also want to be more warry of permissions. Since I'm only using the directory name I don't need to worry about the perms. But if you used it as temp space for sensitive information you'd at least want to consider "chmod 700 ..." and "umask 077", or the other way around. ;-) You'd also want to consider burying the $LOCK direcotry in a protected parent directory.

Have fun scripting!