Support

Blog

Occasionally even in a well maintained system, qmail has issues.

One semi-common issue I get to see, is when a server we send mail to doesn’t timeout. This ties up an outgoing mail slot. Over a period of time, this can lead to issues where the whole outgoing or incoming queue is sitting doing nothing, as every connection is tied up by ‘tarpitted’ connections.

Ideally Qmail should be able to cope with these.  There are settings in qmail to control how long a connection takes, and how long it should wait for.  These settings are covered in the following files (usually set in /var/qmail/control)

timeoutconnect – how long for qmail to wait on initial outgoing connection before trying another mail server.
timeoutremote – how long to wait before timing out a connected outgoing server.
timeoutsmtpd – how long for qmail to wait before dropping an incoming connection.

In our system, we set these values to:
30 seconds for timeoutconnect
600 seconds for timeoutremote
360 seconds for timeoutsmtpd

In theory timeoutremote should see qmail drop a connection after 10 minutes (600 seconds).
In practice, qmail doesn’t.

Why?

timeoutremote only applies if the connection hasn’t received any data for the timeout period.
It doesn’t apply to the connection time as a whole.
If the remote end sends some data, the timeout is reset again, and it will wait again for the timeoutremote period. If the remote server dribbles back an ACK or similar once every few minutes, then it can keep a connection alive for as long as it wants.

This may not happen very often, but it can happen enough to tie up our connection queue over a period of time. I’ve seen connections go on for as long as days or weeks in practice.

Ideally one should be able to set a proper timeout period in qmail which it adheres to, so that any connection over a certain time period gets killed, or at least set something up in ucspi-tcp, however thats something for another time.

Here is a real world example.

I’ve run my kill zombie script in test mode (see bottom of page for the script)

/var/qmail/bin/kill-qmail-smtpd-zombies --test
**Running in TEST mode**
Running: ps ax -o etime,pid,comm --no-heading | grep qmail-remote | grep ':[0-9][0-9]:' | awk '{print }'
-=-=-=-=-=-=-=-=-=-=-
Found zombies, setting up shotgun.
Killing qmail-remote zombies
kill -9 26707
-=-=-=-=-=-=-=-=-=-=-

Its come up with a connection thats been running longer than an hour. – 26707

I’ll double check to see that its correct

ps ax -o etime,pid,comm | grep 26707
01:39:07 26707 qmail-remote

Yup, qmail-remote has been running for 1hr39minutes on that connection.

Lets check what the connection is

ps -ef | grep 26707
root 2964 17112 0 13:01 pts/2 00:00:00 grep 26707
qmailr 26707 21959 0 11:23 ? 00:00:00 qmail-remote bamboo.sz.js.cn zhangbin@bamboo.sz.js.cn

Hmm, its a known troublesome server bamboo.sz.js.cn.
In fact, its the one that caused me to write this article!

Lets watch whats actually happening in real time.

strace -p 26707
Process 26707 attached - interrupt to quit
read(3,

[wait for a minute or two…]

Still nothing.

Hmm, sitting there waiting for a response to a read. Guess what happens before the timeout period?
Yup, we receive some more characters just in time to keep the connection up and running…

We could set the timeoutremote to a lower number, but we do actually have cases where servers genuinely are slow on responses for various spam testing reasons (although they usually pickup speed once they pass those tests), so I prefer another method.

Whats my current (lazy in lieu of patching qmail or ucspi-tcp) solution for this?

A culling the zombies script!

To install in your qmail/bin folder, do the following:


cd /var/qmail/bin
wget http://www.computersolutions.cn/blog/wp-content/uploads/2010/02/kill-qmail-zombies.txt
mv kill-qmail-zombies.txt kill-qmail-zombies.sh
chmod 0700 kill-qmail-zombies.sh

The script has a help file built in, parameters are:
./kill-qmail-zombies.sh
--test - Run in test mode (zombie friendly)
--help - Show the help
--force - Kill some zombies!

eg

./kill-qmail-zombies.sh --test

You could set this to run every few hours in a cron script, but I strongly suggest you test first to see if it works correctly. See the help file for more info on that.

Script below for those who want to take a look. Its one of my first shell scripts, so feel free to laugh, and comment accordingly!

#!/bin/sh

# ===========================
# qmail zombie killer script
# Version: 1.0
# Author: L. Sheed
# Company: Computer Solutions
# URL: http://www.computersolutions.cn
# ===========================

PATH=/usr/bin:/bin

function short_usage
{
cat <<- _EOF_
$0: missing parameter
Try '$0 --help' for more information.

_EOF_
}

function usage
{
cat <<- _EOF_
Parameters:
--force  kill qmail-smtpd and qmail-send processes (aka zombies) older than 1 hour
--test 	 do a test run (no zombie processes will be harmed)
--help   show this help page

Notes:
Strongly suggest test first to see if the ps line works correct on your system before killing any processes!
eg -  Run the ps below on your system, and see if the output looks similar

ps ax -o etime,pid,comm --no-heading | grep qmail-smtp
      04:40  6468 qmail-smtpd
      01:47  7473 qmail-smtpd
      01:00  8142 qmail-smtpd
      01:00  8143 qmail-smtpd
      00:46  8235 qmail-smtpd
      00:36  8283 qmail-smtpd
      00:19  8391 qmail-smtpd
      00:11  8445 qmail-smtpd
      00:07  8494 qmail-smtpd

_EOF_
}

function zap_the_bastards
{
PLIST=`ps ax -o etime,pid,comm --no-heading | grep $WHAT | grep ':[0-9][0-9]:' | awk '{print $2}'`

#In test mode, show what would be called also
if [ "$test" = "1" ]; then
	echo "Running:  ps ax -o etime,pid,comm --no-heading | grep $WHAT | grep ':[0-9][0-9]:' | awk '{print $2}'"
fi

if [ -n "${PLIST:-}" ]
then
	echo "-=-=-=-=-=-=-=-=-=-=-"
	echo "Found zombies, setting up shotgun."
	echo "Killing $WHAT zombies"
	for p in $PLIST
	do
		if [ "$force" = "1" ]; then
			echo "Kabooom:"
			kill -9 $p
		fi
		echo "kill -9 $p"
	done
	echo "-=-=-=-=-=-=-=-=-=-=-"
else
	echo "Good news everybody.  No $WHAT zombies found."
fi
}

## Main

#parse our parameters
if [ ! $# == 1 ]; then
	short_usage
	exit
fi

while [ "$1" != "" ]; do
 case $1 in
        --force )
        echo "**Running in FORCE mode**"
        force=1
        ;;
        --help )
        usage
        exit
        ;;
	--test )
	echo "**Running in TEST mode**"
	test=1
	;;
 esac
shift
done

#do the deed
targets=( "qmail-remote" "qmail-smtpd" )

for target in ${targets[@]}
do
	WHAT=$target
	zap_the_bastards
done

Archives

Categories

Tags

PHOTOSTREAM