Support

Blog

Browsing all articles tagged with debugging

While I usually am good at finding issues relatively quickly, I spent roughly 5 hours troubleshooting an issue today with incoming mail scanning.

What was the issue we were seeing?

Mail would randomly not get scanned by our mail scanner process (simscan), and simscan would exit with errors in various places.

eg

@4000000050d6d5601caac0b4 simscan: in run_ripmime
@4000000050d6d5601caac49c simscan: ripmime error
@4000000050d6d5601cab12bc simscan: exit error code: 71
@4000000050d6d5610478e3cc tcpserver: end 26607 status 0
@4000000050d6d5610478eb9c tcpserver: status: 5/150

What went wrong?

Initially I thought a recent update of our internal antivirus scanner software was to blame, as that was the only change.
I quickly eliminated that as an issue, by disabling the av test.
It worked for a few minutes, then started working incorrectly again.

My next thought was permissions, so I checked those against other servers, checked file permissions, checked ownership etc – all looked good.

Still no progress.

Eventually I recompiled most of the mail subsystem in case something funky was going on. Still no progress.

As it seemed to literally only happen to the simscan process, I decided to look into that code.

I compiled without rip mime initially, as I thought that was the issue, but again, it would work for one or two mails, then start breaking.

I decided to add in some additional debugging code inside simscan.c to see where things were breaking.

@4000000050d6d54c0d613584 simscan: in run_ripmime
@4000000050d6d54c0d613584 simscan: ripmime error
@4000000050d6d54c0d61a6cc simscan: exit error code: 71

I could see that it was calling the correct code segment, but still failing.
If I compiled without ripmime, it would work for a few minutes, then also fail on clamdscan.

I fiddled about with that for a good hour or two, until I decided to add more debugging, and recompile with ripmime again.

I added a few debug statements into simscan to let me know what was happening inside the ripmime function:

int run_ripmime()
{

int pid;
int rmstat;

if ( DebugFlag > 0 ) {
fprintf(stderr, "simscan: in run_ipmime\n");
}

/* fork ripmime */
switch(pid = vfork()) {
case -1:
if ( DebugFlag > 0 ) {
fprintf(stderr, "simscan:vfork ripmime error.\n");
}
return(-1);
...

I could see that simscan couldn’t fork ripmime.
What was weird though, was that if I changed to the simscan process, and ran the test manually, it would work.

Just not though qmail

After another hour or two of looking at incorrect things, I decided to go back and take a better look at the vfork issue.

Googling vfork fail linux eventually found my reason.

It ended up not being permissions related – vfork was actually failing, due to hitting its process cap.

qmail had reached its max limit of child processes, so simscan was getting called, then simscan would try to execute another process, and bam, max processes reached.

This was why it didn’t happen on the command line, but only in production.

The server is actually set to unlimited processes (see below), so this must probably have hit a linux kernel limit (unlimited doesn’t always mean unlimited!)

ulimit for server below:
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16382
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

ps -ef | grep qmail showed that we had a few hundred defunct qmail processes running, so did a

qmailctl stop

killall qmail-smtpd # which killed all the defunct processes)

qmailctl start

simscan started working once again. Next time it happens, I’ll be notified in the error log about vfork issues, and hopefully can spend some time to see why qmailctl restart doesn’t kill off the defunct qmail-smtpd processes…

This was quite a hard issue to debug, as all the issues and solutions online pointed to other common issues like permissions!

Eventually I’ll probably redo simscan to use fork() rather than vfork() as its not recommended.

Still, I learnt more useful things in the journey, so it wasn’t completely wasted time, although I wish it didn’t take me 5+ hours to debug!

Refs:
https://www.securecoding.cert.org/confluence/pages/viewpage.action?pageId=1703954

Archives

Categories

Tags

PHOTOSTREAM