23
simscan and vfork issues
While I usually am good at finding issues relatively quickly, I spent roughly 5 hours troubleshooting an issue today with incoming mail scanning.
What was the issue we were seeing?
Mail would randomly not get scanned by our mail scanner process (simscan), and simscan would exit with errors in various places.
eg
@4000000050d6d5601caac0b4 simscan: in run_ripmime
@4000000050d6d5601caac49c simscan: ripmime error
@4000000050d6d5601cab12bc simscan: exit error code: 71
@4000000050d6d5610478e3cc tcpserver: end 26607 status 0
@4000000050d6d5610478eb9c tcpserver: status: 5/150
What went wrong?
Initially I thought a recent update of our internal antivirus scanner software was to blame, as that was the only change.
I quickly eliminated that as an issue, by disabling the av test.
It worked for a few minutes, then started working incorrectly again.
My next thought was permissions, so I checked those against other servers, checked file permissions, checked ownership etc – all looked good.
Still no progress.
Eventually I recompiled most of the mail subsystem in case something funky was going on. Still no progress.
As it seemed to literally only happen to the simscan process, I decided to look into that code.
I compiled without rip mime initially, as I thought that was the issue, but again, it would work for one or two mails, then start breaking.
I decided to add in some additional debugging code inside simscan.c to see where things were breaking.
@4000000050d6d54c0d613584 simscan: in run_ripmime
@4000000050d6d54c0d613584 simscan: ripmime error
@4000000050d6d54c0d61a6cc simscan: exit error code: 71
I could see that it was calling the correct code segment, but still failing.
If I compiled without ripmime, it would work for a few minutes, then also fail on clamdscan.
I fiddled about with that for a good hour or two, until I decided to add more debugging, and recompile with ripmime again.
I added a few debug statements into simscan to let me know what was happening inside the ripmime function:
int run_ripmime()
{
int pid;
int rmstat;
if ( DebugFlag > 0 ) {
fprintf(stderr, "simscan: in run_ipmime\n");
}
/* fork ripmime */
switch(pid = vfork()) {
case -1:
if ( DebugFlag > 0 ) {
fprintf(stderr, "simscan:vfork ripmime error.\n");
}
return(-1);
...
I could see that simscan couldn’t fork ripmime.
What was weird though, was that if I changed to the simscan process, and ran the test manually, it would work.
Just not though qmail
After another hour or two of looking at incorrect things, I decided to go back and take a better look at the vfork issue.
Googling vfork fail linux eventually found my reason.
It ended up not being permissions related – vfork was actually failing, due to hitting its process cap.
qmail had reached its max limit of child processes, so simscan was getting called, then simscan would try to execute another process, and bam, max processes reached.
This was why it didn’t happen on the command line, but only in production.
The server is actually set to unlimited processes (see below), so this must probably have hit a linux kernel limit (unlimited doesn’t always mean unlimited!)
ulimit for server below:
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16382
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
ps -ef | grep qmail showed that we had a few hundred defunct qmail processes running, so did a
qmailctl stop
killall qmail-smtpd # which killed all the defunct processes)
qmailctl start
simscan started working once again. Next time it happens, I’ll be notified in the error log about vfork issues, and hopefully can spend some time to see why qmailctl restart doesn’t kill off the defunct qmail-smtpd processes…
This was quite a hard issue to debug, as all the issues and solutions online pointed to other common issues like permissions!
Eventually I’ll probably redo simscan to use fork() rather than vfork() as its not recommended.
Still, I learnt more useful things in the journey, so it wasn’t completely wasted time, although I wish it didn’t take me 5+ hours to debug!
Refs:
https://www.securecoding.cert.org/confluence/pages/viewpage.action?pageId=1703954
Occasionally even in a well maintained system, qmail has issues.
One semi-common issue I get to see, is when a server we send mail to doesn’t timeout. This ties up an outgoing mail slot. Over a period of time, this can lead to issues where the whole outgoing or incoming queue is sitting doing nothing, as every connection is tied up by ‘tarpitted’ connections.
Ideally Qmail should be able to cope with these. There are settings in qmail to control how long a connection takes, and how long it should wait for. These settings are covered in the following files (usually set in /var/qmail/control)
Archives
- November 2024
- November 2019
- October 2019
- August 2019
- April 2019
- February 2017
- September 2016
- June 2016
- May 2016
- September 2015
- August 2015
- June 2015
- April 2015
- December 2014
- October 2014
- September 2014
- July 2014
- June 2014
- April 2014
- October 2013
- July 2013
- May 2013
- April 2013
- March 2013
- January 2013
- December 2012
- October 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- December 2011
- November 2011
- October 2011
- September 2011
- July 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
Categories
- Apple
- Arcade Machines
- Badges
- BMW
- China Related
- Cool Hunting
- Exploits
- Firmware
- Food
- General Talk
- government
- IP Cam
- iPhone
- Lasers
- legislation
- MODx
- MySQL
- notice
- qmail
- requirements
- Reviews
- Service Issues
- Tao Bao
- Technical Mumbo Jumbo
- Things that will get me censored
- Travel
- Uncategorized
- Useful Info