In a multitasking environment, there is often a requirement to synchronize the execution of various tasks or ensure one process has been completed before another begins.
This requirement is facilitated by the use of a software switch known as a Semaphore or a Flag.
The function of this is to work in much the same way a railway signal would; only allowing one train on the track at a time. A semaphore timeout is where the railway signal has been set in one state too long, maybe because the train has broken down.
Example of a Semaphore lock
An example of this in Notes/Domino is when the indexer needs to completely rebuild an index, it locks a semaphore so that other tasks cannot use the index until it is rebuilt. If a user task now tries to open that index while it is being rebuilt, it will have to wait for the indexer to finish the rebuild and then unlock the semaphore.
As a result, the user task is stuck until that semaphore is unlocked. While it is stuck waiting for the semaphore, it keeps track of how long it has been waiting. If it is stuck for more than 30 seconds, this is considered a semaphore timeout and in debug mode a message will be logged to the console. The task will continue to wait for the semaphore, timing out every 30 seconds, until the semaphore is unlocked or the task is ended. For most operations, a task might only wait for a few microseconds and hence not time-out. With a complicated view on a large database, the task may have to wait several minutes for the index semaphore.
If an important semaphore is locked by a task and is never unlocked, all tasks can be stopped waiting for that semaphore. This can happen in several different ways: the most common are where a task locks the semaphore and then crashes. This can also happen if a task locks the semaphore and then goes into an endless loop or it gets an error and forgets to unlock it.
Semaphore deadlock can occur when two tasks try to lock two different semaphores in a different order.
For example:
The Task A locks Semaphore 1 and then tries to lock Semaphore 2. In the meantime, Task B has already locked Semaphore 2 and is now trying to lock Semaphore 1. Task A is stuck waiting for Semaphore 2 and Task B is waiting for Semaphore 1 -- deadlock.
Reasons for semaphore timeouts
When you receive semaphore timeout messages, the messages are usually the result of one of the following:
- A heavy load on the server is causing processes to be delayed from releasing semaphores.
- A process has crashed while holding a semaphore, causing other processes to block when trying to acquire the semaphore.
- Deadly embrace, semaphore contention where two tasks are waiting on each other and neither task is able to break the loop. In the simplest case, thread A is trying to get a semaphore which is owned by B, while B is trying to get a different semaphore which is owned by A. More complex combinations are also possible: A wants a semaphore owned by B, who wants a semaphore owned by C, who wants a semaphore owned by A, etc.
- If a process was to fail to set a semaphore during execution, another process dependent on the semaphore will be blocked awaiting the semaphore.
Determining if a semaphore timeout has occurred
If an issue is a semaphore issue, it will be reported in two ways:
1. Sem.Timeouts
At the server console, you will see the Sem.Timeouts statistic. To view this statistic, type the following:
sh stat sem.timeouts
If the problem has occurred, you will see something similar to the following, depending upon the nature of the semaphore timeout
Sem.Timeouts = 430D:58 0A13:42 030B:28 0116:26 0A12:21
The statistic Sem.Timeouts will not appear in the Statrep if the customer is not experiencing semaphore timeouts.
The first number is the semaphore ID. The ID tells us what the semaphore is used for. For example, 0x030B is the collection semaphore used by NIF. The second number is a decimal number which shows the number of times the semaphore timeout occurred. It is not unusual on a heavily used server to get semaphore timeouts.
The following table lists some of the semaphores you may encounter:
Semaphore | Function |
0244 | NSF per-database semaphore |
410F | OS File system semaphore |
0266 | NSF per-database full-text semaphore |
4245 | NSF database opening semaphore |
030B | NIF collection semaphore |
430D | NAMELookup semaphore |
0A13 | Log commit semaphore |
0116 | console semaphore |
0A12 | Buffered log package semaphore |
030B | Collection semaphore |
4117 | Handle table free chain consistency semaphore |
0A0B | Session table semaphore |
4113 | Handle table movement semaphore |
33d5 | New to Domino 4.5, AdminP's semaphore for ACL modification |
4253 | internal control semaphore |
5708 | BSAFE semaphore (RSA encryption stuff) |
0255 | NSF B-Tree semaphore |
0294 | Directory Manager Queue semaphore |
1120 | Transfer queue lock semaphore |
0931 | Task sync semaphore |
0256 | NSF buffer pool container |
03A8 | DBCONT page zero semaphore |
2. Error: "Session semaphore held for [n] seconds"
The second way of determining if there is a semaphore issue, is when you see (at the server console) the following message:
- "Session semaphore held for [n] seconds."
This error does not print to the log file.
Troubleshooting Semaphore timeouts
Before enabling any debug parameters, discuss them with IBM Support beforehand. There may be issues surrounding their use, or special precautions that must be considered. These debug parameters might require a large amount of disk space, dependent upon when the server encounters problems; the longer the server stays up, the larger the debug files will be. These files can grow large enough to cause disk space shortages.
The following parameters can be added to the NOTES.INI file:
- Debug_Capture_Timeout=1
- Debug_Show_Timeout=1
See the following technote for additional information: "How to Turn on Semaphore Debugging Parameters in the NOTES.INI for Domino"
After adding these parameters, if semaphores are generated, they will be captured in a file called SEMDEBUG.TXT in the Domino program directory:
Output of SEMDEBUG.TXT:
- THREAD [02AB:0123] WAITING FOR FRWSEM 0x030B Collection semaphore (@0BE65A20)
- (R=0,W=1,WRITER=0080:015E,1STREADER=0000:0000) FOR 30000 ms
- THREAD [01FE:0220] WAITING FOR FRWSEM 0x0244 open database semaphore (@00ECBDD2)
- (R=0,W=1,WRITER=0080:015E,1STREADER=0000:0000) FOR 30000 ms
What does the output mean?
0x030B Indicates the type of semaphore.
[01FE:0220] The first number (01FE) indicates the process ID. This number is in hex and must be converted to decimal. This number will be completely different when the server is rebooted.
The second number (0220) is the thread ID.
Output of SEMDEBUG.TXT for Unix:
- THREAD [01676:00001] WAITING FOR RWSEM 0x412C (@EE100210) (R=0,W=1,WRITER=05067:00001,1STREADER=05067:00001) FOR 30000 ms
- THREAD [01684:00001] WAITING FOR RWSEM 0x412C (@EE100210) (R=0,W=1,WRITER=05067:00001,1STREADER=05067:00001) FOR 30000 ms
What does the output mean?
0x412C indicates the type of semaphore.
[01684:00001] The first number (01684) indicates the process ID. This number is the PID as indicated in the NSD output and will be completely different when the server is rebooted.
The second number (00001) is the thread ID.
NSD output:
- [1] 1684: /opt/lotus/notes/latest/sunspa/tmmscan <--- fatal thread
- ###### thread 1/4 :: tmmscan, pid=1684, lwp=1 ######
- [1] eed396a0 lwp_sema_p (228a30)
- [2] eed396a0 __lwp_sema_wait (228a30, 1d670, 0, 0, 0, 0) + 8
- [3] eefc779c _park (228990, 228a30, 0, 1, eefe6240, 0) + a0
- [4] eefc7554 _swtch (2289a0, 228b90, 228a10, 228a0c, 228a08, 228a04) + 2cc
- [5] eefc5f8c _cond_timedwait_cancel (efffdd90, efffdd78, efffdd70, 228990, eefe52b0, 0) + 1e4
- [6] eefd3420 _ti_sleep (1e, eefe52b0, eefe52b0, effff2e3, effff154, f8000600) + 100
- [7] ef0776e4 fatal_error (b, ef663dd4, ef657598, efffde50, 228a04, 2289e4) + 2c0
- [8] eefd2f20 __libthread_segvhdlr (b, efffe4b0, efffe1f8, efffe138, eefe52b0, 2289e4) + e0
- [9] eefd2334 sigacthandler (b, efffe4b0, 228990, eefe52b0, efffe1f8, eefd2e40) + 6e0
- [10] 000201d0 KC_ScanStart (efffeec0, 3ae18, efffeec0, efffeec4, efffeec8, efffeecc) + 13a0
- [11] 000157b0 AddInMain (22, 1, effff144, ef7ed2b8, ef7ec9c0, 0) + d00
- [12] 0002fb40 NotesMain (1, effff144, effff144, 0, 0, ef7c16e1) + 40
- [13] 0002fa60 notes_main (0, 0, 0, 1, effff144, 0) + a8
- [14] 000147ec _start (0, 0, 0, 0, 0, 0) + dc
From the Process Tree section:
- username status pid program
- notes06 R 1682 /opt/lotus/notes/latest/sunspa/http
- notes06 R 1684 /opt/lotus/notes/latest/sunspa/tmmscan
What other information should be collected?
For Windows NT:
1. A screenshot of the Windows NTs Task Manager, or the output of the PSTAT utility in the NT resource kit, redirected to a file. (PSTAT > somefile.txt)
It is important to collect a screen capture of Task Manager or the PSTAT output immediately when the semaphores begin to scroll on the Domino server console. When matching a process ID from SEMDEBUG.TXT to Task Manager, it is important to take the ID from the first semaphore error that is generated in the file. This is usually the offending process.
2. Issue a "Show Task Debug" command as soon as the semaphores begin to scroll on the Domino server console.
Example: sh task debug>debug.txt
For Unix:
1. Run NSD under root. Make sure "KillProcess=1" is removed from the NOTES.INI to prevent the server from killing the process when the server crashes, thus preventing mapping of PID and semdebug.txt.
It is important to collect an NSD immediately when the semaphores begin to scroll on the Domino server console. When matching a process ID from SEMDEBUG.TXT to NSD - Process Tree, it is important to take the ID from the first semaphore error that is generated in the file. This is usually the offending process.
2. Issue a "Show Task Debug" command as soon as the semaphores begin to scroll on the Domino server console.
Example: sh task debug>debug.txt
Please contact Notes Technical Support if you need further assistance.
How to utilize SEMDEBUG.TXT and Task Manager
Once the SEMDEBUG.TXT file is collected, it is the FIRST line that is the most important in determining the problem. This line usually contains the offending process. Specifically, we want to look at the OWNER process.
There is no time/date stamp in the file (unless the Debug_Capture_Timeout=10 parameter is used for a Domino 5.0.9 or later server).
The first number in brackets is the Process ID (PID). The second number (after the colon) is the thread ID.
The Process ID must be converted from hex to decimal.
In the excerpt below the OWNER is PID 009A. PID 009A converts to a decimal number of 154 (Process 0053 is waiting for process 009A to release the semaphore). When matched up to Figures 1 and 2 below (screenshots of NT Task Manager), it is apparent that Agent Manager is the offending process.
Sample SEMDEBUG.TXT
THREAD [0053:002F] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=009A:0099) FOR 120 ms
THREAD [0045:0058] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=009A:0099) FOR 120 ms
THREAD [00A4:00A3] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=009A:0099) FOR 120 ms
THREAD [009E:009D] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=0000:0000) FOR 120 ms
THREAD [0045:0058] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=009E:009D) FOR 120 ms
THREAD [00A0:009F] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=009E:009D) FOR 120 ms
THREAD [00A2:00A1] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68)