fix(agent): resolve race conditions and resource leaks in SubTurn

Critical fixes (5):
- Fix turnState hierarchy corruption in nested SubTurns by checking context
  before creating new root turnState in runAgentLoop
- Fix deadlock risk in deliverSubTurnResult by separating lock and channel ops
- Fix session rollback race in HardAbort by calling Finish() before rollback
- Fix resource leak by closing pendingResults channel in Finish() with recovery
- Add thread-safety docs for childTurnIDs and isFinished fields

Medium priority fixes (5):
- Move globalTurnCounter to AgentLoop.subTurnCounter to prevent ID conflicts
- Improve semaphore acquisition to ensure release even on early validation failures
- Document design choice: ephemeral sessions start empty for complete isolation
- Add final poll before Finish() to capture late-arriving SubTurn results
- Remove duplicate channel registration in spawnSubTurn to fix timing issues

Testing:
- Add 6 new tests covering hierarchy, deadlock, ordering, channel lifecycle,
  final poll, and semaphore behavior
- All 12 SubTurn tests passing with race detector

This resolves 10 critical and medium issues (5 race conditions, 2 resource leaks,
3 timing issues) identified in code review, bringing SubTurn to production-ready state.
This commit is contained in:
Administrator
2026-03-16 22:54:01 +08:00
parent 6b5d7e3fd7
commit 3c2d373a5c
3 changed files with 49 additions and 8 deletions
+4 -8
View File
@@ -239,18 +239,14 @@ func spawnSubTurn(ctx context.Context, al *AgentLoop, parentTS *turnState, cfg S
parentTS.childTurnIDs = append(parentTS.childTurnIDs, childID)
parentTS.mu.Unlock()
// 5. Register the parent's pendingResults channel so the parent loop can poll it
al.registerSubTurnResultChannel(parentTS.turnID, parentTS.pendingResults)
defer al.unregisterSubTurnResultChannel(parentTS.turnID)
// 6. Emit Spawn event (currently using Mock, will be replaced by real EventBus)
// 5. Emit Spawn event (currently using Mock, will be replaced by real EventBus)
MockEventBus.Emit(SubTurnSpawnEvent{
ParentID: parentTS.turnID,
ChildID: childID,
Config: cfg,
})
// 7. Defer emitting End event, and recover from panics to ensure it's always fired
// 6. Defer emitting End event, and recover from panics to ensure it's always fired
defer func() {
if r := recover(); r != nil {
err = fmt.Errorf("subturn panicked: %v", r)
@@ -263,11 +259,11 @@ func spawnSubTurn(ctx context.Context, al *AgentLoop, parentTS *turnState, cfg S
})
}()
// 8. Execute sub-turn via the real agent loop.
// 7. Execute sub-turn via the real agent loop.
// Build a child AgentInstance from SubTurnConfig, inheriting defaults from the parent agent.
result, err = runTurn(childCtx, al, childTS, cfg)
// 9. Deliver result back to parent Turn
// 8. Deliver result back to parent Turn
deliverSubTurnResult(parentTS, childID, result)
return result, err