Reinforcement Learning with a Bilevel World-Model Architecture for Scan-Order Optimisation in Laser Directed Energy Deposition

Scan-order design in laser directed energy deposition (LDED) is a delayed, path-dependent thermo-mechanical decision problem, because sequence quality becomes observable only after the complete deposition and cooling cycle. This work formulates LDED scan-order optimisation as a finite-horizon, permutation-constrained reinforcement-learning problem and develops a bilevel finite-element-teacher-labelled AI workflow. A surrogate-assisted teacher-guided optimisation loop learns the Abaqus-labelled response landscape and provides a tractable terminal-reward environment for policy training. A frozen Maskable Proximal Policy Optimization (MaskablePPO) policy is then used to generate legal scan-order candidates, which are independently validated through Abaqus thermo-mechanical simulations. The results show bounded, N-dependent policy-generation value rather than record-level dominance over the mature surrogate-assisted optimiser. The strongest scan orders are obtained by the teacher-guided surrogate loop, whereas PPO autonomously reaches competitive regions of the native response landscape, with stronger rank concentration at smaller track counts and a clear reliability boundary at longer horizons. The teacher-labelled landscape further supports a physically gated lexicographic reward hierarchy in which warpage admissibility is the primary constraint, plastic strain acts as a safety filter and residual-stress-related improvement is pursued conditionally within the admissible region. Validated sequences also reveal an interpretable scale-separated ordering tendency that combines global spatial dispersion with local structured grouping. This workflow provides a route from fixed scan-rule selection toward finite-element-teacher-validated policy generation, while preserving independent finite-element validation as the final physical gate.