In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of $69.3%$ on THUMOS14, outperforming the previous best by $2.5%$, but with only $74.6%$ of its latency. The code is released to https://github.com/sssste/TriDet.
TriDet: Temporal Action Detection with Relative Boundary Modeling
TriDet, a one-stage framework for temporal action detection, uses a Trident-head and Scalable-Granularity Perception layer to improve boundary prediction accuracy and reduce computational costs.
- Year
- 2023
- Venue
- CVPR 2023 1
- Authors
- 6
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2303.07347v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar