0

Patent Representation Learning via Self-supervision

We study self-supervised patent representation learning with contrastive objectives. A standard baseline constructs positives by encoding the same text twice under independent dropout masks, but applying this recipe to long, structured patent documents requires careful…

Preview
Year
2025
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2511.10657ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We study self-supervised patent representation learning with contrastive objectives. A standard baseline constructs positives by encoding the same text twice under independent dropout masks, but applying this recipe to long, structured patent documents requires careful calibration. We show that dropout-only training can be substantially strengthened by tuning temperature and dropout rate, yet its best configuration is evaluation-dependent and does not transfer uniformly from title--abstract retrieval to claim-to-disclosure retrieval. We propose mixed dropout--section positives, a patent-specific view construction strategy in which the anchor is the title--abstract view and the positive is sampled either from a dropout re-encoding of the same view or from another section of the same patent, such as claims, summary, background, drawings, or description. This uses patent-internal structure as a training-time signal without IPC labels, citations, or relevance annotations. We evaluate on graded EPO search-report retrieval, DAPFAM, a recently proposed family-level patent retrieval benchmark, and IPC subclass classification. Section-based positives improve over calibrated dropout-only and generic title--abstract augmentation baselines, are competitive with citation-informed patent encoders and a general-purpose embedding model, and perform strongly on the out-of-domain split of DAPFAM. Additional cross-section alignment diagnostics show that section-pair training improves compatibility among abstracts, claims, and descriptions of the same invention. These results indicate that patent sections provide effective self-supervised positive views for learning dense patent representations.