I've been fiddling with content extraction using @mozilla/readability. As a data-source, what better candidate than this very-here website, so I made some revisions.
The basic question is "what components must my webpage have in order to trigger reader mode". Surely, this is standardized.
Well... extremely no, it seems. Having done some reading, the best I've come up
with is that adding schema attributes won't hurt anything, but the ways in
which browser reader modes parse content is highly eccentric. For instance, the
following qualifies for reader mode using Firefox, but @mozilla/readability
fails to extract a publishedTime
.
<article itemscope itemtype="https://schema.org/Article">
<h1 class="post-title" itemprop="headline">Reader modes are insane</h1>
<time datetime="2024-10-25" itemprop="datePublished">
2024-10-25
</time>
<section class="post-content" itemprop="articleBody">
<p>...</p>
</section>
</article>