I've been fiddling with content extraction using @mozilla/readability. As a data-source, what better candidate than this very-here website, so I made some revisions.

The basic question is "what components must my webpage have in order to trigger reader mode". Surely, this is standardized.

Well... extremely no, it seems. Having done some reading, the best I've come up with is that adding schema attributes won't hurt anything, but the ways in which browser reader modes parse content is highly eccentric. For instance, the following qualifies for reader mode using Firefox, but @mozilla/readability fails to extract a publishedTime.

<article itemscope itemtype="https://schema.org/Article">
  <h1 class="post-title" itemprop="headline">Reader modes are insane</h1>

    <time datetime="2024-10-25" itemprop="datePublished">
      2024-10-25
    </time>

  <section class="post-content" itemprop="articleBody">
    <p>...</p>
  </section>
</article>