Skip to content

feat(seo): static sitemap.xml with git-based lastmod#222

Open
dcrawbuck wants to merge 2 commits into
mainfrom
dcrawbuck/raleigh-v2
Open

feat(seo): static sitemap.xml with git-based lastmod#222
dcrawbuck wants to merge 2 commits into
mainfrom
dcrawbuck/raleigh-v2

Conversation

@dcrawbuck

@dcrawbuck dcrawbuck commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

What & why

/docs/sitemap.xml was generated at request time on the Cloudflare Worker and stamped
every URL with new Date().toISOString(). That tells Google "every page changed just now"
on every crawl, so Google learns to ignore our <lastmod> entirely (it only trusts the field
when it's consistently accurate).

This replaces that with accurate, build-time <lastmod> derived from git history, served
as a static asset. Verified end-to-end on a Cloudflare preview deploy: 592 URLs, all dated.

Approach

Cloudflare Workers have no filesystem / git at request time, so dates are resolved during
bun run build (Node, full repo) — same pattern as the existing generate-static-cache /
generate-search-index post-build scripts. The site has no runtime content source (MDX is
compiled into the bundle), so the page set is fixed at build time and a live route buys
nothing. The generated dist/client/docs/sitemap.xml is served at /docs/sitemap.xml — the
same delivery path proven by search-index.json (confirmed in preview: HTTP 200,
application/xml).

How dates are computed

  • One git log --no-merges --name-only --pretty=format:…%cs pass builds a
    file → latest-commit-date (YYYY-MM-DD) map (1 subprocess, not ~590).
  • URL → source file mapping:
    • Content pages → content/docs/<page.path>.
    • <include> dependencies are resolved transitively — 107 pages render shared bodies
      from content/shared/**, so an edit to a shared file bumps every page that includes it.
    • Component data dependencies/docs/changelog renders <ChangelogTimeline/>, which
      imports the committed src/lib/changelog-entries.json; that file is added as a
      supplemental source so changelog regenerations bump the page's date.
    • /docssrc/routes/index.tsx; /home (301→dashboard) → dashboard content.
    • SDK landing pages (/ios, /android, …) inherit their content source and only get a
      priority bump (single source of truth).
  • Each entry's date = most recent commit among its source files. Unknown → <lastmod> omitted
    (never falls back to new Date()).

Robustness — shallow clones self-heal

Deploy environments (Cloudflare Workers Builds) shallow-clone with no fetch-depth setting,
which would otherwise leave every page date-less. The generator detects a shallow clone and
deepens it with git fetch --unshallow (anonymous — the repo is public). Verified in the
Cloudflare build log: ✓ Fetched full git history → 592 urls (592 with <lastmod>). If history
still can't be obtained, it omits <lastmod> rather than publish a wrong date, and never
fails the build
(git errors degrade gracefully).

Changes

  • src/lib/sitemap.ts — pure, worker-safe: getSitemapSourceEntries (dedupe + priority
    merge), attachLastModified (date resolution injected by caller), optional <lastmod>.
  • scripts/generate-sitemap.ts — new build-time generator (git dates, include + component
    data resolution, shallow self-heal, graceful degradation), wired into build.
  • Deleted runtime route src/routes/sitemap[.]xml.ts (+ regenerated routeTree.gen.ts).
  • src/lib/seo-routes.test.ts — updated for the new API.

Testing

  • bun test — 69 pass.
  • Cloudflare preview build: green; build log shows shallow→unshallow→592 dated URLs→deployed.
  • Local regen: 592 URLs, 592 <lastmod>, valid XML (xmllint); /docs/changelog correctly
    reflects max(wrapper, changelog JSON); include resolution verified.

Note on current dates

~586 of ~590 pages currently share 2026-06-23 because of recent bulk commits (#218/#219).
That's accurate git history; dates diverge naturally as pages are edited individually.

Notes

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a400d681f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const pages = source.getPages() as Array<{ url: string; path: string }>;
const contentPages = pages.map((page) => ({
url: page.url,
sourcePaths: [`content/docs/${page.path}`],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Track changelog data as a sitemap source

When /docs/changelog changes because src/lib/changelog-entries.json is regenerated or committed, this mapping still dates the page only from content/docs/changelog/index.mdx. That MDX renders <ChangelogTimeline /> (content/docs/changelog/index.mdx:92), and the component imports the JSON data (src/components/ChangelogTimeline.tsx:2), so changelog updates can ship with a stale <lastmod> until the wrapper MDX file is touched. Add the generated changelog JSON (or other supplemental component data) to the source paths for that page before resolving lastmod.

Useful? React with 👍 / 👎.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 25, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
superwall-docs 622c80a Jun 26 2026, 12:21 AM

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d83f1de65f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +48 to +52
const SUPPLEMENTAL_SOURCES: Record<string, string[]> = {
// /docs/changelog renders <ChangelogTimeline/>, which imports this generated,
// committed JSON — changelog regenerations don't touch the wrapper MDX.
"content/docs/changelog/index.mdx": ["src/lib/changelog-entries.json"],
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Track page-tree list dependencies in lastmod sources

When a support article is added, removed, or renamed under a folder rendered by <SupportFolderList> (for example content/docs/support/faq/index.mdx:93), that folder index page changes because the component renders links from the page tree (src/components/SupportFolderList.tsx:23-35). The sitemap generator only expands raw <include>s plus this one supplemental JSON mapping, so those child pages/meta files never contribute to /docs/support/faq (and the other support folder indexes) and their <lastmod> can remain stale even though the rendered page changed. Add page-tree/child dependencies for these list pages before resolving lastmod.

Useful? React with 👍 / 👎.

const pages = source.getPages() as Array<{ url: string; path: string }>;
const contentPages = pages.map((page) => ({
url: page.url,
sourcePaths: [`content/docs/${page.path}`],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include referenced images in lastmod sources

When a screenshot or other docs image changes without touching the MDX, the rendered page changes but the sitemap date does not: for example content/docs/dashboard/paywalls.mdx:93 renders /images/docs-paywalls-overview.png, which is copied from content/docs/images, yet each page starts with only its MDX path and the expander only follows <include>/supplemental files. Image-only documentation updates will now publish stale <lastmod> values; add referenced image files to each page's source paths before resolving the git date.

Useful? React with 👍 / 👎.

Replace the request-time sitemap route (which stamped every URL with new
Date() on each crawl, training Google to ignore <lastmod>) with a
build-time static sitemap whose <lastmod> comes from real git history.

- New scripts/generate-sitemap.ts (runs in the build chain): one git log
  pass for per-file dates, resolves <include> deps into content/shared so
  shared edits bump the right pages, and serves dist/client/docs/sitemap.xml.
- Shallow clones omit <lastmod> rather than publish one wrong date; git
  failures degrade gracefully instead of breaking the build.
- src/lib/sitemap.ts refactored to pure, testable, worker-safe helpers.
- Remove runtime route src/routes/sitemap[.]xml.ts (regenerates routeTree).
Two follow-ups from PR review on the sitemap generator:

- Deploy environments (Cloudflare Workers Builds) shallow-clone with no
  fetch-depth setting, which left the deployed sitemap with no <lastmod>.
  Detect a shallow clone and deepen it with 'git fetch --unshallow'
  (anonymous; the repo is public). Falls back to omitting <lastmod> if
  history still can't be obtained — never fails the build.
- /docs/changelog renders <ChangelogTimeline/>, which imports the committed
  src/lib/changelog-entries.json. Add that data file as a supplemental source
  so changelog regenerations bump the page's date.
@dcrawbuck dcrawbuck force-pushed the dcrawbuck/raleigh-v2 branch from d83f1de to 622c80a Compare June 26, 2026 00:12

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 622c80a. Configure here.

return stdout.trim() === "true";
} catch {
return false;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shallow check errors publish dates

Medium Severity

When git rev-parse --is-shallow-repository fails, isShallowRepository treats the repo as non-shallow, so ensureFullHistory skips deepening and still runs buildGitDateMap. On a shallow or truncated history, that can emit clustered, misleading <lastmod> values instead of omitting them as intended.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 622c80a. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant