Skip to content

Extract Knowledge Graph carousel (Van Gogh paintings) — Jason Dinsmore#393

Open
dinjas wants to merge 17 commits into
serpapi:masterfrom
dinjas:van-gogh-paintings
Open

Extract Knowledge Graph carousel (Van Gogh paintings) — Jason Dinsmore#393
dinjas wants to merge 17 commits into
serpapi:masterfrom
dinjas:van-gogh-paintings

Conversation

@dinjas

@dinjas dinjas commented Jun 23, 2026

Copy link
Copy Markdown

Summary

Extracts the Knowledge Graph carousel from the saved results page into an array of { name, extensions, link, image } objects, reproducing files/expected-array.json exactly. The same code path handles other entity carousels (albums, buildings, cast) — only the locator changes per type.

Full write-up (approach, decisions, tradeoffs) is in the README.

Checks the boxes

  • name, extensions, and Google link extracted - output matches expected-array.json
  • ✅ Thumbnails pulled from the page itself: inline base64 from _setImagesSrc scripts for the first tiles, in-page data-src for the rest - no extra HTTP requests
  • ✅ Tested against other layouts (Grateful Dead albums, Frank Lloyd Wright buildings, Breaking Bad cast) plus negative pages that must return []

Running it

bundle install
bundle exec rspec
bundle exec rubocop

Run against any saved page:

ruby -Ilib -rcarousel_extractor -rjson -e 'puts JSON.pretty_generate(CarouselExtractor.call(File.read(ARGV[0])))' files/van-gogh-paintings.html

A few decisions (more in README)

  • Locate by stable data-attrid (not minified classes or per-request ids, which rotate), based on allowlisted knowledge graph types and not an "any carousel-shaped container" heuristic, which may false-positive on non-entity modules like Unilever's social-media panel.
  • extensions kept generic: only the date's meaning is known (from the paintings fixture), so each tile's second line passes through as-is.

Scope

Fixtures are en/us desktop captures. Adding a new carousel type is a one-line locator change (adding a data-attrid to the list), since locating the carousel and extracting a tile are separate concerns. I didn't pursue other locales/mobile layouts, tile dedupe, or "view more" pagination. The last would require making extra HTTP requests, which the requirements rule out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant