Extract Knowledge Graph carousel (Van Gogh paintings) — Jason Dinsmore by dinjas · Pull Request #393 · serpapi/code-challenge

dinjas · 2026-06-23T21:46:12Z

Summary

Extracts the Knowledge Graph carousel from the saved results page into an array of { name, extensions, link, image } objects, reproducing files/expected-array.json exactly. The same code path handles other entity carousels (albums, buildings, cast) — only the locator changes per type.

Full write-up (approach, decisions, tradeoffs) is in the README.

Checks the boxes

✅ name, extensions, and Google link extracted - output matches expected-array.json
✅ Thumbnails pulled from the page itself: inline base64 from _setImagesSrc scripts for the first tiles, in-page data-src for the rest - no extra HTTP requests
✅ Tested against other layouts (Grateful Dead albums, Frank Lloyd Wright buildings, Breaking Bad cast) plus negative pages that must return []

Running it

bundle install
bundle exec rspec
bundle exec rubocop

Run against any saved page:

ruby -Ilib -rcarousel_extractor -rjson -e 'puts JSON.pretty_generate(CarouselExtractor.call(File.read(ARGV[0])))' files/van-gogh-paintings.html

A few decisions (more in README)

Locate by stable data-attrid (not minified classes or per-request ids, which rotate), based on allowlisted knowledge graph types and not an "any carousel-shaped container" heuristic, which may false-positive on non-entity modules like Unilever's social-media panel.
extensions kept generic: only the date's meaning is known (from the paintings fixture), so each tile's second line passes through as-is.

Scope

Fixtures are en/us desktop captures. Adding a new carousel type is a one-line locator change (adding a data-attrid to the list), since locating the carousel and extracting a tile are separate concerns. I didn't pursue other locales/mobile layouts, tile dedupe, or "view more" pagination. The last would require making extra HTTP requests, which the requirements rule out.

A page can match several supported data-attrids at once (e.g. da Vinci is both architect and visual artist). Picking the first by CAROUSEL_ATTRIDS order grabbed the empty architecture block and dropped all 47 paintings. Choose the matching container that actually holds image tiles instead.

dinjas added 17 commits June 23, 2026 11:06

Add rspec; establish extractor; prevent fixture auto-change

3618d07

Add rubocop + minimal config; address lint

2291caa

Locate the carousel

89b47c5

Extract name

e059856

Extract link

37ac559

Extract extensions

ca9a01c

Extract image

5385e38

Add support for "kc:/music/artist:albums"

5a32025

Add negative spec examples

bb4d595

Add support for "kc:/architecture/architect:designed"

1166dc4

Add support for "kc:/tv/tv_program:cast"

3dcba92

Fix typo; switch to symbol-keyed hashes

097f5a8

Update README with info about approach and how to run

81c8445

Improve accuracy of README

b5e81c0

Add to README

7010876

Return multiple carousel content as concatenated array

d8a9e28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract Knowledge Graph carousel (Van Gogh paintings) — Jason Dinsmore#393

Extract Knowledge Graph carousel (Van Gogh paintings) — Jason Dinsmore#393
dinjas wants to merge 17 commits into
serpapi:masterfrom
dinjas:van-gogh-paintings

dinjas commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dinjas commented Jun 23, 2026

Summary

Checks the boxes

Running it

A few decisions (more in README)

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant