Skip to content

SerpApi code challenge#394

Open
rob-mccormick wants to merge 15 commits into
serpapi:masterfrom
rob-mccormick:master
Open

SerpApi code challenge#394
rob-mccormick wants to merge 15 commits into
serpapi:masterfrom
rob-mccormick:master

Conversation

@rob-mccormick

Copy link
Copy Markdown

Extract artwork from Google SERP page

Given a file path to a Google SERP page html file, the FileScraper will return a JSON list of artworks. Each artwork includes its:

  • name
  • extensions
  • link
  • image

In the search result page some artwork is shown with thumbnail images. For these pieces of artwork the thumbnail image is included as the artwork image.

Comment thread lib/file_scraper.rb
@@ -0,0 +1,53 @@
# frozen_string_literal: true

require "ferrum"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Ferrum as a web driver because it's headless by default and simple to use (for this challenge at least).

But it's yet to reach a 1.0 release, so the Selenium web-driver could be used as an alternative.

Comment thread lib/file_scraper.rb

document = Nokogiri::HTML(html)

artworks = document.css("g-loading-icon + div").children

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This selector could be done only in CSS with g-loading-icon + div > *.

I used the children method instead as it is clearer what is being selected than using the > * selector.

Comment thread lib/file_scraper.rb
artworks = document.css("g-loading-icon + div").children

result = artworks.map do |artwork|
extensions = artwork.css("img + div").children.map do |extension|

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like with artworks above, this selector could be done only in CSS with img + div > *.

I used the children method as it is clearer and consistent with the approach above.

Comment thread lib/file_scraper.rb

private

def self.extract_html(file_path)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To begin with I was just parsing the html files with Nokogiri, but found I needed to use a browser to execute the JavaScript for the thumbnail images. So I switched to using a web driver to render the page.

I thought a separate method for extracting the HTML was useful because:

  • It could be extended to scrape the page in different ways, such as with or without a web driver.
  • Having the web driver code contained within this method makes it simpler to change the web driver (e.g. switch Ferrum for Selenium).

Comment thread spec/file_scraper_spec.rb

if file_path == "./files/van-gogh-paintings.html"
it "produces the expected response" do
@response["artworks"].each.with_index do |artwork, index|

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test could be simplified to:

expect(@response).to eq(expected_response)

I used this approach as it was better for debugging (you get a single wall of text when doing the assertion at the response level).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant