SerpApi code challenge#394
Conversation
| @@ -0,0 +1,53 @@ | |||
| # frozen_string_literal: true | |||
|
|
|||
| require "ferrum" | |||
There was a problem hiding this comment.
I used Ferrum as a web driver because it's headless by default and simple to use (for this challenge at least).
But it's yet to reach a 1.0 release, so the Selenium web-driver could be used as an alternative.
|
|
||
| document = Nokogiri::HTML(html) | ||
|
|
||
| artworks = document.css("g-loading-icon + div").children |
There was a problem hiding this comment.
This selector could be done only in CSS with g-loading-icon + div > *.
I used the children method instead as it is clearer what is being selected than using the > * selector.
| artworks = document.css("g-loading-icon + div").children | ||
|
|
||
| result = artworks.map do |artwork| | ||
| extensions = artwork.css("img + div").children.map do |extension| |
There was a problem hiding this comment.
Like with artworks above, this selector could be done only in CSS with img + div > *.
I used the children method as it is clearer and consistent with the approach above.
|
|
||
| private | ||
|
|
||
| def self.extract_html(file_path) |
There was a problem hiding this comment.
To begin with I was just parsing the html files with Nokogiri, but found I needed to use a browser to execute the JavaScript for the thumbnail images. So I switched to using a web driver to render the page.
I thought a separate method for extracting the HTML was useful because:
- It could be extended to scrape the page in different ways, such as with or without a web driver.
- Having the web driver code contained within this method makes it simpler to change the web driver (e.g. switch Ferrum for Selenium).
|
|
||
| if file_path == "./files/van-gogh-paintings.html" | ||
| it "produces the expected response" do | ||
| @response["artworks"].each.with_index do |artwork, index| |
There was a problem hiding this comment.
This test could be simplified to:
expect(@response).to eq(expected_response)I used this approach as it was better for debugging (you get a single wall of text when doing the assertion at the response level).
Extract artwork from Google SERP page
Given a file path to a Google SERP page
htmlfile, theFileScraperwill return a JSON list of artworks. Each artwork includes its:In the search result page some artwork is shown with thumbnail images. For these pieces of artwork the thumbnail image is included as the artwork
image.