Skip to main content

Scrape similar images from the website

· 5 min read
Robin

This article describes how to extract numerous related images from a website. It is a basic necessity to download all images from a web page and store them to a local folder. It is also a necessary to preserve the titles of the images so that they may be managed or processed more easily in the future.

Although there are several download tools that may be used to store all web images into a folder, the most of the images are saved with obscure or arbitrary names. Therefore, I'm using the Python module Clicknium to implement this situation because it's simple to get started with and has good experience capturing lists of related elements.

Let's have a look at the web page for the image list as shown below. Each item has a title, a price, and an image. The expected outcome is a folder containing all of the images with the title as their names.
web

We will cover the scraping in 3 parts as below:

  • Development tool preparation
  • Capture locator for the image
  • Write automation code

Development tool preparation

  • Install Visual Studio Code and Clicknium extension.
  • Follow the instructions of the quick start document in Clicknium extension to complete the setup. vscode

Capture locator for the image

After setting up the development environment, open an empty folder in VSCode and create a new .py file.

  • Start capturing the locator by clicking the button below or press "Ctrl+F10". inv

  • Once the "Clicknium Recorder" is invoked, click the "Similar elements" button in the recorder to capture a locator for the image list.
    rec

  • After clicking the button, a wizard pops up guiding you to generate a locator which can match all expected images.

Hover the mouse cursor over the element and add the first target element by pressing "Ctrl+Click." Any image from the image list may be the element.first
Once the element is added, the locator will be auto-generated, and the wizard will display how many similar elements can be matched by the locator.
Since only one element is added here, it also matches the same target one for now. We can capture another image from the list to match more.
sec
After adding 3 images to the wizard, we can see that 21 elements are now matched with the locator auto-generated.thir
As there are 22 images in total on the web page, we will continue to add more image elements to the wizard, till 22 elements can all be matched by the auto-generated locator. (If the matched number is not expected, we can always add more elements.)
Click "Save" button to complete the wizard.
fth

After capturing the locator, we can open the locator to see its details as below in Visual Studio Code. The detailed properties can be updated manually if it can be optimized further.
loc1

From the locator editor panel, we can also click "Validate" button to ensure that all matched 22 elements are expected. After clicking the "Validate" button, a wizard can be operated to locate the target elements one by one. If any target one is incorrect, we may

  • recapture the locator by going through the wizard again
  • or manually modify the locator in the locator edit panel above. val

Capture image titles in the same way as above. The locator definition is as below:
loc2

Write Automation Code

With the locators, now we can write code as below

  • Get images and titles
  • Download image and save it with title as file name
import os
import requests
import shutil
from clicknium import clicknium as cc, locator, ui

# attach to the opened browser, the url is a fake site
tab = cc.edge.attach_by_title_url(url = "https://gallerydemo.com/pages/outerwear")

# get images and titles
imgs = tab.find_elements(locator.msedge.gallerydept.img_out)
titles = tab.find_elements(locator.msedge.gallerydept.span_out)

# iterate every image element
for x in range(len(imgs)):
src = imgs[x].get_property("src")
tstr = titles[x].get_text()

# download image with url and save to folder with title as name
res = requests.get("https:"+src, stream = True)
if res.status_code == 200:
file = "c:\\test\\gallery\\" + tstr + ".png"
# use different name if the title is duplicated
if(os.path.exists(file)):
file = "c:\\test\\gallery\\" + tstr + str(x) + ".png"
with open(file,'wb') as f:
shutil.copyfileobj(res.raw, f)
print('Image sucessfully downloaded: ',tstr)
else:
print('Image Couldn\'t be retrieved')
  • The complete code can be found on GitHub.

The execution result is as below. The images are saved in folder c:\test\gallery with title as name and same as the one on the web page.
res

Conclusion

We demonstrated how to scrape images from the web in this article. With Clicknium "Similar elements" function, it is easy to locate the images by mouse clicking, and write code simply with the generated locator.
The important part is to capture the similar elements, the more elements you add, the auto-generated locator is more accurate. A good practice is to add elements in different locations, like different columns and different rows, so that it has higher coverage to generate correct locator.