Extracting Page Content/HTML

Screaming Frog Documentation: https://www.screamingfrog.co.uk/web-scraping/

In order to extract content through the SEO Spider tool, follow these steps:

Go to Configuration → Custom → Custom Extraction

Then click “Add” to add an extraction

The easiest way I’ve found to extract page data in an organized fashion is through extracting data from specific divs on a set of pages. This can be done through an XPath statement.

XPath Documentation: https://www.w3schools.com/xml/xpath_syntax.asp

XPath syntax example - Blog author bios

In this example, we want to pull HTML from the Blog Author bio pages into an excel document:

Page example in question: https://www.nurse.com/blog/author/ckonuch/

  • First, you’ll want to configure your crawler to crawl this list of pages. For Nurse.com we have a sitemap containing all of these listings, so we will want to include this sitemap in our Spider configuration settings.

Configuration → Spider → XML Sitemaps → Crawl Linked XML Sitemaps → Crawl these sitemaps

Include your sitemap in this text box, i.e. https://super-craft.nurse.com/sitemaps-1-section-blogAuthors-1-sitemap.xml

Alternatively, if you already have a list of the pages you need to crawl, you can paste that list into Screaming Frog’s List Mode crawl.

 

  • Next, you’ll want to set up your Extraction settings and find the div name that needs to be extracted from the site. Enter the Custom Extraction menu addressed above.

    • Add an extraction

    • Name your extractor, in this case we will call it “AuthorBio”

    • Select “XPath”

    • Select “Page Text”

For these pages, we are extracting HTML from a specific div. This is how you can use Xpath to pull this data:

  • //div[@class='fusion-author-info']

    • //div - the HTML object type you want to extract

    • @class - the class name of the div, span, p, etc.

    • ' ' - put the name of the class here

    • We found the name of this class through inspecting the page

 

Extraction settings:

 

Output:

  • After the crawl is complete, go to the Custom Extraction tab to see your results

Then you can export your results into a CSV or XLSX document