Extracting Page Content/HTML

Screaming Frog Documentation: https://www.screamingfrog.co.uk/web-scraping/

In order to extract content through the SEO Spider tool, follow these steps:

Go to Configuration → Custom → Custom Extraction

Then click “Add” to add an extraction

The easiest way I’ve found to extract page data in an organized fashion is through extracting data from specific divs on a set of pages. This can be done through an XPath statement.

XPath Documentation: https://www.w3schools.com/xml/xpath_syntax.asp

XPath syntax example - Blog author bios

In this example, we want to pull HTML from the Blog Author bio pages into an excel document:

Page example in question: https://www.nurse.com/blog/author/ckonuch/

First, you’ll want to configure your crawler to crawl this list of pages. For Nurse.com we have a sitemap containing all of these listings, so we will want to include this sitemap in our Spider configuration settings.

Configuration → Spider → XML Sitemaps → Crawl Linked XML Sitemaps → Crawl these sitemaps

Include your sitemap in this text box, i.e. https://super-craft.nurse.com/sitemaps-1-section-blogAuthors-1-sitemap.xml

Alternatively, if you already have a list of the pages you need to crawl, you can paste that list into Screaming Frog’s List Mode crawl.

Next, you’ll want to set up your Extraction settings and find the div name that needs to be extracted from the site. Enter the Custom Extraction menu addressed above.
- Add an extraction
- Name your extractor, in this case we will call it “AuthorBio”
- Select “XPath”
- Select “Page Text”

For these pages, we are extracting HTML from a specific div. This is how you can use Xpath to pull this data:

//div[@class='fusion-author-info']
- //div - the HTML object type you want to extract
- @class - the class name of the div, span, p, etc.
- ' ' - put the name of the class here
- We found the name of this class through inspecting the page

Extraction settings:

Output:

After the crawl is complete, go to the Custom Extraction tab to see your results

Then you can export your results into a CSV or XLSX document