Extracting Page Content/HTML
Screaming Frog Documentation: https://www.screamingfrog.co.uk/web-scraping/
In order to extract content through the SEO Spider tool, follow these steps:
Go to Configuration → Custom → Custom Extraction
Then click “Add” to add an extraction
The easiest way I’ve found to extract page data in an organized fashion is through extracting data from specific divs on a set of pages. This can be done through an XPath statement.
XPath Documentation: https://www.w3schools.com/xml/xpath_syntax.asp
XPath syntax example - Blog author bios
In this example, we want to pull HTML from the Blog Author bio pages into an excel document:
Page example in question: https://www.nurse.com/blog/author/ckonuch/
First, you’ll want to configure your crawler to crawl this list of pages. For Nurse.com we have a sitemap containing all of these listings, so we will want to include this sitemap in our Spider configuration settings.
Configuration → Spider → XML Sitemaps → Crawl Linked XML Sitemaps → Crawl these sitemaps
Include your sitemap in this text box, i.e. https://super-craft.nurse.com/sitemaps-1-section-blogAuthors-1-sitemap.xml
Alternatively, if you already have a list of the pages you need to crawl, you can paste that list into Screaming Frog’s List Mode crawl.
Next, you’ll want to set up your Extraction settings and find the div name that needs to be extracted from the site. Enter the Custom Extraction menu addressed above.
Add an extraction
Name your extractor, in this case we will call it “AuthorBio”
Select “XPath”
Select “Page Text”
For these pages, we are extracting HTML from a specific div. This is how you can use Xpath to pull this data:
//div[@class='fusion-author-info']
//div - the HTML object type you want to extract
@class - the class name of the div, span, p, etc.
' ' - put the name of the class here
We found the name of this class through inspecting the page
Extraction settings:
Output:
After the crawl is complete, go to the Custom Extraction tab to see your results
Then you can export your results into a CSV or XLSX document