Crawling a staging/testing site

Screaming Frog Documentation

https://www.screamingfrog.co.uk/how-to-crawl-a-staging-website/

 

For our example, we’ll be attempting to crawl https://super-craft.nurse.com/, Nurse.com’s staging site (as of 12/8/23). This site’s robots.txt file disallows all crawlers by default, as you can see here: https://super-craft.nurse.com/robots.txt

User-agent: * Disallow: /

 

This tells crawlers that they do not have access to crawl this site, but Screaming Frog has the ability to bypass this in its Configuration settings.

Configuration >>> robots.txt >>> Settings

Here, you will select “Ignore robots.txt” or “Ignore robots.txt but report status”

This will allow you to bypass the rules of the robots.txt file to crawl the site.

 

If you need to retain some of the robots.txt restrictions, you can do a Custom Robots Configuration to define which subdomains you wish to crawl.

 

For this domain, running through this process still did not allow us to see the full offerings of the pages on the site, so we’ll need to dive back into the robots.txt file for more details. Often times, Screaming Frog will require you to specifically input the sitemaps you wish to crawl in order to kick off the crawling process

 

# robots.txt for https://super-craft.nurse.com/ sitemap: https://super-craft.nurse.com/jobs/sitemaps-1-sitemap.xml # local - disallow all User-agent: * Disallow: /

 

Here we can see the primary sitemap for the site that we want to crawl, so let’s put that in our browser and see the child sitemaps for Nurse.

https://super-craft.nurse.com/jobs/sitemaps-1-sitemap.xml (primary sitemap)
https://super-craft.nurse.com/jobs/sitemaps-1-section-blogAuthors-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-categorygroup-blogCategories-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-blogEntries-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-blogIndex-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-fullHomepage-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-home-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-marketingPages-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-nurseStoriesIndex-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-categorygroup-podcastCategories-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-podcastEntries-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-podcastEpisodesListing-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-podcastLanding-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-pages-1-sitemap.xml

 

You’ll want to grab this list of sitemaps and put them in your Spider configurations. Here is what you’ll need in order to force Screaming Frog to crawl the entirety of the available pages on the domain.

Configuration >>> Spider >>> Crawl

Leave the defaults for Page Links

Under Crawl Behavior, include:
- Check links outside of start folder

  • Crawl outside of start folder

  • Crawl all Subdomains

 

Under XML Sitemaps, include:

  • Crawl linked XML sitemaps.

  • Auto discover XML sitemaps via robots.txt

  • Crawl these sitemaps:

    • Include your list of sitemaps (like we pulled above)

 

Now you’re ready to start your crawl! Good luck!