Crawling a staging/testing site
Screaming Frog Documentation
How To Crawl A Staging Website
For our example, we’ll be attempting to crawl https://super-craft.nurse.com/, Nurse.com’s staging site (as of 12/8/23). This site’s robots.txt file disallows all crawlers by default, as you can see here: https://super-craft.nurse.com/robots.txt
User-agent: *
Disallow: /
This tells crawlers that they do not have access to crawl this site, but Screaming Frog has the ability to bypass this in its Configuration settings.
Configuration >>> robots.txt >>> Settings
Here, you will select “Ignore robots.txt” or “Ignore robots.txt but report status”
This will allow you to bypass the rules of the robots.txt file to crawl the site.
If you need to retain some of the robots.txt restrictions, you can do a Custom Robots Configuration to define which subdomains you wish to crawl.
For this domain, running through this process still did not allow us to see the full offerings of the pages on the site, so we’ll need to dive back into the robots.txt file for more details. Often times, Screaming Frog will require you to specifically input the sitemaps you wish to crawl in order to kick off the crawling process
# robots.txt for https://super-craft.nurse.com/
sitemap: https://super-craft.nurse.com/jobs/sitemaps-1-sitemap.xml
# local - disallow all
User-agent: *
Disallow: /
Here we can see the primary sitemap for the site that we want to crawl, so let’s put that in our browser and see the child sitemaps for Nurse.
https://super-craft.nurse.com/jobs/sitemaps-1-sitemap.xml (primary sitemap)
https://super-craft.nurse.com/jobs/sitemaps-1-section-blogAuthors-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-categorygroup-blogCategories-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-blogEntries-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-blogIndex-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-fullHomepage-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-home-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-marketingPages-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-nurseStoriesIndex-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-categorygroup-podcastCategories-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-podcastEntries-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-podcastEpisodesListing-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-podcastLanding-1-sitemap.xml
https://super-craft.nurse.com/jobs/sitemaps-1-section-pages-1-sitemap.xml
You’ll want to grab this list of sitemaps and put them in your Spider configurations. Here is what you’ll need in order to force Screaming Frog to crawl the entirety of the available pages on the domain.
Configuration >>> Spider >>> Crawl
Leave the defaults for Page Links
Under Crawl Behavior, include:
- Check links outside of start folder
Crawl outside of start folder
Crawl all Subdomains
Under XML Sitemaps, include:
Crawl linked XML sitemaps.
Auto discover XML sitemaps via robots.txt
Crawl these sitemaps:
Include your list of sitemaps (like we pulled above)
Now you’re ready to start your crawl! Good luck!