Crawling A Site

Crawling a Staging Environment

Extracting Page Content

 

Screaming Frog - https://www.screamingfrog.co.uk/seo-spider/user-guide/configuration/

  • Configuration > User Agent > Googlebot Smartphone (since google is mobile-first index now)

    • The user-agent configuration allows you to switch the user-agent of the HTTP requests made by the SEO Spider. By default the SEO Spider makes requests using its own ‘Screaming Frog SEO Spider user-agent string.

    • However, it has inbuilt preset user agents for Googlebot, Bingbot, various browsers and more. This allows you to switch between them quickly when required. This feature also has a custom user-agent setting which allows you to specify your own user agent.

    • Details on how the SEO Spider handles robots.txt can be found here.

  • Configuration > Speed

    • The speed configuration allows you to control the speed of the SEO Spider, either by number of concurrent threads, or by URLs requested per second.

  • Configuration > Speed > Max Threads >

    • The ‘Max Threads’ option can simply be left alone when you throttle speed via URLs per second.

    • Increasing the number of threads allows you to significantly increase the speed of the SEO Spider. By default the SEO Spider crawls at 5 threads, to not overload servers.

    • Please use the threads configuration responsibly, as setting the number of threads high to increase the speed of the crawl will increase the number of HTTP requests made to the server and can impact a site’s response times. In very extreme cases, you could overload a server and crash it.

    • We recommend approving a crawl rate and time with the webmaster first, monitoring response times and adjusting the default speed if there are any issues.

  • Configuration > Speed > Max URL/s >

    • When reducing speed, it’s always easier to control by the ‘Max URL/s’ option, which is the maximum number of URL requests per second. For example, the screenshot below would mean crawling at 1 URL per second –

  

ON SEMRUSH

For RA, Set  Site Audit Settings at "respect robots.txt" when doing a crawl

Give henry fu a heads up before doing crawls

https://www.semrush.com/kb/539-configuring-site-audit

 

SCREAMING FROG Settings - RA

Max threads: 5

Check the Limit URL/s box

Max URL/s 0.2

Max URI/s 0.7 (3.13.2023)

 

Exclusion List - Relias Academy

Exclusion How-To Article

  • Configuration → Exclude

  • .*emailAFriend*