Dumb SEO Questions

(Entry was posted by Maria Patterson on this post in the Dumb SEO Questions community on Facebook, Thursday, June 21, 2018).

Handling parameters via Google Search Console vs robots.txt?

This month`s Dumb SEO question from me

In a Magento site`s robots.txt

This Allow is higher up in the robots.txt file
Allow: /*?p=

than this Disallow
Disallow: /*?p=* &

Can someone educate me on the syntax and hierarchy.

Which takes precedence?

Also would it be better to handle parameters in Google Search Console?


This question begins at 00:46:42 into the clip. Did this video clip play correctly? Watch this question on YouTube commencing at 00:46:42
Video would not load
I see YouTube error message
I see static
Video clip did not start at this question

YOUR ANSWERS

Selected answers from the Dumb SEO Questions G+ community.

  • Travis Bailey: I don`t have any Magento experience but it looks like they`re trying to block weird URLs from indexation (fragmented, but not really... but fragmented). You can handle parameters in GSC, but that wouldn`t help with other search engines.
  • Dave Elliott: It cascades down(It doesn`t ooops , but more specific rules take precedence) so anything with multiple parameters (p=*& ) won`t be crawled but anything with just p= won`t be.
  • Maria Patterson: Ok so this ?color=white&general_size=small wouldn`t be crawled but this ?color=white on it`s own would be
  • Dave Elliott: Presuming there is a p= somewhere. Then yes.
  • Dave Elliott: Just double checked. Order doesn`t matter but more specific rules will take precedent so it`s still correct
  • Maria Patterson: Dave Elliott feeling dumb as no p on example I gave but I do get the hierarchy now and the p may be pagination & with other filtered parameters. Off to have a play with a combination of pages and filtering. I actually think I should have the Allow removed and have Disallow: /*?p=* but will check out thoroughly first. Thanks again
  • Adam John Humphreys: Dave Elliott correct, robots recently blocked a site`s business page name which included "Company X Premium Service" and confused it with a directory folder called Premium. Be extremely careful when working with robots.txt. Fortunately, Google Search Console is addressing a lot of things now that one might not see otherwise.
  • Stockbridge Truslow: Basically what this is saying is that it`s fine to index pages that have a single "?p=xxx" querystring, but to disallow querystrings that start with ?p=XXX and then have a second variable after it. I`m not sure how Magento works and if it actually has some code to sort the strings to ensure that p= is always the first variable. if so, then this should be fine. If not - then it should work most of the time, but it`s not really an all inclusive catchall. If you have something like /page.php?x=12&p=30 <--- this would be allowed. But something like /page.php?p=30&x=12 would be disallowed, even though both of those URLs are going to end up showing the exact same page. (Again - Magento may have built-in ways to ensure that this never happens, but I don`t know). That said, the "allow" directive in a robots.txt file is only useful when allowing for something that has already been disallowed. I`d need to see the entire file to be sure, but if nothing above your allow: /*p= explicitly prohibits anything that might match that, then the allow statement is unnecessary. If you can put in the entire robots.txt file - we may be able to break it down specifically.
  • Maria Patterson: Thanks Stockbridge Truslow So far I have asked dev company to for the crawl delays to be removed as this site has Cloudflare and "nasty" bots can be blocked via it
  • Stockbridge Truslow: The crawl delay can be handy if you have high server loads. Google is pretty good about adapting to it automatically. Bing is fairly good about it. Not sure about the international ones though. In general I agree though - unless there are issues at hand, never tell a search engine to slow down - let it do what it needs to do to get your represented. In theory the "Allow:/" is redundant and unnecessary, too - but shouldn`t cause any issues.
  • Maria Patterson: User-agent: *Crawl-delay: 10Sitemap: https://www.domainname.com/sitemaps/sitemap.xmlAllow: /User-agent: aranhabotCrawl-delay: 10Disallow: /# Allowable IndexAllow: /*?p=Allow: /index.php/blog/Allow: /catalog/seo_sitemap/category/Allow: /catalogsearch/result/# DirectoriesDisallow: /404/Disallow: /app/Disallow: /cgi-bin/Disallow: /downloader/Disallow: /includes/Disallow: /lib/Disallow: /magento/Disallow: /pkginfo/Disallow: /report/Disallow: /stats/# Paths (clean URLs)Disallow: /catalog/product_compare/Disallow: /catalog/category/view/Disallow: /catalogsearch/Disallow: /checkout/Disallow: /control/Disallow: /contacts/Disallow: /customer/Disallow: /customize/Disallow: /newsletter/Disallow: /poll/Disallow: /review/Disallow: /sendfriend/Disallow: /tag/Disallow: /wishlist/# FilesDisallow: /cron.phpDisallow: /cron.shDisallow: /error_logDisallow: /install.phpDisallow: /LICENSE.htmlDisallow: /LICENSE.txtDisallow: /LICENSE_AFL.txtDisallow: /STATUS.txt# Paths (no clean URLs)Disallow: /*?p=*&Disallow: /*?SID=Disallow: /*?ajax=1
  • Stockbridge Truslow: Okay... so, this is technically correct. For everything other than Google, the proper standard is "first rule wins" - so the allows should come first. For Google, though, they do it differently (of course). Their rule is "most specific wins" - regardless of the order. So, for the Disallow: /catalogsearch/ - you`re blocking that folder and everything in it, but you`ve got /catalogsearch/result/ which allows the spider to crawl the search results page. This is valid and will work for all search engines. Your ?p= bit is saying the following: Disallow any URL that has a ?p= along with ANY other variable, but it`s allowed to crawl things that just have a ?p= as the string. I can`t be sure if that actually covers every possible scenario, but it makes sense and seems to cover the most probable scenarios. If it were me and I REALLY wanted to make sure it would be tamped down properly, might add: Disallow:/*?*&p=This would do the same thing as the first Disallow: /*?p=*& but cover you in the event that the p variable isn`t first on the list. I`d say you`re fine, though.
  • Maria Patterson: Stockbridge Truslow Thanks I`m I`m understanding it so much better and can confirm that the p is for pagination. Your advice to have Disallow:/*?*&p= is one I will follow through on. Thanks again
  • Michael Stricker: Nice thread! Just don’t want to miss GSC’s robots.txt Tester tool. Shows you whether your robots.txt file blocks Google web crawlers from specific URLs on your site. For example, you can use this tool to test whether the Googlebot-Image crawler can crawl the URL of an image you wish to block from Google Image Search.

View original question in the Dumb SEO Questions community on G+, Thursday, June 21, 2018).

Reference Links