Technical: Search Console ignoring "Do not index" directives?

j_holtslander

Member
Joined
Feb 5, 2019
Messages
68
Background
I have a server that runs many staging sites for many client websites as individual subdomains. eg: client1.stagingserver.com, client2.stagingserver.com, client3.stagingserver.com, etc.

Each site contains both the production server's robots.txt file as well as a special robots-staging.txt file that is purposed with preventing the staging sites from being indexed.

The way it works is that if the domain that the website is running on is a subdomain of [stagingserver.com] then the robots-staging.txt file is "substituted" for robots.txt by a mod_rewrite rule for Apache in the site's .htaccess file.
More info.

So (in theory) if a crawler finds [client1.stagingserver.com] and tries to read robots.txt they'll be served the robots-staging.txt file instead which reads:
Code:
User-agent: *
Disallow: /
Noindex: /
But if a crawler finds the exact same website running on [client1.com] it gets the robots.txt file as normal which reads as:
Code:
User-agent: *
Disallow: /.git
Disallow: /cgi-bin
Disallow: /config

Sitemap: https://[client1.com]/sitemap.xml
The Problem
Google seems to somehow be ignoring the rewrite rule.

Humans are definitely getting the robots-staging.txt file's contents when requesting robots.txt from staging, but several client websites have had their staging sites indexed by Google anyway and are now duplicate content. (Canonicals for the URLs are defined by the Wordpress site's WP_SITEURL)

Anyone have any ideas why Google would be indexing these sites? Or a solution?

Yes, we could instead set the staging sites to "Do not index" within Wordpress' general settings for the staging site. But that setting could easily be accidentally cloned from Staging to Production which would be pretty disastrous.
We need a "set it and forget it" solution that's immune to memory lapses which is why we've gone this auto-substitution route.
 

JoshuaMackens

Local Search Expert
Joined
Sep 12, 2012
Messages
1,833
1) Google seems to deindex slowly. Did you accidentally allow them to be indexed and you're now waiting for them to drop from the index?

2) Have you tried using a header for no index?
 

j_holtslander

Member
Joined
Feb 5, 2019
Messages
68
does it take into account any www or non-www versions?
Yeah, no issues there.

Nope. Afraid not.

Did you accidentally allow them to be indexed and you're now waiting for them to drop from the index?
Unclear TBH, but yes currently waiting for them to be dropped after requesting removals.

2) Have you tried using a header for no index?
Unsure how to do that where it'd apply only to Staging and not to Production. Ideas?
 

JoshuaMackens

Local Search Expert
Joined
Sep 12, 2012
Messages
1,833
That may be the problem. Maybe Google is honoring the noindex, only it just now showed up. So they're dropping it from the index. I can tell you from experience that takes a long time sometimes. Months.

Yeah, I'm not sure how to do it either. If it's in a subdirectory you can do it. I did it awhile back but now I've forgotten how.
 

Weekly Digest

Weekly Digest
Subscribe/Unsubscribe

Promoted Posts

New advertising option: A review of your product or service posted by a Sterling Sky employee. This will also be shared on the Sterling Sky & LSF Twitter accounts, our Facebook group, LinkedIn, and both newsletters. More...

Local Search Forum


Google Product Exert

@LocalSearchLink

Join Our Facebook Group

Top