Can You Get Panda-fied If Googlebot Is Blocked?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Can You Get Panda-fied If Googlebot Is Blocked?

Pjman

1:26 pm on May 22, 2013 (gmt 0)

I'm working with one site that looks like a classic Panda symptoms (Lost 5-8 places on main keywords on Panda).

The only problem is that the site is full and rich with content. So much so it is hard to ever consider Panda as a concern.

As I get further into the site, I realize that a folder resides on site filled with data files that are the results of surveys. The data is stored in tens of thousands of HTML files that are ridiculously repetitive. If the contents of the folder were taken into account, it would be 80% of the site. Needless to say, locked down the folder and next big Panda flow; it should be back.

The problem is that site has always had a robots.txt directive to block that folder from being indexed from day 1. If Gbot followed the rules, it should have never been crawled.

Has anyone seen this before?

Are our robots.txt files just guidelines (and not rules) for Gbot now?

goodroi

3:01 pm on May 22, 2013 (gmt 0)

Are you sure the robots.txt was setup correct? A simple typo can allow Googlebot in.

Did the page use Google+, because that can allow Google to get past robots.txt?

Pjman

3:17 pm on May 22, 2013 (gmt 0)

@goodroi

robots.txt was perfect and translated properly verified by looking into the site's Web Master Tools account.

No pages within that folder have any G+ or other G products presence. They are just straight HTML and a few PDFs

rish3

3:29 pm on May 22, 2013 (gmt 0)

Are our robots.txt files just guidelines (and not rules) for Gbot now?

The new rules seems to be that they index it, but present that "A description for this result is not available because of this site's robots.txt � learn more." blurb if you run a query that will show the url.

I agree with you...how did they index it in the first place if the robots.txt was there all along?

You can try a query like "site:yourdomain.com *** -abc123qaz" then go to the end of the results and click the link in this paragraph that shows there

In order to show you the most relevant results, we have omitted some entries very similar to the ones already displayed.
If you like, you can repeat the search with the omitted results included.

Then, go to the end of the results again, and see if you have a boatload of the "A description for this result..." entries.

If so, I would:

1) Add <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> tags to the content <head> section

2) Remove the robots.txt restriction, letting Google crawl the content so it sees the META robots tag.

3) Potentially use the GWT url removal tool to get them removed from the index faster. If they aren't in a directory that the tool can remove all at once, however, you're in for some pain. I ended using the iMacros firefox plugin to make a script to batch remove urls.

tedster

5:33 pm on May 22, 2013 (gmt 0)

how did they index it in the first place if the robots.txt was there all along?

That description only means googlebot found a link to the URL somewhere, and then indexed that URL only. It doesn't mean that googlebot crawled that URL. Rather it means that Google decided that this URL might fit some query or other. That decision is based only on backlink factors like anchor text, title of the linking page, or other text near the backlink. Indexing and crawling are not the same thing.

Pjman

6:11 pm on May 22, 2013 (gmt 0)

@rish3

Good idea on the GWT removal tool. I did it.

@tedster

Great point on the indexing/crawling differential. My question would be then, can a crawl be used to determine quality and possibly attach panda penalty?

tedster

6:18 pm on May 22, 2013 (gmt 0)

Sure - in fact, I think you probably do need a crawl. Using only secondary clues seems really odd to me.

lucy24

8:52 pm on May 22, 2013 (gmt 0)

If you were a search engine, wouldn't it drive you bonkers to know a site has thousands of pages you can't see? Remove the robots.txt line and replace it with a meta "noindex" in the pages themselves. That's assuming they are all created dynamically, in spite of the .html extension, so you only need to add the line once. If they really are static html pages that already exist, do it with a Header directive instead.