Welcome to WebmasterWorld Guest from 18.207.132.114

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google is guessing, I guess

     
6:49 pm on Sep 16, 2014 (gmt 0)

Junior Member

5+ Year Member

joined:July 29, 2014
posts:47
votes: 0


Hi there,

You've probably read about "Google's 302 Redirect Problem" before:
[webmasterworld.com ]

Just wanted to share some strange behavior that I think it is new to you.

I have a Rhymer search tool, when a user search for a word he gets a list of words that rhyme with it. Each of this words USED to have a link to mysite/dic.php?word=SOMEWORD. This pages would do some tracking and then redirect the user to a dictionary site(third party).

When I noticed that Google was indexing this dic.php pages with Titles and Descriptions of the dictionary site I decided to remove the links to dic.php and blocked the dic.php file on my robots.txt. At this point my search results were no longer linking to this dictionary site.

It didn't work, so I thought that Google was ignoring robots.txt and accessing dic.php for some some strange reason(I had no links to this file).
So I edited dic.php and change the 302 redirect to point to my rhymes search page. Surely there was no way for Google to index my dic.php with third party Titles and Descriptions... WRONG!

So, when I do a site: search (showing results listed by date) I'm getting this dic.php pages with third party titles. When I click on this results, I get redirected to my rhymes search(has expected). When I check the Google cache for this pages, I THEN get the third party site!

Now, how could I be getting new dic.php pages indexed day after day?
I did a site: search(showing results listed by date) on the dictionary site and found out that the new pages I was getting were actually new pages of the third party site! Ahhhh!

So
Google finds and index
thirdpartysite.tld/search.php?word=NEWWORD

And then GUESSES I have a page
mysite.tld/dic.php?word=SOMEWORD

...I guess

Notice: I don't rank (and don't want to) for this dic.php pages, they only show on site: searches.
I only have about 60 dic.php indexed pages at any given time, it seems they get dropped pretty fast.
7:23 pm on Sept 16, 2014 (gmt 0)

Junior Member from IT 

5+ Year Member

joined:Oct 29, 2013
posts: 145
votes: 0


Mmh... might be some naughty black hat redirect. :-/ Can't tell for certain, but it looks like to me.
3:03 am on Sept 17, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 11, 2007
posts:774
votes: 3


You are aware that when URLA 302 redirects to URLB that is a temporary redirect... so the engines leave URLA (your dic.php URL) indexed but associate the content at URLB (the dictionary site in your case)with URLA (your page). This would explain why the titles from the other site are being associated with your URLs.

It doesn't however explain how new URLs from your site that are blocked by robots.txt continue to get indexed. How are you blocking them?

Disallow: /dic.php

?
3:57 am on Sept 17, 2014 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4568
votes: 367


Blocking in robots.txt is a request, not a guarantee and it is honored while crawling your site. But if those links are found on other sites and not robots.txt blocked or nofollowed, Google can and does follow their link, and lacking a noindex tag they can be indexed.

If you don't want them indexed, use noindex, either as a metatag on the generated pages or via x-robots headers. Don't block the pages in robots.txt or Google will take a long time (crawling from other site's links) to see your new noindex tags.
6:41 am on Sept 17, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


I agree with what ZydoSEO says. I have also seen cases where the target page content is indexed under redirecting page URL, in case I saw it was a redirect script page that used meta refresh as redirect method.

..blocked the dic.php file on my robots.txt

Have you tested robots.txt via WMT to make sure URL is blocked? You can test in two ways, one is to do "Fetch as Googlebot" and the other is to add the URL in the box in robots section of WMT.

It is also worth noting that even when you add URL to robots.txt, it may take Googlebot up to 24 hours to fetch robots.txt again. Hence, if you added URL to robots.txt and then straight away gone and changed your redirect to another page that you decided to monitor, it can happen that Google fetched that page before fetching a new version of robots.txt.

So what I would do is verify robots.txt is blocking the dict.php (including dict.php with any kind of parameters appended, i.e. there is no $ at the end of the directive) via WMT and if this is confirmed, do another test where I would redirect dict.php somewhere else and monitor what happens there.

With regards to Google guessing your URLs based on other site's URL - it has been for some time that Googlebot fills in forms to see whether this will discover new content, so this may be the case rather than guessing the link exists on your site.

Crawling through HTML forms
April 11, 2008
http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html [googlewebmastercentral.blogspot.com]
In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form.
(...)
If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.


We had previous discussion on Google filling forms and discovering URLs in this way here:

Googlebot Now Crawls via HTML Forms
Apr 12, 2008
http://www.webmasterworld.com/google/3625084.htm [webmasterworld.com]





@not2easy, I think your post above refers to robots meta tag "nofollow" and/or rel="nofollow" attribute rather than robots.txt.

Just to clarify, whilst blocking in robots.txt is a request, Googlebot does honour it so it does not matter where it finds links (on the site or externally), it will not fetch it. But it may still index it.
11:16 am on Sept 17, 2014 (gmt 0)

Junior Member

5+ Year Member

joined:July 29, 2014
posts:47
votes: 0


Sorry, forgot to mention, when I edited dic.php I also unblocked the file. Removed the line on robots.txt.

Current situation,

dic.php is crawlable, not blocked

dic.php redirects to my rhymer search tool

dic.php is not linked internally or externally, only google knows about it from past crawls

I dont link anymore to the dictionary site, directly or with redirects

Hope you understand now, there should be no reason to have new dic pages because there are no new links to dic file. Seems like Google is not fetching this pages because they are now pointing to my site and should have diferent title,description and cache.
2:10 pm on Sept 17, 2014 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4568
votes: 367


I think ZydoSEO may be right too. To clarify, my post was regarding this point in the OP:
Each of this words USED to have a link to mysite/dic.php?word=SOMEWORD. This pages would do some tracking and then redirect the user to a dictionary site
so what I posted was from Google and it referred to noindex, not nofollow:
However, robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.

[developers.google.com...]

The problem with getting a URL that Google has found out of their index is that they never seem to forget what once was found. Their suggestion is to let it be found with noindex attached.