Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Removal tool

remove of dublicated pages which I did not know existed

         

zeus

12:26 pm on Aug 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is it save to remove a page
www.mydomain.com/g/pi/pi2?full=0

be cause a same page already exists, wich is the real page www.mydomain.com/g/pi/pi2 without?full=0

Im not sure how that happen, but google somehow spidered such a page

zeus

3:00 pm on Aug 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



User-agent: Googlebot
Disallow: /*?full=0

would that be ok, then upload to the removal tools, I have tried this before, but its a long time ago that way my questions.

tedster

4:14 pm on Aug 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That looks right to me, zeus. If in doubt, open up a Webmaster Tools (Sitemaps) account with Google and validate your robots.txt there before pushing it live. Note that this usage of wildcard characters is not part of the robots.txt standard, but Google has extended the standard for their own bot.

then upload to the removal tools

If you are going to use the removal tool and not just wait for Google to sort things out, then it looks like you need to be sure that the url resolves to a 404.

...use our automatic URL removal system. We'll accept your removal request only if the page returns a true 404 error via the http headers. Please ensure that you return a true 404 error even if you choose to display a more user-friendly body of the HTML page for your visitors. It won't help to return a page that says "File Not Found" if the http headers still return a status code of 200, or normal.

[google.com...]

zeus

4:47 pm on Aug 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



hmm when I try to place my new robots.txt in the removal tool I get

URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card:
DISALLOW /*?full=0$

But in google guidelines it says like this DISALLOW /*?full=0$

Its a real pain its not even a real page I have NO idea how they have got that link with the?full=0 at the end, why is all those wierd problems ALWAYS on google

Oliver Henniges

8:25 pm on Aug 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



did you check where that link came from, maybe by searching for the complete URL in quotes (try yahoo if google doesn't give an answer;)

I found the safest way to delete more than a dozen of URLs was to generate and ftp dummy pages with simply a meta noindex tag in the head-section. In some cases this is much easier than to use google's delete-url-form.

I am not really sure, but I have the suspicion that, if you forbid such a page by robots.txt, the spider simply doesn't visit, but the page is still kept in the cache.

As I mentioned elsewhere, the same holds true for 404-pages. these are kept in the index and cache far too long.

Dead_Elvis

8:37 pm on Aug 22, 2006 (gmt 0)

10+ Year Member



Yep, unfortunately you can't use wildcards when blocking urls for the removal tool. I learned this recently, and it maed the whole process a lot harder.

I suppose thery're trying to protect people from make really egregious errors with the wildcards ;)

I ended up writing out a whole lot of disallow directives in my robots.txt, and then after Google had successfully removed the pages I added the wildcards back in to keep the Big "G" out for good.

It works :)

zeus

9:45 pm on Aug 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



hmm, the problem is there is no links to those wierd pages, so I can not even place a meta, the the meta will also be on the correct page.

Damn I thinks its incredible, they have no space on the server and a lot of sites have lost pages, but they do index pages that in a way dont exist, well thats another topic.

I realy dont know what to do now, I see about 500 pages with this extra url as supplemental.

Dead_Elvis

10:16 pm on Aug 22, 2006 (gmt 0)

10+ Year Member



Hehe... you can't tell me you really believe that multi-billion dollar Google doesn't have space on their servers...

You guys crack me up :)

As far as the links go, I'd spend some time figuring out how to stop your site from creating those other URLs in the first place, or how to use .htaccess to redirect them all to the correct URLs.

If you can do a redirect via htaccess you should then be able to to do the removal.

Good luck!

zeus

10:47 pm on Aug 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have fixed the wird url, but they are still in the index and if you type the url they also apeare.

About server space, ohh thats no joke, they have even themself said so, also when you think about there problem, omitted results, supplemental results and you can not see all your own pages over 1000 pages, thats all to save space.

Dead_Elvis

11:30 pm on Aug 22, 2006 (gmt 0)

10+ Year Member



Hmmm, if when you type the url into a browser they still come up than I'd say you haven't fixed the problem ;)

Sounds like a major pain in the butt...

I've heard the rumor that Google doesn't have enough server space, but I personally think it's the silliest thing I've ever heard.

Anyways, if your browser is still reaching those URLs I think you're still in trouble.

Oliver Henniges

6:45 am on Aug 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> www.mydomain.com/g/pi/pi2?full=0

yes, maybe it is difficult to design a dummy-page for such get-variabled-pages. However, what first comes to ones mind, is, of course, that somewhere in your scripts you in fact DO generate links of that kind.

If you are 100% sure the links come from outside, you might add some php-code on top of your pages, provided they run through the parser anyway.

if (isset($HTTP_GET_VARS['full'])){
echo '<html><head><meta name = "robots" CONTENT="noindex,nocache"></head><body></body>';}
else
{
your normal page
}

But what strikes me, is, that in the example given the page itself has no dot-ending like .html or .php

zeus

10:04 am on Aug 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have fixed that from , so the site dont generate such a page, but the main problem is how do I remove those pages from google.

Oliver Henniges

7:09 pm on Aug 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The code snippet I gave might work, but only if the rest of the page is generated by php as well. As soon as a http-get request is sent by the spider looking for the given URL with that 'full'-variable, the script sends a page with the necessary meta-tags. The else section covers the normal situation.

If you're writing pure html, you might even leave the last two brackets and add your original code there, but then the page would literally be ill-formed according to wc-standards, because I assume you'd add a second head-section. But since you're only targetting at getting it off the index, that shouldn't really matter.

However, I would never perform any experiments on my own website with other people's code, which I myself did not fully understand. I guess the same holds true for you. If you know a trustworthy person with some knowledge in php, the whole thing shouldn't be too complicated.

Your first choice, of course, should be the google-url-removal tool. Only if you have hundreds of pages to be removed from the index, it might be convenient to do some scripting.