homepage Welcome to WebmasterWorld Guest from 54.167.11.16
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Disallowed robots.txt URL's appearing in SERPs
url-only results attracts hackers to my sites
celenoid




msg:45861
 11:39 pm on May 23, 2004 (gmt 0)

I have been experiencing this for a while... Pages excluded by robots.txt have been constantly appearing in SERPs as url-only listings and can be found by searches matching parts of the URL string.

I've noticed that just after a number of my new sites were crawled by G, many have been hit with an attempted php hack.

My logs tell me that the hacker entered the site via urls that I was able to find in the Google index with a search for part of the query string - for example: "/email.php?page" (undoubtedly used by the hackers to identify my sites as potential targets).

I understand that these pages have not themselves been crawled, but isn't it about time G got it right and not list url's of excluded resources?

Is there anything I can do in future to stop these results from appearing in the SERPs?

[edited by: Marcia at 2:38 am (utc) on May 24, 2004]

 

celenoid




msg:45862
 11:25 pm on May 24, 2004 (gmt 0)

*cough* Has anyone else been experiencing this problem?

jdMorgan




msg:45863
 11:55 pm on May 24, 2004 (gmt 0)

Yes, many.

Have you tried a site search [google.com] on the subject?

Jim

celenoid




msg:45864
 1:05 am on May 25, 2004 (gmt 0)

Thanks Jim.

It's comforting to realise that I'm not alone in this... Not so comforting to realise that there's no evident solution to this problem..!

On some of my sites roughly half the indexed URL's are (and have always been) explicitly excluded by robots.txt. I notice they have even been given page rank.

sigh.

jdMorgan




msg:45865
 1:09 am on May 25, 2004 (gmt 0)

Read a few threads from that search... There is a solution.

Jim

[edit] OK, found it here [webmasterworld.com]. [/edit]

TheDave




msg:45866
 1:14 am on May 25, 2004 (gmt 0)

The solution is to let the bot crawl the pages, but include a meta noindex tag in the head.

<meta name="ROBOTS" content="NOINDEX">

celenoid




msg:45867
 1:19 am on May 25, 2004 (gmt 0)

Thankyou both :)

This problem is doing my brain in, but the solution makes sense -- in a messy kind of way...

If the only way to get URLs out of the index is to ALLOW them to be crawled, does robots.txt have any real use other than to limit the bandwith that crawlers consume?

Robert Thivierge




msg:45868
 2:03 am on May 25, 2004 (gmt 0)

I had a similiar problem (url-only listing of disallowed dynamically generated pages). In my case, using the "Google Automated Removal" feature at
[services.google.com:8882...]

worked fine, and removed the "url-only listings" within a day or two (based on robots.txt disallow).

celenoid




msg:45869
 3:20 am on May 25, 2004 (gmt 0)

Thanks for the tip Robert. I'm testing both techniques using different sites... will see what I come up with! :)

celenoid




msg:45870
 1:02 am on May 26, 2004 (gmt 0)

Just 1 day later...

Removal Technique 1 (site A)
[services.google.com:8882...]
Result: robots.txt still disallows URLs and ALL disallowed urls OUT of index.

Removal Technique 2 (site B)
Result: disallowed URLs now crawlable by G (with noindex tags) and ALL disallowed urls are still IN the index.

I'm off to use the removal tool again.......!

jdMorgan




msg:45871
 1:15 am on May 26, 2004 (gmt 0)

Use both.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved