homepage Welcome to WebmasterWorld Guest from 54.197.147.90
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Googlebot Gone Crazy - error log overload!
Wally_Books




msg:4451937
 7:25 pm on May 10, 2012 (gmt 0)

I've never noticed this before but googlebot has been looking for the weirdest pages on our site

[Thu May 10 09:54:00 2012] [error] [client 66.249.68.152]File does not exist:
/public_html/Sleepless+Night+2011+BRRip+\xd9\x85\xd8\xaa\xd8\xb1\xd8\xac\xd9\x85-artid10511.html

\xd8\xa8\xd8\xb1\xd9\x86\xd8\xa7\xd9\x85\xd8\xac+\xd8\xa7\xd9\x84\xd8\xb7\xd8\xa8\xd8\xb9\xd8\xa9+\xd8\xa7\xd9\x84\xd8\xa3\xd9\x88\xd9\x84\xd9\x89+3.4.2012-artid9987.html

Susanna_Pymble.html

/newsletter3.htm

/castor2008

Piles and piles of weird pages that have never been on our site. No clue where they came from.

 

tedster




msg:4451979
 9:03 pm on May 10, 2012 (gmt 0)

This is the kind of thing that can drive us crazy. By any chance does your site have a site search function - or any other kind of form input for user navigation? If so, googlebot often starts trying some very obscure submissions in an attempt to discover what's going on.

Essentially in these cases, Google is trying to find "deep" content than normally doesn't see the light of day.

Wally_Books




msg:4451989
 9:32 pm on May 10, 2012 (gmt 0)

Our site has been up since 1999. There is nothing on or site to lead google on a some weird quest. It almost appears that google has scrambled various sitemaps or something. Looking for pages in spanish and german now.

arbeitszeugnis/arbeitszeugnis-formulierungen-14.html

Strange

I did find some of the pages on other sites.

Wally_Books




msg:4451998
 9:38 pm on May 10, 2012 (gmt 0)

/forum/index.php?/topic/212-%D1%81%D1%82%D1%80%D0%BE%D0%B8%D1%82%D0%B5%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5-%D0%BC%D0%B0%D1%82%D0%B5%D1%80%D0%B8%D0%B0%D0%BB%D1%8B-%D0%BC%D0%B0%D1%82%D0%B5%D1%80%D0%B8%D0%B0%D0%BB%D0%BE%D0%B2%D0%B5%D0%B4%D0%B5%D0%BD%D0%B8%D0%B5-%D1%87%D0%B0%D1%81/

Our site is strictly html

lucy24




msg:4452030
 12:12 am on May 11, 2012 (gmt 0)

%D1%81%D1%82%D1%80%D0%BE%D0%B8%D1%82%D0%B5%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5-%D0%BC%D0%B0%D1%82%D0%B5%D1%80%D0%B8%D0%B0%D0%BB%D1%8B-%D0%BC%D0%B0%D1%82%D0%B5%D1%80%D0%B8%D0%B0%D0%BB%D0%BE%D0%B2%D0%B5%D0%B4%D0%B5%D0%BD%D0%B8%D0%B5-%D1%87%D0%B0%D1%81

Are you serious? That's Cyrillic.* What would it be doing in the query string of an English-language site anyway?

g### translate sez it means "Building Materials Materials hour" --but possibly you chopped off the last word :)

Edit: I had to look up the OP. It's Arabic (script, not necessarily language). I don't have a function for that.

Further edit:
arbeitszeugnis/arbeitszeugnis-formulierungen-14.html

Holy ###. Those could both be (mis)translations of the same thing. Now I am going to investigate the Arabic.

Returning:
G.T. says rather hilariously that the first bit means "interpreter". For the longer piece they offer "Bernam first edition" and then helpful ask if I meant {scribble, scribble}** which would mean "program of the first edition". This does not strike me as an improvement in sense ;)


* I know this because I occasionally get visitors from Yandex Image Search so I've got a little function to retro-convert their queries.
** I don't read Arabic. I can't even sound it out.

[edited by: lucy24 at 12:31 am (utc) on May 11, 2012]

Wally_Books




msg:4452035
 12:30 am on May 11, 2012 (gmt 0)

more examples, hot off the press. This would be funny if we were getting any traffic

/tag/%E9%98%BF%E6%A3%AE%E7%BA%B3%20%E9%AB%98%E6%B8%85%20%E5%A3%81%E7%BA%B8

/GPS/Spessart/

/n103-genprokuratura_poruchila_proverit_gosobvinitelya_po_delu_bychkova.html

/p50.html/%C2%8Ai%C2%96-rgb

/it-am-arbeitsplatz/index.html

at least bot is looking for html sometimes

lucy24




msg:4452043
 1:00 am on May 11, 2012 (gmt 0)

Ooh, it's like when GWT went haywire and kept putting that one section in different languages. Or, for variety's sake, keeping it in English no matter what language you chose for the rest of the page. (I was in this second group.)

Top bit is Chinese. GT says it means Arsenal Wallpaper, which suggests that Translate may be getting ready to join your googlebot in that quiet room with soft walls.

/p50.html/%C2%8Ai%C2%96-rgb

This worries me, because those aren't letters in UTF-8. They're control characters. Or-- interesting alternative-- your mystery posts use either Windows-Latin-1 or the Mac charset.

matrix_jan




msg:4452045
 1:04 am on May 11, 2012 (gmt 0)

Are you serious? That's Cyrillic


I was just checking my competitor with site: function. Poor guy does not have appropriate redirects set up. G has indexed the following with cPanel congratulations page:

ww.hispage.com
w.hispage.com
www2.hispage.com
(Chinese characters).hispage.com
(Russian word).hispage.com
(random letters).hispage.com

Be careful to redirect or show 404s, otherwise when it's too late you might end up like me, having thousands of 404s in WT.

Wally_Books




msg:4452053
 1:41 am on May 11, 2012 (gmt 0)

/n103-genprokuratura_poruchila_proverit_gosobvinitelya_po_delu_bychkova.html

is a page from a Russian classified ads site

Wally_Books




msg:4452056
 1:46 am on May 11, 2012 (gmt 0)

/Sexcats_-_Witten_Stockumerstr._215_58454-Witten_anzeige5735.html

is a page on a German sit

levo




msg:4452057
 1:46 am on May 11, 2012 (gmt 0)

Be careful to redirect or show 404s


I've recently learned, thanks to new notifications in WMT, that redirecting non-existing pages is a big no no. I thought that soft 404s are error pages with a 200 response, and I honestly believed that leaving as little as possible 404 and redirect them to a related page (or home page as last resort) was a good thing.

As soon as I've started removing 301s and replacing them with 404/410s those random non-existing page requests from Google kicked-in. If it's not a weird bug on their end, I think Google is actively auditing my website for redirections/soft 404s/auto generated content etc.

matrix_jan




msg:4452087
 4:29 am on May 11, 2012 (gmt 0)

I think Google is actively auditing my website for redirections

For some reason I had the same thought too. Although G stuff visited my website and checked 404 by visiting some page like this - "thispageshouldnotexist.fake"

Now I'm not sure whether it was good that my website returned a proper 404 or not :)

levo




msg:4452088
 5:01 am on May 11, 2012 (gmt 0)

I'm not sure whether it was good that my website returned a proper 404 or not


Well, I was nervous too, but Google states that "Generally, 404 errors donít impact your siteís ranking in Google, and you can safely ignore them" and "... 404s are a perfectly normal (and in many ways desirable) part of the web. You will likely never be able to control every link to your site, or resolve every 404 error listed in Webmaster Tools..." on https://support.google.com/webmasters/bin/answer.py?hl=en&answer=2409439

Aussiefoto




msg:4452094
 5:51 am on May 11, 2012 (gmt 0)

I'm getting a bunch of similar things going on with my sites today .. it just started. here's what's going on with me: From my access logs today, google generated over 700 404s, requesting urls from other websites:

Googlebot is requesting urls that don't exist on my site, and have never existed, but do exist on other websites .. here's some examples:

/journal/2012/01/12/thats-what-its-all-about-right-tim-tebow/img_0609/

/journal/2011/05/18/marriage-what-i-didnt-know-when-i-got-married/

/journal/2011/07/28/the-people-pleaser-problem/


I have a wordpress directory on my site called /journal/, so the url paths are similar. particularly with categories and tags, etc .. /journal/tag/photos/ for example ... but if you google those urls, they're not from my website, but other websites. The same thing is happening with my other domain, and the wordpress directory is called /ramblings/

/ramblings/2010/11/27/eight-ways-to-have-a-miserable-vacation/

/ramblings/2011/08/21/reducing-baggage-charges-hotel-laundry/


Those are not from my website, but from other sites.

But then it gets weird.

Even urls like this are being requested:

/restaurants/794/Casparus/
/restaurants/231/JB%20Rivers/

/display_brand/molami_123


Those have nothing to do with my site or directory structure; they're just about randomly picked url paths from around the web and google is requesting them on my site. And I'm also getting these kinds o furls:

/osamu/2011/11/14/%EF%BC%91%EF%BC%91%E6%9C%88%E3%81%8A%E3%81%95%E3%82%80%E3%82%AB%E3%83%AC%E3%83%B3%E3%83%80%E3%83%BC/

/weekday/2012/01/25/1%EF%BC%8F28%E3%83%BB29%E3%81%AE%E8%BB%8A%E4%B8%A1%E9%81%8B%E7%94%A8%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6/


It only happened today, so I can't yet see if they show up in Google Webmasters Tools.

From the posts above, I'm wondering if I should just remove all 301s from my htaccess file. Thoughts?

Cheers

Carl

g1smd




msg:4452110
 6:56 am on May 11, 2012 (gmt 0)

(For the pages with similar URL structure: ) It's almost like Google thinks that your site and the other site(s) are really the same bunch of files on a single server being returned under multiple domain names. They then test this out by requesting those pages from each domain and see what they get back.

You should return 404 for those requests to show that the other sites are separate sites. If you were to somehow "adopt" the URLs I think some sort of mayhem would ensue. I wonder if pages from your site are being requested on those other sites?

lucy24




msg:4452114
 7:04 am on May 11, 2012 (gmt 0)

Redirect the files that need to be redirected. Serve 410s for the files that are gone and have no close replacement. Serve 404s for files that don't exist, never did exist, never will exist...

Paradox. Even though 404s are listed as Errors, g### wants you to have them. That is, they want evidence that your site is able to generate a 404 response.

Unfortunately they've given the job to the same googlebot that will continue crawling pages for years-- literally-- after they've ceased to exist. Gee, maybe that 404 was a mistake. Better try another one.

And then there's the whole issue of g### trying to read URLs out of things that were never intended as links in the first place. Lots of recent threads about that.

The percent encodings really aren't anything special. They're simply non-ASCII characters. Correction: non-alphanumerics plus a very short list of others.

This one's fun:

%8F28%E3%83%BB29

Notice the stray 28 and 29? Something went severely goofy there, because %28 and %29 are (parentheses).

Now, let me know if you ever get a request containing percent-escaped characters in the exact form

%E1%9\d%[89AB]\h

Aussiefoto




msg:4452116
 7:15 am on May 11, 2012 (gmt 0)

Hey g1smd

Thanks for your input. yes, I've put in a note to google, and also spoken with the Tier 2 systems admin on my site. They even checked some of those other urls to see if they had ever maybe been hosted there, or had a similar ip or something, but found nothing.

It only started today; it's really weird. Here's a sample from my access logs


66.249.66.236 - - [09/May/2012:21:11:32 -0700] "GET /ramblings/category/family-vacation/ HTTP/1.1" 404 73818 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.236 - - [09/May/2012:21:11:33 -0700] "GET /index.php/past-newsletter-articles.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.236 - - [09/May/2012:21:11:34 -0700] "GET /index.php/past-newsletter-articles/64-what-happens-on-a-fam-trip.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.236 - - [09/May/2012:21:11:34 -0700] "GET /ramblings/author/cathibanks/page/3/ HTTP/1.1" 404 73822 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.236 - - [09/May/2012:21:11:35 -0700] "GET /past-newsletter-articles.html HTTP/1.1" 404 73686 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


now ALL of those urls return a 404 on my site. Yet 2 of them here show a 301. Which makes me think .. a few days ago I DID see pages on google webmasters tools showing errors for this

journal/index.php/category/photos/ or something .. when it should never show the /index.php file for those pages ... and now I see that kind of thing here in the ramblings/index.php/url-path (my 2nd website's wordpress folder) ... 301 redirecting ramblings/url-path .... so it could be something like a link to some similar url path and google tested it on my site, then started going haywire .... I have no idea.

I'm positive it must have something to do with 301s and so forth in my htaccess ... but it goes so far beyond that with these other crazy urls, too.

Anyway, I didn't mean to hijack the thread ... it just seems like it's the same kind of thing going on, no?

Thanks

Cheers

Carl

Aussiefoto




msg:4452118
 7:19 am on May 11, 2012 (gmt 0)

lucy .. i just saw your reply now... most of which goes way over my head. I searched for that string you mentioned, but nothing here.

All the urls that are not form my site are getting 404s ... I went ahead and installed a new plugin on my site that caches 404 pages to reduce server load, if this is going to be an ongoing problem. I'd really rather, of course, that google not try to imagine other urls on my website and search for them. From what you're suggesting, this might be a temporary "test" that goes away?

Thanks

cheers

Carl

freyia




msg:4452131
 8:25 am on May 11, 2012 (gmt 0)


We're getting the same thing on our website since yesterday morning. Our site is primarily just html and has been on the web 12 years and haven't done anything out of the ordinary with it lately and it's just generating plain 404 errors. We've had few hundred of googlebot trying to fetch weird page names. I don't know if it's relevant but the IP address of the googlebot in question is 66.249.71.169

Some examples of pages it has tried to fetch are :

/store/index.cfm/product/43_4/consorzio-di-modena-tradizionale-balsamic-vinegar
/travel-blog/2012/03/dainess-deal-reel-march-30th-2012/
/recipesarchive/recipe.cfm
/south_africa_sports_volunteering.htm
/vintage_flower_earrings.htm
/2008/Guestbook/guestbook41.htm
/seven_moons_order_doppelganger.htm
/dev/c201essay4.pdf
/apps/blog/show/7321337
/wallpapers16.htm

Some of them are fairly unique and their 'real' locations on the web can be easily found. We also checked a couple of the sites we could identify and they don't even use the same hosting provider as ourselves.

I see Aussiefoto said they'd put a note into google. We've had a look around and can't actually see where to write to google regarding this type of problem. How would you contact them for this sort of matter?

thanks

Jo

Wally_Books




msg:4452316
 5:11 pm on May 11, 2012 (gmt 0)

googlebot is back to normal today after looking for hundreds of pages from other sites on our site. Must have been a glitch. Today just looking for our site. Nothing yet for crawl errors in webmaster tools. Curious if there will be.

g1smd




msg:4452351
 6:35 pm on May 11, 2012 (gmt 0)

It's usually about 48 hours before crawl errors show up in the reports. Check tomorrow.

jimbeetle




msg:4452352
 6:35 pm on May 11, 2012 (gmt 0)

It's almost like Google thinks that your site and the other site(s) are really the same bunch of files on a single server being returned under multiple domain names.

I've seen somewhat similar activity in the past when Google would return pages from a recently deactivated domain as a site: result for a recently activated domain on the same server.

atlrus




msg:4452585
 1:51 pm on May 12, 2012 (gmt 0)

It's probably a xrumer run on your forum. Google xrumer pyramid.

Aussiefoto




msg:4452741
 12:17 am on May 13, 2012 (gmt 0)

Well, I didn't hear anything back from google about it. Jo, here's how I sent them a note

http://support.google.com/webmasters/bin/request.py?&contactus=1

Not that it helps much, if they don't respond.

I still haven't seen anything in my google webmasters tools, and those requests seem to have gone away from my Access Logs. It appears to me that it's gone, and is no longer an issue. I'll check and see over the next few days if it comes back.

Cheers

Carl

Sgt_Kickaxe




msg:4452764
 2:31 am on May 13, 2012 (gmt 0)

Essentially in these cases, Google is trying to find "deep" content than normally doesn't see the light of day.

I beg to differ, I believe they are looking to see if the page returns dynamic content based on the parameters requested. Spammers often key on queries and keywords to create infinite numbers of pages dynamically and Google doesn't like that. Even if you're not a spammer you need to be aware of how your page responds to such requests, they essentially should not change regardless of parameters added(if they exist) or should return the proper error code if they do not exist.

e.g webmasterworld.com/google/?apples=oranges should return the EXACT same content as webmasterworld.com/google/ since webmasterworld.com/google/ exists. To Google it would mean the parameters had no effect(a good thing).

tedster




msg:4452774
 3:01 am on May 13, 2012 (gmt 0)

You're right - that's also true, Sarge. They also test to see if the error handling is up to par or "soft". In short, googlebot does a lot of probing.

Errioxa




msg:4452825
 9:29 am on May 13, 2012 (gmt 0)

In my site always googlebot crawls around 200.000 - 250,000 urls with redirects per day. I track it for long time ago from my access logs.


06/May/ - 252.963
07/May/ - 213.955
08/May/ - 231.638
09/May/ - 169.253
10/May/ - 24.891

Does Google have a bug with redirections?
Google takes a long time to update cached versions in many urls. Google is getting slower.

g1smd




msg:4452827
 9:49 am on May 13, 2012 (gmt 0)

They trust redirects less than they did. However, while they might index the new URL very quickly, they also hold on to the data about the old one for a very long time.

Aussiefoto




msg:4456613
 9:14 pm on May 22, 2012 (gmt 0)

Hey Folks - just a quick follow up; Those urls never showed up in my google webmasters tools at all. And they ceased requesting them the same day they started. Haven't seen the problem again.

Wally_Books




msg:4456619
 9:27 pm on May 22, 2012 (gmt 0)

same here, never showed up and they stopped requesting

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved