Forum Moderators: phranque

Message Too Old, No Replies

Redirects, tell the bot page doesn't exist

bingbot crawling non-existent pages

         

Aussiefoto

4:41 am on Apr 25, 2012 (gmt 0)

10+ Year Member



Hey Folks,

My first post here, so please be gentle. :)

I am having an ongoing issue with my site/s with excessive resource usage with the cpu (shared hosting). I have one primary domain and a subdomain. Both sites have wordpress blogs, and one site also has a coppermine-gallery photo section. And both sites have a bunch of static html pages.

One of the things we found is bing bot and/or msn.bot crawling a unch of pages that don't exist. Urls like this

GET /Bio/alaska/faq/stock/stock/alaska/journal/eagles/stock/thumbnails-79-Banff-National-Park-photos.html
GET /Bio/copyright/faq/stock/alaska/stock/index-17.html
GET /Bio/alaska/stock/stock/stock/alaska/portfolio/landscapes/stock/thumbnails-17-Small-Mammals-Photos.html


Literally, thousands of them. The directory Bio doesn't exist, is now 'bio'. and has just one url, index.html/ But somewhere along the line bing is trying to crawl these crazy non-existent urls. No other engine is crawling them, and they don't seem to exist. Bing's webmasters tools aren't showing a bunch of 404 errors, only a few, and none with this kind of url thing.

So the problem is it generates a 404, which is called and created dynamically by wordpress. Here's what I did:

In the .htaccess file, added

RedirectMatch 301 ^/Bio/ http://www.skolaiimages.com/bio/index.html


So now every one of those bad urls just goes to a correct bio, and static page. Is there a "better" way to configure this, rather than now having thousands of redirects, just have a script or code that says 'Bio"/anything doesn't exist?

Another option is to reconfigure wordpress so it isn't pointed to the root directory, I only made this change recently, so it wouldn't be too big a deal to switch it back and have a static html page as the home page again, and everything wordpress operate within its own directory (/journal/). That should mean any 404s from that above set of urls is not generated dynamically, but calls a static page, correct?

And/or drop wp-super cache, and switch to W3 Total Cache, which allows caching of 404 pages (Super Cache does not).

My access logs show as many as 5000 hits by bing/msn to these bad urls. Is it likely that this is causing the CPU problems?

I've slowed the crawl rate down, via webmasters tools, and and also via this

User-Agent: *
Crawl-delay: 30


in the robots.txt file, but those didn't seem to change anything. I just made the 301 redirect for Bio today, so don't know yet whether the resource usage has slowed at all.

I apologize for the long and convoluted introductory post. I've been having such a hard time with this, and am in WAY over my head on this. Any and all help is much appreciated.

Thanks so much.

Cheers

Carl

lucy24

2:11 am on May 1, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Whoops! How did we get to a second page? Last post on end of page:
Out of curiosity, how do I rewrite something like this

website.com/folder/file-name.html%3Cbr%20/%3E50

or similar? I'm seeing something like that on google's webmasters tools .. i deleted it, but I do know it drives some traffic. I tried this but it doesn't work.

# RewriteRule ^folder\/file-name\.html(.*)$ http://www.example.com/folder/file-name.html [R,L]

Didn't work. I then tried leaving the $ off the first part, after the (.*), and it nearly worked, but still tagged some kind of code <br%20/>50 or something similar.

It's not a huge deal, but I am starting to see more of these links from truncated urls, etc and thought it might be possible to resolve them correctly?

:: mutter, grumble, you really gotta use example.com ::

Now homing in...

<br%20/>50 or something similar

! I know what that is and it isn't your fault and you don't need to do anything about it unless it's a really important link and someone at the other end goofed and now you can't get them to fix it.

:: pause for breath ::

%20 is a space. The other % pieces seem to have been de-escaped on their own: %3C and %3E to < and >. What you have there is yet another variation on g###'s fondness for taking everything and anything as an URL. In this case, it's a legitimate URL that isn't anchored, so instead it is immediately followed by the xhtml line break <br /> ... and then by who-knows-what on the next line!

Remember, you don't need to escape directory slashes in mod_rewrite. And here you must say .+ rather than .* because otherwise you will get into an infinite redirect when there is nothing after the html. And, ahem, you want [R=301,L]. After all, you're not planning to go back to the .html<br />50 form next week, are you?

If there are actual humans at the end of those garbled link-oids, try to hunt down the source and get it fixed. Otherwise, ignore it. You can certainly make a RewriteRule that says

.. ^([^.]+\.html).+ http://www.example.com/$1 [R=301]

if it's happening a lot with legitimate URLs. Or spell out the exact URL(s) if there's only a handful of them. It looks as if they're all coming from the same place, though.

Aussiefoto

2:41 am on May 1, 2012 (gmt 0)

10+ Year Member



Hey Lucy

Ahhh .. thanks ..

I'm unsure about the first comment " mutter, grumble, you really gotta use example.com :: " ? I thought I DID use example.com or 'website.com' .. is there a difference? I thought it was just to (a) avoid using your own site for promo and (b) to not recreate the bad links.

Remember, you don't need to escape directory slashes in mod_rewrite


so you mean I could have just written it like this?

RewriteRule ^folder/file-name\.html(.*)$ .... 


without the backslash after "folder"?

Oh no. if that's so, I have to edit a bunch of them. I thought you had said earlier to do that, but I see you didn't ... I just don't know what all the jargon means, but I get it now ... just for stuff like \.html etc and not in the target.

And, ahem, you want [R=301,L]


Yeah, my bad .. I had been playing with [R,L] as I was working on it, rather than 301, but when I couldn't get it to work, I just left it as is.

Oh .. btw .. I DID drop the "index.html" from the urls in the navigation .. so bio/index.html now is just /bio/ .. and any folders where the url inside is not index.html I put a rewrite for the folder/ to point to the correct url

Rewrite ^anwr/$ http://www.example.com/anwr/anwr-rafting-trip.html [R=301,L]


Correct?

Just so I know, what's the issue with having a link to bio/index.html etc?

And here you must say .+ rather than .* because otherwise you will get into an infinite redirect when there is nothing after the html.


Ah ha! I got that infinite redirect and that's one of the reasons I dropped the whole thing.I'll try the .+ like you suggested.

It's not happening too much, but it does seem to be increasing. A more common one is truncated urls ... www.example.com/anwr/anwr-ra.... etc .... I just ignore those usually.

Thanks again, both of you .. extremely helpful, and much appreciated.

Cheers

Carl

Aussiefoto

2:49 am on May 1, 2012 (gmt 0)

10+ Year Member



PS: the generic rule you provided wouldn't quite work for me, as I have some links to pages like this:

example.com/customtrips.html#ross-green

Correct?

thanks again.

g1smd

6:52 am on May 1, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You never need to worry about #something on the end of a URL. That's a named anchor that when clicked jumps to a particular position within a page. This is processed entirely within the browser and the #part isn't ever part of the URL request to the server.

Colons and slashes never need to be escaped. It's mostly literal periods and lteral spaces that do.

Lucy has covered the other points already.

lucy24

7:28 am on May 1, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I thought I DID use example.com or 'website.com' .. is there a difference?

Yes, there's an extra difference that only matters here in Apache. Anything with leading http:// gets turned into a clickable link, and then people can't see what you typed. It doesn't matter as much in examples without the http:// as long as you use something generic. But example.com --or, apparently, example dot anything else in the world-- is doubly safe.

Compare

[domain.no...]
[domain.se...]
[domain.dk...]
[domain.de...]
[domain.gl...]
[domain.tv...]
[domain.xyz...]
and
http://www.example.no
http://www.example.se
http://www.example.dk
http://www.example.de
http://www.example.gl
http://www.example.tv
http://www.example.xyz

I put in the last one because I don't believe there is a tld called .xyz. So if you need to, you can even obfuscate the country :)

Aussiefoto

9:00 am on May 1, 2012 (gmt 0)

10+ Year Member



Ah ha .. thanks again. You 2 are awesome.

Cheers

Carl
This 36 message thread spans 2 pages: 36