Forum Moderators: phranque

Message Too Old, No Replies

Redirects, tell the bot page doesn't exist

bingbot crawling non-existent pages

         

Aussiefoto

4:41 am on Apr 25, 2012 (gmt 0)

10+ Year Member



Hey Folks,

My first post here, so please be gentle. :)

I am having an ongoing issue with my site/s with excessive resource usage with the cpu (shared hosting). I have one primary domain and a subdomain. Both sites have wordpress blogs, and one site also has a coppermine-gallery photo section. And both sites have a bunch of static html pages.

One of the things we found is bing bot and/or msn.bot crawling a unch of pages that don't exist. Urls like this

GET /Bio/alaska/faq/stock/stock/alaska/journal/eagles/stock/thumbnails-79-Banff-National-Park-photos.html
GET /Bio/copyright/faq/stock/alaska/stock/index-17.html
GET /Bio/alaska/stock/stock/stock/alaska/portfolio/landscapes/stock/thumbnails-17-Small-Mammals-Photos.html


Literally, thousands of them. The directory Bio doesn't exist, is now 'bio'. and has just one url, index.html/ But somewhere along the line bing is trying to crawl these crazy non-existent urls. No other engine is crawling them, and they don't seem to exist. Bing's webmasters tools aren't showing a bunch of 404 errors, only a few, and none with this kind of url thing.

So the problem is it generates a 404, which is called and created dynamically by wordpress. Here's what I did:

In the .htaccess file, added

RedirectMatch 301 ^/Bio/ http://www.skolaiimages.com/bio/index.html


So now every one of those bad urls just goes to a correct bio, and static page. Is there a "better" way to configure this, rather than now having thousands of redirects, just have a script or code that says 'Bio"/anything doesn't exist?

Another option is to reconfigure wordpress so it isn't pointed to the root directory, I only made this change recently, so it wouldn't be too big a deal to switch it back and have a static html page as the home page again, and everything wordpress operate within its own directory (/journal/). That should mean any 404s from that above set of urls is not generated dynamically, but calls a static page, correct?

And/or drop wp-super cache, and switch to W3 Total Cache, which allows caching of 404 pages (Super Cache does not).

My access logs show as many as 5000 hits by bing/msn to these bad urls. Is it likely that this is causing the CPU problems?

I've slowed the crawl rate down, via webmasters tools, and and also via this

User-Agent: *
Crawl-delay: 30


in the robots.txt file, but those didn't seem to change anything. I just made the 301 redirect for Bio today, so don't know yet whether the resource usage has slowed at all.

I apologize for the long and convoluted introductory post. I've been having such a hard time with this, and am in WAY over my head on this. Any and all help is much appreciated.

Thanks so much.

Cheers

Carl

lucy24

5:16 am on Apr 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



First things first: Is it the real bingbot? That is, never mind what it's wearing: does it come from a bing/msn IP? I tend to feel irrationally flattered when bing continues to crawl 410'd pages. They're not as ecumenical as google-- and also not as hipped on following imaginary links. I get a sense of "Ooh, this page is really important."

The directory Bio doesn't exist, is now 'bio'

Do you mean that it used to exist? Then a 410 would seem to be in order. Do not repeat not redirect a whole bunch of files to a single index file unless you've got a ### good reason. (Does your bio/index.html contain all the information that used to be in all those Bio/blahblah files? I kinda doubt it :)) The syntax in mod_alias is simply

RedirectMatch 410 {filename} ... and then nothing. You're redirecting into thin air.

RedirectMatch 301 ^/Bio/ http://www.example.com/bio/index.html

Awk! Never ever ever use the name "index.html" (or .php or whatever). The name of a directory is / as in

http://www.example.com/bio/

Which reminds me: You should detour at this point to read some of those fine-print links at the top of the page-- Forums Charter and so on-- which will, among other things, explain about example.com. You probably noticed one of those problems while composing your post.

NOW THEN...

Ahem.

Does your htaccess currently contain only mod_alias directives (Redirect or RedirectMatch by that name)? If so, you are OK. But if you've got anything using mod_rewrite-- whether it ends up a rewrite or redirect doesn't matter-- change everything to mod_rewrite. Otherwise you're just asking for trouble.

Aussiefoto

8:14 am on Apr 25, 2012 (gmt 0)

10+ Year Member



hey Lucy24

Thanks so much for your help. I'll try to go through each point.

"That is, never mind what it's wearing: does it come from a bing/msn IP?"


Seems to be so, yes. I checked a few of the IPs via ip-lookup.net and they all seem to be bing bots.

"Do you mean that it used to exist? Then a 410 would seem to be in order. Do not repeat not redirect a whole bunch of files to a single index file unless you've got a ### good reason.


Yes, I used to have it called Bio, but changed things to lower case urls when I learned that's a more sensible way to do things. Then I learned about 404 crawl errors, etc afterward.

The "### good reason" for the redirects was simply "it's all I know. :) I'm not sure how to do 410s, so I'll have to look into that. It definitely makes more sense to tell the crawler those pages don't exist than to redirect them to another (unrelated) page continuously.

"The syntax in mod_alias is simply

RedirectMatch 410 {filename} ... and then nothing. You're redirecting into thin air. "


So I don't actually redirect it anywhere? And I do need to do that for each of the urls bing is concocting? (NB - there was only ever ONE Bio file, all the others that are somehow being searched by the crawler are some kind of coding error. Can I just do this

RedirectMatch 410 ^/Bio/

and have that resolve ALL the urls starting with /Bio/? I just saw nearly 8000 in the most recent access logs.

If I Redirect 410, isn't that going to continue to generate 404 errors?

Never ever ever use the name "index.html" (or .php or whatever). The name of a directory is / as in

"http://www.example.com/bio/ "


Do you mean not link to the http://www.example.com/bio/index.html in my navigation/redirect, but still actually keep the index.html file in the directory? Or change the name of the index.html file to something like bio.html, so the url would be http://www.example.com/bio/bio.html (or some such)? I apologize for such rudimentary stupid questions, I'm pretty ignorant of proper 'rules' for this stuff. Hence, my site/s a mess.

"Which reminds me: You should detour at this point to read some of those fine-print links at the top of the page-- Forums Charter and so on-- which will, among other things, explain about example.com. You probably noticed one of those problems while composing your post. "


Ahh.. I see .. sorry .. i just copied that from the file, and should've changed it before posting. it's too late to edit now .. I apologize.

Does your htaccess currently contain only mod_alias directives (Redirect or RedirectMatch by that name)? If so, you are OK. But if you've got anything using mod_rewrite-- whether it ends up a rewrite or redirect doesn't matter-- change everything to mod_rewrite. Otherwise you're just asking for trouble. "


Oh. I'll have to look at that. I have redirect and redirectmatch, but I also have this kind of stuff:

RewriteCond %{HTTP .... 


and

# BEGIN WPSuperCache
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
#If you serve pages from behind a proxy you may want to change 'RewriteCond %{HTTPS} on' to something more sensible
RewriteCond %{REQUEST_URI} .......


So you're saying I should change ALL the Redirect 301 and RedirectMatch 301 to mod_rewrite codes?

Well, right now the redirect isn't working anyway .. i've no idea why it suddenly stopped redirecting correctly.

I'm also trying to configure my WP Super Cache plugin to cache the 404 pages, but so far, I can't get that to happen.

Would adding a simple

Disallow: /Bio/


work to block the crawler from all those urls starting with /Bio/

This stuff is SO hard. I feel like such a clown.

Thanks so much for your help. I really appreciate your time and patience.

Cheers

Carl

g1smd

8:46 am on Apr 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you use RewriteRule for any of your rules, use it for all of your rules. Redirect and RedirectMatch come from mod_alias. RewriteRule comes from mod_rewrite. Don't mix directives from both modules in the same site.

Track down the source of this error. It is likely there's a relative link somewhere on the site. Use only links that begin with a leading slash and specify the full path to the file.

Aussiefoto

9:00 am on Apr 25, 2012 (gmt 0)

10+ Year Member



hello g1smd

Thanks. I'll try to figure out the correct RewriteRule setup.

I have no idea how to find this kind of error. Even when I go to Bing webmasters tools, it's not showing me ANY of these urls as not found. It's not showing me any links pointing towards this, and Bing is the ONLY crawler requesting them. I know I've had url errors in the past, but I think they are corrected. These are direct requests coming in, not from a link:

65.52.108.58 - - [24/Apr/2012:09:23:56 -0700] "GET /Bio/alaska/bio/stock/stock/stock/wildlife/stock/thumbnails-17-Small-Mammals-Photos.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
157.55.17.193 - - [24/Apr/2012:09:39:22 -0700] "GET /Bio/alaska/bio/stock/stock/contact/resources/stock/bio/index.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
157.55.17.193 - - [24/Apr/2012:09:45:34 -0700] "GET /Bio/alaska/bio/stock/stock/stock/eagles/stock/thumbnails-13-Grizzly-Bears-Photos.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"


this kind of thing.

I went to Bing webmasters tools and blocked the directory /Bio/ as well. But it seems that only lasts 90 days.

Do you have any tips or tools that might help to maybe find if there's still a bad link somewhere? I don't believe there is; the directory /Bio/ doesn't even exist on the site now.

I did just find the redirect is not working because the webhost tech support added this to the htaccess

RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.*)\.(html|ico|jpg|png|gif|js|css)$ - [G,L]


That's the correct 410 rule, right? But now those urls still show up as 404s, which doesn't seem correct to me.

Thanks so much.

Cheers

Carl

lucy24

11:13 am on Apr 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Where are they coming through as 404? The RewriteRule-- which you should dump pronto, because it's sloppy and server-intensive-- serves a 410. That's the [G] flag. The log entries you quoted show 301.

... and that's why it's perilous to mix mod_alias and mod_rewrite. It's out of your control which one will execute first.

What you need is a single line.

RewriteRule ^Bio - [G]

That's all. ^Bio means "starts with..." with no ending anchor. Top-level directory, so no leading slash.

Now, about index.html

The name of the file is index.html or index.php or index.asp or whatever. But in all links to any directory, you just say

www.example.com/directory/
www.example.com/directory/subdirectory/

and then let mod_dir do the rest. That's it's job. Filenames are for everything other than each directory's Index file.

<IfModule mod_rewrite.c>

Oh, lord, you've got CMS boilerplate mixed into your htaccess. For starters, get rid of all those <IfModule... envelopes. Not their contents, just the envelopes themselves. You either have the module or you don't. In the case of mod_rewrite, you have it ;)

If I Redirect 410, isn't that going to continue to generate 404 errors?

No, you're replacing the default 404 response with an explicit 410 response. It isn't really a redirect; that's just the syntax. But now that we've established that you do also use mod_rewrite, you're going to dump all the Redirect-by-that-name rules anyway.

If you've got a lot of Redirects in place, you can even run up a sort of meta-RegEx to change everything in one fell swoop. Mine (which I cleverly saved in the htaccess itself so I can't possibly misplace them and have to make them up all over again) go

# change . to \.
# ^(Redirect \d\d\d \S+?[^\\])\. TO \1\\.
# now change Redirect to Rewrite
# ^Redirect(?:Match)? 301 /(.+) TO RewriteRule \1 [R=301,L]
# and
# ^Redirect(?:Match)? 410 /(.+) TO RewriteRule \1 - [G,L]

It's \1 because that's what my text editor uses. Most people probably use $1 instead.

Technically you don't need the [L] flag with [G], though you do with [R=301]. But as a matter of habit, use [L] in every single rule until you've completely internalized which ones don't need it. And when you get to the point where you're omitting [L] because something isn't the last Rule, well, then you'll be answering questions instead of asking them :)

g1smd

11:36 am on Apr 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One point. [L] isn't "last rule" in the way many people think it is, i.e. add it only on the last rule of the lot.

[L] is last rule in "if this rule pattern matched the current request, do whatever action is specified in this rule and exit mod_rewrite for this request".

As such [L] should appear on the end of every RewriteRule (except those with [F] or [G] where [L] is implied).

lucy24

12:38 pm on Apr 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As such [L] should appear on the end of every RewriteRule (except those with [F] or [G] where [L] is implied).

... and those that expressly preclude [L] --and that nobody but jdMorgan is prepared to touch-- like [N] or [C] or [S] ;)

Aussiefoto

7:23 pm on Apr 25, 2012 (gmt 0)

10+ Year Member



Hey Lucy24

Here's an example where they're coming through as 404s

207.46.192.48 - - [24/Apr/2012:02:46:15 -0700] "GET /Bio/alaska/copyright/stock/stock/copyright/resources/contact/stock/thumbnails-60-Insects.html HTTP/1.1" 404 89246 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

207.46.13.206 - - [24/Apr/2012:02:47:25 -0700] "GET /Bio/alaska/faq/appalachia/atlanta/stock/contact/contact/stock/thumbnails-4-Bighorn-Sheep-Photos.html HTTP/1.1" 404 89554 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

157.55.18.25 - - [24/Apr/2012:02:47:54 -0700] "GET /Bio/alaska/copyright/eagles/stock/contact/contact/bio/stock/thumbnails-43-Photos-of-shorebirds.html HTTP/1.1" 404 89509 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"


RewriteRule ^Bio - [G] 

Ahh, OK, thanks. Do I remove the
RewriteCond %{REQUEST_FILENAME} !-f
section as well, or just the line starting RewriteRule?

Top-level directory, so no leading slash.

So "top-level directory" means the first level inside the root, correct? Like www.example.com/Bio/ etc?

Now, about index.html ....


Thank you. I SO shoulda been on this forum 5 years ago. I've gone thru and corrected those links that I can find pointing to /directory/index.html.

or starters, get rid of all those <IfModule... envelopes. Not their contents, just the envelopes themselves.


oh .. so anything that says
<IfModule mod_rewrite.c>
I can/should remove?

If you've got a lot of Redirects in place, you can even run up a sort of meta-RegEx to change everything in one fell swoop. .... snip ….


OK .. now I'm lost. In English, what you're saying is this will effectively change all of the Redirect stuff I have to a proper ReWriteRule, correct? Do I just copy that snippet you typed out str8 into my htaccess as is? Forgive my ignorance, and pretend you're writing to a 5 year old. :) And then what do I do with all the Redirect 301 lines in the htaccess file? Does the code you typed have to go before those Redirect 301 lines (assuming I leave them in place)?

I'm so appreciative, thanks so much.

Cheers

Carl

g1smd

7:35 pm on Apr 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That final bit of code is what you feed into the find/replace function of an intelligent code-aware text editor (such as EditPlus3) to mass replace Redirect directives with RewriteRule directives.

Saves one massive heck of typing. Work smarter, not harder.

Aussiefoto

7:55 pm on Apr 25, 2012 (gmt 0)

10+ Year Member



Ahhhh ... duh ... OK, thanks .. do you have a recommendation for an "intelligent code-aware text editor" for mac? EditPlus3 looks to be windows only. .. ETA: I use Text Edit, but it doesn't seem to have the functionality needed for this kind of thing.

Do you have an example of what this line
Redirect 301 /stock/thumbnails-51-Bald http://www.skolaiimages.com/stock/thumbnails-51-Bald-Eagle-Photos.html


should look like using the RewriteRule correctly?
I put this (below) but it doesn't seem to work correctly:
RewriteRule ^stock/thumbnails-42-Songbirds-an$ http://www.skolaiimages.com/stock/thumbnails-42-Songbirds-and-Passerines-photos.html [R=permanent,L]


Thank you.

Cheers

Carl

lucy24

10:40 pm on Apr 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



do you have a recommendation for an "intelligent code-aware text editor" for mac?

I use SubEthaEdit, but there are others. A lot of people use BBEdit or TextWrangler. I adopted SEE years ago because I needed to be able to switch line-ending format on the fly (between unix/OSX \n or LF alone and DOS/Windows \r\n or CRLF) for e-books. It's also terrific at switching file encodings, either by changing or reinterpreting. Allows me to look at a string of gibberish and say "Oh, I get it, you've got UTF-16 text being interpreted as DOS Korean" ;)

RewriteRule ^Bio - [G]

Do I remove the RewriteCond %{REQUEST_FILENAME} !-f section as well, or just the line starting RewriteRule?

Since the entire directory is gone, you don't need the !-f or !-d conditions. You already know they don't exist.

Top-level directory, so no leading slash.

So "top-level directory" means the first level inside the root, correct? Like www.example.com/Bio/ etc?

Right.

so anything that says <IfModule mod_rewrite.c> I can/should remove?

Yup. But again: just those <envelopes> not their contents. Even Apache says the same thing. Just saw it the other day.

:: shuffling papers ::

The Forums will probably eat the fragment part of the link [httpd.apache.org] --"IfModule" about halfway down the page:
This section should only be used if you need to have one configuration file that works whether or not a specific module is available. In normal operation, directives need not be placed in <IfModule> sections.

And, at the top of the page in the first section:
This directive [<IfModule>] should only be used if you need your configuration file to work whether or not certain modules are installed. It should not be used to enclose directives that you want to work all the time, because it can suppress useful error messages about missing modules.

g1smd

11:38 pm on Apr 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Including those containers is something that many open source packages do by default in order to reduce support requests. Omit them.

Aussiefoto

3:28 am on Apr 26, 2012 (gmt 0)

10+ Year Member



hey Lucy24 and g1smd

Thanks so much .. Text Wrangler and I are getting acquainted.

So this is the correct code I should use:

Redirect 301 /stock/thumbnails-42-Songbirds-an http://www.examplesite.com/stock/thumbnails-42-Songbirds-and-Passerines-photos.html


becomes this:

RewriteRule ^stock\/thumbnails\-42\-Songbirds\-an$ "http\:\/\/www\.examplesite\.com\/stock\/thumbnails\-42\-Songbirds\-and\-Passerines\-photos\.html" [R=301]


I'll remove those envelopes .. thanks for the heads up on those .. my site is being moved to a VPS server, so I'll have to give it a while before I can do that.

Once I get this stuff a little more under control, I'll try to address that 410 rewrite .... I've queried my webhost about it.

Thanks so much.

Cheers

Carl

lucy24

4:07 am on Apr 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You don't need to escape directory slashes in mod_rewrite. (There is another mod-- I forget which, but it has come up in this Forum-- that does, because it uses the javascript-style /{RegEx here}/ locution.) You only ever need to escape hyphens if they're in an iffy location inside of grouping brackets. Periods do need to be escaped in the pattern.

You never need to escape anything in the target.

I don't think colons : ever need to be escaped. There is probably some obscure RegEx dialect ("flavor", ugh) where they do, but I have yet to meet it.

Aussiefoto

4:16 am on Apr 26, 2012 (gmt 0)

10+ Year Member



Hey lucy24

Thank you.

My webhost told me they'd rather all redirects be done via their cpanel, rather than manually - that would take forever, so I thought I'd try to simply copy the syntax they were using. It seems to be clunky and redundant, congruent with some of yours and g1smd's comments above. Here's a snip from the output from cpanel

RewriteCond %{HTTP_HOST} ^mysite$ [OR]
RewriteCond %{HTTP_HOST} ^www.mysite.com$
RewriteRule ^alaska\/denali\.html$ "http\:\/\/www\.mysite\.com\/alaska\/denali\-photos\.html" [R=301,L]

RewriteCond %{HTTP_HOST} ^skolaiimages.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.skolaiimages.com$
RewriteRule ^alaska\/wrangells\.html$ "http\:\/\/www\.mysite\.com\/alaska\/wrangell\-st\-elias\-photos\.html" [R=301,L]



Following what you said, the correct code could simply be this, correct?


RewriteRule ^stock\/thumbnails\-42\-Songbirds\-an$ "http://www.skolaiimages.com/stock/thumbnails-42-Songbirds-and-Passerines-photos.html" [R=301, L]


using the "L" only for the last redirect in that rule?

Thanks so much

Cheers

Carl

g1smd

7:09 am on Apr 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a bullet with "cPanel Developer" engraved on it just waiting for the day...

cPanel produces the absolute worst htaccess code ever. Escape the literal periods in the patterns and nothing else. CPanel omits that then escapes everything else that should not be escaped.

In the last two examples, you don't need any of the RewriteCond lines at all.

Aussiefoto

8:02 am on Apr 26, 2012 (gmt 0)

10+ Year Member



hey g1smd

Ahh .. thanks .. that's what I was kinda afraid of. Though I just MIGHT beat Cpanel out for "worst htaccess code ever". :)

So, would that RewriteRule I posted above be correct?

Thanks again.

Cheers

Carl

Aussiefoto

8:51 am on Apr 26, 2012 (gmt 0)

10+ Year Member



Oh .. and one more question .. I added the line lucy24 gave me above:

RewriteRule ^Bio - [G] 


Is there a way for me to tell if that is working correctly? What should happen when I try to go to one of the urls being 410 redirected (or however it's called) .... I still get a 404.

Thanks again.

Cheers

Carl

lucy24

9:07 am on Apr 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It will be correct once you've got the escapes sorted out :) And you don't need the quotation marks. (Can't remember if they will make the rule fail, make the server explode, or simply get ignored. But why take chances.)

Use the [L] flag on every single rule. It means "If you have executed this rule, you're done with mod_rewrite for now". Maybe instead of calling it "Last" you could think of it as "Let's get out of here". Flags are only applied if that specific rule has executed. This applies to all flags everywhere in mod_rewrite.

I know it seems counter-intuitive, but a Redirect still requires the L flag. Redirecting isn't a sudden-death action like [G] or [F].

Edit, as we've overlapped.

Is there a way for me to tell if that is working correctly? What should happen when I try to go to one of the urls being 410 redirected (or however it's called) .... I still get a 404.

Uh-oh. You mean that when you look up your visit in the logs, it says 404? And when you request the URL, you see your ordinary 404 page? ###. You ought to see "410" instead of 404 in the logs. And if you don't have a custom 410 page you should see the Apache default, which is very scary-looking.

Someone will think about this further. But not me-- at least not right away-- because it's 2AM ;)

g1smd

6:33 pm on Apr 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Clear your browser cache then try again.

Especially use the Live HTTP Headers for Firefox extension to see what is going on.

Aussiefoto

7:39 pm on Apr 26, 2012 (gmt 0)

10+ Year Member



Hey Folks,

So now I found out that the 410 error document didn't actually exist, which is why the 404 was still being served. That's corrected, and the 'bad' urls are now producing 410s ... both when I visit AND (from what I can see in the logs) to the crawler.

It's now producing a correct 410. Thank you so much for your help.

I'll monitor it and see if this starts to ease the problem.

I do see (now) that it's still requesting a few bad urls, the same structure without /Bio/ at the start .... short of blocking them all individually, is there another way to maybe identify where they're coming from? The bing webmasters tools don't show any hint of this activity at all.

Lucy24 .. ahh, I had forgotten to cancel the escapes before the hyphens .. I can remove those. Out of curiosity, what's the problem caused by having both Redirect 310 lines and RewriteRules in the same htaccess file?

Thanks so much folks - you're awesome.

Cheers

Carl

g1smd

8:02 pm on Apr 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Imagine a site with two rules: one for redirecting non-www to www and another for rewriting friendly URL requests to a parameter-based internal filepath.

What should nomally happen is that a request for example.com/this-thing should first be redirected to www.example.com/this-thing . The browser then makes a new request for www.example.com/this-thing and that request is internally rewritten to /index.php?page=this-thing to deliver the content. This functionality correctly happens when you use mod_rewrite for all of the rules and list redirects before rewrites.


Mixing directives from mod_alias and mod_rewrite means you can't be sure what order your rules will be processed in. Likewise if you use mod_rewrite for all of your rules but list rewrites before redirects.

Imagine a request for non-www URL example.com/this-thing where the site has an external redirect from non-www to www that happens after the internal rewrite from /this-thing to /index.php?page=this-thing has been processed.

Now a request for example.com/this-thing is internally rewritten to /index.php?page=this-thing and the redirect then kicks in using the current value of the pointer and simply redirects to www.example.com/index.php?page=this-thing exposing the internal filepath as a new URL in the process.

That's a fatal error for site indexing. Similar issues can arise when mod_alias and mod_rewrite directives are mixed in the same site configuration.

lucy24

10:48 pm on Apr 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I found out that the 410 error document didn't actually exist, which is why the 404 was still being served

That shouldn't happen, unless your host is being too smart for their own good. Another in the list of things that have been said at least 10^3 times: there's no absolute connection between the response returned by the server, and the physical page seen by the visitor.

If there is no custom page for such-and-such category of error, your logs should still show the appropriate error code-- but the visitor will see the Apache default page instead of your own nice page. Or possibly the host's default if they've got one.

A 410 is only returned if you explicitly ask for it. It isn't a built-in error like 404. So generally people don't bother about custom 410 pages until they need them. Sometimes you can even get away with using the same physical page as the custom 404, like this (example):

ErrorDocument 404 /boilerplate/missing.html
ErrorDocument 410 /boilerplate/missing.html

Aussiefoto

7:46 am on Apr 30, 2012 (gmt 0)

10+ Year Member



Hey Folks,

I just wanted to drop back in and say thanks so much. It looks like the issue with bingbot is resolving correctly, from what I can see now. Hopefully the 404 generation was the main problem with resource usage and i can get back on a shared hosting environment from my brand new and much more expensive VPS server soon.

I tried to do as you both suggested with the rewriterule instead of redirect .. it works well enough for most things ... where I'm having a real problem is with the wordpress and coppermine-gallery sections. I guess they both have their own rewrite thing going on for permalinks, etc, and so when I try to do Rewrite one of those files, nothing happens. But when I use Redirect, it works correctly.

And .. to be sure.. I didn't start this ... when I first needed to do redirects, there was nothing in the htaccess file, so I got there first with my Redirect 301 lines .. then these caching plugins and whatnot came along with this RewriteRule business, and it's all a great big mess. :)

lucy - that was definitely the case with the 410 thing .. once they added that code in, it worked correctly .. and the logs showed the difference, now returning 410s instead of 404s.

I might try setting up a better 410 page in case I get some non-bot traffic there.

I wish I knew half of what you're both saying .. most of it goes right by me. I pretty much just stab at it til it bleeds red, click 'save' and I'm done. :)

Promise me neither of you will EVER look at the code on my site, ok. :)

Thanks again.

Cheers

Carl

g1smd

8:17 am on Apr 30, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The fact that Redirect "works" and RewriteRule doesn't tells me the rules are in the wrong order and/or the RegEx patterns are too general affecting more requests than they should.

lucy24

4:12 pm on Apr 30, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You said up front that you're on shared hosting. The host may have done something evil like put a generic Rewrite at the top level which grabs any requests for file names in certain formats. It's not meant to be evil; it's meant to help users who are afraid to make their own htaccess but need the rewriting features. My own host has a bit of blahblah somewhere warning that if you install certain features-- and I'm pretty sure Wordpress is one of them-- your own htaccess will no longer work.

Now, if you can manually rename your wordpress directories so they slip under the config file's radar...

Aussiefoto

5:09 pm on Apr 30, 2012 (gmt 0)

10+ Year Member



Hey lucy

Yes, I was on shared hosting, and they moved me to a VPS server on friday, until we get this resource usage down. I think it's resolved, but need to give it more time yet. I definitely had my own htaccess file before; I put it up a while ago to block some IPs and then to do some 301 redirects. Currently it's a short novel. ;)

I'm not sure what you mean "manually rename your wordpress directories so they slip under the config file's radar".

g1smd - trust me .. i'm sure there are a lot more problems with it all that those 2 you pointed to. :) Well, actually, I doubt many of the "regex" patterns are too general (at least what I've written) .. that stuff confuses me, so I tend to write ones specific to particular urls, not directories as a whole, etc.

I know, for example, the coppermine plugin I have running to rename dynamic urls to a more search engine friendly line is in a htaccess file inside the /stock/ directory ... so it makes sense that me writing RewriteRules in the root directory for those urls probably is not a good idea. Does that make sense?

I've never, ever, seen a forum on any of this stuff where people responded with their time and knowledge so generously. Thank you.

Cheers

Carl

g1smd

7:42 pm on Apr 30, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Crazy things can happen when you have multiple htaccess files in various folder levels. I always put all rules in the root file and ensure they are in the right order: certainly redirects before rewrites and each of those ordered from most specific to most general.

Aussiefoto

12:20 am on May 1, 2012 (gmt 0)

10+ Year Member



I suspect that's exactly the problem, g1smd. Thanks.

Out of curiosity, how do I rewrite something like this

website.com/folder/file-name.html%3Cbr%20/%3E50


or similar? I'm seeing something like that on google's webmasters tools .. i deleted it, but I do know it drives some traffic. I tried this but it doesn't work.

# RewriteRule ^folder\/file-name\.html(.*)$ http://www.website.com/folder/file-name.html [R,L]


Didn't work. I then tried leaving the $ off the first part, after the (.*), and it nearly worked, but still tagged some kind of code <br%20/>50 or something similar.

It's not a huge deal, but I am starting to see more of these links from truncated urls, etc and thought it might be possible to resolve them correctly?

Thanks again.

Cheers

Carl
This 36 message thread spans 2 pages: 36