Forum Moderators: DixonJones
198.65.155.*** - - [25/Oct/2004:12:05:19 -0700] "GET /Blahblah.html HTTP/1.0" 403 480 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Tucows)"
198.65.155.*** - - [25/Oct/2004:12:20:37 -0700] "GET /Blahblah.html HTTP/1.0" 403 480 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; AMEX IE)"
198.65.155.*** - - [25/Oct/2004:17:04:39 -0700] "GET /Blahblah.html HTTP/1.0" 403 480 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; QXW0334h)"
198.65.155.*** - - [25/Oct/2004:17:48:04 -0700] "GET /Blahblah.html HTTP/1.0" 403 480 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows 98; SPINWAY.COM)"
198.65.155.*** - - [25/Oct/2004:17:49:21 -0700] "GET /Blahblah.html HTTP/1.0" 403 480 "-" "Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; Con Edison)"
198.65.155.*** - - [25/Oct/2004:21:57:50 -0700] "GET /Blahblah.html HTTP/1.0" 403 480 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; COM+ 1.0.2204)"
198.65.155.*** - - [25/Oct/2004:22:07:38 -0700] "GET /Blahblah.html HTTP/1.0" 403 480 "-" "Mozilla/4.0 (compatible; MSIE 4.01; MSN 2.6; Windows 98)"
198.65.155.*** - - [25/Oct/2004:22:11:27 -0700] "GET /Blahblah.html HTTP/1.0" 403 480 "-" "Mozilla/4.02 [en] (Win95; I)"
198.65.155.*** - - [25/Oct/2004:22:11:37 -0700] "GET /Blahblah.html HTTP/1.0" 403 480 "-" "Mozilla/4.0 (compatible; MSIE 4.01; AOL 5.0; Windows 95)"
198.65.155.*** - - [25/Oct/2004:22:11:46 -0700] "GET /Blahblah.html HTTP/1.0" 403 480 "-"
"Mozilla/4.73 [en] (X11; I; Linux 2.2.15-4mdk i586)"I think this has to be a website record! :)
For instance. Why come back repeatedly?
What is the benefit to them?
When you get a 403 - Leave and don't come back!
When you get a 404 - Delete the file pathway from your database!
When you get a 301 - Follow the new pathway and discard the old!
Please!
Nothing pisses me off more than to see Google, Yahoo, Slurp and several others continually re-crawling the same files over and over again. I am tired of it!
I am tired of taking the brunt of these repeated crawls, which do NOTHING but suck down the bandwidth.
As a Webmaster, I am expected to read and re-read these mucked up access_log files repeatedly.
Yet, every time I ask the question:
"Why are bots/spiders so STUPID?"
The logic seems to be - "Buck up, little camper" "These things happen" "If you wanna be crawled you gotta put up with them"
I say: Crap!
Let those who do the unleashing do the updating. Just think how much bandwidth and time would be saved all around if only one operator took it upon him/herself to make these changes...
Oh, the two most likely to meet my criteria?
MSN and Jeeves/Teoma. Every file that was ever moved or deleted has been purged from those crawlers and it is so beautiful to see all the requests returned as 200s.
Now, if they can do it - why can't the numbnutz behind this IP Block and every bot/spider out there?
Seems sooooo simple to me...
It is not a robot, it is a tool, and it is driven by the user's requests.
It's "bad" because it cloaks its requests (albeit quite badly), and because (in many cases) the user driving it is your competition.
On the other hand, getting these visits can be a good heads-up that there is new competition in your keyword space or that existing competition is working on their site ranking. Knowing that ahead of time can be very "good" indeed...
One reason that some 'bots repeatedly ask for 404'ed and 410'ed pages is that they are trying to "forgive" temporary server errors and avoid dropping pages because of a simple Webmaster error. If you have ever misconfigured your server because of a coding error, you may appreciate that a 'bot might have visited during the time your server was effectively down, returning 500-Server Error, 403-Forbidden, or 404-Not Found for every request, but yet your site didn't get dropped from the index.
So, repeated attempts to fetch missing files can be a minor annoyance, but it can also be a lifesaver...
However, re-trying for more than 30 days is pretty silly, IMO.
Other 'bots truly are "incompetent," because their software is written by humans, and those humans make mistakes or don't read the specifications. I think Inktomi set the record for me: It kept asking for several 410'ed pages for over a year. This despite the fact that my site returned 410-Gone for HTTP/1.1 requests and 404-Not Found for HTTP/1.0 requests, as it should have.
Jim
My keywords are the very same as the day my site went live in '98. Nothing has changed.
With respect to each academic discipline, I scoured the hard covered dictionaries for those words and used synonyms as well as homonyms of those words...in many cases.
As I understand search engines they utilize not only what keywords I use, they also use the verbiage of those to whom I have added to my Directory. In other words, I annotate each site using the verbiage found in the about section of the website to be added.
Those are the ONLY considerations I have given to keywords.
Jim, I do understand your point about 500 errors, etc. If ones site was down and the bot read the 500 code correctly, they might not come back...and that would be bad.
However, my point is not being adequately addressed to fit my knowledge level. I will admit to ignorance.
Simple question:
Why do bots re-crawl old urls, which have been redirected via 301. You yourself helped me with this better than 18 month ago and it was successful...for a time!
What changed that success? I had nothing to do with it.
Did they pick up the urls from somewhere else? Another database perhaps?
Is that the problem?
If so, then all 301 re-directs should remain in the .htaccess file permanently.
If so, then there are tons of folks who have done re-directs and who will, undoubtedly, be experiencing the same frustrations as I.
Such to say - 301 re-directs are worthless.