|in my sitemap of course i didn't include sub-folder a and b and there is no link whatsoever to sub-folder a and b. |
If it's the case, they won't be crawled.
Actually they might be crawled. Google uses multiple ways to find and crawl content.
If you want the content indexed then you should link to it to greatly increase your chances. If you don't want the content to be indexed use robots.txt.
@goodroi that's what I thought too, to crawl the sitemap is one way but crawlers are much clever than that.
Your suggestion is intriguing but hard to implement when I create dynamic first and second sub-folder for other content.
Still the question remain,if we not using robot.txt to block a and b sub-folder, will crawlers index all and flag empty sub-folder as 404 page?
It depends on whether there are links to those URLs or not. Google Webmaster Tools, for instance, will only report 404 URLs that have links. Otherwise, every site has an infinity of 404 URLs!
@tedster, it sounds like a very logical process to me. if no links from inside my page going to these folders then no crawlers will go there.
thanks, your answers always a big help to me :)
Recently I moved a whole load of old pages to a new unlinked folder as a precursor to permanent deletion. Despite their being no links to that folder, somehow google discovered it.
I can only think that I was logged into google while I was sifting through the pages and checking on dead links within and google discovered that way? Otherwise I have firefox + google toolbar, so maybe that was it?
I must confess, I was shocked!
|Still the question remain,if we not using robot.txt to block a and b sub-folder, will crawlers index all and flag empty sub-folder as 404 page? |
A folder isn't a page. It would only slap down a 404 if it was led to believe there is a named index file. So let's ask the obviously related question as long as we're here:
If a particular folder is auto-indexed-- let's say for the benefit of humans who want to paw through the photographs you've got in there-- and you've got a link to the folder, can search engines read that auto-generated index file? And, if so, will they proceed to crawl everything else in the directory?
I read the original question as: The sitemap has no links pointing to a or b, and the whole site has no links pointing to a or b. For that reason, I doubt Google would crawl those folders unless they are linked to externally.
example.com/a/b/c/d, Google and others many well attempt to access higher level folders to see what the response is.
It's one very good reason why when given a site using URLs like
example.com/index.php?type=5&product=103&page=23 you do NOT then implement SEF URLs like
example.com/type/5/product/103/page/23 and feed those into a rewrite. Disaster strikes when
example.com/type/5/product is accessed.
No. Instead, you use
|can search engines read that auto-generated index file? And, if so, will they proceed to crawl everything else in the directory? |
I have Apache auto-generated indexing enabled for a few folders for which I didn't create any static index pages. Those folders do contain web pages that have inbound links, but there are no links anywhere to the folder indexes.
Nonetheless, Yahoo guessed that there might be an index page for the folder containing those pages and started requesting them. When I saw the 403 results being returned to Yahoo (when auto-indexing was disallowed for those folders), that's when I decided to enable auto-indexing for those folders.
/2010/somepage.html (and others) exist in the folder, with inbound links, but
/2010/index.html does not.
Yahoo started requesting this, even though there are no inlinks to it:
Now that auto-indexing is enabled, they get a web page with links to the folder contents.
The auto-generated index is just a web page like any other (but fairly plain), so, yes, search engines can definitely crawl and follow the links.
I don't recall Google making the same inference about /2010/ "probably" existing, but it could have.
You'd be surprised at Google and others seek, links or no links (toolbar, chrome, temporary link, did you link to them 8 years ago? etc). But why take the chance. Having the main folder show a 404 seems very risky to me, even if you can get away with it for now
If there is a folder with pages in, then Google et. al WILL request the canonical index URL for the folder on a regular basis - even if there are no links to that URL.
That canonical URL is the one ending with a trailing slash and which does not include the index filename.
There's got to be another variable. I've got a cluster of e-texts in the form
where each named book is in its own directory. Nobody ever went looking for an index page until :: cough, cough :: I goofed when adding some relative links, leading the search engines to look for the nonexistent
when they were supposed to be aiming for
Unhappy consequence: my htaccess will now forever have to include a line pointing people-- or rather robots-- to the correct place via a 301.