Does Google crawl every subfolder?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Does Google crawl every subfolder?

aji1480

11:55 am on Jul 19, 2011 (gmt 0)

Hi all, i have a question around search engine behavior when the crawling our website through the sub-folders.

i have this kind of URL: www.domain.com/a/b/c/d and i only have content in sub-folder c and d (trust me, i have a website like this..mainly to help the system which content should they pick up from database so it's parameter-wise).

in my sitemap of course i didn't include sub-folder a and b and there is no link whatsoever to sub-folder a and b.

when user want to access sub-folder a and b they will get 404 page.

1. my question is does crawlers index / check every content in all sub-folders? such as this perhaps:
- www.domain.com/a
- www.domain.com/a/b
- www.domain.com/a/b/c
- www.domain.com/a/b/c/d

2. if they do search per sub-folder and they found out that folder a and b is a 404, will this harm my website?

thank you so much for any advise and helps

flatfile

12:43 pm on Jul 19, 2011 (gmt 0)

in my sitemap of course i didn't include sub-folder a and b and there is no link whatsoever to sub-folder a and b.

If it's the case, they won't be crawled.

goodroi

1:27 pm on Jul 19, 2011 (gmt 0)

Actually they might be crawled. Google uses multiple ways to find and crawl content.

If you want the content indexed then you should link to it to greatly increase your chances. If you don't want the content to be indexed use robots.txt.

aji1480

2:32 pm on Jul 19, 2011 (gmt 0)

@goodroi that's what I thought too, to crawl the sitemap is one way but crawlers are much clever than that.
Your suggestion is intriguing but hard to implement when I create dynamic first and second sub-folder for other content.

Still the question remain,if we not using robot.txt to block a and b sub-folder, will crawlers index all and flag empty sub-folder as 404 page?

Thank you

tedster

2:40 pm on Jul 19, 2011 (gmt 0)

It depends on whether there are links to those URLs or not. Google Webmaster Tools, for instance, will only report 404 URLs that have links. Otherwise, every site has an infinity of 404 URLs!

aji1480

2:47 pm on Jul 19, 2011 (gmt 0)

@tedster, it sounds like a very logical process to me. if no links from inside my page going to these folders then no crawlers will go there.

thanks, your answers always a big help to me :)

suggy

7:03 pm on Jul 19, 2011 (gmt 0)

Recently I moved a whole load of old pages to a new unlinked folder as a precursor to permanent deletion. Despite their being no links to that folder, somehow google discovered it.

I can only think that I was logged into google while I was sifting through the pages and checking on dead links within and google discovered that way? Otherwise I have firefox + google toolbar, so maybe that was it?

I must confess, I was shocked!

lucy24

7:27 pm on Jul 19, 2011 (gmt 0)

Still the question remain,if we not using robot.txt to block a and b sub-folder, will crawlers index all and flag empty sub-folder as 404 page?

A folder isn't a page. It would only slap down a 404 if it was led to believe there is a named index file. So let's ask the obviously related question as long as we're here:

If a particular folder is auto-indexed-- let's say for the benefit of humans who want to paw through the photographs you've got in there-- and you've got a link to the folder, can search engines read that auto-generated index file? And, if so, will they proceed to crawl everything else in the directory?

flatfile

7:54 pm on Jul 19, 2011 (gmt 0)

I read the original question as: The sitemap has no links pointing to a or b, and the whole site has no links pointing to a or b. For that reason, I doubt Google would crawl those folders unless they are linked to externally.

g1smd

8:24 pm on Jul 19, 2011 (gmt 0)

Given

example.com/a/b/c/d

, Google and others many well attempt to access higher level folders to see what the response is.

It's one very good reason why when given a site using URLs like

example.com/index.php?type=5&product=103&page=23

you do NOT then implement SEF URLs like

example.com/type/5/product/103/page/23

and feed those into a rewrite. Disaster strikes when

example.com/type/5/product

is accessed.

No. Instead, you use

example.com/5-103-23

SteveWh

8:57 pm on Jul 19, 2011 (gmt 0)

can search engines read that auto-generated index file? And, if so, will they proceed to crawl everything else in the directory?

I have Apache auto-generated indexing enabled for a few folders for which I didn't create any static index pages. Those folders do contain web pages that have inbound links, but there are no links anywhere to the folder indexes.

Nonetheless, Yahoo guessed that there might be an index page for the folder containing those pages and started requesting them. When I saw the 403 results being returned to Yahoo (when auto-indexing was disallowed for those folders), that's when I decided to enable auto-indexing for those folders.

For example:
The page
/2010/somepage.html (and others) exist in the folder, with inbound links, but
/2010/index.html does not.

Yahoo started requesting this, even though there are no inlinks to it:
/2010/

Now that auto-indexing is enabled, they get a web page with links to the folder contents.

The auto-generated index is just a web page like any other (but fairly plain), so, yes, search engines can definitely crawl and follow the links.

I don't recall Google making the same inference about /2010/ "probably" existing, but it could have.

walkman

9:14 pm on Jul 19, 2011 (gmt 0)

You'd be surprised at Google and others seek, links or no links (toolbar, chrome, temporary link, did you link to them 8 years ago? etc). But why take the chance. Having the main folder show a 404 seems very risky to me, even if you can get away with it for now

g1smd

9:26 pm on Jul 19, 2011 (gmt 0)

If there is a folder with pages in, then Google et. al WILL request the canonical index URL for the folder on a regular basis - even if there are no links to that URL.

That canonical URL is the one ending with a trailing slash and which does not include the index filename.

lucy24

10:38 pm on Jul 19, 2011 (gmt 0)

There's got to be another variable. I've got a cluster of e-texts in the form
/title/FullTitle.html
where each named book is in its own directory. Nobody ever went looking for an index page until :: cough, cough :: I goofed when adding some relative links, leading the search engines to look for the nonexistent
ebooks/title/index.html
when they were supposed to be aiming for
/ebooks/index.html
Unhappy consequence: my htaccess will now forever have to include a line pointing people-- or rather robots-- to the correct place via a 301.