Welcome to WebmasterWorld Guest from 54.198.229.157

Forum Moderators: Robert Charlton & aakk9999 & andy langton & goodroi

Message Too Old, No Replies

Does Google crawl every subfolder?

     
11:55 am on Jul 19, 2011 (gmt 0)

New User

5+ Year Member

joined:June 27, 2011
posts: 13
votes: 0


Hi all, i have a question around search engine behavior when the crawling our website through the sub-folders.

i have this kind of URL: www.domain.com/a/b/c/d and i only have content in sub-folder c and d (trust me, i have a website like this..mainly to help the system which content should they pick up from database so it's parameter-wise).

in my sitemap of course i didn't include sub-folder a and b and there is no link whatsoever to sub-folder a and b.

when user want to access sub-folder a and b they will get 404 page.

1. my question is does crawlers index / check every content in all sub-folders? such as this perhaps:
- www.domain.com/a
- www.domain.com/a/b
- www.domain.com/a/b/c
- www.domain.com/a/b/c/d

2. if they do search per sub-folder and they found out that folder a and b is a 404, will this harm my website?

thank you so much for any advise and helps
12:43 pm on July 19, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Jan 22, 2011
posts:96
votes: 0


in my sitemap of course i didn't include sub-folder a and b and there is no link whatsoever to sub-folder a and b.


If it's the case, they won't be crawled.
1:27 pm on July 19, 2011 (gmt 0)

Administrator from US 

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 21, 2004
posts:3114
votes: 98


Actually they might be crawled. Google uses multiple ways to find and crawl content.

If you want the content indexed then you should link to it to greatly increase your chances. If you don't want the content to be indexed use robots.txt.
2:32 pm on July 19, 2011 (gmt 0)

New User

5+ Year Member

joined:June 27, 2011
posts: 13
votes: 0


@goodroi that's what I thought too, to crawl the sitemap is one way but crawlers are much clever than that.
Your suggestion is intriguing but hard to implement when I create dynamic first and second sub-folder for other content.

Still the question remain,if we not using robot.txt to block a and b sub-folder, will crawlers index all and flag empty sub-folder as 404 page?

Thank you
2:40 pm on July 19, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


It depends on whether there are links to those URLs or not. Google Webmaster Tools, for instance, will only report 404 URLs that have links. Otherwise, every site has an infinity of 404 URLs!
2:47 pm on July 19, 2011 (gmt 0)

New User

5+ Year Member

joined:June 27, 2011
posts: 13
votes: 0


@tedster, it sounds like a very logical process to me. if no links from inside my page going to these folders then no crawlers will go there.

thanks, your answers always a big help to me :)
7:03 pm on July 19, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:Apr 1, 2003
posts:438
votes: 0


Recently I moved a whole load of old pages to a new unlinked folder as a precursor to permanent deletion. Despite their being no links to that folder, somehow google discovered it.

I can only think that I was logged into google while I was sifting through the pages and checking on dead links within and google discovered that way? Otherwise I have firefox + google toolbar, so maybe that was it?

I must confess, I was shocked!
7:27 pm on July 19, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


Still the question remain,if we not using robot.txt to block a and b sub-folder, will crawlers index all and flag empty sub-folder as 404 page?

A folder isn't a page. It would only slap down a 404 if it was led to believe there is a named index file. So let's ask the obviously related question as long as we're here:

If a particular folder is auto-indexed-- let's say for the benefit of humans who want to paw through the photographs you've got in there-- and you've got a link to the folder, can search engines read that auto-generated index file? And, if so, will they proceed to crawl everything else in the directory?
7:54 pm on July 19, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Jan 22, 2011
posts:96
votes: 0


I read the original question as: The sitemap has no links pointing to a or b, and the whole site has no links pointing to a or b. For that reason, I doubt Google would crawl those folders unless they are linked to externally.
8:24 pm on July 19, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Given
example.com/a/b/c/d
, Google and others many well attempt to access higher level folders to see what the response is.

It's one very good reason why when given a site using URLs like
example.com/index.php?type=5&product=103&page=23
you do NOT then implement SEF URLs like
example.com/type/5/product/103/page/23
and feed those into a rewrite. Disaster strikes when
example.com/type/5/product
is accessed.

No. Instead, you use
example.com/5-103-23
8:57 pm on July 19, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:July 25, 2006
posts: 460
votes: 0


can search engines read that auto-generated index file? And, if so, will they proceed to crawl everything else in the directory?


I have Apache auto-generated indexing enabled for a few folders for which I didn't create any static index pages. Those folders do contain web pages that have inbound links, but there are no links anywhere to the folder indexes.

Nonetheless, Yahoo guessed that there might be an index page for the folder containing those pages and started requesting them. When I saw the 403 results being returned to Yahoo (when auto-indexing was disallowed for those folders), that's when I decided to enable auto-indexing for those folders.

For example:
The page
/2010/somepage.html (and others) exist in the folder, with inbound links, but
/2010/index.html does not.

Yahoo started requesting this, even though there are no inlinks to it:
/2010/

Now that auto-indexing is enabled, they get a web page with links to the folder contents.

The auto-generated index is just a web page like any other (but fairly plain), so, yes, search engines can definitely crawl and follow the links.

I don't recall Google making the same inference about /2010/ "probably" existing, but it could have.
9:14 pm on July 19, 2011 (gmt 0)

Senior Member

joined:Dec 29, 2003
posts:5428
votes: 0


You'd be surprised at Google and others seek, links or no links (toolbar, chrome, temporary link, did you link to them 8 years ago? etc). But why take the chance. Having the main folder show a 404 seems very risky to me, even if you can get away with it for now
9:26 pm on July 19, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


If there is a folder with pages in, then Google et. al WILL request the canonical index URL for the folder on a regular basis - even if there are no links to that URL.

That canonical URL is the one ending with a trailing slash and which does not include the index filename.
10:38 pm on July 19, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


There's got to be another variable. I've got a cluster of e-texts in the form
/title/FullTitle.html
where each named book is in its own directory. Nobody ever went looking for an index page until :: cough, cough :: I goofed when adding some relative links, leading the search engines to look for the nonexistent
ebooks/title/index.html
when they were supposed to be aiming for
/ebooks/index.html
Unhappy consequence: my htaccess will now forever have to include a line pointing people-- or rather robots-- to the correct place via a 301.