Forum Moderators: open
We also have 250 affiliate sites. The affiliate sites are websites for the local chapters of the organization.
All content (articles) is handled centraly in a Content Management System (CMS).
My problem is as follows: Since these affiliate sites were launched 18 months ago, Google has only indexed the home page of these affiliate websites, or pages accessed via a virtual path, but no other pages, the most important one being the example.html file.
Now, the important point here is that local articles are accessed by a file called example.html. Other major content areas are blocked via robots.txt.
A careful analysis of our web logs revealed that example.html is not even touched by googlebot. This is not good and my question is why not and what can we do to make it available.
So you should know: all links in the site are to an "article.asp?(number)", and that file redirects to the proper template, either default.asp or example.html etc.
Also, I thought maybe because the files were accessed via article.asp which is a dynamic page type google was devaluating them. So I changed the links in the left nav to article.html two months ago. It still did not help.
I was thinking that it could be session thing, because our site works with session variables. But I was informed that if the client cannot set the session the server will. Also, I tried a text based browser, lynx, disabled session, and I was able to browse the site properly.
Also: All articles get indexed by google in the headquarters site. example.html only applies to the local sites and is not used at all in the headquarters site.
Another organization does a program like ours and they succeed in getting indexed all their sites indexed. What are we doing different that makes the difference?
Looking forward to all of your input.
Moshe
[edited by: Marcia at 3:12 am (utc) on Jan. 30, 2004]
[edit reason] Slight change to page name. [/edit]
Any chance that one of the intermediate pages in your redirect scheme is in one of those blocked areas? Or perhaps the page that links to the local articles is a blocked page?
I'd double and triple check the robots.txt in any case.
By the way, with your spider simulator, if you specify a url in a path blocked by the robots.txt file, is it supposed to tell you that it is not accessable like a real spider would, or is is not that exact?
BTW, if you use a tool like Xenu to spider your site, it won't check the robots.txt file - it will blast through the entire site. That will show that the pages are linked, but won't highlight any robots.txt errors or show linking issues involving off-limits pages. The fact that Alltheweb finds and indexes the other sites suggests that they are spiderable (unless you've put different instructions in the robots.txt for Googlebot).
One other thought occurs to me - are the affiliate sites sufficiently different that they aren't triggering some kind of duplicate content filter? What you describe sounds perfectly legit, but it could also look like a giant link ring or network of related sites to a filter looking for spammers.
<added>Looks like I was writing at the same time as George... duplicate ideas. ;)</added>
Our site works with redirects from a page off the root called article.asp (or .html) that redirects to a more explicit path. This is the architechture of our CMS.
The http header returned for these article.asp links is 302, only on the final page after it redirects do you get a 200. (I have a post going up soon [it's pending moderation] asking if 301 would be better site-wide...)
Now, our parent site does not have any problems getting indexed. It's only the affiliate sites.
Would this affect anything on an already non-popular site?
One other thought occurs to me - are the affiliate sites sufficiently different that they aren't triggering some kind of duplicate content filter? What you describe sounds perfectly legit, but it could also look like a giant link ring or network of related sites to a filter looking for spammers.
That is not an issue because the robots.txt excludes the duplicate content sections. And the Home Pages are significantly different.
Any ideas, anyone?
We do this to ease on server load and not to get the proper template on every page load (in this case, the magazine template).
You can see my other post regarding doing redirects via 302 vs. 301.
My own inclination would be to not intentionally link to anything that didn't return a 200. Will the page load properly if you link to the second URL? If so, I'd try that, at least for some sample links.
I don't think your single-argument query strings are a big barrier to getting spidered, though Google does seem to make dynamic URLs a lower priority at times. In an ideal world, I'd use one of the IIS-compatible rewrite tools to make your URLs look like: www.example.com/magazine/12345.htm
My first priority, though, would be to get away from the 302s, at least on a test basis for some links. Based on the data available, they are my prime suspects for your lack of spidering.