Google not fully spidering my affiliate sites - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Google not fully spidering my affiliate sites

Google is not getting more than the home page of all our affiliate sites.

moishier

2:03 am on Jan 30, 2004 (gmt 0)

10+ Year Member

I take care of a popular website. It is a content-rich website with tens of thousands of articles.

We also have 250 affiliate sites. The affiliate sites are websites for the local chapters of the organization.

All content (articles) is handled centraly in a Content Management System (CMS).

My problem is as follows: Since these affiliate sites were launched 18 months ago, Google has only indexed the home page of these affiliate websites, or pages accessed via a virtual path, but no other pages, the most important one being the example.html file.

Now, the important point here is that local articles are accessed by a file called example.html. Other major content areas are blocked via robots.txt.

A careful analysis of our web logs revealed that example.html is not even touched by googlebot. This is not good and my question is why not and what can we do to make it available.

So you should know: all links in the site are to an "article.asp?(number)", and that file redirects to the proper template, either default.asp or example.html etc.

Also, I thought maybe because the files were accessed via article.asp which is a dynamic page type google was devaluating them. So I changed the links in the left nav to article.html two months ago. It still did not help.

I was thinking that it could be session thing, because our site works with session variables. But I was informed that if the client cannot set the session the server will. Also, I tried a text based browser, lynx, disabled session, and I was able to browse the site properly.

Also: All articles get indexed by google in the headquarters site. example.html only applies to the local sites and is not used at all in the headquarters site.

Another organization does a program like ours and they succeed in getting indexed all their sites indexed. What are we doing different that makes the difference?

Looking forward to all of your input.

Moshe

[edited by: Marcia at 3:12 am (utc) on Jan. 30, 2004]
[edit reason] Slight change to page name. [/edit]

ScottM

12:57 am on Jan 31, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

You've got me stumped on this.

A careful analysis of our web logs revealed that example.html is not even touched by googlebot.

Have you tried a simulated spider to see if IT will find the page in question? If it does, then suspect that the robots.txt is blocking googlebot inadvertantly?

moishier

2:10 am on Feb 1, 2004 (gmt 0)

10+ Year Member

Sorry, it should have been articlecco.asp. Basically, article.html gets translated to articlecco.asp, and that page does not come up in the logs.

So, I ran the sites throughthe validator and all seems ok.

Sharper

3:32 am on Feb 1, 2004 (gmt 0)

10+ Year Member

You stated that you have other content areas blocked via robots.txt.

Any chance that one of the intermediate pages in your redirect scheme is in one of those blocked areas? Or perhaps the page that links to the local articles is a blocked page?

I'd double and triple check the robots.txt in any case.

moishier

2:09 pm on Feb 1, 2004 (gmt 0)

10+ Year Member

Everything seems to be OK with the robots.txt. articlecco.html which is the file I want indexed is in a directory called templates that is not excluded. I also ran it throught the robots.txt validator and it came out ok.

By the way, with your spider simulator, if you specify a url in a path blocked by the robots.txt file, is it supposed to tell you that it is not accessable like a real spider would, or is is not that exact?

moishier

5:53 pm on Feb 2, 2004 (gmt 0)

10+ Year Member

Any ideas out there?

George

7:53 pm on Feb 2, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I believe the spider has to go to the robots.txt file first to know where it is allowed to go. Therefore if you simulate a trip through the site, you will be able to see everything, even though the robots.txt is telling googlebot not to enter the pages.

do other spiders visit these pages?

moishier

8:20 pm on Feb 2, 2004 (gmt 0)

10+ Year Member

Alltheweb has no problem and is indexing all affiliate sites fully.

rogerd

9:25 pm on Feb 2, 2004 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Moishier, do the affiliate sites have their own external links and decent PR?

BTW, if you use a tool like Xenu to spider your site, it won't check the robots.txt file - it will blast through the entire site. That will show that the pages are linked, but won't highlight any robots.txt errors or show linking issues involving off-limits pages. The fact that Alltheweb finds and indexes the other sites suggests that they are spiderable (unless you've put different instructions in the robots.txt for Googlebot).

moishier

9:37 pm on Feb 2, 2004 (gmt 0)

10+ Year Member

The highest I found was a PR4. Not to many backlinks.

George

10:09 pm on Feb 2, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

could it be duplicate content?

rogerd

10:11 pm on Feb 2, 2004 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Here's what I would do:
1) Recheck robots.txt, or have someone else do that to be sure that it isn't somehow keeping GB away from the pages you want spidered. It's probably fine, but I know I've done boneheaded things more than once that I caught only after checking everything else twice. ;)
2) Recheck linkage to the content pages you do wnat spidered, i.e., that the links are structured properly and are on located on pages that DO get spidered and have OK PR themselves. Presumably, the best location would be the home page of the local affiliate site.
3) I'd check the pages that aren't getting indexed with a spider simulator and server header checker just to be sure something wasn't messed up. You should see the metas and content you expect, with a 200 OK message.
4) If everything else looks good, pick one of your affiliate sites and get it some good independent links. See if this helps.

One other thought occurs to me - are the affiliate sites sufficiently different that they aren't triggering some kind of duplicate content filter? What you describe sounds perfectly legit, but it could also look like a giant link ring or network of related sites to a filter looking for spammers.

<added>Looks like I was writing at the same time as George... duplicate ideas. ;)</added>

George

10:25 pm on Feb 2, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Me to :)

moishier

10:39 pm on Feb 2, 2004 (gmt 0)

10+ Year Member

Actually, there is something that you mentioned that raised my eyebrow.

Our site works with redirects from a page off the root called article.asp (or .html) that redirects to a more explicit path. This is the architechture of our CMS.

The http header returned for these article.asp links is 302, only on the final page after it redirects do you get a 200. (I have a post going up soon [it's pending moderation] asking if 301 would be better site-wide...)

Now, our parent site does not have any problems getting indexed. It's only the affiliate sites.
Would this affect anything on an already non-popular site?

George

11:24 pm on Feb 2, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

moishier
I am not qualified to answer technically, but it would not surprise me.

I have had a number of sites on apache that could not be spidered, as they were not dynamic, and did not have the right coding.
Worth checking out.

moishier

3:31 pm on Feb 3, 2004 (gmt 0)

10+ Year Member

One other thought occurs to me - are the affiliate sites sufficiently different that they aren't triggering some kind of duplicate content filter? What you describe sounds perfectly legit, but it could also look like a giant link ring or network of related sites to a filter looking for spammers.

That is not an issue because the robots.txt excludes the duplicate content sections. And the Home Pages are significantly different.

Any ideas, anyone?

rogerd

3:58 pm on Feb 3, 2004 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Our site works with redirects from a page off the root called article.asp (or .html) that redirects to a more explicit path.

Could you illustrate (using example.com, please!) what's going on here? I.e., what does a typical link go to and where does it end up?

moishier

4:06 pm on Feb 3, 2004 (gmt 0)

10+ Year Member

www.example.com/article.asp?aid=12345 redirects via an ASP response.redirect (302) to www.example.com/magazine/default.asp?aid=12345. The whole site works this way.

We do this to ease on server load and not to get the proper template on every page load (in this case, the magazine template).

You can see my other post regarding doing redirects via 302 vs. 301.

rogerd

5:29 pm on Feb 3, 2004 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

First, let me caution you that I haven't experimented with 301 vs. 302 redirects. However, conventional wisdom is that 301 redirects are best because they allow PR to be passed.

My own inclination would be to not intentionally link to anything that didn't return a 200. Will the page load properly if you link to the second URL? If so, I'd try that, at least for some sample links.

I don't think your single-argument query strings are a big barrier to getting spidered, though Google does seem to make dynamic URLs a lower priority at times. In an ideal world, I'd use one of the IIS-compatible rewrite tools to make your URLs look like: www.example.com/magazine/12345.htm

My first priority, though, would be to get away from the 302s, at least on a test basis for some links. Based on the data available, they are my prime suspects for your lack of spidering.