Forum Moderators: phranque
I don't have access to mod rewrite, and so the articles were all getting permalinked off the index.php with parameters:
blogtopic.example.com/index.php?article=XX
I found some blogging software that created links of the form:
blogtopic.example.com/index.php/year/month/day/article-title/
It seemed to me that the second approach was more likely to get those old pages indexed and into the engines than the first. Is that true?
Also, the blog has an RSS feed - do any SEs use that, or do they all crawl the pages?
As for your RSS feed, if I remember correctly, Googlebot can scan XML as plain text, but it's not really worth letting the bot in - I would just exclude the RSS feed in robots.txt.
going right for the rss.xml feed file
If this is the case, then you should definitely block the RSS file in robots.txt. The change will take a little while to "register" with Googlebot, but once done, it will mean that only the true content files (those for the end user) are indexed.
User-agent: *
Disallow: /rss.xml
I decided not to block Google from getting the RSS file. Mostly because I noticed that Google generally hits it a few minutes after I put up a new post. (I ping the usual blog update services, which must be notifying Google somewhere along the line).
It took a couple of weeks, but Google eventually started indexing my archived content as well.