Forum Moderators: Robert Charlton & goodroi
Using this site, I noticed that my site which is mainly a form dropped from it's original 70 000 + results indexed to around 300+, with about 3 DCs with 80 000+
I did some snooping around with some other forums and the issue seems to be the same.
<snip>
3 DCs at 100 000
14 DCs at 9000
<snip>
4 DCs at 250 000
13 DCs at 30 000
<snip>
10: 500 000
7: 1 000 000+
Most of the ones with high results are from
www-gv
www-lm
www-kr
[edited by: lawman at 8:07 pm (utc) on Mar. 19, 2006]
[edit reason] No Tools, No Links Please [/edit]
U have to differentiate
Do you have any specific university in mind, or are you speaking of universities in general?
But maybe spammers are more the issue. Spammers find forums of a certain type and then auto-bot-post thousands of links to c*sino, p*lls, and p*rn sites from junk posts. Many forum owners don't take time to clean that junk up.
Google might want to limit the damage that is caused to such forums by not showing them in the index so making them more difficult for spammers to find, or maybe they are dropping forums that already have multiple links to bad neighbourhoods from posting signatures, or, more likely from member profile pages.
Go look at almost any vbulletin or PHPbb or Invision forum. Find all the posters with zero to five posts. I'll pretty much guarantee that 90% or more of those are just utilising their forum profile page as a free link to some spammy website. Look deeper and you'll find that those same people have done the same trick on many thousands of forums.
So, it is possible that a great swathe of forum-land is now classed as a bad neighbourhood. If you are a forum Admin, go back and delete all members with zero to two posts that have been a member for more than 3 months and who have not logged in for more than 3 months. That will probably fix some 75% plus of the spam that your site is linking to.
Many don't cater well for the "unique title and description per page" requirement.
A lot have code that does not validate and may be causing bots to falter.
.
With that type of software, duplicate content arises when a page of information can be reached in multiple ways. There is the obvious www and non-www to think about, but also many other ways caused by the way that the software has been written.
For example, a post on a vbulletin forum could be expressed as:
/forum/showthread.php?t=54321
/forum/showthread.php?t=54321&p=22446688
/forum/showthread.php?t=54321&page=2
/forum/showthread.php?mode=hybrid&t=54321
/forum/showthread.php?p=22446688&mode=linear#post22446688
/forum/showthread.php?p=22446688&mode=threaded#post22446688
/forum/showthread.php?t=34567&goto=nextnewest
/forum/showthread.php?t=87654&goto=nextoldest
/forum/showthread.php?goto=lastpost&t=54321
/forum/showpost.php?p=22446688
/forum/showpost.php?p=22446688&postcount=45
/forum/printthread.php?t=54321
and that is without introducing URLs that include the page parameter, for threads that are more than one page long, and the pp parameter for changing the default number of posts per page; either or both of which can be added to most of the URLs above too.
.
Oh, and did I mention about duplicate content caused by the usage of session IDs too?
Do not hand out a session ID until someone is acually logging in. Session IDs are one of the biggest causes of failure in indexing forums.
.
Another big problem is the "next" and "previous" links that cause massive duplicate content issues because they allow a thread like /forum/showthread.php?t=54321 to be indexed as /forum/showthread.php?t=34567&goto=nextnewest and as /forum/showthread.php?t=87654&goto=nextoldest too. Additionally if any of the three threads is bumped, the "next" and "previous" links that are indexed no longer point to the same thread, because they contain the thread number of the thread that they were ON (along with the goto parameter), not the real thread number of the thread that they actually pointed to.
This is a major programming error by the people that designed the forum software. The link should either contain the true thread number of the thread that it points to, or else clicking the "next" and "previous" links should go via a 301 redirect to a URL that includes the real true canonical thread number of the target thread.
They are presenting hundreds of millions of almost identical "you are not logged in" error messages, many thousands of each per site. Search engines do not ever need to see those pages.
Get a <meta name="robots" content="noindex"> tag on all those pages, use the rel="nofollow" attribute on all links that point to such pages, and/or disallow those pages in the robots.txt file. A combination of those things is usually the best way.
You could use robots.txt to disallow the "parameter-based" URLs but that isn't always as clean. They might still appear as URL-only entries in the search results.
I have gone through and made them completely SE friendly. The only way you see variables or id's is if your a visitor. SE's see clean /forums/post2312.html files instead of /forums/post.php?id=2312&uid=n309jd03k...
This is the best way to create your own duplicate content problem. How do you think people will link to your forum posts? With the link they see in the address bar, so all incomming links will be of the form /forums/post.php?... whereas Googlebot on its normal crawls sees the same pages with a clean .html name.
Also, don't forget the toolbar and Google's Mediabot. I have seen more than once that Googlebot tried to index a page which was definitly not visible from the outside. The only reference Google could have was because I viewed the page in my browser with the toolbar installed. Ofcourse, the toolbar will see the users-URL, not the search engine URL.
I also don't trust the Mediabot. Although it doesn't crawl pages to store them in the index, it would certainly be possible that the URL list created by this bot is now and then compared with the URL list the regular bot finds. Mediabot (just as the toolbar) sees the user URL, not your manually created search engine friendly URL.
What would Google do when Mediabot and the toolbar see consequently different URLs from a site than Googlebot? Probably drop that site from the index because of suspected cloaking or duplicate content.
You could use robots.txt to disallow the "parameter-based" URLs but that isn't always as clean. They might still appear as URL-only entries in the search results.
"This is the best way to create your own duplicate content problem. How do you think people will link to your forum posts? With the link they see in the address bar, so all incomming links will be of the form /forums/post.php?... whereas Googlebot on its normal crawls sees the same pages with a clean .html name."
Look at the address bar on this page. Then check how many pages are indexed here.
EDIT: Sorry. I thought you were implying the rewrites wouldn't work. I didn't see his "visitor" part. Why he's doing that I have no idea.
Look at the address bar on this page. Then check how many pages are indexed here.
Totally different situation. WebmasterWorld serves SE friendly URLs to both users and search engines which is good and done by many sites. According to message #26, wfernley is feeding SE friendly URLs to the search engines, but SE unfriendly URLs to the visitor. Or maybe my interpretation of "The only way you see variables or id's is if your a visitor" is different than yours?
All incomming links which carry PR value will point to SE unfriendly URLs. Yet the bot only sees the files with the .html extension. Where does this incomming PR go?
[added]We posted at the same time :) I'll leave this post here for clarity to the other readers[/added]
Go look at almost any vbulletin or PHPbb or Invision forum. Find all the posters with zero to five posts. I'll pretty much guarantee that 90% or more of those are just utilising their forum profile page as a free link to some spammy website...
...If you are a forum Admin, go back and delete all members with zero to two posts that have been a member for more than 3 months and who have not logged in for more than 3 months. That will probably fix some 75% plus of the spam that your site is linking to.
That is not the solution. You can do so many other things to a sig or post link:
1) Make it only visible to logged in members.
2) Program the no rel = tag into all outbound links.
3) Use a cgi outbound script (this is what I do) and robot deny access to the cgi folder, works well.
I wasn't talking about spammy posts. I was talking about the thousands of Zero Post lurkers who are only there for the Link back to their own site from the Profile page. Many forums have them, because many forum Admins are not very tech savvy, and have never bothered to check exactly who is registered and what they have on their Profile page.
But yes, the rel="nofollow" tag, and so on, can help. Some forums have a policy of no links out until you have a reasonable (say, 50 or 100) number of posts. Others allow no links at all. However there are thousands of forums that are abused by spammers in multiple ways. It is wise to check things out, and dejunk at the earliest possible opportunity.
Signatures, profiles, outbound links are easily taken care of on the fly along with every other part you don't want followed (register, usercp, etc...). The amoutn of programs and plugins to do this are numerous. Those owners that don't pay attention will suffer but that arguement works for every part of SEO. This really doens't apply here unless you've ignored SEO to begin with.
I guess what I am trying to say is, this issue of forum spamming (posting, sigs, profiles...) was dealt with long ago by most all forum software companies and forum admins that have even the smallest clue of SEO, way before the bloggers ever found their solutions.
The funny part is, the supp results I see now are from when I did ignore SEO. Lots of php?='s. Once I wised up and applied some of the most basic principles my index was about 10x that what it was before but now I get thrown back into supp hell for my previous mistakes. Its a bit frustrating after the amount of work I've put in over the last 6 months.