Forum Moderators: open
I have 3 websites in the top 5 for a pretty competitive keyword. I have performed a search at google of "keyword site:domain" to see a list of all the URLs which google has indexed for my domain names. All I see is the homepage which has been crawled due to inbound links. This has happened for all 3 of my sites (which maintain #1, #3, and #4 positions).
I performed the same search for my competition at #2 and #5 and their sites are pretty thoroughly crawled. My sites are about 2 years old and have had good google placement for about 1 year. I compared code and the only striking difference I can see is that I am XHTML compliant whereas they are not. Could that be a problem?
Some of my sites do have javascript menus but they all have site maps which have hard links on every page so the spiders can crawl. Additionally, I do have a robots meta tag:
<meta name="robots" content="index,follow" />
and
<meta name="revisit-after" content="30 days" />
Any ideas? I have read other posts here mentioning XHTML problems with Google but haven't read anything conclusive....
[google.com...]
shows about 3,240 pages.
So xhtml by and of itself does not seem to prevent deep crawling.
I agree, it makes absolutely no sense. Every single valid page on the web has, by definition, a DOCTYPE.
Also, while xhtml is certainly not widespread, it is used enough that it is inconceivable that robots have problems dealing with it.
If you are bothered by your meta tags remove both, they are useless. The default behavior of a robot is to index and follow, and no robot really cares about when you want it to revisit: it revists when it wants to.
Are your inner pages dynamic -- with more than one variable in the query string? That might do it, even with a plain link.
Do you watch your server logs for Googlebot visits? That might also give you a clue.
I think the problem may have been session IDs. My scripts detect lack of support for cookies and through the session IDs into a query string. Since googlebot doesn't support cookies, it was probably getting links like
domain.com/index.html?sesid=lhkqgl4kj2h34kj2h34kjklhkhl345
I have removed URL based session tracking and will see what happens.
I was going to install a stealth script. Detect bots and if they are present bypass the entire SID system. However, I concluded that is kind of a kludge fix. So, instead I applied the 80/20 rule. Since most people have cookies enabled anyhow, just cater to them. If someone doesn't have cookies turned on, then they can use any of the login features on my sites. If they dont have cookies turned on, they are probablyhaving a pretty poor experience on the net anyhow. Screw Jakob, I stopped caring about netscape a year ago, so now I will also stop caring about cookie-incapable browser settings.
Step 1 - HTML/XHTML Validation
HTML/XHTML Validator [searchengineworld.com]
W3C CSS Validation Service [jigsaw.w3.org]
Step 2 - Robots Text Validation
Robots Text Validator [searchengineworld.com]
Step 3 - SIM Spider
Search Engine Spider Simulator [searchengineworld.com]
Step 4 - Double Check Server Headers
Server Header Checker [searchengineworld.com]
HTTP Status Codes [w3.org]: 200 for valid pages. Status 301 for non-valid permanently redirected pages and Status 404 for pages not found.
So far it really seems like it was the SIDs that killed it. I'm keeping an eye out for Deepbot IPs and will post updates
Everything came through just fine except for various error for meta tags with XHTML closers. However, I have read on other threads its a parser problem on the analyzer.
I wouldn't be too sure that there is a problem with SIM Spider. My understanding is that SIM Spider is telling you there may be a problem with XHTML and the closing meta.
After reading all the previous topics on this issue, I decided to do the right thing and added </meta> as opposed to the /> format. At the time I made this change, Googlebot happened to crawl the site in its entirety. Could have been coincidence, but I like to play it safe and </meta> validates just fine and SIM Spider sees the meta data. I'm validating XHTML 1.1.
Googlbot has crawled and added about a dozen internal pages to its index. Whats odd is the IPs are all for freshbot, not deepbot IPs . Also, the pages are in the index now and they were crawled for the first time less than 10 days ago.
Talk about fast. Wow. Now I just gotta wait for deepbot to go probing....