Forum Moderators: open

Message Too Old, No Replies

Google not deepcrawling XHTML meta tags?

No deepcrawl on multiple sites leads me to believe XHTML problem

         

Thanasus

1:33 pm on Jul 5, 2003 (gmt 0)

10+ Year Member



Greets everyone,

I have 3 websites in the top 5 for a pretty competitive keyword. I have performed a search at google of "keyword site:domain" to see a list of all the URLs which google has indexed for my domain names. All I see is the homepage which has been crawled due to inbound links. This has happened for all 3 of my sites (which maintain #1, #3, and #4 positions).

I performed the same search for my competition at #2 and #5 and their sites are pretty thoroughly crawled. My sites are about 2 years old and have had good google placement for about 1 year. I compared code and the only striking difference I can see is that I am XHTML compliant whereas they are not. Could that be a problem?

Some of my sites do have javascript menus but they all have site maps which have hard links on every page so the spiders can crawl. Additionally, I do have a robots meta tag:

<meta name="robots" content="index,follow" />

and

<meta name="revisit-after" content="30 days" />

Any ideas? I have read other posts here mentioning XHTML problems with Google but haven't read anything conclusive....

Mohamed_E

1:48 pm on Jul 5, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



w3chools.com is coded in xhtml, and a search for

[google.com...]

shows about 3,240 pages.

So xhtml by and of itself does not seem to prevent deep crawling.

Thanasus

2:06 pm on Jul 5, 2003 (gmt 0)

10+ Year Member



The only other difference I see between my sites versus competition is they do not use a document type declaration whereas I do. Now it makes no sense to me as to why that would cause a problem (since it is the proper doc type) but I can only draw conclusion by seeing what my competition is doing differently.

Mohamed_E

2:18 pm on Jul 5, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Now it makes no sense to me ...

I agree, it makes absolutely no sense. Every single valid page on the web has, by definition, a DOCTYPE.

Also, while xhtml is certainly not widespread, it is used enough that it is inconceivable that robots have problems dealing with it.

If you are bothered by your meta tags remove both, they are useless. The default behavior of a robot is to index and follow, and no robot really cares about when you want it to revisit: it revists when it wants to.

tedster

7:19 pm on Jul 5, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I agree with Mohamed_E, it's not the XHTML and it's not the meta tags. So there's something else going on here.

Are your inner pages dynamic -- with more than one variable in the query string? That might do it, even with a plain link.

Do you watch your server logs for Googlebot visits? That might also give you a clue.

Thanasus

3:08 am on Jul 6, 2003 (gmt 0)

10+ Year Member



I have about 135,000 unique inner pages which are all dynamic. However, I have something similar to a mod_rewrite going on so the name/value pairs are not in a query string.

I think the problem may have been session IDs. My scripts detect lack of support for cookies and through the session IDs into a query string. Since googlebot doesn't support cookies, it was probably getting links like

domain.com/index.html?sesid=lhkqgl4kj2h34kj2h34kjklhkhl345

I have removed URL based session tracking and will see what happens.

tedster

5:03 am on Jul 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have a good feeling that you've nailed it there. Now if only Google would settle down and start being a bit more "normal" with new pages, right?

Let us know if this does it for you -- it sounds good.

Thanasus

11:18 am on Jul 6, 2003 (gmt 0)

10+ Year Member



I'm hoping this is the SID is the prob. Since I made the change Slurpcat somehow got screwy and has been hitting the same 40 pages on my site over and over now for the past 16 hours. Apparently, it didnt like SIDs either.

I was going to install a stealth script. Detect bots and if they are present bypass the entire SID system. However, I concluded that is kind of a kludge fix. So, instead I applied the 80/20 rule. Since most people have cookies enabled anyhow, just cater to them. If someone doesn't have cookies turned on, then they can use any of the login features on my sites. If they dont have cookies turned on, they are probablyhaving a pretty poor experience on the net anyhow. Screw Jakob, I stopped caring about netscape a year ago, so now I will also stop caring about cookie-incapable browser settings.

Thanasus

6:15 am on Jul 9, 2003 (gmt 0)

10+ Year Member



---UPDATE---

Throughout the day googlebot (64.68.82.*) crawled about a dozen of my the pages on one of the sites. This is a good sign

pageoneresults

6:29 am on Jul 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanasus, have you run your pages through the following tests?

Step 1 - HTML/XHTML Validation
HTML/XHTML Validator [searchengineworld.com]
W3C CSS Validation Service [jigsaw.w3.org]

Step 2 - Robots Text Validation
Robots Text Validator [searchengineworld.com]

Step 3 - SIM Spider
Search Engine Spider Simulator [searchengineworld.com]

Step 4 - Double Check Server Headers
Server Header Checker [searchengineworld.com]

HTTP Status Codes [w3.org]: 200 for valid pages. Status 301 for non-valid permanently redirected pages and Status 404 for pages not found.

Thanasus

7:00 am on Jul 9, 2003 (gmt 0)

10+ Year Member



I was quite certain everything validated but I passed one of the sites through to play it safe. Everything came through just fine except for various error for meta tags with XHTML closers. However, I have read on other threads its a parser problem on the analyzer.

So far it really seems like it was the SIDs that killed it. I'm keeping an eye out for Deepbot IPs and will post updates

pageoneresults

7:38 am on Jul 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Everything came through just fine except for various error for meta tags with XHTML closers. However, I have read on other threads its a parser problem on the analyzer.

I wouldn't be too sure that there is a problem with SIM Spider. My understanding is that SIM Spider is telling you there may be a problem with XHTML and the closing meta.

After reading all the previous topics on this issue, I decided to do the right thing and added </meta> as opposed to the /> format. At the time I made this change, Googlebot happened to crawl the site in its entirety. Could have been coincidence, but I like to play it safe and </meta> validates just fine and SIM Spider sees the meta data. I'm validating XHTML 1.1.

Thanasus

5:15 pm on Jul 10, 2003 (gmt 0)

10+ Year Member



-- UPDATE --

Googlbot has crawled and added about a dozen internal pages to its index. Whats odd is the IPs are all for freshbot, not deepbot IPs . Also, the pages are in the index now and they were crawled for the first time less than 10 days ago.

Talk about fast. Wow. Now I just gotta wait for deepbot to go probing....

g1smd

10:12 pm on Jul 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Look through the last 2 months of postings here. The current idea is that deepbot has retired and that freshbot has been given longer arms and a bigger nose. His carrier pidgeon has also been put on steroids.

Thanasus

5:00 pm on Jul 17, 2003 (gmt 0)

10+ Year Member



I kinda got slammed over the past two months and did some catching up. Man, its amzing how quick things change around here.

Well I now see about 200 pages coming up in the index and google is crawling new ones with every passing day.

Thanasus

8:51 am on Aug 19, 2003 (gmt 0)

10+ Year Member



Just as an update, Google has now indexed about 12,000 pages! woot!