Goggle indexing

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Goggle indexing

Getting Goggle to crawl my entire site.

Demaestro

8:09 pm on Apr 4, 2005 (gmt 0)

I have had a website up for a while and Finally Google has decided to crawl/index it.

I have noticed in my logfiles for the past few months that everytime googleBot visits my site it only makes 2 hits. This happens almost twice a month. I have a few questions if anyone has the time to provide me with some insight I would greatly appreciate it. Not sure that this matters but MSN and slurp(yahoo) index my site like crazy. Why doesn't Google like me?

Is there a reason that it only makes 2 hits on my site per visit?

Is my site not 'special' enough yet for it to index the whole site?

What will it take to get google to go past my front page?

Demaestro

4:28 pm on Apr 14, 2005 (gmt 0)

Wierd I don't see the query string part of the URL when I hover my mouse over, I just see the base URL. In IE 6 and Firefox.

I wonder what would cause the cached link to go away? They must have some data cached because the results page has some old content from my site.

Any guesses as to what would make the cached link disapear. Just curious.

Reid

6:27 pm on Apr 14, 2005 (gmt 0)

maybe we were looking at 2 different databases. I didn't really check for that

Demaestro

5:09 pm on Apr 28, 2005 (gmt 0)

OK.

I got that stupid 500 error to go away. You have no idea what I went through though. I will spare you the details but it seems that it was server specific and not the ZOPE platform I was using. Even for for some of my straight static sites with no dynamic anything, no Zope just stright web account folders with .html files in them, even they were returning that 500 code error. It seemed to be server specific. I did a reverse DNS lookup and put them through poodle, blocks of them would return the error, while on some of my other server machines even the Zope ones were fine.

Anyway I got that fixed then I held my breath and waited for googlebot to come along. So yesterday it came and guess what. I still only got 2 hits to the front page then nothing. ARG!

To further peeve me off I used a new domain a client had registered with me as a test. I put him in a Zope and set him up about 3 months ago. I checked his logfiles and he had not been indexed by google ever. Then about a week ago he got the 2 hits to google and it went away, same as on my site, execpt google came back a few days later and crawled his whole site, and it continues to index him daily now. This drives me nuts because this site is the same as mine for everything as far as site layout, platform, server specs goes and it was returning those stupid 500 errors just like mine was until a couple days ago when I got it fixed.

What the what? Can it be I have been blacklisted? I still can't find another example of Google removing the 'Get Googles cached version' link, like it did to mine.

Reid

6:50 am on Apr 30, 2005 (gmt 0)

I still can't find another example of Google removing the 'Get Googles cached version' link, like it did to mine.

I don't understand that sentence.

The poodle predictor looks great now.
Just need to figure out what happened now.

I see your homepage cached in google
UNDER CONSTRUCTION 28 Mar 2004

with the new date it should pick up on it now, sometimes the googlebots work in harmony with each other (googlebot and freshbot) it may have seen the new date and went to fetch the other bot.
It could still happen.

Demaestro

10:51 pm on May 3, 2005 (gmt 0)

RE
I still can't find another example of Google removing the 'Get Googles cached version' link, like it did to mine.
I don't understand that sentence.

When you put in my URL directly into Google it gives the following styled results:

Google can show you the following information for this URL:

* Find web pages that are similar to www.exampleURL.com
* Find web pages that link to www.exampleURL.com
* Find web pages from the site www.exampleURL.com
* Find web pages that contain the term "www.exampleURL.com"

Normally there is also a link that says

* Show Google's cache of www.exampleURL.com

This link does not appear with my results. I haven't seen this link removed from any other URL I have tried.

However for my URL this link does not appear.

Reid

5:32 am on May 4, 2005 (gmt 0)

Ive never seen that either.
Is googlebot still coming to your site?

Demaestro

4:43 pm on May 4, 2005 (gmt 0)

Nope.

2 hits to the front about a week ago then it went away. I am not sure about how reliable this info is but someone who uses my site often claims that on the day googlebot paid me a visit last month that the Google's cache was updated for a day. He told me that he put in our URL and it was showing him the new content. He even emailed me in excitment that Google had updated the cache. So I looked at my logfiles and saw google did come and see me, even though it was just the 2 hits. So then I went to look at the Google cache and it was the same old content.

Now I am not sure that he really saw what he claims to have seen but he was adamant about it and he did email me on the day after googlebot came to see me (Which is about every 40 days so it seems like good timing on his part) to tell me about it being updated and he was excited as he has been following the non-caching of our site with intrigue.

If what he says is true then google came to my site, made 2 hits, updated the cache, then the next day reverted back to the super old cache, which stills appears without the 'Show Goolges Cache' link. Does this sound like something that could happen?

Reid

2:19 am on May 5, 2005 (gmt 0)

that IS possible.
If you and your buddy were looking at 2 different databases. It takes a few days for things to propogate into all databases.

I still see 'under construction' with no cache.

I did notice on the poodle tool - the header checker - that your character set is showing up twice.
maybe the page doesn't need that META tag because that is the default for the server?

If I were you I would try 2 things.
1. submit a reinclusion request to google. but they will probably tell you the site is already listed.

what do you have an empty robots.txt file?

check it with this
[searchengineworld.com...]

if you just want to allow everything here is the code

user-agent: *
disallow:

it is really important that you have a valid robots.txt file before I tell you what your other option is. - Just making your robots.txt validate may fix the problem.
I know an empty robots.txt is supposed to be ok but this is non-standard and just giving googlebot something to go by (on a dynamic site) may be all it needs.
One more thing.
I don't know much about dynamic websites or Zobe but it makes sense to me that /index.html should not return a 404, instead I would make it a 301 redirect to /index_html/
That may be a dumb statement, like I said I don't know, just thought I would point that out.
Or make sure googlebot is asking for / and not /index.html for some reason. (it usually does the former)

Reid

3:31 am on May 5, 2005 (gmt 0)

I still can't find another example of Google removing the 'Get Googles cached version' link, like it did to mine.

I wouldn't be too concerned with it other than that it seems google has dropped the cache for this page.
Might be something you did in the dynamic content issue? or with the header dates?
We probable fixed the 'outdated cache' problem

Anyway I would put a bet on the robots.txt validation issue.
Your robots.txt is invalid. it is a dynamicly generated page.
here is the src of your robots.txt

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1"></HEAD>
<BODY></BODY></HTML>

robots.txt must be a simple text-editor .txt file in the root /robots.txt (notepad)

empty is as if there isn't one (all robots allowed)
but I would rather see it validate.

Demaestro

3:15 pm on May 5, 2005 (gmt 0)

Wow, Reid you are saving my bum, I had a bug in my system that was altering the mime type of some files and I thought I caught them all. Looks like I was wrong. I had a bunch of PDFs that picked up the text/html mime type, and I had to run a scipt to change them all back. Guess I didn't think that my text files had the same problem.

Well another issue down.

What do you think about Googlereverting back from an updated cache to the old one that is up. Does that sound possible?

Reid

1:16 am on May 6, 2005 (gmt 0)

I think now that weve found the cache problem and the robots.txt problem that it may be fine now.
I would just sit back and see what googlebot does this time.
No sense keeping on with more until we see what happens.
This has got to do it now.

What do you think about Googlereverting back from an updated cache to the old one that is up. Does that sound possible?

Like I said googlebot is a greedy little bugger.
I think when the file date got fixed google figured out that it had an old cache and dropped it but was not able to crawl the site (robots.txt problem) so it could have gone back and grabbed the old cache again.
Once everything works it should straighten it all up in notime.

Demaestro

3:13 pm on May 6, 2005 (gmt 0)

Reid,

Thanks again, I swear I feel like I owe you something. If you ever have a SQL/Dynamic Code issues I hope you would sticky or email me.

I really would like to be able to return the favor.

Reid

3:48 am on May 7, 2005 (gmt 0)

I really would like to be able to return the favor.

You already have. Im learning about this stuff and you provided an excellent troublshooting example. My site has never had a problem with googlebot.

my/sql I have looked at it but have not found a need for it on my site.

BTW I'm still seeing the same html code in your robots.txt

Demaestro

3:25 pm on May 10, 2005 (gmt 0)

Reid.

Well I am glad to hear that you are not doing this all for not.

So I got back from a trip out to the mountains and I peaked in on the site and what do you know, Googlebot has come to see me and he made 5 hits to the site.

I checked googles cache and low and behold I am indexed! I checked my robots.txt an it is of type text/plain so I think all is well.

Thanks again for your help Reid. Please please think of me if you ever need a few lines of script or custom SQL I would be so happy to help you in these areas.

Take care
Demaestro

Reid

1:02 am on May 11, 2005 (gmt 0)

yep I see 2 pages cached may 7.

I suggest you validate robots.txt anyway.
I'm seeing HTML there. Head and body.

But it looks like googlebot has it figured out.

Pico_Train

10:16 am on May 11, 2005 (gmt 0)

1. Good site map will links to all pages
2. If over 100 links in site map, break into two pages.
3. At least one link to a specific page on the site on your pages.

Straight out of Google's Webmaster tips.

Hooked on Phonics, worked for me!

craignett

12:48 am on May 27, 2005 (gmt 0)

i think i may have finally found a place to post my question. i've been lurking for some time trying to determine what may be the problem with my site.

i've been having the same problem as Demaestro. Googlebot visits almost daily and requests the robots.txt and the homepage. then promptly leaves. this has been going on for a couple of weeks. (patience has never been one of my virtues) i've run the validator on W3 and it returns 3 errors one of which is the "doc type". i cannot seem to add the doc type into my header. i've also run the poodle predictor and this returned some warning about "No h1, h2 or h3 Headings were found". however it returns this same warning for google.

the robots.txt allows all and the since modified is working. i also have a site map.

im stumped. any thoughts? go easy on me, im new to all this.

roodle

5:46 pm on May 27, 2005 (gmt 0)

I think Googlebot does this with a lot of sites initially. Perhaps you need more links to your site to persuade Gbot to go deeper in. I've launched sites which I've deliberately only set up a couple of links to and it has exactly this behaviour.

You don't necessarily need the doctype or h1, h2 etc. Plenty of sites get crawled without these.

craignett

10:03 pm on May 31, 2005 (gmt 0)

I still don't get it. Google is not the only robot that is doing this. They all just come view the homepage and/or robots.txt file then leave. My robots.txt file is as follows:

User-agent: *
Disallow:

That is the entire file. No headers or anything in the robots.txt. I do not see anything in the html that obstructs them. I have some Meta keywords but thats about it.

Is this behavior usual for askjeeves, msn, become, looksmart, etc in addition to google? All of these have visited and seem to do the same thing.

This 49 message thread spans 2 pages: 49

Goggle indexing

Getting Goggle to crawl my entire site.

Demaestro

Demaestro

Reid

Demaestro

Reid

Demaestro

Reid

Demaestro

Reid

Reid

Demaestro

Reid

Demaestro

Reid

Demaestro

Reid

Pico_Train

craignett

roodle

craignett

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week