Reasons why one might NOT be included in Google - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Reasons why one might NOT be included in Google

Help requested.

jtoddv

2:05 pm on Apr 21, 2003 (gmt 0)

10+ Year Member

I have been trying to get a site listed in Google for the past 6 months. This is a static HTML site. No cookies required. Straight up HTML site.

There are numerous links to the site from other web sites already indexed by Google and have existed for months. Still, the site fails to be indexed.

I have emailed Google and they responded by saying that the site is not penialized.

I am trying to generate a check list to determine what might be the issue for the site not being indexed. I understand that there are work arounds for some of the causes for not being indexed, but I just want to make the checklist so that we can determine what exactly might be causing the issue. All possible causes is what I am seeking. Everyone is welcome to repond. Here is a start.

--- Checklist for Denial ---

Are the following true? These are reasons why your site MIGHT NOT get indexed.

1) Cookies required (disable cookies, then try to use your site)
2) Dynamic pages with "?" existing in the URL
3) Robots.txt file denying the bot
4) Previously banned or penialized site and/or domain
5) duplicate content/mirror site or *VERY* similar content
6) DNS problem at Google
7) HTTP Status Code other than 200 OK (like 301 redirect)
8) Special character in domainname (not sure about this one... Japanese, Chinese & Korean characters. See Takagi post below.)
9) no default file in root or no read permission
10) server down while deepbot wanted to visit the site
11) server too slow (time out)
12) No incoming links (orphan page syndrome)
13) Not been around long enough (at least 3 months before panic button should be pushed)
14) Session IDs in the URL (new URL with "af654d6s1f6asd51s6f3f13dsf8se6312sd" which keeps changing)

Good work guys... keep'em coming!

Thanks for your help.
Justin

[edited by: jtoddv at 3:14 pm (utc) on April 21, 2003]

tenita

2:08 pm on Apr 21, 2003 (gmt 0)

10+ Year Member

Hi Justin,

If you could sticky-email me the URL it can take a look and tell you what's wrong.

tenita

2:26 pm on Apr 21, 2003 (gmt 0)

10+ Year Member

--- Checklist for Denial ---

1) No. You don't have to embed cookies in your pages to be listed. On the other site you if you send cookies with your pages, Google won't mind. :) Either way, it doesn't matter at all.

2. Google crawls and add to the index the dynamical pages with "?" (but you should take care with the "id" variable - like in "link?id=" - sometimes a sign the session variables). You should limit your GET variables to three

3. Take a look in your root directory and if you have a robots.txt file open it with Notepad and paste the content here. We'll tall you if the googlebot is blocked or not

4) This happens indeed. If the domain was used to spam goggle you might have it blocked. A hint will be to install the goggle bar and see if your Page Rank is gray or zero.

There might be other reasons of course. For example, do you have a large number of domains parked on that particular site?

Tenita

jtoddv

2:34 pm on Apr 21, 2003 (gmt 0)

10+ Year Member

Thanks Tenita,

What I am looking for is the other reasons on why a site MIGHT NOT get indexed.

The checklist is to help determine what items to look for when your site is getting denied.

I will add all resons to the first post list as we gather them.

Thanks,
Justin

takagi

2:47 pm on Apr 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Just some ideas

-> duplicate content/mirror site

-> DNS problem at Google

-> HTTP Status Code other than 200 OK (like 301 redirect)

-> Special character in domainname (not sure about this one)

-> no default file in root or no read permission

-> server down while deepbot wanted to visit the site

-> server too slow (time out)

vincevincevince

2:52 pm on Apr 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

1) Cookies required
- If not being able to use cookies means you don't see the page, then it will not be included (disable cookies, try to use your site)
2) Dynamic pages with "?" existing in the URL
-? in the url does work, but every page must have a different title and pretty much different content, 3 or fewer arguments is best, preferably just 1
3) Robots.txt file denying the bot
- agreed. add to that META tag for no-index
4) Previously banned or penialized site and/or domain
- agreed. apply to google for reinclusion - but that is not the case in your domain as they say you are not penalised.

add to your list:

5) No incoming links (orphan page syndrome)
6) Not been around long enough (update is once a month, so needs link to follow, then crawl, then update... easily could be 6 weeks)
7) PHP session ID. This can really penalise your site as every time google comes [no cookie keeping] it get given a new URL with af654d6s1f6asd51s6f3f13dsf8se6312sd in it (which keeps changing).
8) *Very* similar content to another domain - Google will give the oldest/most linked site the listing, and remove the other as duplicaton.

jtoddv

2:53 pm on Apr 21, 2003 (gmt 0)

10+ Year Member

Thanks Takagi. This is exactly what I am looking for. I have added them to the top list.

:)

Justin

jtoddv

2:58 pm on Apr 21, 2003 (gmt 0)

10+ Year Member

Thanks Vince. Appreciate the help.

takagi

3:10 pm on Apr 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Just to explain about the special characters in domain name.

As you can read in the threads Japanese-language domain names [webmasterworld.com] & Domain names in Chinese [webmasterworld.com] it is possible to have a domain name with Japanese, Chinese & Korean characters. I never saw such a domain in a SERP of Google. I know for Japan that it is extremely rare to have a domain with these characters. I can imagine that Google will have problems with these domains.

takagi

3:57 pm on Apr 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I think you need to make a difference between not included or not spidered. A not-spidered page will be shown as URL only (unless it is in ODP) but can even have pagerank. A not-include page, will not show up at all in the SERP.

Last 2 for today (he, it's midnight here),

-> nofollow in META tag on all pages linking to you.
<META NAME="robots" CONTENT="INDEX,NOFOLLOW">

-> Problem processing the file of homepage (corrupted file/encrypted file/unknown file type)

It wouldn't surprise me if an HTML file with a type error in an important tag (e.g. "<BODDY>") could cause the page not to be indexed

Google cannot read the contents of a password protected PDF

Only a few years ago Google started to index files like Word, Excel, PDF etc. I don't know what Google does with HDML (mobile phone) files or other unusual files.

webwoman

4:01 pm on Apr 21, 2003 (gmt 0)

10+ Year Member

Not sure but I think broken links could be added to your list.

mbennie

5:02 pm on Apr 21, 2003 (gmt 0)

10+ Year Member

How about extremely poor html code that is so invalid as to make the bot give up trying to parse it?

jtoddv

9:54 pm on Apr 22, 2003 (gmt 0)

10+ Year Member

Thanks, guys.

What about:

<meta http-equiv="pragma" content="no-cache" />

Would this cause an issue with Google?

We are experencing strange things with the site. The site seems to be listed, but whenever you search for the site, i.e. "domain.com" (no quotes) in Google, nothing comes up. However, we have found the domain.com and one other page for the domain listed in some SERPS, why is this?

There is also no PR listed for the pages in the toolbar. Trying to access the cached page from the toolbar does not work either, but if you click the cache link next to the listing it works?

Justin