Forum Moderators: open
There are numerous links to the site from other web sites already indexed by Google and have existed for months. Still, the site fails to be indexed.
I have emailed Google and they responded by saying that the site is not penialized.
I am trying to generate a check list to determine what might be the issue for the site not being indexed. I understand that there are work arounds for some of the causes for not being indexed, but I just want to make the checklist so that we can determine what exactly might be causing the issue. All possible causes is what I am seeking. Everyone is welcome to repond. Here is a start.
--- Checklist for Denial ---
Are the following true? These are reasons why your site MIGHT NOT get indexed.
1) Cookies required (disable cookies, then try to use your site)
2) Dynamic pages with "?" existing in the URL
3) Robots.txt file denying the bot
4) Previously banned or penialized site and/or domain
5) duplicate content/mirror site or *VERY* similar content
6) DNS problem at Google
7) HTTP Status Code other than 200 OK (like 301 redirect)
8) Special character in domainname (not sure about this one... Japanese, Chinese & Korean characters. See Takagi post below.)
9) no default file in root or no read permission
10) server down while deepbot wanted to visit the site
11) server too slow (time out)
12) No incoming links (orphan page syndrome)
13) Not been around long enough (at least 3 months before panic button should be pushed)
14) Session IDs in the URL (new URL with "af654d6s1f6asd51s6f3f13dsf8se6312sd" which keeps changing)
Good work guys... keep'em coming!
Thanks for your help.
Justin
[edited by: jtoddv at 3:14 pm (utc) on April 21, 2003]
1) No. You don't have to embed cookies in your pages to be listed. On the other site you if you send cookies with your pages, Google won't mind. :) Either way, it doesn't matter at all.
2. Google crawls and add to the index the dynamical pages with "?" (but you should take care with the "id" variable - like in "link?id=" - sometimes a sign the session variables). You should limit your GET variables to three
3. Take a look in your root directory and if you have a robots.txt file open it with Notepad and paste the content here. We'll tall you if the googlebot is blocked or not
4) This happens indeed. If the domain was used to spam goggle you might have it blocked. A hint will be to install the goggle bar and see if your Page Rank is gray or zero.
There might be other reasons of course. For example, do you have a large number of domains parked on that particular site?
Tenita
-> duplicate content/mirror site
-> DNS problem at Google
-> HTTP Status Code other than 200 OK (like 301 redirect)
-> Special character in domainname (not sure about this one)
-> no default file in root or no read permission
-> server down while deepbot wanted to visit the site
-> server too slow (time out)
add to your list:
5) No incoming links (orphan page syndrome)
6) Not been around long enough (update is once a month, so needs link to follow, then crawl, then update... easily could be 6 weeks)
7) PHP session ID. This can really penalise your site as every time google comes [no cookie keeping] it get given a new URL with af654d6s1f6asd51s6f3f13dsf8se6312sd in it (which keeps changing).
8) *Very* similar content to another domain - Google will give the oldest/most linked site the listing, and remove the other as duplicaton.
As you can read in the threads Japanese-language domain names [webmasterworld.com] & Domain names in Chinese [webmasterworld.com] it is possible to have a domain name with Japanese, Chinese & Korean characters. I never saw such a domain in a SERP of Google. I know for Japan that it is extremely rare to have a domain with these characters. I can imagine that Google will have problems with these domains.
Last 2 for today (he, it's midnight here),
-> nofollow in META tag on all pages linking to you.
<META NAME="robots" CONTENT="INDEX,NOFOLLOW">
-> Problem processing the file of homepage (corrupted file/encrypted file/unknown file type)
It wouldn't surprise me if an HTML file with a type error in an important tag (e.g. "<BODDY>") could cause the page not to be indexed
Google cannot read the contents of a password protected PDF
Only a few years ago Google started to index files like Word, Excel, PDF etc. I don't know what Google does with HDML (mobile phone) files or other unusual files.
What about:
<meta http-equiv="pragma" content="no-cache" />
Would this cause an issue with Google?
We are experencing strange things with the site. The site seems to be listed, but whenever you search for the site, i.e. "domain.com" (no quotes) in Google, nothing comes up. However, we have found the domain.com and one other page for the domain listed in some SERPS, why is this?
There is also no PR listed for the pages in the toolbar. Trying to access the cached page from the toolbar does not work either, but if you click the cache link next to the listing it works?
Justin