homepage Welcome to WebmasterWorld Guest from 107.20.25.215
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 76 message thread spans 3 pages: 76 ( [1] 2 3 > >     
Google refuses to spider site. It has been more than a year!
Google hits the index page and goes no further.
Crow_Song




msg:97809
 5:34 pm on Sep 8, 2003 (gmt 0)

I am the web developer for the Faculty of Applied Science at a Canadian university. About a year ago, I redesigned one of our department's sites. It was not the smoothest transition: we moved it to a different server, I changed the structure of the directories and the names of pages, and we even changed the domain name and IP (it was using two domain names, and we dropped one). I expected Google to take a few months to re-index the site, but now a year later, I have a lot of angry profs blaming me for Google not listing their research.
I can't figure it out. I am at a loss. The code itself is not the problem...I am using a template that I also use on several other departments. There is some server-side scripting and includes, but nothing weird. Google has recently finally assigned a PR of 6 to the homepage, and a PR of 5 to several pages one level in. But it will not spider the site, nor will it assign a page rank of more than 0 to any other pages on the site (at least 0 is an improvement! Until a few weeks ago, there was no rank at all). I watch the logs every day, and Google hits the robots.txt page, and then the index page...and that's it. It's been doing that for months and months, never going to another page. The robots page excludes only a testing directory, and nothing more (I've even tried removing it altogether).

There are tons of links to the site, and I have painstakingly contacted webmasters from sites who listed the old url. They have updated their links, but it doesn't seem to matter.

Anyone have any ideas? I have tried submitting the site to DMOZ, but they have not listed it.

Thanks for any help. I am at my wit's end.

 

athinktank




msg:97810
 8:08 pm on Sep 8, 2003 (gmt 0)

I'm just guessing here, since I don't know your domain name.. but I did a search and found a 'Faculty of Applied Science at a Canadian university' that fits your description.. you in Toronto? Once again, just guessing, but it looks as if there a couple of domains that all point to the same content. You maybe suffering from a duplicate content penalty... I don't know much about those either.. but you can search the forums for more answers...

Good luck and welcome to Webmaster World.

bcolflesh




msg:97811
 8:16 pm on Sep 8, 2003 (gmt 0)

Can we assume it's the site in the email address of your profile? If so, I'm trying to find a section that isn't ranked - which pages are your talking about, specifically? - send a Sticky...

Crow_Song




msg:97812
 8:38 pm on Sep 8, 2003 (gmt 0)

Nope - not in Toronto, but rather Queen's in Kingston. Specifically, the Mechanical department. (Am I allowed to say this?) We have created a "common theme" for all of the departments, so I certainly hope that Google won't think that it looks too much like another site. With several sites having a very similar homepage, that would not be good.

The other departments have not had any trouble being indexed...just this one. Also, other search engine bots have spidered the site like crazy. I have tried to write to Google but have not had a response other than one pointing me to a page on what not to do.

rfgdxm1




msg:97813
 9:44 pm on Sep 8, 2003 (gmt 0)

You really need a SEO to look over this personally. I'm guessing you hit a filter, as opposed to a penalty. Forget about a personal reply from Google.

kaled




msg:97814
 12:16 am on Sep 9, 2003 (gmt 0)

I don't see how the problem could be a duplicate content penalty. If the robot is not fetching the pages Google cannot know that they are duplicates or not.

I'm a great believer in (re)checking the obvious so here are a few obvious suggestions.

1. No hidden text (or text that Google might construe as hidden).

2. Simple HTML links rather than javascript links (that Google probably does not follow).

3. Regularly make small changes (or at least do so once after the bot visits). This may encourage the bots to look further.

4. I use frames without major problems but there have been many comments that Google does not like frames.

5. Check the headers. Keep it simple. Try an html validation tool.

6. Experiment with a few links at the bottom of the index page to a few test pages. See if you can get these pages into the Google index. Adjust such factors as file names (extensions, underscores and anything else you can think of) file sizes (some large and some small) etc.
Obviously, test pages are not desirable on a live site, but I think you will have to try it unless someone else has a better idea.

Kaled.

PS
Until the problem is sorted, I would definitely delete (or rename) the robots.txt file unless there is a good reason not to. There is a site you can use to validate this file but I don't have the url to hand.

jdMorgan




msg:97815
 12:30 am on Sep 9, 2003 (gmt 0)

Crow_Song,

Your robots.txt file is invalid - I'd remove it immediately.

Robots.txt validator: [searchengineworld.com...]

A Standard for Robots Exclusion: [robotstxt.org...]

Server Headers checker: [webmasterworld.com...]

Jim

caustic




msg:97816
 2:44 am on Sep 9, 2003 (gmt 0)

could be sessionids

GoogleGuy




msg:97817
 5:06 am on Sep 9, 2003 (gmt 0)

If it's the site I believe you're talking about, we have 38 pages indexed, including the home page for that host.

Doing the search mechanical engineering queen's kingston put you on the first page, for example. It didn't look like many people at the university had linked to your site though. Also, I'd recommend more text links and static html and less mouseover-y javascripty stuff. Just my two cents from a quick look..

tqatsju




msg:97818
 5:24 am on Sep 9, 2003 (gmt 0)

i'm having the same problem, (sites are in my profile). When you do a site:mydomain.com pages come up for the other site, but those pages were never on the other site, they were always our content. I'd really love to hear what googleguy has to say about it

thanks, Tom

Powdork




msg:97819
 5:54 am on Sep 9, 2003 (gmt 0)

The articles you mentioned specifically in your first post are buried in JS. That was the first thing that came to my mind.

GoogleGuy




msg:97820
 6:36 am on Sep 9, 2003 (gmt 0)

tqatsju, the mods get a little antsy when I give specific site tip, because then everybody starts asking soon afterwards. Doing a
site:yoursite.org -asdfsadf
shows 561 pages where we found urls on your site, so it's not a problem with finding your pages. I can't really give any advice on your site--every url that I visited looked kinda like pre-built generic pages, with no real content other than off-site links to affiliate programs, etc.

The general advice I'd give is to read Brett's 26-step program and try to add more content to your site--but that's good advice for any webmaster, too.

Josefu




msg:97821
 6:56 am on Sep 9, 2003 (gmt 0)

A silly question: does the googlebot ever 'give up' on reading a page? I'm having the same problem. I may have hit a 'tripwire'?

It's been three months since I was indexed (just the "/" and "robots.txt" eight times in all) and happily enough I have a PR 2 because of a few incoming links - still building there - but never anything more than the abovementioned content indexed.

I have used, as the W3C standards say I should, the 'noembed' tag (yes, that again) for non-flash browsers, and the same text on another page for those who DO have it - after writing google several times I still have no answer to my "is this penalized" question. I could also be getting a double whammy for duplicate content. I think I should yank it but have yet to hear otherwise.

I can't really afford to hire an SEO but I can't seem to get out of this jam. Eight visits, PR 2 and no spidering past the "/" - normal? I think I'm doing something wrong but can't see what.

(added) but I will read Brett's text again ...

(added added) yes, my site does seem 'google laughable' after reading all that again... but you should see some of the mails I've recieved about it - we're getting interviewed for it! . Talk about having my a** between two chairs...

Josefu




msg:97822
 9:59 am on Sep 9, 2003 (gmt 0)

I got kind of sidetracked there, sorry. I feel mister Crow's frustration.

Freshman




msg:97823
 1:11 pm on Sep 9, 2003 (gmt 0)

I have a lot of angry profs blaming me for Google not listing their research.

I'm having it in a worse way even :(
I'm about to lose my job because of Google not listing all the site pages. Web site is the only way of hitting orders and it doesn't seem to work...
I'm at my wits' end as well and don't nkow what to think...

1)My site (look it up in the profile) is bilingual i.e. almost for every english page there's a russian counterpart with the same information in the other language. Sounds silly I know, but can Google really construe this as duplicate content?

2)I also got js navigation on top with normal href links to the site pages down below. What's strange - my russian pages got indexed alright - english pages didn't and that pisses off my management really strong as US market is our major objective.

3)the first href link that Gbot sees on home page (the page that was fed to Gbot at addurl.html) is link to russian home page that was followed and all russian pages got indexed. My question is: if Google followed it and indexed russian part why didn't it come back to index the rest?

4) My idea now is to arrange a site map page with all hrefs and "feed" it to Gbot in order to get all pages indexed. Has anybody practiced that before? Will that be penalized?

Plus there's still loads of junk when I do search for "site:www.mysite.com +www.mysite.+com" or "site:mysite.com –asdfsadf"
- a lot of urls from last site version yet didn't get deleted. (the site was redesigned 3 months ago)

Sorry to be such a bloody nuisance, GoogleGuy, but it would be just great if you replied to this.

Crow_Song




msg:97824
 1:18 pm on Sep 9, 2003 (gmt 0)

Thanks for all of the tips and help, everyone. I really appreciate the help.

GoogleGuy - I jumped for joy when I saw that you were lending your expertise and knowledge. Thanks!
Indeed, the Mech home page has finally been indexed, but the spider won't seem to explore further. Looking in the logs I see one visit, several times a month to the index page and to the robots.txt, and that's it. The home page doesn't have a whole lot of content, and does have the javascript-y stuff...but the rest of the site is packed with content. Also, the scripting on the home page hasn't seemed to have had the same detrimental effect on the faculty site, or any other department. I changed the page to straight HTML, loaded with content and left it up for more than a month, with no effect.

There are 38 pages indexed? Please note that anything beginning with "conn" is the old domain name, and is not valid. I have painstakingly contacted every site I could find with those older links and asked the webmasters to update their links, but some remain (to old pages that do not have pages on the new site). Also, sites that begin with "ferrari." or "sellens.", etc. are not part of the site, and are on different servers (though they are affiliated). Could this cause any problems?

Also...does it matter if one uses site:www.domain.com vs. site: www.domain.com (note the space). It seems to yield very different results in our case.

As for the robots file - the validator says it's okay. Thanks for the link to the checker. I do not use any session IDs, nor do I use frames or hidden text. I try to keep lots of regular html links, and I have validated the headers. Whew!

Something I forgot to note that might be important...we are using IIS 5, and the pages are ASP. Does that make a difference? When I compare the site to several others I have here that use the same template, I can't imagine why they get spidered like crazy, and this one doesn't. I am confident that it is not a code problem...but the SysAdmin and I have checked the server config, too.

Finally...how does one go about hiring a SEO? I'd be willing to invest in the help. There is one prof who has accused me of ruining his career because his research pages (which used to be listed on Google) fell off a year ago when I changed the site.

Thanks so much,
Crow

rogerd




msg:97825
 1:41 pm on Sep 9, 2003 (gmt 0)

Crow_Song, the search with the space in it will treat the query as two separate words, not a search restricted to one domain.

As far as the prof whose career you ruined: why not throw up a quick domain, like professordemento.com, and stick his stuff up there. Get a few links, and he'll be back in business.

Crow_Song




msg:97826
 2:17 pm on Sep 9, 2003 (gmt 0)

Thanks Rogerd - what I wanted to tell him (but wouldn't dare) is that anyone whose entire career is dependant on search engine rankings is nuts. Still, I understand why he's a little ticked...even if he's a bit overdramatic.

bcolflesh




msg:97827
 2:41 pm on Sep 9, 2003 (gmt 0)

There is one prof who has accused me of ruining his career...

LOL - don't you love working with academics? The real world is a scary/confusing place for them...

hutcheson




msg:97828
 3:02 pm on Sep 9, 2003 (gmt 0)

Googleguy says "with all thy getting, get links!"

Might I point out that the ODP is a valuable resource: individual professors' sites are often listable separately, if they have significant online content. The science categories are not high spam targets (except for really really stupid spammers), so the unreviewed ratio is often less than 1% (where 10% is average, and 100% is not unusual in spam magnets). This means submittals are highly appreciated, and get listed "fairly quickly" -- generally within a semester (to translate time units into a form recognizable in the Ivory Towers.)

As for ruining careers, just add a bit of hidden text on each page. But you already knew that if you've been following the BOFH technical files...

rogerd




msg:97829
 3:09 pm on Sep 9, 2003 (gmt 0)

[quote]anyone whose entire career is dependant on search engine rankings is nuts[/b]

Careful, Crow, some people here may take offense... though most will cheerfully acknowledge their insanity! ;)

Freshman




msg:97830
 3:23 pm on Sep 9, 2003 (gmt 0)

Hey guys will anybody help this poor soul?
My post was last on page1 so it apparently got ignored :(

Crow_Song




msg:97831
 3:42 pm on Sep 9, 2003 (gmt 0)

You're right Rogerd. I don't mean to offend people whose jobs are actually in the search engine business!
Oops!
Sorry, guys.

GoogleGuy




msg:97832
 3:44 pm on Sep 9, 2003 (gmt 0)

Freshman, the more I talk about a specific domain (e.g. you don't have any penalties), the more the mods worry that lots of people will start mentioning specifics--and that's against the charter of the board. So re-read all our webmaster pages if you haven't, and read Brett's essay if you haven't, and then general comments may still apply to you.

Crow_Song, javascript-y stuff isn't too harmful, but I don't remember seeing a site map. Site maps are your friend. If the domain has truly moved forever, then you should have a 301 in place from the only host to your new host--verify that is so. Someone else mentioned that individual pages can be suitable for garnering links. I noticed a few pages indexed because it looked as though they have their own links. You may want to include a static link on every page back to the root of your site.

But my main advice is still to get a few more links (e.g. from within the university; campus directory, etc.), and instead of the javascript-y mouseovers, go with static links without fragments (the '#' stuff I saw a lot while mousing over the root page).

Josefu




msg:97833
 3:54 pm on Sep 9, 2003 (gmt 0)

I must REALLY be doing something wrong if I don't even get a mention : P Some of the advice added here IS relevent to my case, so thanks everyone for that.

Cheers!

tqatsju




msg:97834
 4:35 pm on Sep 9, 2003 (gmt 0)

google guy the site mydomain.org is not our site, that's the problem, our site is site mydomain.com ddd and it is showing up as mysite.org for no reason really....

Crow_Song




msg:97835
 4:40 pm on Sep 9, 2003 (gmt 0)

Thanks GoogleGuy - I really appreciate your advice.

There is a sitemap, however - in a straight text link at the bottom of the screen. And there are straight text links on all pages. The javascript-y stuff occurs on only a few pages: the home pages of each major section. But the thousands of content pages are all text (and some graphics of course) with plenty of straight links, etc. Ironically, it is only the javascript-using homepages that appear at all in the Google index. All of the "meat" of the site is ignored by the Googlebot, despite the links I have snuck into the frequently spidered sitemap of another (PR 7) site (the parent Faculty site).

GoogleGuy - is the old conn server the problem? If I talk the powers that be into giving us the domain so we can bring it back to life (it's been gone for about 8 months) and put redirects in place, will that help?

Cheers,
Crow

kaled




msg:97836
 4:40 pm on Sep 9, 2003 (gmt 0)

I know I'm not supposed to ask GG specific questions but this is a general interest one so here goes.

When designing a new version of an old website, would you recommend keeping the old version live while the new one gets Googled? This is not pretty, and it means that there will be some duplication of content, but it might save the odd career or two.

Perhaps Google could add some guidance regarding how best to handle transitions of this sort on the Webmaster Guidelines page.

Kaled.

Crow_Song




msg:97837
 4:47 pm on Sep 9, 2003 (gmt 0)

GoogleGuy - just to be clear, which I have probably not been (being the WebmasterWorld newbie that I am), the site I am having difficulty with is listed in my Interests (in my profile).

Cheers,
Crow

Powdork




msg:97838
 5:14 pm on Sep 9, 2003 (gmt 0)

Using the sim spider it is confusing. The spider is able to read the link to the research page, and the spider is able to read the link to the activities page, and then on to the prof's pages. I'm rather stumped. The only thing that jumps out is that there have been posts about Googlebot doing strange things with the trailing slash lately. That's really just grasping at straws though.
I too was looking at the appsci pages the first time around, thus the js comment.

This 76 message thread spans 3 pages: 76 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved