| This 76 message thread spans 3 pages: 76 (  2 3 ) > > || |
|Google refuses to spider site. It has been more than a year!|
Google hits the index page and goes no further.
I am the web developer for the Faculty of Applied Science at a Canadian university. About a year ago, I redesigned one of our department's sites. It was not the smoothest transition: we moved it to a different server, I changed the structure of the directories and the names of pages, and we even changed the domain name and IP (it was using two domain names, and we dropped one). I expected Google to take a few months to re-index the site, but now a year later, I have a lot of angry profs blaming me for Google not listing their research.
I can't figure it out. I am at a loss. The code itself is not the problem...I am using a template that I also use on several other departments. There is some server-side scripting and includes, but nothing weird. Google has recently finally assigned a PR of 6 to the homepage, and a PR of 5 to several pages one level in. But it will not spider the site, nor will it assign a page rank of more than 0 to any other pages on the site (at least 0 is an improvement! Until a few weeks ago, there was no rank at all). I watch the logs every day, and Google hits the robots.txt page, and then the index page...and that's it. It's been doing that for months and months, never going to another page. The robots page excludes only a testing directory, and nothing more (I've even tried removing it altogether).
There are tons of links to the site, and I have painstakingly contacted webmasters from sites who listed the old url. They have updated their links, but it doesn't seem to matter.
Anyone have any ideas? I have tried submitting the site to DMOZ, but they have not listed it.
Thanks for any help. I am at my wit's end.
I'm just guessing here, since I don't know your domain name.. but I did a search and found a 'Faculty of Applied Science at a Canadian university' that fits your description.. you in Toronto? Once again, just guessing, but it looks as if there a couple of domains that all point to the same content. You maybe suffering from a duplicate content penalty... I don't know much about those either.. but you can search the forums for more answers...
Good luck and welcome to Webmaster World.
Can we assume it's the site in the email address of your profile? If so, I'm trying to find a section that isn't ranked - which pages are your talking about, specifically? - send a Sticky...
Nope - not in Toronto, but rather Queen's in Kingston. Specifically, the Mechanical department. (Am I allowed to say this?) We have created a "common theme" for all of the departments, so I certainly hope that Google won't think that it looks too much like another site. With several sites having a very similar homepage, that would not be good.
The other departments have not had any trouble being indexed...just this one. Also, other search engine bots have spidered the site like crazy. I have tried to write to Google but have not had a response other than one pointing me to a page on what not to do.
You really need a SEO to look over this personally. I'm guessing you hit a filter, as opposed to a penalty. Forget about a personal reply from Google.
I don't see how the problem could be a duplicate content penalty. If the robot is not fetching the pages Google cannot know that they are duplicates or not.
I'm a great believer in (re)checking the obvious so here are a few obvious suggestions.
1. No hidden text (or text that Google might construe as hidden).
3. Regularly make small changes (or at least do so once after the bot visits). This may encourage the bots to look further.
4. I use frames without major problems but there have been many comments that Google does not like frames.
5. Check the headers. Keep it simple. Try an html validation tool.
6. Experiment with a few links at the bottom of the index page to a few test pages. See if you can get these pages into the Google index. Adjust such factors as file names (extensions, underscores and anything else you can think of) file sizes (some large and some small) etc.
Obviously, test pages are not desirable on a live site, but I think you will have to try it unless someone else has a better idea.
Until the problem is sorted, I would definitely delete (or rename) the robots.txt file unless there is a good reason not to. There is a site you can use to validate this file but I don't have the url to hand.
Your robots.txt file is invalid - I'd remove it immediately.
Robots.txt validator: [searchengineworld.com...]
A Standard for Robots Exclusion: [robotstxt.org...]
Server Headers checker: [webmasterworld.com...]
could be sessionids
If it's the site I believe you're talking about, we have 38 pages indexed, including the home page for that host.
i'm having the same problem, (sites are in my profile). When you do a site:mydomain.com pages come up for the other site, but those pages were never on the other site, they were always our content. I'd really love to hear what googleguy has to say about it
The articles you mentioned specifically in your first post are buried in JS. That was the first thing that came to my mind.
tqatsju, the mods get a little antsy when I give specific site tip, because then everybody starts asking soon afterwards. Doing a
shows 561 pages where we found urls on your site, so it's not a problem with finding your pages. I can't really give any advice on your site--every url that I visited looked kinda like pre-built generic pages, with no real content other than off-site links to affiliate programs, etc.
The general advice I'd give is to read Brett's 26-step program and try to add more content to your site--but that's good advice for any webmaster, too.
A silly question: does the googlebot ever 'give up' on reading a page? I'm having the same problem. I may have hit a 'tripwire'?
It's been three months since I was indexed (just the "/" and "robots.txt" eight times in all) and happily enough I have a PR 2 because of a few incoming links - still building there - but never anything more than the abovementioned content indexed.
I have used, as the W3C standards say I should, the 'noembed' tag (yes, that again) for non-flash browsers, and the same text on another page for those who DO have it - after writing google several times I still have no answer to my "is this penalized" question. I could also be getting a double whammy for duplicate content. I think I should yank it but have yet to hear otherwise.
I can't really afford to hire an SEO but I can't seem to get out of this jam. Eight visits, PR 2 and no spidering past the "/" - normal? I think I'm doing something wrong but can't see what.
(added) but I will read Brett's text again ...
(added added) yes, my site does seem 'google laughable' after reading all that again... but you should see some of the mails I've recieved about it - we're getting interviewed for it! . Talk about having my a** between two chairs...
I got kind of sidetracked there, sorry. I feel mister Crow's frustration.
|I have a lot of angry profs blaming me for Google not listing their research. |
I'm having it in a worse way even :(
I'm about to lose my job because of Google not listing all the site pages. Web site is the only way of hitting orders and it doesn't seem to work...
I'm at my wits' end as well and don't nkow what to think...
1)My site (look it up in the profile) is bilingual i.e. almost for every english page there's a russian counterpart with the same information in the other language. Sounds silly I know, but can Google really construe this as duplicate content?
2)I also got js navigation on top with normal href links to the site pages down below. What's strange - my russian pages got indexed alright - english pages didn't and that pisses off my management really strong as US market is our major objective.
3)the first href link that Gbot sees on home page (the page that was fed to Gbot at addurl.html) is link to russian home page that was followed and all russian pages got indexed. My question is: if Google followed it and indexed russian part why didn't it come back to index the rest?
4) My idea now is to arrange a site map page with all hrefs and "feed" it to Gbot in order to get all pages indexed. Has anybody practiced that before? Will that be penalized?
Plus there's still loads of junk when I do search for "site:www.mysite.com +www.mysite.+com" or "site:mysite.com –asdfsadf"
- a lot of urls from last site version yet didn't get deleted. (the site was redesigned 3 months ago)
Sorry to be such a bloody nuisance, GoogleGuy, but it would be just great if you replied to this.
Thanks for all of the tips and help, everyone. I really appreciate the help.
GoogleGuy - I jumped for joy when I saw that you were lending your expertise and knowledge. Thanks!
There are 38 pages indexed? Please note that anything beginning with "conn" is the old domain name, and is not valid. I have painstakingly contacted every site I could find with those older links and asked the webmasters to update their links, but some remain (to old pages that do not have pages on the new site). Also, sites that begin with "ferrari." or "sellens.", etc. are not part of the site, and are on different servers (though they are affiliated). Could this cause any problems?
Also...does it matter if one uses site:www.domain.com vs. site: www.domain.com (note the space). It seems to yield very different results in our case.
As for the robots file - the validator says it's okay. Thanks for the link to the checker. I do not use any session IDs, nor do I use frames or hidden text. I try to keep lots of regular html links, and I have validated the headers. Whew!
Something I forgot to note that might be important...we are using IIS 5, and the pages are ASP. Does that make a difference? When I compare the site to several others I have here that use the same template, I can't imagine why they get spidered like crazy, and this one doesn't. I am confident that it is not a code problem...but the SysAdmin and I have checked the server config, too.
Finally...how does one go about hiring a SEO? I'd be willing to invest in the help. There is one prof who has accused me of ruining his career because his research pages (which used to be listed on Google) fell off a year ago when I changed the site.
Thanks so much,
Crow_Song, the search with the space in it will treat the query as two separate words, not a search restricted to one domain.
As far as the prof whose career you ruined: why not throw up a quick domain, like professordemento.com, and stick his stuff up there. Get a few links, and he'll be back in business.
Thanks Rogerd - what I wanted to tell him (but wouldn't dare) is that anyone whose entire career is dependant on search engine rankings is nuts. Still, I understand why he's a little ticked...even if he's a bit overdramatic.
|There is one prof who has accused me of ruining his career... |
LOL - don't you love working with academics? The real world is a scary/confusing place for them...
Googleguy says "with all thy getting, get links!"
Might I point out that the ODP is a valuable resource: individual professors' sites are often listable separately, if they have significant online content. The science categories are not high spam targets (except for really really stupid spammers), so the unreviewed ratio is often less than 1% (where 10% is average, and 100% is not unusual in spam magnets). This means submittals are highly appreciated, and get listed "fairly quickly" -- generally within a semester (to translate time units into a form recognizable in the Ivory Towers.)
As for ruining careers, just add a bit of hidden text on each page. But you already knew that if you've been following the BOFH technical files...
[quote]anyone whose entire career is dependant on search engine rankings is nuts[/b]
Careful, Crow, some people here may take offense... though most will cheerfully acknowledge their insanity! ;)
Hey guys will anybody help this poor soul?
My post was last on page1 so it apparently got ignored :(
You're right Rogerd. I don't mean to offend people whose jobs are actually in the search engine business!
Freshman, the more I talk about a specific domain (e.g. you don't have any penalties), the more the mods worry that lots of people will start mentioning specifics--and that's against the charter of the board. So re-read all our webmaster pages if you haven't, and read Brett's essay if you haven't, and then general comments may still apply to you.
I must REALLY be doing something wrong if I don't even get a mention : P Some of the advice added here IS relevent to my case, so thanks everyone for that.
google guy the site mydomain.org is not our site, that's the problem, our site is site mydomain.com ddd and it is showing up as mysite.org for no reason really....
Thanks GoogleGuy - I really appreciate your advice.
GoogleGuy - is the old conn server the problem? If I talk the powers that be into giving us the domain so we can bring it back to life (it's been gone for about 8 months) and put redirects in place, will that help?
I know I'm not supposed to ask GG specific questions but this is a general interest one so here goes.
When designing a new version of an old website, would you recommend keeping the old version live while the new one gets Googled? This is not pretty, and it means that there will be some duplication of content, but it might save the odd career or two.
Perhaps Google could add some guidance regarding how best to handle transitions of this sort on the Webmaster Guidelines page.
GoogleGuy - just to be clear, which I have probably not been (being the WebmasterWorld newbie that I am), the site I am having difficulty with is listed in my Interests (in my profile).
Using the sim spider it is confusing. The spider is able to read the link to the research page, and the spider is able to read the link to the activities page, and then on to the prof's pages. I'm rather stumped. The only thing that jumps out is that there have been posts about Googlebot doing strange things with the trailing slash lately. That's really just grasping at straws though.
I too was looking at the appsci pages the first time around, thus the js comment.
| This 76 message thread spans 3 pages: 76 (  2 3 ) > > |