homepage Welcome to WebmasterWorld Guest from 54.198.42.105
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Google ignores four month old site
No visits from the google spider on site whatsoever
marilyn




msg:111626
 9:28 am on Aug 13, 2004 (gmt 0)

Hello - I have looked around WW and don't yet have a clue to the failure of my site to be visited by the google spider, (nor any other spider apart from fleeting visits from MSN). Can you help? Here are the characteristics of the site.

1) It is an e-commerce site with about 450 products.

2) It uses PHP dynamically generated pages.

3) The index page www.mysite.com/index.html contains a javascript re-direct to the 'real' home page which lies in a sub-directory. (I did this because I wanted a number of sub-directories and did not want to maintain two sets of include files - no other reason)

4) The sub-directories are based around file extensions rather than products (I gather this could be a mistake). In other words the sub-directories were for ease of my coding rather than for grouping products by type.

5) The index.html contains a link to a page that contains a randomly generated list of links to the details pages of 75 of my products, (which I hoped would get listed somehow by someone somewhere if the spider in question ignored my javascript re-dirict)

6)I have only ever been visited by the MSN spider and when it comes over it only visits pages linked directly to by my 'real' home page, (i.e. not the one with the javascript redirect on it). And when MSN does visit it leaves a trail like this in my log file to the boring pages rather than to any of my products pages.

pages/main-faq.php?PHPSESSID=21c665914zzz05c
pages/main-news.php?PHPSESSID=2cd51deed0b15c
pages/general-privacy.php?electedpage=faq&PHPSESSID=2cd51deed0b08b15

7) I use PHP session ids everywhere - is this part of the problem? Or is it all down to the index.html having a re-direct, or what? Other posts in the forum seem to say that the redirect should be ignored.

Even if you can only point me to the appropriate forum topic, please help.

Marilyn

 

Marcia




msg:111627
 12:28 pm on Aug 13, 2004 (gmt 0)

I use PHP session ids everywhere - is this part of the problem?

Get rid of session ids - there are problems with those.

pages/main-faq.php?PHPSESSID=21c665914zzz05c
pages/main-news.php?PHPSESSID=2cd51deed0b15c
pages/general-privacy.php?electedpage=faq&PHPSESSID=2cd51deed0b08b15

You definitely don't want that.

3) The index page www.mysite.com/index.html contains a javascript re-direct to the 'real' home page which lies in a sub-directory. (I did this because I wanted a number of sub-directories and did not want to maintain two sets of include files - no other reason)

Right there - only Javascript enabled browsers follow JS redirects.

JasonHamilton




msg:111628
 12:42 pm on Aug 13, 2004 (gmt 0)

use a

<?php
header("HTTP/1.0 301 Moved Permanently");
header("Location: /pages/main.php");
?>

This is a *much* better than using javascript.

You can also use mod_rewrite to make the root index.php page load up the sub page, but I'm not sure if that would mess up anything due to the paths being different.

Also, the SID's need to go. The crawlers avoid it like the plague because it can send them into endless loops.

Vicente Duque




msg:111629
 1:16 pm on Aug 13, 2004 (gmt 0)

Marilyn :

Sorry for your long period of non-existence in Google. Let's cheer up and be patient. Let's have Faith and Hope in Adversity.

Google erased like 60% or 70% of my pages, even Super Original Content Sites ( all of 2004 ) and is Registering hundreds of pages of mine that no longer exist!

How Foolish!

But there has to be a way. Google Dirty Monster is shooting on his own feet by erasing Original Content Sites and allowing lots of Spam, Garbage, Trash, Dirt and Filth that has no Information Value and uses fake keywords to deceive the visitor. ( My experience in Google Searches ).

This looks like a job for the Powerpuff Girls! ( that battle so many dirty monsters in the Cartoon Network )

Other Busines People will fill the void that Google leaves with little Search Engines that take you to the Right Place.

Someone has to invent a way that Little Ants like us be noticed and visited for our Original Content Sites. Let's wait with patience.

Vicente

bsterz




msg:111630
 1:22 pm on Aug 13, 2004 (gmt 0)

A quick way to kill those sids is to turn em off in your php.ini file. This will mean that you will need cookies to maintain sessions, but it's a trade-off.

session.use_trans_sid = 0

A good way to test if they are truly off is to disable cookies in your browser and tool around your site. With cookies off, php will try to add the sid if it's enabled. Naturally, most spiders don't return cookies.

b

marilyn




msg:111631
 3:22 pm on Aug 13, 2004 (gmt 0)

The problem seems to be mostly with the session ids. I will also remove the re-direct for good measure and make some product oriented sbb-directories, (this could be a major re-write).

I don't know how to achieve what I need without session ids so I am going to research that now. Turning them off sounds like everything might stop working. I wrote the whole thing without really understandng how the session ids were implemented.

Google is a bit of a monster. You get your pages listed, (unless you are me) then Google changes an alogrithm, and you get Vicente's problem. How do you stop yourselves from going mad?

Marilyn

webdude




msg:111632
 3:30 pm on Aug 13, 2004 (gmt 0)

I've heard mention too, in this forum, that there might be a problem with passing 3 or more variables in the URL. Don't know, but might be worth checking out...

webdude




msg:111633
 3:30 pm on Aug 13, 2004 (gmt 0)

I think I meant post arguments.

marilyn




msg:111634
 5:10 pm on Aug 13, 2004 (gmt 0)

Yes - I am going to look at url re-writing as well to get rid of get parameters on the url. (Did you really mean post?)

I just think it is strange that google has ignored my whole site - has not been there once - not even come in for coffee. You would think it would have at least visited the standard pages - or maybe there are session ids on those too.

hugo_guzman




msg:111635
 6:01 pm on Aug 13, 2004 (gmt 0)

have you acquired any backlinks pointing to your site?

Even if you have problems with coding, Googlebot should still visit by jumping through a link from another site.

webdude




msg:111636
 7:52 pm on Aug 13, 2004 (gmt 0)

I would definitely check the SIDs. If possible, get rid of them.

Vicente Duque




msg:111637
 9:04 pm on Aug 13, 2004 (gmt 0)

Marilyn :

You said


Google is a bit of a monster. You get your pages listed, (unless you are me) then Google changes an alogrithm, and you get Vicente's problem. How do you stop yourselves from going mad?

My method to avoid madness is this :

I think long term. I think of developing Original Content Websites of such high quality that sooner or later many people will want to link them, even if I don't ask them for links.

And that the topic is so interesting that the domain name will be a living reality for many years.

I also realize that Google is not God, and I stop working all day thinking of pleasing Google ( Molloch, the God to which children were sacrificed in Carthage ). Life is too short to spend it pleasing Google 8 hours or more a day.

I try to get the best possible domains for the chosen topic of my Original Content Websires ( OCWs )

I think that one or two ORIGINAL CONTENT WEBSITES ( OCWs ) are not enough, because with the years we change our natural interests.

My Visitors have to choose which of my OCWs is the best and the preferred. Specialization may ensue.

The problem is when Search Engines refuse to link our OCWs. Then we are flying blindfolded.

I realize that I may run into economic and financial difficulties and may have to search for another method to earn money. .... Best idea is to be independent and not a salaried slave.

I am optimistic .... Sooner or later there will be something better than Google. .... A Search Engine like Google that links billions of tons of Garbage Spam and refuses OCWs will get plenty of competition ahead!

Keep the Faith!

Vicente

marilyn




msg:111638
 11:39 am on Aug 14, 2004 (gmt 0)

Thanks to all for your comments.

Marcia, bsterz and webdude: I have removed all session ids as suggested and will wait to see what happens. For completeness I will post back here to say if it had any impact ina few weeks time.

I take Vicente's point about the Evil Google Empire and will also concentrate on content, which of course is the right thing to do ...

Marilyn

Marcia




msg:111639
 3:11 pm on Aug 14, 2004 (gmt 0)

Did you fix that JS redirect?

marilyn




msg:111640
 3:57 pm on Aug 14, 2004 (gmt 0)

hugo_guzman : yes - I have some back-links, some with great page rank, but I have still never been spidered by Google. If it is the session ids, then maybe the spider arrived and was immediately upset by the sids so did nothing.

marcia : no - I haven't taken out the js redirect as my index file (which contains it) is called index.html and to make the page redirect using the header command means I have to call it index.php. I know .php files are OK, but I thought that the index file should be a .html even if all the other files were not. I am talking rubbish, arn't I.
Wait a minute.
.
.
.
OK - I have just remembered I can keep the filename index.html and use this method of redirection instead.

<META http-equiv="Refresh" CONTENT='0; url=pages/main-home.php'>

So, yes I have now got rid of the js redirect. Don't tell me a meta refresh will make google come over all queasy.

Marilyn

encyclo




msg:111641
 4:04 pm on Aug 14, 2004 (gmt 0)

marilyn, if you absolutely have to have session IDs, then you need to look at cloaking for Googlebot (and Slurp, and any others you want). You will need to deliver the same content with the same URLs, but no session ID just for the bots, whilst normal visitors get the sessions. Don't go mad and do keyword-stuffing or customized content for the bots, and you'll be just fine.

This is a clear case when cloaking is not "evil" or underhand, but simply necessary to give the bots proper access.

Don't tell me a meta refresh will make google come over all queasy.

It's as bad if not worse than the Javascript. Are you sure you can't use an index.php file? Check with your hosting company. If you can't do it, change companies. Seriously.

Have you tried placing the following in the root .htaccess file?:

DirectoryIndex index.php index.html

Marcia




msg:111642
 4:22 pm on Aug 14, 2004 (gmt 0)

Any idea how many duplicates there are out there, except for the sessionid being different? What would that do to PR and backlinks in a site?

[google.com...]

No you can't use meta-refresh, you don't want to do that. You can use a 301 permanent redirect - not a 302. See, you have links and PR - but only to the homepage - and what, by the way, is the content on the homepage? Is there anything indexed or cached for it? Google isn't going to index a blank page with description and all - is there actually a site they can crawl for them to index?

dirkz




msg:111643
 4:50 pm on Aug 14, 2004 (gmt 0)

The really strange thing with session IDs is that it depends.

I have a store that uses them only when cookies are rejected, which means that every bot gets URLs with session ids (easily provable by masking as Googlebot). The URLs are crawler friendly though (no GET parameters).

I never expected this thing to get crawled at all. But it's indexed and even ranking on its own.

Maybe it depends on the shop software you use. I think I'm using the same as the Google store has been using until recently (I swear they changed it in the last 2 weeks or so).

marilyn




msg:111644
 4:51 pm on Aug 14, 2004 (gmt 0)

encyclo : I have now removed the meta refresh and replaced it with the php header command, so have renamed the file from .html to .php. Also have implemented the .htaccess DirectoryIndex directive.

Anyway - now waiting to see what happens.

Marcia: I don't know how many duplicates are out there - not sure I understand you ... There are lots of pages that are different as there are lots of products. As they were never indexed, none of them are out there with or without session ids. Is that what you meant?

Marilyn

marilyn




msg:111645
 5:03 pm on Aug 14, 2004 (gmt 0)

dirkz: I have GET parameters - but have used url rewriting to make them look like they are not there on a spider-friendly site map which is linked to from my (new) index.html file. I randomly generate 75 products each time the site index is accessed, (except it has never been accessed which is why I started this topic).

The real product pages do not have url re-writing, so when you visit the pages you see in the address space something like : [mysite.com...]

But on the spider-friendly page for the same product you see : [mysite.com...]

I wrote the shop software myself, then used worldpay for the money part.

I have got rid of the session variables only by forcing the use of cookies (as suggested earlier). But if the spiders are going to reject cookies and have the use session ids in their place, I have not really made any improvments.

I am losing the will to live.

Marilyn

[edited by: marilyn at 5:09 pm (utc) on Aug. 14, 2004]

dirkz




msg:111646
 5:08 pm on Aug 14, 2004 (gmt 0)

> But on the spider-friendly page for the same product you see : [mysite.com...]

If you mean you have now 2 different versions of every product page get rid of that. It's duplicate content.

Either rewrite them for all users (bots included) or for none. Or use cloaking.

marilyn




msg:111647
 7:15 pm on Aug 14, 2004 (gmt 0)

OK - Have taken your advice and done the following

1) Turned off session ids with nothing in their place. If the user has cookies disabled they can't go shopping. It is either that or my sanity.

2) Got rid of js redirect as previously described.

3) Taken out the duplicate product pages with full url rewriting across all affected pages.

4) Removed most other instances of get parameters using url re-writing on non products pages.

Thanks for all your help - especially pointing out my duplicate page problems.

Marilyn

dirkz




msg:111648
 9:23 am on Aug 15, 2004 (gmt 0)

> 1) Turned off session ids with nothing in their place.

Then you will lose some percentage of people who don't have cookies enabled. They can put things in their basket all day long, when they check out or even view the next page it's gone :(

What I would suggest is leaving sids in place and only kill them if it is a bot. This is fairly easy to achieve and not even cloaking.

This is the solution I'm now using in order to get crawled by yahoo also.

marilyn




msg:111649
 9:59 am on Aug 15, 2004 (gmt 0)

dirkz - you are quite right. All the logging code that I wrote too is now not doing too well without cookies. I wasvisited by ia_archiver (whatever that is) just now and instead of having '5 visitors currently online', I now have '146 visitors currently online'. Also - those without cookies are not going to do too well when it comes to shopping as you point out.

I have looked around a bit - but don't know how to tell the visitor is a real person and if it is a spider/bot. And even if I did know the difference, how do I switch from cookies to session ids at will. At the moment I have switched session ids off at the mains. Can you point me anywhere?

Marilyn

[edited by: marilyn at 10:16 am (utc) on Aug. 15, 2004]

Marcia




msg:111650
 10:04 am on Aug 15, 2004 (gmt 0)

I'd ask in the PHP forum - or Ecommerce.

marilyn




msg:111651
 10:28 am on Aug 15, 2004 (gmt 0)

I have found this information - in WW.
[webmasterworld.com...]

I will have to read it about 15 times.

Marilyn

marilyn




msg:111652
 12:57 pm on Aug 15, 2004 (gmt 0)

Have now tidied up the solution.

1) Have put sessions back on in the php.ini file
2) Still have URL rewriting to get rid of GET parameters.
3) Am detecting bots by using a list of BrowserMatchNoCase directives in the httpd.conf to identify spiders and robots. This sets a environment variable which I chose to call visitor-is-robot, which is accessible from the PHP.

4)In PHP I ask if the visitor-is-robot env variable is set, and if it is I use PHP to turn off sessions.

if (getenv('visitor-is-robot')) {
ini_set('session.use_trans_sid', false);
}

Maybe this will work.
Marilyn

dirkz




msg:111653
 6:40 pm on Aug 15, 2004 (gmt 0)

> Maybe this will work.

You should test the behavior with a browser that allows switching cookies on and off.

Also you should disguise as Googlebot (i.e. fake the user agent) and see which links you get.

You can do the latter with curl or wget.

If then Googlebot never sees a session in a URL and you have eliminated all duplicates your site should get crawled and ranked (some inbound links provided etc.)

The only thing that can hurt you again is inbound links with (an old) session id (it's unlikely but sometimes it could happen). When seen by Googlebot this could be duplicate content again.

Also make sure that every page is different from the other, i.e. proper title tag. Not that "Welcome to the ... shop" on every page :)

marilyn




msg:111654
 2:14 pm on Aug 16, 2004 (gmt 0)

Hi dirkz

I am already being visited (bombarded) by Alexa (ia_archiver) and I note there are no session ids there. This has never happened before so something must have improved. However the visits from MSN are only one or two pages long which is odd, and are accompanied by session ids which is also odd seeing as I should be trapping the msnbot and dealing with it . . .

I'll have to wait until the weekend to get some local help with cURL as I tried downloading it but could not work out what to do with it. I will need to check that I am losing the session ids for msn and googlebot as expected. Until then , the session ids in msnbots trail are a mystery.

Marilyn

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved