Welcome to WebmasterWorld Guest from 54.144.80.75

Forum Moderators: open

Message Too Old, No Replies

Nuts and Bolts Tips

How many of these do you know?

     
6:07 am on Jul 21, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Oct 8, 2001
posts:2882
votes: 0


Okay, I'm waiting for something to finish compiling, so I thought I'd write up a few webmaster tips. Most of these are on our help pages somewhere, but not many people know all of these tidbits. So here goes:

Tip #1: Use If-Modified-Since (IMS). IMS lets your webserver tell Googlebot whether a page has changed since the last time the page was fetched. If the page hasn't changed, we can re-use the content from the last time we fetched that page. That in turn lets the bot download more pages and save bandwidth. I highly recommend that you check to see if your server is configured to support If-Modified-Since. It's an easy win for static pages, and sometimes even pages with parameters can benefit from IMS.

Tip #2: You can use wildcards in robots.txt, and patterns can end in '$' to indicate the end of a name. So if you don't want Googlebot to fetch any PDF files, for example, you could say
Disallow: /*.pdf$
Don't forget that in the robots.txt file, all url patterns need a "/" anchor to be valid. That's a pretty common webmaster error (maybe the most common robots.txt mistake), so keep it in mind and save yourself some angst. :)

Tip #3: Googlebot also permits an "Allow" directive in robots.txt. This lets you specifically flag areas that are okay to crawl. When there are two directives that could apply, we follow the longest (i.e. most specific directive). See
[google.com...]
for an example.

Tip #4: Avoid session ID's. If you can, use fewer dynamic parameters and stay away from the parameter "id=" in urls--Googlebot tries to stay away from things that might be session ID's.

Tip #5: Make sure that you can reach every page on your site with a text browser like lynx. That's the best way to make sure that a spider can follow links to all of your pages. Site maps can be a really good way to help users and spiders get down into different parts of your site.

Some of these tips work mainly with Googlebot, but I hope that they help. Anybody else with nuts and bolts tips for site architecture, crawling, or robots.txt--throw 'em in! :)

3:07 pm on July 21, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 22, 2003
posts:1483
votes: 0


Tip #2: You can use wildcards in robots.txt, and patterns can end in '$' to indicate the end of a name. So if you don't want Googlebot to fetch any PDF files, for example, you could say
Disallow: /*.pdf$

Well whack me in the head...SO...
Disallow: /*.pdf
will disallow *.pdf*, that is, anything with pdf in the prefix of the extension is disallowed right? (filename.pdf and filename.pdfx would be disallowed)

and

Disallow: /*.pdf$
will disallow *.pdf only (filename.pdf is disallowed and filename.pdfx would be allowed)

3:24 pm on July 21, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:May 23, 2002
posts:444
votes: 0


Thanks for the pointers GG.

Just to clarify; when you say stay away from ID= in a URL do you mean avoid using a variable like that even if you're not using any session variables?

3:29 pm on July 21, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 27, 2003
posts:166
votes: 0


Regarding #2 and #3: as these (wildcards and Allow:) are extensions of the robots.txt standard it would be well to keep in mind that these tips won't work for all or most search engines other than Google.
3:31 pm on July 21, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 28, 2003
posts:1977
votes: 0


Thanks, GG, for telling us about these tips.

I would also like to add a link to a thread about if-modified-since: [webmasterworld.com...]

4:22 pm on July 21, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Oct 8, 2001
posts:2882
votes: 0


I believe that's right, skipfactor. I'll try to ask around and post if it's not. :)
4:31 pm on July 21, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 15, 2003
posts:2395
votes: 0


oh yes, valid code ;) I've found that most spiders like well prepared food. Not only the html, but also scripts, especially link/page redirect scripts (serverside that is, not JS/VB/ECMA... just forget these navigation-wise).

And lynx, that's a great point, this way you even support some screen readers as well as the bots :)

/claus

4:40 pm on July 21, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 3, 2003
posts:1633
votes: 0


That in turn lets the bot download more pages and save bandwidth. I highly recommend that you check to see if your server is configured to support If-Modified-Since. It's an easy win for static pages, and sometimes even pages with parameters can benefit from IMS.

Would it be possible to elaborate as to how Googlebot handles first time queries with respect to If-Modified-Since and "Last-Modified".

The RFC's imply that a client should only use this header if it holds an actual "Last-Modified" value received in response to a previous request - this is to avoid problems of time-synchronisation between hosts.

I've heard wind that Googlebot makes up an If-Modified-Since value that is ages ago and simply "hopes for the best", in which case i'm not sure I want to risk it.

I'm developing a forum system that has a "mod_rewritten" archive section that could certainly benefit both parties from using this header. The URI looks something like:

/archive/2003/07/21/1.html

and it would be a trivial matter to serve 304 ("Not Modified) if a request for this page offers an If-Modified-Since date greater than the date to which the page refers. Not sure I want to risk it though...

5:03 pm on July 21, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:May 18, 2002
posts:126
votes: 0


GG,
Thanks for reiterating the basics. I have a question regarding session ids in the query string. Will Googlebot handle a cookie based session id system? All of our applications set a session id when a user first enters the site. The user, if they allow cookies, will have the session id set in this cookie. If they do not allow cookies, or refuse the cookie, we will write the session id in the query string.

Will Googlebot take the cookie and use the session id throughout its session?

Thanks

5:24 pm on July 21, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 8, 2002
posts:2012
votes: 0


>Will Googlebot take the cookie and use the session id throughout its session?

Allthough i'm not GoogleGuy (honestly! ;) i can confirm that Google doesn't take/read/stores cookies at all.

athinktank, most session based software packages (forums, cms' etc.) put the session id into the url if cookies are disabled at the client site. Therefor GoogleGuy said: turn off (did he say cloak off?) session id's if googlebot visits your site. Often this is done with a single "exclude user agent line" within your session creation code.

Reason is pretty easy to understand: session id's are unique numbers and therefor a page that uses session id's in the request uri will get duplicated again and again everytime googlebot initiates a new visit at a session based site.

5:27 pm on July 21, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 8, 2002
posts:2012
votes: 0


>Tip #5: Make sure that you can reach every page on your site with a text browser like lynx.

...or do the same with one click using the Search Engine Spider Simulator [searchengineworld.com] at sew. ;)
5:37 pm on July 21, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 25, 2003
posts:970
votes: 0


Regarding Site architecture just thought to add that if you have a new site and your emphasis is on getting more pages crawled than go for Breadth rather than Depth.
5:53 pm on July 21, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:May 20, 2003
posts:76
votes: 0


I have 2 varsions of a page:
[domain.com...]
and
[domain.com...]

how do I tell Google to read /blah, but to not crawl /blah?something=*

Thanks, and excellent post.

5:56 pm on July 21, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 3, 2003
posts:1633
votes: 0


how do I tell Google to read /blah, but to not crawl /blah?something=*

You could add the HTML:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

to your page where something <> "", although Googlebot might go and apply the META to the page that exists without the?something, so i'd be a bit carefull.

6:08 pm on July 21, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:May 20, 2003
posts:76
votes: 0



Googlebot might go and apply the META to the page that exists without the?something, so i'd be a bit carefull.

yeah, that's what i'm worried about.

6:39 pm on July 21, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 27, 2003
posts:166
votes: 0



I have 2 varsions of a page:
[domain.com...]
and
[domain.com...]
how do I tell Google to read /blah, but to not crawl /blah?something=*

That should be covered with an entry

Disallow: /blah?something

in your robots.txt.

6:56 pm on July 21, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:May 20, 2003
posts:76
votes: 0


k thanks :-)
4:36 pm on July 25, 2003 (gmt 0)

Full Member

10+ Year Member

joined:May 12, 2002
posts:221
votes: 0


Bumping this useful thread.
6:30 pm on July 26, 2003 (gmt 0)

New User

10+ Year Member

joined:July 24, 2003
posts:3
votes: 0


I am in a bit of a pickle...My store uses SIDs so in order to make sure googlebot could get to the pages it needed to with out getting lost in the SIDs I served the googlebot a page that would not create SIDs, The page looked very similar to my homepage but simply had some description text and links to all my products. After talking with a good number of people some said this practice was cloaking and some said it was OK.

Never the less I just noticed that I fell off of the 1st page for my most popular keyword. So I have been loosing sleep at night thinking that I pissed off the googlebot when all I was trying to do was help it get through site and not get tangled in a web of SIDs. As a result I removed the "cloaked" page and now I am left with what I am sure will be a SID mess. Any suggestions? I am very new to the world of SEO so any advice will be greatly appreciated.

6:38 pm on July 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 25, 2002
posts:776
votes: 0


For tip number 3, does Google look explicitly for "id=" paramater or does it also use partial matching. For example, if I use a paramter like "jobid=" am I asking for trouble.
6:43 pm on July 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member rcjordan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 22, 2000
posts:9138
votes: 0


>Tip #4: Avoid session ID's.

Let me bookmark that one, GG. We had a big go-round after you confirmed this in that CMS thread a while back.

6:54 pm on July 26, 2003 (gmt 0)

Senior Member from MY 

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 1, 2003
posts:4847
votes: 0


For tip number 3, does Google look explicitly for "id=" paramater or does it also use partial matching. For example, if I use a paramter like "jobid=" am I asking for trouble.

ALARM BELLS Urgent mass variable changing...

7:01 pm on July 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 4, 2002
posts:5044
votes: 0


>>Avoid session ID's

So, we should show Googlebot pages that don't contain session id's and (assuming our sites require it) show our users pages that do...?

Isn't that cloaking? hehehehehe ;)

Nick

8:24 pm on July 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 26, 2000
posts:2176
votes: 0



Detecting Googlebot in order to serve a session/cookie free page is not considered cloaking.
8:25 pm on July 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 8, 2002
posts:2012
votes: 0


>Isn't that cloaking? hehehehehe ;)

huhuh, wake up! ;)

Here's a ancient statement of GG (Dec 4, 2002):
Google and Session Killing [webmasterworld.com]

Everybody knows I'm pretty anti-cloaking, but WebGuerilla has already made strong points why it's okay to drop a session id for Google.<snip snap>
This is just my personal take, but allowing Googlebot to crawl without requiring session id's should not run afoul of Google's policy against cloaking. I encourage webmasters to drop session id's when they can. I would consider it safe. Fair enough?

Hope that helps,
googleguy

8:26 pm on July 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 4, 2002
posts:5044
votes: 0


Says who? - Details please WG..

Added:
>>This is just my personal take (GG from that thread)

Nick

[edited by: Nick_W at 8:27 pm (utc) on July 26, 2003]

8:27 pm on July 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 8, 2002
posts:2012
votes: 0


see above, Nick. ;)
8:28 pm on July 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 4, 2002
posts:5044
votes: 0


...and I just edited heheh!

Nick

8:31 pm on July 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 8, 2002
posts:2012
votes: 0


Well, yes - his personal take ... hmm, we don't start this discussion again about GoogleGuy's relation to Google Inc., or!? ;)

btw: my personal experience is that google doesn't have any problem with my "cloaked" forums - it eats 'em.

<added>nevertheless, good edit, Nick</added>

8:35 pm on July 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 4, 2002
posts:5044
votes: 0


Yeah, my statement was clearly a little 'tongue in cheek' (which I thought I made clear with the 'hehehehehe' apologies if it wasnt clear enough...) - I cloak for that all the time and have never had a problem...

Nick

This 32 message thread spans 2 pages: 32
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members