Welcome to WebmasterWorld Guest from 54.196.217.43

Forum Moderators: open

Message Too Old, No Replies

No follow?

Google is not crawling my page

     
8:11 am on Jun 24, 2003 (gmt 0)

New User

10+ Year Member

joined:June 24, 2003
posts:15
votes: 0


Hello,
I've this silly problem
Google has been indexing my page since 5 days or so... but it only reads "/" and does not crawl further links.

Mos of the links look like that:
<a class=mainlisttitle href="/index.php?action=read&sez=&id=309">text</a>

I'll put the complete URL, if someone wants to give it a check, I would really appreaciate:
www ilbisturi it

by purpose miswrote it, so I won't look like a spammer :)
TIA,
Manuele

10:39 pm on June 24, 2003 (gmt 0)

New User

10+ Year Member

joined:June 24, 2003
posts:15
votes: 0


This is getting very strange...
still no further crawling, only the home page is being fetched...
can someone give me a hint?
Thanks
10:49 pm on June 24, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 5, 2001
posts:2466
votes: 0


too many variables in the url of one.

dave

2:40 am on June 25, 2003 (gmt 0)

New User

10+ Year Member

joined:June 10, 2002
posts:34
votes: 0


last I checked, google completely ignored urls which contained a query-string param of id=anything-at-all -- this was some months ago.
3:34 am on June 25, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 24, 2002
posts:1130
votes: 0


Google and other search engines don't like many parameters, but recently Google was more likely to spider URLs with several parameters. Try to keep the number low (2 or 3 will do for Google). But the most important thing is to avoid a parameter with a name like 'id'. This indicates a session-id, and following those links would result in spidering many (almost) identical pages. So I agree with frogg, the main problem will be the 'id' parameter.

Hope this helps.

4:48 am on June 25, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Oct 8, 2001
posts:2882
votes: 0


Yah, "id=" usually marks a session id, so if you can rename that I would. Also, fewer parameters are better. Good luck! :)
6:58 am on June 25, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


Is the problem with "anythingid=" or just "id="?
7:43 am on June 25, 2003 (gmt 0)

New User

10+ Year Member

joined:June 24, 2003
posts:15
votes: 0


Thank you all,
I think I'll mod_rewrite the whole thing ("id" is quite impossible to change...)
Eventually, I'll let you know :)
Manuele
1:37 pm on June 25, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 13, 2003
posts:103
votes: 0


recently Google was more likely to spider URLs with several parameters. Try to keep the number low (2 or 3 will do for Google).

3? Is this true? I have never seen a 3 parameter URL in the SERPs, but I guess I haven't been looking that hard. Can you confirm that Google will index 3-parameter URLs?

Thanks,

swerve

6:02 pm on June 25, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


I've seen them index more than 3, but I don't think you'll get as many urls spidered. It's more of a rarity. But if there are a lot of sites on the web linking to a page with a lot of parameters they will get indexed.
6:46 pm on June 25, 2003 (gmt 0)

New User

10+ Year Member

joined:Apr 16, 2003
posts:2
votes: 0


But the most important thing is to avoid a parameter with a name like 'id'. This indicates a session-id, and following those links would result in spidering many (almost) identical pages.

Hmm... this is interesting to know. I'm having trouble getting Google to spider my sub-pages, even without a query string. The URLs look something like this:

[......]

I wonder - could Google be smart enough to realize I'm using the path as a replacement for the parameters and notice the ciid?

7:51 pm on June 25, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


nope...use the tool here at WW to see the headers being output. If you are not putting out a 200 header google may not want to spider it.
8:17 pm on June 25, 2003 (gmt 0)

New User

10+ Year Member

joined:Mar 3, 2003
posts:24
votes: 0


Clark,

Where is the tool you are referring to and what exactly do you mean by a '200 Header'?

Thanks

9:09 pm on June 25, 2003 (gmt 0)

New User

10+ Year Member

joined:Apr 16, 2003
posts:2
votes: 0


The tool is over on SearchEngineWorld.com.

He means the status code that your HTTP server responds with when something makes a request for a page. 200 means OK, 404 means not found, 500 means internal server error, etc.

My site sends back a 200 response code for both the page I'm trying to get it to spider and the page that links to it.

But the other engines spider it OK, so I'm thinking maybe it's just another part of the wackiness of this latest update...

9:51 pm on June 25, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


I always have trouble remembering how to find it or I would have provided the link.

BTW, when did your pages go up? In the last 2 months G hasn't crawled much new pages (especially on sites that are older than a few months)

10:55 pm on June 25, 2003 (gmt 0)

New User

10+ Year Member

joined:June 24, 2003
posts:15
votes: 0


More:
I've changed all the links, now they look like:
"/read/NNN/"
where NNN is a number...
Too bad, all I got is:
64.68.82.38 - - [25/Jun/2003:13:46:34 +0200] "GET /robots.txt HTTP/1.0" 404 283
64.68.82.38 - - [25/Jun/2003:13:46:35 +0200] "GET / HTTP/1.0" 200 21821

still no spidering of the subpages...
can someone give me another good hint?
www ilbisturi it
is the page.
Thanks again
Manuele

3:56 am on June 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 24, 2002
posts:1130
votes: 0


Just looked out the links in the cache (indexed on June 24), and those were like:

<a href="/index.php?action=read&sez=100&id=274">

but the current page looks better

<a href="/read/274/">

A check at Server Header Check [webmasterworld.com] gave a 'HTTP/1.1 200 OK' so that looks good. Maybe just wait some longer for Google to get the sub pages as well. By the way, you hardly have links to your site (Google, AllTheWeb, AltaVista, Inktomi all say: 0 links). Having some more could also help to get sub pages spidered.

8:47 am on June 26, 2003 (gmt 0)

New User

10+ Year Member

joined:June 24, 2003
posts:15
votes: 0


mmmmmm
the site has only few days life :)

The point is that at the time of that spidering (in the last post) the links were already modified... but google only asked for "/"

:/

10:21 am on June 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 24, 2002
posts:1130
votes: 0


You could try the 'submit URL' at Google for a few sub pages (like "archive/3/") and see what happens. And of course get more inbound links. From what I understand of your site (sorry I'm not so good in reading Italian), there is quite some content so it shouldn't be that hard to get some links. BTW, I also noticed a few internal links from the home page with the 'id' parameter. They link to pages with 'scarica' (discharge, download?). There are enough other links that should do well now, so this cannot be the cause of you problem. But you can already work on that too.
12:45 pm on June 26, 2003 (gmt 0)

New User

10+ Year Member

joined:June 24, 2003
posts:15
votes: 0


Thank you for your interest...
so far I've changed the links layout again...
now they are "story_NNN.html" and "archive_S_O.html"
where: NNN, S and O are all numbers...

I did this to make the pages look all in the same subdirectory (better for many things...)

Now I'm waiting for the next spidering to see what happens...

(Those other links - downloads and stuff - are ok to be not spidered... so id can remain ... I want to first solve the main problem and see what will happen later...)

Thank you again... will let know :)

1:07 pm on June 26, 2003 (gmt 0)

New User

10+ Year Member

joined:Mar 3, 2003
posts:24
votes: 0


Nostgard,

Thanks for explaining. I thought that was what he meant but the way he said it lost me.

Anyone have a good idea why Google crawled me back in April but the pages have no title or cache in their directory? I keep asking hoping someone has the magic solution. Think Google will get the content next time around?

Thanks

7:39 am on June 27, 2003 (gmt 0)

New User

10+ Year Member

joined:June 24, 2003
posts:15
votes: 0


64.68.82.28 - - [26/Jun/2003:15:29:45 +0200] "GET / HTTP/1.0" 200 24302 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html
)"
18.29.1.50 - - [26/Jun/2003:15:29:46 +0200] "GET / HTTP/1.1" 200 24315 "-" "W3C_Validator/1.305.2.12 libwww-perl/5.64"
18.29.1.50 - - [26/Jun/2003:15:30:20 +0200] "GET / HTTP/1.1" 200 24315 "-" "W3C_Validator/1.305.2.12 libwww-perl/5.64

new spidering, still only "/"...
I'm getting very upset :///
Any help appreciated...

9:48 pm on June 27, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 31, 2002
posts:43
votes: 0


Just as a note, I have a site with tons of query strings in one section of the site, and Google has not only spidered each and every one of them -- it has added them all to its index. Bit ridiculous when thinking that it's a photo gallery...but I suppose it's good that at least Google *is* spidering links with query strings in them.
12:30 am on June 28, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


I've decided to run a test. Will let you know the result if google ever spiders again :)
11:00 am on June 29, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 24, 2002
posts:1130
votes: 0


Just checked the site, and it is now indexed with 33 pages, all marked with '27 Jun 2003' in the SERP. So removing the 'id' helped to get sub pages spidered.
1:36 pm on June 30, 2003 (gmt 0)

New User

10+ Year Member

joined:June 24, 2003
posts:15
votes: 0


Thanks you all, and specially takagi...
yes, indeed, google is able to spider the site :)
So I'm kinda glad with that...
But still I have more concerns...
1) Is it normal to have a 24/30 hours delay between the spidering time and the time google shows the results? That is what is happening...

2) Second question is a bit more complicated:
I have archive pages for news (archive_SECTION_OFFSET.html)
where SECTION and OFFSET are numbers...
The point is that with an OFFSET equal to 0 you get the newest archive page: this means that archive_3_0.html is always changing ... and that is the page I reference from the HP.
Now, for as much as I could see, google is not following the link that is at the bottom of the archive page (it is an image only anchor - left and right arrows) so it is loosing all the older archives (i.e.: archive_3_1.html, archive_3_2.html and so on...) and because of that it also looses a lot of content....

In order to avoid that i can see 2 possible ways:

- change the naming so that archive_SECTION_0.html is the OLDEST archive and archive_SECTION_9.html (for example) is the freshest page, linked from the home. In this scenario i would hope that google spiders the site so often that it will get the archives at least before the OFFSET changes...

- or, alternatively, find out why google is not following the image-only anchor and make it spider the archive sub pages...

I hope I explained myself....
otherwise let me know, i'll try to improve my bad english as fast as possible... :)

TIA, again.

2:06 pm on June 30, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 24, 2002
posts:1130
votes: 0


1) Is it normal to have a 24/30 hours delay between the spidering time and the time google shows the results?

The delay between freshbot spidering the pages, and the results showing up was usually more like 48 hours I think. So 24/30 hours seems to be fast. The newly spidered pages from many sites has to be processed into an index, and that has to be spread over the 9 data centers.

The point is that with an OFFSET equal to 0 you get the newest archive page: this means that archive_3_0.html is always changing

That means that if a page is in the index, and matches a query from a user of a search engine, the contents of the page is most likely to be already changed. Because the contents that the visitor was looking for, changed to a file with a different name. To get pages indexed, this is no big problem, but most visitors will immediately hit the back button when they find out, that the data they were looking for is gone (i.e. not on the page they see). So having a higher number for a more recent page is a more logic way. Just like the thread numbers on this forum.

google is not following the link that is at the bottom of the archive page

For a page with no inbound links (and therefor PR0), the number of links followed by fresh bot will be limited. So links at the end of the source file, are more likely to be skipped in the initial stage. If you make sure there are several links to your home page, then deepbot will spider more pages.
8:31 am on July 14, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 8, 2002
posts:63
votes: 0


Can I ask where you got the Mod Rewrite code to convert the dynamic query string into a path?

I've seen a few threads dotted around, but if you could point me towards one that worlds well, I'd appreciate it. ;-)

10:17 am on July 16, 2003 (gmt 0)

New User

10+ Year Member

joined:June 24, 2003
posts:15
votes: 0


RewriteEngine on
RewriteRule /read_([0-9]+).html /index.php?action=read&id=$1
RewriteRule /story_([0-9]+).html /index.php?action=read&id=$1
RewriteRule /archive_([0-9]+)_([0-9]+).html /index.php?action=archive&sez=$1&offset=$2
RewriteRule /archive_([0-9]+).html /index.php?action=archive&sez=$1&offset=0
RewriteRule /read/([0-9]+) /index.php?action=read&id=$1
#RewriteRule /read/([0-9]+)/images/([a-Z]+) /images/$2
RewriteRule /archive/([0-9]+)/([0-9]+) /index.php?action=archive&sez=$1&offset=$2
RewriteRule /archive/([0-9]+) /index.php?action=archive&sez=$1

That is my configugaration for that site, it's in the <virtualhost> directive....

besides, i wanted to ask...
why did google left only one page all of a sudden?
I had 45 indexed pages yesterday, today only one...
Can someone help me?

11:13 am on July 16, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 24, 2002
posts:1130
votes: 0


why did google left only one page all of a sudden?
I had 45 indexed pages yesterday, today only one...
Can someone help me?

In the first message of this thread you wrote that Google started indexing your home page around June 19. That is a few days after the Esmeralda update began. That would mean that all the 45 pages indexed until yesterday were not in the full index. IIRC, all your pages had a fresh tag (date next to the URL). So most likely it was the 'fresh bot' that kept your pages in the SERP. But the weird thing is, you only have a sub page left over, not your home page.

Soon 'fresh bot' will bring in some more pages into the SERP. Maybe you should just have some more patience until the next update, and be happy with the pages that were already shown in the SERPs for the last few weeks. After all, the site was found after the last update and has only few inbound links.

This 32 message thread spans 2 pages: 32