Welcome to WebmasterWorld Guest from 54.162.12.134

Forum Moderators: open

Message Too Old, No Replies

No follow?

Google is not crawling my page

     
8:11 am on Jun 24, 2003 (gmt 0)

10+ Year Member



Hello,
I've this silly problem
Google has been indexing my page since 5 days or so... but it only reads "/" and does not crawl further links.

Mos of the links look like that:
<a class=mainlisttitle href="/index.php?action=read&sez=&id=309">text</a>

I'll put the complete URL, if someone wants to give it a check, I would really appreaciate:
www ilbisturi it

by purpose miswrote it, so I won't look like a spammer :)
TIA,
Manuele

10:39 pm on Jun 24, 2003 (gmt 0)

10+ Year Member



This is getting very strange...
still no further crawling, only the home page is being fetched...
can someone give me a hint?
Thanks
10:49 pm on Jun 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



too many variables in the url of one.

dave

2:40 am on Jun 25, 2003 (gmt 0)

10+ Year Member



last I checked, google completely ignored urls which contained a query-string param of id=anything-at-all -- this was some months ago.
3:34 am on Jun 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google and other search engines don't like many parameters, but recently Google was more likely to spider URLs with several parameters. Try to keep the number low (2 or 3 will do for Google). But the most important thing is to avoid a parameter with a name like 'id'. This indicates a session-id, and following those links would result in spidering many (almost) identical pages. So I agree with frogg, the main problem will be the 'id' parameter.

Hope this helps.

4:48 am on Jun 25, 2003 (gmt 0)

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Yah, "id=" usually marks a session id, so if you can rename that I would. Also, fewer parameters are better. Good luck! :)
6:58 am on Jun 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is the problem with "anythingid=" or just "id="?
7:43 am on Jun 25, 2003 (gmt 0)

10+ Year Member



Thank you all,
I think I'll mod_rewrite the whole thing ("id" is quite impossible to change...)
Eventually, I'll let you know :)
Manuele
1:37 pm on Jun 25, 2003 (gmt 0)

10+ Year Member



recently Google was more likely to spider URLs with several parameters. Try to keep the number low (2 or 3 will do for Google).

3? Is this true? I have never seen a 3 parameter URL in the SERPs, but I guess I haven't been looking that hard. Can you confirm that Google will index 3-parameter URLs?

Thanks,

swerve

6:02 pm on Jun 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've seen them index more than 3, but I don't think you'll get as many urls spidered. It's more of a rarity. But if there are a lot of sites on the web linking to a page with a lot of parameters they will get indexed.
6:46 pm on Jun 25, 2003 (gmt 0)

10+ Year Member



But the most important thing is to avoid a parameter with a name like 'id'. This indicates a session-id, and following those links would result in spidering many (almost) identical pages.

Hmm... this is interesting to know. I'm having trouble getting Google to spider my sub-pages, even without a query string. The URLs look something like this:

[......]

I wonder - could Google be smart enough to realize I'm using the path as a replacement for the parameters and notice the ciid?

7:51 pm on Jun 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



nope...use the tool here at WW to see the headers being output. If you are not putting out a 200 header google may not want to spider it.
8:17 pm on Jun 25, 2003 (gmt 0)

10+ Year Member



Clark,

Where is the tool you are referring to and what exactly do you mean by a '200 Header'?

Thanks

9:09 pm on Jun 25, 2003 (gmt 0)

10+ Year Member



The tool is over on SearchEngineWorld.com.

He means the status code that your HTTP server responds with when something makes a request for a page. 200 means OK, 404 means not found, 500 means internal server error, etc.

My site sends back a 200 response code for both the page I'm trying to get it to spider and the page that links to it.

But the other engines spider it OK, so I'm thinking maybe it's just another part of the wackiness of this latest update...

9:51 pm on Jun 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I always have trouble remembering how to find it or I would have provided the link.

BTW, when did your pages go up? In the last 2 months G hasn't crawled much new pages (especially on sites that are older than a few months)

10:55 pm on Jun 25, 2003 (gmt 0)

10+ Year Member



More:
I've changed all the links, now they look like:
"/read/NNN/"
where NNN is a number...
Too bad, all I got is:
64.68.82.38 - - [25/Jun/2003:13:46:34 +0200] "GET /robots.txt HTTP/1.0" 404 283
64.68.82.38 - - [25/Jun/2003:13:46:35 +0200] "GET / HTTP/1.0" 200 21821

still no spidering of the subpages...
can someone give me another good hint?
www ilbisturi it
is the page.
Thanks again
Manuele

3:56 am on Jun 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just looked out the links in the cache (indexed on June 24), and those were like:

<a href="/index.php?action=read&sez=100&id=274">

but the current page looks better

<a href="/read/274/">

A check at Server Header Check [webmasterworld.com] gave a 'HTTP/1.1 200 OK' so that looks good. Maybe just wait some longer for Google to get the sub pages as well. By the way, you hardly have links to your site (Google, AllTheWeb, AltaVista, Inktomi all say: 0 links). Having some more could also help to get sub pages spidered.

8:47 am on Jun 26, 2003 (gmt 0)

10+ Year Member



mmmmmm
the site has only few days life :)

The point is that at the time of that spidering (in the last post) the links were already modified... but google only asked for "/"

:/

10:21 am on Jun 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You could try the 'submit URL' at Google for a few sub pages (like "archive/3/") and see what happens. And of course get more inbound links. From what I understand of your site (sorry I'm not so good in reading Italian), there is quite some content so it shouldn't be that hard to get some links. BTW, I also noticed a few internal links from the home page with the 'id' parameter. They link to pages with 'scarica' (discharge, download?). There are enough other links that should do well now, so this cannot be the cause of you problem. But you can already work on that too.
12:45 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



Thank you for your interest...
so far I've changed the links layout again...
now they are "story_NNN.html" and "archive_S_O.html"
where: NNN, S and O are all numbers...

I did this to make the pages look all in the same subdirectory (better for many things...)

Now I'm waiting for the next spidering to see what happens...

(Those other links - downloads and stuff - are ok to be not spidered... so id can remain ... I want to first solve the main problem and see what will happen later...)

Thank you again... will let know :)

1:07 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



Nostgard,

Thanks for explaining. I thought that was what he meant but the way he said it lost me.

Anyone have a good idea why Google crawled me back in April but the pages have no title or cache in their directory? I keep asking hoping someone has the magic solution. Think Google will get the content next time around?

Thanks

7:39 am on Jun 27, 2003 (gmt 0)

10+ Year Member



64.68.82.28 - - [26/Jun/2003:15:29:45 +0200] "GET / HTTP/1.0" 200 24302 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html
)"
18.29.1.50 - - [26/Jun/2003:15:29:46 +0200] "GET / HTTP/1.1" 200 24315 "-" "W3C_Validator/1.305.2.12 libwww-perl/5.64"
18.29.1.50 - - [26/Jun/2003:15:30:20 +0200] "GET / HTTP/1.1" 200 24315 "-" "W3C_Validator/1.305.2.12 libwww-perl/5.64

new spidering, still only "/"...
I'm getting very upset :///
Any help appreciated...

9:48 pm on Jun 27, 2003 (gmt 0)

10+ Year Member



Just as a note, I have a site with tons of query strings in one section of the site, and Google has not only spidered each and every one of them -- it has added them all to its index. Bit ridiculous when thinking that it's a photo gallery...but I suppose it's good that at least Google *is* spidering links with query strings in them.
12:30 am on Jun 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've decided to run a test. Will let you know the result if google ever spiders again :)
11:00 am on Jun 29, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just checked the site, and it is now indexed with 33 pages, all marked with '27 Jun 2003' in the SERP. So removing the 'id' helped to get sub pages spidered.
1:36 pm on Jun 30, 2003 (gmt 0)

10+ Year Member



Thanks you all, and specially takagi...
yes, indeed, google is able to spider the site :)
So I'm kinda glad with that...
But still I have more concerns...
1) Is it normal to have a 24/30 hours delay between the spidering time and the time google shows the results? That is what is happening...

2) Second question is a bit more complicated:
I have archive pages for news (archive_SECTION_OFFSET.html)
where SECTION and OFFSET are numbers...
The point is that with an OFFSET equal to 0 you get the newest archive page: this means that archive_3_0.html is always changing ... and that is the page I reference from the HP.
Now, for as much as I could see, google is not following the link that is at the bottom of the archive page (it is an image only anchor - left and right arrows) so it is loosing all the older archives (i.e.: archive_3_1.html, archive_3_2.html and so on...) and because of that it also looses a lot of content....

In order to avoid that i can see 2 possible ways:

- change the naming so that archive_SECTION_0.html is the OLDEST archive and archive_SECTION_9.html (for example) is the freshest page, linked from the home. In this scenario i would hope that google spiders the site so often that it will get the archives at least before the OFFSET changes...

- or, alternatively, find out why google is not following the image-only anchor and make it spider the archive sub pages...

I hope I explained myself....
otherwise let me know, i'll try to improve my bad english as fast as possible... :)

TIA, again.

2:06 pm on Jun 30, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



1) Is it normal to have a 24/30 hours delay between the spidering time and the time google shows the results?

The delay between freshbot spidering the pages, and the results showing up was usually more like 48 hours I think. So 24/30 hours seems to be fast. The newly spidered pages from many sites has to be processed into an index, and that has to be spread over the 9 data centers.

The point is that with an OFFSET equal to 0 you get the newest archive page: this means that archive_3_0.html is always changing

That means that if a page is in the index, and matches a query from a user of a search engine, the contents of the page is most likely to be already changed. Because the contents that the visitor was looking for, changed to a file with a different name. To get pages indexed, this is no big problem, but most visitors will immediately hit the back button when they find out, that the data they were looking for is gone (i.e. not on the page they see). So having a higher number for a more recent page is a more logic way. Just like the thread numbers on this forum.

google is not following the link that is at the bottom of the archive page

For a page with no inbound links (and therefor PR0), the number of links followed by fresh bot will be limited. So links at the end of the source file, are more likely to be skipped in the initial stage. If you make sure there are several links to your home page, then deepbot will spider more pages.
8:31 am on Jul 14, 2003 (gmt 0)

10+ Year Member



Can I ask where you got the Mod Rewrite code to convert the dynamic query string into a path?

I've seen a few threads dotted around, but if you could point me towards one that worlds well, I'd appreciate it. ;-)

10:17 am on Jul 16, 2003 (gmt 0)

10+ Year Member



RewriteEngine on
RewriteRule /read_([0-9]+).html /index.php?action=read&id=$1
RewriteRule /story_([0-9]+).html /index.php?action=read&id=$1
RewriteRule /archive_([0-9]+)_([0-9]+).html /index.php?action=archive&sez=$1&offset=$2
RewriteRule /archive_([0-9]+).html /index.php?action=archive&sez=$1&offset=0
RewriteRule /read/([0-9]+) /index.php?action=read&id=$1
#RewriteRule /read/([0-9]+)/images/([a-Z]+) /images/$2
RewriteRule /archive/([0-9]+)/([0-9]+) /index.php?action=archive&sez=$1&offset=$2
RewriteRule /archive/([0-9]+) /index.php?action=archive&sez=$1

That is my configugaration for that site, it's in the <virtualhost> directive....

besides, i wanted to ask...
why did google left only one page all of a sudden?
I had 45 indexed pages yesterday, today only one...
Can someone help me?

11:13 am on Jul 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



why did google left only one page all of a sudden?
I had 45 indexed pages yesterday, today only one...
Can someone help me?

In the first message of this thread you wrote that Google started indexing your home page around June 19. That is a few days after the Esmeralda update began. That would mean that all the 45 pages indexed until yesterday were not in the full index. IIRC, all your pages had a fresh tag (date next to the URL). So most likely it was the 'fresh bot' that kept your pages in the SERP. But the weird thing is, you only have a sub page left over, not your home page.

Soon 'fresh bot' will bring in some more pages into the SERP. Maybe you should just have some more patience until the next update, and be happy with the pages that were already shown in the SERPs for the last few weeks. After all, the site was found after the last update and has only few inbound links.

This 32 message thread spans 2 pages: 32
 

Featured Threads

Hot Threads This Week

Hot Threads This Month