This is getting very strange...
still no further crawling, only the home page is being fetched...
can someone give me a hint?
too many variables in the url of one.
last I checked, google completely ignored urls which contained a query-string param of id=anything-at-all -- this was some months ago.
Google and other search engines don't like many parameters, but recently Google was more likely to spider URLs with several parameters. Try to keep the number low (2 or 3 will do for Google). But the most important thing is to avoid a parameter with a name like 'id'. This indicates a session-id, and following those links would result in spidering many (almost) identical pages. So I agree with frogg, the main problem will be the 'id' parameter.
Hope this helps.
Yah, "id=" usually marks a session id, so if you can rename that I would. Also, fewer parameters are better. Good luck! :)
Is the problem with "anythingid=" or just "id="?
Thank you all,
I think I'll mod_rewrite the whole thing ("id" is quite impossible to change...)
Eventually, I'll let you know :)
|recently Google was more likely to spider URLs with several parameters. Try to keep the number low (2 or 3 will do for Google). |
3? Is this true? I have never seen a 3 parameter URL in the SERPs, but I guess I haven't been looking that hard. Can you confirm that Google will index 3-parameter URLs?
I've seen them index more than 3, but I don't think you'll get as many urls spidered. It's more of a rarity. But if there are a lot of sites on the web linking to a page with a lot of parameters they will get indexed.
|But the most important thing is to avoid a parameter with a name like 'id'. This indicates a session-id, and following those links would result in spidering many (almost) identical pages. |
Hmm... this is interesting to know. I'm having trouble getting Google to spider my sub-pages, even without a query string. The URLs look something like this:
I wonder - could Google be smart enough to realize I'm using the path as a replacement for the parameters and notice the ciid?
nope...use the tool here at WW to see the headers being output. If you are not putting out a 200 header google may not want to spider it.
Where is the tool you are referring to and what exactly do you mean by a '200 Header'?
The tool is over on SearchEngineWorld.com.
He means the status code that your HTTP server responds with when something makes a request for a page. 200 means OK, 404 means not found, 500 means internal server error, etc.
My site sends back a 200 response code for both the page I'm trying to get it to spider and the page that links to it.
But the other engines spider it OK, so I'm thinking maybe it's just another part of the wackiness of this latest update...
I always have trouble remembering how to find it or I would have provided the link.
BTW, when did your pages go up? In the last 2 months G hasn't crawled much new pages (especially on sites that are older than a few months)
I've changed all the links, now they look like:
where NNN is a number...
Too bad, all I got is:
18.104.22.168 - - [25/Jun/2003:13:46:34 +0200] "GET /robots.txt HTTP/1.0" 404 283
22.214.171.124 - - [25/Jun/2003:13:46:35 +0200] "GET / HTTP/1.0" 200 21821
still no spidering of the subpages...
can someone give me another good hint?
www ilbisturi it
is the page.
Just looked out the links in the cache (indexed on June 24), and those were like:
but the current page looks better
A check at Server Header Check [webmasterworld.com] gave a 'HTTP/1.1 200 OK' so that looks good. Maybe just wait some longer for Google to get the sub pages as well. By the way, you hardly have links to your site (Google, AllTheWeb, AltaVista, Inktomi all say: 0 links). Having some more could also help to get sub pages spidered.
the site has only few days life :)
The point is that at the time of that spidering (in the last post) the links were already modified... but google only asked for "/"
You could try the 'submit URL' at Google for a few sub pages (like "archive/3/") and see what happens. And of course get more inbound links. From what I understand of your site (sorry I'm not so good in reading Italian), there is quite some content so it shouldn't be that hard to get some links. BTW, I also noticed a few internal links from the home page with the 'id' parameter. They link to pages with 'scarica' (discharge, download?). There are enough other links that should do well now, so this cannot be the cause of you problem. But you can already work on that too.
Thank you for your interest...
so far I've changed the links layout again...
now they are "story_NNN.html" and "archive_S_O.html"
where: NNN, S and O are all numbers...
I did this to make the pages look all in the same subdirectory (better for many things...)
Now I'm waiting for the next spidering to see what happens...
(Those other links - downloads and stuff - are ok to be not spidered... so id can remain ... I want to first solve the main problem and see what will happen later...)
Thank you again... will let know :)
Thanks for explaining. I thought that was what he meant but the way he said it lost me.
Anyone have a good idea why Google crawled me back in April but the pages have no title or cache in their directory? I keep asking hoping someone has the magic solution. Think Google will get the content next time around?
126.96.36.199 - - [26/Jun/2003:15:29:45 +0200] "GET / HTTP/1.0" 200 24302 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html
188.8.131.52 - - [26/Jun/2003:15:29:46 +0200] "GET / HTTP/1.1" 200 24315 "-" "W3C_Validator/1.305.2.12 libwww-perl/5.64"
184.108.40.206 - - [26/Jun/2003:15:30:20 +0200] "GET / HTTP/1.1" 200 24315 "-" "W3C_Validator/1.305.2.12 libwww-perl/5.64
new spidering, still only "/"...
I'm getting very upset :///
Any help appreciated...
Just as a note, I have a site with tons of query strings in one section of the site, and Google has not only spidered each and every one of them -- it has added them all to its index. Bit ridiculous when thinking that it's a photo gallery...but I suppose it's good that at least Google *is* spidering links with query strings in them.
I've decided to run a test. Will let you know the result if google ever spiders again :)
Just checked the site, and it is now indexed with 33 pages, all marked with '27 Jun 2003' in the SERP. So removing the 'id' helped to get sub pages spidered.
Thanks you all, and specially takagi...
yes, indeed, google is able to spider the site :)
So I'm kinda glad with that...
But still I have more concerns...
1) Is it normal to have a 24/30 hours delay between the spidering time and the time google shows the results? That is what is happening...
2) Second question is a bit more complicated:
I have archive pages for news (archive_SECTION_OFFSET.html)
where SECTION and OFFSET are numbers...
The point is that with an OFFSET equal to 0 you get the newest archive page: this means that archive_3_0.html is always changing ... and that is the page I reference from the HP.
Now, for as much as I could see, google is not following the link that is at the bottom of the archive page (it is an image only anchor - left and right arrows) so it is loosing all the older archives (i.e.: archive_3_1.html, archive_3_2.html and so on...) and because of that it also looses a lot of content....
In order to avoid that i can see 2 possible ways:
- change the naming so that archive_SECTION_0.html is the OLDEST archive and archive_SECTION_9.html (for example) is the freshest page, linked from the home. In this scenario i would hope that google spiders the site so often that it will get the archives at least before the OFFSET changes...
- or, alternatively, find out why google is not following the image-only anchor and make it spider the archive sub pages...
I hope I explained myself....
otherwise let me know, i'll try to improve my bad english as fast as possible... :)
|1) Is it normal to have a 24/30 hours delay between the spidering time and the time google shows the results? |
The delay between freshbot spidering the pages, and the results showing up was usually more like 48 hours I think. So 24/30 hours seems to be fast. The newly spidered pages from many sites has to be processed into an index, and that has to be spread over the 9 data centers.
|The point is that with an OFFSET equal to 0 you get the newest archive page: this means that archive_3_0.html is always changing |
That means that if a page is in the index, and matches a query from a user of a search engine, the contents of the page is most likely to be already changed. Because the contents that the visitor was looking for, changed to a file with a different name. To get pages indexed, this is no big problem, but most visitors will immediately hit the back button when they find out, that the data they were looking for is gone (i.e. not on the page they see). So having a higher number for a more recent page is a more logic way. Just like the thread numbers on this forum.
|google is not following the link that is at the bottom of the archive page |
For a page with no inbound links (and therefor PR0), the number of links followed by fresh bot will be limited. So links at the end of the source file, are more likely to be skipped in the initial stage. If you make sure there are several links to your home page, then deepbot will spider more pages.
Can I ask where you got the Mod Rewrite code to convert the dynamic query string into a path?
I've seen a few threads dotted around, but if you could point me towards one that worlds well, I'd appreciate it. ;-)
RewriteRule /read_([0-9]+).html /index.php?action=read&id=$1
RewriteRule /story_([0-9]+).html /index.php?action=read&id=$1
RewriteRule /archive_([0-9]+)_([0-9]+).html /index.php?action=archive&sez=$1&offset=$2
RewriteRule /archive_([0-9]+).html /index.php?action=archive&sez=$1&offset=0
RewriteRule /read/([0-9]+) /index.php?action=read&id=$1
#RewriteRule /read/([0-9]+)/images/([a-Z]+) /images/$2
RewriteRule /archive/([0-9]+)/([0-9]+) /index.php?action=archive&sez=$1&offset=$2
RewriteRule /archive/([0-9]+) /index.php?action=archive&sez=$1
That is my configugaration for that site, it's in the <virtualhost> directive....
besides, i wanted to ask...
why did google left only one page all of a sudden?
I had 45 indexed pages yesterday, today only one...
Can someone help me?
|why did google left only one page all of a sudden? |
I had 45 indexed pages yesterday, today only one...
Can someone help me?
In the first message of this thread you wrote that Google started indexing your home page around June 19. That is a few days after the Esmeralda update began. That would mean that all the 45 pages indexed until yesterday were not in the full index. IIRC, all your pages had a fresh tag (date next to the URL). So most likely it was the 'fresh bot' that kept your pages in the SERP. But the weird thing is, you only have a sub page left over, not your home page.
Soon 'fresh bot' will bring in some more pages into the SERP. Maybe you should just have some more patience until the next update, and be happy with the pages that were already shown in the SERPs for the last few weeks. After all, the site was found after the last update and has only few inbound links.
| This 32 message thread spans 2 pages: 32 (  2 ) > > |