homepage Welcome to WebmasterWorld Guest from 54.163.72.86
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 32 message thread spans 2 pages: 32 ( [1] 2 > >     
No follow?
Google is not crawling my page
corvonero




msg:176431
 8:11 am on Jun 24, 2003 (gmt 0)

Hello,
I've this silly problem
Google has been indexing my page since 5 days or so... but it only reads "/" and does not crawl further links.

Mos of the links look like that:
<a class=mainlisttitle href="/index.php?action=read&sez=&id=309">text</a>

I'll put the complete URL, if someone wants to give it a check, I would really appreaciate:
www ilbisturi it

by purpose miswrote it, so I won't look like a spammer :)
TIA,
Manuele

 

corvonero




msg:176432
 10:39 pm on Jun 24, 2003 (gmt 0)

This is getting very strange...
still no further crawling, only the home page is being fetched...
can someone give me a hint?
Thanks

DaveN




msg:176433
 10:49 pm on Jun 24, 2003 (gmt 0)

too many variables in the url of one.

dave

frogg




msg:176434
 2:40 am on Jun 25, 2003 (gmt 0)

last I checked, google completely ignored urls which contained a query-string param of id=anything-at-all -- this was some months ago.

takagi




msg:176435
 3:34 am on Jun 25, 2003 (gmt 0)

Google and other search engines don't like many parameters, but recently Google was more likely to spider URLs with several parameters. Try to keep the number low (2 or 3 will do for Google). But the most important thing is to avoid a parameter with a name like 'id'. This indicates a session-id, and following those links would result in spidering many (almost) identical pages. So I agree with frogg, the main problem will be the 'id' parameter.

Hope this helps.

GoogleGuy




msg:176436
 4:48 am on Jun 25, 2003 (gmt 0)

Yah, "id=" usually marks a session id, so if you can rename that I would. Also, fewer parameters are better. Good luck! :)

Clark




msg:176437
 6:58 am on Jun 25, 2003 (gmt 0)

Is the problem with "anythingid=" or just "id="?

corvonero




msg:176438
 7:43 am on Jun 25, 2003 (gmt 0)

Thank you all,
I think I'll mod_rewrite the whole thing ("id" is quite impossible to change...)
Eventually, I'll let you know :)
Manuele

swerve




msg:176439
 1:37 pm on Jun 25, 2003 (gmt 0)

recently Google was more likely to spider URLs with several parameters. Try to keep the number low (2 or 3 will do for Google).

3? Is this true? I have never seen a 3 parameter URL in the SERPs, but I guess I haven't been looking that hard. Can you confirm that Google will index 3-parameter URLs?

Thanks,

swerve

Clark




msg:176440
 6:02 pm on Jun 25, 2003 (gmt 0)

I've seen them index more than 3, but I don't think you'll get as many urls spidered. It's more of a rarity. But if there are a lot of sites on the web linking to a page with a lot of parameters they will get indexed.

nostgard




msg:176441
 6:46 pm on Jun 25, 2003 (gmt 0)

But the most important thing is to avoid a parameter with a name like 'id'. This indicates a session-id, and following those links would result in spidering many (almost) identical pages.

Hmm... this is interesting to know. I'm having trouble getting Google to spider my sub-pages, even without a query string. The URLs look something like this:

[......]

I wonder - could Google be smart enough to realize I'm using the path as a replacement for the parameters and notice the ciid?

Clark




msg:176442
 7:51 pm on Jun 25, 2003 (gmt 0)

nope...use the tool here at WW to see the headers being output. If you are not putting out a 200 header google may not want to spider it.

swampy webber




msg:176443
 8:17 pm on Jun 25, 2003 (gmt 0)

Clark,

Where is the tool you are referring to and what exactly do you mean by a '200 Header'?

Thanks

nostgard




msg:176444
 9:09 pm on Jun 25, 2003 (gmt 0)

The tool is over on SearchEngineWorld.com.

He means the status code that your HTTP server responds with when something makes a request for a page. 200 means OK, 404 means not found, 500 means internal server error, etc.

My site sends back a 200 response code for both the page I'm trying to get it to spider and the page that links to it.

But the other engines spider it OK, so I'm thinking maybe it's just another part of the wackiness of this latest update...

Clark




msg:176445
 9:51 pm on Jun 25, 2003 (gmt 0)

I always have trouble remembering how to find it or I would have provided the link.

BTW, when did your pages go up? In the last 2 months G hasn't crawled much new pages (especially on sites that are older than a few months)

corvonero




msg:176446
 10:55 pm on Jun 25, 2003 (gmt 0)

More:
I've changed all the links, now they look like:
"/read/NNN/"
where NNN is a number...
Too bad, all I got is:
64.68.82.38 - - [25/Jun/2003:13:46:34 +0200] "GET /robots.txt HTTP/1.0" 404 283
64.68.82.38 - - [25/Jun/2003:13:46:35 +0200] "GET / HTTP/1.0" 200 21821

still no spidering of the subpages...
can someone give me another good hint?
www ilbisturi it
is the page.
Thanks again
Manuele

takagi




msg:176447
 3:56 am on Jun 26, 2003 (gmt 0)

Just looked out the links in the cache (indexed on June 24), and those were like:

<a href="/index.php?action=read&sez=100&id=274">

but the current page looks better

<a href="/read/274/">

A check at Server Header Check [webmasterworld.com] gave a 'HTTP/1.1 200 OK' so that looks good. Maybe just wait some longer for Google to get the sub pages as well. By the way, you hardly have links to your site (Google, AllTheWeb, AltaVista, Inktomi all say: 0 links). Having some more could also help to get sub pages spidered.

corvonero




msg:176448
 8:47 am on Jun 26, 2003 (gmt 0)

mmmmmm
the site has only few days life :)

The point is that at the time of that spidering (in the last post) the links were already modified... but google only asked for "/"

:/

takagi




msg:176449
 10:21 am on Jun 26, 2003 (gmt 0)

You could try the 'submit URL' at Google for a few sub pages (like "archive/3/") and see what happens. And of course get more inbound links. From what I understand of your site (sorry I'm not so good in reading Italian), there is quite some content so it shouldn't be that hard to get some links. BTW, I also noticed a few internal links from the home page with the 'id' parameter. They link to pages with 'scarica' (discharge, download?). There are enough other links that should do well now, so this cannot be the cause of you problem. But you can already work on that too.

corvonero




msg:176450
 12:45 pm on Jun 26, 2003 (gmt 0)

Thank you for your interest...
so far I've changed the links layout again...
now they are "story_NNN.html" and "archive_S_O.html"
where: NNN, S and O are all numbers...

I did this to make the pages look all in the same subdirectory (better for many things...)

Now I'm waiting for the next spidering to see what happens...

(Those other links - downloads and stuff - are ok to be not spidered... so id can remain ... I want to first solve the main problem and see what will happen later...)

Thank you again... will let know :)

swampy webber




msg:176451
 1:07 pm on Jun 26, 2003 (gmt 0)

Nostgard,

Thanks for explaining. I thought that was what he meant but the way he said it lost me.

Anyone have a good idea why Google crawled me back in April but the pages have no title or cache in their directory? I keep asking hoping someone has the magic solution. Think Google will get the content next time around?

Thanks

corvonero




msg:176452
 7:39 am on Jun 27, 2003 (gmt 0)

64.68.82.28 - - [26/Jun/2003:15:29:45 +0200] "GET / HTTP/1.0" 200 24302 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html
)"
18.29.1.50 - - [26/Jun/2003:15:29:46 +0200] "GET / HTTP/1.1" 200 24315 "-" "W3C_Validator/1.305.2.12 libwww-perl/5.64"
18.29.1.50 - - [26/Jun/2003:15:30:20 +0200] "GET / HTTP/1.1" 200 24315 "-" "W3C_Validator/1.305.2.12 libwww-perl/5.64

new spidering, still only "/"...
I'm getting very upset :///
Any help appreciated...

snark




msg:176453
 9:48 pm on Jun 27, 2003 (gmt 0)

Just as a note, I have a site with tons of query strings in one section of the site, and Google has not only spidered each and every one of them -- it has added them all to its index. Bit ridiculous when thinking that it's a photo gallery...but I suppose it's good that at least Google *is* spidering links with query strings in them.

Clark




msg:176454
 12:30 am on Jun 28, 2003 (gmt 0)

I've decided to run a test. Will let you know the result if google ever spiders again :)

takagi




msg:176455
 11:00 am on Jun 29, 2003 (gmt 0)

Just checked the site, and it is now indexed with 33 pages, all marked with '27 Jun 2003' in the SERP. So removing the 'id' helped to get sub pages spidered.

corvonero




msg:176456
 1:36 pm on Jun 30, 2003 (gmt 0)

Thanks you all, and specially takagi...
yes, indeed, google is able to spider the site :)
So I'm kinda glad with that...
But still I have more concerns...
1) Is it normal to have a 24/30 hours delay between the spidering time and the time google shows the results? That is what is happening...

2) Second question is a bit more complicated:
I have archive pages for news (archive_SECTION_OFFSET.html)
where SECTION and OFFSET are numbers...
The point is that with an OFFSET equal to 0 you get the newest archive page: this means that archive_3_0.html is always changing ... and that is the page I reference from the HP.
Now, for as much as I could see, google is not following the link that is at the bottom of the archive page (it is an image only anchor - left and right arrows) so it is loosing all the older archives (i.e.: archive_3_1.html, archive_3_2.html and so on...) and because of that it also looses a lot of content....

In order to avoid that i can see 2 possible ways:

- change the naming so that archive_SECTION_0.html is the OLDEST archive and archive_SECTION_9.html (for example) is the freshest page, linked from the home. In this scenario i would hope that google spiders the site so often that it will get the archives at least before the OFFSET changes...

- or, alternatively, find out why google is not following the image-only anchor and make it spider the archive sub pages...

I hope I explained myself....
otherwise let me know, i'll try to improve my bad english as fast as possible... :)

TIA, again.

takagi




msg:176457
 2:06 pm on Jun 30, 2003 (gmt 0)

1) Is it normal to have a 24/30 hours delay between the spidering time and the time google shows the results?

The delay between freshbot spidering the pages, and the results showing up was usually more like 48 hours I think. So 24/30 hours seems to be fast. The newly spidered pages from many sites has to be processed into an index, and that has to be spread over the 9 data centers.

The point is that with an OFFSET equal to 0 you get the newest archive page: this means that archive_3_0.html is always changing

That means that if a page is in the index, and matches a query from a user of a search engine, the contents of the page is most likely to be already changed. Because the contents that the visitor was looking for, changed to a file with a different name. To get pages indexed, this is no big problem, but most visitors will immediately hit the back button when they find out, that the data they were looking for is gone (i.e. not on the page they see). So having a higher number for a more recent page is a more logic way. Just like the thread numbers on this forum.

google is not following the link that is at the bottom of the archive page

For a page with no inbound links (and therefor PR0), the number of links followed by fresh bot will be limited. So links at the end of the source file, are more likely to be skipped in the initial stage. If you make sure there are several links to your home page, then deepbot will spider more pages.

Red5




msg:176458
 8:31 am on Jul 14, 2003 (gmt 0)

Can I ask where you got the Mod Rewrite code to convert the dynamic query string into a path?

I've seen a few threads dotted around, but if you could point me towards one that worlds well, I'd appreciate it. ;-)

corvonero




msg:176459
 10:17 am on Jul 16, 2003 (gmt 0)

RewriteEngine on
RewriteRule /read_([0-9]+).html /index.php?action=read&id=$1
RewriteRule /story_([0-9]+).html /index.php?action=read&id=$1
RewriteRule /archive_([0-9]+)_([0-9]+).html /index.php?action=archive&sez=$1&offset=$2
RewriteRule /archive_([0-9]+).html /index.php?action=archive&sez=$1&offset=0
RewriteRule /read/([0-9]+) /index.php?action=read&id=$1
#RewriteRule /read/([0-9]+)/images/([a-Z]+) /images/$2
RewriteRule /archive/([0-9]+)/([0-9]+) /index.php?action=archive&sez=$1&offset=$2
RewriteRule /archive/([0-9]+) /index.php?action=archive&sez=$1

That is my configugaration for that site, it's in the <virtualhost> directive....

besides, i wanted to ask...
why did google left only one page all of a sudden?
I had 45 indexed pages yesterday, today only one...
Can someone help me?

takagi




msg:176460
 11:13 am on Jul 16, 2003 (gmt 0)

why did google left only one page all of a sudden?
I had 45 indexed pages yesterday, today only one...
Can someone help me?

In the first message of this thread you wrote that Google started indexing your home page around June 19. That is a few days after the Esmeralda update began. That would mean that all the 45 pages indexed until yesterday were not in the full index. IIRC, all your pages had a fresh tag (date next to the URL). So most likely it was the 'fresh bot' that kept your pages in the SERP. But the weird thing is, you only have a sub page left over, not your home page.

Soon 'fresh bot' will bring in some more pages into the SERP. Maybe you should just have some more patience until the next update, and be happy with the pages that were already shown in the SERPs for the last few weeks. After all, the site was found after the last update and has only few inbound links.

This 32 message thread spans 2 pages: 32 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved