Forum Moderators: open

Message Too Old, No Replies

getting a product catalog indexed...

150.000+ products online and not 1 indexed!

         

pardo

2:46 pm on May 6, 2003 (gmt 0)

10+ Year Member



Maybe a good topic to have all eyes off the (possible)update. I think there are other that will have or recognize the following 'problem':

I have to get our 150.000+ (multilingual)product catalog indexed (now only about 100 pages indexed) in Google. The site is build around a database, when surfing around session ID and more is in the url. I read before that you have to get rid of that sort of urls.

Second is the navigation through categories and products. It goes sort of <a href="javascript:showCategory('-here goes_the_number_of_the_category')"

The site is in english. Different countries have no opportunity to get all products translated into their own language, only some non-products related pages and page elements could be translated.

My plan is to get the database modified in a way that each country can translate the most important information (product name and html-title-tag). Next to this the url's have to be rewritten into static looking urls, for each language, so that it's fairly easy to index all urls and that each url is giving information what it's about (../red-widgets.html is more likely to be clicked on than ../categoryID=12345 when someone is searching for red-widgets).

Last thing is the navigation. Either I'll get it out of the 'javascript method' or have some proper, dynamic sitemaps that points the spider into the product database to the deepest level.

Next to my question to have some feedback on this approach there remain some questions for me:

1. what harm does the sessionID's do?
2. how about duplicate content:
.../de/blue-widgets.html
.../uk/blue-widgets.html
where there will be several 'the same' objects on the page?

This will be my first major product catalogue so any help would be fine to give the 'techies' in our company some evidence to get things changed...

Many thanks in advance!

Darkness

3:31 pm on May 6, 2003 (gmt 0)

10+ Year Member



My suggestions :

1. If the content of
.../de/blue-widgets.html
.../uk/blue-widgets.html

are the same then ban google from all(using robots.txt) but one directory e.g. uk

2. Google doesnt like session id's so try and pass these in a cookie instead of the url

3. To guarantee google will be able to index the site I would rewrite the site to use no url parameters e.g.

/area/category/productid

You *may* get away with having simple parameters e.g.

/viewproduct.php?id=4532

4. Get rid of the javascript if possible, robots generally cannot understand it and users may also have it disabled.

killroy

3:39 pm on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If a pages is equal, all but the content paragraph of text (which is translated) and is not a super long pasge, I doubt it would trigger the dupe algo. Also remmeber that Google wants what is good for your visitors, and it would certainly maake sense to your visitors to get served the page in the langauge that they searched for, not the "default for google" uk page.

I've got a directory with around 70-90 thausand pages, wheresubject pages may be in many topics with different paths, but same page, google never penalised for that yet.

also, think carefully about your URLs first up front, it'll make hte d4esign and layout of the site much easier. The URL structure will almost be your site/structzure map.

Good luck with the conversion.

SN

Martin Dunst

3:43 pm on May 6, 2003 (gmt 0)

10+ Year Member



hello pardo,

first of all, build an _accessible_ site.
i think a user should be able to use a site with a browser that can't handle anything but plain html.
no javascript, no cookies, no css, no java, no plugins.
if you can do this, then you don't have to worry much about search-bots.
there's quite a few resources on w3.org dealing with web content accessibility.

a site navigation based on javascript simply destroys the idea behind hypertext.
such a thing is bad for the _user_ in the first place - and it's also bad for search-bots.
try to get rid of it.

sessions:
make sure that a http-client can go through all relevant pages without triggering a session.
sessions could indicate temporary and/or personalized content.
moreover, googlebot would receive a new session-id (i.e. a new url) every time it requests a document.

regards
martin dunst

jever

3:44 pm on May 6, 2003 (gmt 0)

10+ Year Member



Depending on your language putting SIDs into a cookie might be harmful too. Example: If a browser doesn't accept cookies, PHP stores automatically a session ID in the URL.

I put a little routine onto my site that checks for Googlebot and switches off session handling entirely.

Jever

trillianjedi

3:50 pm on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Jever,

Any chance of the snippet of code that does that?

Thanks,

TJ

jever

4:45 pm on May 6, 2003 (gmt 0)

10+ Year Member



sure:
---->8----->8-----

// get agent
$agent = $HTTP_SERVER_VARS['HTTP_USER_AGENT'];

// check for some agents
if (
stristr($agent, "Googlebot") ¦¦
stristr($agent, "inktomi") ¦¦
stristr($agent, "scooter") ¦¦
stristr($agent, "webcrawler")
)
{
// No Session ID
// any kind of special code
// or do nothing
}
else
{
// do session stuff
// 15 minutes lifetime
ini_set("session.gc_maxlifetime", 900);

// 10 percent probability
// for session garbage collection
ini_set("session.gc_probability", 10);

session_start();
}

----8<-----8<----

I put this in my header file which is alway loaded with any of my dynamic pages.

Of course: no warranty ;-)

Jever

trillianjedi

4:51 pm on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you Jever!

No warranties expected, but I'll have a play with that.

TJ

jever

4:59 pm on May 6, 2003 (gmt 0)

10+ Year Member



it's also a nice way to keep leakers and unwanted bots out. Just check for their agent-strings and do something with them ;-)

Jever

pardo

6:01 pm on May 6, 2003 (gmt 0)

10+ Year Member



Many thanks all of you for having a thought on it and sharing helpfull tips. . I'llThis is why I like this place so much! I sure will share my experiences once this will be implemented.

Killroy, it make perfect sense for me to serve a page in the language of the user, but I have to agree with Darkness where he points outs his concern. The page wouldn't be exactly the same. Some database fields will be 'all english' some elements though will be translated by the countries. Together with rewriting URL's this will be separated pages I think...