Forum Moderators: goodroi

Message Too Old, No Replies

Robots txt to solve gallery duplicates

can someone confirm this is correct

         

caines

8:10 am on Jan 5, 2007 (gmt 0)

10+ Year Member



Sorry, I'm a pretty inexperienced webmaster - on my hobby site I use Gallery2 - it's a great photo gallery but it creates a lot of duplicate issues because for example www.mysite.com/photos/ www.mysite.com/photos/index.php and www.mysite.com/photos/index.php?g2=page1 are all exactly the same page. I don't know why it fuctions this way but it does - also I used (pobably mistakenly) the mod-rewrite tool to create more attractive looking links, so every photo appears as a www.mysite.com/photos/pretty-picture.jpg.html now that should be good for the search engines so I read lol - which is why I went that way BUT I have over 3000 pretty-pictures and they don't all have unique descriptions keywords etc etc in fact considering all 3000 photos have the same subject Place A - it's actually hard to come up with 3000 unique ways to say "a picture of place A" although obviously every picture is different! Anyhoo - I decided that seeing as the traffic that comes via the image part of google is of little value and that my main priority is to rank well for the www.mysite.com and www.mysite.com/photos pages so people can find the site that I would do the following to avoid any further penalties - I think I have been knocked back 30 spots already

User-agent: Googlebot
Disallow: /*?*
User-Agent: Googlebot
Disallow: /photos/*.html$

I have also done a re-write in the class file within gallery2 so that photos/index.php is now resolving to photos/

I've never needed to use robots.txt before have I done this correctly and will it have the deseired effect.

Any advice greatly appreciated

jdMorgan

8:48 am on Jan 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm not sure what specific URLs you do and don't want crawled, but here's a couple of partial answers:

You only need one "User-agent: Googlebot" line there.

If you have another user-agent you want to control, then there must be a blank line after the last Disallow in a any User-agent's record, including the last one in the file:

User-agent: abc
Disallow: /xyz
Disallow: /123
Disallow: /doremi

User-agent: def
Disallow: /xyz
Disallow: /124


If you mean that your site has been knocked down exactly 30 places, then you should do a search on this site for the "minus 30 penalty [google.com]" -- This is a known phenomenon.

URLs like "photo.jpeg.html" are rather non-sensical. If these are JPEG files, they should be called "photo.jpeg" or "photo.jpg", and if HTML pages, then "photo.html".

Jim

caines

10:54 am on Jan 5, 2007 (gmt 0)

10+ Year Member



Thanks very much for answering! I had done it wrong so have fixed it now I appreciate your setting me straight.

I agreee about the adding.html to the end of every photo page being a strange approach but this plug in is used ny gallery2 because the alternative is a url which was deemed to be "unfriendly" to search engines - but I think making each photo an html page is equally problematic if, as in my case, I am working with a great many photos and don't want to create meta tags for everyone.

What I'm trying to achieve with regard to indexing is I just want the first two levels of the gallery indexed so - mysite.com/photos/ and mysite.com/photos/pictures-of-pigs/ - because the front page of the site and the first page of each album has all the meta tags and each is unique etc. My hope is to stop googlebot indexing any of the photopages which present as www.mysite.com/photos/picture.jpg.html

I also want to stop google from indexing all duplicate versions of album pages so for example I want mysite.com/photos/ to be indexed but mysite.com/photos/?g2_page1 not to be indexed as both url's resolve to the same page. The first page of every album can be reached via mysite.com/photos/albumname/
and mysite.com/photos/albumname/?g2_page1

so I think this robots.txt should take care of all the photo pages that are missing tags and most of the duplicates the only issue I have is mysite.com/photos/index.php I have redirected it but still if you type www.mysite.com/photos/index.php into the browser the page comes up with that address showing so clearly I still have to find a way to redirect that properly.

My site is quite new about 6 months and so I don't know if it's really been penalised or if I'm just bouncing around at the moment I have seen the posts here about the minus 30 penalty and so whether that is what has happened or not, I thought I better sort my site out :)

One last idiot question if you don't mind - is asking Google not to index the duplicates enough to make Google happy? Or do I need to 301 all of the urls containing a? to their mod-rewrite version. If so how would you direct mysite.com/photos/example/?g2_page=1 to mysite.com/photos/example/ when there is no album called example sitting under the photos directory on my site. Sorry probably a really dumb question.

Thanks again for your help