homepage Welcome to WebmasterWorld Guest from 54.145.183.190
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Maximum size for robots.txt
Does any one know what it is?
blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 552 posted 7:01 pm on Feb 16, 2005 (gmt 0)

I believe that we got hit with Duplicate content penalty by you know who, after renaming the product template name and then changing URL variable name from ProdID to prod_id. I am trying to figure out the way to get of this problem. I Deleted the duplicate template that we had on the server so far. It has been 5 weeks since I did so. G$ re-crawls the site on the weekly bases, I know that for sure, it also requests the old files.

I soon As I found out about the problem I wrote a script that did a 301 Redirect, but it does not seem to solve the dilemma. Rewriting the content is not an option too.

So my strategy is to create a big robots.txt file to battle the problem.

Question: what is Maximum size for robots.txt since it will have to be about 3000 lines long or may be there is a better way to handle it

Thanks for your input.

 

LowLevel

10+ Year Member



 
Msg#: 552 posted 1:00 am on Feb 21, 2005 (gmt 0)

Hmm... thousands of lines could be too much to be correctly handled by the spiders.

Are you sure you can't use the implicit wildchar at the end of any "Disallow:" directive to reduce the number of the lines?

Can you post some examples of what URLs you need to disallow?

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 552 posted 1:10 am on Feb 21, 2005 (gmt 0)

> Are you sure you can't use the implicit wildchar at the
> end of any "Disallow:" directive to reduce the number of the lines?

You can't use wildcards or pattern matching as its compliant with current de facto standart.

It might be better to split site into two logical areas one of which (to avoid suspect duplicate content) will be specifically disallowed in robots.txt. Make sure however that its not the area where everyone is or will be linking to!

LowLevel

10+ Year Member



 
Msg#: 552 posted 4:07 am on Feb 21, 2005 (gmt 0)


You can't use wildcards or pattern matching as its compliant with current de facto standart.

If you read carefully my reply, you'll notice that I was referring to the implicit wildchar "*" that the Robots Exclusion Standard "puts" at the end of any Disallow: path.

Sometimes webmasters can take advantage of this feature to reduce the amount of disallow lines in the file.

That's why I asked to the original poster for an example: to understand if the implicit wildchar can be used in this specifc case to limit the number of disallow lines.

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 552 posted 7:41 am on Feb 21, 2005 (gmt 0)

The thing is that there seems to be 5 sets of urls

product_page.cfm/prodid/100.cfm <-this one i want to keep
then there is
old_product_page.cfm?prodid=100 has bunch of a lot back-links from scrapper sites – I have nothing to do with thouse

product_page.cfm?prodid=100 <-goes as 301 to the one above, but is in the index of Y and G and for some reason I cant get them to disappear fro 5 month already
product_page.cfm/prodid/100/item/widget-name.cfm
and
product_page.cfm/item/widget-name/prodid/100.cfm
now widget name is be different

in MS-SE we are doing great top 5 on about 100 keyword phrases, and is shows about 1700 pages indexed. With Google 4800 Pages indexed – where about 3000 do not exist any more or duplicate content with different urls for the same products.

Duplicate content with different urls for the same site – I did not mean to do it at all.
I am at the point to contact Google and ask them to drop the site from the index completely and then re-index. Don’t know how far I will get with the request. G-Bot rolls in 2-3 times a week at least and does pretty good job on caching pages.

When I do site: command it even returns separate urls for pages with
product_page.cfm……… and Product_Page.cfm case sensitive P and p

the end result – main keyword phrase we are not in first 1000 on Google and we have more widgets than any retail competitor on the market – sad

I am going to post URL of the site in the profile, I am here to listen and learn.

Thanks

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 552 posted 9:40 pm on Feb 25, 2005 (gmt 0)

If you read carefully my reply, you'll notice that I was referring to the implicit

My bad -- you right imlpicit wildchar '*' is fine, I misread it for usage of the same chars in paths. :(

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved