homepage Welcome to WebmasterWorld Guest from 23.20.34.25
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Website
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
A huge number of URL's monitored on GWT
Need to remove all of them
morpheus83




msg:4554719
 10:55 am on Mar 14, 2013 (gmt 0)

I have a fairly old and popular blog (started in 2007) which ran on Movable Type. We migrated to Wordpress during Christmas last year. Unfortunately I did not use Google Webmaster tools actively until we witnesses a huge drop in traffic in Nov 2012, by almost 60% which we have still not recovered from. One thing that surprises me on GWT is the URL's monitored -
ParameterURLs monitored
page 184,095
p 15,358

To put in perspective my blog has 14,000 posts, 10 categories and close to 1,000 tags. The number of url's monitored is fairly large and all of them are invalid links.

Movable Type paginates by adding a a variable ?page=page number so it would be mywebsite.com/index.php?page=2 however Wordpress paginates in this fashion - mywebsite.com/page/2/

What is happening now is Google is combining both the factors and crawling thousands of irrelevant urls like -
index.php?page=2
?page=2
watches/index.php?page=3
?page=3
index.php?page=4
?page=4
index.php?page=5
?page=5
index.php?page=6
?page=6

How can I stop Google from indexing any url with the variable ?page= and ?p= using robots.txt. I have configured WMT not to crawl any url's but with no effect. I want to do it using robots.txt now.

Since Google is treating them as individual pages the PR would be diluted correct?
According to GMT it has indexed 33,000 pages on my website.

 

lucy24




msg:4554934
 9:12 pm on Mar 14, 2013 (gmt 0)

Does one naming format redirect to the other? If so, has G already crawled everything with the /p/ name form? Again if so, all you have to do is remove the "page=" parameter in gwt.

Wait, not quite "all". Depending on your exact redirect configuration, you have to choose between telling it to ignore a particular parameter, or telling it to ignore pages whose URL contains the parameter. Make sure you are clear on the difference.

No matter what you do, URLs containing explicit "index.something" should never occur. Admittedly you don't gain much by redirecting if the whole query string is still there-- but at least you've eliminated one form of duplication.

I don't know whether someone has hard information. I have always assumed that if a search engine comes across /directory/pagename it will put /directory/ alone on its shopping list even if it has never met an explicit link in that form.

morpheus83




msg:4555112
 6:45 am on Mar 15, 2013 (gmt 0)

No, one naming format does not redirect to the other. I have removed that parameter, but even Google says to remove the parameter or something for good you need to mention that in the robots.txt.

So I need to know how I can explicitly remove all url's with the query string ?page= and ?p= using robots.txt.

khaty




msg:4556665
 2:06 am on Mar 20, 2013 (gmt 0)

I am not sure about robots.txt, I experienced those crawling errors in my blog and what I did is rename the url (permalink) those ?page= are now in postname. The exact step is click the settings > Permalinks > Postname to remove the ?p= parameters

lucy24




msg:4556683
 4:41 am on Mar 20, 2013 (gmt 0)

I have removed that parameter, but even Google says to remove the parameter or something for good you need to mention that in the robots.txt.

I think you're conflating two different sets of instructions.

One is the Remove From Index area. There, they say that to keep things out permanently, the page either has to have a noindex tag or ::cough-cough, ahem, G said it, not me:: the page has to be blocked in robots.txt.

The other is the Parameter Handling area. Here you're giving g### permanent information, not about specific pages or directories but about naming conventions in general. This is where you list parameters that don't affect page content. It sounds as if you have already done this. And it shouldn't have anything to do with crawling, only indexing. The googlebot might happen to crawl filename.php?sillyparameter=17, but when it passes this information along to the indexing computer, the information from this crawl will be merged with information form filename.php?sillyparameter=9 and filename.php alone.

But what about the other issue? In your first post all the examples come in pairs:
index.php?page=2
?page=2
watches/index.php?page=3
?page=3
index.php?page=4
?page=4

Always the identical page, with and without "index.php" in its name. It needs to be one or the other, not both. This part can be handled without wmt at all; you just need a global "index.php" redirect. Over in the apache forum this is in the Top Ten Recurring questions, so you should have no trouble putting together the code you need. It's a single line in htaccess if you use the [NS] flag, two lines if you use {THE_REQUEST} instead.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved