homepage Welcome to WebmasterWorld Guest from 54.226.21.57
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How can I block strange URL's?
IndigoHollow



 
Msg#: 4518731 posted 2:21 pm on Nov 13, 2012 (gmt 0)

On my site search bots are often index URL's like:

/statji/raschet-dimoudalenija?PAGEN_1=3
/statji/raschet-dimoudalenija?PAGEN_1=2
/statji/raschet-dimoudalenija?undefigned

but there is no such URL's on my site. There is only:

/statji/raschet-dimoudalenija

it's a page with my text article.

Question: how can I block those strange URL's? Don't want to write them all in robots.txt.

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4518731 posted 4:39 pm on Nov 13, 2012 (gmt 0)

Are you asking about robots visiting-- for whatever reason-- or about nonexistent pages getting indexed in search engines?

robots.txt would not do a speck of good. If the site uses no queries at all, it can be as simple as:

RewriteCond %{QUERY_STRING} .
RewriteRule .* - [F]

That is the minimalist version. More details may be necessary.

IndigoHollow



 
Msg#: 4518731 posted 6:27 am on Nov 14, 2012 (gmt 0)

I have written in the code part, that tell me which pages are indexed by google robot. In this URL's list I sometimes see written above wrong URL's. But in google search result in 75% there is no such wrong URL's.
My site is wrote on CMS, so, I think, it uses queries.

Sorry for my bad English, I am from another country. This forum is the only, that I could find in net.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4518731 posted 7:24 am on Nov 14, 2012 (gmt 0)

welcome to WebmasterWorld, IndigoHollow!

do you normally have any URLs with query strings?
the query strings start with the question mark, for example:

?PAGEN_1=3


do you know where google discovered these URLs?
have you tried a site crawl with xenu link sleuth?
are there other sites linking to your site using these URLs?
have you checked your server access log files to see who is requesting these URLs besides googlebot?
are the non-googlebot requests showing a referrer?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4518731 posted 7:59 pm on Nov 14, 2012 (gmt 0)

My site is wrote on CMS, so, I think, it uses queries.

It almost certainly uses queries behind the scenes, but are they part of the visible URL? Google should only index URLs that are seen by humans. So the second-simplest form is

RewriteCond %{THE_REQUEST} ^[A-Z]{3-9}\ [^\s?]+\?
RewriteRule .* - [F]

meaning: the visitor asked for something containing a query string. (The \s is because I have temporarily forgotten whether the referer counts as part of the complete request. Almost everything from a search engine will have a ? in the referer.)

The third-simplest form is the one you have to use if you've got any kind of on-site analytics, because then you yourself will be asking for things with a query string.

In the fourth-simplest form, any request containing a query string is forcibly redirected to the queryless form of the same URL.

Are there many different queries, or are they the same ones over and over? Your first post showed "PAGEN_1" twice. You can go into Webmaster Tools and tell google to ignore this parameter. But also make sure that the googlebot is not able to get pages in this form if they don't really exist.

IndigoHollow



 
Msg#: 4518731 posted 1:25 pm on Nov 22, 2012 (gmt 0)

Friends! Thank you for all the answers! I'll give the answers to your questions a little bit later.

Now I corresponding with Google support and have another QUESTION to you:
I write in robots.txt this string:
Disallow: /*PAGEN*
Is it means, that not only "../articles/first/?PAGEN_1_1" will be blocked, but URL "../articles/first?PAGEN_1_1" (without "/") will be blocked too?

Thank you all for your answers!

IndigoHollow



 
Msg#: 4518731 posted 3:30 pm on Nov 22, 2012 (gmt 0)

I'll give an answer on my question by myself. Yes, will be blocked all the links, that include "PAGEN" in their URL.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4518731 posted 5:30 pm on Nov 22, 2012 (gmt 0)

the Disallow in robots.txt matches urls left-to-right.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved