homepage Welcome to WebmasterWorld Guest from 54.198.139.141
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How can I block strange URL's?
IndigoHollow




msg:4518733
 2:21 pm on Nov 13, 2012 (gmt 0)

On my site search bots are often index URL's like:

/statji/raschet-dimoudalenija?PAGEN_1=3
/statji/raschet-dimoudalenija?PAGEN_1=2
/statji/raschet-dimoudalenija?undefigned

but there is no such URL's on my site. There is only:

/statji/raschet-dimoudalenija

it's a page with my text article.

Question: how can I block those strange URL's? Don't want to write them all in robots.txt.

 

lucy24




msg:4518762
 4:39 pm on Nov 13, 2012 (gmt 0)

Are you asking about robots visiting-- for whatever reason-- or about nonexistent pages getting indexed in search engines?

robots.txt would not do a speck of good. If the site uses no queries at all, it can be as simple as:

RewriteCond %{QUERY_STRING} .
RewriteRule .* - [F]

That is the minimalist version. More details may be necessary.

IndigoHollow




msg:4518966
 6:27 am on Nov 14, 2012 (gmt 0)

I have written in the code part, that tell me which pages are indexed by google robot. In this URL's list I sometimes see written above wrong URL's. But in google search result in 75% there is no such wrong URL's.
My site is wrote on CMS, so, I think, it uses queries.

Sorry for my bad English, I am from another country. This forum is the only, that I could find in net.

phranque




msg:4518974
 7:24 am on Nov 14, 2012 (gmt 0)

welcome to WebmasterWorld, IndigoHollow!

do you normally have any URLs with query strings?
the query strings start with the question mark, for example:

?PAGEN_1=3


do you know where google discovered these URLs?
have you tried a site crawl with xenu link sleuth?
are there other sites linking to your site using these URLs?
have you checked your server access log files to see who is requesting these URLs besides googlebot?
are the non-googlebot requests showing a referrer?

lucy24




msg:4519134
 7:59 pm on Nov 14, 2012 (gmt 0)

My site is wrote on CMS, so, I think, it uses queries.

It almost certainly uses queries behind the scenes, but are they part of the visible URL? Google should only index URLs that are seen by humans. So the second-simplest form is

RewriteCond %{THE_REQUEST} ^[A-Z]{3-9}\ [^\s?]+\?
RewriteRule .* - [F]

meaning: the visitor asked for something containing a query string. (The \s is because I have temporarily forgotten whether the referer counts as part of the complete request. Almost everything from a search engine will have a ? in the referer.)

The third-simplest form is the one you have to use if you've got any kind of on-site analytics, because then you yourself will be asking for things with a query string.

In the fourth-simplest form, any request containing a query string is forcibly redirected to the queryless form of the same URL.

Are there many different queries, or are they the same ones over and over? Your first post showed "PAGEN_1" twice. You can go into Webmaster Tools and tell google to ignore this parameter. But also make sure that the googlebot is not able to get pages in this form if they don't really exist.

IndigoHollow




msg:4521571
 1:25 pm on Nov 22, 2012 (gmt 0)

Friends! Thank you for all the answers! I'll give the answers to your questions a little bit later.

Now I corresponding with Google support and have another QUESTION to you:
I write in robots.txt this string:
Disallow: /*PAGEN*
Is it means, that not only "../articles/first/?PAGEN_1_1" will be blocked, but URL "../articles/first?PAGEN_1_1" (without "/") will be blocked too?

Thank you all for your answers!

IndigoHollow




msg:4521612
 3:30 pm on Nov 22, 2012 (gmt 0)

I'll give an answer on my question by myself. Yes, will be blocked all the links, that include "PAGEN" in their URL.

phranque




msg:4521638
 5:30 pm on Nov 22, 2012 (gmt 0)

the Disallow in robots.txt matches urls left-to-right.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved