homepage Welcome to WebmasterWorld Guest from 54.166.53.169
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
test pages added in Google as duplicates
fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 8:45 pm on Sep 28, 2008 (gmt 0)

Hi

i was doing just test on my site and have created another folder with same content and unforunately run a sitemap for that

i remove that sitemap after few hours but now when i searching in google i can see that google has crawled that pages as well, i dont know did google gives any penalty for it or it ignores it

let say i have two folders one is

example.com/articles
and test was

example.com/posts

i have changed post folder and all files under it to 404 will google penalized me for this or what

and what can i do to remove it quickly from google.

may i have to worry about it or not will it effect my site ranking>>?

[edited by: Robert_Charlton at 9:03 pm (utc) on Sep. 28, 2008]
[edit reason] changed to example.com; it can never be owned [/edit]

 

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 11:48 pm on Sep 28, 2008 (gmt 0)

There is no "penalty" for duplication - but there can be serious confusion and filtered out urls. At this point Google might even send traffic to the folder that is now 404.

I'd suggest a robots.txt disallow rule for that accidental folder now, so that Google stops spending crawl budget on urls you don't want in the index - and in fact no longer have on the domain.

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 12:03 am on Sep 29, 2008 (gmt 0)

Well as far as i know google dont removes the urls for which robots.txt disallow rule is applied and i want these urls to be removed.

if i use robots definetly google will remove its cache but what about the urls as main issue is removing urls.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 2:38 am on Sep 29, 2008 (gmt 0)

First set up the robots.txt and then send a url removal request based on your robots.txt. They'll be removed within a few days,

Marcia

WebmasterWorld Senior Member marcia us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 4:56 am on Sep 29, 2008 (gmt 0)

I've seen a page on a site completely filtered out of a search for a sentence taken directly from the page, in quotes. I believe the filtered out site has been identified as a mirror site - and rightly so, in that case.

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 12:49 pm on Sep 29, 2008 (gmt 0)

>>>First set up the robots.txt and then send a url removal request based on your robots.txt.

well i can setup the robots but how can i send a url removal request .

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 6:11 pm on Sep 29, 2008 (gmt 0)

Through your Webmaster Tools account - if you don't have one, you can set it up in a minute or two.

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 6:34 pm on Sep 29, 2008 (gmt 0)

well i already have webmaster account but problem is that there are many urls and i cant submit all urls manually

if i tried to remove whole directory it gives denied error

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 6:37 pm on Sep 29, 2008 (gmt 0)

Don't submit them individually, use the second option "Remove all files and subdirectories in a specific directory on your site from appearing in Google search results."

[edited by: tedster at 7:37 pm (utc) on Sep. 29, 2008]

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 7:05 pm on Sep 29, 2008 (gmt 0)

is it necessary to add that directory in robots as i just changed that to 404 and tried this option

Remove all files and subdirectories in a specific directory on your site from appearing in Google search results."

but it gives error of denied, r u sure adding this directory in robots will help in removing urls using webmaster tools

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 7:11 pm on Sep 29, 2008 (gmt 0)

lets say i want to remove this folder

www.example.com/messages

so i added this in robots.txt file

User-Agent: Googlebot
Disallow: /messages

is this ok

and 2nd step i should have to request in google tools like this

Remove all files and subdirectories in a specific directory on your site from appearing in Google search results."

Directory URL: http://www.example.com/messages

all pages giving 404 as well so have i done every thing ok

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 7:12 pm on Sep 29, 2008 (gmt 0)

Yes, I'm sure - I've done it successfully and recently. I've been recommending the robots.txt approach throughout this thread. The 404 responses you introduced have complicated things, but the robots.txt should sort it out.

[edited by: tedster at 7:37 pm (utc) on Sep. 29, 2008]

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 7:33 pm on Sep 29, 2008 (gmt 0)

ok thanks i will inform what the results will be

one more thing

i have some pages on my site which i can change to 404

if add them to robots.txt and dont do any thing else will that pages will be removed from google as well.

Thanks for ur help

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 7:37 pm on Sep 29, 2008 (gmt 0)

Eventually, yes. The removal request speeds things up. It also ensures that backlinks to those urls will not generate a "url-only" listing.

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 6:21 pm on Sep 30, 2008 (gmt 0)

Thanks a lot tedster google removed those accidental urls, ur forum is really great great great!

My best regards

i have another question if u can plz help with that as well

my site is crawled with domain www and without www

i have added 301 permanent redirect google is removing with www urls but every week it is removing 100 to 300 urls and there are 3000 still left is there any way to remove them fastly

and 2nd question i have site crawled with double links

let say one is

www.example.com/demo/articles&jtype=1

and 2nd is

www.example.com/demo/articles

i have added redirect to remove jtype but it is too slow and taking too long to remove jtype urls

Any quick way to handle both the problems.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 6:48 pm on Sep 30, 2008 (gmt 0)

Don't worry about the "jtype" URLs. Once the 301 redirect is in place Google knows that they will need to be dropped eventually. In the meantime, if those URLs do appear in the SERPs, they will still be sending traffic to your site, which your redirect will capture.

Likewise the non-www URLs will still send traffic where they appear in the SERPs and the redirect forces the correct URL. I would let Google remove them in their own time. It would be silly to remove the non-www when there isn't a full complement of www URLs in the SERPs to send that traffic through.

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 11:14 am on Oct 1, 2008 (gmt 0)

the urls which were created accidently are totaly removed , i want to remove that from robots.txt can i remove from robots.txt file now .

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 12:44 pm on Oct 1, 2008 (gmt 0)

Note for next time: Put "test" content on a server that is password protected so that nothing can access it without your permission.

Setting up a password is very easy using Apache. You need just a couple of lines of code in the .htaccess and .htpasswd files.

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 2:22 pm on Oct 1, 2008 (gmt 0)

i think u dont got my question

i asked that if i removed those files from robot.txt which i have removed from google webmaster tools is it ok.

means if i remove robot file

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 3:10 pm on Oct 1, 2008 (gmt 0)

Technically, yes, you can remove the robots.txt and just let the urls return a 404. But as I said above, I'd leave the rules in robots.txt so that Google doesn't spend any of the crawl budget for your domain on requesting urls that aren't there and won't be there.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 5:54 pm on Oct 1, 2008 (gmt 0)

"Crawl budget" seems to becoming more of an issue as time goes on.

I have seen a few instances of this recently, even on fairly small sites where Google doesn't want to eat it all.

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 6:47 am on Oct 2, 2008 (gmt 0)

Hi tedster

the real problem occured now

i have some files which are not in a directory form as im using dynamic script

example of url is like this

www.example.com/demo/top_emailed_jokes.php?cat_id=110&jtype=emailed

www.example.com/demo/top_ten_jokes.php?cat_id=46&jtype=ten

i have added urls in google webmaster tools like this
www.example.com/demo/top_emailed_jokes.php/

www.example.com/demo/top_ten_jokes.php/

but it removed only one url which i submitted and leave the other urls as it which i mentioned above so what to do to remove all urls .

[edited by: Receptional_Andy at 8:39 am (utc) on Oct. 3, 2008]
[edit reason] Please use example.com - it can never be owned [/edit]

tez899

5+ Year Member



 
Msg#: 3753930 posted 12:39 am on Oct 3, 2008 (gmt 0)

I'm generalising but a 404 header will tell Google that the page is no longer existing. Your new site map, or crawl pattern will reflect on the 404 errors. e.g. WMT will display errors etc.. but to be honest.. it even tells you that if you intend for the errors to happen, then don't change anything. (You intended the errors to happen)

What more else is there to say? Just wait for re-crawl and it'll fix it self.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 12:49 am on Oct 3, 2008 (gmt 0)

If you have a lot of 404 URLs, Google might spend all their time pulling those, and leave no budget to actually look at your real content pages.

That is an issue, and why the robots.txt solution was suggested,

tez899

5+ Year Member



 
Msg#: 3753930 posted 4:08 am on Oct 3, 2008 (gmt 0)

Google with a budget? I'm confused..

Fsmobilez - Return a 404 error to the PHP parameters you don't want indexed.

A re-crawl and fix isn't going to happen over night but in the future, for any test websites etc you're working on an important tag is;

<meta name="robots" content="noindex">

If the 404 doesn't work, which it will. Place that code on your checkout.php in the header tags. That'd work 100% if all else fails.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 4:59 am on Oct 3, 2008 (gmt 0)

The Google crawl team has an algorithm all their own. It assigns the urls to be crawled, and it also assigns a "crawl budget" that googlebot can spend on a given site.

If that crawl budget gets used in spidering essentially unimportant urls (and a pile of 404s for accidental urls is pretty unimportant) then less budget is there to crawl the rest of the site, more frequently and deeper.

This is also a reason why server response time, though probably not directly part of the ranking algorithm, can affect a website's performance in Google Search. It's also a reason why responding to the If-Modified-Since header with 304's when appropriate matters, as well as using file compression, such as mod_gzip on Apache.

So in this particular case, I recommended the robots.txt disallow rule. Once spidered, that will stop googlebot from spending any more cycles on trying to get those accidentally exposed urls.

tez899

5+ Year Member



 
Msg#: 3753930 posted 6:14 am on Oct 3, 2008 (gmt 0)

Thanks Ted, but what is saying to Google that one site is worth the 'crawl budget' and one isn't. A manual review? How is it possible that a bot can allocate time and space for a site that isn't worth it when it could be making a ton.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 6:53 am on Oct 3, 2008 (gmt 0)

It's algoritmic, and of course Google won't share the exact details publicly. But if you think about the what the spidering job is all about (and it's a huge job), some obvious factors would jump out - trust (naturally), the number of impressions in the recent search results, and the average churn of the site, for just a few.

My point, esepcially for this thread, is that each site does have a crawl budget. It is being adjusted continually for all kinds of reasons (including Google's internal needs) - but whatever your site's allotted crawl budget is in any cycle, you don't want to squander it.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3753930 posted 8:11 am on Oct 3, 2008 (gmt 0)

Google has "discovered" hundreds of millions of URLs.

Very many of those lead to error pages or are duplicates or are junk.

They cannot crawl all of them so they prioritise them. If you have stuff that they don't need to spend time on, block them from spending time on them.

fsmobilez

5+ Year Member



 
Msg#: 3753930 posted 8:54 am on Oct 3, 2008 (gmt 0)

I apprecieate ur replies but u dont get the question what i exactly want to reply

this time the question asked was not about accidental pages , accidental pages have been removed by the method u told me.

It was for another site so instead of starting a new thread i carry on with it.

first of all to ur answer about robots i have tried it already but after one month still urls were not removed , although there cache pages were removed but urls were still in google and on another forum some one told me that google will keep these urls with it forever till i remove disallow from robots and change them to 404 error.

so i changed the pages to 404

u told me to use robots and quick removal tool in WMT which was very effective

i wonder if i can do for this dynamic site as well to remove the pages quickly

i have some files which are not in a directory form as im using dynamic script

example of url is like this

www.example.com/demo/top_emailed_jokes.php?cat_id=110&jtype=emailed

www.example.com/demo/top_ten_jokes.php?cat_id=46&jtype=ten

i have added urls in google webmaster tools like this
http://www.example.com/sms/top_emailed_jokes.php/

www.example.com/demo/top_ten_jokes.php/

but it removed only one url which i submitted and leave the other urls as it which i mentioned above so what to do to remove all urls

[edited by: tedster at 5:03 pm (utc) on Oct. 3, 2008]
[edit reason] switched to example.com [/edit]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved