test pages added in Google as duplicates

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

test pages added in Google as duplicates

fsmobilez

8:45 pm on Sep 28, 2008 (gmt 0)

i was doing just test on my site and have created another folder with same content and unforunately run a sitemap for that

i remove that sitemap after few hours but now when i searching in google i can see that google has crawled that pages as well, i dont know did google gives any penalty for it or it ignores it

let say i have two folders one is

example.com/articles
and test was

example.com/posts

i have changed post folder and all files under it to 404 will google penalized me for this or what

and what can i do to remove it quickly from google.

may i have to worry about it or not will it effect my site ranking>>?

[edited by: Robert_Charlton at 9:03 pm (utc) on Sep. 28, 2008]
[edit reason] changed to example.com; it can never be owned [/edit]

tedster

11:48 pm on Sep 28, 2008 (gmt 0)

There is no "penalty" for duplication - but there can be serious confusion and filtered out urls. At this point Google might even send traffic to the folder that is now 404.

I'd suggest a robots.txt disallow rule for that accidental folder now, so that Google stops spending crawl budget on urls you don't want in the index - and in fact no longer have on the domain.

fsmobilez

12:03 am on Sep 29, 2008 (gmt 0)

Well as far as i know google dont removes the urls for which robots.txt disallow rule is applied and i want these urls to be removed.

if i use robots definetly google will remove its cache but what about the urls as main issue is removing urls.

tedster

2:38 am on Sep 29, 2008 (gmt 0)

First set up the robots.txt and then send a url removal request based on your robots.txt. They'll be removed within a few days,

Marcia

4:56 am on Sep 29, 2008 (gmt 0)

I've seen a page on a site completely filtered out of a search for a sentence taken directly from the page, in quotes. I believe the filtered out site has been identified as a mirror site - and rightly so, in that case.

fsmobilez

12:49 pm on Sep 29, 2008 (gmt 0)

>>>First set up the robots.txt and then send a url removal request based on your robots.txt.

well i can setup the robots but how can i send a url removal request .

tedster

6:11 pm on Sep 29, 2008 (gmt 0)

Through your Webmaster Tools account - if you don't have one, you can set it up in a minute or two.

fsmobilez

6:34 pm on Sep 29, 2008 (gmt 0)

well i already have webmaster account but problem is that there are many urls and i cant submit all urls manually

if i tried to remove whole directory it gives denied error

tedster

6:37 pm on Sep 29, 2008 (gmt 0)

Don't submit them individually, use the second option "Remove all files and subdirectories in a specific directory on your site from appearing in Google search results."

[edited by: tedster at 7:37 pm (utc) on Sep. 29, 2008]

fsmobilez

7:05 pm on Sep 29, 2008 (gmt 0)

is it necessary to add that directory in robots as i just changed that to 404 and tried this option

Remove all files and subdirectories in a specific directory on your site from appearing in Google search results."

but it gives error of denied, r u sure adding this directory in robots will help in removing urls using webmaster tools

fsmobilez

7:11 pm on Sep 29, 2008 (gmt 0)

lets say i want to remove this folder

www.example.com/messages

so i added this in robots.txt file

User-Agent: Googlebot
Disallow: /messages

is this ok

and 2nd step i should have to request in google tools like this

Remove all files and subdirectories in a specific directory on your site from appearing in Google search results."

Directory URL: http://www.example.com/messages

all pages giving 404 as well so have i done every thing ok

tedster

7:12 pm on Sep 29, 2008 (gmt 0)

Yes, I'm sure - I've done it successfully and recently. I've been recommending the robots.txt approach throughout this thread. The 404 responses you introduced have complicated things, but the robots.txt should sort it out.

[edited by: tedster at 7:37 pm (utc) on Sep. 29, 2008]

fsmobilez

7:33 pm on Sep 29, 2008 (gmt 0)

ok thanks i will inform what the results will be

one more thing

i have some pages on my site which i can change to 404

if add them to robots.txt and dont do any thing else will that pages will be removed from google as well.

Thanks for ur help

tedster

7:37 pm on Sep 29, 2008 (gmt 0)

Eventually, yes. The removal request speeds things up. It also ensures that backlinks to those urls will not generate a "url-only" listing.

fsmobilez

6:21 pm on Sep 30, 2008 (gmt 0)

Thanks a lot tedster google removed those accidental urls, ur forum is really great great great!

My best regards

i have another question if u can plz help with that as well

my site is crawled with domain www and without www

i have added 301 permanent redirect google is removing with www urls but every week it is removing 100 to 300 urls and there are 3000 still left is there any way to remove them fastly

and 2nd question i have site crawled with double links

let say one is

www.example.com/demo/articles&jtype=1

and 2nd is

www.example.com/demo/articles

i have added redirect to remove jtype but it is too slow and taking too long to remove jtype urls

Any quick way to handle both the problems.

g1smd

6:48 pm on Sep 30, 2008 (gmt 0)

Don't worry about the "jtype" URLs. Once the 301 redirect is in place Google knows that they will need to be dropped eventually. In the meantime, if those URLs do appear in the SERPs, they will still be sending traffic to your site, which your redirect will capture.

Likewise the non-www URLs will still send traffic where they appear in the SERPs and the redirect forces the correct URL. I would let Google remove them in their own time. It would be silly to remove the non-www when there isn't a full complement of www URLs in the SERPs to send that traffic through.

fsmobilez

11:14 am on Oct 1, 2008 (gmt 0)

the urls which were created accidently are totaly removed , i want to remove that from robots.txt can i remove from robots.txt file now .

g1smd

12:44 pm on Oct 1, 2008 (gmt 0)

Note for next time: Put "test" content on a server that is password protected so that nothing can access it without your permission.

Setting up a password is very easy using Apache. You need just a couple of lines of code in the .htaccess and .htpasswd files.

fsmobilez

2:22 pm on Oct 1, 2008 (gmt 0)

i think u dont got my question

i asked that if i removed those files from robot.txt which i have removed from google webmaster tools is it ok.

means if i remove robot file

tedster

3:10 pm on Oct 1, 2008 (gmt 0)

Technically, yes, you can remove the robots.txt and just let the urls return a 404. But as I said above, I'd leave the rules in robots.txt so that Google doesn't spend any of the crawl budget for your domain on requesting urls that aren't there and won't be there.

g1smd

5:54 pm on Oct 1, 2008 (gmt 0)

"Crawl budget" seems to becoming more of an issue as time goes on.

I have seen a few instances of this recently, even on fairly small sites where Google doesn't want to eat it all.

fsmobilez

6:47 am on Oct 2, 2008 (gmt 0)

Hi tedster

the real problem occured now

i have some files which are not in a directory form as im using dynamic script

example of url is like this

www.example.com/demo/top_emailed_jokes.php?cat_id=110&jtype=emailed

www.example.com/demo/top_ten_jokes.php?cat_id=46&jtype=ten

i have added urls in google webmaster tools like this
www.example.com/demo/top_emailed_jokes.php/

www.example.com/demo/top_ten_jokes.php/

but it removed only one url which i submitted and leave the other urls as it which i mentioned above so what to do to remove all urls .

[edited by: Receptional_Andy at 8:39 am (utc) on Oct. 3, 2008]
[edit reason] Please use example.com - it can never be owned [/edit]

tez899

12:39 am on Oct 3, 2008 (gmt 0)

I'm generalising but a 404 header will tell Google that the page is no longer existing. Your new site map, or crawl pattern will reflect on the 404 errors. e.g. WMT will display errors etc.. but to be honest.. it even tells you that if you intend for the errors to happen, then don't change anything. (You intended the errors to happen)

What more else is there to say? Just wait for re-crawl and it'll fix it self.

g1smd

12:49 am on Oct 3, 2008 (gmt 0)

If you have a lot of 404 URLs, Google might spend all their time pulling those, and leave no budget to actually look at your real content pages.

That is an issue, and why the robots.txt solution was suggested,

tez899

4:08 am on Oct 3, 2008 (gmt 0)

Google with a budget? I'm confused..

Fsmobilez - Return a 404 error to the PHP parameters you don't want indexed.

A re-crawl and fix isn't going to happen over night but in the future, for any test websites etc you're working on an important tag is;

If the 404 doesn't work, which it will. Place that code on your checkout.php in the header tags. That'd work 100% if all else fails.

tedster

4:59 am on Oct 3, 2008 (gmt 0)

The Google crawl team has an algorithm all their own. It assigns the urls to be crawled, and it also assigns a "crawl budget" that googlebot can spend on a given site.

If that crawl budget gets used in spidering essentially unimportant urls (and a pile of 404s for accidental urls is pretty unimportant) then less budget is there to crawl the rest of the site, more frequently and deeper.

This is also a reason why server response time, though probably not directly part of the ranking algorithm, can affect a website's performance in Google Search. It's also a reason why responding to the If-Modified-Since header with 304's when appropriate matters, as well as using file compression, such as mod_gzip on Apache.

So in this particular case, I recommended the robots.txt disallow rule. Once spidered, that will stop googlebot from spending any more cycles on trying to get those accidentally exposed urls.

tez899

6:14 am on Oct 3, 2008 (gmt 0)

Thanks Ted, but what is saying to Google that one site is worth the 'crawl budget' and one isn't. A manual review? How is it possible that a bot can allocate time and space for a site that isn't worth it when it could be making a ton.

tedster

6:53 am on Oct 3, 2008 (gmt 0)

It's algoritmic, and of course Google won't share the exact details publicly. But if you think about the what the spidering job is all about (and it's a huge job), some obvious factors would jump out - trust (naturally), the number of impressions in the recent search results, and the average churn of the site, for just a few.

My point, esepcially for this thread, is that each site does have a crawl budget. It is being adjusted continually for all kinds of reasons (including Google's internal needs) - but whatever your site's allotted crawl budget is in any cycle, you don't want to squander it.

g1smd

8:11 am on Oct 3, 2008 (gmt 0)

Google has "discovered" hundreds of millions of URLs.

Very many of those lead to error pages or are duplicates or are junk.

They cannot crawl all of them so they prioritise them. If you have stuff that they don't need to spend time on, block them from spending time on them.

fsmobilez

8:54 am on Oct 3, 2008 (gmt 0)

I apprecieate ur replies but u dont get the question what i exactly want to reply

this time the question asked was not about accidental pages , accidental pages have been removed by the method u told me.

It was for another site so instead of starting a new thread i carry on with it.

first of all to ur answer about robots i have tried it already but after one month still urls were not removed , although there cache pages were removed but urls were still in google and on another forum some one told me that google will keep these urls with it forever till i remove disallow from robots and change them to 404 error.

so i changed the pages to 404

u told me to use robots and quick removal tool in WMT which was very effective

i wonder if i can do for this dynamic site as well to remove the pages quickly

i have some files which are not in a directory form as im using dynamic script

example of url is like this

www.example.com/demo/top_emailed_jokes.php?cat_id=110&jtype=emailed

www.example.com/demo/top_ten_jokes.php?cat_id=46&jtype=ten

i have added urls in google webmaster tools like this
http://www.example.com/sms/top_emailed_jokes.php/

www.example.com/demo/top_ten_jokes.php/

but it removed only one url which i submitted and leave the other urls as it which i mentioned above so what to do to remove all urls

[edited by: tedster at 5:03 pm (utc) on Oct. 3, 2008]
[edit reason] switched to example.com [/edit]