Welcome to WebmasterWorld Guest from 23.22.220.37

Message Too Old, No Replies

test pages added in Google as duplicates

     
8:45 pm on Sep 28, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


Hi

i was doing just test on my site and have created another folder with same content and unforunately run a sitemap for that

i remove that sitemap after few hours but now when i searching in google i can see that google has crawled that pages as well, i dont know did google gives any penalty for it or it ignores it

let say i have two folders one is

example.com/articles
and test was

example.com/posts

i have changed post folder and all files under it to 404 will google penalized me for this or what

and what can i do to remove it quickly from google.

may i have to worry about it or not will it effect my site ranking>>?

[edited by: Robert_Charlton at 9:03 pm (utc) on Sep. 28, 2008]
[edit reason] changed to example.com; it can never be owned [/edit]

11:48 pm on Sept 28, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


There is no "penalty" for duplication - but there can be serious confusion and filtered out urls. At this point Google might even send traffic to the folder that is now 404.

I'd suggest a robots.txt disallow rule for that accidental folder now, so that Google stops spending crawl budget on urls you don't want in the index - and in fact no longer have on the domain.

12:03 am on Sept 29, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


Well as far as i know google dont removes the urls for which robots.txt disallow rule is applied and i want these urls to be removed.

if i use robots definetly google will remove its cache but what about the urls as main issue is removing urls.

2:38 am on Sept 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


First set up the robots.txt and then send a url removal request based on your robots.txt. They'll be removed within a few days,
4:56 am on Sept 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Sept 29, 2000
posts:12095
votes: 0


I've seen a page on a site completely filtered out of a search for a sentence taken directly from the page, in quotes. I believe the filtered out site has been identified as a mirror site - and rightly so, in that case.
12:49 pm on Sept 29, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


>>>First set up the robots.txt and then send a url removal request based on your robots.txt.

well i can setup the robots but how can i send a url removal request .

6:11 pm on Sept 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Through your Webmaster Tools account - if you don't have one, you can set it up in a minute or two.
6:34 pm on Sept 29, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


well i already have webmaster account but problem is that there are many urls and i cant submit all urls manually

if i tried to remove whole directory it gives denied error

6:37 pm on Sept 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Don't submit them individually, use the second option "Remove all files and subdirectories in a specific directory on your site from appearing in Google search results."

[edited by: tedster at 7:37 pm (utc) on Sep. 29, 2008]

7:05 pm on Sept 29, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


is it necessary to add that directory in robots as i just changed that to 404 and tried this option

Remove all files and subdirectories in a specific directory on your site from appearing in Google search results."

but it gives error of denied, r u sure adding this directory in robots will help in removing urls using webmaster tools

7:11 pm on Sept 29, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


lets say i want to remove this folder

www.example.com/messages

so i added this in robots.txt file

User-Agent: Googlebot
Disallow: /messages

is this ok

and 2nd step i should have to request in google tools like this

Remove all files and subdirectories in a specific directory on your site from appearing in Google search results."

Directory URL: http://www.example.com/messages

all pages giving 404 as well so have i done every thing ok

7:12 pm on Sept 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Yes, I'm sure - I've done it successfully and recently. I've been recommending the robots.txt approach throughout this thread. The 404 responses you introduced have complicated things, but the robots.txt should sort it out.

[edited by: tedster at 7:37 pm (utc) on Sep. 29, 2008]

7:33 pm on Sept 29, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


ok thanks i will inform what the results will be

one more thing

i have some pages on my site which i can change to 404

if add them to robots.txt and dont do any thing else will that pages will be removed from google as well.

Thanks for ur help

7:37 pm on Sept 29, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Eventually, yes. The removal request speeds things up. It also ensures that backlinks to those urls will not generate a "url-only" listing.
6:21 pm on Sept 30, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


Thanks a lot tedster google removed those accidental urls, ur forum is really great great great!

My best regards

i have another question if u can plz help with that as well

my site is crawled with domain www and without www

i have added 301 permanent redirect google is removing with www urls but every week it is removing 100 to 300 urls and there are 3000 still left is there any way to remove them fastly

and 2nd question i have site crawled with double links

let say one is

www.example.com/demo/articles&jtype=1

and 2nd is

www.example.com/demo/articles

i have added redirect to remove jtype but it is too slow and taking too long to remove jtype urls

Any quick way to handle both the problems.

6:48 pm on Sept 30, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Don't worry about the "jtype" URLs. Once the 301 redirect is in place Google knows that they will need to be dropped eventually. In the meantime, if those URLs do appear in the SERPs, they will still be sending traffic to your site, which your redirect will capture.

Likewise the non-www URLs will still send traffic where they appear in the SERPs and the redirect forces the correct URL. I would let Google remove them in their own time. It would be silly to remove the non-www when there isn't a full complement of www URLs in the SERPs to send that traffic through.

11:14 am on Oct 1, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


the urls which were created accidently are totaly removed , i want to remove that from robots.txt can i remove from robots.txt file now .
12:44 pm on Oct 1, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Note for next time: Put "test" content on a server that is password protected so that nothing can access it without your permission.

Setting up a password is very easy using Apache. You need just a couple of lines of code in the .htaccess and .htpasswd files.

2:22 pm on Oct 1, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


i think u dont got my question

i asked that if i removed those files from robot.txt which i have removed from google webmaster tools is it ok.

means if i remove robot file

3:10 pm on Oct 1, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Technically, yes, you can remove the robots.txt and just let the urls return a 404. But as I said above, I'd leave the rules in robots.txt so that Google doesn't spend any of the crawl budget for your domain on requesting urls that aren't there and won't be there.
5:54 pm on Oct 1, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


"Crawl budget" seems to becoming more of an issue as time goes on.

I have seen a few instances of this recently, even on fairly small sites where Google doesn't want to eat it all.

6:47 am on Oct 2, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


Hi tedster

the real problem occured now

i have some files which are not in a directory form as im using dynamic script

example of url is like this

www.example.com/demo/top_emailed_jokes.php?cat_id=110&jtype=emailed

www.example.com/demo/top_ten_jokes.php?cat_id=46&jtype=ten

i have added urls in google webmaster tools like this
www.example.com/demo/top_emailed_jokes.php/

www.example.com/demo/top_ten_jokes.php/

but it removed only one url which i submitted and leave the other urls as it which i mentioned above so what to do to remove all urls .

[edited by: Receptional_Andy at 8:39 am (utc) on Oct. 3, 2008]
[edit reason] Please use example.com - it can never be owned [/edit]

12:39 am on Oct 3, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:June 28, 2008
posts:51
votes: 0


I'm generalising but a 404 header will tell Google that the page is no longer existing. Your new site map, or crawl pattern will reflect on the 404 errors. e.g. WMT will display errors etc.. but to be honest.. it even tells you that if you intend for the errors to happen, then don't change anything. (You intended the errors to happen)

What more else is there to say? Just wait for re-crawl and it'll fix it self.

12:49 am on Oct 3, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


If you have a lot of 404 URLs, Google might spend all their time pulling those, and leave no budget to actually look at your real content pages.

That is an issue, and why the robots.txt solution was suggested,

4:08 am on Oct 3, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:June 28, 2008
posts:51
votes: 0


Google with a budget? I'm confused..

Fsmobilez - Return a 404 error to the PHP parameters you don't want indexed.

A re-crawl and fix isn't going to happen over night but in the future, for any test websites etc you're working on an important tag is;

<meta name="robots" content="noindex">

If the 404 doesn't work, which it will. Place that code on your checkout.php in the header tags. That'd work 100% if all else fails.

4:59 am on Oct 3, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


The Google crawl team has an algorithm all their own. It assigns the urls to be crawled, and it also assigns a "crawl budget" that googlebot can spend on a given site.

If that crawl budget gets used in spidering essentially unimportant urls (and a pile of 404s for accidental urls is pretty unimportant) then less budget is there to crawl the rest of the site, more frequently and deeper.

This is also a reason why server response time, though probably not directly part of the ranking algorithm, can affect a website's performance in Google Search. It's also a reason why responding to the If-Modified-Since header with 304's when appropriate matters, as well as using file compression, such as mod_gzip on Apache.

So in this particular case, I recommended the robots.txt disallow rule. Once spidered, that will stop googlebot from spending any more cycles on trying to get those accidentally exposed urls.

6:14 am on Oct 3, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:June 28, 2008
posts:51
votes: 0


Thanks Ted, but what is saying to Google that one site is worth the 'crawl budget' and one isn't. A manual review? How is it possible that a bot can allocate time and space for a site that isn't worth it when it could be making a ton.
6:53 am on Oct 3, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


It's algoritmic, and of course Google won't share the exact details publicly. But if you think about the what the spidering job is all about (and it's a huge job), some obvious factors would jump out - trust (naturally), the number of impressions in the recent search results, and the average churn of the site, for just a few.

My point, esepcially for this thread, is that each site does have a crawl budget. It is being adjusted continually for all kinds of reasons (including Google's internal needs) - but whatever your site's allotted crawl budget is in any cycle, you don't want to squander it.

8:11 am on Oct 3, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Google has "discovered" hundreds of millions of URLs.

Very many of those lead to error pages or are duplicates or are junk.

They cannot crawl all of them so they prioritise them. If you have stuff that they don't need to spend time on, block them from spending time on them.

8:54 am on Oct 3, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 22, 2008
posts: 151
votes: 0


I apprecieate ur replies but u dont get the question what i exactly want to reply

this time the question asked was not about accidental pages , accidental pages have been removed by the method u told me.

It was for another site so instead of starting a new thread i carry on with it.

first of all to ur answer about robots i have tried it already but after one month still urls were not removed , although there cache pages were removed but urls were still in google and on another forum some one told me that google will keep these urls with it forever till i remove disallow from robots and change them to 404 error.

so i changed the pages to 404

u told me to use robots and quick removal tool in WMT which was very effective

i wonder if i can do for this dynamic site as well to remove the pages quickly

i have some files which are not in a directory form as im using dynamic script

example of url is like this

www.example.com/demo/top_emailed_jokes.php?cat_id=110&jtype=emailed

www.example.com/demo/top_ten_jokes.php?cat_id=46&jtype=ten

i have added urls in google webmaster tools like this
http://www.example.com/sms/top_emailed_jokes.php/

www.example.com/demo/top_ten_jokes.php/

but it removed only one url which i submitted and leave the other urls as it which i mentioned above so what to do to remove all urls

[edited by: tedster at 5:03 pm (utc) on Oct. 3, 2008]
[edit reason] switched to example.com [/edit]

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members