Web Spam? Where Is It Coming From? HELP! - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Web Spam? Where Is It Coming From? HELP!

Frost_Angel

5:54 pm on Feb 13, 2012 (gmt 0)

10+ Year Member

Top Contributors Of The Month

I made the HUGE mistake of installing a search engine script on my site. It was supposed to help visitors find jobs. However, I started realizing that all the searches that people were doing using this script were being cached by Google and put into the SERPS. There were thousands of these pages in the SERPS. All thin content, aggregated content.

I was web spamming before I realized I was doing it.

I immediately removed the script. I noindexed/robot.txt the folder to keep search engines out. I started this process 2 months ago - And STILL every day, new searches are showing up in GWT?

Where are they coming from? The script is GONE. The Database is GONE. Where are these new searches coming from? Tons of them.

I'm at a loss and thought - as embarrassing as this is, and as inept as I sound, I'll ask here at WW. I do *not* profess to be some kinda SEO/Website genius. That WHY I come here.

If you can offer any insight, suggestions or help. You'd be saving this chick a lot of future grief - and maybe I could keep my blonde hair and not pull it all out! LOL

Thanks in advance.

seoskunk

8:08 pm on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Are you serving a 404 page where the script used to be?

Also I had the same thing with a search function, after I blocked it in robots.txt searches kept showing up for ages after.

netmeg

9:04 pm on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Serve up a 410?

Frost_Angel

9:05 pm on Feb 13, 2012 (gmt 0)

10+ Year Member

Top Contributors Of The Month

@seoskunk

We've had it blocked for months. Just two days ago we decided to unblock and 404 it. We were just trying to see where it's coming from.

Are you saying it's best to just continue blocking it and wait it out? How long did it take for you guys to see the stuff go away?

Thanks.

Frost_Angel

9:20 pm on Feb 13, 2012 (gmt 0)

10+ Year Member

Top Contributors Of The Month

@netmeg

Won't a 410 give us pretty much a 404? Or does the search engine spider/algo really treat this differently? I ask because I don't really know.

Thanks.

seoskunk

9:31 pm on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

OK now don't laugh but I still have them in WMT six months on through blocking via robots.txt about a week ago I too removed the script (in this case a magento search) and 404 it, now I have 10,000 404 result pages but at least I feel there on the way out the index. So I think 404 or 410 is the way to go.
Funny thing is none of these queries appeared in the SERPS and it was only when blocking in robots.txt they started coming up, maybe they are supplemental pages but six months on robots.txt failed to remove them - hope that helps

HuskyPup

9:40 pm on Feb 13, 2012 (gmt 0)

It took me 3+ months with robots.txt to stop Google serving up ONE Coppermine error page in the SERPs...sincerely good luck with your challenge.

anteck

9:49 pm on Feb 13, 2012 (gmt 0)

10+ Year Member

Google pretty much ignore robots.txt - and they take AGES to loose a noindex.

Then there is Google Images. There is seemingly no way to stop google indexing and displaying pictures from your own site, to image hungry visitors who only want to click 'save as'. Out of all the solutions 'given' by google and others to stop google image traffic, none have worked. I ended up coming with a .htaccess solution to detect google image search and deny visitors coming from there.

Google give instructions to stop indexing - but quite often they don't work. Matt Cutts says one thing, another googler says something different. The google forums are rife with people who have trouble getting stuff out of google's index...

In short, google wants your content, no matter what you tell them. Noindex will work a lot better then robots.txt, which is near useless these days. As for 404's in webmaster tools - we've all got them. Google want your deleted pages back! :)

Frost_Angel

10:08 pm on Feb 13, 2012 (gmt 0)

10+ Year Member

Top Contributors Of The Month

@anteck

Well that is discouraging I must say. Because for people like me - "trying to do the right thing" seems hopeless.

I understand the image frustration too. I pay for all the images on my site - yet people just come by and steal them. The plugins for Wordpress that are supposed to copyright the images or help prevent this are heavy and slow your site down. So useless.

tedster

10:18 pm on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

"trying to do the right thing" seems hopeless.

The right thing really is not to allow any search engine ever to index URLs for site search results - but once you've let that particular genie out of the bottle, you can have challenge making things better again.

It helps to keep this difference in mind: robots.txt means do not CRAWL - but the URL may still show up in the index with content that Google puts together from other sources. A noindex means "don't INDEX" -- but googlebot needs to crawl the page to even see the meta tag.

I know some people report pages that pages blocked with robots.txt are crawled anyway, but I find that situation to be very rare.

Frost_Angel

10:35 pm on Feb 13, 2012 (gmt 0)

10+ Year Member

Top Contributors Of The Month

@tedster

but once you've let that particular genie out of the bottle, you can have challenge making things better again.

I agree. And feel stupid ever adding this search engine. Hindsight.

What do you think about netmeg's suggestion of a 410?

seoskunk

10:44 pm on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I always thought google liked 404's

[mattcutts.com...]

410 (Gone)The server returns this response when the requested resource has been permanently removed. It is similar to a 404 (Not found) code, but is sometimes used in the place of a 404 for resources that used to exist but no longer do. If the resource has permanently moved, you should use a 301 to specify the resource's new location.

Also found this [webmasterworld.com...]

Be very interested in tedsters opinion as robots.txt doesn't work and 404 takes ages.

seoskunk

12:25 am on Feb 14, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Frost_Angel if you do decide to 410 (i'm trying it) this may be useful. I redirected entire search directory to 410 with following rule in .htaccess

RedirectMatch 410 ^/catalogsearch/

lucy24

3:15 am on Feb 14, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Can we assume that the very first thing you did was go into the Parameters section of gwt and tell them to ignore any and all parameters that apply specifically to your Search page?

leadegroot

11:57 am on Feb 15, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Have you used the WMT removal tool - you have it robots.txt blocked, so it should work.

I have my development box here blocked in the robots.txt - its live on the web because sometimes I work away from home. A couple of times a year results will turn up in google serps for a site:example.com check. Grrr... A WMT removal request always pulls them, and quite quickly, too.

aakk9999

1:13 pm on Feb 15, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

If you use 410 gone instead of 404, Google will handle it a bit faster. Or in other words, by returning 410 Google will stop re-requesting the URL sooner, whilst with 404 it may keep requesting the URL a bit longer just to see if 404 is there to stay or whether the 404 was a temporary "glitch" and the page returns.