Welcome to WebmasterWorld Guest from 3.229.122.219

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

A recent Google experience

     
8:18 pm on Apr 8, 2015 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 31, 2004
posts: 135
votes: 2


I thought that people might be interested in something I am recently experiencing with Google. I just want to put this out there in case it helps someone, it is another data point that people can use when trying to figure out their own problems - this is not a general issue with Google, this is something I screwed up myself.

I have a system that prevents 'bots from crawling my site. It has a whitelist, to which I add Google IPs. I had always added them manually because new IPs didn't come up too often, and I wanted to make sure that no one was spoofing Google. About 10 days ago, Google apparently switched to crawling from about a dozen new IPs. I was not paying close attention to my system and those IPs got blocked. They were blocked for about 3 or 4 days.

On March 31, I noticed a slight downturn in traffic. I also noticed that my "traffic sources" was a little off - my Analytics has pretty consistently pegged my sources as 82-84% Organic, 11-13% Direct, and 4-5% Referral within a percentage point or two. But on March 31, I was down to 81% Organic. Just a tiny drop, but I hadn't been below 82% for months. I also graph some real-time metrics that show the amount of site usage, and that was down a bit too.

On a hunch, I started digging into things and noticed that I had blocked a bunch of Google crawlers. I quickly unblocked them. I didn't think much of it.

By about 1pm, I saw some Twitter chatter. People were wondering why, when they searched for a hockey player, my site wasn't returned when it had been in the past. I checked, and sure enough, they were right. Not for all players, but for a lot of them, especially the more popular players.

I have made boneheaded mistakes in the past, accidentally noindexing some pages, and when I did a "fetch as Google" and "Submit to index", Google recrawled and the page was added back within minutes. So that's what I did with some of my more popular pages. Unfortunately, Webmaster Tools listed that about 55,000 pages had been blocked. Considering that you can only "clear" 1,000 pages per day, I knew I had some work to do over the next two months - dutifully going into WMT and clearing that list, and that I would never be able to add all those pages back via the "fetch as Google" feature since there is a 500-page per month limit. I knew I would have to just take my medicine.

By the evening, a Reddit thread was started, asking why my site wasn't coming up for searches. And the twitter chatter continued. See, the thing is, even though I have a pretty prominent search box all over my site, people simply prefer to put a hockey player's name into Google along with my site's name, and they then click on the link in Google. And that was annoying users who either thought my site was gone, or who just don't want to change their routines.

The next day, my Analytics stats showed more of a drop - I was down to 79% Organic, 15% direct, and 7% referral. That persisted for 2 days, and my overall traffic was off by about 30%.

The traffic picked up a little bit, but slowly. Google wasn't adding the pages back even though they had recrawled them. Some pages came back, but some of my top pages (for example, Connor McDavid) were nowhere to be found in Google - even when I searched with my site's name (as many users do). I tried asking Google to recrawl multiple times, but after a week they still aren't adding back pages for which I request a recrawl.

I also noticed that although I had 5 straight days of recovery, on the 6th day, I got knocked down a peg again. It could have been seasonality - my traffic goes up and down based on hockey game schedules - but it seems more like a Google induced issue to me. But on the 7th day, I'm still going up, gradually.

Here are some odd things I noticed. First, although Google tells me that over 80% of my traffic is "Organic", and that figure dropped by only about 4 percentage points, it doesn't add up to me mathematically that my overall traffic would be down by 20-30%.

Next, my advertising revenue is way off. This is harder to pin down because the issue took place over a month boundary, and ad rates and fill percentages often change over monthly boundaries, but my effective CPM dropped by almost 2/3 even though my traffic was initially down by less than 1/3 and is currently down by about 20%. My effective CPM is not rising even as my traffic rises. This could be due to a couple of things:
  • Advertising is just lousy in April.
  • Advertising systems use the first couple of days of the month to "prime" their systems in some way, so if your traffic on those days is not representative, it skews the rest of your month.
  • Google is somehow sending a different "blend" of visitors, and this group is not behaving the same way with respect to ads. In other words, they are sending visitors less likely to click on an ad.
  • The visitors that Google are still not sending are very valuable to the advertisers, perhaps because most only view one or two pages (when people go to my site directly, they are usually there for a while - they are the more loyal and in-depth visitors).

    I don't think there is anything I can do here except wait this out and patiently explain to people on Twitter that yes, I screwed up; yes, I fixed my end, and that we just have to wait for Google to finish its course of action. However I just found the whole thing interesting, academically, because I don't think anyone would ever deliberately do this as a case study.
  • 11:34 pm on Apr 8, 2015 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member editorialguy is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

    joined:June 28, 2013
    posts:3493
    votes: 788


    I feel your pain. About a dozen years ago, Google was indexing both www and non-www versions of my site because I'd failed to include a redirect from one to the other in my .htaccess file. It must have appeared to Google that I had duplicate content, because I lost 90 percent of my Google referrals overnight.

    Once I got the problem fixed (which didn't take long, thanks to advice that I got here), it took a while for things to return to normal. Still, all was well after about 60 days.
    11:43 pm on Apr 8, 2015 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Apr 15, 2003
    posts:960
    votes: 34


    Not long ago, Google announced that it would soon start crawling from new IPs, which were to be matched with the geo-location of the domain being crawled to better accommodate sites that do dynamic content serving based on the user's IP address. I suspect that's what you've run into.
    4:54 am on Apr 9, 2015 (gmt 0)

    Senior Member from US 

    WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

    joined:Apr 9, 2011
    posts:15956
    votes: 898


    I really, really hope Google understands that the Googlebot UA is so widely spoofed, many sites have an unconditional block on anything calling itself a Googlebot that doesn't come from 66.249.whatever-it-is. Will wmt at least give you a hint which IPs are really them and which are still Ukrainian spoofers, or do you have to figure it out for yourself?
    5:22 am on Apr 9, 2015 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Apr 15, 2003
    posts:960
    votes: 34


    Unfortunately, I don't recall any details of this announcement. I did find an article in Webmaster Tools Help that uses the term "Geo-distributed crawling" that seems to cover the issue.

    Google has encouraged webmasters to verify Googlebot with a reverse DNS lookup for many years now, so the upshot is that webmasters who have been automatically blocking all unrecognized IPs will likely have to reconstruct their whitelists/blacklists or make other adjustments - especially if their site has a geo-location outside the US.
    7:58 am on Apr 9, 2015 (gmt 0)

    Full Member

    10+ Year Member Top Contributors Of The Month

    joined:June 3, 2005
    posts: 298
    votes: 12


    I agree with rainborick based on my experience. My website went down a week during the Panda update of 2011. I had a 60% drop in traffic I though was down to my site being down and Google penalised me for that. I have never really recovered. Please do run the free Barracuda Digital Panguin Tool online and investigate further and report back. It has the geo-targeting update that rainborick mentioned on it as well as all other updates.

    May I post the tool like this: barracuda-digital.co.uk/panguin-tool
    9:57 am on Apr 9, 2015 (gmt 0)

    Junior Member

    Top Contributors Of The Month

    joined:Jan 19, 2015
    posts: 170
    votes: 28


    @Lucy24 since I always see you at the Apache forum section, would you happen to know any .htaccess rule we could use to reverse lookup the IP of anyone claiming to be Googlebot and if they're out of the 66. range they're banned automatically?

    I've seen Googlebot coming from IPs in Brazil, France, India, Australia and UK. When you look at who owns the IP it's (interestingly) a local ISP and NOT a server farm (although it could be feasible I guess to set up a server at home in some countries where the ISP is bad at counteracting the setting up of servers in household lines). This is why I have been sort of wary of blocking some IPs claiming to be Googlebot as I recall Google somewhere saying that their bots need not come from the Mountain View range.

    Any rules for server firewall would also be welcome if .htaccess rule is way too CPU intensive. Thanks.
    10:24 am on Apr 9, 2015 (gmt 0)

    Senior Member from NL 

    WebmasterWorld Senior Member lammert is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

    joined:Jan 10, 2005
    posts: 2959
    votes: 38


    I have seen this behavior of Google recently in a slightly different environment.

    I recently started a new project with 100k+ pages. After a week Google discovered the site and started crawling the site with 1 fetch per second, which is amazing looking at the age of the site and the amount of just two incoming links. But when I made a few changes to block some private content from access which I accidentially not secured before releasing the project, Googlebot hit a 403 page about once every 50 crawls. With this small amount of 403 hits, they almost overnight stopped crawling to a speed of 4 fetches per hour.

    I have now changed the 403 response to 404 (although technically it is a 403 because the content is there but not accessible) and Googlebot gradually increased their crawl speed again and is now after a few weeks back at the original 1 fetch per second.

    I watched my log files carefully and Googlebot doesn't seem to care much about 301, 302 or 404 responses. Hours of only 404 responses doesn't slow down Googlebot a bit. But once you feed them small amounts of 403 responses, some allergic reaction takes place and Googlebot almost completely disappears.
    11:51 am on Apr 9, 2015 (gmt 0)

    Preferred Member from GB 

    10+ Year Member Top Contributors Of The Month

    joined:July 25, 2005
    posts:406
    votes: 17


    @Ralph_Slate, thank you for sharing this and I hope you recover soon.

    Hindsight is a good thing, and you may hate me for saying this, but I've never understood webmasters' obsession with blocking things. Surely, unless you're hosting your site with i*********** or h********, it shouldn't matter how many fake bots visit your site, you should have enough bandwidth to cope with it. Scrapers will always find a way to scrape your site.

    That's why it's always best to keep a blacklist instead of a whitelist. In other words, don't collect good IPs because you will inevitably miss some good IPs. It's not only Google. There's plenty of good bots out there and it's impossible to follow them all and keep updating their new IPs. When you collect bad IPs after they have offended, you know for sure that you're only blocking those that really deserve it.
    12:38 pm on Apr 9, 2015 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

    joined:Apr 19, 2002
    posts:3522
    votes: 89


    would you happen to know any .htaccess rule we could use to reverse lookup the IP of anyone claiming to be Googlebot and if they're out of the 66. range they're banned automatically?


    .htaccess cannot be used for this,
    you would need to run a script for the reverse lookup and then process the result accordingly - to however you use a whitelist/blacklist ... eg db or filesystem or list in a file etc
    3:27 pm on Apr 9, 2015 (gmt 0)

    Junior Member

    Top Contributors Of The Month

    joined:Jan 19, 2015
    posts: 170
    votes: 28


    @topr Is there anything like that for csf/firewall level?
    5:43 pm on Apr 9, 2015 (gmt 0)

    Junior Member

    10+ Year Member

    joined:Jan 31, 2004
    posts: 135
    votes: 2


    @adder, It depends on your needs. I used to only keep a blacklist, but people figured out that if they started at 12:01am and hit the site with 10 different 'bots, by the end of the day they could have the entire site downloaded locally and I wouldn't catch on until the next day when it was too late. I move to a more real-time system but that involves whitelists. Fortunately with such a system I can do a DNS lookup on suspicious IPs and whitelist on the fly if it is Google. I just hadn't implemented that yet before Google switched up their IPs.
    5:48 pm on Apr 9, 2015 (gmt 0)

    Senior Member from US 

    WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

    joined:Apr 9, 2011
    posts:15956
    votes: 898


    would you happen to know any .htaccess rule we could use to reverse lookup the IP of anyone claiming to be Googlebot and if they're out of the 66. range they're banned automatically?

    If you've already decided to admit only the 66.whatever range, there is no need for a reverse lookup; REMOTE_ADDR is a value in its own right. The complication is when you have to do a hostname lookup (REMOTE_HOST) on every request, because this slows the server w-a-y down (and, not incidentally, plays havoc with your access logs).

    I've seen Googlebot coming from IPs in Brazil, France, India, Australia and UK. When you look at who owns the IP it's (interestingly) a local ISP and NOT a server farm

    If you look up the specific IP, does it come through as Google or only as the ISP's name? A lot of ISPs do have server/hosting ranges. Dunno about anyone else, but I'd be exceedingly suspicious of anything from Brazil claiming to be the googlebot. Maybe if it was looking only at Portuguese-language pages, since your ordinary Brazilian robot (generally an infected human browser) doesn't seem to care what it hits.
    6:15 pm on Apr 9, 2015 (gmt 0)

    Junior Member

    Top Contributors Of The Month

    joined:Jan 19, 2015
    posts: 170
    votes: 28


    @lucy24 Many thanks. You're great with your reply as usual.

    Would it then be possible to set up something at the firewall level that does any lookups? I addmiteddly know nothing about using the firewall other than the very basics that can be learned through any guide, but we have our VPS/servers fully managed and the techs are always keen to set up any rules. Ideally to do a reverse lookup on any Googlebot user agent. Would that be possible? Or is the firewall used to mostly block anything that doesn't come from th Mountain View range? I'm also asking because I have seen Sucuri block Googlebot user agents coming from outside the MV range because of their IP alone.

    Yes, they were ISP and we are almos sure they were simply household IPs. It was my partner who went through those IPs as he is knowledgeable of the countries they were coming from and he confirmed they were ISP IPs.

    To give you an idea and from what I remember. We'd get a visitor to a UK site (.co.uk) claiming to be Googlebot with an IP from Orange (UK ISP). I'm highly positive that Orange doesn't have servers per se so this would very likely be a household IP, unless Google themselves have a deal with Orange to use some of their IPs for their (Google's) UK datacenter. So instead of using their MV IP range they use local IPs from UK and employ local ISP. This is one example I remeber off the top of my head,it was my partner who went through the suspect Australian/Brazilian/Singaporean/Indian IPs and confirmed many were from ISP that do not have any publicly known server centers.

    Then I've also seen a fake Googlebot from a OVH IP in France lol oh la la Google!
    9:08 pm on Apr 10, 2015 (gmt 0)

    Preferred Member

    10+ Year Member

    joined:Feb 3, 2001
    posts:578
    votes: 1


    This is very interesting. After not touching my crawler whitelist for a long time, around the same time as this post suggests, I suddenly noticed Google Webmaster Tools telling me it could not crawl a section of my site. Could not fetch those pages but could on others on the same site. I double checked and it said it was as it should be. And why it can crawl much of the site just fine and not these few directories puzzled me. So I tested and removed my list to just make it wide open. Then Google could crawl everything. But it is a bit too wide open so I put it back, again in the same normal way for Google crawlers that I have had for years and the full list of what they list they have. Parts of the site were said to be blocked again. So something isn't adding up and they appeared to have change crawlers without it being on their published list to approve.
    10:44 pm on Apr 10, 2015 (gmt 0)

    Full Member

    10+ Year Member

    joined:Feb 13, 2008
    posts:250
    votes: 5


    I use reverse proxy technique and it's still working fine for me. First I check user agent, if it's Google bot then I do reverse proxy check. I put IP in memcache for 24 hours if reverse proxy check is successful otherwise I straightaway ban ip using Cloudflare API.
    11:46 pm on Apr 10, 2015 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

    joined:Dec 27, 2004
    posts:1999
    votes: 75


    To OP,

    I am going to ask a very very "stupidio" question": what is the IP range/IP's you are talking about?

    Google is somehow sending a different "blend" of visitors, and this group is not behaving the same way with respect to ads. In other words, they are sending visitors less likely to click on an ad.

    Were they the /Blend28 kind?
    1:16 am on Apr 11, 2015 (gmt 0)

    Senior Member from US 

    WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

    joined:Sept 26, 2001
    posts:12913
    votes: 893


    ...Google announced that it would soon start crawling from new IPs...

    @rainborick - Where did Google announce this? Please supply source & link.

    If this is true, it is of extreme importance and needs to be vetted. Otherwise it is hearsay and should be labeled as such.
    3:25 am on Apr 11, 2015 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Apr 15, 2003
    posts:960
    votes: 34


    It was announced on January 28 of this year in the Webmaster Central Blog. You can find several additional references by searching on "Geo-distributed crawling" and/or "Locale-aware crawling by Googlebot" in Webmaster Tools Help. The articles I found say that Googlebot will use these other IPs on URLs they determine might be "locale adaptive".
    3:32 am on Apr 11, 2015 (gmt 0)

    Senior Member from US 

    WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

    joined:Apr 9, 2011
    posts:15956
    votes: 898


    WebmasterWorld post from Engine:
    [webmasterworld.com...]
    with link to:
    [support.google.com...]

    Note in particular
    Note: This list is not complete and likely to change over time.

    The list is, in fact, entirely empty, putting us in the "You're in an airplane" category of accuracy and truthfulness.
    7:19 am on Apr 11, 2015 (gmt 0)

    Senior Member from US 

    WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

    joined:Sept 26, 2001
    posts:12913
    votes: 893


    Thanks for the references.

    Locale-aware crawling by Googlebot occurs only if alternate content is served depending on visitor geo origins, according to those articles, so unless that is being done, I see no issue with continuing to block Googlebot spoofers by IP range filtering.
    11:04 am on Apr 11, 2015 (gmt 0)

    Senior Member from US 

    WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

    joined:Feb 3, 2014
    posts:1561
    votes: 672


    @Ralph_Slate - Wow! This sounds remarkably like what happened to my site. I use wordpfence and had it set to block fake Googlebots. These new Bots must be getting blocked. I cleared it and will await my fate.
    3:41 pm on Apr 11, 2015 (gmt 0)

    Senior Member from US 

    WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

    joined:Feb 3, 2014
    posts:1561
    votes: 672


    Google's help forum claims that their IP's are never published because they change so often. I guess this throws a bucket of water on this theory.
    9:45 am on Apr 13, 2015 (gmt 0)

    Full Member from ES 

    10+ Year Member Top Contributors Of The Month

    joined:Jan 20, 2004
    posts: 347
    votes: 25


    Does Fantomaster still participate on this forum. I suspect he might have an interesting insight.
    9:47 am on Apr 13, 2015 (gmt 0)

    Full Member from ES 

    10+ Year Member Top Contributors Of The Month

    joined:Jan 20, 2004
    posts: 347
    votes: 25


    To answer my own question, not since 2002 :(

    Maybe he uses another name.
    4:43 pm on Apr 13, 2015 (gmt 0)

    System Operator from US 

    incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

    joined:Jan 25, 2005
    posts: 14664
    votes: 99


    Nothing wrong with whitelisting as long as your scripts note new IPs so you can update.

    Plus, you're not supposed to use an IP list, you're supposed to use the full trip DNS verification and those that don't verify this way will get burned eventually.

    I use a mix, I use full google IP ranges, not just an IP list, and also verify using full trip DNS and never got burned.
     

    Join The Conversation

    Moderators and Top Contributors

    Hot Threads This Week

    Featured Threads

    Free SEO Tools

    Hire Expert Members