Forum Moderators: open
We just discovered two sets of two hidden ampersands (white on white) that a former webeditor used to space out some text. It was strictly used for formatting purposes and we have removed it from our home page.
My webmaster doesn't think 4 white on white ampersands could possibly be the cause for banning, but we are at a loss at to what else it could be.
It does seem improbable that 4 simple hidden characters (not keywords) could cause banning. What do you guys think?
One other possibility is this code:
<meta http-equiv="refresh" content="1800, url=/Default.aspx">
But I thought only if you are less than 30 seconds that you get "banned"? If using meta refresh AT ALL can cause us to get banned from search engines, please let me know and we'll take it off.
But then this morning, it's back to only (1 result) the home page again.
I'm wondering if maybe when I request a site review, they're adding it in, but than an automatic algorithm is black listing it again when it hits a certain page that isn't following the guidelines.
I wish there were a utility to check our >1000 pages for HTML code that Yahoo doesn't like. Maybe there is? Recommendations? I tried WebPosition Platinum - it says our HTML is fine.
One final theory was that it was we were detecting whether it was a web visitor or a web crawler and then displaying slightly different results. We did this because our original plan years ago was to have people “login” to our website to view the content BUT we wanted the search engines to still be able to view the content without logging in, so we created a dummy account for the spiders called guest@<ourdomain>.com.
I guess this might be considered a form of cloaking. We got rid of that as well. But don't major news sites do that as well, i.e. nytimes, latimes?
The difference between a user browsing our site and a spider browsing our site is so minimal, I find it hard to believe we would be banned for that.
No guessing about it. It is a form of cloaking. Cloaking is where you feed the spider one page and a visitor another - the amount of differences, etc don't make it "less cloaked". Feed one page to the spider, another to humans = cloaking. I'd get that taken care of and then submit for reinclusion and hope they take the time to do yet another review and re-include you in the index.
Geez, we weren’t providing bogus keywords to the SE or spamming. We just wanted logons to view our content.
>>that is cloaking.
ok, then answer me this - how do all the new pubs, i.e. NYtimes, LATimes, and all the other major newspapers get indexed by Google and other search engines, yet they require logons - yet they somehow allow spiders to spider their content?
We too are a news organization - though much much smaller. We planned in the .com days of requiring logons to view our content, but decided against it.
Maybe the fact we are much much smaller has something to do with it? In any event, the cloak is gone.
I didn't write the HTML, but from what I understand, the webmaster coded crawlers with a "temp" account called guest@<ourdomain>.com. All other visitors have to sign in with their ID to view the content. The ID is displayed on every web page when signed in, so it is unique to each user.
But the HTML would be only different by like 5-10 characters (the userID). I guess if Yahoo and other SE do a simple compare of the HTML when using a spider agent vs. a browser agent it will be different by only ~5-10 characters. Maybe their algorithm is just a simple file compare. if so, then I guess that might explain the blockage.
Here's a new theory:
We pay a monthly subscription to BusinessWire to provide us with news releases related to our industry and these news feeds get automatically posted to our website. I'm sure other sites post the same news releases, but I find it hard to believe Yahoo would ban us for news releases that go out on news wires.
maybe Yahoo uses Businesswire too and thinks we're stealing their content?
If so that's just ridiculous because I've seen the some of the same news releases posted to many other sites, including our competitors and they aren't banned.
what do you guys think?
I didn't write the HTML, but from what I understand, the webmaster coded crawlers with a "temp" account called guest@<ourdomain>.com. All other visitors have to sign in with their ID to view the content. The ID is displayed on every web page when signed in, so it is unique to each user.
It's not the difference between someone who's logged in and what the spider sees, it's the difference between someone who's NOT logged in and what the spider sees.
You were cloaking big time by the sounds of it. Now you say you've got rid of it, so it may just be a waiting game? If I were you I would surf around your site as a not logged in user and make sure you now see what the spiders see. If it's something different then you're still cloaking.
>>This still sounds like an enormous cloak! If I, as an unregistered (NOT logged in user) see something wildly different (i.e. NO content) compared to a logged in user or an SE spider, then this is a MAJOR difference.
nope, the difference in HTML is minimal. We added in userids years ago (.com days) with the intention of forcing logons, but we never did. We just left the HTML code in there in case we decided to force logons, which we never did.
Just legacy HTML code really.
Wow, my last theory about newswires causing banning is out the window since someone replied they use newswires/press releases.
maybe it's a subpage and not our home page causing the issue. Are there any good tools to scan not just 1 page, but every subdirectory and look for hidden text, cloaking, etc? Surely, one cannot be expected to check thousands of HTML pages to be "Yahoo compliant".
what if a vindictive employee places "hidden text" in a buried HTML page to cause banning? How would you ever find it?
Somebody has to. It's the price you pay to "deserve" to have thousands of pages listed, instead of dozens or hundreds. Practically speaking, most large sites are dynamically generated, and the templates are tested for "compliance" thus making the generated pages "correct by construction."
> what if a vindictive employee places "hidden text" in a buried HTML page to cause banning? How would you ever find it?
Technical answer: Pre-emptive access control. Don't let just any employee modify or put pages on your server. Management answer: Make sure you have no vindictive employees. If you do, either address that employee's problems or fire him. If the employee had legitimate complaints, then fire his supervisor for allowing an employee to be so badly treated that he became vindictive.
This is an interesting case. It leads to the question, "Do search engines want to index content, or act as marketing channels for member-only content?" As a user, I don't want to see search listings that lead to pages I can't view without a subscription -- They are a waste of my time. As a marketer, I can see the other side, too.
You might want to test for some server-level problems. Make sure your pages appear under only a very small number of domains and/or subdomains, make sure that any redirects are your site are returning response codes consistent with the HTTP specification, and make sure that you don't redirect all missing pages to a single page without returning a 404-Not Found response code.
Jim
You have attempted to get users to your site by having information displayed to the crawlers but not the visitors and they have taken action, quite rightly so as the searcher would be better off presented with a page in the SERPS that gave them the info they searched for.
Maybe a better way forward for you would to be to build pages with content to entice searchers and then offer more of the same thing if they subscribed.
nope, the difference in HTML is minimal.
I think you need to re-read what you've been writing and what's been written in reply if you really want to move forward. Don't worry about news feeds. If you were cloaking then that could account for everything.
Don't take it as a criticism, take it as friendly advice. Think carefully about just what your site is delivering to whom.
We designed the web page 7 years ago with the "intention" of forced logons, but we never flipped the switch to turn it on. Thus 100% of our content has ALWAYS been available to bot SEs and users.
We only kept the userids in the HTML code since we decided to use it for maintaining newsletter subscriptions. A user that is logged in can subscribe/unsubscribe from various newsletters.