| 2:31 am on Aug 23, 2003 (gmt 0)|
> Can the Google spider detect this?
Very likely it cannot.
Send in a spam report [google.com], just don't expect anything to come of it.
| 7:00 am on Aug 23, 2003 (gmt 0)|
If the js is called externally from a robots.txt protected file then it definitely won't pick it up. But it also won't help at all in the serps. Now you have text that is invisible to Google and surfers so there really is no problem. I think Google would want to move towards indexing what is on the screen rather than in the code. Maybe like using OCR technology on a screenshot. Does anyone know if anything like this is in the works?
| 7:13 am on Aug 23, 2003 (gmt 0)|
> Now you have text that is invisible to Google and surfers so there really is no problem.
From what I gather, the text is in the HTML and the JS just hides it.
Maybe I shouldn't spell out how to do that. ;) So there are just two of many ways to accomplish the same thing, and how is google ever going to be able to automatically detect all of them?
They aren't. They need spam reports. They need to listen to and act on spam reports.
| 7:38 am on Aug 23, 2003 (gmt 0)|
They could theoretically get the computed style of every element on the page and check for display: none or visibility: hidden...but that would take about a gazillion years to do for every page in their index...
I agree with Dolemite, they need human evaluation and reporting. And it only takes a minute to fill out a spam report. :)
| 9:03 am on Aug 23, 2003 (gmt 0)|
|From what I gather, the text is in the HTML and the JS just hides it. |
That would certainly make more sense.(doh!)
I wouldn't try it given their new js abilities. I would expect them to be getting much better about that as well (unless you call it from a protected file).
| 9:32 am on Aug 23, 2003 (gmt 0)|
| 11:38 am on Aug 23, 2003 (gmt 0)|
Thanks all for your comments. Seems like if one is willing to task the risk, he will probably get by. This is a national design firm that has a separate page for each city it targets so when one does a key word search for website for that city this firm will usually come up first. Each page has a separate title and file name that has keywords for that city. The only thing different on each page is the html text and links that is hidden by the .js hidden text style and tag ID. Therefore each city page looks like the home page.
| 11:43 am on Aug 23, 2003 (gmt 0)|
You should definitely report them.
| 11:50 am on Aug 23, 2003 (gmt 0)|
Is the site relevant to what searchers are looking for?
| 11:55 am on Aug 23, 2003 (gmt 0)|
>Is the site relevant to what searchers are looking for?
If you were looking for a design firm in a particular city, they would appear first in the results.
| 12:43 pm on Aug 23, 2003 (gmt 0)|
|They could theoretically get the computed style of every element on the page and check for display: none or visibility: hidden...but that would take about a gazillion years to do for every page in their index... |
Simply checking for those attributes shouldn't be too hard, inline anyway...I seem to remember a thread about links in a DHTML/CSS menu not being spidered (Marcia, was that yours?). The menu used inline CSS to initially hide divs, then revealed them on hover as I recall. External CSS complicates things a bit, and yes, determining the resulting style of every element would be just about impossible. Adjacent selectors, height:0, etc., good luck. ;)
The even bigger problem is when JS is used to modify/add CSS or to modify/add HTML, since you can do basically anything, and do it lots of different ways. Right now, as far as we know, google isn't even parsing basic JS, let alone advanced stuff coded with intentional obscurity.
So google already weeded out the basic same-color hidden text to some extent, they might be checking for basic inline display:none, but they'll never get much further beyond that algorithmically.
| 1:51 pm on Aug 23, 2003 (gmt 0)|
So, it appears that a web design firm can use the js file to document.write inline hidden text style with little fear from google, unless someone from this forum or a competitor busts them. You all are saying that they don't respond well to spam reports, though...
| 2:14 pm on Aug 23, 2003 (gmt 0)|
|"...they'll never get much further beyond that algorithmically..." |
Gotta disagree. Its not that hard a problem. There are many ways to go about solving it, so I think it is just a matter of time.
Just to illustrate the point, a very clumsy example of how it could be solved: Screen-scraping techniques have been deployed to integrate large legacy corporate systems for years, where those systems were initially designed for human interfacing only. And that is just a very clumsy way to do it. There are far more elegant ways.
| 9:50 pm on Aug 23, 2003 (gmt 0)|
I don't dispute that Google will eventually parse JS to some extent, but it's a whole other thing to turn code into meaning and determine what visual effects the code will result in. Browsers don't even do that. People do that.
Screen scraping does you no good when you don't know what's actually going to end up on the screen.
One method that might be used to detect text-hiding JS is to generate some sort of hash of the initial visual output...then execute JS, re-render & generate a new hash. Then check if you lost visual information in between. Even that would be a serious increase in processing and would probably result in a lot of false positives. There are plenty of legitimate uses of document.write and the various ways of modifying element attributes. This system does nothing for CSS, though. Often times the visual appearance of a page changes so drastically when CSS is switched on/off that there might be no meaningful way to compare the two, except perhaps OCR, as Powdork suggested. I imagine that would be incredibly CPU intensive, though.
| 11:10 pm on Aug 23, 2003 (gmt 0)|
|"...One method that might be used to detect text-hiding JS is to generate some sort of hash of the initial visual output...then execute JS, re-render & generate a new hash. Then check if you lost visual information in between..." |
Having said that, you are right it not a trivial problem which has already been solved. But I still maintain it is not that hard a problem, and it is only a matter of time before it is solved. It is not like the raw technology is not available; it is just a question of development.
| 11:13 pm on Aug 23, 2003 (gmt 0)|
"If the js is called externally from a robots.txt protected file then it definitely won't pick it up."
I don't think this is neccessarily the case. robot.txt says "do not index this file/directory", to me theres a difference between indexing a file and simply reading it to get a real representation of how the page looks. I think from a perspective of quality serps, that the next big step is to be able to read and parse js to a certin extent, including detecting js-controlled changes to visibility etc. We know that Google has recently started requesting external js files and I image there is a team of boffins at Google already devloping ways to automatically detect some of the less-than-ethical uses of them.
I think by the end of the year we will see Googlebot requesting external css files also.
| 5:35 am on Aug 24, 2003 (gmt 0)|
|I don't think this is neccessarily the case. robot.txt says "do not index this file/directory", to me theres a difference between indexing a file and simply reading it to get a real representation of how the page looks |
Disallowing through robots.txt will keep GoogleBot from ever crawling the page. Google can and does index urls that are disallowed through robots.txt. It will list only the url in the serps and the only factor applied when ranking these pages is the anchor text pointing at them.
Simply put, I f Google GETs a page that is disallowed through robots.txt you should contact them (after checking your syntax).:)
| 8:20 am on Aug 24, 2003 (gmt 0)|
I don't know if this is really on topic, but I've seen a porn site ranking pretty well on Google for totally unrelated searches using a rather primitive form of JS redirection. Basically, the page has this piece of code in its head:
which is obviously ignored by google but still redirects the browsers. The rest of the page, which will be never seen by the users, looks like a database-generated text with high density for specific keywords. Again, I am very surprised this poor trick may work well on Google today!
| 8:48 am on Aug 24, 2003 (gmt 0)|
>Again, I am very surprised this poor trick may work well on Google today!
That's not a trick in first place (allthough it's used as a trick in your example). It's part of a legit script that is used by many frameset pages to jump back into the frameset if a single frame source is requested.
if (parent==self) top.location.href='http://example.com/frameset1.htm'
| 9:41 am on Aug 24, 2003 (gmt 0)|
|A bot can't penalize the cheater |
So, would you say that it's a safe to use trick (even without frames on your pages)?
I wonder why we don't see tons of pages like that, then.
| 9:59 am on Aug 24, 2003 (gmt 0)|
>So, would you say that it's a safe to use trick
There's no safe to use trick, imho. Just more or less risky tricks.
Google Guidelines say: Don't employ cloaking or sneaky redirects. [google.com]
So i expect this is a higher risk trick.
>I wonder why we don't see tons of pages like that, then.
You don't see them ...?!
| 10:49 am on Aug 24, 2003 (gmt 0)|
I see them, but not so many as I would expect for something so easy to do...
Back to topic, maybe this is a sneaky redict and not hidden text, but the point is: many of us do believe it is easy to detect (by human review) while difficult to automatically penalize: this should make it a relatively 'safe' trick, unless you are in a very competitive field.
My other point would be: if somebody from a porn site makes this kind of sneaky redirect to my site, whose site is going to be penalized, after human review? Knowing the answer, I could probaly either trash my competitors' sites or use the trick safely for my own good...
| 11:11 am on Aug 24, 2003 (gmt 0)|
>if somebody from a porn site makes this kind of sneaky redirect to my site,
>whose site is going to be penalized, after human review?
After human review? Who knows ... it depends on the human. If there's no visible or whois relation between the sites, most likely the pron page gets penalized though.
>Knowing the answer, I could probaly either trash my competitors' sites
|Fiction: A competitor can ruin a site's ranking somehow or have another site removed from Google's index. |
Fact: There is almost nothing a competitor can do to harm your ranking or have your site removed from our index. Your rank and your inclusion are dependent on factors under your control as a webmaster, including content choices and site design.
| 12:45 pm on Aug 24, 2003 (gmt 0)|
Yes, I know that nothing that's not under your direct control as a webmaster can harm your site.
On the other hand, if I take a 'disposable' domain, fill it with hidden text and sneaky redirects to my own site, the only thing they can possibly do is remove the disposable site from the index. Pretty boring game, I guess...
(and, most important, given this possibility, why should you worry about creating .js files on your own domain, using robots.txt to hide them and taking risks anyway? It seems much better to do this on a different domain redirecting to the first one, if you are really such a 'bad' guy)
| 8:03 am on Aug 25, 2003 (gmt 0)|
if (i == 2000)
[document.write and have the first character be something ascii character of i / 100 +20 or something]
And on the 2000th operation the magic will happen and there's nothing google can do except increase the maximum loops parsed. Maybe this will devolve into a war with them increasing maximum loop parsing until finally their parser takes up too much CPU. This is a problem they will never solve thankfully.
| 10:29 am on Aug 25, 2003 (gmt 0)|
|Guess this is where my computer science degree from CMU comes in. |
Exactly...not to sound all high-and-mighty, but sometimes people who haven't been up to their elbows in stuff like this only see input and output; they can't imagine what needs to happen in between. As worthless and outdated as some of my CS seemed at the time, I realize part of it was just intended to teach us how to think. At least I hope that's what they were going for...it definitely had no immediate practical application. ;)
That's a very naive assumption. IMO, googlebot will be parsing & running basic JS long before it can spot even the simplest implementations of JS hidden text. CSS doesn't seem especially meaningful to Google except in spotting hidden text, but that's going to be a very difficult task as well. Combine JS+CSS in any reasonably clever way and things are going to get extremely dicey, nearly impossible to detect.
Think about everything a compiler has to do, and that's just what's needed to transform language code into machine code. Compilers can't tell you the first thing about what a program will do once it runs, and this particular problem is more on that level. In order to detect this stuff, google has to determine the effect of code, which is a very difficult task for a machine...and a relatively simple one for a human.
|Having said that, you are right it not a trivial problem which has already been solved. But I still maintain it is not that hard a problem, and it is only a matter of time before it is solved. It is not like the raw technology is not available; it is just a question of development. |
I remain convinced that it is very difficult. It goes far beyond a question of development.
First of all, any kind of simulated scripting is going to increase processing time by orders of magnitude. They'll need every second they can shave off PR calculations and maybe another 50,000 boxes. Sure, they can afford it, but I don't think its going to look like a good idea to the board of directors come IPO time, let alone the current leadership.
Secondly, assuming a detection system like this would dole out automatic penalties, false positives are going to be a major issue. No system like this is foolproof in either direction, and getting both the right detection code to begin with and then the right threshold for a penalty is going to require lots of analysis.
So this is indeed a non-trivial problem, perhaps one not suited for algorithmic detection. That's not to say that they shouldn't try, but in the mean time, its as compelling an argument for paying attention & responding to spam reports as I've got.
| 12:35 pm on Aug 25, 2003 (gmt 0)|
Whoah, so I get to do the honours... Welcome to WebmasterWorld, displacedprogrammer!
|"... Think about everything a compiler has to do ... |
I think this is the wrong metaphor, but lets explore it and see what it reveals.
OK, now let me remember... Well there is parsing, then lexical analysis ... that won't help... but wait there is more... bottom up evaluation of inherited attributes... well perhaps some of those techniques could come in useful in this sort of problem... then there is type checking, including overloading functions and operators... no probably not... but wait ... how about code optimization ... Yes, a compiler can eliminate dead code. There are a whole bunch of standard algorithms for determining how long code might be in a loop, and algorithms to optimise loops (such as loop flow graphs). A compiler can determine loop invariant computations, and perform code motion, it can spot and eliminate induction variables, it can deal with aliases and pointers, it can estimate types, and it can perform a whole bunch of different sorts of dataflow analysis techniques.
For another example of how software can "determine the effect of code", consider some of the recent developments in anti-virus software. The program is executed on a virtual machine, and elements in the virtual machine are examined/monitored by the software.
I still maintain it is not that hard a problem. Not trivial. Not yet solved. But compared to many other problems already solved, not that hard.
| 9:32 pm on Aug 25, 2003 (gmt 0)|
thx for the welcome.
There is no solution to the halting problem, here's the proof:
so Google will never be able to solve this. They can get loop invariants usually used when rolling out a loop, maybe use other tricks, but that won't help when the code looks like this:
x = (x*2) % 40;
if(i == 2000)
x is dependent upon its previous value, they must calculate all 2000 values, if their limit is 1000 then the code goes unparsed.
It's not only hard it's unsolvable which is good for some of us ;)
| 10:04 am on Aug 26, 2003 (gmt 0)|
Not denying the halting problem is unsolveable, but that is purely from an academic perspective. There are many viable ways to go, which are 'close enough', even though they may not be absolutely and purely correct solution. (And that is after you have done the compiler optimisation to eliminate redundant loops, etc, as I hinted in my previous post).
There is a difference between science and technology. There have been philosophers in academia who argue the following (and I'm simplifying here, so any philosophers out there, please forgive this gross over simplification). They argue that the physical world is not what is real. The scientific world is about making observations which lead to a model through induction, then using that model to determine/predict using deduction. The problem is that that notion of science is flawed for a bunch of reasons, such as you can't trust your senses hence you can't make observations, so there are no such thing as facts/observations. So, these particular philosophers argue, science can't be trusted. The funny thing is that these philosophers happily use airoplanes to travel to conferences to lecture their theory that you can't trust science. Why? Because there is a difference between pure science and technology.
Just to bring the discussion back to earth, if you go down the 'its not possible' approach then you'd give up on automation of all logistics functions (The bin-packing problem is unsolvable). And yet there are very many good software products out there in use by shipping companies which pack bins. They may not actually solve the bin packing problem, but they come close enough to make practical sense.
| This 31 message thread spans 2 pages: 31 (  2 ) > > |