Forum Moderators: open
The 100k limit, AFAIK, is only a limit of the page size stored by the cache, I believe a number of sources have confirmed that the information not cached in pages > 100k is still indexed.
Although it's obviously a huge Usability blunder to have a page that size and very very bad practice.
Cheers,
Nigel
You mentioned that the best way to hide links from Google would be to block cgi-bin through robots.txt and then use a script which gives URL's of the format
http://www.domain.com/cgi-bin/redir.pl?id=00012345
where the id represents a URL
I have 2 questions
1) How to implement this? Could you give me links to such scripts as I am unable to find any?
2) What happens in general when Google sees a link but cannot follow it. Do such links contribute to PR leak?
Suppose I have 10 links on my page of which 4 are hidden via such a script. Would google distribute the PR to all the 10 links or only to those 6 visible links?
Thanks,
Internet Master
I did not pick this one but it seems to do the job just as well as others i've seen. You just have to look at "-Shortened URLs" in the "readme.txt" file for guidance on using IDs.
2) PR leak.
First, there's an understanding that a PR Leak is not a leak. Some PR is passed on from your page to the pages you link to, but your own page does not lose any PR in doing this. I've seen this statement in a few threads, here are two examples:
1) [webmasterworld.com...]
2) [webmasterworld.com...]
A link like that would be a link to an internal part of your own site. A part which is access restricted. I would assume that this makes it quite impossible to decide where the 4/10 of the PR should go to. In order to "leak" it must leak to "somewhere" - at least that is the way i understand it.
So, your 4 links will not leak. That is: They will not pass any PR on, as there is nothing to pass it on to. Your six remaining links will pass on 1/6 of the total amount passed on. That is, the total amount "passed on" will be the same, it will only benefit 6 pages in stead of 10. As 1/6 is greater than 1/10 this is good for the pages that get some.
/claus
<edit>typos, clarified text</edit>
Everybody: I will remember to all of you that in post 23 I explained what is Java and JS. Please name things by their name before I join Made_In_Sheffield's autodestructive behaviour (and my walls won't last too much, so I'll break my neighbours' ones!)
claus: You said that a link can be hidden in a Java-applet, but I'm not sure: linking outside the applet's home domain is valid with Java security policies?
Imaster: I'll give you a quick definition 'bout PR leak:
A page gives PR to the pages it links to without losing its own. If the other page links back, then it'll give more PR to the 1st page. If your 1st page has more links, each one will carry less PR, so the 2nd pages that link back will return less. The page never losses PR by linking out, but it might happen that it gains less than before: this is what is called PR leak.
Regards,
Herenvardö
Google definitely has followed the java links on my site. I have text links on the bottoms of the pages now, but Google indexed the pages before the text links were there.
The wierd thing is, I have a client's site that uses the same type of java scripting, and the Goo didn't make it past the first page the first time it got crawled.baffled,
Chris
GoogleBot might have visited the pages in question, if you visited the page and had the Google ToolBar turned on to send data to Google.
Thats how Google finds new sites/pages...
If a page is not linked to by another page then it is not part of the web as Google sees it and will therefore not be added to the index by my understanding. I think the only way Google finds pages is by following links from other sites. I'm pretty sure that submitting your site to Google does no good unless you have a link from another site (which kind of makes it a pointless exercise apart from making you feel better). Anyone want to correct me on any of that?
I do however think Alexa does find sites in this way.
Cheers,
Nigel
Nobody's saying Google can't find a page without links. What I am saying is that they won't add it to the index> until they find some links.
This is because of the way the PageRank alrgorythm works.
Why do you think pages get dropped from the index once you take them out of your site navigation but leave the pages there? Because they are no longer linked.
It takes a while but ti does happen in time.
Visit_Thailand
Without looking in your log file and seeing what the User Agent was you can't blame that on Google, do you have the Alexa toolbar installed?
Thanks
Nigel
This question about hiding links from Google was discussed some months ago and Googleguy made his comments clear about this topic:
Googleguy comments about hidden links [webmasterworld.com]
his comments are page1, fifth comment
Personally, i use a php redirect script hidden from the bot using robots.txt. Google is gracious not to visit pages declared DENY by robots.txt
It's great for hiding repeated links to terms and conditions, privacy statements or adverts.
Read what Googleguy said, hope this helps.
"4. Use a redirect script and use robots.txt to disallow your script. (RECOMMENDED!) googlebot won't even touch your script."
then GoogleGuy replied:
"Yah, robots.txt or meta tags are your friends.. Those should both work fine."
Here's what I'm thinking. You can maybe prevent the link from being spidered and indexed, but does it matter PR wise? I mean even if the link isn't followed it's still there and googlebot still saw it. For those who need to hide the link to reduce PR leakage that method won't work.. Since there has been talk about google spidering javascript that method is out too. So it leaves us with <form> redirection but that shouldn't be very hard for google to implement and since GoogleGuy follows this forum it's probably alredy done. If someone thinks of a cool method to make googlebot ignore a link completely (as if it's not a part of html) and posts it here, that will be the end of that method:) I'd still love to have it though.
regards,
Darko
If you just want your users to find the links, you could just put them on another page and setup a prominent link to that page.
To make a link totally unspiderable, simply create it using the document.write method in javascript. If the required link is a named image rather than text, I think you can assign its HREF field. I did something similar to this in an html help file a while back but my recollection of this is rather hazy.
Kaled.
The code is:
<a href='javascript:void window.open("http://www.site.com", "_self");'>Link</a>
I've used js links also, and when I check them in a search engine spider simulator they show as [javascript:void...]
The code is:<a href='javascript:void window.open("http://www.site.com", "_self");'>Link</a>
I think that as long as google recognizes it as a link of any kind (even "http://javascript:void/") it is considered in the PR calculations algo. inbound links increase PR, outbound links decrease it..
Think of it this way. Forget HTML, consider the page to be simply text. Finding absolute urls on the page is easy, and if the search engine spiders wish to do so they can follow those urls whether or not they are links.
So, the only way to ensure that such urls are not followed is to use obfuscation and this is most easily done by creating the links using javascript document.write statements to simply create plain old HTML links.
Of course, technically, spiders could run the javascript but this is unlikely. However, you could go one step further by using the OnClick event to launch a javascript function that creates an url on the fly (from say a domain name and a page adddress) and launches it. There is absolutely no way that a spider will cope with this. BUT the key to this is still to consider the page as plain text. If a simple text search yields urls then those urls could, theoretically, be followed by spiders.
Hope this helps,
Kaled.
someone recently posted that Googlebot took his external .js file along with the other pages. This wasn't a common practice in the past as I understand. So it might indicate that google now takes and examines javascript too. Should be a piece of cake for google to handle javascript.
I think Patrick_Taylor's Flash example would do the trick though. Wonder how long this will be (since google and macromedia alredy have some kind of google flash spidering thing going on).
Darko
Should be a piece of cake for google to handle javascript.
That rather depends on what is meant by "handle javascript". Interpreting a scripted language requires a lot of CPU power. I very much doubt that any spiders currently operating have sufficient CPU cycles to spare to do anything much with javascript. The most rudimentary obfuscation of the urls should be more than enough to defeat any spider. With a little imagination and I think you could come up with a system that would defeat spider technology for the next ten years or so (unless the spider specifically targetted your algos).
I've used javascript to create HTML code. It really is not that difficult.
Kaled.
PS
There are a lot of cgi links pointing to my site. These are not read by spiders (drat, drat and double drat).
If links can be hidden from the search engines, I'm going to start exchanging links with any site, including link farms. I don't recall anyone that wanted to trade links ever asking if the link would be picked up by Google and the other search engines.
That won't help you with linkfarms cause they have bots that check if you're linking back to them etc.. and the bots won't see the link. They usually specifically require <a href="URL">DESC</a> link.
kaled,
I guess it's all guessing untill we test it. But since ppl are reporting about Googlebot taking their .js files google must have some agenda with them..
Darko
i.e. the url http://www.xyz.com/dir/s.asp?l=111 actually leads to abc.com and if we check the backlinks for abc.com, the link http://www.xyz.com/dir/s.asp?l=111 does appear.
Examples used above are fictitious ;)