homepage Welcome to WebmasterWorld Guest from 54.145.183.169
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
unusual
wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4429425 posted 9:52 am on Mar 15, 2012 (gmt 0)

Anybody have a clue?
The first one I simply accepted as trash.
The second one makes me wonder, although CC has many open proxies.

I've broken the link in the UA.
This page is slightly relative (at least in name) to an active news topic.

112.119.106.zz - - [15/Mar/2012:00:52:38 +0000] "HEAD /MyFolder/MySub/MyPage.html HTTP/1.1" 403 - "http:// googlenewssubmit. com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; Media Center PC 6.0; InfoPath.2; MS-RTC LM 8)"

173.12.249.zzz - - [15/Mar/2012:06:28:45 +0000] "HEAD /SameFolder/SameSub/SamePage.html HTTP/1.0" 200 - "http:// googlenewssubmit. com/how-can-it-help-me/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; OfficeLiveConnector.1.4; OfficeLivePatch.1.3; yie8)"

 

DeeCee



 
Msg#: 4429425 posted 4:03 pm on Mar 15, 2012 (gmt 0)

Googlenewssubmit is a commercial press release company, promising they can get you to the top of Google News using various (invalid?) methods.

They are essentially advertising themselves by sending referrer spam into your logs, hoping that you will check out their prices and buy their "services". In itself an "invalid method". :-)

In my categorizations they fall under "link_spammer".

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4429425 posted 3:08 pm on Mar 23, 2012 (gmt 0)

The problem is SEs now days pulling out URLs from the page content (not just html anchors) and unfortunately this one is no exception posting log records. Perhaps it's one reason they spam this way.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4429425 posted 3:19 pm on Mar 23, 2012 (gmt 0)

enigma1,
Could you expand?

Are the SE's pulling out URL's that are NOT embedded links?

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4429425 posted 4:55 pm on Mar 23, 2012 (gmt 0)

There are rumors they do. There are several threads implying this and at least from my logs I see weird accesses.

This post talks about googlebot trying to interpret js but what's important to note is that it parses content and "interprets it".
[webmasterworld.com...]
[webmasterworld.com...]
and various other threads I cannot recall right now.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4429425 posted 9:36 pm on Mar 23, 2012 (gmt 0)

Are the SE's pulling out URL's that are NOT embedded links?

Yes, there are mountains of them in gwt error pages-- sometimes even when there is a link wrapped around the damaged text.

Concrete example under "not found":

hovercraft/h..

That's quoted verbatim, dots and all. If you follow the "linked from" links you arrive eventually at a page with the world's spammiest meta tags and a list of urls, including--

Wait, I've got to do some more verbatim quoting under the vague head of "With friends like these..."

<td width="580">
<div class="msnresult">
<div style="margin-bottom:5px; padding-left: 8px;">
<a href="http://www.example.com/{directory}/{filename}.html" target="_blank" class="msneresult" rel="nofollow">{page title} - {my domain name}</a>
<div class="msnresultcnt">
{text of my meta description}</div><span class="msnresulturl">http://www.example.com/hovercraft/h...</span></div></div></td>


Notice (a) the teeny-weeny detail that the "not found" version snips off one more dot-- truncated urls on the page always have three-- and (b) they seem to have decided that "nofollow" doesn't count on this page. The dot-snipping doesn't kick in after a fixed number of characters, though it may be some physical lenghth in pixels. I ran out of interest at this point ;)

What's notable is that the "real" link, unsnipped, is only two lines away. But that one doesn't count as a link. (I checked a different gwt page.)

Someone, somewhere, programmed a computer to make these decisions.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved