homepage Welcome to WebmasterWorld Guest from 54.196.62.132
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
unusual
wilderness




msg:4429427
 9:52 am on Mar 15, 2012 (gmt 0)

Anybody have a clue?
The first one I simply accepted as trash.
The second one makes me wonder, although CC has many open proxies.

I've broken the link in the UA.
This page is slightly relative (at least in name) to an active news topic.

112.119.106.zz - - [15/Mar/2012:00:52:38 +0000] "HEAD /MyFolder/MySub/MyPage.html HTTP/1.1" 403 - "http:// googlenewssubmit. com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; Media Center PC 6.0; InfoPath.2; MS-RTC LM 8)"

173.12.249.zzz - - [15/Mar/2012:06:28:45 +0000] "HEAD /SameFolder/SameSub/SamePage.html HTTP/1.0" 200 - "http:// googlenewssubmit. com/how-can-it-help-me/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; OfficeLiveConnector.1.4; OfficeLivePatch.1.3; yie8)"

 

DeeCee




msg:4429591
 4:03 pm on Mar 15, 2012 (gmt 0)

Googlenewssubmit is a commercial press release company, promising they can get you to the top of Google News using various (invalid?) methods.

They are essentially advertising themselves by sending referrer spam into your logs, hoping that you will check out their prices and buy their "services". In itself an "invalid method". :-)

In my categorizations they fall under "link_spammer".

enigma1




msg:4432597
 3:08 pm on Mar 23, 2012 (gmt 0)

The problem is SEs now days pulling out URLs from the page content (not just html anchors) and unfortunately this one is no exception posting log records. Perhaps it's one reason they spam this way.

wilderness




msg:4432600
 3:19 pm on Mar 23, 2012 (gmt 0)

enigma1,
Could you expand?

Are the SE's pulling out URL's that are NOT embedded links?

enigma1




msg:4432642
 4:55 pm on Mar 23, 2012 (gmt 0)

There are rumors they do. There are several threads implying this and at least from my logs I see weird accesses.

This post talks about googlebot trying to interpret js but what's important to note is that it parses content and "interprets it".
[webmasterworld.com...]
[webmasterworld.com...]
and various other threads I cannot recall right now.

lucy24




msg:4432741
 9:36 pm on Mar 23, 2012 (gmt 0)

Are the SE's pulling out URL's that are NOT embedded links?

Yes, there are mountains of them in gwt error pages-- sometimes even when there is a link wrapped around the damaged text.

Concrete example under "not found":

hovercraft/h..

That's quoted verbatim, dots and all. If you follow the "linked from" links you arrive eventually at a page with the world's spammiest meta tags and a list of urls, including--

Wait, I've got to do some more verbatim quoting under the vague head of "With friends like these..."

<td width="580">
<div class="msnresult">
<div style="margin-bottom:5px; padding-left: 8px;">
<a href="http://www.example.com/{directory}/{filename}.html" target="_blank" class="msneresult" rel="nofollow">{page title} - {my domain name}</a>
<div class="msnresultcnt">
{text of my meta description}</div><span class="msnresulturl">http://www.example.com/hovercraft/h...</span></div></div></td>


Notice (a) the teeny-weeny detail that the "not found" version snips off one more dot-- truncated urls on the page always have three-- and (b) they seem to have decided that "nofollow" doesn't count on this page. The dot-snipping doesn't kick in after a fixed number of characters, though it may be some physical lenghth in pixels. I ran out of interest at this point ;)

What's notable is that the "real" link, unsnipped, is only two lines away. But that one doesn't count as a link. (I checked a different gwt page.)

Someone, somewhere, programmed a computer to make these decisions.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved