Welcome to WebmasterWorld Guest from 18.232.53.231

Forum Moderators: ocean10000

Message Too Old, No Replies

ASP magically makes content disappear?

     
9:18 pm on Sep 23, 2009 (gmt 0)

New User

10+ Year Member

joined:Jan 20, 2008
posts:29
votes: 0


I'm not at all familiar with ASP but I've come across this problem more then once and it's really bugging me now. The problem is I wanted to automate something in PHP using cURL to grab info from the web.

What's happening is that the content I get back is missing data.... as opposed to the same request given by a browser. I think the first few sites where I encountered this were using ASP but I know the current site I'm working on is (It's ASPX).

How is this done?

If anyone is curious as to what my project is, I'm an affliate for a company and I really don't like the way they have their product information displayed and wanted to redisplay it in the style of my website. They just have like 1000+ products I need to port over and I can't submit a POST request properly because some form data is missing.

9:40 pm on Sept 23, 2009 (gmt 0)

Administrator

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month

joined:Jan 14, 2004
posts:864
votes: 3


It sounds like you are trying to copy the data off another website which you do not operate and use it on your own website.

Most webmasters look rather dimly at this, and consider it stealing of their content which they created. I am assuming the website which you are trying to copy from using your php script has some Anti-Scraping code in place to stop you from copying the pages in an automated fashion.

Also a great number of websites check for the type of browser doing the request and format the resulting HTML output specifically for that browser. They may omit/add certain items in the HTML code based on that browser.

3:28 am on Sept 24, 2009 (gmt 0)

New User

10+ Year Member

joined:Jan 20, 2008
posts:29
votes: 0


I wrote some PHP code that essentially does the job that cURL does before I knew it existed. It's morphed over the years and has for quite some time sent requests that are pretty similar to what a browser would send.... the only thing I don't do is use persistent connections. The script doesn't bombard the server with requests either.... it'll take almost 2 days to grab ~1,700 product sheets.

My intentions are pretty benign. I pretty much want users to stay on my site instead of leaving and searching for other affiliates (we can't change the price on products just how we present it and the copy). I also think the time I spent on my design has paid off and looks more professional then the company's site.

Another thing is that some people are spooked by buying from a MLM company. In this case I'm only trying to retail the products. I have no interest in having a gazillion people underneath me, it's a lot less headaches. So I'm trying to distance myself through my product website having it's unique style and copy.

I guess no matter how benign it's still the same "evil" deed.

What's missing in the HTML form that I know of is two hidden variables. But these aren't used in the request I'm making and I've added them in anyway with blank values (... "&name1=&name2=&"...) which is what my browser sent. I suspect there's some data missing from another hidden field that holds between 6-10kB of data.

I'll plug away at this for a bit. Thanks. You got my brain churning :)

8:33 am on Sept 24, 2009 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11771
votes: 224


the servers might be doing some user agent cloaking or other method of detecting a scraper vs a live visitor.