Scraping away profits

01.10.2015
At technology company Graphiq, web-scraping bots were becoming more than just a nuisance. They were impacting its bottom line.

The company collects and interprets billions of data points from thousands of online data sources and turns them into visual graphs that website visitors can use for free. Scrapers were extracting data from hundreds of millions of these pages and building duplicate sites. 

“We don’t want people to reuse [our data] commercially for free because there is a cost associated with creating that content,” says Ivan Bercovich, vice president of engineering. “It also undermines the value of our content” and steals traffic away from its site. Then there are the operational costs associated with blocking those web-scraping attempts. “We may have months where we block 5% to 6% of all requests,” Bercovich says. “For a site of our volume – about 30 million visitors a month -- that’s a lot of wasted requests.”

Web scraping is on the rise, especially attacks on businesses to steal intellectual property or competitive intelligence. Scraping attacks increased 17 percent in 2014, up for the fifth year in a row, according to ScrapeSentry, an anti-scraping service. Some 22 percent of all site visitors are considered to be scrapers, according to the report.

“One of the big things driving this up is that it’s just getting easier,” says Gus Cunningham, CEO at ScrapeSentry. Would-be scrapers can pay for commercial scraping services, write the code themselves using step-by-step online tutorials or even get free automated tools that do all the work.

Web-scraping tools are out in the open because web scraping is legal in some cases, such as gathering data for personal use. But it also creates a loophole for nefarious scrapers and a security hole for companies that don’t update their legal terms or their IT security processes.

“A lot of people are under the misconception that this kind of thing is considered ‘fair use,’ which is absolutely incorrect,” says Michael R. Overly, a partner and intellectual property lawyer focusing on technology at Foley & Lardner LLP in Los Angeles. “They think that because [the data provider’s] website doesn’t require any payment of fees there’s this exception. In general, if you (the scraper) are selling ads on your site, even if your end users don’t pay you any money, you’re getting revenues from ad displays. It’s a commercial purpose, so it’s highly unlikely it’s going to be fair use.”

Michael R. Overly, a partner and intellectual property lawyer focusing on technology at Foley & Lardner LLP

Ticketmaster and Massachusetts Institute of Technology have successfully gone after scrapers of their data who claimed that their actions were fair use or didn’t violate copyright laws.

Today’s botnets take web scraping to a whole new and elusive level.“The reward-to-risk ratio in cybercrime is actually pretty good, and that’s why we're seeing an uptick in volume in web scraping, says Morgan Gerhart, vice president of product marketing at cyber security company Imperva. “Generally 60 percent of traffic that hits a website is bots. Not all bot traffic is bad, but we know half of that is from somebody that is up to no good.”Random spikes in any bot traffic reduces website performance and increases infrastructure cost, which impact the user’s experience.

Web scraping is not going away, Bercovich says, but companies can take several steps to fight back. Botnets come fast and furiously in large volumes and usually slow down systems. “If they’re filtering at superhuman speeds, or paginating quickly or never scroll down the page,” then it’s probably a bot. Even after bots are detected and blocked, the fight is rarely over. “They’ll try a few variations to see if they can escape detection, but by then we’re totally on top of it,” Bercovich adds.

A multi-layer defense is the best offense to combat web-scraping bots, Gerhart says.

Application level intelligence

A product with application level intelligence can look at traffic to determine if it’s a browser on the other end or a bot. “People who are good at [scraping] can make it look like a browser,” Gerhart says. “That’s when you get into more sophisticated behavior.”

Rate limiting

Not all web scraping is inherently bad, Gerhart says, “but you don’t want web scraping traffic to interfere with other users.” On a per-connection basis, limit a user’s actions to no more than x number of actions in x amount of time, he says. “Even if you’re OK with scraping of your site, you may not want it at a rapid pace so it overwhelms your CPUs.”

Obfuscate data

Render data meaningless to the person who is scraping it. Displaying web content as images or flash files can deter site scrapers, “although more sophisticated scrapers can get around it,” Gerhart says. Another option -- applications can compile text using script or style sheets since most scraping tools cannot interpret JavaScript or CSS.

Other deterrents include constantly changing html tags to deter repeated scraping attacks, and using fake web content, images, or links to catch site scrapers who republish the content, Gerhart says.

Safety in numbers

Some companies combat distributed botnets by partnering with large service providers that have exposure to a big portion of all of the requests on the Internet. They’re able to see attack patterns, collect those IP addresses and block them for all of their clients. Graphiq chose to outsource its bot protection to a provider with broader knowledge of scraping attacks.

Legal protection

Scrapers and botnet users are extremely hard to find and prosecute, security experts say. Still, companies have to lay the groundwork for legal action by clearly stating in their website’s terms and conditions of use that web scraping or automated cataloging is prohibited, Overly says.

The second line of a legal defense is copyright law. When scrapers make off with material on a site, they are infringing on that copyright. Website owners don’t even have to prove that scraping led to any real harm, Overly says. “They can simply show that it was intentional, and they get mandated damages from the copyright act, which can be very substantial.“

Today, Graphiq “rarely if ever” has its data stolen by web scrapers, Bercovich says, but they’ll never be able to eliminate botnet attempts. “You can only detect and block them so they don’t get your content,” he says. “The more effectively you can do that, the more understanding and good reporting that you have, the more quickly you can act.”

(www.csoonline.com)

Stacy Collett