Scrape password-protected sites with custom browser extensions

Fully-automated web scraping is very convenient: you have nothing to do after your scraping or automation robot is programmed and launched. Unfortunately it comes with a few downsides that can affect your projects that target password-protected websites or applications.

The first downside is password handling. If your scraping project targets a website containing high-value data, you may want to avoid storing your password on a remote server with security outside your control.

Detectability is also a concern. Automating access to a website is sometimes frowned upon, even if legitimate. We are sometimes called in to scrape our client’s own data from an application that doesn’t provide a comprehensive backup function. When that’s not enough to lock you into a product that turned out to be inferior, the application provider may even attempt to prevent automatic data extraction by recognizing (and banning) activity patterns that are not “human-like”.

A third issue with automated scraping or web automation is when your process is not fully programmable. Say there is an old and unwieldy web application that requires a lot of copy-pasting and other tedious activities to perform an essential function. While it would possible to automate 90% of the functionality and save as much manpower, a human-made decision may still be necessary such that an autonomous program could not take over.

Enter custom browser extensions, or add-ons. Browser extensions a small programs that attach to your web browser to perform a wide variety of functions. Third-party toolbars, which were one very popular, are one example. The famous AdBlock Plus ad blocker is another. Being tied into your web browser gives add-ons particular powers:

  • they can access any website,
  • they can use your authenticated sessions without having to know your password and
  • they can be either autonomous (in the background) or interactive.
Interactive extensions can supply you, the user, with elements of a graphical user interface, modify web pages, help you point and click on bits of data to be scraped and fill in form fields automatically. Autonomous extensions can run unsupervised on web sites that require your credentials without you having to worry about the safety of your passwords: all you need is login into the target websites and let the program do its job.

Being powered by user actions, interactive add-ons solve any possible issues with detectability. Instead of replacing human users with robots, they free them from their most repetitive and tedious tasks, letting them focus on the decisions they need to make.

Every time we’ve built and deployed a browser add-on has been a great success either because they enabled a job that couldn’t be done otherwise or because they saved hundreds or thousands of expensive man-hours. Custom browser extensions come with their downsides too however.

All other things being equal, building a browser add-on is more expensive that a regular web scraping robot. The addition expense is usually between 50% and 200%. They also require more effort on your part since you at least have to install them and login into your target sites. Once you’ve built an add-on you are also constrained to using the browser it was made for, although cross-browser extensions are a possibility.

Apart from those caveats, custom browser add-ons are a very handy tool in the web scraping toolbox, to be considered when the job involves passwords or requires human interactions.

Share this Post:
 
 

Article by Julien Demoor

A software developer who specializes in web applications and web scraping, Julien founded Stratalis in 2010.