Web scraping with your web browser: Why not?

8chananon.github.io

150 points by 8chanAnon 9 months ago

Includes working code. First article in a planned series.

> can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?

Completely agree with this sentiment.

I just spent the last couple of months developing a chrome extension, but recently also did an unrleated web scraping project where I looked into all the common tools like beautiful soup, selenium, playwright, pupeteer, etc, etc.

All of these tools were needlessly complicated and I was having a ton of trouble with sites that required authentication. I then realized it would be way easier to write some javascript and paste it in my browser to do the scraping. Worked like a charm!

moritzwarhier 9 months ago

Is Playwright really that complicated?
I feel that when it has been set up, it's very straightforward to use.
Maybe in contrast to other solutions you posted? Not sure about that though; having only brief experiences with both, Playwright seems like an improved Cypress to me.
metadat 9 months ago

You might like Tamper monkey. You can add a button to kick it off or whatever your heart desires.
Tampermoney also works around CORs issues with relative ease.
- crtasm 9 months ago
  
  Userscripts for the win. Consider Violentmonkey over Tampermonkey though.
  
  metadat 9 months ago
  
  > Consider Violentmonkey over Tampermonkey though.
  Why?
  
  MC995 9 months ago
  
  Probably because it looks to be closed source these days.
  https://github.com/Tampermonkey/tampermonkey
  > This repository contains the source of the Tampermonkey extension up to version 2.9. All newer versions are distributed under a proprietary license.
  
  panphora 9 months ago
  
  Open source
- joshdavham 9 months ago
  
  I love Tamper monkey! I've never made my own scripts though - just used other peoples'.

smallerfish 9 months ago

I wrote a prototype of a browser extension that scraped your bookmarks + 1 degree, and indexed everything into an in-memory search index (which gets persisted in localstorage). I took over the new tab page with a simple search UI, with instant type-ahead search.

Rough aspects:

a) It requires a _lot_ of browser permissions to install the extension, and I figured the audience who might be interested in their own search index would likely be put off by intrusive perms.

b) Loading the search index from localstorage on browser startup took 10-15s with a moderate number of sites; not great. Maybe would be a fit for pouchdb or something else that makes IndexedDB tolerable. (Or wasm sqllite, if it's mature enough.)

c) A lot of sites didn't like being scraped (even with rate limiting and back-off), and I ended up being served an annoying number of captchas in my regular everyday browsing.

d) Some walled garden sites seem completely unscrapable (even in the browser) - e.g. Linkedin.

changing1999 9 months ago

In my experience building a browser-based scraper I preferred scraping pages by a direct in-browser visit rather that a fetch request. A direct visit from a real browser is basically undetectable by anti-bot software (unless you try to do something funny like automated deep crawling and scraping). So applied to your usecase it would have to go through every bookmark + 1 degree to index it. Maybe even in an offscreen canvas (haven't tried that though, could be detectable).
8chanAnon 9 months ago

>Some walled garden sites seem completely unscrapable
Any examples besides Linkedin? Tell me what sites you're trying to target and I'll have a look to see what can be done with them. It takes some pretty evil Javascript obfuscation to block me and only one site has been able to do that. I doubt that the sites you're hitting are anywhere near that evil, lol. I would appreciate it if you have a good example that I could use in a future article.
- smallerfish 9 months ago
  
  It's been ~18 months so I'm fuzzy on details. I remember gmail being tricky also.
  IIRC I ended up building an iframe based scraper for sites that didn't yield any content with just a fetch - and I think built a fallback mechanism so that if fetch didn't work, I'd queue it up in the iframe scraper. The problem with that is that there are various heavily used security headers that prohibit loading in an iframe. (And the reason for iframe vs just loading in a tab and injecting my extension's script is that I wanted it to be able to run "in the background" without being super distracting for the user - the tab changing favicon every second or two was pretty annoying.)
paulryanrogers 9 months ago

How often did it crawl? Once per day shouldn't trigger any blockers.

gmac 9 months ago

Yes: I find it surprising that this isn't a more widespread approach. It's how I've taught web scraping to my PhD students for some years.

https://github.com/jawj/web-scraping-for-researchers

hombre_fatal 9 months ago

It’s not widespread because it’s much more complicated than making an http request and reading the results from the body. You don’t spin up a browser, much less the full GUI, unless it’s a last resort.
- gmac 9 months ago
  
  Well, very much yes and no to that claim. Sure, for someone who’s comfortable in the shell, the first step is lighter with curl or wget. But the next step – parsing — is a lot less obvious. And it’s all much more likely to fail on websites that assume a (logged-in?) user in a browser, accepting cookies, executing JS, and so on.

hildenae 9 months ago

I understand that "with/in your web browser" implies a extention or simmilar, but i have good experience using Selenium and Python to scrape websites. Some sites are trickier than others, and when you are instrumenting a browser it easily triggers bot prevention, but you are also able to easily scrape pages that build the DOM using JS and simmilar. I have considered, but not looked into compiling my own Firefox to disable i.e. navigator.webdriver, but it feels a bit to much work.

This is my project for extracting my (your) webshop order & item data https://gitlab.com/Kagee/webshop-order-scraper

simlan 9 months ago

I also did something similar for my spring project. The idea was to buy a used car and I was frustrated with the BS the listing sites claimed as fair price etc..

I went the browser extension route and used grease monkey to inject custom JavaScript. I patched the window.fetch and because it was a react page it did most of the work for me providing me with a slightly convolute JSON doc everytime I scrolled. Getting the data extracted was only a question of getting a flask API with correct CORS settings running.

Thanks for posting using a local proxy for even more control could be helpful in the future.

throwaway48476 9 months ago

Apparently there is no web extension API to inspect the body of a fetch response and you have to override window.fetch
Seems like an omission in the spec.
- banditelol 9 months ago
  
  If we're talking about chrome extension, yes.
  But firefox extension expose API to inspect the response stream.
  
  throwaway48476 9 months ago
  
  Can you link the MDN page for this?
  
  banditelol 9 months ago
  
  [dead]

linsomniac 9 months ago

There is an extension called "Amazon Order History Reporter" that will scrape Amazon to download your order history. I've used it a couple times and it works brilliantly.

seanwilson 9 months ago

> So the question is: can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?

> One of the issues is what is called CORS (Cross-Origin Resource Sharing) which is a set of protocols which may forbid or allow access to a web resource by Javascript. There are two possible workarounds: a browser extension or a proxy server. The first choice is fairly limited since some security restrictions still apply.

I'm doing this for a browser extension that crawls a website from page to page checking for SEO/speed/security problems (https://www.checkbot.io/). It's been flexible enough, and it's nice not to have to maintain and scale servers for the web crawling. https://browserflow.app/ is another extension I know of that does scraping within the browser I think, and other automation.

your_friend 9 months ago

Interesting, I’ve tried checkbot recently and it failed to do any cloudflare gated website, even 1 page. But maybe I’m on the old version
- seanwilson 9 months ago
  
  Well, there's not a lot any crawler will be able to do if a website is gated with aggressive bot detection e.g. Puppeteer via a proxy will similar problems. Even if a bypass is found, it could break tomorrow. I've rarely had support messages about this, but most of them were resolved by adding IP addresses or user-agent/header strings to an allow list, or turning down how aggressive the bot detection is. Checkbot is more for crawling sites you have control over so there's more options here.
  It is worrying what this means for the future for web crawlers in general though if most sites end up being gated to all bots that aren't from major search engines.

ggorlen 9 months ago

I wrote a similar post on in-browser scraping: https://serpapi.com/blog/dynamic-scraping-without-libraries/

My approach is a step or two more automated (optionally using a userscript and a backend) and runs in the console on the site under automation rather than cross-origin, as shown in OP.

In addition to being simple for one-off scripts and avoiding the learning curve of a Selenium, Playwright or Puppeteer, scraping in-browser avoids a good deal of potential bot detection issues, and is useful for constant polling a site to wait for something to happen (for example, a specific message or article to appear).

You can still use a backend and write to file, trigger an email or SMS, etc. Just have your userscript make requests to a server you're running.

gabrielsroka 9 months ago

Why do you need a proxy or to worry about CORS? Why not just point your browser to rumble.com and start from there?

I've posted here about scraping for example HN with JavaScript. It's certainly not a new idea.

2020: https://news.ycombinator.com/item?id=22788236

CharlieDigital 9 months ago
```
    > Why do you need a proxy or to worry about CORS? 
```
Not sure about OP, but you might want to point to a proxy depending on the site/content you are scraping and your location. For example, if you are in Canada but you want to scrape in USD, you might need to use a proxy located in the US to get US prices.
```
    > Why not just point your browser to rumble.com and start from there?
```
Some endpoints use simple web application firewall rules that will block IPs. In this case, a rotating proxy can help evade the blocks (and prevent your legitimate traffic from being blocked). Some domains use more sophisticated WAFs like Imperva and will do browser fingerprinting so you'll need even more advanced techniques to scrape successfully.
Source: work at a startup that does a lot of scraping and these are issues we've run into. Our entire office network is blocked from some sites due to some early testing without a proxy.

ljw1004 9 months ago

In my web-scraping I've gravitated towards the "cheerio" library for javascript.

I kind of don't want to use DOMParser because it's browser-only... my web-scrapers have to evolve every few years as the underlying web pages change, so I really want CI tests, so it's easiest to have something that works in node.

datadrivenangel 9 months ago

I've been playing around with this idea lately as well! There are a lot of web interfaces that are hostile to scraping, and I see no reason why we shouldn't be able to use the data we have access to for our own purposes. CUSTOMIZE YOUR INTERFACES

flashgordon 9 months ago

Ah I remember doing this almost 20 years ago and even rotating through 1500 proxies to not get tripped up by ddos detectors :). A plugin is one of the ways to scrape as it also looks like a human (ie more js run, more divs loaded and so on).

turingfeel 9 months ago

If you want to get your personal IP and fingerprint blacklisted across major providers and large ranges, unfortunately this is how you do it. Just keep the rates low.

zarzavat 9 months ago

Obviously anyone scraping on their home IP is being foolish. Getting blacklisted is the least bad thing that can happen.
As for fingerprinting, you can just use a different computer. Most people probably have a bunch of old computers lying around, right? If not, computers are cheap.

acheong08 9 months ago

I actually did that with a firefox extension + containers to scrape ChatGPT a long while back (before the APIs)

https://github.com/acheong08/ChatGPT-API-agent

Worked pretty well but browsers took up too much memory per tab so automating thousands of accounts (what i wanted) was infeasible

pimlottc 9 months ago

When I have to do some really quick ad-hoc webscraping, I often just select all text on the page, copy it, and then switch to a terminal window where I build a pipeline that extracts the part I need (using pbpaste to access the clipboard). Very quick and dirty for when you just need to hit a few pages.

ricardo81 9 months ago

I've found using a local proxy helps when using Puppeteer and a proxy. The way chrome authenticates to a proxy keeps the connection open which can sometimes mess up rotating proxy endpoints, and having to close/re-open browsers per page is just too inefficient.

changing1999 9 months ago

> can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?

My guess would be that some companies are doing it (I worked at a major tech company that is/was), just not publicizing this fact as crawling/scraping is such a gray legal area.

chaosharmonic 9 months ago

> You can find plenty of tutorials on the Internet about the art of web scraping... and the first things you will learn about are Python and Beautiful Soup. There is no tutorial on web scraping with Javascript in a web browser...

Um... [0]

[0] https://bhmt.dev/blog/scraping

8chanAnon 9 months ago

Rather long so I'll read it later. Thanks for the tip. Got more or is that it?
- chaosharmonic 9 months ago
  
  Not that I know of, I'm just quipping a bit about my own work lol

dewey 9 months ago

I've read through that (hard to read, because of the bad formatting) but I still don't understand why you would do that instead of Playwright, Puppeteer etc. - The only reason seems to be "This technique certainly has its limits.".

8chanAnon 9 months ago

>bad formatting
If you can elaborate, I would very much appreciate it. I'm always interested in doing better.
Why use Puppeteer etc. when you don't have to? What is the argument for using these additional tools versus not using them?
- dewey 9 months ago
  
  You don't have max-width set on the text, so unless you have your browser window resized to very small size the paragraphs will span your whole screen.
  
  pwg 9 months ago
  
  Which places you, the reader, fully in control of the width of lines you prefer to see. Adjust your browser window width, or apply a user style sheet to tell your browser to format the text the way you want to see it formatted.
  
  dewey 9 months ago
  
  I knew this reply would come... How many people do you think use custom browser stylesheets? It's probably smaller than 0.1% of the internet population and everyone moved on to formatting text so everyone can enjoy good readability. Also not all devices have the luxury of supporting custom stylesheets.
  Of course it's always up to the site owner, but most people want people to read what they share.
  
  nsonha 9 months ago
  
  > Adjust your browser window width, or apply a user style sheet
  very funny, both jokes
  
  8chanAnon 9 months ago
  
  I see. I don't have a wide-screen monitor (still using an old tube type until it finally expires but it's taking a few decades, lol). I've wondered whether people actually like reading websites on wide-screen. Some do and some don't. What would you suggest for a max width?
  You could also try zooming in. My apps don't expand to full width because of the video box but you can zoom.
  
  dewey 9 months ago
  
  There's a lot of info about this but usually 500-700px or ~80 characters will be much easier to read: https://ux.stackexchange.com/questions/108801/what-is-the-be...
  
  bityard 9 months ago
  
  I have the opposite problem. I typically have many browser windows open at the same time but only two screens, and many sites that I use are designed to assume that everyone has full-screen browsers.
  
  micahdeath 9 months ago
  
  Similar problem here. My resolution is 1440x900 (27in Monitor) paired w/ 1280x720 (32in TV) and I keep 3 Browsers Open (Edge, Opera, FireFox - each has intent). Each are at 3/4 width and 1/2 height and offset so I can see each partially.
  With this setup, many sites work, but a few... a few have a top ad banner, a side banner and a footer of 'cookie acceptance'... then add in a 'subscribe to our email' and a google login prompt.... (Game Wiki's.. I game in smaller windows too -- what good is a multi tasking computer if you don't use it?)
- gabrielsroka 9 months ago
  
  The red text on a yellow background is not great. Neither is a serif font. Also it should be JavaScript with a capital S.
  
  8chanAnon 9 months ago
  
  Red text on yellow - you mean the website? Would you like the text to be darker?
  And Beautiful Soup should be BeautifulSoup. Who makes the rules?
  
  gabrielsroka 9 months ago
  
  Yes, the website. Better contrast is good. Black on white works great.
  Margins would also be nice on the left and right.
  Beautiful Soup is two words. Just look at their website.
  
  bityard 9 months ago
  
  Too much contrast makes it hard to read too. (With bonus eye strain.)
  
  8chanAnon 9 months ago
  
  Is there a happy in-between? Maybe not. What looks perfect to one user might appear atrocious to another. What is a poor website operator to do???
  I dislike black-on-white and don't understand gray-on-black which seems to be popular now due to gamma settings being cranked up to 11 or something. I try to use some color as an in-between but that may take some time to "perfect".
  
  gabrielsroka 9 months ago
  
  Even Google Chrome's Lighthouse said that your background and foreground colors do not have a sufficient contrast ratio.
  
  8chanAnon 9 months ago
  
  Every dark mode site in existence should fail that test.
  
  dewey 9 months ago
  
  No, because the contrast can still be good as the colors are just reversed?
  
  pwg 9 months ago
  
  > Is there a happy in-between?
  Since browsers allow users to configure a default font and background color then one possible "happy in-between" would be to set no background color, and set no font color, thereby allowing each user agent (i.e., browser) to display the site with that user's default background and font colors.
  In that case, each viewer should get their preferred colors, all without you doing anything.
- ricardo81 9 months ago
  
  Cloudflare et al do a lot of fingerprinting of the user agent. Any website that has 'high' anti-bot settings will return a 403 with anything but a browser. source: I've scraped lots of things.
bdcravens 9 months ago

Solutions that want to automate in the context of their customers' browser. For example, ListPerfectly, a solution for cross-listing to eBay, Poshmark, etc, does this in their browser extension.
- dewey 9 months ago
  
  This is just regular browser extension behavior though, a bit different than the use case talked about in this post.

spullara 9 months ago

I love it when something like this reminds me of a project from forever ago...

https://github.com/spullara/browsercrawler

nsonha 9 months ago

sorry the format of this site is just too annoying for me to bother to read it. If this is about the shocking revelation that you can paste some code into the browser console, aka manually extracting information, then manually put that into whatever workflow that you need that information for, then I don't think that is called web scrapping, it's just browsing the web with code.

micahdeath 9 months ago

Excel/Word Macro using a WebBrowser object in a Form (old IE did this nicely; Haven't done that since Edge came out.)

deisteve 9 months ago

is there anything that runs on WASM for scraping? the issue is that you need to enable flags and turn off other security features to scrape on your web browser and this is why its not popular but with WASM that might change

8chanAnon 9 months ago

WASM runs in a sandbox. It can only talk to the outside world via Javascript so you can forget the idea that it might be a way to crack through browser security.
Maybe somebody will make a web browser with all of the security locks disabled. Sort of like the Russian commander in "Hunting for Red October" who disabled his missiles' security features in order to more effectively target the American sub but then got blown up by his own missile.

ttshaw1 9 months ago

How is this different from scraping in, say, Selenium in non-headless mode?

twelve40 9 months ago

I think Selenium's killer use case is (aside from legacy/inertia) cross-browser and cross-language. In exchange, it comes with a ton of its own baggage, since it's an additional layer in between you and your task, with its own Selenium-specific bugs, behavior limitations and edge cases.
If you don't need cross-browser and Chrome is all you need, then something like a simple Chrome extension and/or Chrome DevTools Protocol cuts out a lot of middle-man baggage and at least you will be wrangling the browser behavior directly, without any extra idiosyncrasies of middle layers.
- ttshaw1 9 months ago
  
  Sorry, I dropped the context when I decided to make this a top-level comment. Does scraping in a browser circumvent scraping protections that Selenium non-headless would get caught by?
  
  8chanAnon 9 months ago
  
  My next article will be on the topic of bypassing the Cloudflare bot protection. You can then compare with how Selenium handles this problem (if at all).

welder 9 months ago

Neo already did that in the Matrix:

https://www.youtube.com/watch?v=sjoad6gcRzs

squigz 9 months ago

This horrendous color scheme makes it impossible for me to read this.