How we built ngrok's data platform

164 points by samber 9 months ago

I found the technical details really interesting, but I think this gem applies more broadly:

> I find this is often an artifact of the DE roles not being equipped with the necessary knowledge of more generic SWE tools, and general SWEs not being equipped with knowledge of data-specific tools and workflows.

> Speaking of, especially in smaller companies, equipping all engineers with the technical tooling and knowledge to work on all parts of the platform (including data) is a big advantage, since it allows people not usually on your team to help on projects as needed. Standardized tooling is a part of that equation.

I have found this to be so true. SWE vs DE is one division where this applies, and I think it also applies for SWE vs SRE (if you have those in your company), data scientists, "analysts", basically anyone who is in a technical role should ideally know what kinds of problems other teams work on and what kinds of tooling they use to address those problems so that you can cross-pollinate.

anonzzzies 9 months ago

I too see this; I have a big hole in my DE knowledge, even though I manage a lot of data for our clients. I just work from experience and have been using more or less the same tech for decades (with upgrades and one major 'newer' addition; Clickhouse). I try to learn DE stuff, but I do find it particularly hard because i'm NOT a DE but an SWE, so I really quickly fall back on the tooling I already know and love and see very little reasons for anything else.
So is there a something like 'DE for SWE's' someone would recommend here?

1a527dd5 9 months ago

Blimey, that is a lot of moving parts.

Our data team currently has something similar and its costs are astronomical.

On the other hand our internal platform metrics are fired at BigQuery [1] and then we used scheduled queries that run daily (looking at the -24 hours) that aggregate/export to parquet. And it's cheap as chips. From there it's just a flat file that is stored on GCS that can be pulled for analysis.

Do you have more thoughts on Preset/Superset? We looked at both (slightly leaning towards cloud hosted as we want to move away from on-prem) - but ended up going with Metabase.

[1] https://cloud.google.com/bigquery/docs/write-api

otter-in-a-suit 9 months ago

I’m the author, but posting as a private individual here, these being just my options and all that… but I can shed some more light on why I did move us to Superset.
Preset is great, as are most of these tools’ hosted versions! Lots of great folks working on these.
But, tbh, as an infrastructure company this is somewhat the core business of ngrok - hosting another DB + K8s service is something that we have great tooling for and lots of expertise in the infra space. And using ngrok makes it even easier.
The whole dogfooding aspect is important too - if I don’t run an app in production with ngrok I have a hard time empathizing with customers who want to do the same. My previous job encouraged that too and I’ve always liked that.
Also, yes, lots of moving parts - but most of them are very reusable and they share a lot of code, infra, and logic/operations playbooks etc. Costs are manageable - Athena charges $5/TB scanned iirc, which tends to be the biggest factor.
- 1a527dd5 9 months ago
  
  Appreciate you taking the time to reply :)
  I guess the underlying tone of cynicism in my tone speaks to the question that I didn't directly ask - how often do each of the components/moving parts fail and require manual intervention/fixing?
  I often get pulled into complex distributed systems and the team responsible for that flow (data or not) often have no idea where to begin.
  Edit* On the point of Athena, I desperate wanted to use it but provide BigQuery to be much better in every way you could think of. It's the black sheep in the company, as every other cloud thing we have is AWS. But honestly, nothing I've found in AWS circle comes close to BigQuery.
  
  datadrivenangel 9 months ago
  
  BigQuery + Metabase is such a powerful combination. Easy, affordable, effective.
- spmurrayzzz 9 months ago
  
  I appreciate the time you took to write this all out (both the article and your response here). In particular, this line from the article resonated with my own experience over the last couple of decades:
  > This particular setup—viewing DE as a very technical, general-purpose distributed system SWE discipline and making the people who know best what real-world scenarios they want to model—makes our setup work in practice.
  The common analyst-to-DE path has some benefits for sure with respect to business-centric data modeling, but without the deep technical infrastructure investments and related support, the stack becomes a beast to deal with at scale (or just ends up being a massive cost on the balance sheet from outside vendor sourcing). You really need both verticals in order to be optimal IMO.
  Of course if internally an org doesn't already have the platform/infrastructure to dogfood in the first place, this admittedly makes the proposition a bit more of a gamble.
mritchie712 9 months ago

blimey indeed. This is a lot of work to set up.
I think you'll see more platforms that offer this set up as a service:
* cheap storage / datalake
* pipelines to get data to storage
* BI / dashboards on top of the storage
We're doing this at Definite (https://www.definite.app/) with Iceberg (same as in this post) + DuckDB as a query engine.
- 1a527dd5 9 months ago
  
  We are waiting for Metabase to support DuckDB on their cloud version. It's pretty neat.

zurfer 9 months ago

Kudos to the author who is responsible for the whole stack. A lot of effort goes into ingesting data into Iceberg tables to be queried via AWS Athena.

But I think it's great that analytics and data transformation is distributed, so developers also are somewhat responsible for correct analytical numbers.

In most companies there is strong split between building product and maintaining analytics for the product, which leads to all sort of inefficiencies and errors.

tonymet 9 months ago

15k/s event rate and 650GB volume / day is massive. Of course that's confidential, but I'd guess they are below 10k concurrent connections. So they are recording 1.5 event's / second / user. Does every packet need discrete & real-time telemetry? I've seen games with millions of active users only hit 30k concurrents and this is a developer tool.

Most events can be aggregated over time with a statistic (count, avg, max, etc). Even discrete events can be aggregated with a 5 min latency. That should reduce their event volume by 90% . Every layer in that diagram is CPU wasted on encode-decode that costs money.

The paragraph on integrity violation queries was helpful -- it would be good to understand more of the query and latency requirements.

The article is a great technical overview, but it's also helpful to discuss whether this system is a viable business investment. Sure they are making high margins, but why burn good cash on something like this?

nemothekid 9 months ago

>Of course that's confidential, but I'd guess they are below 10k concurrent connections
I think 10k concurrent connections might be low? I've seen ngrok used at places where you need a reverse proxy to some device - each of those types of customers may have thousands of agents alone.
- tonymet 9 months ago
  
  it's anyone's guess. i'm factoring in the fact that it's a niche dev tool with lots of competition . remember that concurrent figures are 100x -1000x smaller than monthly active users

jmuguy 9 months ago

I wonder if this data collection is why Ngrok's tunnels are now painfully slow to use. I've just gone back to localhost unless I specifically need to test omniauth or something similar.

valzam 9 months ago

i pity the developer who has to maintain tagless final plumbing code after the “functional programming enthusiast” moves on… in a Go first org no less.

otter-in-a-suit 9 months ago

Author here. This decision went through all proper architecture channels, including talks with our engineers, proof of concepts and the like.
I’ve been doing this too long to shoehorn in my pet languages if I didn’t think they’re a good fit. And I think that scala/FP + Flink _is_ a good fit for this use case.
We did also explore the go ecosystem fwiw - the options there are limited (especially around the data tooling like iceberg) and go is simply not a language that’s popular enough in the data world.
Python’s typing system (or lack thereof) is a huge hinderance in this space in general (imo), and Java didn’t cause many happy faces on the Eng team either, but it’s certainly an option. I just find FP semantics a better fit for data / streaming work (lots of map and flat map anyways), and Scala makes that easy.
Also no cats/zio - just some tangles final _inspired_ composition and type classes. Not too difficult to reason about, not using any obscure patterns. I even mutate references sometimes. :-)
- atomicnumber3 9 months ago
  
  I'm assuming the parent commenter hasn't worked in data/spark before either. The functional rabbit hole goes WAY deeper than even just cats et al, and Scala and spark themselves both encourage a fair amount of functional-style code on their own.
- bfors 9 months ago
  
  Could you speak to how you're interfacing scala with flink? I looked into using scala with flink a while back, and stopped when I found out that the scala API was deprecated.
- boltzmann-brain 9 months ago
  
  scala? why not haskell instead?
  
  atomicnumber3 9 months ago
  
  Spark is written in scala and Scala is its first-class language - other languages suffer from either second-class APIs (Java) or suffer from codec/serde overhead (pyspark) (though pyspark actually also is missing a few APIs that scala has, as well).
  
  otter-in-a-suit 9 months ago
  
  Not assuming you’re serious, but in any case: the reason is the JVM (+ Scala) ecosystem in the data space.
  
  epgui 9 months ago
  
  FWIW, I do believe there is a serious case to be made for Haskell… But it’s probably beyond the scope of this context / would require changing many other decisions.
  If integrating with java tools was important then personally I’d ask “why not Clojure”.
  :)
moandcompany 9 months ago

There was a prior effort to create a Golang SDK for Apache Beam https://beam.apache.org/documentation/sdks/go/
The BEAM Golang SDK work came from Googlers working on Beam that were Golang fans, and internally there were Golang-oriented tools for batch data processing that needed a migration path forward.
Historical Note: Apache Beam also originated from Google as "Dataflow"
epgui 9 months ago

I would much rather inherit an FP data pipeline than anything else. You do realize data pipelines (and distributed computing) are an ideal use case for FP?
- pjmlp 9 months ago
  
  I guess the issue being point out is the choice in a Go culture shop, and we all know their common point of view regarding "fancy" languages.
  
  epgui 9 months ago
  
  It’s not clear to me why having two different sets of tooling for solving two different kinds of problems, is an issue.
  In most well-resourced companies, you’re probably not going to have to ask your Go engineers to fix data pipelines in Scala.
  
  pjmlp 9 months ago
  
  That is why they pointed out the leaving the company, as possible scenario.
  As for well resourced I guess it depends, that variable usually doesn't mean much, as we can see by companies firing whole departments, while swimming in profits.

moandcompany 9 months ago

At the end of the day, we're all pushing protobufs from place to place

tonymet 9 months ago

why aws, azure & gcp are printing money

fkinclwns 9 months ago

[dead]

LoganDark 9 months ago

> Note that we do not store any data about the traffic content flowing through your tunnels—we only ever look at metadata. While you have the ability to enable full capture mode of all your traffic and can opt in to this service, we never store or analyze this data in our data platform. Instead, we use Clickhouse with a short data retention period in a completely separate platform and strong access controls to store this information and make it available to customers.

Don't worry, your sensitive data isn't handled by our platform, we ship it to a third-party instead. This is for your protection!

(I have no idea if Clickhouse is actually a third party, it sounds like one though?)

sippeangelo 9 months ago

Clickhouse must be the worst named product in popular use. I know it's a DB, but every time I read it, it sounds like a marketing/Ads company for privacy invasive tracking.
- bjconlan 9 months ago
  
  Well, that's what it was originally designed to store/solve; that being Yandex's analytics platform.
  https://clickhouse.com/docs/en/about-us/history
  
  sippeangelo 9 months ago
  
  That makes a lot of sense.
leosanchez 9 months ago

Clickhouse is a database. It has cloud offering.
- faangguyindia 9 months ago
  
  What's the point of clickhouse cloud?when you can just use bigquery and run queries on billions of row.
  I am genuinely curious what case does clickhouse serve over bigquery.
  
  FridgeSeal 9 months ago
  
  It’s actually open source, you can self-host it easily enough, you can push a single instance pretty far too.
  It’ll also happily read from disaggregated storage and is compatible with parquet and friends and a stack of other formats. I’ve not really used BigQuery in anger, but the ClickHouse performance is really, really good.
  I guess ultimately, all the same benefits, and a lot fewer downsides.
  
  tnolet 9 months ago
  
  - non proprietary
  - open source
  - run it locally
  - SQL like syntax
  - tons of plugins
  - not by Google
  
  fkinclwns 9 months ago
  
  [dead]
IanCal 9 months ago

A different platform doesn't mean third party. It can just mean you have completely separated things so that none of the data tooling discussed here has any ability to access it.
- LoganDark 9 months ago
  
  Not sure what you mean... Do you mean they run software called Clickhouse on their own infra, just separated from the other parts of their backend? To me it reads like they were shipping the data off to a third-party named Clickhouse, especially with "we never store or analyze this data in our data platform" (does data platform refer to ngrok itself or what?).
  
  IanCal 9 months ago
  
  There is a database called clickhouse, while the company offers services and hosting many run clickhouse on their own infra.
fkinclwns 9 months ago

[dead]