In the land of LLMs, can we do better mock data generation?

neurelo.substack.com

140 points by pncnmnp 9 months ago

Big fan of this write up as it presents a really easy to understand and at the same time brutally honest example of a domain in which a) you would expect LLMs to perform very well, b) they don't and c) the solution is to make the use of ML more targeted, a complement to human reasoning rather than a replacement for it.

Over and over again we see businesses sinking money into "AI" where they are effectively doing a) and then calling it a day, blithely expecting profit to roll in. The day cannot come too soon when these businesses all lose their money and the hype finally dies - and we can go back to using ML the way this write up does (ie the way it is meant to be used). Let's hope no critical systems (eg healthcare or law enforcement) make the same mistake businesses are before that time.

infecto 9 months ago

On the flip side I thought the write up was weak on details and while "brutally honest" it did not touch on how they even tried to implement an LLM in the workflow and for all we know they were using an outdated model or a bad implementation. Your bias seems to follow it though, you have jumped so quickly into a camp that its easy to enjoy an article that supports your worldview.
- jerf 9 months ago
  
  To be honest, I exited the article thinking the answer is "no", or at least, perilously close to "no". The same amount of work put into a conventional solution probably would have been better. That cross-product "solution" is a generalized fix for data generation from a weak data source and as near as I can tell is what is actually doing most of the lifting, not the LLM.
  That said, I'm not convinced there isn't something to the idea, I just don't know that that is the correct use of LLMs. I find myself wondering if from-scratch training, of a much, much smaller model trained on the original data, using LLM technology but not using one of the current monsters, might not work better. I also wonder if this might be a case where prompt engineering isn't the way to go but directly sampling the resulting model might be a better way to go. Or maybe start with GPT-2 and ask it for lists of things; in a weird sort of way, GPT-2's "spaciness" and inaccuracy is sort of advantageous for this. Asking "give me a list of names" and getting "Johongle X. Boodlesmith" would be disastrous from a modern model, but for this task is actually a win. (And I wouldn't ask GPT-2 to try to format the data, I'd probably go for just getting a list of nicely randomized-but-plausible data, and solve all the issues like "tying the references together" conventionally.)
- krainboltgreene 9 months ago
  
  Is this the new normal for comments? Incredibly bad faith.
  
  infecto 9 months ago
  
  How so? Their implementation was interesting but I think it missed the whole setup on what did and did not work on the LLM side. Have just a few of those details would have made it very interesting. As it stands its really hard to decide if LLM is or is not the way.
  If you have such an opinion why not share how I could communicate it better?
  
  gopher_space 9 months ago
  
  Your line about the parent commenter's bias was weird and rude. You've never met the person and are accusing them of something you're in the process of doing yourself.
  https://www.youtube.com/watch?v=_cJO7pkx2jQ
  
  infecto 9 months ago
  
  Darn I hate being weird. Thanks!
  
  hluska 9 months ago
  
  It was very rude.
  
  emptiestplace 9 months ago
  
  Wow!
  
  emptiestplace 9 months ago
  
  Your comment is perfectly fine - obviously. You even used 'seems'. You even even saved me from wasting my time reading another bullshit article about LLMs from someone who can't be bothered to learn anything about them. Thanks!
  
  bcoates 9 months ago
  
  "If it didn't work you didn't believe hard enough" also known as "Real Communism has never been tried" or "Conservatism never fails, it can only be failed" is a sort of... information-free stock position.
  Basically, if thing is good it needs to still be good when tried in the real world by flawed humans, so if someone says "I tried thing and it didn't work" replying with "well maybe thing is good but you suck" isn't productive.
  
  infecto 9 months ago
  
  Sorry I think it’s totally justified to question an article when they provided nothing more beyond we tried and it did not work. The whole premise was can it be done but it was missing basic information to draw a conclusion.
  Now maybe I was too weird in my response to the OP but it really went into a LLMs are bad narrative.

jumploops 9 months ago

The title and the contents don’t match.

The author expected to use LLMs to just solve the mock data problem, including traversing the schema and generating the correct Rust code for DB insertions.

This demonstrates little about using LLMs for _mock data_ and more about using LLMs for understanding existing system architecture.

The latter is a hard problem, as humans are known to create messy and complex systems (see: any engineer joining a new company).

For mock data generation, we’ve[0] actually found LLMs to be fantastic, however there are a few tricks.

1. Few shot prompting: use a couple of example “records” by inserting user/assistant messages to “prime” the context 2. Keep the records you’ve generated in context, as in, treat every record generated as a historical chat message. This helps avoid duplicates/repeats of common tropes (e.g. John Smith) 3. Split your tables into multiple generations steps — e.g. start with “users” and then for each user generate an “address” (with history!), and so on. Model your mock data creation after your schema and its constraints, don’t rely on the LLM for this step. 4. Separate out mock data generation and DB updates into disparate steps. First generate CSVs (or JSON/YAML) of your data, and then use a separate script(s) to insert that data. This helps avoid issues at insertion as you can easily tweak, retry, or pass on malformed data.

LLMs are fantastic tools for mock data creation, but don’t expect them to also solve the problem of understanding your legacy DB schemas and application code all at once (yet?).

[0]https://www.youtube.com/watch?v=BJ1wtjdHn-E

edrenova 9 months ago

Nice write up, mock data generation with LLMs is pretty tough. We spent time trying to do it across multiple tables and it always had issues. Whether you look at classical ML models like GANs or even LLMs, they struggle with producing a lot of data and respecting FKs, Constraints and other relationships.

Maybe some day, it gets better but for now, we've found that using a more traditional algorithmic approach is more consistent.

Transparency: founder of Neosync - open source data anonymization - github.com/nucleuscloud/neosync

its_down_again 9 months ago

I’ve spent some time in enterprise TFO/demo engineering, and this kind of generative tool would’ve been a game changer. When it comes to synthetic data, the challenge lies at the sweet spot of being both "super tough" and in high business need. When you're working with customer data, it’s pretty risky—just anonymizing PII doesn’t cut it. You’ve got to create data that’s far enough removed from the original to really stay in the clear. But even if you can do it once, AI tools often need thousands of data rows to make the demo worthwhile. Without that volume, the visualizations fall flat, and the demo doesn’t have any impact.
I found challenge with LLMs isn’t generating a "real enough" data point—that’s doable. It’s about, "How do I load this in?", then, "How do I generate hundreds of these?" And even beyond that, "How do I make these pseudo-random in a way that tells a coherent story with the graphs?" It always feels like you’re right on the edge, but getting it to work reliably in the way you need is harder than it looks.
- edrenova 9 months ago
  
  Yup agreed. We built an orchestration engine into Neosync for that reason. Can handles all of the reading/writing from DBs for you. Also can generate data from scratch (using LLMs or not).
juthen 9 months ago

GANs are barely ten years old and already they have reached the classical ML algorithm status.

danielbln 9 months ago

Did I miss it or did the article not mention which LLM they tried, what prompts they've used and then they also mention zero-shot only, meaning no in-context learning? And they didn't think to tweak the instructions after it failed the first time? I don't know, doesn't seem like they really tried all that hard and basically just quickly checked the "yep, LLMs don't work here" box.

dogma1138 9 months ago

Most LLMs I’ve played with are terrible at generating mock data that is in any way useful because they are strongly reinforced against anything that could be used for “recall”.

At least for playing around with llama2 for this you need to abliterate it the point of lobotomy to do anything and then the usefulness drops for other reasons.

pitah1 9 months ago

The world of mock data generation is now flooded with ML/AI solutions generating data but this is a solution that understands it is better to generate metadata to help guide the data generation. I found this was the case given the former solutions rely on production data, retraining, slow speed, huge resources, no guarantee about leaking sensitive data and its inability to retain referential integrity.

As mentioned in the article, I think there is a lot of potential in this area for improvement. I've been working on a tool called Data Caterer (https://github.com/data-catering/data-caterer) which is a metadata-driven data generator that also can validate based on the generated data. Then you have full end-to-end testing using a single tool. There are also other metadata sources that can help drive these kinds of tools outside of using LLMs (i.e. data catalogs, data quality).

SkyVoyager99 9 months ago

I think this article does a good job in capturing the complexities of generating test data for real world databases. Generating mock data using LLMs for individual tables based on the naming of the fields is one thing, but doing it across multiple tables, while honoring complex relationships across them (primary-foreign keys across 1:1, 1:N, and M:N with intermediate tables) is a whole another level of a challenge. And it's even harder for databases such as MongoDB, where the relationships across collections are often implicit and can best be inferred based on the names of the fields.

gopher_space 9 months ago

> Generating mock data using LLMs for individual tables based on the naming of the fields is one thing, but doing it across multiple tables, while honoring complex relationships across them (primary-foreign keys across 1:1, 1:N, and M:N with intermediate tables) is a whole another level of a challenge.
So much so that I'm wondering about the context and how useful the results would be if the idea was self-applied. The article talks about mocking data for a number of clients, and I appreciate that viewpoint, but I'm struggling to picture a scenario where I wouldn't have the time or desire to hand-craft my own test data.
- SkyVoyager99 9 months ago
  
  Well a few scenarios come to mind - 1) keeping the test data up-to-date as the schema changes takes a fair amount of work, especially if it's a schema that's actively changing and being worked on in a team by more than one developer. 2) Not everyone wants to necessarily craft their own test data even if they can, because well they would rather spend their time doing something else. 3) test data generation at even modest scale can be quite painful to hand-craft (and keep up-to-date). 4) capturing all the variances across the data e.g. combinations of nulls across fields, lengths of data across the fields, etc.

nonameiguess 9 months ago

We faced probably about the worst form of this problem you can face when working for the NRO on ground processing of satellite data. When new orbital sensor platforms are developed, new processing software has to be developed in tandem, but the software has to be developed and tested before the platforms are actually launched, so real data is impossible and you have to generate and process synthetic data instead.

Even then, it's an entirely tractable problem. If you understand the physical characteristics and capabilities of the sensors and the basic physics of satellite imaging in general, you simply use that knowledge. You can't possibly know what you're really going to see when you get into space and look, but you at least know the mathematical characteristics the data will have.

The entire problem here is you need a lot of expertise to do this. It's not even expertise I have or any other software developer had or has. We needed PhDs in orbital mechanics, atmospheric studies, and image science to do it. There isn't and probably never will be a "one-click" button to just make it happen, but this kind of thing might honestly be a great test for anyone that truly believes LLMs can reason at a level equal to human experts. Generate a form of data that has never existed, thus cannot have been in your training set, from first principles of basic physics.

sgarland 9 months ago

IMO, nothing beats a carefully curated selection of data, randomly selected (with correlations as needed). The problem is you rapidly start getting into absurd levels of detail for things like postal addresses, at least, if you want them to be accurate.

zebomon 9 months ago

Good read. I wonder to what degree this kind of step-making which I suppose is what is often happening under the hood of OpenAI's o1 "reasoning" model, is set up manually (manually as in a case-by-case basis) as you've done here.

I'm reminded of an evening that I spent playing Overcooked 2 with my partner recently. We made it through to the 4-star rounds, which are very challenging, and we realized that for one of the later 4-star rounds, one could reach the goal rather easily -- by taking advantage of a glitch in the way that items are stored on the map. This realization brought up an interesting conversation, as to whether or not we should then beat the round twice, once using the glitch and once not.

With LLMs right now, I think there's still a widespread hope (wish?) that the emergent capabilities seen in scaled-up data and training epochs will yield ALL capabilities hereon. Fortunately for the users of this site, hacking together solutions seems like it's going to remain necessary for many goals.

yawnxyz 9 months ago

ok so a long time ago I used "real-looking examples" in a bunch of client prototypes (for a big widely known company's web store) and the account managers couldn't tell whether these were items new that had been released or not... so somehow the mock data ended up in production (before it got caught and snipped)

ever since then I use "real-but-dumb examples" so people know in a glance that it can't possibly be real

the reason I don't like latin placeholder text is b/c the word lengths are different than english so sentence widths end up very different

globalise83 9 months ago

Yes, this should be a lesson in all software engineering courses: never use real or realistic data in examples or documentation. Once made the mistake of using a realistic but totally fake configuration id and had people use it in their production setup. Far better to use configId=justanexampleid or whatever.
sgarland 9 months ago

That sounds like a problem with the account managers, not you.
Accurate and realistic data is important for doing proper load tests.
- yawnxyz 9 months ago
  
  oh yeah definitely a problem with the account managers! I just try to make it easier on them...

benxh 9 months ago

I'm pretty sure that Neosync[0] does this to a pretty good degree, it is open source and YC funded too.

[0] https://www.neosync.dev/

WhiteOwlEd 9 months ago

Building on this, Human preference optimization (such as Direct Preference Optimization or Kahneman Tversky Optimization) could be used to help in refining models to create better data.

I wrote about this more recently in the context of using LLMs to improve data pipelines. That blog post is at: https://www.linkedin.com/posts/ralphbrooks_bigdata-dataengin...

larodi 9 months ago

The thing is that this test data generation does not work if you don't account for the schema. Author did so, well done. Been following the same algo for an year, and it works as long, as context big enough to keep ids generated. or otherwise you feed ids for the FKs missing.

But this is really not a breakthrough, anyone with fair knowledge of LLMs and E/R should be able to devise it. the fact not many people have interdisciplinary knowledge is very much evident from all text2sql papers for example which is a similar domain.

Version467 9 months ago

> anyone with fair knowledge of LLMs and E/R should be able to devise it.
While this may be true, I think it overlooks a really important aspect. Current LLMs could be very useful in many workflows if someone does the grunt work of properly integrating it. That’s not necessarily complicated, but it is quite a bit of work.
I don’t think we’ll hit a capabilities wall anytime soon, but if we do, we’ll still have years of work to do, to properly make use of everything llms have to offer today.
- larodi 9 months ago
  
  This grunt wall is these 10-20% that the model gets wrong, or just misbehaves. And it can be lotta struggle. I'm not talking about easy stuff like writting a letter to customer, but classification and text2code, which is hard
  Stating this after just finished my long overdue masters and the topic was text2sql with a pinch of my own thing. Hundreds of papers are written on this topic, and only when complex agents, multi-prompting + actual discreete systems play together things start ot work. So just tossing all in the context is not a solution.
  In practice, I agree, the llms have their role in software, as classifier, graphics segmentation, code assist, etc. But it is very wrong to put all eggs in the same basket, and this basket is very very very shade one.

eesmith 9 months ago

A European friend of mine told me about some of the problems of mock data generation.

A hard one, at least for the legal requirements in her field, is that it must not include a real person's information.

Like, if it says "John Smith, 123 Oak St." and someone actually lives there with that name, then it's a privacy violation.

You end up having to use addresses that specifically do not exist, and driver's license numbers which are invalid, etc.

mungoman2 9 months ago

Surely that's only their interpretation of privacy laws, and not something tested in courts.
It seems unlikely to actually break regulations if it's clear that the data has been fished out of the entropy well.
- aithrowawaycomm 9 months ago
  
  But if "fished out of the entropy well" includes "a direct copy of something which should not have been in the training data in the first place, like a corporate HR document," then that's a big problem.
  I don't think AI providers get to hide behind an "entropy well" defense when that entropy is a direct consequence of AI professionals' greed and laziness around data governance.
- eesmith 9 months ago
  
  The conversation was about 15 years ago so my memory might be wrong. But if you happen to have an SSN correctly matching someone's name, can you say it's been fished out of the entropy well? As aithrowawaycomm commented, how can you know it didn't regurgitate part of the training set, which happened to contain real data?

chromanoid 9 months ago

The article reads like it was a bullet point list inflated by AI. But maybe I am just allergic to long texts nowadays.

I wonder if we will use AI users to generate mock data and e2e test our applications in the near future. This would probably generate even more realistic data.

lysecret 9 months ago

This is a very good point, that's probably my number one use-case of things like copilot chat, just to fill in some of my types and generate some test cases.

roywiggins 9 months ago

a digression but

> this text has been the industry's standard dummy text ever since some printed in the 1500s

doesn't seem to be true:

https://slate.com/news-and-politics/2023/01/lorem-ipsum-hist...

hluska 9 months ago

From the article:

“It should generate realistic data based solely on the schema, without requiring any external user input—a “one-click” solution with minimal friction.“

This is extremely ambitious and ambition will always be very cool.

dartos 9 months ago

Maybe I’m confused, but why would an llm be better at mapping tuples to functions as opposed to a kind of switch statement?

Especially since it doesn’t seem to totally understand the breadth of possible kinds of faked data?

erehweb 9 months ago

See also the Charlie Javice case, where she allegedly defrauded JP Morgan into buying her student financial aid company using mock data https://www.nbcnews.com/news/us-news/startup-founder-charlie...

ShanAIDev 9 months ago

[flagged]

thelostdragon 9 months ago

This looks quite interesting and promising.