Cheap and DIY solution for log analysis

5 points by unrequited 19 hours ago

I’ve ~5TB size of logs. I’m open to ideas on doing a log analysis using any of the AI models available to play around for learning purposes. I’m on a budget for this but have time to work on something DIY. Kindly suggest any ideas for anomaly detection or similar to play around with these logs. Thanks.

speedgoose 5 hours ago

A few random thoughts since no one replied:

(rip)grep. Use AI to suggest what to look for if you want to use AI. Maybe do it in reverse, so you filter-out the logs you aren't interested about.

Look at the similarity of each line. Working on UTF-8 or ASCII may not be good enough, though it can quickly highlight some interesting lines. Perhaps a nice tokenizer can help, or even language models embeddings. You can play with old text similarity algorithms or cosine similarity and similar.

Play with clustering algorithms like UMAP and HDBSCAN (or whatever the state of the art is, I havn't look at the field recently).

Feeding a chat/instruct LLM 5TB of logs is technically possible, but that would be a huge waste of ressources IMHO. Is it worth it? You could only feed the lines that are unusual to filter.

Let's say you have hardware and a LLM than can process 100tokens/s, 5TB is about 400 years of compute.