Harnessing Gen AI for Data Privacy

21 Jun 2024

Gen AI for data privacy

I set out several months ago to deeply understand and engage with the modern AI tooling that is in the process of revolutionizing (or at least sensationalizing!) the world of Web Development as we know it. I had a single purpose: to build a theoretically scalable system that could leverage this plethora of new technologies. And one that wouldn’t bankrupt me in the process.

I picked a use case that was an area of interest to me and one that I felt was ripe for the use of Generative AI. It is the world of data privacy. I built a tool that can scan any public-facing web URL, navigate, and register all the network requests. Then we analyse these requests and process the information using Generative AI.

Gen AI is a valid use case because data privacy is complicated in the sense that it is difficult to understand the meaning and consequences of compliance. I believe that Generative AI is mostly useful as a data synthesizer and a reducer. Much of the current frustration comes from the erroneous use of LLMS, which involves inputting a nugget of data and expecting Gen AI to mass-produce gold. It inevitably churns out rubbish. But if you input a high concentration of quality data and ask Gen AI to condense this into something useful, that’s when you get valuable output.

So, what do I want Gen AI to do? Simple, really, as someone who has been doing Web Development for nearly 20 years, I still find myself unable to answer seemingly trivial questions:

Do I need permission to track user data? Which types? What is “user” data anyway? An IP address?
I’m not saving that PII, so it’s legal, right? Right?
When can I set cookies?
But I use a Cookie to track my Cookie consent. That’s allowed, I guess? But what is a “functional” Cookie anyway?
What kind of user tracking is permitted with and without consent?
How do I need to ask for consent?
Can I save consent? How do I even persist in negative consent?
What is the difference between consent for tracking and cookies? Do I need both? Is that one button or two?
What are the consequences of not respecting consent?
How does this vary by geography?
Is Google Analytics legal for use in Europe?

Quite honestly, a lot of the confusion in the above arises because data privacy consent is a grey area with room for interpretation. This isn’t helped by a difference in legislation between different geographies. But the key is that GPT-4o/Llama 3 excels at interpreting vast amounts of data and explaining them with simple language. Perfect, thank you.

So, I set out to gather as much hard evidence as I could about what is actually happening on a simple navigation on a public-facing document (i.e., website!). We map this evidence with our understanding of the legislation, and we arrive at a system that is capable of testing the data processing flow of any public website.

Woohoo. But you aren’t here for the cookies are you, you’re here for the AI…

One little system, one bucket load of AI.

OpenAI - GPT-4o / DALL-E 3 We use this to analyze the COMPANIES that are the ultimate processors of the ingested data.
Groq - Llama3-70b-8192 We use this to analyze the REQUESTS where the data is transmitted from the public document.
Grok - https://developers.x.ai/ We use this to analyze sentiment and trends to inform our content generation strategy.
Brave API - https://brave.com/search/api/ We use this to research public information on actors identified within our system.
Algoria search - https://www.algolia.com/ We use this to intelligently map unstructured data into a lovely SQL database.

Did you say 5 APIs?

Five different AI products?

How was my experience with this mesh of AI? Very, very hard…

It nearly broke me. So, how did we end up with a platform that has five different AI integrations anyway? Experimentation, repetition, and a fair bit of lunacy. It’s a pattern. But when we untangle the system we have created, each component part makes sense.

The first trade-off is Groq vs ChatGPT. ChatGPT is, of course, the flagship product of OpenAI, the first worm out of the proverbial can. And their first-mover advantage shows. Their API and models have been more refined, and this is clear from the quality difference of the output. So I use ChatGPT for the long-form content and the quality of the results is indisputable.

BUT.

It’s expensive. I woke up in sweats several times a week worrying what would happen if somebody, anybody, actually used this platform I’d built. A great experiment, but one I’m willing to bankrupt myself for? Not likely.

Groq changed everything. Their API costs 100x less. It’s fair to say that had I not discovered Groq, I would probably have never released this blog, simply out of fear of the cost. The quality of GPT-4o over llama 3 is noticeable. But the price of Llama 3 on Groq is quite literally 1% of the price.

I use Open AI if we need to leverage content of the absolute highest quality. I use Groq when I need to process lots of information.

I have built a killswitch to turn everything to Groq at a second's notice. This switch is the difference between being able to launch or not.

gpt-4o - $5.00 / $15.00 per 1 million tokens.
Llama3-7b - $0.05/$0.08 (per 1M Tokens, input/output)

So our AI count is at two out of the door…

Where does Brave AI Search come in?

You could easily interchange this with Perplexity AI or something similar. I was very impressed with the API offering. Brave found it's way into the stack as I was building my own SERP crawler and researcher. Mine was rubbish and consuming a lot of time, Brave’s was excellent. Mine worked 50% of the time; Brave’s 95%. To be able to generate high-quality content for people, we need to solve several puzzles. We need thorough research, and we also need to know what is interesting to the user.

Brave’s search API is excellent for doing research for AI content. It provides links and references and shows high-traffic suggestions for users to follow the content rabbit trail. Without the research from Brave, the results from ChatGPT and Groq would be spam. It is a wonderful AI that feeds research and data into our AI. That’s a 2024 phrase if I’ve ever heard one.

Three down…

Onto the most controversial selection. Grok by Twitter (x) is an LLM with a difference. It has built-in social media retrieval (I imagine it has some kind of proprietary RAG). How does this help?

This helps us understand content and topics that are trending and new. So before we research and generate content, we need to understand the hot topics.

I’m not yet convinced by the viability of Grok, but the potential to plug into real-time sentiment and use this as a search and content generation strategy is an exciting one for me. Put this one down as experimental. I’ll keep you posted.

So we end up with Algolia. Why do we need an AI-powered search on top of our AI-powered research and generative AI?

This comes down to how I’ve structured my platform. We’ll go deeper into the how and the why later in this article, but to build a world-class platform, we need to fill in some of the basics. In my old-school paradigm, you can’t have world-class content without a world-class CMS. World-class CMS requires clean, structured data. SQL.

We use Algolia to weave and map together the content from our different systems. It’s hard to define and strongly limit output from text generation models (the company recognized it could be Shopify, Shopify’s App, or Shop App). Getting JSON output is more or less stable these days. But converting JSON output to SQL with references between content types is tricky due to the unstructured nature of text generation. Algolia bridges this gap by condensing ‘similar’ content into unique SQL data that can be consumed by a website.

It’s not perfect. But it works (95% of the time).

So here we are, 5 AIs in the hype boom forged, with one simple platform to rule them.

It was hard.

It nearly broke me.

We go from theoretical to engineering concerns. Chaining AI API calls to create a tolerable product. So why is using AI so hard?

Fundamentally there is one simple reason. The Internet is now fast. We expect things to be fast. Even AWS API Gateway HTTP requests timeout after a maximum of 30 seconds.

But generative AI?

Just crafting an image with researched content and output can require up to 5 chained calls to various APIs.

Identify the content (Sentiment analysis w/Grok)
Research the content (AI Search w/Brave)
Generate the content (Gen AI w/Llama3/ GPT-4o)
Generate the image (Gen AI w/DALL-E 3)
Save the content and image into SQL (Algolia)

AI generated image of HotJar company from privacytrek.com Example content: HotJar data privacy analysis on privacytrek.com

It’s very hard to build something fast when the underlying APIs are so slow. You won’t get quality output reliably generated in under a minute, especially as you need to knit together disparate APIs to build anything resembling quality content.

The perfectionist in me refuses to wait so long to deliver results on a website. We’ve come too far.

What’s the solution? Streaming? Websockets? Background processes? It’s complicated…

I tried every single one of the above. I hated every single one for different reasons; we could write a blog about each…

I spent almost a week building and tweaking a Rabbit MQ broker so that my platform could subscribe to content from the backend responsible for negotiating with this mesh of AI APIs. I was so proud of myself; it was wonderful. It was also absolute rubbish. I deleted it. You know that saying, ‘Every person has a book in them. Most should keep it there.’ The same applies to Software Engineers and their AI ideas.

It’s so easy to go off on a tangent and build around the problems that are inherent in artificial intelligence tools. I’ve done it many times until I reluctantly accepted that you can’t make an elephant run, and we needed a different approach. You should, too. Like horse and carriage congestion in the early 20th Century, eventually, it won’t be a problem. But until it isn’t, it is.

Users expect fast web experiences; a sprinkle of AI will only sate patience for so long before web experiences become onerous and frustrating. So, to use AI at scale, we need to fetch our data before the user has even arrived. The number one rule for leveraging AI is to derive the value from our business logic long before the user has arrived.

The key to the kingdom is to use every word and every image. Every scrap of expensive generated content should be treated like proverbial gold. This means vigilant control of both inputs and outputs.

And so I save every API request, and I thoroughly research every API call I make. I test, and I tweak, I iterate, and I learn until I can bend the tool to my will.

AI Costs $$$

Treat AI API calls with the respect they deserve.

AI is prohibitively expensive. Imagine paying 5c for every API call you make to your CMS. I challenge you to do some matchstick math in your observability platform. Just look at the logs of any modern software system, requests to modern systems are typically measured in the hundreds of thousands, or millions…

To make AI valuable at scale, we can’t treat its output as transient or ephemeral. The first thing I learned about working with AI APIs is to save every response, output, or image. It can be used later. And one of the ironic properties of AI output is that the more you refine and reuse it (condensing), the more valuable and realistic it becomes. Just make sure we conserve the building blocks or you will literally be paying the price.

It’s funny how paying and being on the hook for your own system really takes you back to the basics as an engineer. Nothing strikes fear into a developer more than being on the hook for a faulty API call that could accidentally cost thousands of dollars. Nothing will make me optimize my API fallback strategy like the fear that an accidental loop could bankrupt me. Frankly, we should treat normal APIs with the same respect, but caching, cheap processing and laziness have made this approach redundant.

The founding principle of working with these APIs is to treat every output from an LLM with respect. Spend time considering the inputs and the outputs. Prompt engineering, RAG, and Vector DBs are the buzzwords. The principles are far more simple. Every question or input to a Gen AI system is costing you real dollars. Have you optimized that input to ensure that what is coming out is actually valuable? Or are you simply pounding away at a broken slot machine, throwing your money down the drain?

I spent a long time crafting every user and system prompt, optimizing the inputs and the outputs to ensure that what comes out of the LLM is of value. I failed more than I succeeded. It took me a long time to get beautifully crafted artisan image representations of my companies. I spent days trying to use ChatGPT to create an icon library (bad idea). The more you work with these APIs, the easier it is to see the cracks. It’s so easy to get rubbish output; if you haven’t carefully automated and scaled the input, it is the most likely outcome.

But when the robot gets it right, it becomes something very special indeed.

Ask ChatGPT

Just ask ChatGPT?

In my experience, the inverse is true; these LLMs aren’t generalists at all but specialists. I don’t know why this is surprising. Machine learning algorithms have always been thus. We have object detection models to detect objects from images. We have structured data extraction algorithms to extract data from text. We wouldn’t expect our object detection algorithm to extract structured data from text, right? But that is exactly what we expect from our LLMs. One superhuman AGI robot to rule them all. Absolute popsicle…

Nothing screams amateur to me more loudly than the companies building a wrapper to ChatGPT and assuming that “AI” will solve their problem. The LLMs have no AGI at all, not even intelligence. They are capable of processing extremely large datasets. Just the thought that they are a silver bullet to every problem shows me that not much thought has been given at all.

What are they specialists at? Condensing large amounts of information into valuable, smaller, intelligible versions of the same. Ironically, this is the exact opposite of the majority of use cases. Welcome to the trough of disillusionment.

LLMs are a tool in the armory that can solve problems in new and inventive ways. They have opened doors that we didn’t even know existed. So what next for this great experiment?

We are at the beginning.

I guess I’ve got proof of concept, and my goal is to convert this into a functional, modern platform. There are challenges to overcome.

The field of dreams conundrum. I’ve built it. Will they come? Experience tells me that probably not.

I need to turn this platform into a self-aware, SEO-optimized monster. I’m going to use AI and the tools I've woven together to craft and scale human consumable content and bring Data Privacy analysis to the world…

I’m not sure how far I’ll get, but it’s turning out to be a wonderful adventure…

Do come along for the ride.