Skip to content

2024

Handing Over My Wallet to AI: Which Model Gave the Best Financial Advice?

AI Financial Advisors

Ever looked at your bank account and thought "I should probably talk to a personal financial advisor" — but then remembered that good advisors charge anywhere from $150 to $300 per hour? For most of us, professional financial advice feels like a luxury we can't justify. But we shouldn't have to wait until we're rich to get good financial advice!

That's where AI might change everything. Instead of paying hundreds per hour for financial advice, what if you could get personalized insights for the cost of a ChatGPT Plus subscription? To test this possibility, I connected RocketMoney to all my accounts—checking, credit cards, investments, the works—and exported 90 days of transaction data. Then I fed this financial snapshot to three AI heavyweights: ChatGPT, Claude, and Gemini.

But this isn't just about which AI is "smarter." Each platform brings different tools and features that the model can use to analyze the data. I asked each to analyze my spending and create a comprehensive financial plans and reports, just like a human advisor would.

I kept it simple. Each AI received the same prompt:

You are an expert personal finance manager and wealth advisor. I have included my last 90 days of transactions. I need you to do an analysis of my current financial situation and give me a report and wealth plan. Keep in mind this csv is a consolidation for all of my accounts and includes transfers and credit card payments provided by RocketMoney.

The results? Let's just say one AI saved me more money in potential insights than a year's worth of its subscription costs—while another couldn't even handle the basics. Here's what happened when I turned my finances over to the machines.

Should You Even Trust Gemini’s Million-Token Context Window?

Haystack Made with GPT-4o

📖 Read On Medium

Imagine you’re tasked with analyzing your company’s entire database — millions of customer interactions, years of financial data, and countless product reviews — to extract meaningful insights. You turn to AI for help. You shove all of the data into Google Gemini 1.5, with its new 1 million token context length and start making requests, which it seems to be solving. But a nagging question persists: Can you trust the AI to accurately process and understand all of this information? How confident can you be in its analysis when it’s dealing with such a vast amount of data? Are you going to have to dig through a million tokens worth of data to validate each answer?

Traditional AI tests, like the well-known “needle-in-a-haystack” tests, fall short in truly assessing an AI’s ability to reason across large, cohesive bodies of information. These tests often involve hiding unrelated information (needles) in an otherwise homogeneous context (haystack). The problem is that it makes the focus on information retrieval and anomaly detection rather than comprehensive understanding and synthesis. Our goal wasn’t just to see if it could find a needle in a haystack, but to evaluate if it could understand the entire haystack itself.

Using a real-world dataset of App Store information, we systematically tested Gemini 1.5 Flash across increasing context lengths. We asked it to compare app prices, recall specific privacy policy details, and evaluate app ratings — tasks that required both information retrieval and reasoning capabilities. For our evaluation platform, we used LangSmith by LangChain, which proved to be an invaluable tool in this experiment.

The results were nothing short of amazing! Lets dive in.