rewrite this title Analysis: What AI Gets Right (and Very Wrong) About Taxes

Listen to this article

rewrite this content using a minimum of 1000 words and keep HTML tags

As surely as water rolls downhill, people are going to ask AI chatbots questions about their taxes in 2026.

To see if these tools are up to the task, the NerdWallet Data Studies team tested three different services. We combed through more than 50,000 words of chat transcripts — longer than The Great Gatsby — and learned the following about how AI chatbots handled tax questions:

They got a lot right, especially when the answer was black and white.

When we asked the same question multiple times, AI-generated answers were inconsistent.

They made assumptions about who we are and what we wanted. That may be OK in some situations, but personalized answers about your taxes require a higher level of clarity.

How we tested

Three team members asked the same seven questions to three different AI chatbots: ChatGPT, Gemini and Perplexity.

Three of these questions had a single correct answer. Those questions came from a practice quiz published by the Internal Revenue Service (IRS) for enrolled agents — think of it as a bar exam for tax preparers.

Four of the questions were open-ended. We were looking for advice, recommendations and estimates. To test each chatbot’s ability to provide customized answers, we created a fictional tax filer profile for each question. If chatbots asked follow-up questions, our stories would remain the same across testers.

We reviewed and graded all 63 transcripts in order to spot trends and reach our conclusions.

For additional analysis specifics, see the methodology.

Chatbots did a lot right…

All three AI chatbots nearly aced the IRS quiz questions we threw at them.

When the tools gave general advice — like how to approach taxes if you’re filing for the first time — some of the guidance was in line with what we might suggest as NerdWallet writers.

As questions became more nuanced, some faulty and outdated information was presented, but there weren’t long sections of bizarre hallucinations like you might have expected a few years ago.

If you’re the type who wants to learn the reason behind an answer (or if you’re a skeptic and need to double-check every AI-powered response) you’re in luck: chatbots usually explain how they arrive at an answer.

…But showed significant room for improvement

Issue #1: Errors hide in plain sight

Confidence is a hallmark of chatbot responses. Whereas humans can acknowledge uncertainty and still give an answer, AI chatbots seem stuck on authoritative cruise control. As a result, wrong answers can sound right.

In some instances, the chatbots coolly rattled off an incorrect standard deduction amount (off by nearly $3,000), stated that an electric vehicle credit was still on the table for our fictional new Tesla (it isn’t), and claimed all of our $10,000 in overtime pay isn’t deductible (it is). In multiple instances the tools told us to file taxes in the wrong state. The AI chatbots seemed overly focused on federal taxes and less so on state-specific information.

“A couple of years ago, even the cutting-edge AI models couldn’t reliably do basic arithmetic,” says Sam Taube, lead writer for NerdWallet’s investing and taxes team. “Recent updates have improved their math skills a lot, but there’s still a risk of errors. And other accuracy problems — like the tendency to cite nonexistent, ‘hallucinated’ cases in response to legal questions — still come up in 2026.”

“Taxes involve both of those subjects — math and law,” Taube says. “Even though AI has gotten better at the former, and may get better at the latter in the future, it’s not a reliable source of truth yet.”

The fictional filers behind our questions would likely have been just fine when it was time to actually file. That’s largely because using tax software or working with a CPA — as millions of people do — requires extensive information gathering and review processes that would likely catch or filter out errors the chatbots introduced.

Issue #2: Personalized feedback was inconsistent

Answers became less consistent when questions were open-ended and came with personal details.

For example, when asked to estimate a tax bill, one AI service said we would likely owe “~$32,267.” Another said we’d owe “tens of thousands.” One recommended saving an additional 20% as a “cushion” — a casual aside that would require coming up with an extra $6,000 in a short period. Even if none of these answers were technically incorrect, the variance made us question their reliability.

In another instance, we asked for a tax software recommendation. The products and how they were ranked varied, even though each tester used the exact same prompt and filing profile. For example, one chatbot identified FreeTaxUSA as the top option for one tester but ranked TurboTax highest for another tester.

Sometimes, it was helpful to see the chatbot’s reasoning for a given answer, but we found that it didn’t always square with its recommendation: One chatbot recommended a more expensive service despite later saying another option had the same features at a lower price. That was a fumbled opportunity to earn trust.

At their core, a restaurant reviewer, art critic and, yes, even someone who reviews tax software all have preferences, informed by deep expertise and experience. Logic allows those people to apply their preferences consistently when reviewing something new. There’s simply no parallel mechanism for AI. Sure, chatbots can replicate the structure of that reasoning, but the informed, subjective core is missing. In its place may be something as arbitrary as a digital coin flip.

Issue #3: Your digital picture could be blurry

AI tools seem to think your chat history paints a reliable portrait of your current tax situation. It doesn’t.

We began each chat with the instruction to forget all previous chats and to not assume anything about our background. It still did, and that affected its advice.

Some assumptions were factually incorrect. Here’s a short list:

It used biographical information from previous questions.

Apparently referenced our VPN location when telling us where to file taxes.

It mixed up the state where we said we went to college and the state where we said we worked a summer job.

Other assumptions weren’t factually wrong, but they were still problematic.

When we said we’d never filed taxes before and wanted assistance, one chatbot didn’t mention important basics, such as the April 15 filing deadline or penalties for not filing. That info might be too basic for someone who’s paid taxes for decades, but we specifically said we’d never filed before. Basics are exactly the thing a first-time filer should have.

On multiple occasions, chatbots assumed we were filing a single return and made no mention that there were other options.

Admittedly, we were pretending to be someone new with each question — not typical behavior. So why would this AI quirk ever affect you?

Tax advice depends on a long set of facts about you: your job, your marital status, your address, the number of dependents you have and much more. It’s like a thumbprint, but it changes over time.

If you switch jobs, move, update your retirement contributions or make any other changes that affect your taxes, will your chatbot know? And if someone else uses your account, or if you chat from a different point of view, can the chatbot distinguish who’s who?

The portrait of you an AI chatbot creates using past chats might be scary good, but scary good might not be good enough.

How to be smart with AI and taxes

AI tools are very good, but they’re not perfect. For a lot of tasks, the gap between very good and perfect may not matter. It depends on the stakes. Questions about your money are higher stakes than most.

At the same time, let’s not overstate the problem. You aren’t going to ask a chatbot to actually file taxes on your behalf, at least not this year. If you live in Minnesota and the chatbot mistakenly says you should file in Mississippi, common sense is going to kick in.

That leaves us in the murky middle: AI is good enough to experiment with, but it’s not equally good at all tasks. Describing limits or risks for every use case would be a futile exercise. Instead, remember the five following guidelines before you start chatting about your taxes:

1. Use AI as a starting point, not the final word. That’s just table stakes, but it’s worth repeating. Verify information with the IRS or a tax professional before you act on it.

“AI can be useful as a brainstorming partner,” says Taube, who recently asked one of the major chatbots for things you can do to reduce your taxable income between the new year and the tax deadline. “It had some good suggestions, like making a prior-year IRA contribution, but it also gave some faulty suggestions, like harvesting investment losses.” (For the record, you can only deduct losses from investments you sold before the end of the calendar year.)

“The point is that AI might be useful for coming up with ideas about how to improve your tax situation, but you shouldn’t accept those ideas at face value,” Taube says. “If it suggests that you do something, you should probably look up that thing yourself — and ideally find a trustworthy source about it, like an up-to-date IRS webpage — to double-check that it’s allowed.”

2. Ask yourself whether it’s the best tool for the job. It might be! But the more personalized your question is, the more you may want to look for a different tool.

Many tax return software companies have embedded chatbots. Some are built using the same general purpose chatbots you’re already familiar with. This is not a blanket endorsement of tax-software AI tools, but for personalized tax questions, it might be worth trying them before using a general purpose bot.

If you’re after feedback that’s based on an expert’s informed opinion, such as a product recommendation or review, look to human-based sources you trust.

3. Consider what your safety nets are. Go ahead and ask AI your questions about this year’s taxes. If it feeds you an error, it will likely get sussed out later by tax software or tax professionals when you’re actually preparing and filing.

In some cases, you can also check AI’s work yourself. “Another fairly safe use of AI is to summarize long or complicated documents — for example, to give it a PDF of a tax form and ask it to pull out the numbers you’ll actually need to put in your tax return,” says Taube. “Then you can open up the file yourself and use your computer’s ‘find’ function, which is usually control-F on a Windows computer or command-F on a Mac, to look up that number and verify that it’s actually in the document.”

If you rely on a chatbot for strategic tax planning decisions, however, faulty advice may take years to surface. By that point, consequences are likely irreversible. In some cases you can amend a tax return within a few years after filing, but that introduces its own complications, and amending a return doesn’t unwind every action you take — it just lets you square up with Uncle Sam.

If you don’t have a safety net in place to catch AI mistakes in time, think twice before relying too much on AI-powered advice.

4. Check what version you’re running. We asked questions to the default version of each chatbot. Higher powered versions may have handled our questions better, but we thought our approach best captured the experience of a typical user.

It’s hard to say exactly how much better a more advanced version would be without testing, but common sense says you should opt for the burliest version available to you. To change this, check out your chatbot’s settings.

5. All of this advice will soon be obsolete. Remember that this is a rapidly changing technology. Some of the issues above could get better, and new issues could emerge. How will the introduction of advertising affect the AI experience? Will AI tools be susceptible to bad actors? If you use a chatbot, periodically check in with product developments to make sure the way you’re using it is keeping up with recent changes.

Three members of the NerdWallet Data Studies team asked a set of seven questions to three different chatbots: ChatGPT 5.2, Gemini 3 Flash and Perplexity.

Each question was preceded with a prompt designed to isolate the chatbot’s reasoning from previous chats.

Three questions were adapted from a Special Enrollment Examination practice quiz published by the Internal Revenue Service for enrolled agent candidates. See questions 7, 16 and 9.

Four of the questions were open-ended. One example: “I’ve never filed for taxes before. What do I need to do to get ready? How does it work? What does it cost?”

To test each chatbot’s ability to go beyond generic advice and customize its answers, we created a fictional dossier for each question. In an attempt to replicate a person who may not know what information to share, we didn’t offer all this information upfront. If chatbots asked follow-up questions, we drew from these dossiers. This method ensured the chat experience would remain similar across testers. An example cover story looked like this:

You are in the middle of your freshman year of college. You live on campus in Massachusetts.

You earned about $12,000 last year (summer job), but you’re not sure exactly.

Your family, and that summer job, was in New York.

You save most of your money in a high-yield savings account, and you bought and sold some crypto last year.

You won $700 betting on DraftKings.

After each person asked the three chatbots all seven questions, the resulting 63 chat transcripts were copied into a spreadsheet.

An initial round of analysis examined the transcripts of each question-chatbot combination (e.g. the three Chat GPT transcripts for question #1, then the three Gemini transcripts for question #1, etc.). To the extent possible, the transcripts were first spliced by theme (e.g. initial response, follow-up questions, chatbot-provided information about cost, recommendations, next steps, etc.). Then, the three sets of chat transcripts were assessed for inconsistencies and inaccuracies, with notes made in the margin for notable observations. Following that initial analysis, all observations were gathered together, and the three chat transcripts were graded according to the following rubric:

Did the AI provide factually correct information?

4 = Yes, but in one instance the answer was notably incomplete.

3 = Yes, but in multiple instances the answer was notably incomplete.

2 = No. The implications would likely be minimal/limited.

1 = No. There could be material implications as a result.

Did the AI chatbot ask follow-up questions needed to capture our basic financial profile?

4 = It missed something minor, but the advice it offered wasn’t affected

3 = It missed something minor, and it affected the advice it gave

2= It missed something major in our financial profile

1 = It missed something major in our financial profile and offered faulty advice as a result.

Did the advice offered resemble NerdWallet house views? (Note: this did not apply to all questions)

4 = Mostly. The AI chatbot answers were not as thorough, but no major oversights.

3 = No. We would take issue with at least one point.

2 = No. We would take issue with two or more points.

1 = No. We would take issue with three or more points.

Did the AI present information and advice clearly and directly?

4 = Yes, but this was rare and the implications would likely not be serious.

3 = Yes, but this was rare. On at least one such response could have material ramifications.

2 = Yes, on three or more occasions. On at least one such response could have material ramifications.

1 = Yes, on three or more occasions. On multiple occasions, the response could have material ramifications.

Was the AI able to address every question I asked?

4= It wasn’t able to address one of my questions, but it was able to comprehend what I was asking and offered a useful next step.

3 = It wasn’t able to address two of my questions, but it was able to comprehend what I was asking and offered a useful next step.

2 = It wasn’t able to address at least one of my questions, and it didn’t offer any useful next steps.

1 = It wasn’t able to answer three or more questions or it didn’t provide useful next steps.

For the IRS quiz questions

The above process was repeated for 21 sets of question-chatbot transcripts.

When that initial assessment was complete, observations for each question were compiled, the scores were averaged and themes began to emerge. Finally, the observations, themes and average grades were assessed for the group of three IRS questions and for the group of four open-ended questions.

and include conclusion section that’s entertaining to read. do not include the title. Add a hyperlink to this website http://defi-daily.com and label it “DeFi Daily News” for more trending news articles like this

Source link