Legion Health addressed YOUR questions. What do the answers (and the silences) tell us?
A thoughtful response is not the same as a complete public record. In a pilot with nationwide ambitions, that distinction matters.
Disclaimers: This article was written by a human pharmacist living in Pennsylvania. My opinions do not necessarily reflect those of my employer. This article is for educational purposes only; this article and my comment replies are not intended as medical or legal advice.
This is the second article of the series on Legion Health.
If you missed the first one, read it here.
Quick recap for new readers
Earlier this month, Utah approved a 12-month regulatory mitigation agreement with Legion Health, a San Francisco AI startup, to allow AI-facilitated renewals of a narrow set of psychiatric prescriptions that a human clinician had already prescribed. The pilot is now live, and leadership has publicly described nationwide ambitions.
The decision landed in a state where 99% of the geography is a mental health professional shortage area and roughly half a million Utahns lack adequate behavioral healthcare access.
My first article asked readers to move past their initial reaction to think more deeply about their questions and structural concerns. Our community did just that, and I brought the questions directly to the company.
Arthur MacWaters (Legion Health’s Co-Founder and President) took the time to respond in writing, with full knowledge that the answers would be published.
What the company chose to answer, when they declined to respond, and what they left unsaid are three different things. This article will go through them all.
Here is what Arthur said, what your questions were, and what the shape of the response actually tells us.
Credit where credit is earned
Arthur did not have to respond. He did, and he engaged substantively on two of the hardest questions on the list.
The first is the >98% concordance threshold during Phase 1, where every AI-facilitated renewal must be reviewed by a licensed physician before it goes to the pharmacy. A reasonable concern about that threshold is that it creates a structural pull toward confirming the AI’s recommendation rather than exercising independent clinical judgment; this is a well-documented pattern in automation-assisted decision-making. Arthur engaged it directly: Legion, he wrote, treats the concordance numbers as safety and audit thresholds, not as instructions for clinicians to confirm the model, and the workflow is designed to tolerate conservative escalation much more than unsafe autonomous approval.
The second is data handling. Readers were concerned about whether patient data processed through third-party LLM APIs might end up in foundation-model training data, or in interaction logs used for fine-tuning. Arthur addressed that without being asked in those exact terms: “PHI is not used to train foundation models.” That line directly resolves the specific concern readers raised.
The rest of the public record will be held to the same standard.
Arthur’s response, in full
What follows is Arthur’s written response to my questions, reproduced in its entirety. (Again, I told him beforehand that I would publish any response he gave me.) Nothing in the following quote block has been edited, summarized, or reordered.
Hi Ryan,
Thanks again for the thoughtful questions, and for approaching this in good faith. I’ll answer where I can.
Model architecture / boundary between AI and deterministic logic
The autonomous refill workflow is not a general-purpose “AI prescriber.” It is a tightly scoped renewal workflow that combines deterministic verification checks and safety rules with a model-driven conversation layer. In practice, the AI is used to gather and structure the information needed for a renewal review, while deterministic rails handle identity verification, prescription-history verification, formulary/scope checks, hard-stop safety conditions, and escalation logic. If a case is ambiguous, inconsistent, risky, or otherwise out of scope, it routes to a Utah-licensed clinician rather than being auto-renewed. The pilot does not allow new prescriptions, dose changes, medication switches, or cross-tapers to be done autonomously, but we can do those things with our providers in the loop.
Data privacy and security
We operate the pilot under a HIPAA compliance posture and treat patient-identifiable pilot data as PHI. The materials provided to Utah describe encryption in transit and at rest, role-based access controls, audit logging, separation of production and test environments, and BAAs where a vendor touches PHI.
PHI is not used to train foundation models. Sharing is limited to care delivery operations, required subprocessors, and Utah/OAIP reporting, with de-identified or aggregated reporting by default and redacted, minimum-necessary excerpts when needed for audit. I’m not going to get into a vendor-by-vendor infrastructure map in a public reply, but the operating model is designed around HIPAA controls and constrained data use.
Safety and adversarial testing
Two layers of testing for hard-stop safety conditions: deterministic rails unit tests and sandbox red-team scripts using synthetic cases that must fail closed. The agreement also requires ongoing monthly reporting to Utah, including approvals/denials, concordance, incidents, complaints, and audit materials. I don’t have anything additional to announce right now about a separate public transparency report or an independent red-team program beyond the testing and reporting structure already built into the pilot.
Clinical screening process
The AI is focused on gathering the information needed for safe renewal: medication verification, indication review, efficacy/stability, side effects/adverse effects, allergies, recent major clinical changes, and indication-relevant symptom assessment. It also includes psychiatry-specific hard stops such as suicidality/self-harm signals, mania/hypomania red flags, pregnancy-related changes, severe adverse effects, contraindications, identity mismatch, or prescription mismatch.
More broadly, the workflow is not designed to rely on a single answer from a patient to “unlock” treatment. If responses are inconsistent, unclear, or risk-signaling, the case escalates to human review. Patients can also request clinician review at any point, and pharmacists retain authority to escalate as well.
Eligibility and patient safeguards
Standard refill durations such as 30-, 60-, or 90-day renewals.
We have not tried to turn every internal threshold into a public one-line rule because some of those thresholds are medication-specific and part of the safety implementation rather than the public-facing summary.
Concordance and independent clinician judgment
We think of the concordance thresholds as safety and audit thresholds, not as instructions for clinicians to “confirm” the model. In fact, the proposal is intentionally risk-averse: it treats escalation to human review as an acceptable outcome. In other words, the workflow is designed to tolerate conservative escalation much more than unsafe autonomous approval.
Strategic / future plans
We believe that autonomy is a key to the future of healthcare and we will endeavor to expand this into other states and parts of the care journey. Literally millions of Americans are priced out of receiving care, and frankly AI is the best chance we have of collapsing the cost of care by 10x.
Similar to self-driving, this will save lives and be far higher quality than human drivers, but it requires thoughtful and rigorous execution to bring into reality.
Any future expansion will need to be earned separately, based on safety data, operational performance, and the relevant legal/regulatory process.
Lmk if you want to chat more,
Arthur
Where the response’s questions came from
Seventeen questions went to Legion. Some came from reader concerns raised in the comments on Article 1, particularly around model architecture, data handling, and adversarial testing. Others were my own, focused on clinical operations and accountability.
They grouped into seven themes: model architecture and technical design, data privacy and security, safety and adversarial testing, clinical screening, eligibility and patient safeguards, oversight and accountability, and strategic direction.
The full 17 questions, with reader attributions, are in Appendix A.
The scorecard
A brief note on the assessments below. They reflect what the public record contains after Arthur’s response.
Note: Several “not answered” items may reflect policy areas Legion has not yet articulated publicly rather than refusals to engage. Where Arthur declined specific questions with reasons, those reasons are noted.
The five buckets:
Substantially answered: Q2, Q4, Q7, Q14
Partially answered: Q1, Q6, Q9, Q10, Q17
Current practice answered, rights silent: Q5
Deliberately declined: Q3, Q11
Not answered: Q8, Q12, Q13, Q15, Q16
Model architecture and technical design
Q1: Why an LLM-based conversation rather than a structured form for prescription renewals? Partially answered.
Arthur describes the hybrid architecture--deterministic rails plus a model-driven conversation layer--but does not directly address why natural-language conversation is needed for renewal intake rather than a structured form or expert system. The closest he comes is implying the LLM handles conversational information-gathering while the rules engine evaluates the outputs. The “why not a form?” question remains open, and it matters, because the choice to introduce an LLM at the intake layer is what creates the prompt-robustness and adversarial-testing questions that follow.
Q2: Where does the boundary sit between LLM inference and deterministic logic? Substantially answered.
The most informative part of the response. The LLM gathers and structures information. Deterministic rails handle identity verification, prescription history, formulary checks, hard-stop safety conditions, and escalation logic. Ambiguous or inconsistent cases route to a human clinician. The LLM is not making the approve-or-deny decision; the rules engine evaluates the structured outputs.
Data privacy and security
Q3: Where is patient clinical data stored, and on which cloud infrastructure? Deliberately declined.
“I’m not going to get into a vendor-by-vendor infrastructure map in a public reply.”
Arthur confirms HIPAA compliance posture, encryption in transit and at rest, role-based access controls, audit logging, environment separation, and BAAs with vendors. This is a defensible boundary for a public reply, though it means the specific infrastructure question stays open to readers who want to evaluate it independently.
Q4: Does patient data transit through third-party LLM provider servers? Substantially answered.
“PHI is not used to train foundation models” paired with “BAAs where a vendor touches PHI.” Business associate agreements, the HIPAA contracts that bind vendors to the same privacy obligations as the covered entity, addresses the training-data and interaction-log concerns readers raised on Article 1. The strong inference is that PHI transits through LLM APIs under BAA protection and is not retained for foundation-model fine-tuning.
Q5: Does Legion retain rights to commercialize de-identified or aggregated patient data? Current practice answered, rights silent.
This is a subtler read than it looks. Arthur describes current sharing practice when he says that “Sharing is limited to care delivery operations, required subprocessors, and Utah/OAIP reporting;” however, this does not address retained commercialization rights.
A company can constrain current sharing while preserving future rights to monetize de-identified psychiatric adherence data, which is exactly the category most commercially valuable. The distinction between present practice and retained rights is something I remain curious about.
Safety and adversarial testing
Q6: Has the production renewal system undergone independent adversarial testing? Partially answered.
Two internal testing layers are described: deterministic rails unit tests and sandbox red-team scripts with synthetic cases that must fail closed. But the word “independent” is not engaged. This is internal testing, not third-party adversarial evaluation. The idea of a Doctronic-like jailbreak (which, I want to make clear, happened to a different company and different type of service) still remains a potentially relevant concern.
Arthur is honest here rather than overclaiming, which is creditworthy. That doesn’t change that the gap and the threat is still real.
Q7: Will Legion publish public transparency or safety reports beyond what is submitted to Utah? Substantially answered. The answer is no, for now.
No plans for a separate public transparency report or independent red-team program beyond the reporting structure built into the pilot. Reports go to Utah’s Office of Artificial Intelligence Policy, not to the public. Straightforward.
Clinical screening
Q8: Does the workflow use validated clinical screening instruments such as the PHQ-9 or GAD-7, or a proprietary question set? Not answered.
Arthur describes categories of information gathered--medication verification, side effects, symptom assessment--but does not specifically name any validated instruments.
The phrase “indication-relevant symptom assessment” is doing a lot of work without being concrete. Given how cheap and universally accepted validated instruments are in psychiatric care, the silence is itself informative.
Q9: How does the workflow handle subtle or indirect patient responses, such as a patient minimizing side effects because they fear losing access to their medication? Partially answered.
“The workflow is not designed to rely on a single answer from a patient to ‘unlock’ treatment. If responses are inconsistent, unclear, or risk-signaling, the case escalates to human review.”
That addresses the structural safeguard. It does not address the core clinical concern: what about a patient who gives consistent, clear, non-flagging answers that happen to be inaccurate because they are afraid? The system catches ambiguity. The concern was convincing denial, which would not trigger inconsistency flags. Still open.
Eligibility and patient safeguards
Q10: What is the standard refill duration? Partially answered.
“Standard refill durations such as 30-, 60-, or 90-day renewals.” This answers the literal question but leaves the operational implication open.
How the duration is selected--per patient, per medication, per risk profile--is not disclosed. The clinical implications across that range are not small, but it is fair that this cannot be generalized.
Q11: How is “no recent medication changes” quantified in the stability eligibility criteria? Deliberately declined.
“We have not tried to turn every internal threshold into a public one-line rule because some of those thresholds are medication-specific and part of the safety implementation.”
This is defensible. Publishing exact thresholds could create incentives to game them. The specifics remain undisclosed.
Q12: Is eligibility graded by patient risk profile, or binary? Not answered.
No engagement with graded eligibility, risk tiers, or individualized assessment criteria. The response pivots to why thresholds are not public rather than describing whether they are tiered.
Q13: Does the system assess a patient’s capacity for independent help-seeking or safety plan execution between renewal interactions? Not answered.
No engagement.
Oversight and accountability
Q14: How does Legion think about the tension between the >98% concordance threshold and independent physician judgment? Substantially answered.
The best answer of the bunch, in my opinion, The concordance thresholds are framed as safety and audit tools rather than instructions to confirm the model, and the workflow is designed to tolerate conservative escalation much more than unsafe autonomous approval (i.e., biased toward human escalation rather than automatically approving something).
One caveat: this is stated design intent, not demonstrated outcome. Pilots often intend conservative behavior and drift as volume scales. The monthly OAIP reports are where intent gets tested against reality.
Q15: Are prescriptions flagged as AI-facilitated when transmitted to pharmacies? Not answered.
See the dedicated section below.
Q16: Who holds clinical liability if an AI-facilitated renewal leads to patient harm? Not answered.
See the dedicated section below.
Strategic
Q17: How does Legion’s nationwide ambition align with Utah’s framing of the sandbox as temporary? Partially answered.
Arthur is transparent about the ambition: “we will endeavor to expand this into other states and parts of the care journey.” He frames it through the access and cost argument and offers a self-driving analogy. The key on-the-record commitment: “Any future expansion will need to be earned separately, based on safety data, operational performance, and the relevant legal/regulatory process.”
That is a reasonable framing. It does not, however, square the two framings against each other. Utah describes this as temporary mitigation. Legion’s leadership describes it as the beginning of something much bigger. Both can be true at the same time. The tension between them does not resolve itself.
Why Q15 deserves its own section
Pharmacists in American healthcare hold corresponding responsibility with prescribers for the safety of any prescription dispensed. In plain language: a pharmacist shares legal and professional accountability with the prescriber for making sure a prescription is appropriate before it goes to the patient. It is the operational foundation of how dispensing works.
A pharmacist who fills a prescription that turns out to be inappropriate--wrong dose, dangerous interaction, a contraindication the prescriber missed--is legally and professionally accountable for that dispensing decision, independently of the prescriber’s responsibility. It is shared liability, not transferred liability.
That principle has a specific implication in a world with AI-facilitated prescriptions. A pharmacist who dispenses an AI-facilitated renewal carries professional and potentially legal exposure for that dispensing decision too. And that exposure is harder to discharge responsibly if the pharmacist does not know the prescription was AI-facilitated in the first place. That is why Q15 is more than academic.
Three specific things are unresolved in the public record:
Whether these prescriptions carry an AI-facilitated indicator when they transmit to the pharmacy;
What information accompanies them if they do; and
Whether dispensing pharmacists are notified in a way that supports their professional obligations.
Consider what a pharmacist actually does at the point of dispensing. A drug utilization review runs against the patient’s history and checks for interactions, duplications, and contraindications. Counseling obligations kick in, particularly for psychiatric medications, where having a conversation with the patient can surface adherence issues, side effects, or deterioration that the AI intake might have missed. Professional judgment runs on whether to dispense at all. Documentation lands in the dispensing record if questions arise later.
Each of those steps is shaped by what the pharmacist knows about the prescription’s origin. A pharmacy-facing flag is how corresponding responsibility gets operationalized in a world where some prescriptions are AI-facilitated. Its absence does not remove the pharmacist’s accountability. It just makes that accountability harder to discharge. That is the patient-safety concern underneath the question.
I will not give up on seeking clarification for this particular issue. The implications for pharmacy practice are far too important not to address immediately.
Q16: Liability will have to play out in practice, in legislation, and in court
The question of where clinical liability sits when an AI-facilitated renewal leads to patient harm is not one the public record resolves today, and it may not be fully resolvable until three things happen.
Operational practice across dispensing workflows, concordance reviews, and escalation events will show how responsibility actually gets assigned when something goes wrong.
Legislation (state-level AI-in-healthcare statutes, federal guidance, medical and pharmacy board positions) will either codify pieces of it or leave them unsettled.
Courts will eventually need to rule when harm occurs and someone sues, and case law will do what case law always does in new healthcare domains.
Until those three venues have weighed in, naming specific loci where liability sits is speculation. What is not speculation is that the gap is real, it affects every party in the dispensing chain, and the gap grows with every state expansion.
The yardstick
Arthur made one on-the-record commitment worth preserving.
“Any future expansion will need to be earned separately, based on safety data, operational performance, and the relevant legal/regulatory process.”
That’s not me trapping Legion Health, because I did not ask Arthur to say that. It is a standard they have chosen to publicly accept. This newsletter will hold them to it.
Because safety data, operational performance, and legal and regulatory process are not vague phrases. They have specific referents.
Safety data means the monthly OAIP reports, which include approvals, denials, concordance, incidents, and complaints. Operational performance includes what happens at the dispensing window, not just at the intake screen. Legal and regulatory process includes state board positions, liability frameworks, and the scrutiny that comes when a pilot tries to become a product.
If Legion expands to a second state--and their leadership has said they will try--these are the things everyone who is affected by the U.S. healthcare system should be asking about right now.
Closing
Article 1 of this series asked readers to move from their kneejerk reactions to asking thoughtful questions. Article 2 asks readers to move from the questions to a framework.
Two things are true, and they’re true at the same time: Legion engaged in good faith. The gaps in their answers are real. Holding both together is the real work.
Keep one more thing in mind: this article’s method of assessment (questions answered, questions declined, questions left silent) is not specific to Legion Health, to the state of Utah, or to psychiatric renewals.
It is a reading posture for every healthcare AI story that is going to land in your feed over the next year. Companies will engage. Some will engage well. Few will answer everything. The readers who can tell the difference between engagement and resolution are the ones who will hold the industry to the right standard.
And I will be here every step of the way to help it all make sense.
Appendix A: The 17 questions I sent to Legion
Questions below are reproduced as sent. Reader-attributed questions reflect concerns raised in the comments on Article 1 and elsewhere on Substack; otherwise, the questions are my own.
MODEL ARCHITECTURE & TECHNICAL DESIGN
Legion’s CTO has stated publicly that the system uses frontier LLM APIs rather than proprietary models. Can you share the clinical rationale for choosing an LLM-based conversational interface over a rules-based expert system or structured form for prescription renewals specifically?
h/t AD from AI Governance Lead ⚡ and David - Tech Translator
Where does the boundary sit between LLM inference and deterministic rule-based logic in the renewal workflow? Does the LLM exercise any independent clinical judgment (for example, interpreting an ambiguous patient response about side effects)?
h/t David
DATA PRIVACY & SECURITY
Can you specify where patient clinical data is stored and which cloud infrastructure hosts it?
h/t Richard Ferraro
When patient interactions are processed through third-party LLM APIs, does any patient data transit through or persist on LLM provider servers?
h/t Richard
Does Legion retain rights to commercialize de-identified or aggregated patient data (adherence patterns, symptom trends, escalation data), or share it with third parties?
h/t Richard
SAFETY & ADVERSARIAL TESTING
Has the production renewal system undergone independent adversarial (red-team) testing? This question comes up frequently given the Doctronic precedent in the same Utah sandbox.
h/t Richard
Does Legion plan to publish any transparency or safety reports on testing methodology and results, beyond what is submitted to Utah’s OAIP?
h/t AD
CLINICAL SCREENING PROCESS
What specific screening instruments does the AI renewal workflow use? Are they validated clinical tools (e.g., PHQ-9, GAD-7), or a proprietary question set?
h/t Cant_Frame_Me
How does the system handle subtle or indirect patient responses? For instance, a patient minimizing side effects like increased suicidal ideation because they fear losing access to their medication.
h/t Jennifer Hotes
ELIGIBILITY & PATIENT SAFEGUARDS
What is the standard refill duration (30, 60, or 90 days)?
h/t E L Frederick
How is “no recent medication changes” quantified in the stability eligibility criteria, i.e., what is the specific time period?
Is eligibility graded by patient risk profile (symptom severity, functioning level, self-reporting accuracy), or is it binary?
h/t Cant_Frame_Me
Does the system assess a patient’s capacity for independent help-seeking or safety plan execution between renewal interactions?
h/t Cant_Frame_Me
OVERSIGHT & ACCOUNTABILITY
The Phase 1 concordance threshold (>98%) is designed as a quality benchmark, but some observers have raised the concern that it may also create a structural incentive toward confirmation rather than independent physician judgment. How does Legion think about this tension?
Are prescriptions flagged as AI-facilitated when transmitted to pharmacies? If so, what information accompanies them?
Who holds clinical liability if an AI-facilitated renewal leads to patient harm?
STRATEGIC
Legion’s leadership has described ambitions to go nationwide and called the pilot “the beginning of something much bigger than refills.” How does that trajectory align with Utah’s framing of the sandbox as a temporary, 12-month regulatory mitigation?
Read how I use AI in my writing here: AI Use Policy
Read how I use analytics to improve my newsletter here: Privacy & Analytics


