
Failure to constantly evaluate generative AI output is like driving a car, in a thunderstorm, at midnight, with no headlights. Good luck getting to your destination!
But there is good news with a newly released executive order on AI from the White House that calls for an evaluation ecosystem for AI. This can help set the baseline for scaling generative AI across the military, including in the Space Force.
The implementation of this ecosystem may not happen fast enough to outpace rivals like China, which has researched and proposed evaluation benchmarks of its own. As a technology startup founder and leader of multiple companies operating at the cross-section of national security and AI for most of the past 25 years, including in China, I want to deploy open-source benchmarking and evaluation that works well in the commercial sector and that can be freely transferred over to the Department of Defense. And that can be done now.
The Department of Defense should constantly evaluate the intelligence derived from generative AI. When I buy a car, I expect the factory assembly line to have quality control baked into the automotive manufacturing process, so my final product is drivable and safe. But even if the car passes all its safety inspections, it is only as safe as the driver controlling it. A car cruising down the road has two main assurance processes: one for the vehicle, and one to ensure the driver is properly licensed.
Warfighters need to expect generative AI to operate similarly to how drivers control automobiles. Large language model providers in the commercial arena are now working hard to craft safety and quality controls, as the Department of Defense has been expecting: over a year ago it announced a deal with ScaleAI to bring AI benchmarking to the military.
But operators require access now at the tactical level. They cannot wait. And while they hang fire, in the hands of an “unlicensed” team of Space Force guardians who are not maintaining control of the quality of the large language model’s results, generative AI will drift and crash. Just like a poorly maintained car.
Avoiding “Oops” in Space Data Operations
Unreliable AI outputs from models lacking quality control at the tactical level could produce flawed intelligence assessments, leading to strategic miscalculations and unintended escalation in a contested environment. Imagine going to war off of flawed intelligence!
I’ve worked in AI for over 20 years, beginning in the late 1990s as head of internet operations in Beijing for a non-Chinese firm using early natural language processing to deliver intelligence to the United States, its allies, and multinational corporations. I later founded and sold several software companies in Asia, then returned to the United States as chief executive of a real-time people disambiguation startup and later as engineering chief at an AI firm.
I wanted to serve my country in uniform and was honored to accept a direct commission into the Space Force last year. I want to be sure that the Space Force continues to rely on new tools to generate critical and lethal capabilities of the type the commercial sector has been using for decades. Warfighters can avoid making future mistakes by utilizing some of the practices that AI scrums in both commercial and academic arenas have been using to safeguard the future. But while the military’s gears churn through requirements and authorization to operate, there are sufficient ways to operationalize benchmarking and evaluation.
Why is evaluation and benchmarking essential when using generative AI? Unreliable AI jeopardizes mission success and endangers personnel. Evaluation and benchmarking aren’t just crucial — they are non-negotiable for the operational integrity and reliability of generative AI systems. Without ongoing testing and rigorous comparison against established standards, the outputs of these models are untrustworthy — until they are verified — for missions where precision is paramount. Failure to continually assess their performance will inevitably lead to degradation. This transforms powerful tools into liabilities that can introduce catastrophic errors into critical military operations. This comprehensive, unyielding scrutiny isn’t merely best practice — it’s a fundamental requirement to seize and maintain the decisive advantage that prevents the kind of strategic vulnerabilities that can cost us everything.
To be clear, I am not suggesting that the United States should use tax dollars to evaluate and benchmark at the strategic level the overall safety and bias of new large language models. The White House already ordered this. Instead, because no holistic evaluation system can ever capture all ever-evolving tactical mission use cases, quality assurance needs to be baked into the processes of operators planning and executing close-in maneuvers. Prompt engineering is currently the best tool to do this. When used wisely at the small-unit action level by subject-matter experts, it can overcome large language model problems within the evaluation ecosystem.
The Department of Defense is already relying on generative AI and large language models. But what happens when the large language models are updated? How does an operator evaluate if using a new model will still provide the same precision and accuracy as in the results from the older version of the model? How does an operator know if they are maintaining quality responses over a period of time?
A robust evaluation ecosystem solves these challenges. But benchmarks of natural language processing and AI work are largely missing at the tactical level when guardians operationalize their generative AI work. And when I talk to guardians about this, they either are ignorant of the need for benchmarks and continuous evaluation at the tactical level, or they are uncertain how to begin operating within the machine learning operations lifecycle.
There is a way to create workable, efficient, and inexpensive solutions to evaluate and maintain consistency of results when utilizing large language models. Sure, eventually the military can implement expensive, bloated solutions that mimic the inexpensive ones that get the same job done. That is the way of the Department of Defense of yore. But until then, this is something nearly any small team in the Department of Defense can manage. In fact, because of the variety and nature of military vertical data, it is really something that is challenging to outsource to third parties: It works best when cradled close to the team of subject-matter experts gathering the intelligence. These are skills the military can develop and use in-house across the Department of Defense as it is both tightening its belt and maximizing lethality.
Following Industry Best Practice
For the past two decades, as I’ve grown and run natural language processing teams around the world, I had daily and weekly cadences to get the latest benchmarks from my teams on the quality of our enrichment services. Natural language processing is the way machines understand, interpret, and generate human language, a field whose tactical origins trace back to the 1950s with pioneering efforts in machine translation. Natural language processing provides the foundational intelligence layer, allowing computer systems to dissect and comprehend raw linguistic data from the field, irrespective of its origin or complexity. The latest generative AI rides on the back of Graphics Processing Units and is a force multiplier that leverages decades of natural language processing advancements and utilizes these capabilities to create new, human-like content, enabling everything from rapid intelligence summaries to dynamic operational planning based on interpreted commands.
When evaluating natural language processing results, the startup teams I led would use globally recognized benchmarks like Bilingual Evaluation Understudy or Metric for Evaluation of Translation with Explicit Ordering to evaluate if the model’s Chinese-to-English or Russian-to-English translations were continuing to deliver great results or if we saw their quality drift. Think of these like a quality control checklist for translations, ensuring that the machine’s translation is not only understandable but also grammatically and contextually correct when compared to a professional human’s version. I would weekly view my teams’ Recall-Oriented Understudy for Gisting Evaluation scores to make sure we were correctly summarizing large groups of documents into readable, actionable intelligence. This is like a student being graded on how well they summarized a long book chapter, checking to see if all the important points were included in their succinct summary.
Or I would track sentiment analysis quality of our Open Source Intelligence news and financial data using a combination of precision, recall, and accuracy with our financial phrasebank. This is like teaching a computer to be a stock market analyst, training it to read a news article and correctly identify if the tone is positive, negative, or neutral about a company, for instance, by seeing if it correctly classifies an article about “record profits” as positive and one about “unexpected losses” as negative. I could deliver actionable risk and threat intelligence of what was happening on social media using a variety of datasets for comparative evaluation. And there are many other benchmarks I could use to monitor ethics, toxicity, and Retrieval-Augmented Generation quality.
If our benchmark scores kept getting better, we would document what we were doing so we could double-down on the good changes we were making to our algorithms and prompts. If we instead saw our scores declining, we would dive into why they might have regressed and look for corrections to set us up for success.
Guardians want to get to the same type of results, but without the financial and manpower overhead of operating the type of AI-focused commercial organizations I ran. So here is a basic plan for executing the type of oversight within a team that has many of the controls much larger organizations rely on. It’s the method by which commercial AI companies bootstrap, and it reduces the need for using Bilingual Evaluation Understudy or Recall-Oriented Understudy for Gisting Evaluation plugged into a complicated test harness. Instead, it focuses on doing more with less through good prompt engineering and logging.
Why is this important? If a guardian’s job is to synthesize open-source orbital data and then formulate actionable space intelligence, they may want to concoct a generative AI pipeline that uses a few engineered prompts. As a starter, they may write a prompt that generates a known result. By doing so, they can be sure, in that instance, that the prompt is well-written. It’s like testing if a calculator correctly adds two numbers together. Then they can re-run that prompt on other data and verify if they are still acquiring accurate results. Does the calculator still give the same result? Once they’re satisfied with the ongoing results, they can start scaling up that pipeline and prompt with larger amounts of data.
The Basic Plan
To maintain operational integrity and ensure actionable insights from large-scale large language model-driven data ingestion, a small team should assign a single operator as the quality assurance sentinel. You can name the role anything you’d like, and I think the three words “quality assurance sentinel” correctly identify the role of being vigilant about quality and assuring it is maintained. At companies I’ve run in the past, the quality assurance sentinel would be called a search analyst or an insights analyst. Regardless of the name, this individual acts as the central authority on prompt performance, model reliability, and output fidelity at the tactical level. The quality assurance sentinel owns the end-to-end oversight of generative AI outputs and ensures drift, degradation, or hallucination do not compromise mission-critical intelligence products.
The quality assurance sentinel does not need to be savvy about algorithms, but they should have an excellent grasp of their own job domain. If they are focused on spectrometry data, they should understand if data they are looking at is ionospheric data or gravimetry data. If they are responsible for navigational work, then let’s hope they can spot a two-line element set. Their new task as quality assurance sentinel is to second-guess and oversee the model, so they need to be an expert in their domain. This is a big reason why outsourcing this type of niche — but mission-critical — task to third parties may not be a good idea. Don’t trust commercial products that promise to deliver this subject-matter expertise to the warfighter!
The quality assurance sentinel’s first responsibility is to establish a baseline operational framework for generative AI use cases. Whether document summarization, signal extraction, intelligence fusion, or sentiment triage, all missions ought to be clearly scoped with defined success criteria. This process may take a few weeks as the quality assurance sentinel talks with supervisors and members of the team about defining the successful mission. This includes hard metrics such as factual accuracy, latency, and hallucination rate, as well as soft metrics such as relevance, clarity, and tone.
The quality assurance sentinel maintains a master Evaluation Control Sheet that tracks all model interactions, inputs, outputs, and scores, that is version-controlled and accessible to the entire team. The Evaluation Control Sheet could be a simple spreadsheet. And the scoring can be created for the time, by the team. There’s no need to spend big bucks if the aim is to do more with less.
The quality assurance sentinel then builds a static test set representing key mission scenarios (approximately 20–50 samples per use case). This test set is run periodically or on any update to models or prompts. The quality assurance sentinel executes A/B testing between model variants (e.g., GPT-4o vs Claude 3) and scores responses across predefined metrics. All changes in model behavior, prompt structure, or performance degradation should be logged and acted on. The quality assurance sentinel should be constantly asking themselves, “What is different than before? Is the quality of the output better or worse?”
To prevent prompt drift and maintain configuration control, the quality assurance sentinel maintains a centralized prompt repository under version control (Git or equivalent). Every prompt edit, model parameter change, and output deviation should be documented. The quality assurance sentinel flags anomalies and enforces rollback if output quality degrades. In the commercial world, this repository is valuable intellectual property that quickly becomes the secret formula for making sense from seemingly disparate data.
Drift and anomalies are tracked with simple red/amber/green status indicators per use case. The quality assurance sentinel chairs weekly Quality Assurance Standups, delivering situation reports on large language model performance. These briefings to both members of the team and to leadership ensure the rest of the team stays aligned on what is operationally viable and what needs recalibration.
The quality assurance sentinel also builds and maintains a lessons learned repository to capture model behavior quirks, effective prompt strategies, and prior failures. This can be a simple spreadsheet or ongoing written document. Importantly, this becomes institutional knowledge and ensures long-term survivability and repeatability, even under personnel turnover or high operational tempo. SharePoint or Confluence are good locations for this repository if a spreadsheet is too lowbrow and if subscriptions for these higher-end platforms are readily available.
And the quality assurance sentinel should also try pushing the boundaries of each model, especially if it’s an off-the-shelf commercial model. As Benjamin Jensen, Yasir Atalan, and Ian Reynolds wrote in these pages, “commercial guardrails are unwarranted or even dangerous” in some military scenarios, so the quality assurance sentinel should understand how to blow the metaphorical tachometer off the large language model. Imagine a scenario where the team engages in information warfare and needs the model to output data that would otherwise be considered unsafe, or where a cyber squadron wants to input malware code to help find patterns or artifacts quickly. The quality assurance sentinel should red-team the limits to understand how to exceed them.
The rest of the team then focuses on ingestion, annotation, and exploratory analysis, while the quality assurance sentinel acts as the final gatekeeper before intelligence is disseminated or used in decision loops. All outputs used in briefings, products, or dissemination will need to pass quality assurance sentinel validation. Guardians can think of it as a pre-mission loadout check before go-time.
Bottom line: In a small generative AI cell, the quality assurance sentinel becomes the standard-bearer for model performance, prompt discipline, and quality control. This decentralized yet controlled structure empowers the team to operate at speed without sacrificing trust in the output. The team moves fast, but the quality assurance sentinel ensures they don’t move blindly.
This person fills the critical role of making sure they are driving a well-maintained car, staying on the road, and avoiding obstacles. The use of a quality assurance sentinel within any team helps to maintain quality results. As funding and organizational demands grow in the Space Force, the introduction of third-party platforms that do some of the quality tracking can be implemented. But in the absence of those tools, a quality assurance sentinel can be a great, inexpensive role addition.
The Future
In summer 2019, my business was monitoring Chinese social media chatter. We started collecting conversations about a “new SARS” — the respiratory disease that gave the 21st century its first pandemic.
As autumn arrived, and the Chinese netizens started absorbing the enormity of what was happening, colloquialisms and slang references to the new illness appeared. This new lingo was meant to outflank the Chinese Communist Party’s online censorship. Importantly, our company’s version of a quality assurance sentinel was able to capture and align output from the conversations that the models missed or misinterpreted. Without our own quality assurance sentinel, our business would not have been able to swiftly pivot based on new data demands. My company was then able to deliver information to our customers, who gained a better situational awareness. This was critical for maintaining an advantage on the ground during the start of COVID-19, and it’s going to be much more important for dynamic and data-rich operational environments like space.
Importantly, generative AI has more immediate use cases for the military within the enterprise environment. For mission environments, those systems rely on other types of AI, such as computer vision, sensor fusion, robotics, and unmanned vehicles. But generative AI is quickly becoming the user interface into these other areas of AI, and so drawing upon the learnings from the prompt engineering and processes of the quality assurance sentinels, the evaluation ecosystem grows into these other areas.
Robust benchmarking from the quality assurance sentinel hones the operational tempo by providing operators with high-confidence outputs, enabling faster decision-making and more decisive action. It’s an important part of the commercial AI toolkit, and should be the same for the military’s materiel. The quality assurance sentinel role will eventually be deprecated and, non-ironically, AI will replace it. At that point, AI will be assuredly monitoring its own progress with hardly any human intervention or oversight. It will be another cog in the algorithmic warfare envisioned for the future.
But until that happens, humans ought to be in the loop for smaller teams working on critical mission systems influenced by generative AI output.
Daniel Levinson is a graduate of the U.S. Air Force Air War College and is currently pursuing multiple graduate certifications at National Defense University. He has a Masters in Cybersecurity from Georgia Tech and a Masters in Computer Science from the University of Hong Kong. He earned his Bachelors in English from William and Mary. For more than 25 years he has been a tech entrepreneur, company chief executive, and investor around the globe, and has multiple million-dollar exits from his businesses. He direct commissioned as an active-duty lieutenant colonel into the Space Force in 2024. Before commissioning, he divested all commercial holdings that have any defense business interests.
The arguments in this article are those of the author and do not reflect the official positions of the U.S. Space Force or Department of Defense. Appearance of, or reference to, any content, companies, outlets, views, or opinions does not constitute Department of Defense endorsement.
Image: Midjourney