
AI systems are only as fair and safe as the data they’re built on. While conversations about AI ethics often focus on model architecture, algorithmic transparency or deployment oversight, fairness and trust are established much earlier—during data collection, labeling and preprocessing.
When AI teams deliberately and carefully design their training datasets, they can actively reduce bias, mitigate risk and improve model reliability. Below, members of Forbes Technology Council share proven data design strategies and explain how these practices can help create safer, more equitable AI systems that businesses and users can trust.
1. Use ‘What If’ Versions Of Training Data
A simple way to improve AI fairness is using “what if” versions of training data, where we change identity-related details (like name or gender) while keeping all factual information the same. If the AI makes different decisions due to these changes, we penalize it during training. This ensures it focuses on relevant information and ignores unfair biases, making it safer and more equitable. – Nishit Mehta, JP Morgan Chase
2. Be Transparent About The Who, What And How Of Training
AI practitioners should be transparent about who is involved, what data is used for training and how it impacts the algorithm’s recommendations. Teams should use tools like AI Fairness 360 (AIF360) to detect and mitigate bias in models. For example, in healthcare utilization scoring, fairness (measured by AIF360) ensures prioritized care management and transparency in recommendations. – Jayashree Arunkumar, Wipro
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
3. Collect Diverse And Balanced Data
Ensuring diverse and balanced data collection is one concrete way. If an AI fraud-detection model only ingests transaction data from one region or demographic, it may overflag or underdetect activity elsewhere. By deliberately collecting representative samples across populations, the model reduces bias and improves both fairness (equal treatment) and safety (fewer harmful errors). – Michael Roytman, Empirical Security
4. Ensure Underrepresented Groups Are Included
Diverse and intentional data collection—capturing varied demographics, contexts and edge cases—helps AI avoid biased assumptions and perform more equitably. Thoughtful data design ensures underrepresented groups are included, improving fairness. In safety-critical domains, it also reduces harmful errors and strengthens trust in AI decisions. – Mohit Gupta, Damco Solutions
5. Filter Out Sensitive Or Harmful Training Data
Thoughtful data design, like filtering out sensitive or harmful training content, can make AI systems safer without undermining their intelligence. For example, removing bio risk‑related data from model training improved safety in open‑source Pythia models while preserving general performance. This guards against misuse and ensures more responsible AI behavior. – Murtaza Amirali, DataPillar
6. Apply Bias-Aware Preprocessing
Fairness and safety in AI aren’t achieved at deployment; they’re engineered upstream. The most powerful lever is bias-aware preprocessing. Stress-test data for equity, simulate how sensitive features shift outcomes, and reweight or enrich until error rates converge. This elevates preprocessing from housekeeping to a fairness checkpoint, preventing harm before models are trained and ensuring trust. – Harshida Acharya, Fulfillment IQ
7. Normalize And Label Data To Set Guardrails
It is critical to ensure that data is discovered, normalized and labeled, as this allows it to be classified appropriately. This establishes guardrails and access controls for AI systems that protect against unauthorized access to sensitive data, both by humans and the AI itself. This is best achieved by first leveraging AI to organize and prepare the organization’s data before deploying an AI system. – Jesse Todd, EncompaaS
8. Build ‘Fairness Tripwire’ Datasets
We build a “fairness tripwire” dataset: the same profile with only gender or race flipped must get the same answer. Label these pairs and gate releases until the parity gap meets a target. This data design strategy forces safer, fairer behavior before customers see harm. – Margarita Simonova, ILoveMyQA
9. Use Ontologies And Knowledge Graphs For Traceability
Implementing an ontology ensures AI becomes more efficient and accurate, has fewer hallucinations, and comes with traceability. Without ontologies and knowledge graphs, AI lacks the ability to appropriately be “fair,” and certainly, without traceability, there is no way to create “safety.” – Rob Buller, Cyberhill Partners, LLC
10. Oversample Edge Cases
Stratify sampling across edge cases. Instead of collecting massive datasets randomly, deliberately oversample underrepresented scenarios—rare medical conditions, non-standard user behaviors or minority demographics. This prevents AI systems from defaulting to “majority rule” decisions that can harm vulnerable groups. Small, intentional datasets often outperform huge, biased ones. – Jitender Jain
11. Ensure Real-Time Inputs Are Accurate And Representative
When AI consumes real-time data, flawed inputs can spread bias and errors instantly, halting progress like grit clogging a gear. Thoughtful collection, labeling and processing keep inputs accurate and truly representative of the information an AI system is being fed. Filtering noise, flagging anomalies and fostering governance ensure outputs are fair, safe and aligned with business goals. – Bill Rokos, Parsec Automation
12. Flag Ambiguity And Gaps During Preprocessing
Thoughtful preprocessing means being clear on what data truly reflects reality and where uncertainty remains. By flagging gaps or ambiguous cases rather than forcing assumptions, teams reduce hidden bias and make AI systems safer and more transparent. – Benjamin Forlani, Dedale Intelligence
13. Anonymize Sensitive Fields
Thoughtful preprocessing—like anonymizing sensitive fields while preserving utility—boosts AI fairness and safety. It safeguards privacy, reduces compliance risks and prevents hidden bias from leaking into models. The result is responsible, trustworthy systems that make fairer decisions without sacrificing performance. – Aravind Nuthalapati, Microsoft
14. Analyze Outliers And Skewness
The key is to analyze the data effectively for outliers and skewness. These are critical factors that need to be flagged as exceptions for an AI system. Skewness factors tend to cause model bias and severe risk in outputs. Effective data governance needs to be in place to profile, catalog and report all kinds of bias factors. – Sanath Chilakala, NTT Data
15. Track And Rate Annotators’ Work
One powerful solution is to introduce a layer of annotator rating and measurement. During labeling, record who labeled each item and in what context. Use calibration methods to assess each annotator’s consistency and bias. This allows you to adjust the influence of individual labels, strengthening the dataset and improving the model’s fairness. It also helps flag edge cases for expert review. – Angshuman Rudra, TapClicks
16. Establish Well-Defined Data Policies
Fairness and safety exist only if you have transparency and explainability. Ethical data collection begins with a well-defined data policy. From there, you can determine tactical steps, such as data collection, normalization and bias detection. You can also educate your workers about the data lifecycle. Your policy allows you to justify smart, aggressive steps while remaining ethical. – James Stanger, CompTIA
17. Know What’s Not In Your Dataset
Thoughtful data design starts with knowing what’s not in your dataset. In AI-driven workforce analytics, stripping out personally identifiable information and normalizing activity data helps avoid bias. This ensures AI insights focus on work patterns and trends, not demographics, safeguarding fairness and preventing decisions that could inadvertently disadvantage certain groups. – Heidi Farris, ActivTrak
18. Preserve Annotator Disagreements
Preserve annotator disagreements as valuable business intelligence. When human labelers disagree on edge cases, it pinpoints exactly where your AI will likely fail users unfairly. Instead of forcing consensus, build this uncertainty into your training data as an early warning system. This gives your AI the ability to recognize when it’s uncertain and avoid making confident, but biased, decisions. – Joseph Ours, Centric Consulting