Failed Machine Learning (FML)

June 14, 2024 · View on GitHub

High-profile real-world examples of failed machine learning projects


“Success is not final, failure is not fatal. It is the courage to continue that counts.” - Winston Churchill


If you are looking for examples of how ML can fail despite all its incredible potential, you have come to the right place. Beyond the wonderful success stories of applied machine learning, here is a list of failed projects which we can learn a lot from.

Contributions Welcome!


Contents

  1. Classic Machine Learning
  2. Computer Vision
  3. Forecasting
  4. Image Generation
  5. Natural Language Processing
  6. Recommendation Systems

Classic Machine Learning

TitleDescription
Amazon AI Recruitment SystemAI-powered automated recruitment system canceled after evidence of discrimination against female candidates
Genderify - Gender identification toolAI-powered tool designed to identify gender based on fields like name and email address was shut down due to built-in biases and inaccuracies
Leakage and the Reproducibility Crisis in ML-based ScienceA team at Princeton University found 20 reviews across 17 scientific fields that discovered significant errors (e.g., data leakage, no train-test split) in 329 papers that use ML-based science
COVID-19 Diagnosis and Triage ModelsHundreds of predictive models were developed to diagnose or triage COVID-19 patients faster, but ultimately none of them were fit for clinical use, and some were potentially harmful
COMPAS Recidivism AlgorithmFlorida’s recidivism risk system found evidence of racial bias
Pennsylvania Child Welfare Screening ToolThe predictive algorithm (which helps identify which families are to be investigated by social workers for child abuse and neglect) flagged a disproportionate number of Black children for 'mandatory' neglect investigations.
Oregon Child Welfare Screening ToolA similar predictive tool to the one in Pennsylvania, the AI algorithm for child welfare in Oregon was also stopped a month after the Pennsylvania report
U.S. Healthcare System Health Risk PredictionA widely used algorithm to predict healthcare needs exhibited racial bias where for a given risk score, black patients are considerably sicker than white patients
Apple Card Credit CardApple’s new credit card (created in partnership with Goldman Sachs) is being investigated by financial regulators after customers complained that the card’s lending algorithms discriminated against women, where the credit line offered by a male customer's Apple Card was 20 times higher than that offered to his spouse

Computer Vision

TitleDescription
Inverness Automated Football Camera SystemAI camera football-tracking technology for live streaming repeatedly confused a linesman’s bald head for the ball itself
Amazon Rekognition for US CongressmenAmazon's facial recognition technology (Rekognition) falsely matched 28 congresspeople with mugshots of criminals, while also revealing racial bias in the algorithm
Amazon Rekognition for law enforcementAmazon's facial recognition technology (Rekognition) misidentified women as men, particularly those with darker skin
Zhejiang traffic facial recognition systemTraffic camera system (designed to capture traffic offenses) mistook a face on the side of a bus as someone who jaywalked
Kneron tricking facial recognition terminalsThe team at Kneron used high-quality 3-D masks to deceive AliPay and WeChat payment systems to make purchases
Twitter smart cropping toolTwitter's auto-crop tool for photo review displayed evident signs of racial bias
Depixelator toolAlgorithm (based on StyleGAN) designed to generate depixelated faces showed signs of racial bias, with image output skewed towards the white demographic
Google Photos taggingThe automatic photo tagging capability in Google Photos mistakenly labeled black people as gorillas
GenderShades evaluation of gender classification productsGenderShades' research revealed that Microsoft and IBM’s face-analysis services for identifying the gender of people in photos frequently erred when analyzing images of women with dark skin
New Jersey Police Facial RecognitionA false facial recognition match by New Jersey police landed an innocent black man (Nijeer Parks) in jail even though he was 30 miles away from the crime
Tesla's dilemma between a horse cart and a truckTesla's visualization system got confused by mistaking a horse carriage as a truck with a man walking behind it
Google's AI for Diabetic Retinopathy DetectionThe retina scanning tool fared much worse in real-life settings than in controlled experiments, with issues such as rejected scans (from poor scan image quality) and delays from intermittent internet connectivity when uploading images to the cloud for processing

Forecasting

TitleDescription
Google Flu TrendsFlu prevalence prediction model based on Google searches produced inaccurate over-estimates
Zillow iBuying algorithmsSignificant losses in Zillow's home-flipping business due to inaccurate (overestimated) prices from property valuation models
Tyndaris Robot Hedge FundAI-powered automated trading system controlled by a supercomputer named K1 resulted in big investment losses, culminating in a lawsuit
Sentient Investment AI Hedge FundThe once high flying AI-powered fund at Sentient Investment Management failed to make money and was promptly liquidated in less than 2 years
JP Morgan's Deep Learning Model for FX AlgosJP Morgan has phased out a deep neural network for foreign exchange algorithmic execution, citing issues with data interpretation and the complexity involved.

Image Generation

TitleDescription
Playground AI facial generationWhen asked to turn an image of an Asian headshot into a professional LinkedIn profile photo, the AI image editor generated an output with features that made it look Caucasian instead
Stable Diffusion Text-to-Image ModelIn an experiment run by Bloomberg, it was found that Stable Diffusion (text-to-image model) exhibited racial and gender bias in the thousands of generated images related to job titles and crime
Historical Inaccuracies in Gemini Image GenerationGoogle's Gemini image generation feature was found to be generating inaccurate historical image depictions in its attempt to subvert gender and racial stereotypes, such as returning non-white AI-generated people when prompted to generate USA's founding fathers

Natural Language Processing

TitleDescription
Microsoft Tay ChatbotChatbot that posted inflammatory and offensive tweets through its Twitter account
Nabla ChatbotExperimental chatbot (for medical advice) using a cloud-hosted instance of GPT-3 advised a mock patient to commit suicide
Facebook Negotiation ChatbotsThe AI system was shut down after the chatbots stopped using English in their negotiations and started using a language that they created by themselves
OpenAI GPT-3 Chatbot SamanthaA GPT-3 chatbot fine-tuned by indie game developer Jason Rohrer to emulate his dead fiancée was shut down by OpenAI after Jason refused their request to insert an automated monitoring tool amidst concerns of the chatbot being racist or overtly sexual
Amazon Alexa plays pornAmazon's voice-activated digital assistant unleashed a torrent of raunchy language after a toddler asked it to play a children’s song.
Galactica - Meta's Large Language ModelA problem with Galactica was that it could not distinguish truth from falsehood, a basic requirement for a language model designed to generate scientific text. It was found to make up fake papers (sometimes attributing them to real authors), and generated articles about the history of bears in space as readily as ones about protein complexes.
Energy Firm in Voice Mimicry FraudCybercriminals used AI-based software to impersonate the voice of a CEO to demand a fraudulent money transfer as part of the voice-spoofing attack
MOH chatbot dispenses safe sex advice when asked Covid-19 questionsThe 'Ask Jamie' chatbot by the Singapore Ministry of Health (MOH) was temporarily disabled after it provided misaligned replies around safe sex when asked about managing positive COVID-19 results
Google's BARD Chatbot DemoIn its first public demo advertisement, BARD made a factual error regarding which satellite first took pictures of a planet outside the Earth's solar system.
ChatGPT Categories of FailuresAn analysis of the ten categories of failures seen in ChatGPT so far, including reasoning, factual errors, math, coding, and bias.
TikTokers roasting McDonald's hilarious drive-thru AI order failsSome samples where a production/deployed voice assistant fails to get orders right and leads to brand/reputation damage for McDonalds
Bing Chatbot's Unhinged Emotional BehaviorIn certain conversations, Bing's chatbot was found to reply with argumentative and emotional responses
Bing's AI quotes COVID disinformation sourced from ChatGPTBing's response to a query on COVID-19 anti-vaccine advocacy was inaccurate and based on false information from unreliable sources
AI-generated 'Seinfeld' suspended on Twitch for transphobic jokesA mistake with the AI’s content filter resulted in the character 'Larry' delivering a transphobic standup routine.
ChatGPT cites bogus legal casesA lawyer used OpenAI's popular chatbot ChatGPT to "supplement" his own findings but was provided with completely manufactured previous cases that do not exist
Air Canada chatbot gives erroneous informationAir Canada's AI-powered chabot hallucinated an answer inconsistent with airline policy with regard to bereavement fares.
AI bot performed illegal insider trading and lied about its actionsAn AI investment management system chatbot called Alpha (built on OpenAI's GPT-4, developed by Apollo Research) demonstrated that it was capable of making illegal financial trades and lying about its actions.

Recommendation Systems

TitleDescription
IBM's Watson HealthIBM’s Watson allegedly provided numerous unsafe and incorrect recommendations for treating cancer patients
Netflix - $1 Million ChallengeThe recommender system that won the $1 Million challenge improved the proposed baseline by 8.43%. However, this performance gain did not seem to justify the engineering effort needed to bring it into a production environment.