@inferrlm/react-native-mlx

March 21, 2026 ยท View on GitHub

Run LLMs, Text-to-Speech, and Speech-to-Text on-device in React Native using MLX Swift.

Built with Nitro Modules for zero-overhead native bridging.

Requirements

  • iOS 16.0+
  • React Native 0.76+
  • react-native-nitro-modules

Installation

npm install @inferrlm/react-native-mlx react-native-nitro-modules
cd ios && pod install

Usage

Download a Model

import { ModelManager, MLXModel } from '@inferrlm/react-native-mlx'

await ModelManager.download(MLXModel.Qwen3_1_7B_4bit, (progress) => {
  console.log(`${(progress * 100).toFixed(1)}%`)
})

Load and Generate

import { LLM } from '@inferrlm/react-native-mlx'

await LLM.load('mlx-community/Qwen3-1.7B-4bit', {
  onProgress: (p) => console.log(`Loading: ${(p * 100).toFixed(0)}%`),
  manageHistory: true,
})

const response = await LLM.generate('What is the capital of France?')

Streaming

let response = ''
await LLM.stream('Tell me a story', (token) => {
  response += token
})

Streaming with Events

For thinking blocks (chain-of-thought) and tool calls, use the event-based API:

await LLM.streamWithEvents('Solve 2+3 step by step', (event) => {
  switch (event.type) {
    case 'thinking_start':
      showThinkingIndicator()
      break
    case 'thinking_chunk':
      appendToThinking(event.chunk)
      break
    case 'thinking_end':
      hideThinkingIndicator()
      break
    case 'token':
      appendToContent(event.token)
      break
    case 'generation_end':
      console.log(`${event.stats.tokensPerSecond} tok/s`)
      break
  }
})

Configuring Generation

LLM.systemPrompt = 'You are a helpful coding assistant.'
LLM.maxTokens = 2048
LLM.temperature = 0.7
LLM.enableThinking = true

Tool Calling

import { LLM, createTool } from '@inferrlm/react-native-mlx'
import { z } from 'zod'

const weatherTool = createTool({
  name: 'get_weather',
  description: 'Get weather for a city',
  arguments: z.object({
    city: z.string().describe('City name'),
  }),
  handler: async ({ city }) => {
    return { temperature: 22, condition: 'sunny' }
  },
})

await LLM.load('mlx-community/Qwen3-1.7B-4bit', {
  onProgress: (p) => console.log(`${(p * 100).toFixed(0)}%`),
  tools: [weatherTool],
})

await LLM.stream('What is the weather in Tokyo?', (token) => {
  process.stdout.write(token)
}, (update) => {
  console.log(`Tool called: ${update.toolCall.name}`)
})

Conversation History

When manageHistory: true is set during load, conversation turns are automatically tracked:

await LLM.load('mlx-community/Qwen3-1.7B-4bit', {
  manageHistory: true,
})

await LLM.stream('My name is Alice', onToken)
await LLM.stream('What is my name?', onToken) // Remembers "Alice"

const history = LLM.getHistory()
LLM.clearHistory()

Stop Generation

LLM.stop()

Unload

LLM.unload()

Text-to-Speech

import { TTS, MLXModel } from '@inferrlm/react-native-mlx'

await TTS.load(MLXModel.PocketTTS, {
  onProgress: (p) => console.log(`${(p * 100).toFixed(0)}%`),
})

const audioBuffer = await TTS.generate('Hello world!', {
  voice: 'alba',
  speed: 1.0,
})

await TTS.stream('Hello world!', (chunk) => {
  // Process each audio chunk
}, { voice: 'alba' })

Available voices: alba, azelma, cosette, eponine, fantine, javert, jean, marius

Speech-to-Text

import { STT, MLXModel } from '@inferrlm/react-native-mlx'

await STT.load(MLXModel.GLM_ASR_Nano_4bit, {
  onProgress: (p) => console.log(`${(p * 100).toFixed(0)}%`),
})

const text = await STT.transcribe(audioBuffer)

// Live microphone transcription
await STT.startListening()
const partial = await STT.transcribeBuffer()
const final = await STT.stopListening()

API

LLM

Methods

MethodDescription
load(modelId, options?)Load a model into memory. Downloads from HuggingFace if not cached
generate(prompt)Generate a complete response (blocking)
stream(prompt, onToken, onToolCall?)Stream tokens with optional tool call updates
streamWithEvents(prompt, onEvent)Stream with typed events for thinking/tool calls
stop()Stop current generation
unload()Unload model and free memory
getLastGenerationStats()Get token count, speed, timing from last generation
getHistory()Get conversation history (if manageHistory enabled)
clearHistory()Clear conversation history

Properties

PropertyTypeDefaultDescription
isLoadedboolean-Whether a model is loaded (read-only)
isGeneratingboolean-Whether generation is in progress (read-only)
modelIdstring''Currently loaded model ID (read-only)
debugbooleanfalseEnable debug logging
systemPromptstring'You are a helpful assistant.'System prompt for the model
maxTokensnumber2048Maximum tokens to generate
temperaturenumber0.7Sampling temperature (0 = deterministic)
enableThinkingbooleantrueEnable thinking mode for supported models

LLMLoadOptions

PropertyTypeDescription
onProgress(progress: number) => voidLoading progress callback (0-1)
additionalContextLLMMessage[]Conversation history or few-shot examples
manageHistorybooleanAutomatically manage conversation history
toolsToolDefinition[]Tools available for the model to call

GenerationStats

PropertyTypeDescription
tokenCountnumberTotal tokens generated
tokensPerSecondnumberGeneration speed
timeToFirstTokennumberTime to first token (ms)
totalTimenumberTotal generation time (ms)
toolExecutionTimenumberTime spent executing tools (ms)

StreamEvent Types

EventFieldsDescription
generation_starttimestampGeneration began
tokentokenResponse token
thinking_starttimestampModel began thinking
thinking_chunkchunkThinking content chunk
thinking_endcontent, timestampThinking complete
tool_call_startid, name, argumentsTool call initiated
tool_call_executingidTool handler running
tool_call_completedid, resultTool returned result
tool_call_failedid, errorTool execution failed
generation_endcontent, statsGeneration complete

TTS

Methods

MethodDescription
load(modelId, options?)Load a TTS model
generate(text, options?)Generate audio buffer from text
stream(text, onAudioChunk, options?)Stream audio chunks
stop()Stop generation
unload()Unload model

Properties

PropertyTypeDescription
isLoadedbooleanWhether a TTS model is loaded
isGeneratingbooleanWhether audio is being generated
modelIdstringCurrently loaded model ID
sampleRatenumberAudio sample rate (e.g. 24000)

TTSGenerateOptions

PropertyTypeDescription
voicestringVoice name
speednumberSpeech speed multiplier

STT

Methods

MethodDescription
load(modelId, options?)Load an STT model
transcribe(audio)Transcribe an audio buffer
transcribeStream(audio, onToken)Stream transcription tokens
startListening()Start microphone capture
transcribeBuffer()Transcribe current buffer while listening
stopListening()Stop listening and get final transcript
stop()Stop transcription
unload()Unload model

Properties

PropertyTypeDescription
isLoadedbooleanWhether an STT model is loaded
isTranscribingbooleanWhether transcription is in progress
isListeningbooleanWhether the microphone is active
modelIdstringCurrently loaded model ID

ModelManager

Methods

MethodDescription
download(modelId, onProgress)Download a model from HuggingFace
isDownloaded(modelId)Check if a model is downloaded
getDownloadedModels()Get list of downloaded model IDs
deleteModel(modelId)Delete a downloaded model
getModelPath(modelId)Get local filesystem path

Properties

PropertyTypeDescription
debugbooleanEnable debug logging

Supported Models

Any MLX-compatible model from mlx-community should work. The MLXModel enum provides pre-tested models:

import { MLXModel } from '@inferrlm/react-native-mlx'

LLM Models

ModelEnum KeyHuggingFace ID
Llama 3.2 (Meta)
Llama 3.2 1B 4-bitLlama_3_2_1B_Instruct_4bitmlx-community/Llama-3.2-1B-Instruct-4bit
Llama 3.2 1B 8-bitLlama_3_2_1B_Instruct_8bitmlx-community/Llama-3.2-1B-Instruct-8bit
Llama 3.2 3B 4-bitLlama_3_2_3B_Instruct_4bitmlx-community/Llama-3.2-3B-Instruct-4bit
Llama 3.2 3B 8-bitLlama_3_2_3B_Instruct_8bitmlx-community/Llama-3.2-3B-Instruct-8bit
Qwen 2.5 (Alibaba)
Qwen 2.5 0.5B 4-bitQwen2_5_0_5B_Instruct_4bitmlx-community/Qwen2.5-0.5B-Instruct-4bit
Qwen 2.5 0.5B 8-bitQwen2_5_0_5B_Instruct_8bitmlx-community/Qwen2.5-0.5B-Instruct-8bit
Qwen 2.5 1.5B 4-bitQwen2_5_1_5B_Instruct_4bitmlx-community/Qwen2.5-1.5B-Instruct-4bit
Qwen 2.5 1.5B 8-bitQwen2_5_1_5B_Instruct_8bitmlx-community/Qwen2.5-1.5B-Instruct-8bit
Qwen 2.5 3B 4-bitQwen2_5_3B_Instruct_4bitmlx-community/Qwen2.5-3B-Instruct-4bit
Qwen 2.5 3B 8-bitQwen2_5_3B_Instruct_8bitmlx-community/Qwen2.5-3B-Instruct-8bit
Qwen 3
Qwen 3 1.7B 4-bitQwen3_1_7B_4bitmlx-community/Qwen3-1.7B-4bit
Qwen 3 1.7B 8-bitQwen3_1_7B_8bitmlx-community/Qwen3-1.7B-8bit
Qwen 3.5
Qwen 3.5 0.8B 4-bitQwen3_5_0_8B_MLX_4bitmlx-community/Qwen3.5-0.8B-MLX-4bit
Qwen 3.5 0.8B 8-bitQwen3_5_0_8B_MLX_8bitmlx-community/Qwen3.5-0.8B-MLX-8bit
Gemma 3 (Google)
Gemma 3 1B 4-bitGemma_3_1B_IT_4bitmlx-community/gemma-3-1b-it-4bit
Gemma 3 1B 8-bitGemma_3_1B_IT_8bitmlx-community/gemma-3-1b-it-8bit
Phi 3.5 Mini (Microsoft)
Phi 3.5 Mini 4-bitPhi_3_5_Mini_Instruct_4bitmlx-community/Phi-3.5-mini-instruct-4bit
Phi 3.5 Mini 8-bitPhi_3_5_Mini_Instruct_8bitmlx-community/Phi-3.5-mini-instruct-8bit
Phi 4 Mini (Microsoft)
Phi 4 Mini 4-bitPhi_4_Mini_Instruct_4bitmlx-community/Phi-4-mini-instruct-4bit
Phi 4 Mini 8-bitPhi_4_Mini_Instruct_8bitmlx-community/Phi-4-mini-instruct-8bit
SmolLM (HuggingFace)
SmolLM 1.7B 4-bitSmolLM_1_7B_Instruct_4bitmlx-community/SmolLM-1.7B-Instruct-4bit
SmolLM 1.7B 8-bitSmolLM_1_7B_Instruct_8bitmlx-community/SmolLM-1.7B-Instruct-8bit
SmolLM2 (HuggingFace)
SmolLM2 1.7B 4-bitSmolLM2_1_7B_Instruct_4bitmlx-community/SmolLM2-1.7B-Instruct-4bit
SmolLM2 1.7B 8-bitSmolLM2_1_7B_Instruct_8bitmlx-community/SmolLM2-1.7B-Instruct-8bit
OpenELM (Apple)
OpenELM 1.1B 4-bitOpenELM_1_1B_4bitmlx-community/OpenELM-1_1B-4bit
OpenELM 1.1B 8-bitOpenELM_1_1B_8bitmlx-community/OpenELM-1_1B-8bit
OpenELM 3B 4-bitOpenELM_3B_4bitmlx-community/OpenELM-3B-4bit
OpenELM 3B 8-bitOpenELM_3B_8bitmlx-community/OpenELM-3B-8bit

TTS Models

ModelEnum KeyHuggingFace ID
PocketTTS (Kyutai) - 44.6M params
PocketTTS bf16PocketTTSmlx-community/pocket-tts
PocketTTS 8-bitPocketTTS_8bitmlx-community/pocket-tts-8bit
PocketTTS 4-bitPocketTTS_4bitmlx-community/pocket-tts-4bit

STT Models

ModelEnum KeyHuggingFace ID
GLM-ASR (Alibaba)
GLM-ASR Nano 4-bitGLM_ASR_Nano_4bitmlx-community/GLM-ASR-Nano-2512-4bit

Browse more models at huggingface.co/mlx-community.

License

MIT