Building a Real-Time Speech-to-Text Mobile App with React Native and Node.js

This article will guide you through building a mobile app using React Native for the frontend and Node.js for the backend, where users can record their voice, convert the audio to text using Google’s Speech-to-Text API, and display the transcription in the app. We’ll walk through the steps that made the project functional.

Prerequisites

React Native Setup: Ensure React Native is installed and working on your system. Follow this guide to set up React Native for iOS and Android on macOS.
Node.js Setup: Install Node.js 20.x by following this guide.
Google Speech-to-Text API Key: Obtain an API key from the Google Cloud Console and ensure the Speech-to-Text API is enabled.
FFmpeg: Install FFmpeg for audio conversion:

brew install ffmpeg

Step 1: Create a Project Structure

1. Create a folder named VoiceNote and navigate to it:

mkdir VoiceNote && cd VoiceNote

2. Initialize two subprojects:

a) React Native App:

npx @react-native-community/cli init VoiceNoteApp

b) Node.js Backend:

mkdir backend && cd backend npm init -y

Step 2: Set Up the Node.js Backend

1. Install required dependencies in the backend folder:

npm install express body-parser multer dotenv axios

2. Install and configure FFmpeg for audio conversion:

brew install ffmpeg

3. Create an .env file in the backend folder and add your Google Speech-to-Text API key:GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY

4. Create a file named server.js in the backend folder and add the following code:

require("dotenv").config();
const express = require("express");
const bodyParser = require("body-parser");
const multer = require("multer");
const fs = require("fs");
const axios = require("axios");
const { exec } = require("child_process");

const app = express();
const port = 5019;

// Middleware
app.use(bodyParser.json());

// Multer configuration for file uploads
const upload = multer({ dest: "uploads/" });

// Endpoint to handle audio upload
app.post("/upload", upload.single("audio"), async (req, res) => {
    try {
      const inputPath = req.file.path;
      const outputPath = `${req.file.path}.wav`;
  
      // Convert the audio file to LINEAR16 with 16000 Hz sample rate
      const ffmpegCommand = `ffmpeg -i ${inputPath} -ar 16000 -ac 1 -f wav ${outputPath}`;
      exec(ffmpegCommand, async (error, stdout, stderr) => {
        if (error) {
          console.error("Error during audio conversion:", stderr);
          return res.status(500).json({ error: "Error converting audio file" });
        }
  
        // Read the converted audio file
        const audioFile = fs.readFileSync(outputPath);
        const audioBytes = audioFile.toString("base64");
  
        // Google Speech-to-Text API request payload
        const requestPayload = {
          config: {
            encoding: "LINEAR16",
            sampleRateHertz: 16000,
            languageCode: "en-US",
          },
          audio: {
            content: audioBytes,
          },
        };
  
        // Send the audio to Google Speech-to-Text API
        try {
          const response = await axios.post(
            `https://speech.googleapis.com/v1/speech:recognize?key=${process.env.GOOGLE_API_KEY}`,
            requestPayload,
            { headers: { "Content-Type": "application/json" } }
          );
  
          // Extract transcription from the API response
          const transcription = response.data.results
            .map((result) => result.alternatives[0].transcript)
            .join("\n");
          
            console.log('transcription : ', transcription);
  
          // Cleanup temporary files
          fs.unlinkSync(inputPath);
          fs.unlinkSync(outputPath);
  
          // Respond with transcription
          res.json({ transcription });
        } catch (apiError) {
          console.error("Error during transcription:", apiError.response?.data || apiError.message);
          res.status(500).json({ error: "Error during transcription" });
        }
      });
    } catch (error) {
      console.error("Error processing request:", error);
      res.status(500).json({ error: "Error processing audio file" });
    }
  });

app.listen(port, () => {
  console.log(`Server running at http://localhost:${port}`);
});

5. Start the backend server:

node server.js

Step 3: Configure React Native

1. Install dependencies for audio recording in the VoiceNoteApp folder:

npm install react-native-audio-recorder-player react-native-permissions

2. Update the VoiceNoteApp/ios/VoiceNoteApp/Info.plist file for iOS permissions. Add the following:

<key>NSMicrophoneUsageDescription</key> 
<string>We need microphone access to record audio for transcription.</string>

3. Modify the App.tsx file to include the following code:

import React, { useState } from "react";
import { View, Text, Button, StyleSheet, ActivityIndicator } from "react-native";
import AudioRecorderPlayer from "react-native-audio-recorder-player";

const audioRecorderPlayer = new AudioRecorderPlayer();

const App = () => {
  const [recording, setRecording] = useState(false);
  const [transcription, setTranscription] = useState("");
  const [loading, setLoading] = useState(false); // State for the spinner loader

  const startRecording = async () => {
    try {
      setRecording(true);
      const path = "recording.m4a"; // Recorded file path
      await audioRecorderPlayer.startRecorder(path);
      console.log("Recording started");
    } catch (error) {
      console.error("Error starting recording:", error);
    }
  };

  const stopRecording = async () => {
    try {
      const result = await audioRecorderPlayer.stopRecorder();
      setRecording(false);
      console.log("Recording stopped:", result);

      // Show the loader while sending the audio file
      setLoading(true);

      // Send audio file to backend
      const formData = new FormData();
      formData.append("audio", {
        uri: `file://${result}`,
        type: "audio/mpeg",
        name: "recording.mp4",
      });

      const response = await fetch("http://localhost:5019/upload", {
        method: "POST",
        body: formData,
        headers: {
          "Content-Type": "multipart/form-data",
        },
      });

      const data = await response.json();
      console.log('data.transcription: ', data.transcription);
      setTranscription(data.transcription);

      // Hide the loader after API call is complete
      setLoading(false);

    } catch (error) {
      console.error("Error stopping recording:", error);
      setLoading(false); // Ensure loader is hidden even if there's an error
    }
  };

  return (
    <View style={styles.container}>
      <Button
        title={recording ? "Stop Recording" : "Start Recording"}
        onPress={recording ? stopRecording : startRecording}
      />
      <Text style={styles.text}>
        {transcription ? `Transcription: ${transcription}` : ""}
      </Text>
      {loading && (
        <View style={styles.loaderContainer}>
          <ActivityIndicator size="large" color="#0000ff" />
        </View>
      )}
    </View>
  );
};

const styles = StyleSheet.create({
  container: {
    flex: 1,
    justifyContent: "center",
    alignItems: "center",
  },
  text: {
    marginTop: 20,
    fontSize: 16,
  },
  loaderContainer: {
    position: "absolute",
    top: 0,
    left: 0,
    right: 0,
    bottom: 0,
    justifyContent: "center",
    alignItems: "center",
    backgroundColor: "rgba(255, 255, 255, 0.6)", // Optional: dim the background
    zIndex: 10, // Ensure it stays above other components
  },
});

export default App;

4. Run the React Native app:

npx react-native run-ios

Step 4: Test the App

1. Start the Node.js backend:

cd backend && node server.js

2. Run the React Native app and test by recording your voice.

3. The transcription should appear in the app after you stop the recording.

Summary

By following these steps, you’ve successfully created a React Native app with a Node.js backend that records audio, converts it to text using Google’s Speech-to-Text API, and displays the transcription. This process involved configuring React Native, setting up a Node.js backend with FFmpeg for audio conversion, and leveraging Google’s powerful speech recognition capabilities.