AI Voice of Phoebe

For various reasons, I have several ideas to use AI to generate things based on the 90s sitcom Friends. This is about my attempt to make a text-to-speech generator with the voice of Phoebe.

I’m using this Tensorflow based text-to-speech model implementation: https://github.com/Kyubyong/dc_tts

I won’t cover all the details of setting up a Tensorflow environment here. I use a PC running Ubuntu bionic, which has an Nvidia GeForce GTX 1060 card with 6GB RAM. I have a Python virtual environment set up for Tensorflow using Python 3.6 and with the CUDA stuff installed. So I started by using pip to install the listed requirements and cloning the repository. The following takes place at the top level of the cloned repository.

I then downloaded the LJ Speech Dataset pre-trained model linked near the bottom of the page. For some reason it has a .tar extension but it is in fact a zip file. It should be unpacked into a directory called logdir there will be two directories called LJ01-1 and LJ01-2. This implementation uses two separate models Text2Mel referred to as 1, and SSRN referred to as 2.

To see if it works I ran python synthesize.py, this makes WAVs of the sentences in the file harvard_sentences.txt in a directory called samples. Which it did.

Next the hard bit. To make set of training data to reopen training of the LJ pre-trained models (also known as transfer learning) I needed a lot of transcribed clips of Phoebe talking. I downloaded the audio of this YouTube video https://www.youtube.com/watch?v=FhS-3WJ7K8Y using youtube-dl (the version in Ubuntu was too old to work) like so:

youtube-dl --extract-audio --audio-format wav 'https://www.youtube.com/watch?v=FhS-3WJ7K8Y'

I used a program called transcriber (available in Ubuntu) to mark and transcribe 100 bits from the video’s audio track, leaving the text for the bits between Phoebe blank. I exported the result as an STM file. Then I wrote the following script to split the original WAV into separate files containing the Phoebe sections and write out a transcript.csv (not a CSV file) with one line per file in the form <filename>|<transcribed text>. The script uses ffmpeg to do the splitting, it takes two arguments, the STM file and the WAV.

#!/usr/bin/ruby

require 'fileutils'

DIR = 'train/phoebe'
FileUtils.mkdir_p(DIR + '/wavs')

out = {}
c = 0

File.open(ARGV[0]).each_line do |line|
    if not line[0] == ';' then
        m = line.match(/.*inter_segment_gap ([\d\.]+) ([\d\.]+) <.+> (.+)$/)
        if m then
            puts "#{m[1]} #{m[2]} #{m[3]}"
            fname = "%04d.wav" % c
             out[fname] = m[3]
            system("ffmpeg -i \"#{ARGV[1]}\" -acodec copy -ss #{m[1]} -to #{m[2]} -ac 1 -y #{DIR}/wavs/#{fname}")
            c += 1
        end
    end
end

File.open("#{DIR}/transcript.csv", 'w') do |o|
    out.keys.sort.each do |k|
        o.puts "#{k}|#{out[k]}"
    end
end

In data_load.py in the load_data function I replaced everything between if mode=="train": and else: # synthesize on unseen test text. with the following:

        # Parse
        fpaths, text_lengths, texts = [], [], []
        transcript = os.path.join(hp.data, 'transcript.csv')
        lines = codecs.open(transcript, 'r', 'utf-8').readlines()
        for line in lines:
            fname, text = line.strip().split("|")

            fpath = os.path.join(hp.data, "wavs", fname)
            fpaths.append(fpath)

            text = text_normalize(text) + "E"  # E: EOS
            text = [char2idx[char] for char in text]
            text_lengths.append(len(text))
            texts.append(np.array(text, np.int32).tostring())

        return fpaths, text_lengths, texts

This makes it read transcript.csv with the correct format.

In hyperparams.py I commented out data = "/data/private/voice/LJSpeech-1.0" and added data = "train/phoebe".

I had to fix a bug as explained here: https://github.com/Kyubyong/dc_tts/issues/11#issuecomment-394698487

Then I was ready to run the preprocessing step: python prepo.py. This results in two new directories: mags and mels. They each have one file per training set wav file.

Finally time for training. There are two parts, one per model. They can apparently be run independently and possibly simultaneously, but when I tried that I ran out of RAM. Part 1 is kicked off with python train.py 1. For me this complained a bit about lack of RAM on the GPU but continued anyway. I let it run for a couple of hours then hit ctrl-C to stop it.

When I tried to to run part 2 with python train.py 2 it ran out of GPU RAM pretty quickly and gave up. To make it fall back to CPU I added the following two lines near the start of train.py, just before the rest of the imports:

import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"

The CPU is much slower to I let it run overnight, then stopped it again using ctrl-C. Note that this also stops train.py 1 and synthesize.py from using the GPU, so comment out the second line to speed those up again.

I was then able to produce WAVs by running synthesize.py. I slightly modified data_load.py to load its sentences one per line from a simple text file.

The results clearly sound like her, but are quite difficult to understand. Possibly I should have selected clearer sounding samples with less range of intonation for the training set.

I will put some example outputs here when I work out how.