Word Generation - Markov Chains III

December 15, 2020

Word Generation - Markov Chains III

In the previous post I showed a Markov Model, which would generate company names, but trained on any data set it would be able to generate words similar to that. The algorithm used was a modification of an algorithm used in one of my previous blogs, which would generate text. To understand how a Markov Model works, check this blog out which covers the theory bit for Markov Model. We will code this in Python.

Modelling

To create words, the order should be a number of letters and the words should be divided on basis for that. For example, 'Python' divided with order 3 would be 'Pyt' and 'hon' and 'Java' would be simply 'Jav' and 'a'.

First we should be able to create a dictionary which contains these broken up words corresponding to their occurrence. We should first create a blank dictionary and divide our data into individual words.

def generate(data, order):
    Dict = {}
    words = data.split(' ')
    index = order

Now we loop through 'words' to fill our dictionary. We start from the order and then pick up the first segment. If that segment already exists in the dictionary then we just add the letter which comes next in a list. If it does not exist then we just create a new one.

    for word in words:
        index = order
        for letter in word[index:]:
            key = ''.join(word[index - order: index])
            if key in Dict: 
                Dict[key].append(letter)
            else:  
                Dict[key] = [letter] 
            index += 1
    return Dict

Generation

Now we use that dictionary to create a new word. We first import the right libraries.

import random as r

Then, we create two strings; one which will contain our generated string and the other which will be a randomly chosen word segment from our dictionary. Then we create a loop. Inside it, we will place a 'try-except' statement, since we will run into, KeyErrors, handling them separately will complicate stuff and this is gets things done faster.

Inside the 'try', we search for letters that come after our randomly chosen string in our dictionary and randomly choose any one of them. Then we just add that, to the string. Then we change the old letters, to the last three letters of the created string. Since this is looped our new characters will constantly be added. If we run into any errors then we just simply return that string, since that tells us that we can't generate more characters based on our data.

def make_word(Dict, length, order):
    oldLetters = r.choice(list(Dict.keys())).split(' ') 
    string = ''.join(oldLetters)
    for i in range(length - order): 
        try:
            key = ''.join(oldLetters)
            newLetters = r.choice(Dict[key]) 
            string += newLetters 
            oldLetters = string[-(order):]
        except KeyError:
            return string
    return string

Final Lines

Now if you want to read out these names you can do that by using any TTS library. To learn that, check out this blog, where at the end I also teach how to use the gTTS library. You can do other stuff as well like, display it to a GUI etc. Right now we will just do the basics.

Dict = generate(data, 3)
string = make_word(Dict, 10, 3)
print(string)

Also if you want to train the model on the data which I used, the data will be up on my GitHub page and all the code will also be there. Till then,

Happy Coding!!

Search This Blog

The Original Coding Cult