Word Generation - Markov Chains III
In the previous post I showed a Markov Model, which would generate company
names, but trained on any data set it would be able to generate words similar
to that. The algorithm used was a modification of an algorithm used in one of my
previous blogs, which would generate text. To understand how a Markov Model works, check
this blog
out which covers the theory bit for Markov Model. We will code this in
Python.
Modelling
To create words, the order should be a number of letters and the words should
be divided on basis for that. For example, 'Python' divided with order 3 would be 'Pyt' and 'hon' and 'Java' would be simply 'Jav' and 'a'.
First we should be able to create a dictionary which contains these broken up
words corresponding to their occurrence. We should first create a blank
dictionary and divide our data into individual words.
def generate(data, order): Dict = {} words = data.split(' ') index = order
Now we loop through 'words' to fill our dictionary. We start from the order
and then pick up the first segment. If that segment already exists in the
dictionary then we just add the letter which comes next in a list. If it does
not exist then we just create a new one.
for word in words: index = order for letter in word[index:]: key = ''.join(word[index - order: index]) if key in Dict: Dict[key].append(letter) else: Dict[key] = [letter] index += 1 return Dict
Generation
Now we use that dictionary to create a new word. We first import the right libraries.
import random as r
Then, we create two strings; one which will contain our generated string and the other which will be a randomly chosen word segment from our dictionary. Then we create a loop. Inside it, we will place a 'try-except' statement, since we will run into, KeyErrors, handling them separately will complicate stuff and this is gets things done faster.
Inside the 'try', we search for letters that come after our randomly chosen
string in our dictionary and randomly choose any one of them. Then we just add
that, to the string. Then we change the old letters, to the last three
letters of the created string. Since this is looped our new characters will
constantly be added. If we run into any errors then we just simply return that
string, since that tells us that we can't generate more characters based on our data.
def make_word(Dict, length, order): oldLetters = r.choice(list(Dict.keys())).split(' ') string = ''.join(oldLetters) for i in range(length - order): try: key = ''.join(oldLetters) newLetters = r.choice(Dict[key]) string += newLetters oldLetters = string[-(order):] except KeyError: return string return string
Final Lines
Now if you want to read out these names you can do that by using any TTS library. To learn that, check out this blog, where at the end I also teach how to use the gTTS library. You can do other stuff as well like, display it to a GUI etc. Right now we will just do the basics.
Dict = generate(data, 3) string = make_word(Dict, 10, 3) print(string)
Also if you want to train the model on the data which I used, the data will be up on my GitHub page and all the code will also be there. Till then,
Happy Coding!!
Comments