Saturday, 6 October 2018

Resume Information Extraction Spacy NLP :Adding a new entity type to pre-trained NER - Sandeep Kanao

# NLP - SpaCy - Adding a new entity type to an existing pre-trained NER - Sandeep Kanao

GIT


"""Example of training an additional entity type

This script shows how to add a new entity type to an existing pre-trained NER model.

The actual training is performed by looping over the examples, and calling `nlp.entity.update()`. The `update()` method steps through the words of the input.
At each word, it makes a prediction. It then consults the annotations provided on the GoldParse instance, to see whether it was right. If it was wrong, it adjusts its weights so that the correct action will score higher next time.

After training your model, you can save it to a directory. We recommend
wrapping models as Python packages, for ease of deployment.

For more details, see the documentation:
* Training: https://spacy.io/usage/training
* NER: https://spacy.io/usage/linguistic-features#named-entities

"""

Deep Learning Tensorflow Text Classification For Resume/Curriculum Vitae - Sandeep Kanao

GIT - Sandeep Kanao

Deep Learning Tensorflow Text Classification For Resume/Curriculum Vitae - Sandeep Kanao

Text Classification is the task of assigning the right label to a given piece of text. This text can either be a phrase, a sentence or even a paragraph. Our aim would be to take in some text as input and attach or assign a label to it. Since we will be using Tensor Flow deep learning library, we can call this the Tensorflow text classification system, and can be extended to extract information from resume or curriculum vitae
We will go through how you can build your own text-based classifier with loads of classes or labels.
The article Tensorflow text classification will be divided into multiple sections. First are the text pre-processing steps and creation and usage of the bag of words technique.
Second is the training of the text classifier and finally the testing and using the classifier.
Bag of Words – The Bag of Words model in Text Processing is the process of creating a unique list of words. This model is used as a tool for feature generation.
Eg: consider two sentences: Star Wars is better than Star Trek. Star Trek isn’t as good as Star Wars.
For the above two sentences, the bag of words will be: [“Star”, “Wars”, “Trek”, “better”, “good”, “isn’t”, “is”, “as”]. The position of each word in the list is hence fixed.
Now, to construct a feature for classification from a sentence, we use a binary array ( an array where each element can either be 1 or 0).
For example, a new sentence, “Wars is good” will be represented as [0,1,0,0,1,0,1,0] . As you can see in the array, position 2 is set to 1 because the word in position 2 is “wars” in the bag of words which is also present in our example sentence. This same holds good for the other words “is” and “good” as well.
Initiation
In [7]:
import nltk
from nltk.stem.lancaster import LancasterStemmer
import numpy as np
import tflearn
import tensorflow as tf
import random
import json
import string
import unicodedata
import sys
Step 1: Data Preparation Before we train a model that can classify a given text to a particular category, we have to first prepare the data. We can create a simple JSON file that will hold the required data for training. - Sandeep Kanao
Following is a sample file that I have created, that contains 9 categories. You can create how many ever categories that you want.
{ "Education": ["Bachelor", "Master", "PhD", "High School", "College", "BSc", "B.Sc.", "MSc", "M.Sc", "BA"], "Email": ["sandeepkanao@gmail.com", "sandeep.kanao@abc.com", "abc123@mnc.net"], "Phone": ['1-416-3040208', "416-355 0208", "416 1220206" "41689700206", "304 123 4455", "1 123 456 5567"], "Skill": ["Python", "c++", "Java", "Angular", ".net"], "Name": ["Sandeep.Kanao", "George M Very", "Jenefer Atkinson", "Kevin Spacy"], "Address": ["17 street name", "#1204-191 College St.", "Apt 1290", "123 Main Parkway Vancouver BC"], "StudiedAt": ["University of British Columbia", "University of", "Seneca College", "Famous College"], "WorkedAt": ["123 TECH", "BC Development", "Tech Works", "TD Canada", "Bank Of Montreal"], "Title" : ["Software Developer", "Programmer", "Architect", "Analyst", "Scientist", "Manager"] }
Step 2: Data Load and Pre-processing
In [52]:
# a table structure to hold the different punctuation used
tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                    if unicodedata.category(chr(i)).startswith('P'))


# method to remove punctuations from sentences.
def remove_punctuation(text):
    return text
    #return text.translate(tbl)

# initialize the stemmer
stemmer = LancasterStemmer()
# variable to hold the Json data read from the file
data = None

# read the json file and load the training data
with open('traindata.json') as json_data:
    data = json.load(json_data)
    print(data)

# get a list of all categories to train for
categories = list(data.keys())
words = []
# a list of tuples with words in the sentence and category name
docs = []

for each_category in data.keys():
    for each_sentence in data[each_category]:
        # remove any punctuation from the sentence
        each_sentence = remove_punctuation(each_sentence)
        print(each_sentence)
        # extract words from each sentence and append to the word list
        w = nltk.word_tokenize(each_sentence)
        print("tokenized words: ", w)
        words.extend(w)
        docs.append((w, each_category))

# stem and lower each word and remove duplicates
words = [stemmer.stem(w.lower()) for w in words]
words = sorted(list(set(words)))

print(words)
print(docs)
{'Education': ['Bachelor', 'Master', 'PhD', 'High School', 'College', 'BSc', 'B.Sc.', 'MSc', 'M.Sc.', 'BA', 'Associate Degree', 'Associate of Arts (A.A.)', 'Associate of Science (A.S.)', "Bachelor's (or Baccalaureate) Degree", 'college program', 'Graduate Degree', 'Professional Degree', 'Joint Degrees', 'Liberal Arts and Career Combination', 'Teacher Certification', 'BA Hons', 'BFA', 'BJourn', 'BSA', 'BA', 'BEng', 'BHA', 'BTech', 'BArchSc', 'BComm'], 'Email': ['sandeepkanao@gmail.com', 'sandeep.kanao@abc.com', 'abc123@mnc.net', 'nick@gmail.com', 'vic@gmail.com', 'prince@yahoo.in', 'BillGates@gmail.com', 'SteveJobs@yahoo.com', 'BillJobs@aol.com', 'first.lastnumber@school.edu', 'john.smith01@harvard.edu', 'firstinitiallastnamerandomnumber@school.edu', 'jsmith1234@harvard.edu', 'noreply@labs.princeton.edu', 'whoever@harvard.edu', 'texasstars@yahoo.com', 'Fredblogs@thisdomain.com', 'Fredblogs@thisdomain.com'], 'Phone': ['1-416-3040208', '416-355 0208', '416 1220206', '41689700206', '304 123 4455', '1 123 456 5567', '919371424876', '+1 678 3390204', '15674456783', '567-980-3445', '890 342 3342', '011 1 989 678 4567', '900 789 8906', '18009009809'], 'Skill': ['Python', 'c++', 'Java', 'Angular', '.net'], 'Name': ['Sandeep.Kanao', 'George M Very', 'Jenefer Atkinson', 'Kevin Spacy', 'George Ing', 'Tim Jonson', 'Mery Wong', 'Kim Lee', 'Vikas Vats'], 'Address': ['17 street name', '#1204-191 College St.', 'Apt 1290', '123 Main Parkway Vancouver BC', 'Apartment 123', 'Suit 456', '#456 890 King Street', '89 Queen St.', '8900 Cres', 'Road', 'Street', 'St.', 'Apt', 'Apartment', 'Suit'], 'StudiedAt': ['University of British Columbia', 'University of', 'Seneca College', 'Famous College', 'University', 'College', 'High School', 'School'], 'WorkedAt': ['123 TECH', 'BC Development', 'Tech Works', 'TD Canada', 'Bank Of Montreal'], 'Title': ['Software Developer', 'Programmer', 'Architect', 'Analyst', 'Scientist', 'Manager']}
Bachelor
tokenized words:  ['Bachelor']
Master
tokenized words:  ['Master']
PhD
tokenized words:  ['PhD']
High School
tokenized words:  ['High', 'School']
College
tokenized words:  ['College']
BSc
tokenized words:  ['BSc']
B.Sc.
tokenized words:  ['B.Sc', '.']
MSc
tokenized words:  ['MSc']
M.Sc.
tokenized words:  ['M.Sc', '.']
BA
tokenized words:  ['BA']
Associate Degree
tokenized words:  ['Associate', 'Degree']
Associate of Arts (A.A.)
tokenized words:  ['Associate', 'of', 'Arts', '(', 'A.A', '.', ')']
Associate of Science (A.S.)
tokenized words:  ['Associate', 'of', 'Science', '(', 'A.S', '.', ')']
Bachelor's (or Baccalaureate) Degree
tokenized words:  ['Bachelor', "'s", '(', 'or', 'Baccalaureate', ')', 'Degree']
college program
tokenized words:  ['college', 'program']
Graduate Degree
tokenized words:  ['Graduate', 'Degree']
Professional Degree
tokenized words:  ['Professional', 'Degree']
Joint Degrees
tokenized words:  ['Joint', 'Degrees']
Liberal Arts and Career Combination
tokenized words:  ['Liberal', 'Arts', 'and', 'Career', 'Combination']
Teacher Certification
tokenized words:  ['Teacher', 'Certification']
BA Hons
tokenized words:  ['BA', 'Hons']
BFA
tokenized words:  ['BFA']
BJourn
tokenized words:  ['BJourn']
BSA
tokenized words:  ['BSA']
BA
tokenized words:  ['BA']
BEng
tokenized words:  ['BEng']
BHA
tokenized words:  ['BHA']
BTech
tokenized words:  ['BTech']
BArchSc
tokenized words:  ['BArchSc']
BComm
tokenized words:  ['BComm']
sandeepkanao@gmail.com
tokenized words:  ['sandeepkanao', '@', 'gmail.com']
sandeep.kanao@abc.com
tokenized words:  ['sandeep.kanao', '@', 'abc.com']
abc123@mnc.net
tokenized words:  ['abc123', '@', 'mnc.net']
nick@gmail.com
tokenized words:  ['nick', '@', 'gmail.com']
vic@gmail.com
tokenized words:  ['vic', '@', 'gmail.com']
prince@yahoo.in
tokenized words:  ['prince', '@', 'yahoo.in']
BillGates@gmail.com
tokenized words:  ['BillGates', '@', 'gmail.com']
SteveJobs@yahoo.com
tokenized words:  ['SteveJobs', '@', 'yahoo.com']
BillJobs@aol.com
tokenized words:  ['BillJobs', '@', 'aol.com']
first.lastnumber@school.edu
tokenized words:  ['first.lastnumber', '@', 'school.edu']
john.smith01@harvard.edu
tokenized words:  ['john.smith01', '@', 'harvard.edu']
firstinitiallastnamerandomnumber@school.edu
tokenized words:  ['firstinitiallastnamerandomnumber', '@', 'school.edu']
jsmith1234@harvard.edu
tokenized words:  ['jsmith1234', '@', 'harvard.edu']
noreply@labs.princeton.edu
tokenized words:  ['noreply', '@', 'labs.princeton.edu']
whoever@harvard.edu
tokenized words:  ['whoever', '@', 'harvard.edu']
texasstars@yahoo.com
tokenized words:  ['texasstars', '@', 'yahoo.com']
Fredblogs@thisdomain.com
tokenized words:  ['Fredblogs', '@', 'thisdomain.com']
Fredblogs@thisdomain.com
tokenized words:  ['Fredblogs', '@', 'thisdomain.com']
1-416-3040208
tokenized words:  ['1-416-3040208']
416-355 0208
tokenized words:  ['416-355', '0208']
416 1220206
tokenized words:  ['416', '1220206']
41689700206
tokenized words:  ['41689700206']
304 123 4455
tokenized words:  ['304', '123', '4455']
1 123 456 5567
tokenized words:  ['1', '123', '456', '5567']
919371424876
tokenized words:  ['919371424876']
+1 678 3390204
tokenized words:  ['+1', '678', '3390204']
15674456783
tokenized words:  ['15674456783']
567-980-3445
tokenized words:  ['567-980-3445']
890 342 3342
tokenized words:  ['890', '342', '3342']
011 1 989 678 4567
tokenized words:  ['011', '1', '989', '678', '4567']
900 789 8906
tokenized words:  ['900', '789', '8906']
18009009809
tokenized words:  ['18009009809']
Python
tokenized words:  ['Python']
c++
tokenized words:  ['c++']
Java
tokenized words:  ['Java']
Angular
tokenized words:  ['Angular']
.net
tokenized words:  ['.net']
Sandeep.Kanao
tokenized words:  ['Sandeep.Kanao']
George M Very
tokenized words:  ['George', 'M', 'Very']
Jenefer Atkinson
tokenized words:  ['Jenefer', 'Atkinson']
Kevin Spacy
tokenized words:  ['Kevin', 'Spacy']
George Ing
tokenized words:  ['George', 'Ing']
Tim Jonson
tokenized words:  ['Tim', 'Jonson']
Mery Wong
tokenized words:  ['Mery', 'Wong']
Kim Lee
tokenized words:  ['Kim', 'Lee']
Vikas Vats
tokenized words:  ['Vikas', 'Vats']
17 street name
tokenized words:  ['17', 'street', 'name']
#1204-191 College St.
tokenized words:  ['#', '1204-191', 'College', 'St', '.']
Apt 1290
tokenized words:  ['Apt', '1290']
123 Main Parkway Vancouver BC
tokenized words:  ['123', 'Main', 'Parkway', 'Vancouver', 'BC']
Apartment 123
tokenized words:  ['Apartment', '123']
Suit 456
tokenized words:  ['Suit', '456']
#456 890 King Street
tokenized words:  ['#', '456', '890', 'King', 'Street']
89 Queen St.
tokenized words:  ['89', 'Queen', 'St', '.']
8900 Cres
tokenized words:  ['8900', 'Cres']
Road
tokenized words:  ['Road']
Street
tokenized words:  ['Street']
St.
tokenized words:  ['St', '.']
Apt
tokenized words:  ['Apt']
Apartment
tokenized words:  ['Apartment']
Suit
tokenized words:  ['Suit']
University of British Columbia
tokenized words:  ['University', 'of', 'British', 'Columbia']
University of
tokenized words:  ['University', 'of']
Seneca College
tokenized words:  ['Seneca', 'College']
Famous College
tokenized words:  ['Famous', 'College']
University
tokenized words:  ['University']
College
tokenized words:  ['College']
High School
tokenized words:  ['High', 'School']
School
tokenized words:  ['School']
123 TECH
tokenized words:  ['123', 'TECH']
BC Development
tokenized words:  ['BC', 'Development']
Tech Works
tokenized words:  ['Tech', 'Works']
TD Canada
tokenized words:  ['TD', 'Canada']
Bank Of Montreal
tokenized words:  ['Bank', 'Of', 'Montreal']
Software Developer
tokenized words:  ['Software', 'Developer']
Programmer
tokenized words:  ['Programmer']
Architect
tokenized words:  ['Architect']
Analyst
tokenized words:  ['Analyst']
Scientist
tokenized words:  ['Scientist']
Manager
tokenized words:  ['Manager']
['#', "'s", '(', ')', '+1', '.', '.net', '011', '0208', '1', '1-416-3040208', '1204-191', '1220206', '123', '1290', '15674456783', '17', '18009009809', '304', '3342', '3390204', '342', '416', '416-355', '41689700206', '4455', '456', '4567', '5567', '567-980-3445', '678', '789', '89', '890', '8900', '8906', '900', '919371424876', '989', '@', 'a.', 'a.s', 'abc.com', 'abc123', 'analyst', 'and', 'angul', 'aol.com', 'apart', 'apt', 'architect', 'art', 'assocy', 'atkinson', 'b.sc', 'ba', 'baccala', 'bachel', 'bank', 'barchsc', 'bc', 'bcom', 'beng', 'bfa', 'bha', 'billg', 'billjob', 'bjourn', 'brit', 'bsa', 'bsc', 'btech', 'c++', 'canad', 'car', 'cert', 'colleg', 'columb', 'combin', 'cre', 'degr', 'develop', 'fam', 'first.lastnumber', 'firstinitiallastnamerandomnumb', 'fredblog', 'georg', 'gmail.com', 'gradu', 'harvard.edu', 'high', 'hon', 'ing', 'jav', 'jenef', 'john.smith01', 'joint', 'jonson', 'jsmith1234', 'kevin', 'kim', 'king', 'labs.princeton.edu', 'lee', 'lib', 'm', 'm.sc', 'main', 'man', 'mast', 'mery', 'mnc.net', 'mont', 'msc', 'nam', 'nick', 'noreply', 'of', 'or', 'parkway', 'phd', 'print', 'profess', 'program', 'python', 'queen', 'road', 'sandeep.kanao', 'sandeepkanao', 'school', 'school.edu', 'sci', 'senec', 'softw', 'spacy', 'st', 'stevejob', 'street', 'suit', 'td', 'teach', 'tech', 'texasst', 'thisdomain.com', 'tim', 'univers', 'vancouv', 'vat', 'very', 'vic', 'vika', 'whoev', 'wong', 'work', 'yahoo.com', 'yahoo.in']
[(['Bachelor'], 'Education'), (['Master'], 'Education'), (['PhD'], 'Education'), (['High', 'School'], 'Education'), (['College'], 'Education'), (['BSc'], 'Education'), (['B.Sc', '.'], 'Education'), (['MSc'], 'Education'), (['M.Sc', '.'], 'Education'), (['BA'], 'Education'), (['Associate', 'Degree'], 'Education'), (['Associate', 'of', 'Arts', '(', 'A.A', '.', ')'], 'Education'), (['Associate', 'of', 'Science', '(', 'A.S', '.', ')'], 'Education'), (['Bachelor', "'s", '(', 'or', 'Baccalaureate', ')', 'Degree'], 'Education'), (['college', 'program'], 'Education'), (['Graduate', 'Degree'], 'Education'), (['Professional', 'Degree'], 'Education'), (['Joint', 'Degrees'], 'Education'), (['Liberal', 'Arts', 'and', 'Career', 'Combination'], 'Education'), (['Teacher', 'Certification'], 'Education'), (['BA', 'Hons'], 'Education'), (['BFA'], 'Education'), (['BJourn'], 'Education'), (['BSA'], 'Education'), (['BA'], 'Education'), (['BEng'], 'Education'), (['BHA'], 'Education'), (['BTech'], 'Education'), (['BArchSc'], 'Education'), (['BComm'], 'Education'), (['sandeepkanao', '@', 'gmail.com'], 'Email'), (['sandeep.kanao', '@', 'abc.com'], 'Email'), (['abc123', '@', 'mnc.net'], 'Email'), (['nick', '@', 'gmail.com'], 'Email'), (['vic', '@', 'gmail.com'], 'Email'), (['prince', '@', 'yahoo.in'], 'Email'), (['BillGates', '@', 'gmail.com'], 'Email'), (['SteveJobs', '@', 'yahoo.com'], 'Email'), (['BillJobs', '@', 'aol.com'], 'Email'), (['first.lastnumber', '@', 'school.edu'], 'Email'), (['john.smith01', '@', 'harvard.edu'], 'Email'), (['firstinitiallastnamerandomnumber', '@', 'school.edu'], 'Email'), (['jsmith1234', '@', 'harvard.edu'], 'Email'), (['noreply', '@', 'labs.princeton.edu'], 'Email'), (['whoever', '@', 'harvard.edu'], 'Email'), (['texasstars', '@', 'yahoo.com'], 'Email'), (['Fredblogs', '@', 'thisdomain.com'], 'Email'), (['Fredblogs', '@', 'thisdomain.com'], 'Email'), (['1-416-3040208'], 'Phone'), (['416-355', '0208'], 'Phone'), (['416', '1220206'], 'Phone'), (['41689700206'], 'Phone'), (['304', '123', '4455'], 'Phone'), (['1', '123', '456', '5567'], 'Phone'), (['919371424876'], 'Phone'), (['+1', '678', '3390204'], 'Phone'), (['15674456783'], 'Phone'), (['567-980-3445'], 'Phone'), (['890', '342', '3342'], 'Phone'), (['011', '1', '989', '678', '4567'], 'Phone'), (['900', '789', '8906'], 'Phone'), (['18009009809'], 'Phone'), (['Python'], 'Skill'), (['c++'], 'Skill'), (['Java'], 'Skill'), (['Angular'], 'Skill'), (['.net'], 'Skill'), (['Sandeep.Kanao'], 'Name'), (['George', 'M', 'Very'], 'Name'), (['Jenefer', 'Atkinson'], 'Name'), (['Kevin', 'Spacy'], 'Name'), (['George', 'Ing'], 'Name'), (['Tim', 'Jonson'], 'Name'), (['Mery', 'Wong'], 'Name'), (['Kim', 'Lee'], 'Name'), (['Vikas', 'Vats'], 'Name'), (['17', 'street', 'name'], 'Address'), (['#', '1204-191', 'College', 'St', '.'], 'Address'), (['Apt', '1290'], 'Address'), (['123', 'Main', 'Parkway', 'Vancouver', 'BC'], 'Address'), (['Apartment', '123'], 'Address'), (['Suit', '456'], 'Address'), (['#', '456', '890', 'King', 'Street'], 'Address'), (['89', 'Queen', 'St', '.'], 'Address'), (['8900', 'Cres'], 'Address'), (['Road'], 'Address'), (['Street'], 'Address'), (['St', '.'], 'Address'), (['Apt'], 'Address'), (['Apartment'], 'Address'), (['Suit'], 'Address'), (['University', 'of', 'British', 'Columbia'], 'StudiedAt'), (['University', 'of'], 'StudiedAt'), (['Seneca', 'College'], 'StudiedAt'), (['Famous', 'College'], 'StudiedAt'), (['University'], 'StudiedAt'), (['College'], 'StudiedAt'), (['High', 'School'], 'StudiedAt'), (['School'], 'StudiedAt'), (['123', 'TECH'], 'WorkedAt'), (['BC', 'Development'], 'WorkedAt'), (['Tech', 'Works'], 'WorkedAt'), (['TD', 'Canada'], 'WorkedAt'), (['Bank', 'Of', 'Montreal'], 'WorkedAt'), (['Software', 'Developer'], 'Title'), (['Programmer'], 'Title'), (['Architect'], 'Title'), (['Analyst'], 'Title'), (['Scientist'], 'Title'), (['Manager'], 'Title')]
Step 3: Convert the data to Tensorflow Specification - Sandeep Kanao
In [57]:
# create our training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(categories)


for doc in docs:
    # initialize our bag of words(bow) for each document in the list
    bow = []
    # list of tokenized words for the pattern
    token_words = doc[0]
    # stem each word
    token_words = [stemmer.stem(word.lower()) for word in token_words]
    # create our bag of words array
    for w in words:
        bow.append(1) if w in token_words else bow.append(0)

    output_row = list(output_empty)
    output_row[categories.index(doc[1])] = 1

    # our training set will contain a the bag of words model and the output row that tells
    # which catefory that bow belongs to.
    training.append([bow, output_row])

# shuffle our features and turn into np.array as tensorflow  takes in numpy array
random.shuffle(training)
training = np.array(training)

# trainX contains the Bag of words and train_y contains the label/ category
train_x = list(training[:, 0])
train_y = list(training[:, 1])
Step 4: Initiate Tensorflow Text Classification - Sandeep Kanao
The code below runs for a 1000 epochs. I ran it for 10,000 epochs which had 30,000 steps
In [54]:
## reset underlying graph data
tf.reset_default_graph()
# Build neural network
net = tflearn.input_data(shape=[None, len(train_x[0])])
net = tflearn.fully_connected(net, 24)
net = tflearn.fully_connected(net, 24)
net = tflearn.fully_connected(net, len(train_y[0]), activation='softmax')
net = tflearn.regression(net)

# Define model and setup tensorboard
model = tflearn.DNN(net, tensorboard_dir='tflearn_logs')
# Start training (apply gradient descent algorithm)
model.fit(train_x, train_y, n_epoch=25000, batch_size=4, show_metric=True)
model.save('model.tflearn')
Training Step: 699899  | total loss: 0.02096 | time: 0.060s
| Adam | epoch: 24997 | loss: 0.02096 - acc: 0.9905 -- iter: 044/110
Step 5: Testing the Tensorflow Text Classification Model -Sandeep Kanao
In [55]:
# let's test the mdodel for a few sentences:
# the first two sentences are used for training, and the last two sentences are not present in the training data.
sent_1 = "sandeepkanao@gmail.com"
sent_2 = "123 Main Parkway Vancouver BC"
sent_3 = "B.Sc. Computer Science"
sent_4 = "University of New York"
sent_5 = "jonsmith@mail.com"
sent_6 = "780678709"
sent_7 = "ASP.NET"
sent_8 = "Jon Smith"
sent_9 = "17 MNC Cres Toronto ON"
sent_10 = "2013 - Present Royal Bank Of Canada"
sent_11 = "2012 - 2013 Vancouver, BC Software Developer BC Development"

# a method that takes in a sentence and list of all words
# and returns the data in a form the can be fed to tensorflow


def get_tf_record(sentence):
    global words
    # tokenize the pattern
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word
    sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
    # bag of words
    bow = [0]*len(words)
    for s in sentence_words:
        for i, w in enumerate(words):
            if w == s:
                bow[i] = 1

    return(np.array(bow))


# we can start to predict the results for each of the sentences
print(categories[np.argmax(model.predict([get_tf_record(sent_1)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_2)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_3)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_4)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_5)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_6)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_7)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_8)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_9)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_10)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_11)]))])
Email
Address
Title
StudiedAt
StudiedAt
StudiedAt
StudiedAt
StudiedAt
Address
WorkedAt
Title