CourseNana | CISC3025 Natural Language Processing Project 3: Maximum entropy model

CISC3025 - Natural Language Processing Project#3, 2023/2024
CourseNana.COM

(Due date: 18th April) CourseNana.COM

Person Name ('Named Entity') Recognition CourseNana.COM

This is a group project with two students at most. You need to enroll in a group here. In this project, you will be building a maximum entropy model (MEM) for identifying person names in newswire texts (Label=PERSON or Label=O). We have provided all of the machinery for training and testing your MEM, but we have left the feature set woefully inadequate. Your job is to modify the code for generating features so that it produces a much more sensible, complete, and higher-performing set of features. CourseNana.COM

NOTE: In this project, we expect you to design a web application for demonstrating your final model. You need to design a web page that provides at least such a simple function: 1) User inputs sentence; 2) Output the named entity recognition results. Of course, more functionalities in your web application are highly encouraged. For example, you can integrate the previous project’s work, i.e., text classification, into your project (It would be very cool!). CourseNana.COM

You NEED to submit: CourseNana.COM

Runnable program CourseNana.COM

o YouneedtoimplementaNamedEntityRecognitionmodelbasedonthegivenstarter codes CourseNana.COM
Model file
o Onceyouhavefinishedthedesigningofyourfeaturesandmadeitfunctionswell,it CourseNana.COM

will dump a model file (‘model.pkl’) automatically. We will use it to evaluate your model. CourseNana.COM
Web application CourseNana.COM

o Youalsoneedtodevelopawebapplication(freestyle,norestrictiononprogramming CourseNana.COM

languages) to demonstrate your NER model or even more NLP functions. CourseNana.COM

o Obviously,youneedtolearnhowtocallyourpythonprojectwhenbuildingtheweb application. CourseNana.COM

• Report
o Youshouldfinishareporttointroduceyourworkonthisproject.Yourreportshould CourseNana.COM

contain the following content: CourseNana.COM

§ Introduction; CourseNana.COM
§ Description of the methods, implementation, and additional consideration to optimize your model; CourseNana.COM
§ Evaluations and discussions about your findings; CourseNana.COM

1 CourseNana.COM

§ Conclusion and future work suggestions. • Presentation CourseNana.COM

o Youneedtogivea8-minutepresentationintheclasstointroduceyourworkfollowed by a 3-minute Q&A section. The content of the presentation may refer to the report. CourseNana.COM

Starter Code CourseNana.COM

In the starter code, we have provided you with three simple starter features, but you should be able to improve substantially on them. We recommend experimenting with orthographic information, gazetteers, and the surrounding words, and we also encourage you to think beyond these suggestions. CourseNana.COM

The file you will be modifying is MEM.py CourseNana.COM

Adding Features to the Code CourseNana.COM

You will create the features for the word at the given position, with the given previous label. You may condition on any word in the sequence (and its relative position), not just the current word because they are all observed. You may not condition on any labels other than the previous one. CourseNana.COM

You need to give a unique name for each feature. The system will use this unique name in training to set the weight for that feature. At the testing time, the system will use the name of this feature and its weight to make a classification decision. CourseNana.COM

Types of features to include CourseNana.COM

Your features should not just be the words themselves. The features can represent any property of the word, context, or additional knowledge. CourseNana.COM

For example, the case of a word is a good predictor for a person's name, so you might want to add a feature to capture whether a given word was lowercase, Titlecase, CamelCase, ALLCAP, etc. CourseNana.COM

def features(self, words, previous_label, position): features = {} CourseNana.COM

""" Baseline Features """
current_word = words[position] features['has_(%s)' % current_word] = 1 features['prev_label'] = previous_label if current_word[0].isupper(): CourseNana.COM

features['Titlecase'] = 1
#===== TODO: Add your features here =======# #...
#=============== TODO: Done ================# return features CourseNana.COM

2 CourseNana.COM

Imagine you saw the word “Jenny”. In addition to the feature for the word itself (as above), you could add a feature to indicate it was in Title case, like: CourseNana.COM

You might encounter an unknown word in the test set, but if you know it begins with a capital letter then this might be evidence that helps with the correct prediction. CourseNana.COM

Choosing the correct features is an important part of natural language processing. It is as much art as science: some trial and error is inevitable, but you should see your accuracy increasing as you add new types of features. CourseNana.COM

The name of a feature is not different from an ID number. You can use assign any name for a feature as long as it is unique. For example, you can use “case=Title” instead of “Titlecase”. CourseNana.COM

Running the Program CourseNana.COM

We have provided you with a training set and a development set. We will be running your programs on an unseen test set, so you should try to make your features as general as possible. Your goal should be to increase F1 on the dev set, which is the harmonic mean of the precision and the recall. You can use three different command flags (‘-t’, ‘-d’, ‘-s’) to train, test, and show respectively. These flags can be used independently or jointly. If you run the program as it is, you should see the following training process: CourseNana.COM

Afterward, it can print out your score on the dev set. CourseNana.COM

$ python run.py -d Testing classifier... CourseNana.COM

 f_score =
 accuracy =
 recall =

if current_word[0].isupper(): features['Titlecase'] = 1 CourseNana.COM

$ cd NER
$ python run.py -t Training classifier... CourseNana.COM

==> Training (5 iterations) CourseNana.COM

Iteration Log-Likelihood Accuracy --------------------------------------- CourseNana.COM

    1          -0.69315
    2          -0.09383
    3          -0.08134
    4          -0.07136

Final          -0.06330

0.8715
0.9641
0.7143

You can also give it an additional flag, -s, and have it show verbose sample results. The first column CourseNana.COM

precision = 0.9642 CourseNana.COM

is the word, the last two columns are your program's prediction of the word’s probability to be CourseNana.COM

3 CourseNana.COM

PERSON or O. The star ‘*’ indicates the gold result. This should help you do error analysis and properly target your features. CourseNana.COM

Function ‘features()’ in MEM.py CourseNana.COM
You can modify the “Customization” part in run.py in order to debug more efficiently and CourseNana.COM

properly. It should be noted that your final submitted model should be trained under at least 20 iterations. CourseNana.COM
You may need to add a function “predict_sentence( )” in class MEM( ) to output predictions and integrate with your web applications. CourseNana.COM

Changes beyond these, if you choose to make any, should be done with caution. CourseNana.COM

Grading CourseNana.COM

The assignment will be graded based on your codes, reports, and most importantly final presentation. CourseNana.COM

$ python run.py -s
Words P(PERSON) P(O) CourseNana.COM

---------------------------------------- CourseNana.COM

EU CourseNana.COM

rejects CourseNana.COM

German CourseNana.COM

call CourseNana.COM

to CourseNana.COM

boycott CourseNana.COM

British CourseNana.COM

lamb CourseNana.COM

. CourseNana.COM

Peter CourseNana.COM

Blackburn CourseNana.COM

BRUSSELS CourseNana.COM

1996-08-22 CourseNana.COM

The CourseNana.COM

European CourseNana.COM

Commission CourseNana.COM

said CourseNana.COM

on CourseNana.COM

Thursday CourseNana.COM

0.0544 *0.9456 0.0286 *0.9714 0.0544 *0.9456 0.0286 *0.9714 0.0284 *0.9716 0.0286 *0.9714 0.0544 *0.9456 0.0286 *0.9714 0.0281 *0.9719 CourseNana.COM

*0.4059 0.5941 *0.5057 0.4943 0.4977 *0.5023 0.0286 *0.9714 0.0544 *0.9456 0.0544 *0.9456 0.0544 *0.9456 0.0258 *0.9742 0.0283 *0.9717 0.0544 *0.9456 CourseNana.COM

*0.9714 CourseNana.COM

it CourseNana.COM

0.0286 CourseNana.COM

Where to make your changes? CourseNana.COM

#====== Customization ======
BETA = 0.5
MAX_ITER = 5 # max training iteration
BOUND = (0, 20) # the desired position bound of samples #========================== CourseNana.COM

4 CourseNana.COM

Tips CourseNana.COM

Start early! This project may take longer than the previous assignments if you are aiming for the perfect score. CourseNana.COM
Generalize your features. For example, if you're adding the above "case=Title" feature, think about whether there is any pattern that is not captured by the feature. Would the "case=Title" feature capture "O'Gorman"? CourseNana.COM
When you add a new feature, think about whether it would have a positive or negative weight for PERSON and O tags (these are the only tags for this assignment). CourseNana.COM

CISC3025 Natural Language Processing Project 3: Maximum entropy model

Get in Touch with Our Experts