CS4442b/9542b: Artificial Intelligence II, Winter 2017
Language Models代写 Submission instructions•Submit through OWL by 11:55pm on the due date.•The extra credit in the assignment
Assignment 3: Language Models
Due: April 11
1 Submission instructions Language Models代写
- Submit through OWL by 11:55pm on the due
- Theextra credit in the assignment does not transfer to
In a few places, I ask you to discuss results. As long as you say something reasonable (i.e. related to the question), you will get the credit.
- Report should be in pdf
- Makesure your zip file for submission is no more than 10MB
- VS creates a lot of auxiliary files that may result in super-big size of the zip. To ensure appro- priate size, run ”clean solution” option before submitting your solution folder. If that doesn’t help, use plugin that can create ”clean” archiveshttps://marketplace.visualstudio.com/items?itemName=RonJacobs.CleanProject-CleansVisualStudioSolutionsForUploadi Language Models代写
- We will not charge late penalty but you will get extra credit of 5% if you hand in your assignment by April 11.
- Submityour Microsoft Visual Studio workspace in a single zipped file with name A3 <first name> <last name>.zip. Your workspace should have separate projects corre- sponding to problems: i.e. P1, P2, P3, P4, P5, P6, P7. Plus, if you choose, optional P8. Your workspace can include any additional projects, if needed. The structure of your submission should be as below.Language Models代写
| — <my *.cpp, *.h, *.vcxproj>
| — <your *.cpp, *.h, *.vcxproj for problem #1>
| — <your *.cpp, *.h, *.vcxproj for problem #2>
— <your *.cpp, *.h, *.vcxproj for problem #N>
2 Code ProvidedLanguage Models代写
In this assignment, you will work with character and word (string) based language models. To efficiently store and count nGrams, you should use either hash tables or balanced search trees. C++ standard library has hash table named unordered map. I provide you with code that illustrates efficient nGram processing using unordered map, both in the case of character (char) and word (string) models. If you wish, you can store nGrams using another data structure, but make sure it is an efficient implementation (such as a hash table or a balanced tree).Language Models代写
I also provide code to read from files. You must use this code, in order to make sure parsing isdone in a consistent way. Lastly, I provide you with code to sample from a probability distribution(for random sentence generation, problem 3) and code to compute edit distance between strings (for spell checking application, problem 6).
All code was compiled and tested under MS studio 2010.
2.2 Code ProvidedLanguage Models代写
Contains code for reading from files. You will need the following functions.
void read tokens(const std::string & filename, std::vector<std::string> & tokens, bool eos) : reads file from filename into a vector of string. If eos = true, then reads
end of sentence marker as a special string EOS.Language Models代写
void read tokens(const std::string & filename, std::vector<char> & tokens, bool latin only): reads file from filename into a vector char. If latin only = false, it reads
all characters from file. If latin only = true, it reads only Latin characters and converts upper case to lower case.
- EOS = “<EOS>”: special marker for end of
Illustrates how to build character model and string model based on C++ unordered map, which is a hash table. I will use terms hash table and unordered map interchangeably.
For word (string) language model, you should have typedef string T . For character (char) language model, you should have typedef char T.Language Models代写
In both cases, an nGram is represented as a vector<T>. This vector serves as a key into thehash table. I use int as the value associated with the key to give the count of how many timesnGram (vector<T>) occurred in the text data. When I insert a new nGram, I set its count to 1. If I read that nGram again from the text, I update the count by 1.
One should be careful when using unordered map built-in function count(key).
Despite beingcalled count, it has only two return values: 0, in case key is not in the hash table, and 1 if key is in the hash table. To see how often key occurs in the unordered map, that is the value associated with the key, use the square bracket operator. But also be aware that the squaredbracket operator will insert a new entry in the hash table if entry with the given key is not already in the unordered map.Language Models代写
To illustrate, suppose h is unordered map, and currently does not have key “abc”. If you use h[“abc”], it will cause entry with key “abc” to be inserted into h with some default value in the value field (which is of type int in our case). Thus to check if there is an entry with key “abc”, use h.count(“abc”). However, remember that if h.count(“abc”) returns 1, all this means that an entry with key “abc” is in your hash table h. The actual count of how many times nGram “abc” was read from file is in the value field associated with key “abc”. You access it with h[“abc”]. At this point, accessing h[“abc”] is fine, since you already know that key “abc” is in your hash table.
class VectorHash: class needed for unordered map in function to construct a hash table for vector keys. You just need to include this into your project.
int drawIndex(vector double &probs)): samples from probabilities given in input vector of probabilities. Checks that input probabilities approximately add to 1. Use this for the problem of random sentence generation.Language Models代写
size t uiLevenshteinDistance(const std::string &s1, const std::string &s2): Com- putes distance between two strings. Use it for the spell checker problem.
2.3 Samples of Input and OutputLanguage Models代写
Folder InputOutput has sample input and output for this assignment.
Folder Texts contains text files and language files for this assignment.
Problem 1 (10 %)
This problem investigates word distribution in English language and whether it is true that it is enough to know 100 words of English to read about 50% of the text. Read string tokens without EOS markers.Language Models代写
(a)Write a program P1 which takes as command line arguments the name of the text file and a value of k. Your program should compute counts of words in an input file and output k mostfrequent words together with their counts, one per line, word and its count separated by Your program should also compute what percentage of words in the input file areamong the most k frequent words. For example, to output 4 most frequent words from file “text.txt”, your program is invoked as
P1 text.txt 4
The output format should be:
(b)Howmany words do you need to know to read about 50% of “DostoevskyKaramazov.txt” and“DrSeuss.txt”? In other words, what should k be set for the most frequent words parameterto get the output from the above program around 50% for these two texts?Language Models代写
Problem 2 (10 %)Language Models代写
This problem investigates language sparseness. Use the word language model. Do not read EOS for this problem.
- Write a program P2 that takes as the command line arguments the names of two text files,thesize n for the nGram, and the last parameter that indicates whether to print out common nGrams or not. The program should count and print to the standard output the percentage of nGrams in the second text file that do not occur in the first text file. If the last inputparameter is 1, the program should print out the common nGrams between the two files, oneper If the last parameter is 0, your program should not print out common nGrams.For example, if we want to count 5-Grams without printing, the program should be executed with:
P2 text1.txt text2.txt 5 0Language Models代写
The output format should be:
If we want to count 6-Grams with the printing option, the program should be executed with:
P2 text1.txt text2.txt 6 1
The output format should be:
he thought much better of it I can play piano in the 75.001 Language Models代写
(b)Take two parts of the same novel, “DostoevskyPart1.txt” (as the first input file) and“Dosto- evskyDostoevskyPart2.txt”(as the second input file). Use your program to compute and write down the percentages of zero count nGrams for n = 1, 2, 3, …, What is the value of n that gives no common nGrams between the two files? What is (are) the largest common nGram for these files?
(c)Repeatpart (a) for different writers, “Dickens.txt” (as first) and “KafkaTrial.txt” (as second).
(d)Repeat part (a) for the“opposite” writers, “MarxEngelsManifest.txt” (as first) and “Smith- WealthNations.txt” (assecond).Language Models代写
(e)Discussthe difference in results between parts (b,c,d).
Problem 3 (15 %)
This problem is about random text generation according to ML language model. Use string model read EOS markers.
Random Sentence Generation
Let vocabulary V be all unique words in the input text file. Let m denote the size of V . Let vocabulary words be indexed with integers in the range from 0 to m. Thus, our vocabulary V = v0, v1, …vm−1 . You will generate a sequence of words w1, w2, …, stopping sentence generation when EOS marker is generated.Language Models代写
Case 1: If n = 1, then each word in a sentence is generated independently (no context).Compute P (v) for all words v V , according to the ML unigram model. Store the computed probabilities in a vector<double> probs, where probs[i] = P (vi), for i 0, 1, …, m 1 . To generate the first word, use provided function intdrawIndex(vector<double> &probs). Theoutput of drawIndex() is the index of the word chosen. For example, if the output is 10, thismeans that v10 is chosen, and you set the first word in the sentence w1 to v10. Generate all other words in the same way, stopping when EOS is generated.
If n > 1, then there is context. For the first word w1, its context (previous word) is EOS marker. Use ML to estimate P (v|EOS) for all v ∈ V . Store these probabilities in the vector probs and generate the first word w1 using drawIndex(probs). To generate the second word, use ML to estimate P (v|w1) for all v ∈ V , store these probabilities in vector probs and generate the second word w2 using drawIndex(probs). Continue this procedure until you generate the EOS marker.Language Models代写
Note that if n > 2, then as more context becomes available, you should use it. For example, if n = 3, then to generate the third (and forth, and so on) word, you should use two previously generated words for context. To be precise, to generate the kth word in the sentence wk, sample from P (v wk−n+1, …, wk−1), where wk−n+1, …, wk−1 are n 1 previously generated words.Make sure to initialize random seed generator with something like srand(time(NULL)).
(a)Writea program P3 which takes as command line arguments the name of a text file and the size n of the nGram model. Your program should construct an ML (maximum likelihood) language model from txt, and generate and print out a random sentence according to the procedure described in class and summarized above. For example, to generate a sentenceusing trigram model learned from from file text.txt, the program should be invoked with
P3 text.txt 3
(b)Runyour program on “KafkaTrial.txt” with n = 1, 2, 3, 4, Discuss your results, pay partic- ular attention to the case of n = 1 and n = 6. You might want to look inside “KafkaTrial.txt” to interpret your results for n = 6.
(c)Set n = 3 and generate a random sentence from “MarxEngelsManifest.txt”. Compare them with the results generated from“KafkaTrial.txt”.Language Models代写
(d)Justfor fun: Submit the funniest sentence from your program generated with either n = 2 or 3 from any text file you I will run a poll for selecting the funniest sentence.
Problem 4 (10 %)Language Models代写
This problem is about Add-Delta language model. Use word model and do not read EOS markers.
P4 textModel.txt sentence.txt n delta
that builds Add-Delta model from text in file textModel.txt for n-grams with delta. The program should read the sentence in the second file sentence.txt and output its log probability to the standard output.Language Models代写
For example, to model language from file textModel.txt, estimate log probability of sentence in file sentences.txt, build a 5-Gram Add-Delta model with delta = 0.1, use:
P4 textModel.txt sentences.txt 5 0.1
The output should be formatted as
Set vocabulary size to the number of unique words in file textModel.txt plus one. We add one to account for all the words that occur in file sentence.txt but do not occur in file textModel.txt. Intuitively, this is maps all words that are not in our model file to the same word “UNKNOWN”.
Implementaiton notes: Language Models代写
- Use double data type for probabilities toavoid
- To avoid integer overflows, use double instead ofint.
- Be careful if your count variables are integers. With integer division, count of 2 divided by count of 5 is 0, but the correct answer is 0.4. Cast integers to double before division.
- The output of your program is log probabilities (to avoid underflow), therefore your output should be a negative number, since log(x) ≤ 0 for x ≤
- Make sure that your program works for the special case of unigrams (n =1).
- If delta = 0, then Add-Delta model is equivalent to ML model and some sentences will have probability 0. In this case, your program should not crash but output the maximum negative number in double precision, which is pre-defined in the compiler as -DBL MAX.
- Runyour program and report the output of your program for the following cases:
- P4 txt testFile.txt 11
- P4 txt testFile.txt 21
- P4 txt testFile.txt 20.001
- P4 txt testFile.txt 30.001
Problem 5 (15%)Language Models代写
This problem is about Good-Turing language model. Use word model and do not read EOS
P5 textModel.txt sentence.txt n threshold
that builds Good-Turing model from text in file textModel.txt for n-grams with parameterthreshold. The program should read the sentence in the second file sentence.txt and outputits log probability to the standard output. Recall that the threshold for Good-Turing is used as follows. If an nGram has rate r < threshold, use Good-Turing estimate of probability. If r ≥ threshold use ML estimate of probabilities.Language Models代写
For example, to model language from file textModel.txt, estimate log probability of sentence in file sentence.txt, build a 4-Gram Good-Turing model with threshold = 6, use:
P5 textModel.txt sentence.txt 4 6
The output should be formatted as
Just as in Problem 4, set vocabulary size to the number of unique words in file textModel.txt plus 1.
- Do not forget to renormalize so that probabilities add up to 1.
- If the user sets threshold so high that Nr=0 for some r < threshold, then some estiDo not forget to renormalize so that probabilities add up to 1.mated GT probabilities will be 0. Before computing probabilities, loop over Nr to checkthat they are not zero for all r ≤ threshold. You have to do this separately for 1-grams,2-grams, …, n-grams. If needed, reset threshold to a lower value so that Nr>0 for all r< threshold.
- Use GT probabilities for all n-Grams. Namely, if input n = 3, you will need tri-grams,that they are not zero for all r ≤ threshold. You have to do this separately for 1-grams,2-grams, …, n-grams. If needed, reset threshold to a lower value so that Nr>0 for all r< threshold.bi-grams, and unigrams to compute probability of a sentence. Use GT estimates for all of them.Language Models代写
(b)Runyour program and report the output of your program for the following cases:
- P4 txt testFile.txt 11
- P4 txt testFile.txt 25
- P4 txt testFile.txt 35
Problem 6 (20 %)
In this problem we will use Add-Delta language model to classify which human languages (i.e. English, French, etc.) a given sentence is written in. Use the character based language model that reads all the characters, i.e. latin only = false. Set vocabulary size to 256.
Folder Languages contains training and test files for six languages. Each language file hasthe name corresponding to the language. Training files have endings 1.txt (english1.txt, danish1.txt, etc), and test files have ending 2.txt (english2.txt, danish2.txt, etc). Assume that all the text files are stored in the same directory where the program is run, so you do not have to specify their location.Language Models代写
Train the language models, separately,
for each language on training files, i.e. those endingin 1. Language classification is performed as follows. Given a sentence, compute its probability under French, English, etc. language models and classify the sentence with the language giving the maximum probability. For this problem, a sentence is a sequence of characters of fixed length senLen, given as an input parameter. You need to parse an input file into consecutive chunks of characters of length senLen, and classify each chunk. If the last chunk is of length less than senLen, it should be omitted from classification.
Your program should output the total error rate (in percents), and the confusion matrix. Thetotal percentage error rate for all languages is the overall number of misclassified sentences in alllanguage files divided by the overall number of sentences in all language files, multiplied by 100 to express as percentage.Language Models代写
The confusion matrix is a 6 by 6 array, where C(i,j) is the number of sentences of language ithat were classified as language j. That is diagonal has the correct classifications, and off-diagonal wrong classifications.
(a)Write program P6 for language identification. Your program is invokedwith
P6 n delta senLength
Where n is the size of the nGram, delta is the parameter for Add-Delta, and senLength is the sentence length.
The output of your program should be formatted as:
134 3 0 0 0 1
24 341 1 0 0 0
38 2 85 0 0 0
23 1 0 213 0 0
26 9 1 3 221 0
77 2 0 0 0 50
Where the first line is the percentage error rate, and the next size lines is the confusion matrix.Language Models代写
(b)Runyour program and report the error rate on the following cases:
- P6 1 050
- P6 2 050
- P6 3 050
(c)Runyour program and report the error rate on the following
- P6 1 0.0550
- P6 2 0.0550
- P6 3 0.0550
(d)Runyour program and report the error rate on the following cases:
- P6 3 0.0550
- P6 3 0.00550
- P6 3 0.000550
(e)Compareand discuss the performance between (b,c,d).
(f)Exploreand discuss how classification performance changes with the sentence length by run- ning your program on the following cases:
- P6 2 0.0510
- P6 2 0.0550
- P6 2 0.05100
Problem 7 (20 %)Language Models代写
In this problem you will develop a simple spell-checker.
(a)Write a spelling program P7 that is invokedwith:
P7 textTrain.txt textCheck.txt dictionary.txt n t delta model
The input parameters are as follows: name of text file for model training, name of text file to check spelling, name of text file with dictionary, n for the nGram model, threshold for Good- Turing, delta for Add-One smoothing, model to use. As before, if model = 0 use Good-Turing, and if model = 1 use Add-Delta.
Use word language model without reading EOS markers to read textTrain.txt and dictionary.txt.
File textCheck will contain several sentences (one per line) to check spelling of. Check each sentence separately. It is convenient to read textCheck.txt with EOS marker to sep- arate into sentences. Output the result of checking separately on new line. For example, if textCheck.txt has sentences:
I lke to eat cereal in the morning. Pla nicely!
The output should be formatted as:
i like to eat cereal in the morning play nicely Language Models代写
We will assume that there is only one misspelled word per sentence. For each word w in the sentence, find all candidate words in the dictionary with edit distance of ≤ 1 from w, using the function uiLevenshteinDistance(). Let C(w) be the set of such candidate words for w. We also include w in C(w).
As discussed in class, we will consider all possible sentences where one word w is replaced witha word from C(w). This means you should iterate over all words in the sentence, and for eachword w iterate over all words in C(w), replacing w with a word from C(w) to generate the next possible sentence.
You will implement a simpler version than what was discussed in class.
Instead of implementing the noisy channel model, you just select a sentence with the highest probability accordingto the language model. Print to the standard output the best sentence. Note that the highest probability sentence can be the unchanged input sentence.Language Models代写
(b)Reportand discuss the results of your program for the following cases:
- P7 trainHuge.txt textCheck.txt dictionary.txt 2 3 1 1
- P7 trainHuge.txt textCheck.txt dictionary.txt 2 3 0.1 1
- P7 trainHuge.txt textCheck.txt dictionary.txt 2 3 0.01 1
(c)Reportand discuss the results of your program for the following cases:
- P7 trainHuge.txt textCheck.txt dictionary.txt 1 3 0.01 1
- P7 trainHuge.txt textCheck.txt dictionary.txt 2 3 0.01 1
- P7 trainHuge.txt textCheck.txt dictionary.txt 3 3 0.01 1
(d)Reportand discuss the results of your program for the following cases:
- P7 trainHuge.txt textCheck.txt dictionary.txt 1 3 1 0
- P7 trainHuge.txt textCheck.txt dictionary.txt 2 3 1 0
- P7 trainHuge.txt textCheck.txt dictionary.txt 3 3 1 0
Extra Credit Problem 8 (20 %)Language Models代写
Implement program P8 that improves your spelling correction program in problem 7 in any way. You can build a better language model, implement the noisy channel model discussed in class, implement a better edit distance between strings. Hand in your improved program, and re- port/discuss the cases where your new program does something better compared to the old pro- gram.