python

USE PYTHON TO SOLVE THIS PROBLEM.

Several text processing applications require an analysis of the word frequency distribution in the text. The term word frequency refers to the number of occurrences of each word in the text. For example, to compare if two documents are similar to each other, there are various algorithms that require word frequency information about the text contained in the documents. (This is how websites recommend news articles that are similar to ones you have read previously, for instance.) In this problem, you will first find the frequency distribution of words in a given text file, i.e., the number of times each unique word occurs in the file. You are asked to use a dictionary to store the frequency distribution of the words. Furthermore, you will compare two documents for similarity by using their word frequency distribution. Additional details are provided below: • For our purposes, a word is simply a consecutive sequence of characters that are letters of the alphabet. We do not distinguish between upper and lower case (e.g., “The” and “the” are the same word). For example, the text “How old are you and how old is your half-brother?” consists of nine words: “how”, “old”, “are”, “you”, “and”, “is”, “your”, 5 “half”, and “brother”. The words “how” and “old” have a frequency of 2 each, and the remaining words have a frequency of 1 each. Note that we do not include punctuation characters in a word. This is why, in the example above, the word “half-brother” is treated as two words, “half” and “brother”. This also means that words such as “Sally’s” or “They’re” get treated as two words each (“Sally” and ’s’, “They” and “re”), but we’ll live with that. There are various ways in which you can split a word at its punctuation marks. For example, you can use the replace() method for strings to replace all the punctuation marks with a space (make sure you do not replace it with an empty string because that will combine hyphenated words into one word, for example). In the string module, you can get the string of all punctuation characters by using string.punctuation. You may use this if necessary. • To compare two documents to check if their contents are similar to each other, we will use a commonly used method called cosine similarity that uses their word frequency distribution. This method works as follows: Let w1, w2, w3, . . . , wn be the set of n unique words from both the documents. Some of these words will occur in only one document and some will occur in both. Let (a1, a2, a3, . . . , an) be the frequency of these words in the first document (hence, if a word wi does not occur in the document, ai will be 0). Let (b1, b2, b3, . . . , bn) be the frequency of the words in the second document (similarly, some of the bi may be 0). Then the cosine similarity between the two documents is defined as follows: a1.b1 + a2.b2 + a3.b3 + . . . + an.bn q a 2 1 + a 2 2 + a 2 3 + . . . + a 2 n q b 2 1 + b 2 2 + b 2 3 + . . . + b 2 n (For those of you that know about vectors, note that the above expression is simply the cosine of the angle between two n-dimensional vectors.) The above expression will give a value between 0 and 1. The closer the above value is to 1, the more similar the two documents, and the closer it is to 0, the more dissimilar. Here is an example: – Suppose the first document contains: starry starry night – Suppose the second document contains: a clear night is starry – Then the set of unique words in the documents is starry, night, a, clear, is. The frequency of these words in the first document is (2, 1, 0, 0, 0) and in the second document is (1, 1, 1, 1, 1). The cosine similarity between them is 3/( √ 5 √ 5) = 3/5 = 0.6. We can conclude that the two documents are somewhat similar. As it turns out, this method has the weakness that some commonly occuring words (such as a, an, the, and, of, to, etc.) can have an outsize influence on determining similarity. Most documents have a lot of these words, so they should not play much of a role in determining whether two documents are similar or not. There are many algorithms to deal with this issue, but we will ignore it in our program here. Implement the following functions for this problem:

A function called freq dictionary with a single parameter infile, the name of the text file for which we want to compute the word frequency distribution. The function should first open infile for reading. It should then create a dictionary with key:value pairs, where the key is a word occurring in infile and its associated value is the frequency of that word in the file i.e., the number of times it occurs in the file. Refer 6 to the earlier discussion for our definition of a word. Note that every word occurring in the file should be included in the dictionary (even if they are nonsense words). Words that do not occur in the file should not be included in the dictionary. The function must return the dictionary. Remember to close the file prior to the return statement.

A function called cosine similarity with two parameters: docfile1 and docfile2, which are names of two text files that we want to compare for similarity. The function should open both files for reading, create their word frequency dictionaries, and return their cosine similarity (a real number). Remember to close both files before returning. Hint: Make good use of the get method for dictionaries to implement this function in a clean and simple way. On the Canvas page for this assignment, you are provided five sample documents for you to test your functions: doc1.txt, doc2.txt, doc3.txt, doc4.txt, and doc5.txt. You should run your cosine similarity function on these documents to verify the following information: doc3 and doc4 are the most similar (0.876), followed by doc4 and doc5 (0.667), followed by doc3 and doc5 (0.603), followed by doc1 and doc2 (0.6).

Solution

This question has been answered.