Synonym Measurement Through Semantic Similarity Using the SOC-PMI Method

History of the article: Receive December 16, 2019 Revised January 16, 2020 Received February 24, 2020 Online February 28, 2020 Measurement of synonyms can be an essential task in measuring word similarity. This work cannot be done syntactically but must dig deeper into its semantics. Semantic relations can be anything, such as synonyms, antonyms, hyponymy, homonymy, and polysemy. This research works on finding synonym values using the Second Order Co-occurrence Pointwise Mutual Information (SOCPMI) method. The data used are 30 questions on the TOEFL exam. Each question consists of one word as a question and four reference answers as alternative answers. The results show very low accuracy (30%) since there are only 9 out of 30 answers that show the synonym. Besides, the LCS method was also tested to get a character-based similarity score. LCS method can achieve a higher similarity score of 43.33%. Finally, the idea of the hybrid method by combining character-based and semantic-based methods can be considered in longer words to produce a fairer similarity score.

This study aims to measure the similarity of synonyms by knowing the value of semantic similarity.
Since semantic similarity can be of various types, this research limits only the synonym similarity. In research conducted by (Ullmann, 1964) in (Djajasudarma, 1993), synonyms are divided into nine types as follows: 1) Synonyms in which one of its members has a more general meaning, 2) Synonyms in which one of its members has more intensive elements of meaning, 3) Synonyms where one of the members emphasizes emotive meaning, 4) Synonyms where one of the members is reproachful or not justifying, 5) Synonym where one of the members becomes a field term specific, 6) Synonyms where one of its members is more widely used in a variety of written languages, 7) Synonyms where one of its members is more commonly used in conversational languages, 8) Synonyms where one of the members is used in childhood language, 9) Synonyms where one of the members is usually used in certain areas.
The method used in this study considers semantic similarity in measuring synonym similarity. The method, namely, Second-Order Co-occurrence Pointwise Mutual Information (SOC-PMI), was conceived by (Islam & Inkpen, 2006) and has been used in a variety of natural language processing applications. Even though the method focuses on semantic similarity, this research focuses on synonyms, which are one of a series of semantic elements.

Semantic Similarity
Humans, with their common sense, can recognize the interrelation of a pair of words in various ways. For humans, it is not difficult to judge the relationship between apples and oranges, rather than apples and toothbrushes (Islam & Inkpen, 2006). Semantics can be used in two mechanisms, namely in the detection of similarities and differences (Frawley, 2013). During this time, applications in natural language processing have used semantic similarity measurements, such as in the construction of automated thesaurus (Grefenstette, 1993)(D. Lin, 1998 (Li, Abe, World, & Partnership, 1998), automatic indexing, text annotations and document summarizing (C. Lin, Hovy, & Rey, 2003), text classification, word sense disambiguation (Li et al., 1998) (Lesk, 1986) (Yarowsky, 1992), information extraction and information retrieval (Buckley, Salton, Allan, & Singhal, 1995) (Vechtomova & Robertson, 2014) (Xu & Croft, 2000).

SOC-PMI
The Second Order Co-occurrence Pointwise Mutual Information method, or from now on referred to as the SOC-PMI method, is a method developed from the predecessor algorithm called PMI-IR. PMI-IR is proposed by (Turney, 2001), and uses the AltaVista Advanced Search query http://dx.doi.org/10.35671/telematika.v13i1.941 syntax to calculate probabilities. PMI-IR is a simple method intended to recognize synonyms, using Pointwise Mutual Information as written in Equation (1) below: where, { ℎ 1 , ℎ 2 , … , ℎ } represent the alternatives from problem word , while probability that problem and ℎ co-occur stated with p( & ℎ ). Another variation of this equation is based on the closeness of the pair in the document, considering antonyms, and considering the context.
Through the principle of probability PMI-IR, (Islam & Inkpen, 2006) formulate the Pointwise Mutual Information that can be shown by Equation (2) as follows: is targeted word, while ( ) is a type of frequency function, and ( , ) is a bigram frequency function. Then, the total number of tokens in corpus represented by . Furthermore, − functions for 1 and 2 are defined in Equation (3) and (4): Finally, the PMI semantic similarity function between the two words 1 and 2 is shown by the following Equations (5): The value is related to the number of times the word W appears in the corpus. The value is defined in the following Equation (6): 1 = (log ( ( ))) 2 (log 2 ( )) , = 1,2 Where is constant and in research conducted by (Islam & Inkpen, 2006) the value = 6.5 is determined. The value depends on the size of the corpus. The smaller the corpus used, the smaller the value of .

Data
In this research, synonym similarity is obtained from two pairs of words with the semantic meaning approach. The data used is part of the TOEFL test which deals with synonyms. Data are collected from lessons 1-3 in the TOEFL exercise book written by (Matthiesen, 2017). A total of 30 multiple choice questions were used, and each question had four alternative answers. The following is an example of the data used: Choose the synonym of appealing: The answer key provided by the book will provide information that the synonym of the word "appealing" is "alluring," in which the two words have the same meaning of the word "attractive." Besides, these words also have other synonyms, such as "interesting," "enticing," "catchy," and "catching." http://dx.doi.org/10.35671/telematika.v13i1.941

Research flow
This research goes through the steps shown in Figure 1 below: The first step is to collect data, as described in the previous section. The data does not experience any pre-processing techniques. In this case, the similarity data will be directly measured using the SOC-PMI method. The library obtained from https://github.com/pritishyuvraj/SOC-PMI-Short-Text-Similarity-is used to measure the semantic similarity. In the library, there are at least three algorithms included, where the three methods are Hybrid methods named Semantic Text Similarity (STS) (Islam & Inkpen, 2008). In the library, there are at least three algorithms included, where the three algorithms are Hybrid methods named Semantic Text Similarity (STS). However, in this study, only the SOC-PMI algorithm was taken and used. This method includes the NLTK library and also uses WordNet as the dictionary. Wordnet is an extensive semantic network in which there are words and groups of words that are connected lexically and conceptually, which are represented by arc labeled (Fellbaum, 2006).
Furthermore, after the SOC-PMI value of each possible answer is obtained, an evaluation of the method's performance is carried out by finding a match between the two answers, both the predicted answer and the actual answer. We also apply another method for comparison. We use a characterbased method called Longest Common Subsequence. We use this method because it is not possible to implement string-based methods with questions in the form of word synonyms, since the words used are clearly different.

RESULTS AND DISCUSSION
The results and discussion of this paper can be seen as follows:

Results
http://dx.doi.org/10.35671/telematika.v13i1.941 The synonym question data in lessons 1, 2 and 3 collected and will measure the semantic similarity. Table 1 illustrates an example of the semantic similarity measurement results using the SOC-PMI method in question Number 5, Lesson 2: Based on the results obtained in Table 1, it can be seen that the word "alluring" has the closest semantic relation to the word "appealing" with a similarity value of 0.84211. Furthermore, the word "alluring" in bold indicates that the word is the actual answer. In the end, the whole data is also measured, and the results obtained as shown in Table 2, Table 3, and Table 4.    Based on the results shown in Table 2, Table 3, and Table 4, several things must be considered. First, some vocabularies do not show any semantic relations. Figure 2 illustrates the distribution of words that have semantic relations and those without semantic relations.

Figure 2. Total number of word pairs with semantic relations
In the end, the accuracy of the values generated by the SOC-PMI method with the actual answers is also measured. By reviewing the SOC-PMI values generated in Table 2, Table 3, and Table 4, it can be seen that there are 9 correct answers and 21 missed answers. Futhermore, the result for accuracy values are: Meanwhile, using the LCS method we obtain the following accuracy result: Unfortunately, it can be seen that the results are not satisfying results. The discussion section will explain the phenomena and analyze what factors influence the results and how this method can be used in the future.

Discussion
In this section, the results obtained are then analyzed. First, it should be noted that the SOC-PMI method is not evaluating semantic similarities based on synonymous rules. The SOC-PMI method considers the semantic relations between one pair of words, where semantic meaning can be anything.
They can be synonymous, antonym, hyponymy, hypernimic, polysemic, or just connected to a certain hierarchy. Furthermore, this method takes into account how a pair of words meet in the same context.
At this point, the frequency with which each word appears in the same context window greatly influences the results of semantic similarity. For example, the word pairs "computer" and "machine" will have more similar semantic relations (i.e. 0.94118) than "computer" and "keyboard" (0.82353), "computer" and "portable" (0.76190 ), and "computer" and "RAM" (0.70000).
High or low semantic similarity value is determined by the frequency of occurrence of the two words together in the context window. Even though the completeness of the word dictionary will also affect the results of semantic similarity. In the previous section, there were eleven questions for which the alternative answers did not have any semantic relations. This can happen for two reasons. First, the two words do not appear at all in the corpus, or it can only appear one of them without being followed by the next word. Secondly, the two words do exist in the corpus but do not appear in the same context. Therefore, the SOC-PMI similarity value cannot be obtained. In the case of Netherlands -Holland, computer -keyboard, computer -machine, or mommy -daddy, the SOC-PMI method might be able to provide competitive results, depending on the size of the corpus used. However, if the corpus is not able to represent words that are not commonly used, that will be another problem.
Here, the LCS method may have better performance. However, the LCS method ignores the semantic meaning of words because it considers the presence or absence of a character in the two words being compared. Sometimes the LCS method can give a higher score even though the characters are reversed. In this case, the idea of combining character-based and semantic-based methods can be considered in longer words, i.e., phrases or sentences. In the end, the hybrid method can be considered to produce a fairer similarity score. For example, we can give each word score weighting for the SOC-PMI and LCS values. The word "restore" has the synonym word "revitalize" where the SOC-PMI method gives a score of 0.22, and the LCS method gives a score of 0.47. If we give each method a weight of 0.5, we will get a final similarity score of 0.35.
For future NLP works involving word similarity factors, it can be concluded that the SOC-PMI method is not specifically recommended for personal use. As in this case, it is used to determine the synonymity of words. It would be wise to use the SOC-PMI method together with other methods (so that it will become a new hybrid method). This idea starts from the perspective that the relation of semantic meaning can be in any form. Thus, considering the possibility of syntactic similarity will be wiser and more objective on tasks involving more general similarities. http://dx.doi.org/10.35671/telematika.v13i1.941

CONCLUSIONS AND FUTURE WORKS
Finally, a conclusion can be drawn from the results and discussion in the previous section. Firstly, the SOC-PMI method is not suitable for determining specific semantic meanings, such as the case of synonyms between words. Because semantic relations can be of any type, depending on the frequency with which the two words occur together in the context window in the corpus. Secondly, the SOC-PMI method might perform well when measuring the semantic similarity of commonly used words, but this does not necessarily apply to TOEFL synonym questions because they have vocabulary lists that are sometimes "not common." In the end, the SOC-PMI method might work better when combined with other methods.
However, the ability of WordNet will be a new challenge for other types of languages.

ACKNOWLEDGEMENT
The authors would like to thank Universitas Amikom Purwokerto for funding this research.