Synonym Measurement Through Semantic Similarity Using the SOC-PMI Method

Uswatun Hasanah; Bambang Pilu Hartato; Mitra Yulianti; Saeful Haq Faruqi

doi:10.35671/telematika.v13i1.941

Synonym Measurement Through Semantic Similarity Using the SOC-PMI Method

Uswatun Hasanah, Bambang Pilu Hartato, Mitra Yulianti, Saeful Haq Faruqi

Abstract

Abstract: Measurement of synonyms can be an important task in measuring word similarity. This work cannot be done syntactically, but must dig deeper about its semantics. Semantic relations can be anything, such as synonyms, antonyms, hyponymy, homonymy and polysemy. This research works on finding synonym values using the Second Order Co-occurrence Pointwise Mutual Information (SOC-PMI) method. The data used are 30 questions on the TOEFL exam. Each question consists of one word as a question and four reference answers as alternative answers. The results show very low accuracy (30%) since there are only 9 out of 30 answers that actually show the synonym. In addition, the LCS method was also tested to get a character-based similarity score. LCS method is able to achieve a higher similarity score of 43.33%. Finally, the idea of hybrid method by combining character-based and semantic-based methods can be considered in longer words to produce a fairer similarity score.

Abstrak: Pengukuran sinonim dapat menjadi pekerjaan yang penting dalam mengukur kemiripan kata. Pekerjaan ini tidak dapat dilakukan secara sintaksis, tetapi harus dilakukan dengan menggali lebih dalam tentang semantiknya. Hubungan semantik dapat berupa apa saja, seperti sinonim, antonim, hiponim, homonim, dan polisemi. Penelitian ini berusaha untuk menemukan nilai-nilai sinonim menggunakan metode Second Order Co-occurrence Pointwise Mutual Information (SOC-PMI). Data yang digunakan adalah 30 pertanyaan pada ujian TOEFL. Setiap pertanyaan terdiri dari satu kata sebagai pertanyaan dan empat jawaban referensi sebagai jawaban alternatif. Hasil menunjukkan nilai akurasi yang sangat rendah (30%) karena hanya ada 9 dari 30 jawaban yang benar-benar menunjukkan sinonim. Selain itu, metode LCS juga diuji untuk mendapatkan skor kemiripan berdasarkan karakternya. Metode LCS mampu mencapai skor kemiripan yang lebih tinggi yaitu 43,33%. Akhirnya, gagasan metode hybrid dengan menggabungkan metode berbasis karakter dan metode berbasis semantik semantik dapat dipertimbangkan untuk kata-kata yang lebih panjang agar menghasilkan skor kesamaan yang lebih adil.

Keywords

SOC-PMI; Semantic Similarity; Synonym; Corpus-based method

Full Text:

Link Download

References

Alguliyev, R. M., Aliguliyev, R. M., Isazade, N. R., Abdi, A., & Idris, N. (2017). A model for text summarization. International Journal of Intelligent Information Technologies (IJIIT), 13(1), 67–85.

Aronson, A. R., & Rindflesch, T. C. (1997). Query expansion using the UMLS Metathesaurus. In Proceedings of the AMIA Annual Fall Symposium (p. 485).

Barzilay, R., & Elhadad, M. (1999). Using lexical chains for text summarization. Advances in Automatic Text Summarization, 111–121.

Buckley, C., Salton, G., Allan, J., & Singhal, A. (1995). Automatic query expansion using SMART: TREC 3. NIST Special Publication Sp, 69.

Díaz-Galiano, M. C., Martín-Valdivia, M. T., & Ureña-López, L. A. (2009). Query expansion with a medical ontology to improve a multimodal information retrieval system. Computers in Biology and Medicine, 39(4), 396–403.

Djajasudarma, T. F. (1993). Semantik I: Pengantar ke Arah Ilmu Makna. Eresco 145. Bandung.

Fellbaum, C. (2006). WordNet(s). In Encyclopedia of Language & Linguistics (Second Edition) (pp. 665–670).

Ferrucci, D. A. (2012). Introduction to “This is Watson.” IBM Journal of Research and Development, 56(3.4), 1.

Frawley, W. (2013). Linguistic semantics. Routledge.

Grefenstette, G. (1993). Automatic Thesaurus Generation from Raw Text using Knowledge-Poor Techniques. In Ninth Annual Conference of the UW Centre for the New OED and text Research.

Islam, A., & Inkpen, D. (2006). Second order co-occurrence PMI for determining the semantic similarity of words. In LREC (pp. 1033–1038). https://doi.org/10.1145/1376815.1376819

Islam, A., & Inkpen, D. (2008). Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity. ACM Transactions on Knowledge Discovery from Data, 2(2), 1–25. https://doi.org/10.1145/1376815.1376819

Lesk, M. (1986). Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. In Proceedings of the 5th annual international conference on Systems documentation (pp. 24–26).

Li, H., Abe, N., World, R., & Partnership, C. (1998). Word clustering and disambiguation based on co-occurrence data. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (pp. 749–755).

Lin, C., Hovy, E., & Rey, M. (2003). Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 71–78).

Lin, D. (1998). Automatic retrieval and clustering of similar words. In COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics (pp. 768–774).

Matthiesen, S. J. (2017). Essential Words for the TOEFL. Simon and Schuster.

Parera, J. D. (2004). Teori Semantik [Semantic Theory]. Jakarta: Erlangga.

Plovnick, R. M., & Zeng, Q. T. (2004). Reformulation of consumer health queries with professional terminology: a pilot study. Journal of Medical Internet Research, 6(3), e27.

Turney, P. D. (2001). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp. 491–502).

Ullmann, S. (1964). Language and style: collected papers (Vol. 1). B. Blackwell.

Vechtomova, O., & Robertson, S. (2014). Integration of Collocation Statistics into the Probabilistic Retrieval Model. In 22nd Annual Colloquium on Information Retrieval Research (pp. 165–177).

Xu, J., & Croft, W. B. (2000). Improving the Effectiveness of Information Retrieval with Local Context Analysis. ACM Transactions on Information Systems (TOIS), 18(1), 79–112.

Yarowsky, D. (1992). Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2 (pp. 454–460).

DOI: http://dx.doi.org/10.35671/telematika.v13i1.941

Refbacks

There are currently no refbacks.

Indexed by:

Telematika
ISSN: 2442-4528 (online) | ISSN: 1979-925X (print)
Published by : Universitas Amikom Purwokerto
Jl. Let. Jend. POL SUMARTO Watumas, Purwonegoro - Purwokerto, Indonesia

This work is licensed under a Creative Commons Attribution 4.0 International License .

Username
Password
Remember me