Hyperspace Analogue to Language (HAL): An example
From the definition of the Hyperspace Analogue to Language (HAL) before, here we discuss a simple example on how to use the HAL method.
Let say we have a sentence: “The basic concept of the word association” using a 5-word moving window.
The word co-occurrence frequency matrix will be:
the | basic | concept | of | word | association | |
the | 2 | 3 | 4 | 5 | 0 | 0 |
basic | 5 | 0 | 0 | 0 | 0 | 0 |
concept | 4 | 5 | 0 | 0 | 0 | 0 |
of | 3 | 4 | 5 | 0 | 0 | 0 |
word | 5+1 | 2 | 3 | 4 | 0 | 5 |
association | 4 | 1 | 2 | 3 | 5 | 0 |
Next a “meaning” vector is created for each word by concatenating its row vector and its column vector, thus including both the preceding and following contexts of the words.
For example, in the example above, the meaning vector for basic becomes [5, 0, 0, 0, 0, 0, 3, 0, 5, 4, 2, 1].
The meaning vectors can now be compared. For this we have to use a distance measure that is suitable for comparing vectors of different lengths. Cosine similarity is such a measure.
For example, the similarity between basic and association, 0.302, is smaller than the similarity between basic and concept, 0.688.
Cosine similarity for basic and association:
Let d1 = 5 0 0 0 0 0 3 0 5 4 2 1 Let d2 = 4 1 2 3 5 0 0 0 0 0 0 0 Cosine Similarity (d1, d2) = dot(d1, d2) / ||d1|| ||d2|| dot(d1, d2) = (5)*(4) + (0)*(1) + (0)*(2) + (0)*(3) + (0)*(5) + (0)*(0) + (3)*(0) + (0)*(0) + (5)*(0) + (4)*(0) + (2)*(0) + (1)*(0) = 20 ||d1|| = sqrt((5)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (3)^2 + (0)^2 + (5)^2 + (4)^2 + (2)^2 + (1)^2) = 8.94427191 ||d2|| = sqrt((4)^2 + (1)^2 + (2)^2 + (3)^2 + (5)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2) = 7.4161984871 Cosine Similarity (d1, d2) = 20 / (8.94427191) * (7.4161984871) = 20 / 66.3324958071 = 0.301511344578
Cosine similarity for basic and concept:
Let d1 = 5 0 0 0 0 0 3 0 5 4 2 1 Let d2 = 4 5 0 0 0 0 4 0 0 5 3 2 Cosine Similarity (d1, d2) = dot(d1, d2) / ||d1|| ||d2|| dot(d1, d2) = (5)*(4) + (0)*(5) + (0)*(0) + (0)*(0) + (0)*(0) + (0)*(0) + (3)*(4) + (0)*(0) + (5)*(0) + (4)*(5) + (2)*(3) + (1)*(2) = 60 ||d1|| = sqrt((5)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (3)^2 + (0)^2 + (5)^2 + (4)^2 + (2)^2 + (1)^2) = 8.94427191 ||d2|| = sqrt((4)^2 + (5)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (4)^2 + (0)^2 + (0)^2 + (5)^2 + (3)^2 + (2)^2) = 9.74679434481 Cosine Similarity (d1, d2) = 60 / (8.94427191) * (9.74679434481) = 60 / 87.1779788708 = 0.688247201612
Things to ponder:
- HAL matrix is direction sensitive: the co-occurrence information preceding and following a word are recorded separately by the row and column vectors.
- The quality of HAL vectors is influenced by the window size; the longer the window, the higher the chance of representing spurious associations between terms. Window sizes of eight and ten have been used in various studies.
References:
- Lund, K. and Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence, Behavior Research Methods, Instruments & Computers, 28(2), pg. 203-208.
- Song, D., Bruza, P. and Cole, R. (2004). Concept learning and information inferencing on a high dimensional semantic space, Proceedings ACM SIGIR 2004 Workshop on Mathematical/Formal Methods in Information Retrieval (MF/IR 2004).
- Applied Software Design web: Cosine Similarity Calculator. http://www.appliedsoftwaredesign.com/archives/cosine-similarity-calculator.
Leave a comment