Nasharuddin, N.A. Research

Research Interest: Information Retrieval, Natural Language Processing, Multimedia Computing

Hyperspace Analogue to Language (HAL): An example

leave a comment »

From the definition of the Hyperspace Analogue to Language (HAL) before, here we discuss a simple example on how to use the HAL method.

Let say we have a sentence: “The basic concept of the word association” using a 5-word moving window.

The word co-occurrence frequency matrix will be:

the basic concept of word association
the 2 3 4 5 0 0
basic 5 0 0 0 0 0
concept 4 5 0 0 0 0
of 3 4 5 0 0 0
word 5+1 2 3 4 0 5
association 4 1 2 3 5 0

Next a “meaning” vector is created for each word by concatenating its row vector and its column vector, thus including both the preceding and following contexts of the words.

For example, in the example above, the meaning vector for basic becomes [5, 0, 0, 0, 0, 0, 3, 0, 5, 4, 2, 1].

The meaning vectors can now be compared. For this we have to use a distance measure that is suitable for comparing vectors of different lengths. Cosine similarity is such a measure.

For example, the similarity between basic and association, 0.302, is smaller than the similarity between basic and concept, 0.688.

Cosine similarity for basic and association:

Let d1 = 5 0 0 0 0 0 3 0 5 4 2 1
Let d2 = 4 1 2 3 5 0 0 0 0 0 0 0
Cosine Similarity (d1, d2) = dot(d1, d2) / ||d1|| ||d2||
dot(d1, d2) = (5)*(4) + (0)*(1) + (0)*(2) + (0)*(3) + (0)*(5) + (0)*(0) + (3)*(0) + (0)*(0) + (5)*(0) + (4)*(0) + (2)*(0) + (1)*(0) = 20
||d1|| = sqrt((5)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (3)^2 + (0)^2 + (5)^2 + (4)^2 + (2)^2 + (1)^2) = 8.94427191
||d2|| = sqrt((4)^2 + (1)^2 + (2)^2 + (3)^2 + (5)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2) = 7.4161984871
Cosine Similarity (d1, d2) = 20 / (8.94427191) * (7.4161984871)
                           = 20 / 66.3324958071
                           = 0.301511344578

Cosine similarity for basic and concept:

Let d1 = 5 0 0 0 0 0 3 0 5 4 2 1
Let d2 = 4 5 0 0 0 0 4 0 0 5 3 2
Cosine Similarity (d1, d2) =  dot(d1, d2) / ||d1|| ||d2||
dot(d1, d2) = (5)*(4) + (0)*(5) + (0)*(0) + (0)*(0) + (0)*(0) + (0)*(0) + (3)*(4) + (0)*(0) + (5)*(0) + (4)*(5) + (2)*(3) + (1)*(2) = 60
||d1|| = sqrt((5)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (3)^2 + (0)^2 + (5)^2 + (4)^2 + (2)^2 + (1)^2) = 8.94427191
||d2|| = sqrt((4)^2 + (5)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (4)^2 + (0)^2 + (0)^2 + (5)^2 + (3)^2 + (2)^2) = 9.74679434481
Cosine Similarity (d1, d2) = 60 / (8.94427191) * (9.74679434481)
                           = 60 / 87.1779788708
                           = 0.688247201612

Things to ponder:

  • HAL matrix is direction sensitive: the co-occurrence  information preceding and following a word are recorded separately by the row and column vectors.
  • The quality of HAL vectors is influenced by the window size; the longer the window, the higher the chance of representing spurious associations between terms. Window sizes of eight and ten have been used in various studies.

References:

  1. Lund, K. and Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence, Behavior Research Methods, Instruments & Computers, 28(2), pg. 203-208.
  2. Song, D., Bruza, P. and Cole, R. (2004). Concept learning and information inferencing on a high dimensional semantic space,  Proceedings ACM SIGIR 2004 Workshop on Mathematical/Formal Methods in Information Retrieval (MF/IR 2004).
  3. Applied Software Design web: Cosine Similarity Calculator. http://www.appliedsoftwaredesign.com/archives/cosine-similarity-calculator.

Written by Amy Nasha

09/10/2012 at 7:56 PM

Posted in Literature, Research

Tagged with

Leave a comment