2017-06-16

## Background

Both FM and FFM are extensions for linear regression(LR). The basic formula for LR is:

$y(w, x) = w_0 + \sum_{i=1}^{n}w_i x_i$

where $n$ is the number of features. As its name implies, its a linear and simple model. The feature conjunction is not considered.

## FM

FM has addtional terms to represent the conjunction between features. For example, feature “country: China” and “holiday: Spring Festival” must have somehow relation bewtween themselves. A guest from China may be prone to click item related to Spring Festival, while an American may not. Other feature pair may be “country: US” and “holiday: Thanksgiving”.
With linear regression, if the model take both two click examples like “country: China; holiday: Spring Festival” and “country: US; holiday: Thanksgiving” and some other unrelated example, the weights for “China” and “US” will be the same and the weights for “Spring Festival” and “Thanksgiving” will be the same.
If we have a new example “country: China; holiday: Thanksgiving”, we will predict the same click-through rate as “country: China; holiday: Spring Festival”. Obviously, we don’t want something like this. Therefore, considering feature conjunction is neccessary.

The formula for FM is as below, with $n$ features as well

$y(w, x) = w_0 + \sum_{i=1}^{n}w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n}<\mathbf{v}_i \centerdot \mathbf{v}_j> x_i x_j$

Here we introduce a new parameter matrix $\mathbf{V}$, whose size is $n \times k$. $k$ is a constant, such as 2. $\mathbf{V}$ is formulated as

$\mathbf{V} = \left[ \begin{array}{c} \mathbf{v}_1 \\ \mathbf{v}_2 \\ … \\ \mathbf{v}_i \\… \\ \mathbf{v}_n\end{array} \right]$

$\mathbf{v}_i$ is the latent vector for feature $i$. We are going to train these latent vectors to learn the latent effects between feature pairs.

The total number of parameters is $1 + n + kn$.

One last thing to note is that FM can be simplified to be trained and used in $O(kn)$ time. So it’s a quite efficient algorithm.

### Example

Now we have an example as below, it represents that a male user clicked Nike ad on Amazon.

1 male Nike Amazon

We would need to perform One-Hot encoding to expand the feature, so that feature gender will be expanded to “gender_male” and “gender_female”. “gender: male” will be converted to “gender_male = 1; gender_female = 0”. So it will be easier to perform numerical calculations.

Ignoring LR, the output value for this example can be calculated as:

Recall that we have latent effects between features. In this case, we are using $\mathbf{v}_{male}$ to learn two latent effects <male,nike> and <male, amazon>. However, these two feature pairs may have different latent effects. It may be inappropriate to do it this way. Therefore, FFM is proposed.

## FFM

For FFM, we divide features into $m$ fields, such as country, gender, brand… Every feature will maintain $m$ different latent vectors. For a feature pair, we get the latent vectors for each feature and for the field of the other feature. Therefore, if we have feature $a_1 \in A$, $b_1, b_2 \in B$ and $c_1 \in C$, $A, B, C$ are fields, training with pair $% %]]>$ won’t affect training with pair $% %]]>$. The latent effect between feature pair in fields $% %]]>$ are unrelated to that in fields $% %]]>$.

The total number of parameters will be $1 + n + mkn$. Meanwhile, the calculation cannot be simplified as FM. So the training will be $O(kn^2)$ in time. But it may be worthy.

Now, for the example mentioned in FM. With FFM, the output value can be calculated as

where $f_{ad}$ means field for ad. Latent vectors for feature gender_male $\mathbf{v}_{male,f_{ad}}$ and $\mathbf{v}_{male,f_{plat}}$ are separated in order to learn different latent effects involving feature gender_male, with feature ad_nike and platform_amazon, respectively.

The formula for FFM is given as