「トピックモデルを用いた併売の分析」ではトピックモデルによる併売の分析を試しましたが、今回は gensim の Word2Vec で試してみました。

ソースは http://github.com/fits/try_samples/tree/master/blog/20180617/

はじめに

データセット

これまでは適当に作ったデータセットを使っていましたが、今回は R の Groceries データセット ※ をスペース区切りのテキストファイル（groceries.txt）にして使います。（商品名にスペースを含む場合は代わりに _ を使っています）

 ※ ある食料雑貨店における 30日間の POS データ

groceries.txt

citrus_fruit semi-finished_bread margarine ready_soups
tropical_fruit yogurt coffee
whole_milk
pip_fruit yogurt cream_cheese_ meat_spreads
other_vegetables whole_milk condensed_milk long_life_bakery_product
whole_milk butter yogurt rice abrasive_cleaner
rolls/buns
other_vegetables UHT-milk rolls/buns bottled_beer liquor_(appetizer)
pot_plants
whole_milk cereals
・・・
cooking_chocolate
chicken citrus_fruit other_vegetables butter yogurt frozen_dessert domestic_eggs rolls/buns rum cling_film/bags
semi-finished_bread bottled_water soda bottled_beer
chicken tropical_fruit other_vegetables vinegar shopping_bags

R によるアソシエーションルールの抽出結果

参考のため、まずは R を使って Groceries データセットを apriori で処理しました。

リフト値を優先するため、支持度（supp）と確信度（conf）を低めの値にしています。

groceries_apriori.R

library(arules)
data(Groceries)

params <- list(supp = 0.001, conf = 0.1)

rules <- apriori(Groceries, parameter = params)

inspect(head(sort(rules, by = "lift"), 10))

実行結果

> Rscript groceries_apriori.R

・・・
     lhs                        rhs                         support confidence     lift count
[1]  {bottled beer,                                                                          
      red/blush wine}        => {liquor}                0.001931876  0.3958333 35.71579    19
[2]  {hamburger meat,                                                                        
      soda}                  => {Instant food products} 0.001220132  0.2105263 26.20919    12
[3]  {ham,                                                                                   
      white bread}           => {processed cheese}      0.001931876  0.3800000 22.92822    19
[4]  {root vegetables,                                                                       
      other vegetables,                                                                      
      whole milk,                                                                            
      yogurt}                => {rice}                  0.001321810  0.1688312 22.13939    13
[5]  {bottled beer,                                                                          
      liquor}                => {red/blush wine}        0.001931876  0.4130435 21.49356    19
[6]  {Instant food products,                                                                 
      soda}                  => {hamburger meat}        0.001220132  0.6315789 18.99565    12
[7]  {curd,                                                                                  
      sugar}                 => {flour}                 0.001118454  0.3235294 18.60767    11
[8]  {soda,                                                                                  
      salty snack}           => {popcorn}               0.001220132  0.1304348 18.06797    12
[9]  {sugar,                                                                                 
      baking powder}         => {flour}                 0.001016777  0.3125000 17.97332    10
[10] {processed cheese,                                                                      
      white bread}           => {ham}                   0.001931876  0.4634146 17.80345    19

bottled beer と red/blush wine で liquor が同時に買われやすい、hamburger meat と soda で Instant food products が同時に買われやすいという結果（アソシエーションルール）が出ています。

Word2Vec の適用

それでは groceries.txt を gensim の Word2Vec で処理してみます。とりあえず iter を 500 に min_count を 1 にしてみました。

なお、購入品目の多い POS データを処理する場合は window パラメータを大きめにすべきかもしれません。（今回はデフォルト値の 5）

今回は Jupyter Notebook で実行しています。

Word2Vec モデルの構築

from gensim.models import word2vec

sentences = word2vec.LineSentence('groceries.txt')

model = word2vec.Word2Vec(sentences, iter = 500, min_count = 1)

類似品の算出

まずは、wv.most_similar で類似単語（商品）を抽出してみます。

pork の類似単語

model.wv.most_similar('pork')

[('turkey', 0.5547687411308289),
 ('ham', 0.49448296427726746),
 ('pip_fruit', 0.46879759430885315),
 ('tropical_fruit', 0.4383287727832794),
 ('butter', 0.43373265862464905),
 ('frankfurter', 0.4334157109260559),
 ('root_vegetables', 0.4249211549758911),
 ('citrus_fruit', 0.4246293306350708),
 ('chicken', 0.42378148436546326),
 ('sausage', 0.41153857111930847)]

微妙なものも含んでいますが、それなりの結果になっているような気もします。

most_similar はベクトル的に類似している単語を抽出するため、POS データを処理する場合は競合や代用品の抽出に使えるのではないかと思います。

併売の分析

併売の商品はお互いに類似していないと思うので most_similar は役立ちそうにありませんが、それでも何らかの関係性はありそうな気がします。

そこで、指定した単語群の中心となる単語を抽出する predict_output_word を使えないかと思い、R で抽出したアソシエーションルールの組み合わせで試してみました。

predict_output_word の検証

bottled_beer と red/blush_wine

model.predict_output_word(['bottled_beer', 'red/blush_wine'])

[('liquor', 0.22384332),
 ('prosecco', 0.04933687),
 ('sparkling_wine', 0.0345262),
 ・・・]

R の結果に出てた liquor が先頭（確率が最大）に来ています。

bottled_beer と red/blush_wine

model.predict_output_word(['hamburger_meat', 'soda'])

[('Instant_food_products', 0.054281656),
 ('canned_vegetables', 0.029985178),
 ('pasta', 0.025487985),
 ・・・]

ここでも R の結果に出てた Instant_food_products が先頭に来ています。

ham と white_bread

model.predict_output_word(['ham', 'white_bread'])

[('processed_cheese', 0.20990367),
 ('sweet_spreads', 0.024131883),
 ('spread_cheese', 0.023222428),
 ・・・]

こちらも同様です。

root_vegetables と other_vegetables と whole_milk と yogurt

model.predict_output_word(['root_vegetables', 'other_vegetables', 'whole_milk', 'yogurt'])

[('herbs', 0.024541182),
 ('liver_loaf', 0.019327056),
 ('turkey', 0.01775743),
 ('onions', 0.01760579),
 ('specialty_cheese', 0.014991459),
 ('packaged_fruit/vegetables', 0.014529809),
 ('spread_cheese', 0.012931713),
 ('meat', 0.012434797),
 ('beef', 0.011924307),
 ('butter_milk', 0.011828974)]

R の結果にあった rice はこの中には含まれていません。

curd と sugar

model.predict_output_word(['curd', 'sugar'])

[('flour', 0.076272935),
 ('pudding_powder', 0.055790607),
 ('baking_powder', 0.026003197),
 ・・・]

R の結果に出てた flour （小麦粉）が先頭に来ています。

soda と salty_snack

model.predict_output_word(['soda', 'salty_snack'])

[('popcorn', 0.05830234),
 ('nut_snack', 0.046429735),
 ('chewing_gum', 0.0213278),
 ・・・]

こちらも同様です。

sugar と baking_powder

model.predict_output_word(['sugar', 'baking_powder'])

[('flour', 0.11954326),
 ('cooking_chocolate', 0.046284538),
 ('pudding_powder', 0.03714784),
 ・・・]

こちらも同様です。

以上のように、少なくとも 2品を指定した場合の predict_output_word の結果は R で抽出したアソシエーションルールに合致しているようです。

Word2Vec のパラメータに左右されるのかもしれませんが、この結果を見る限りでは predict_output_word を 3品の併売の組み合わせ抽出に使えるかもしれない事が分かりました。

3品の併売

次に predict_output_word で 2品に対する 1品を確率の高い順に抽出してみました。

なお、ここでは 3品の組み合わせの購入数が 10 未満のものは除外するようにしています。

from collections import Counter
import itertools

# 3品の組み合わせのカウント
tri_counter = Counter([c for ws in sentences for c in itertools.combinations(sorted(ws), 3)])

# 2品の組み合わせを作成
pairs = itertools.combinations(model.wv.vocab.keys(), 2)

sorted([
    (p, item, prob) for p in pairs for item, prob in model.predict_output_word(p)
    if prob >= 0.05 and tri_counter[tuple(sorted([p[0], p[1], item]))] >= 10
], key = lambda x: -x[2])

[(('bottled_beer', 'red/blush_wine'), 'liquor', 0.22384332),
 (('white_bread', 'ham'), 'processed_cheese', 0.20990367),
 (('bottled_beer', 'liquor'), 'red/blush_wine', 0.16274776),
 (('sugar', 'baking_powder'), 'flour', 0.11954326),
 (('curd', 'sugar'), 'flour', 0.076272935),
 (('margarine', 'sugar'), 'flour', 0.07422828),
 (('flour', 'sugar'), 'baking_powder', 0.07345509),
 (('sugar', 'whipped/sour_cream'), 'flour', 0.072731614),
 (('rolls/buns', 'hamburger_meat'), 'Instant_food_products', 0.06818052),
 (('sugar', 'root_vegetables'), 'flour', 0.0641469),
 (('tropical_fruit', 'white_bread'), 'processed_cheese', 0.061861355),
 (('soda', 'ham'), 'processed_cheese', 0.06138085),
 (('white_bread', 'processed_cheese'), 'ham', 0.061199907),
 (('whole_milk', 'ham'), 'processed_cheese', 0.059773713),
 (('beef', 'root_vegetables'), 'herbs', 0.059243686),
 (('sugar', 'whipped/sour_cream'), 'baking_powder', 0.05871357),
 (('soda', 'salty_snack'), 'popcorn', 0.05830234),
 (('soda', 'popcorn'), 'salty_snack', 0.05819882),
 (('red/blush_wine', 'liquor'), 'bottled_beer', 0.057226427),
 (('flour', 'baking_powder'), 'sugar', 0.05517209),
 (('soda', 'hamburger_meat'), 'Instant_food_products', 0.054281656),
 (('processed_cheese', 'ham'), 'white_bread', 0.053193364),
 (('other_vegetables', 'ham'), 'processed_cheese', 0.052585844)]

R で抽出したアソシエーションルールと同じ様な結果が出ており、それなりの結果が出ているように思います。

skip-gram の場合

gensim の Word2Vec はデフォルトで CBoW を使うようですので、skip-gram の場合にどうなるかも簡単に確認してみました。

skip-gram の使用

model = word2vec.Word2Vec(sentences, iter = 500, min_count = 1, sg = 1)

まずは predict_output_word の結果をいくつか見てみます。

先頭（確率が最大のもの）は変わらないようですが、CBoW よりも確率の値が全体的に低くなっているようです。

bottled_beer と red/blush_wine

model.predict_output_word(['bottled_beer', 'red/blush_wine'])

[('liquor', 0.076620705),
 ('prosecco', 0.030791236),
 ('liquor_(appetizer)', 0.027123762),
 ・・・]

hamburger_meat と soda

model.predict_output_word(['hamburger_meat', 'soda'])

[('Instant_food_products', 0.022627866),
 ('pasta', 0.018009944),
 ('canned_vegetables', 0.01685342),
 ・・・]

root_vegetables と other_vegetables と whole_milk と yogurt

model.predict_output_word(['root_vegetables', 'other_vegetables', 'whole_milk', 'yogurt'])

[('herbs', 0.015105391),
 ('turkey', 0.014365919),
 ('rice', 0.01316431),
 ・・・]

ここでは、CBoW で 10番以内に入っていなかった rice が入っています。

次に、先程と同様に predict_output_word で 3品の組み合わせを確率順に抽出してみます。

確率の値が全体的に下がっているため、最小値の条件を 0.02 へ変えています。

predict_output_word を使った 3品の組み合わせ抽出

・・・
sorted([
    (p, item, prob) for p in pairs for item, prob in model.predict_output_word(p)
    if prob >= 0.02 and tri_counter[tuple(sorted([p[0], p[1], item]))] >= 10
], key = lambda x: -x[2])

[(('bottled_beer', 'red/blush_wine'), 'liquor', 0.076620705),
 (('bottled_beer', 'liquor'), 'red/blush_wine', 0.0712179),
 (('white_bread', 'ham'), 'processed_cheese', 0.039820198),
 (('red/blush_wine', 'liquor'), 'bottled_beer', 0.031292748),
 (('sugar', 'baking_powder'), 'flour', 0.030803043),
 (('sugar', 'whipped/sour_cream'), 'flour', 0.029322423),
 (('margarine', 'sugar'), 'flour', 0.027827),
 (('beef', 'root_vegetables'), 'herbs', 0.02740662),
 (('curd', 'sugar'), 'flour', 0.025570681),
 (('flour', 'sugar'), 'baking_powder', 0.025403246),
 (('tropical_fruit', 'root_vegetables'), 'turkey', 0.025329975),
 (('whole_milk', 'ham'), 'processed_cheese', 0.024535457),
 (('rolls/buns', 'hamburger_meat'), 'Instant_food_products', 0.02427808),
 (('flour', 'baking_powder'), 'sugar', 0.023779714),
 (('tropical_fruit', 'white_bread'), 'processed_cheese', 0.023528077),
 (('sugar', 'root_vegetables'), 'flour', 0.023394365),
 (('soda', 'salty_snack'), 'popcorn', 0.02322538),
 (('whole_milk', 'sugar'), 'flour', 0.023202542),
 (('fruit/vegetable_juice', 'ham'), 'processed_cheese', 0.023127634),
 (('butter', 'root_vegetables'), 'herbs', 0.02304014),
 (('soda', 'ham'), 'processed_cheese', 0.022633638),
 (('soda', 'hamburger_meat'), 'Instant_food_products', 0.022627866),
 (('citrus_fruit', 'sugar'), 'flour', 0.022040429),
 (('bottled_beer', 'soda'), 'liquor', 0.02189085),
 (('processed_cheese', 'ham'), 'white_bread', 0.021692872),
 (('yogurt', 'sugar'), 'flour', 0.021522585),
 (('tropical_fruit', 'other_vegetables'), 'turkey', 0.021456005),
 (('other_vegetables', 'beef'), 'herbs', 0.021407435),
 (('white_bread', 'processed_cheese'), 'ham', 0.021362728),
 (('curd', 'root_vegetables'), 'herbs', 0.021005861),
 (('other_vegetables', 'ham'), 'processed_cheese', 0.020965746),
 (('root_vegetables', 'whipped/sour_cream'), 'herbs', 0.020788824),
 (('other_vegetables', 'root_vegetables'), 'herbs', 0.020782541),
 (('sugar', 'whipped/sour_cream'), 'baking_powder', 0.02058014),
 (('whole_milk', 'sugar'), 'rice', 0.020371588),
 (('root_vegetables', 'frozen_vegetables'), 'herbs', 0.02027719),
 (('whole_milk', 'Instant_food_products'), 'hamburger_meat', 0.020258738),
 (('citrus_fruit', 'root_vegetables'), 'herbs', 0.020241175)]

最小値の条件を下げたために、より多くの組み合わせを抽出していますが、CBoW の結果と大きな違いは無さそうです。

なんとなくな Developer のメモ

Word2Vec を用いた併売の分析 - gensim

はじめに

データセット

groceries.txt

R によるアソシエーションルールの抽出結果

groceries_apriori.R

実行結果

Word2Vec の適用

Word2Vec モデルの構築

類似品の算出

pork の類似単語

併売の分析

predict_output_word の検証

bottled_beer と red/blush_wine

bottled_beer と red/blush_wine

ham と white_bread

root_vegetables と other_vegetables と whole_milk と yogurt

curd と sugar

soda と salty_snack

sugar と baking_powder

3品の併売

skip-gram の場合

skip-gram の使用

bottled_beer と red/blush_wine

hamburger_meat と soda

root_vegetables と other_vegetables と whole_milk と yogurt

predict_output_word を使った 3品の組み合わせ抽出