2022-02-20

docker run 時に Kuromoji をインストールして Elasticsearch を実行

Docker Elasticsearch

下記のオフィシャルな Docker イメージだとプラグインが導入されておらず Kuromoji を使えません。

https://hub.docker.com/_/elasticsearch

そこで、docker run 時に Kuromoji をインストールして Elasticsearch を実行してみました。

はじめに

Elasticsearch 8.0 がリリースされていますが、デフォルトで https 接続や認証が必須となっており多少面倒そうだったので、ここでは 7.17.0 を使う事にします。

なお、通常は以下のような Dockerfile を docker build して Kuromoji をインストールしたイメージを作成すれば済む話なのですが、ここではこの方法を使わず docker run 時のインストールを試します。

Dockerfile の例（Kuromoji をインストールした Elasticsearch イメージ作成）

FROM elasticsearch:7.17.0

RUN bin/elasticsearch-plugin install analysis-kuromoji

docker run 時の Kuromoji インストール

Elasticsearch 起動後に Kuromoji をインストールすると Elasticsearch の再起動が必要になってしまうので、Kuromoji のインストール後に Elasticsearch を起動させる必要があります。

とりあえずは、以下のように bash -c "bin/elasticsearch-plugin install analysis-kuromoji; docker-entrypoint.sh" コマンドを実行する事で実現できました。

docker run 実行例1

$ docker run -d --name sample1 -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.17.0 bash -c "bin/elasticsearch-plugin install analysis-kuromoji; docker-entrypoint.sh"

確認例

$ curl -s http://localhost:9200/_cat/plugins
・・・ analysis-kuromoji 7.17.0

ただし、これだと docker start 時に毎回 Kuromoji のインストールを試みてエラーログを出力するのが気になります。

docker start 時のログ内容

$ docker stop sample1
$ docker start sample1
$ docker logs sample1
・・・
-> Failed installing analysis-kuromoji
-> Rolling back analysis-kuromoji
-> Rolled back analysis-kuromoji

ERROR: plugin directory [/usr/share/elasticsearch/plugins/analysis-kuromoji] already exists; if you need to update the plugin, uninstall it first using command 'remove analysis-kuromoji'
・・・

そこで、条件分岐（プラグインが空の場合にのみ Kuromoji をインストール）を加える事でこの問題を回避してみました。

docker run 実行例2（条件分岐の追加）

$ docker run -d --name sample2 -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.17.0 bash -c "if [[ -z \$(bin/elasticsearch-plugin list) ]]; then bin/elasticsearch-plugin install analysis-kuromoji; fi; docker-entrypoint.sh"

なお、Windows PowerShell で実行する場合は、以下のように $ をエスケープするためにバッククォートを使う必要がありました。

PowerShell で実行する場合

> docker run -d --name sample2 -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.17.0 bash -c "if [[ -z `$(bin/elasticsearch-plugin list) ]]; then bin/elasticsearch-plugin install analysis-kuromoji; fi; docker-entrypoint.sh"

2022-01-06

辞書ベースの日本語 Tokenizer - Kuromoji, Sudachi, Fugashi, Kagome, Lindera

java go python rust javascript kuromoji sudachi fugashi kagome lindera

辞書をベースに処理する日本語 Tokenizer のいくつかをコードを書いて実行してみました。

(a) Lucene Kuromoji
(b) atilika Kuromoji
(c) Sudachi
(d) Kuromoji.js
(e) Fugashi
(f) Kagome
(g) Lindera

今回は以下の文を処理して分割された単語と品詞を出力します。

処理対象文

 WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だった。

システム辞書だけを使用し、分割モードを指定する場合は固有名詞などをそのままにする（細かく分割しない）モードを選ぶ事にします。

ソースコードは https://github.com/fits/try_samples/tree/master/blog/20220106/

(a) Lucene Kuromoji

Lucene に組み込まれた Kuromoji で Elasticsearch や Solr で使われます。

kuromoji.gradle を見ると、システム辞書は以下のどちらかを使うようになっているようです。

IPADIC（mecab-ipadic-2.7.0-20070801）
NAIST jdic（mecab-naist-jdic-0.6.3b-20111013）

a1

lucene-analyzers-kuromoji の JapaneseTokenizer を使います。辞書は IPADIC のようです。

lucene/a1.groovy

@Grab('org.apache.lucene:lucene-analyzers-kuromoji:8.11.1')
import org.apache.lucene.analysis.ja.JapaneseTokenizer;
import org.apache.lucene.analysis.ja.JapaneseTokenizer.Mode
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import org.apache.lucene.analysis.ja.tokenattributes.PartOfSpeechAttribute

def text = args[0]

new JapaneseTokenizer(null, false, Mode.NORMAL).withCloseable { tokenizer ->
    def term = tokenizer.addAttribute(CharTermAttribute)
    def pos = tokenizer.addAttribute(PartOfSpeechAttribute)

    tokenizer.reader = new StringReader(text)
    tokenizer.reset()

    while(tokenizer.incrementToken()) {
        println "term=${term}, partOfSpeech=${pos.partOfSpeech}"
    }
}

a1 結果

> groovy a1.groovy "WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だった。"

term=WebAssembly, partOfSpeech=名詞-固有名詞-組織
term=が, partOfSpeech=助詞-格助詞-一般
term=サーバー, partOfSpeech=名詞-一般
term=レス, partOfSpeech=名詞-サ変接続
term=分野, partOfSpeech=名詞-一般
term=へ, partOfSpeech=助詞-格助詞-一般
term=大きな, partOfSpeech=連体詞
term=影響, partOfSpeech=名詞-サ変接続
term=を, partOfSpeech=助詞-格助詞-一般
term=与える, partOfSpeech=動詞-自立
term=だろ, partOfSpeech=助動詞
term=う, partOfSpeech=助動詞
term=と, partOfSpeech=助詞-格助詞-引用
term=答え, partOfSpeech=動詞-自立
term=た, partOfSpeech=助動詞
term=回答, partOfSpeech=名詞-サ変接続
term=者, partOfSpeech=名詞-接尾-一般
term=は, partOfSpeech=助詞-係助詞
term=全体, partOfSpeech=名詞-副詞可能
term=の, partOfSpeech=助詞-連体化
term=５, partOfSpeech=名詞-数
term=６, partOfSpeech=名詞-数
term=％, partOfSpeech=名詞-接尾-助数詞
term=だっ, partOfSpeech=助動詞
term=た, partOfSpeech=助動詞
term=。, partOfSpeech=記号-句点

a2

org.codelibs が上記の ipadic-neologd 版を提供していたので、ついでに試してみました。

処理内容はそのままで、モジュールとパッケージ名を変えるだけです。

lucene/a2.groovy

@GrabResolver('https://maven.codelibs.org/')
@Grab('org.codelibs:lucene-analyzers-kuromoji-ipadic-neologd:8.2.0-20200120')
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import org.codelibs.neologd.ipadic.lucene.analysis.ja.JapaneseTokenizer
import org.codelibs.neologd.ipadic.lucene.analysis.ja.JapaneseTokenizer.Mode
import org.codelibs.neologd.ipadic.lucene.analysis.ja.tokenattributes.PartOfSpeechAttribute

def text = args[0]

new JapaneseTokenizer(null, false, Mode.NORMAL).withCloseable { tokenizer ->
    ・・・
}

a2 結果

> groovy a2.groovy "WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だった。"

term=WebAssembly, partOfSpeech=名詞-固有名詞-組織
term=が, partOfSpeech=助詞-格助詞-一般
term=サーバーレス, partOfSpeech=名詞-固有名詞-一般
term=分野, partOfSpeech=名詞-一般
term=へ, partOfSpeech=助詞-格助詞-一般
term=大きな, partOfSpeech=連体詞
term=影響, partOfSpeech=名詞-サ変接続
term=を, partOfSpeech=助詞-格助詞-一般
term=与える, partOfSpeech=動詞-自立
term=だろ, partOfSpeech=助動詞
term=う, partOfSpeech=助動詞
term=と, partOfSpeech=助詞-格助詞-引用
term=答え, partOfSpeech=動詞-自立
term=た, partOfSpeech=助動詞
term=回答者, partOfSpeech=名詞-固有名詞-一般
term=は, partOfSpeech=助詞-係助詞
term=全体, partOfSpeech=名詞-副詞可能
term=の, partOfSpeech=助詞-連体化
term=５, partOfSpeech=名詞-数
term=６, partOfSpeech=名詞-数
term=％, partOfSpeech=名詞-接尾-助数詞
term=だっ, partOfSpeech=助動詞
term=た, partOfSpeech=助動詞
term=。, partOfSpeech=記号-句点

a1 と違って "サーバーレス" や "回答者" となりました。

(b) atilika Kuromoji

https://github.com/atilika/kuromoji

Lucene Kuromoji のベースとなった Kuromoji。更新は途絶えているようですが、色々な辞書に対応しています。

ここでは以下の 2種類の辞書を試してみました。

UniDic（2.1.2）
JUMAN（7.0-20130310）

b1

まずは、UniDic 版です。

kuromoji/b1.groovy

@Grab('com.atilika.kuromoji:kuromoji-unidic:0.9.0')
import com.atilika.kuromoji.unidic.Tokenizer

def text = args[0]
def tokenizer = new Tokenizer()

tokenizer.tokenize(args[0]).each {
    def pos = [
        it.partOfSpeechLevel1,
        it.partOfSpeechLevel2,
        it.partOfSpeechLevel3,
        it.partOfSpeechLevel4
    ]

    println "term=${it.surface}, partOfSpeech=${pos}"
}

b1 結果

> groovy b1.groovy "WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だ った。"

term=WebAssembly, partOfSpeech=[名詞, 普通名詞, 一般, *]
term=が, partOfSpeech=[助詞, 格助詞, *, *]
term=サーバー, partOfSpeech=[名詞, 普通名詞, 一般, *]
term=レス, partOfSpeech=[名詞, 普通名詞, 一般, *]
term=分野, partOfSpeech=[名詞, 普通名詞, 一般, *]
term=へ, partOfSpeech=[助詞, 格助詞, *, *]
term=大きな, partOfSpeech=[連体詞, *, *, *]
term=影響, partOfSpeech=[名詞, 普通名詞, サ変可能, *]
term=を, partOfSpeech=[助詞, 格助詞, *, *]
term=与える, partOfSpeech=[動詞, 一般, *, *]
term=だろう, partOfSpeech=[助動詞, *, *, *]
term=と, partOfSpeech=[助詞, 格助詞, *, *]
term=答え, partOfSpeech=[動詞, 一般, *, *]
term=た, partOfSpeech=[助動詞, *, *, *]
term=回答, partOfSpeech=[名詞, 普通名詞, サ変可能, *]
term=者, partOfSpeech=[接尾辞, 名詞的, 一般, *]
term=は, partOfSpeech=[助詞, 係助詞, *, *]
term=全体, partOfSpeech=[名詞, 普通名詞, 一般, *]
term=の, partOfSpeech=[助詞, 格助詞, *, *]
term=５, partOfSpeech=[名詞, 数詞, *, *]
term=６, partOfSpeech=[名詞, 数詞, *, *]
term=％, partOfSpeech=[名詞, 普通名詞, 助数詞可能, *]
term=だっ, partOfSpeech=[助動詞, *, *, *]
term=た, partOfSpeech=[助動詞, *, *, *]
term=。, partOfSpeech=[補助記号, 句点, *, *]

"だろう" が分割されていないのが特徴。

b2

JUMAN 辞書版です。

kuromoji/b2.groovy

@Grab('com.atilika.kuromoji:kuromoji-jumandic:0.9.0')
import com.atilika.kuromoji.jumandic.Tokenizer

def text = args[0]
def tokenizer = new Tokenizer()

・・・

b2 結果

> groovy b2.groovy "WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だ った。"

term=WebAssembly, partOfSpeech=[名詞, 組織名, *, *]
term=が, partOfSpeech=[助詞, 格助詞, *, *]
term=サーバーレス, partOfSpeech=[名詞, 人名, *, *]
term=分野, partOfSpeech=[名詞, 普通名詞, *, *]
term=へ, partOfSpeech=[助詞, 格助詞, *, *]
term=大きな, partOfSpeech=[連体詞, *, *, *]
term=影響, partOfSpeech=[名詞, サ変名詞, *, *]
term=を, partOfSpeech=[助詞, 格助詞, *, *]
term=与える, partOfSpeech=[動詞, *, 母音動詞, 基本形]
term=だろう, partOfSpeech=[助動詞, *, 助動詞だろう型, 基本形]
term=と, partOfSpeech=[助詞, 格助詞, *, *]
term=答えた, partOfSpeech=[動詞, *, 母音動詞, タ形]
term=回答, partOfSpeech=[名詞, サ変名詞, *, *]
term=者, partOfSpeech=[接尾辞, 名詞性名詞接尾辞, *, *]
term=は, partOfSpeech=[助詞, 副助詞, *, *]
term=全体, partOfSpeech=[名詞, 普通名詞, *, *]
term=の, partOfSpeech=[助詞, 接続助詞, *, *]
term=５６, partOfSpeech=[名詞, 数詞, *, *]
term=％, partOfSpeech=[接尾辞, 名詞性名詞助数辞, *, *]
term=だった, partOfSpeech=[判定詞, *, 判定詞, ダ列タ形]
term=。, partOfSpeech=[特殊, 句点, *, *]

"サーバーレス"、"だろう"、"答えた"、"５６"、"だった" が分割されていないのが特徴。 "サーバーレス" が人名となっているのは不思議。

(c) Sudachi

https://github.com/WorksApplications/Sudachi

Rust 版もありますが、Java 版を使いました。

辞書は UniDic と NEologd をベースに調整したものらしく、3種類（Small, Core, Full）用意されています。

辞書が継続的にメンテナンスされており最新のものを使えるのが魅力だと思います。

ここではデフォルトの Core 辞書を使いました。（system_core.dic ファイルをカレントディレクトリに配置して実行）

Core 辞書（20211220 版）

また、Elasticsearch 用のプラグイン analysis-sudachi も用意されています。

sudachi/c1.groovy

@Grab('com.worksap.nlp:sudachi:0.5.3')
import com.worksap.nlp.sudachi.DictionaryFactory
import com.worksap.nlp.sudachi.Tokenizer

def text = args[0]

new DictionaryFactory().create().withCloseable { dic ->
    def tokenizer = dic.create()
    def ts = tokenizer.tokenize(Tokenizer.SplitMode.C, text)

    ts.each { t ->
        def pos = dic.getPartOfSpeechString(t.partOfSpeechId())

        println "term=${t.surface()}, partOfSpeech=${pos}"
    }
}

c1 結果

> groovy c1.groovy "WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だっ た。"

term=WebAssembly, partOfSpeech=[名詞, 普通名詞, 一般, *, *, *]
term=が, partOfSpeech=[助詞, 格助詞, *, *, *, *]
term=サーバーレス, partOfSpeech=[名詞, 普通名詞, 一般, *, *, *]
term=分野, partOfSpeech=[名詞, 普通名詞, 一般, *, *, *]
term=へ, partOfSpeech=[助詞, 格助詞, *, *, *, *]
term=大きな, partOfSpeech=[連体詞, *, *, *, *, *]
term=影響, partOfSpeech=[名詞, 普通名詞, サ変可能, *, *, *]
term=を, partOfSpeech=[助詞, 格助詞, *, *, *, *]
term=与える, partOfSpeech=[動詞, 一般, *, *, 下一段-ア行, 終止形-一般]
term=だろう, partOfSpeech=[助動詞, *, *, *, 助動詞-ダ, 意志推量形]
term=と, partOfSpeech=[助詞, 格助詞, *, *, *, *]
term=答え, partOfSpeech=[動詞, 一般, *, *, 下一段-ア行, 連用形-一般]
term=た, partOfSpeech=[助動詞, *, *, *, 助動詞-タ, 連体形-一般]
term=回答者, partOfSpeech=[名詞, 普通名詞, 一般, *, *, *]
term=は, partOfSpeech=[助詞, 係助詞, *, *, *, *]
term=全体, partOfSpeech=[名詞, 普通名詞, 一般, *, *, *]
term=の, partOfSpeech=[助詞, 格助詞, *, *, *, *]
term=５６, partOfSpeech=[名詞, 数詞, *, *, *, *]
term=％, partOfSpeech=[名詞, 普通名詞, 助数詞可能, *, *, *]
term=だっ, partOfSpeech=[助動詞, *, *, *, 助動詞-ダ, 連用形-促音便]
term=た, partOfSpeech=[助動詞, *, *, *, 助動詞-タ, 終止形-一般]
term=。, partOfSpeech=[補助記号, 句点, *, *, *, *]

"サーバーレス"、"だろう"、"回答者"、"５６" となっているのが特徴。

(d) Kuromoji.js

https://github.com/takuyaa/kuromoji.js/

Kuromoji の JavaScript 実装。

kuromoji.js/d1.mjs

import kuromoji from 'kuromoji'

const dicPath = 'node_modules/kuromoji/dict'
const text = process.argv[2]

kuromoji.builder({ dicPath }).build((err, tokenizer) => {
    if (err) {
        console.error(err)
        return
    }

    const ts = tokenizer.tokenize(text)

    for (const t of ts) {
        const pos = [t.pos, t.pos_detail_1, t.pos_detail_2, t.pos_detail_3]

        console.log(`term=${t.surface_form}, partOfSpeech=${pos}`)
    }
})

d1 結果

> node d1.mjs "WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だった。"

term=WebAssembly, partOfSpeech=名詞,固有名詞,組織,*
term=が, partOfSpeech=助詞,格助詞,一般,*
term=サーバー, partOfSpeech=名詞,一般,*,*
term=レス, partOfSpeech=名詞,サ変接続,*,*
term=分野, partOfSpeech=名詞,一般,*,*
term=へ, partOfSpeech=助詞,格助詞,一般,*
term=大きな, partOfSpeech=連体詞,*,*,*
term=影響, partOfSpeech=名詞,サ変接続,*,*
term=を, partOfSpeech=助詞,格助詞,一般,*
term=与える, partOfSpeech=動詞,自立,*,*
term=だろ, partOfSpeech=助動詞,*,*,*
term=う, partOfSpeech=助動詞,*,*,*
term=と, partOfSpeech=助詞,格助詞,引用,*
term=答え, partOfSpeech=動詞,自立,*,*
term=た, partOfSpeech=助動詞,*,*,*
term=回答, partOfSpeech=名詞,サ変接続,*,*
term=者, partOfSpeech=名詞,接尾,一般,*
term=は, partOfSpeech=助詞,係助詞,*,*
term=全体, partOfSpeech=名詞,副詞可能,*,*
term=の, partOfSpeech=助詞,連体化,*,*
term=５, partOfSpeech=名詞,数,*,*
term=６, partOfSpeech=名詞,数,*,*
term=％, partOfSpeech=名詞,接尾,助数詞,*
term=だっ, partOfSpeech=助動詞,*,*,*
term=た, partOfSpeech=助動詞,*,*,*
term=。, partOfSpeech=記号,句点,*,*

a1 と同じ結果になりました。

(e) Fugashi

https://github.com/polm/fugashi

MeCab の Python 用ラッパー。

辞書として unidic-lite と unidic のパッケージが用意されていましたが、ここでは JUMAN 辞書を使いました。

JUMAN

fugashi/e1.py

from fugashi import GenericTagger
import sys

text = sys.argv[1]

tagger = GenericTagger()

for t in tagger(text):
    pos = t.feature[0:4]
    print(f"term={t.surface}, partOfSpeech={pos}")

e1 結果

> python e1.py "WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だった。"

term=WebAssembly, partOfSpeech=('名詞', '組織名', '*', '*')
term=が, partOfSpeech=('助詞', '格助詞', '*', '*')
term=サーバーレス, partOfSpeech=('名詞', '人名', '*', '*')
term=分野, partOfSpeech=('名詞', '普通名詞', '*', '*')
term=へ, partOfSpeech=('助詞', '格助詞', '*', '*')
term=大きな, partOfSpeech=('連体詞', '*', '*', '*')
term=影響, partOfSpeech=('名詞', 'サ変名詞', '*', '*')
term=を, partOfSpeech=('助詞', '格助詞', '*', '*')
term=与える, partOfSpeech=('動詞', '*', '母音動詞', '基本形')
term=だろう, partOfSpeech=('助動詞', '*', '助動詞だろう型', '基本形')
term=と, partOfSpeech=('助詞', '格助詞', '*', '*')
term=答えた, partOfSpeech=('動詞', '*', '母音動詞', 'タ形')
term=回答, partOfSpeech=('名詞', 'サ変名詞', '*', '*')
term=者, partOfSpeech=('接尾辞', '名詞性名詞接尾辞', '*', '*')
term=は, partOfSpeech=('助詞', '副助詞', '*', '*')
term=全体, partOfSpeech=('名詞', '普通名詞', '*', '*')
term=の, partOfSpeech=('助詞', '接続助詞', '*', '*')
term=５６, partOfSpeech=('名詞', '数詞', '*', '*')
term=％, partOfSpeech=('接尾辞', '名詞性名詞助数辞', '*', '*')
term=だった, partOfSpeech=('判定詞', '*', '判定詞', 'ダ列タ形')
term=。, partOfSpeech=('特殊', '句点', '*', '*')

同じ辞書を使っている b2 と同じ結果になりました。（"サーバーレス" が人名なのも同じ）。

(f) Kagome

https://github.com/ikawaha/kagome

ここでは以下の辞書を使用します。

IPADIC（mecab-ipadic-2.7.0-20070801）
UniDic（2.1.2）

なお、Tokenize を呼び出した場合は、分割モードとして Normal が適用されるようです。

f1

IPADIC 版

kagome/f1.go

package main

import (
    "fmt"
    "os"

    "github.com/ikawaha/kagome-dict/ipa"
    "github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
    text := os.Args[1]

    t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())

    if err != nil {
        panic(err)
    }
    // 分割モード Normal
    ts := t.Tokenize(text)

    for _, t := range ts {
        fmt.Printf("term=%s, partOfSpeech=%v\n", t.Surface, t.POS())
    }
}

f1 結果

> go run f1.go "WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だった。"

term=WebAssembly, partOfSpeech=[名詞 固有名詞 組織 *]
term=が, partOfSpeech=[助詞 格助詞 一般 *]
term=サーバー, partOfSpeech=[名詞 一般 * *]
term=レス, partOfSpeech=[名詞 サ変接続 * *]
term=分野, partOfSpeech=[名詞 一般 * *]
term=へ, partOfSpeech=[助詞 格助詞 一般 *]
term=大きな, partOfSpeech=[連体詞 * * *]
term=影響, partOfSpeech=[名詞 サ変接続 * *]
term=を, partOfSpeech=[助詞 格助詞 一般 *]
term=与える, partOfSpeech=[動詞 自立 * *]
term=だろ, partOfSpeech=[助動詞 * * *]
term=う, partOfSpeech=[助動詞 * * *]
term=と, partOfSpeech=[助詞 格助詞 引用 *]
term=答え, partOfSpeech=[動詞 自立 * *]
term=た, partOfSpeech=[助動詞 * * *]
term=回答, partOfSpeech=[名詞 サ変接続 * *]
term=者, partOfSpeech=[名詞 接尾 一般 *]
term=は, partOfSpeech=[助詞 係助詞 * *]
term=全体, partOfSpeech=[名詞 副詞可能 * *]
term=の, partOfSpeech=[助詞 連体化 * *]
term=５, partOfSpeech=[名詞 数 * *]
term=６, partOfSpeech=[名詞 数 * *]
term=％, partOfSpeech=[名詞 接尾 助数詞 *]
term=だっ, partOfSpeech=[助動詞 * * *]
term=た, partOfSpeech=[助動詞 * * *]
term=。, partOfSpeech=[記号 句点 * *]

同じ辞書を使っている a1 と同じ結果になりました。

f2

UniDic 版

kagome/f2.go

package main

import (
    "fmt"
    "os"

    "github.com/ikawaha/kagome-dict/uni"
    "github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
    text := os.Args[1]

    t, err := tokenizer.New(uni.Dict(), tokenizer.OmitBosEos())

    ・・・
}

f2 結果

> go run f2.go "WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だった。"

term=WebAssembly, partOfSpeech=[名詞 普通名詞 一般 *]
term=が, partOfSpeech=[助詞 格助詞 * *]
term=サーバー, partOfSpeech=[名詞 普通名詞 一般 *]
term=レス, partOfSpeech=[名詞 普通名詞 一般 *]
term=分野, partOfSpeech=[名詞 普通名詞 一般 *]
term=へ, partOfSpeech=[助詞 格助詞 * *]
term=大きな, partOfSpeech=[連体詞 * * *]
term=影響, partOfSpeech=[名詞 普通名詞 サ変可能 *]
term=を, partOfSpeech=[助詞 格助詞 * *]
term=与える, partOfSpeech=[動詞 一般 * *]
term=だろう, partOfSpeech=[助動詞 * * *]
term=と, partOfSpeech=[助詞 格助詞 * *]
term=答え, partOfSpeech=[動詞 一般 * *]
term=た, partOfSpeech=[助動詞 * * *]
term=回答, partOfSpeech=[名詞 普通名詞 サ変可能 *]
term=者, partOfSpeech=[接尾辞 名詞的 一般 *]
term=は, partOfSpeech=[助詞 係助詞 * *]
term=全体, partOfSpeech=[名詞 普通名詞 一般 *]
term=の, partOfSpeech=[助詞 格助詞 * *]
term=５, partOfSpeech=[名詞 数詞 * *]
term=６, partOfSpeech=[名詞 数詞 * *]
term=％, partOfSpeech=[名詞 普通名詞 助数詞可能 *]
term=だっ, partOfSpeech=[助動詞 * * *]
term=た, partOfSpeech=[助動詞 * * *]
term=。, partOfSpeech=[補助記号 句点 * *]

同じ辞書を使っている b1 と同じ結果になりました。

(g) Lindera

https://github.com/lindera-morphology/lindera

kuromoji-rs のフォークのようで、辞書は IPADIC です。

IPADIC（mecab-ipadic-2.7.0-20070801）

lindera/src/main.rs

use lindera::tokenizer::Tokenizer;
use lindera_core::LinderaResult;

use std::env;

fn main() -> LinderaResult<()> {
    let text = env::args().nth(1).unwrap_or("".to_string());

    let mut tokenizer = Tokenizer::new()?;
    let ts = tokenizer.tokenize(&text)?;

    for t in ts {
        let pos = t.detail.get(0..4).unwrap_or(&t.detail);

        println!("text={}, partOfSpeech={:?}", t.text, pos);
    }

    Ok(())
}

結果

> cargo run "WebAssemblyがサーバーレス分野へ大きな影響を与えるだろうと答えた回答者は全体の５６％だった。"

・・・
text=WebAssembly, partOfSpeech=["UNK"]
text=が, partOfSpeech=["助詞", "格助詞", "一般", "*"]
text=サーバー, partOfSpeech=["名詞", "一般", "*", "*"]
text=レス, partOfSpeech=["名詞", "サ変接続", "*", "*"]
text=分野, partOfSpeech=["名詞", "一般", "*", "*"]
text=へ, partOfSpeech=["助詞", "格助詞", "一般", "*"]
text=大きな, partOfSpeech=["連体詞", "*", "*", "*"]
text=影響, partOfSpeech=["名詞", "サ変接続", "*", "*"]
text=を, partOfSpeech=["助詞", "格助詞", "一般", "*"]
text=与える, partOfSpeech=["動詞", "自立", "*", "*"]
text=だろ, partOfSpeech=["助動詞", "*", "*", "*"]
text=う, partOfSpeech=["助動詞", "*", "*", "*"]
text=と, partOfSpeech=["助詞", "格助詞", "引用", "*"]
text=答え, partOfSpeech=["動詞", "自立", "*", "*"]
text=た, partOfSpeech=["助動詞", "*", "*", "*"]
text=回答, partOfSpeech=["名詞", "サ変接続", "*", "*"]
text=者, partOfSpeech=["名詞", "接尾", "一般", "*"]
text=は, partOfSpeech=["助詞", "係助詞", "*", "*"]
text=全体, partOfSpeech=["名詞", "副詞可能", "*", "*"]
text=の, partOfSpeech=["助詞", "連体化", "*", "*"]
text=５, partOfSpeech=["名詞", "数", "*", "*"]
text=６, partOfSpeech=["名詞", "数", "*", "*"]
text=％, partOfSpeech=["名詞", "接尾", "助数詞", "*"]
text=だっ, partOfSpeech=["助動詞", "*", "*", "*"]
text=た, partOfSpeech=["助動詞", "*", "*", "*"]
text=。, partOfSpeech=["記号", "句点", "*", "*"]

a1 と概ね同じ結果となりましたが、"WebAssembly" が名詞になっていないのが特徴。

2021-12-25

MySQL 5.7 で非ネイティブなパーティションを判別

MySQL

MySQL 5.7 へのアップグレード時に、パーティションを更新しないと該当テーブル検索時に以下のような警告が出力されるようになります。

非推奨パーティションの警告

The partition engine, used by table 'テーブル名', is deprecated and 
will be removed in a future release. Please use native partitioning instead.

これ自体は以下を実行して InnoDB native なパーティションへ更新すれば解決する話ですが。

native partitioning への更新方法

ALTER TABLE テーブル名 ENGINE = INNODB;

パーティションの native と非native を見分けるにはどうすればよいのか気になったので information_schema のテーブルを調べてみました。

PARTITIONS テーブル等には特に違いが見られなかったのですが、INNODB_SYS_TABLES テーブルの FILE_FORMAT と ROW_FORMAT の値で違いを確認できました。

INNODB_SYS_TABLES の内容

パーティション	FILE_FORMAT	ROW_FORMAT
非native の場合	Antelope	Compact
native の場合	Barracuda	Dynamic

なお、INNODB_SYS_TABLES テーブルでパーティションに関するレコードは、NAME の値が スキーマ名/テーブル名#P#パーティション名[#SP#サブパーティション名] のようになっていました。

ALTER TABLE ・・・ でパーティションを更新すると、Antelope と Compact から Barracuda と Dynamic へ変化したので、とりあえずはこれで判別できそうです。

2021-11-27

Elasticsearch で nested の集計

Elasticsearch TypeScript Deno

Elasticsearch において 2層の nested 型フィールドで集計してみました。

サンプルのソースコードは http://github.com/fits/try_samples/tree/master/blog/20211127/

はじめに

下記のような 2層のカテゴリ（categories と children は nested 型）を持つドキュメントに対して、カテゴリ単位で集計します。

ドキュメント例

{
  "name": "item-1",
  "categories": [
    {
      "code": "B",
      "name": "categoryB",
      "children": [
        {
          "code": "U",
          "name": "subcategoryBU"
        }
      ]
    }
  ]
}

まずは、下記処理でマッピング定義とドキュメントを登録しました。

Elasticsearch のデフォルトでは、日付では無い文字列を text 型のフィールドと keyword 型の keyword サブフィールドへマッピングするようになっていますが、ここでは文字列を keyword 型へマッピングするように dynamic_templates を設定しました。

init.ts

import { send } from './es_util.ts'

const index = 'sample'
const baseUrl = 'http://localhost:9200'
const indexUrl = `${baseUrl}/${index}`

const N = 20
const cts1 = ['A', 'B', 'C']
const cts2 = ['S', 'T', 'U']
// インデックスの有無を確認
const { status } = await fetch(indexUrl, { method: 'HEAD' })

if (status == 200) {
  // インデックス削除
  await fetch(indexUrl, { method: 'DELETE' })
}
// マッピング定義
const { body } = await send(indexUrl, 'PUT', {
  mappings: {
    dynamic_templates: [
      {
        // 文字列を keyword 型へ動的マッピングする設定
        string_keyword: {
          match_mapping_type: 'string',
          mapping: {
            type: 'keyword'
          }
        }
      }
    ],
    properties: {
      categories: {
        type: 'nested',
        properties: {
          children: {
            type: 'nested'
          }
        }
      }
    }
  }
})

console.log(body)

const selectCategory = (cts: Array<string>) => 
  cts[Math.round(Math.random() * (cts.length - 1))]

// ドキュメント登録
for (let i = 0; i < N; i++) {
  const ct1 = selectCategory(cts1)
  const ct2 = selectCategory(cts2)

  const { body: r } = await send(`${indexUrl}/_doc`, 'POST', {
    name: `item-${i + 1}`,
    categories: [
      {
        code: ct1,
        name: `category${ct1}`,
        children: [
          {
            code: ct2,
            name: `subcategory${ct1}${ct2}`
          }
        ]
      }
    ]
  })

  console.log(`result: ${r.result}, _id: ${r._id}`)
}

es_util.ts

export const send = async (url: string, method: string, body: any) => {
  const res = await fetch(url, {
    method,
    headers: {
        "Content-Type": "application/json"
    },
    body: JSON.stringify(body)
  })

  const resBody = await res.json()

  return {
    ok: res.ok,
    body: resBody
  }
}

Deno で実行します。

実行例

> deno run --allow-net init.ts

{ acknowledged: true, shards_acknowledged: true, index: "sample" }
result: created, _id: 632STX0BXMPSr8jZ-Q7n
・・・

(a) terms による集計

categories.code とそれに属する children.code の値でグルーピングして、該当するドキュメント数をカウントしてみます。

特定のフィールドの値でグルーピングするには terms を使用します。ちなみに、terms.field へ text 型のフィールドを指定する事はできないようです。（※1）

また、nested 型のフィールドに対して集計する際は nested.path を設定する必要がありました。（※2）

更に、categories と children の親子関係を保持したままドキュメント数をカウントするには aggs を入れ子にする必要があります。（※3）

（※1）今回は dynamic_templates を設定しているため、
       code や name フィールドは keyword 型になっていますが、
       Elasticsearch デフォルトの動的マッピングを使用した際は、
       'categories.code.keyword' のように keyword サブフィールドの方を
       terms.field へ設定する事になります

（※2）nested.path を設定せずに、、
       terms.field を categories.code としても集計結果は 0 になった

（※3）入れ子にしなかった場合、
       親子関係を無視して categories.code と children.code の値で
       それぞれ独立したカウントになる（ソースの sample0.ts 参照）

デフォルトでは、ドキュメント数の多い順にソートされるようなので、order を指定して terms.field で指定したフィールドの値順になるようにしています。

aggs 直下の categories、count_items、children の箇所では任意の名前を使えます。

sample1.ts（前半）

・・・

const { body } = await send(`${indexUrl}/_search`, 'POST', {
  size: 0,
  aggs: {
    categories: {
      nested: { path: 'categories' },
      aggs: {
        count_items: {
          terms: {
            field: 'categories.code',
            order: { _term: 'asc' }
          },
          aggs: {
            children: {
              nested: { path: 'categories.children' },
              aggs: {
                count_items: {
                  terms: {
                    field: 'categories.children.code',
                    order: { _term: 'asc' }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
})

console.log(JSON.stringify(body, null, 2))

console.log('----------')
・・・

この部分の実行結果は以下の通りです。

(a) 実行結果（前半）

> deno run --allow-net sample1.ts

{
  ・・・
  "hits": {
    "total": {
      "value": 20,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "categories": {
      "doc_count": 20,
      "count_items": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "A",
            "doc_count": 4,
            "children": {
              "doc_count": 4,
              "count_items": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                  {
                    "key": "T",
                    "doc_count": 3
                  },
                  {
                    "key": "U",
                    "doc_count": 1
                  }
                ]
              }
            }
          },
          ・・・
        ]
      }
    }
  }
}

ここから必要な内容だけを残すように加工してみます。

sample1.ts（後半）

・・・
console.log('----------')

const fieldName = (rs: any) => 
  Object.keys(rs).find(k => !['key', 'doc_count'].includes(k))

type Bucket = { key: string, doc_count: number }
// 集計結果の加工
const toDoc = (rs: any) => {
  const k1 = fieldName(rs)

  if (!k1) {
    return {}
  }

  const k2 = fieldName(rs[k1])!

  const bs = rs[k1][k2].buckets.map((b: Bucket) => 
    Object.assign(
      {
        code: b.key,
        [k2]: b.doc_count
      },
      toDoc(b)
    )
  )

  return {
    [k1]: bs
  }
}

const res = toDoc(body.aggregations)

console.log(JSON.stringify(res, null, 2))

実行結果は以下の通りです。

(a) 実行結果（後半）

> deno run --allow-net sample1.ts

・・・
----------
{
  "categories": [
    {
      "code": "A",
      "count_items": 4,
      "children": [
        {
          "code": "T",
          "count_items": 3
        },
        {
          "code": "U",
          "count_items": 1
        }
      ]
    },
    {
      "code": "B",
      "count_items": 10,
      "children": [
        {
          "code": "S",
          "count_items": 2
        },
        {
          "code": "T",
          "count_items": 4
        },
        {
          "code": "U",
          "count_items": 4
        }
      ]
    },
    {
      "code": "C",
      "count_items": 6,
      "children": [
        {
          "code": "S",
          "count_items": 2
        },
        {
          "code": "T",
          "count_items": 3
        },
        {
          "code": "U",
          "count_items": 1
        }
      ]
    }
  ]
}

(b) multi_terms による集計

terms を使った集計だと code の値しか取れなかったので、ここでは multi_terms を使って categories.name や children.name の値も同時に取得できるようにしてみます。

multi_terms.terms で複数のフィールドを指定すれば、フィールドの値を連結した値（key_as_string の値）でグルーピングするようです。

集計結果の key が配列になるので、そこから code と name の値を取り出す事が可能です。

sample2.ts

・・・

const { body } = await send(`${indexUrl}/_search`, 'POST', {
  size: 0,
  aggs: {
    categories: {
      nested: { path: 'categories' },
      aggs: {
        count_items: {
          multi_terms: {
            terms: [
              { field: 'categories.code' },
              { field: 'categories.name' }
            ],
            order: { _term: 'asc' }
          },
          aggs: {
            children: {
              nested: { path: 'categories.children' },
              aggs: {
                count_items: {
                  multi_terms: {
                    terms: [
                      { field: 'categories.children.code' },
                      { field: 'categories.children.name' }
                    ],
                    order: { _term: 'asc' }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
})

console.log(JSON.stringify(body, null, 2))

console.log('----------')

const fieldName = (rs: any) => 
  Object.keys(rs).find(k => !['key', 'doc_count', 'key_as_string'].includes(k))

type Bucket = { key: [string, string], doc_count: number }

const toDoc = (rs: any) => {
  const k1 = fieldName(rs)

  if (!k1) {
    return {}
  }

  const k2 = fieldName(rs[k1])!

  const bs = rs[k1][k2].buckets.map((b: Bucket) => 
    Object.assign(
      {
        code: b.key[0],
        name: b.key[1],
        [k2]: b.doc_count
      },
      toDoc(b)
    )
  )

  return {
    [k1]: bs
  }
}

const res = toDoc(body.aggregations)

console.log(JSON.stringify(res, null, 2))

実行結果は以下の通りです。

(b) 実行結果

> deno run --allow-net sample2.ts 
{
  ・・・
  "aggregations": {
    "categories": {
      "doc_count": 20,
      "count_items": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": [
              "A",
              "categoryA"
            ],
            "key_as_string": "A|categoryA",
            "doc_count": 4,
            "children": {
              "doc_count": 4,
              "count_items": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                  {
                    "key": [
                      "T",
                      "subcategoryAT"
                    ],
                    "key_as_string": "T|subcategoryAT",
                    "doc_count": 3
                  },
                  {
                    "key": [
                      "U",
                      "subcategoryAU"
                    ],
                    "key_as_string": "U|subcategoryAU",
                    "doc_count": 1
                  }
                ]
              }
            }
          },
          ・・・
        ]
      }
    }
  }
}
----------
{
  "categories": [
    {
      "code": "A",
      "name": "categoryA",
      "count_items": 4,
      "children": [
        {
          "code": "T",
          "name": "subcategoryAT",
          "count_items": 3
        },
        {
          "code": "U",
          "name": "subcategoryAU",
          "count_items": 1
        }
      ]
    },
    {
      "code": "B",
      "name": "categoryB",
      "count_items": 10,
      "children": [
        {
          "code": "S",
          "name": "subcategoryBS",
          "count_items": 2
        },
        {
          "code": "T",
          "name": "subcategoryBT",
          "count_items": 4
        },
        {
          "code": "U",
          "name": "subcategoryBU",
          "count_items": 4
        }
      ]
    },
    {
      "code": "C",
      "name": "categoryC",
      "count_items": 6,
      "children": [
        {
          "code": "S",
          "name": "subcategoryCS",
          "count_items": 2
        },
        {
          "code": "T",
          "name": "subcategoryCT",
          "count_items": 3
        },
        {
          "code": "U",
          "name": "subcategoryCU",
          "count_items": 1
        }
      ]
    }
  ]
}

2021-10-11

Elasticsearch で検索条件に合致した nested の要素だけを抽出

Elasticsearch TypeScript Deno

Elasticsearch の nested を用いた検索において、inner_hits を使って検索条件に合致した nested の要素だけを抽出するようにしてみました。

今回のソースは http://github.com/fits/try_samples/tree/master/blog/20211011/

はじめに

Elasticsearch では nested という型を用いる事で、入れ子構造を実現できるようになっています。

ここでは、商品毎に複数のエディションがあって、エディション毎にセールの履歴を保持しているような、2層の nested を持つサンプルデータを用意してみました。

Elasticsearch のマッピング設定は以下。

マッピング設定

$ curl -s http://localhost:9200/items/_mappings | jq
{
  "items": {
    "mappings": {
      "properties": {
        "editions": {
          "type": "nested",
          "properties": {
            "edition": {
              "type": "keyword"
            },
            "price": {
              "type": "scaled_float",
              "scaling_factor": 100
            },
            "sales": {
              "type": "nested",
              "properties": {
                "date_from": {
                  "type": "date"
                },
                "date_to": {
                  "type": "date"
                },
                "sale_price": {
                  "type": "scaled_float",
                  "scaling_factor": 100
                }
              }
            }
          }
        },
        "id": {
          "type": "keyword"
        },
        "name": {
          "type": "text"
        }
      }
    }
  }
}

ドキュメントの内容は以下の通りです。

全ドキュメント内容

$ curl -s http://localhost:9200/items/_search | jq
{
  ・・・
  "hits": {
    ・・・
    "hits": [
      {
        "_index": "items",
        "_type": "_doc",
        "_id": "id-001",
        "_score": 1,
        "_source": {
          "id": "id-001",
          "name": "item-A",
          "editions": [
            {
              "edition": "Standard",
              "price": 1000,
              "sales": [
                {
                  "sale_price": 900,
                  "date_from": "2021-01-01T00:00:00Z",
                  "date_to": "2021-01-03T00:00:00Z"
                },
                {
                  "sale_price": 800,
                  "date_from": "2021-07-01T00:00:00Z",
                  "date_to": "2021-07-05T00:00:00Z"
                }
              ]
            },
            {
              "edition": "Extra",
              "price": 2000,
              "sales": [
                {
                  "sale_price": 1800,
                  "date_from": "2021-01-01T00:00:00Z",
                  "date_to": "2021-01-03T00:00:00Z"
                },
                {
                  "sale_price": 1700,
                  "date_from": "2021-07-01T00:00:00Z",
                  "date_to": "2021-07-05T00:00:00Z"
                },
                {
                  "sale_price": 1500,
                  "date_from": "2021-09-01T00:00:00Z",
                  "date_to": "2021-09-02T00:00:00Z"
                }
              ]
            }
          ]
        }
      },
      {
        "_index": "items",
        "_type": "_doc",
        "_id": "id-002",
        "_score": 1,
        "_source": {
          "id": "id-002",
          "name": "item-B",
          "editions": [
            {
              "edition": "Standard",
              "price": 1500,
              "sales": [
                {
                  "sale_price": 1400,
                  "date_from": "2021-09-01T00:00:00Z",
                  "date_to": "2021-09-05T00:00:00Z"
                }
              ]
            },
            {
              "edition": "Extra",
              "price": 5000,
              "sales": []
            }
          ]
        }
      },
      {
        "_index": "items",
        "_type": "_doc",
        "_id": "id-003",
        "_score": 1,
        "_source": {
          "id": "id-003",
          "name": "item-C",
          "editions": [
            {
              "edition": "Standard",
              "price": 7000,
              "sales": [
                {
                  "sale_price": 6800,
                  "date_from": "2021-01-01T00:00:00Z",
                  "date_to": "2021-01-03T00:00:00Z"
                },
                {
                  "sale_price": 6700,
                  "date_from": "2021-02-01T00:00:00Z",
                  "date_to": "2021-02-05T00:00:00Z"
                },
                {
                  "sale_price": 6600,
                  "date_from": "2021-04-01T00:00:00Z",
                  "date_to": "2021-04-02T00:00:00Z"
                },
                {
                  "sale_price": 6500,
                  "date_from": "2021-07-01T00:00:00Z",
                  "date_to": "2021-07-15T00:00:00Z"
                }
              ]
            }
          ]
        }
      },
      {
        "_index": "items",
        "_type": "_doc",
        "_id": "id-004",
        "_score": 1,
        "_source": {
          "id": "id-004",
          "name": "item-D",
          "editions": [
            {
              "edition": "Standard",
              "price": 4000,
              "sales": []
            },
            {
              "edition": "Extra",
              "price": 6000,
              "sales": []
            },
            {
              "edition": "Premium",
              "price": 9000,
              "sales": []
            }
          ]
        }
      }
    ]
  }
}

検索

Deno 用の TypeScript で検索処理を実装してみます。

Deno 用の公式 Elasticsearch クライアントライブラリは今のところ無さそうなので、fetch を使って REST API を呼び出すような関数を用意してみました。

es_util.ts

// Elasticsearch の REST API 呼び出し
export const send = async (url: string, method: string, body: any) => {
    const res = await fetch(url, {
        method,
        headers: {
          "Content-Type": "application/json"
        },
        body: JSON.stringify(body)
    })

    const resBody = await res.json()

    return {
        ok: res.ok,
        body: resBody
    }
}

ここでは、下記のような条件で検索する事にします。

2021/7/1 以降のセール価格が 1500 以下だったもの

1. inner_hits を用いない検索

まずは、inner_hits を用いず普通に検索します。

nested のフィールドを検索条件として指定するために、Nested query（{ nested: path: ・・・, query: ・・・ }）を使用します。

editions 内の sales を検索条件として指定するために、Nested query を入れ子にします。

search1.ts

import { send } from './es_util.ts'

const index = 'items'
const baseUrl = 'http://localhost:9200'

const indexUrl = `${baseUrl}/${index}`

const { body } = await send(`${indexUrl}/_search`, 'POST', {
  query: {
    nested: {
      path: 'editions',
      query: {
        nested: {
          path: 'editions.sales',
          query: {
            bool: {
              must: [
                {
                  range: {
                    'editions.sales.date_from': { gte: '2021-07-01T00:00:00Z' }
                  }
                },
                {
                  range: {
                    'editions.sales.sale_price': { lte: 1500 }
                  }
                }
              ]
            }
          }
        }
      }
    }
  }
})

console.log(JSON.stringify(body, null, 2))

実行結果は次の通り、_source は元のドキュメント内容そのままなので、どのエディションの、どのセール履歴が検索条件に合致したかは分かりません。

実行結果

> deno run --allow-net search1.ts

{
  ・・・
  "hits": {
    ・・・
    "hits": [
      {
        "_index": "items",
        "_type": "_doc",
        "_id": "id-001",
        "_score": 2,
        "_source": {
          "id": "id-001",
          "name": "item-A",
          "editions": [
            {
              "edition": "Standard",
              "price": 1000,
              "sales": [
                {
                  "sale_price": 900,
                  "date_from": "2021-01-01T00:00:00Z",
                  "date_to": "2021-01-03T00:00:00Z"
                },
                {
                  "sale_price": 800,
                  "date_from": "2021-07-01T00:00:00Z",
                  "date_to": "2021-07-05T00:00:00Z"
                }
              ]
            },
            {
              "edition": "Extra",
              "price": 2000,
              "sales": [
                {
                  "sale_price": 1800,
                  "date_from": "2021-01-01T00:00:00Z",
                  "date_to": "2021-01-03T00:00:00Z"
                },
                {
                  "sale_price": 1700,
                  "date_from": "2021-07-01T00:00:00Z",
                  "date_to": "2021-07-05T00:00:00Z"
                },
                {
                  "sale_price": 1500,
                  "date_from": "2021-09-01T00:00:00Z",
                  "date_to": "2021-09-02T00:00:00Z"
                }
              ]
            }
          ]
        }
      },
      {
        "_index": "items",
        "_type": "_doc",
        "_id": "id-002",
        "_score": 2,
        "_source": {
          "id": "id-002",
          "name": "item-B",
          "editions": [
            {
              "edition": "Standard",
              "price": 1500,
              "sales": [
                {
                  "sale_price": 1400,
                  "date_from": "2021-09-01T00:00:00Z",
                  "date_to": "2021-09-05T00:00:00Z"
                }
              ]
            },
            {
              "edition": "Extra",
              "price": 5000,
              "sales": []
            }
          ]
        }
      }
    ]
  }
}

2. inner_hits を用いた検索

それでは、本題の inner_hits を使った検索を行います。

inner_hits は Nested query で指定できます。 inner_hits を指定した場合は _source とは別の inner_hits という項目に該当結果が設定されます。

ここでは、inner_hits で取得する内容は Source filtering を使って、元の _source から取り除くように設定しています。（Source filtering は inner_hits 内でも設定可）

また、_source と inner_hits の結果を結合して、検索条件に合致した要素だけで構成されるドキュメントを構築する処理（toDoc）も用意してみました。

search2.ts

・・・
const { body } = await send(`${indexUrl}/_search`, 'POST', {
  _source: {
    // _source から editions の内容を取り除く
    excludes: ['editions']
  },
  query: {
    nested: {
      path: 'editions',
      query: {
        nested: {
          path: 'editions.sales',
          query: {
            bool: {
              must: [
                {
                  range: {
                    'editions.sales.date_from': { gte: '2021-07-01T00:00:00Z' }
                  }
                },
                {
                  range: {
                    'editions.sales.sale_price': { lte: 1500 }
                  }
                }
              ]
            }
          },
          // editions.sales に対する inner_hits の設定
          inner_hits: {}
        }
      },
      // editions に対する inner_hits の設定
      inner_hits: {
        _source: {
          // inner_hits の _source から sales の部分を取り除く
          excludes: ['editions.sales']
        }
      }
    }
  }
})

console.log(JSON.stringify(body, null, 2))

console.log('-----')

// _source の内容に inner_hits の内容を再帰的に結合する処理
const toDoc = (res: any) => res.hits.hits.map((r: any) => {
  const doc = { ...r._source }

  for (const [k, v] of Object.entries(r.inner_hits ?? {})) {
    const key = k.split('.').slice(-1)[0]
    doc[key] = toDoc(v)
  }

  return doc
})

const res = toDoc(body)

console.log(JSON.stringify(res, null, 2))

実行結果は次のようになりました。

実行結果

> deno run --allow-net search2.ts 

{
  ・・・
  "hits": {
    ・・・
    "hits": [
      {
        "_index": "items",
        "_type": "_doc",
        "_id": "id-001",
        "_score": 2,
        "_source": {
          "name": "item-A",
          "id": "id-001"
        },
        "inner_hits": {
          "editions": {
            "hits": {
              ・・・
              "hits": [
                {
                  "_index": "items",
                  "_type": "_doc",
                  "_id": "id-001",
                  "_nested": {
                    "field": "editions",
                    "offset": 0
                  },
                  "_score": 2,
                  "_source": {
                    "price": 1000,
                    "edition": "Standard"
                  },
                  "inner_hits": {
                    "editions.sales": {
                      "hits": {
                        ・・・
                        "hits": [
                          {
                            "_index": "items",
                            "_type": "_doc",
                            "_id": "id-001",
                            "_nested": {
                              "field": "editions",
                              "offset": 0,
                              "_nested": {
                                "field": "sales",
                                "offset": 1
                              }
                            },
                            "_score": 2,
                            "_source": {
                              "date_to": "2021-07-05T00:00:00Z",
                              "sale_price": 800,
                              "date_from": "2021-07-01T00:00:00Z"
                            }
                          }
                        ]
                      }
                    }
                  }
                },
                {
                  "_index": "items",
                  "_type": "_doc",
                  "_id": "id-001",
                  "_nested": {
                    "field": "editions",
                    "offset": 1
                  },
                  "_score": 2,
                  "_source": {
                    "price": 2000,
                    "edition": "Extra"
                  },
                  "inner_hits": {
                    "editions.sales": {
                      "hits": {
                        ・・・
                        "hits": [
                          {
                            "_index": "items",
                            "_type": "_doc",
                            "_id": "id-001",
                            "_nested": {
                              "field": "editions",
                              "offset": 1,
                              "_nested": {
                                "field": "sales",
                                "offset": 2
                              }
                            },
                            "_score": 2,
                            "_source": {
                              "date_to": "2021-09-02T00:00:00Z",
                              "sale_price": 1500,
                              "date_from": "2021-09-01T00:00:00Z"
                            }
                          }
                        ]
                      }
                    }
                  }
                }
              ]
            }
          }
        }
      },
      ・・・
    ]
  }
}
-----
[
  {
    "name": "item-A",
    "id": "id-001",
    "editions": [
      {
        "price": 1000,
        "edition": "Standard",
        "sales": [
          {
            "date_to": "2021-07-05T00:00:00Z",
            "sale_price": 800,
            "date_from": "2021-07-01T00:00:00Z"
          }
        ]
      },
      {
        "price": 2000,
        "edition": "Extra",
        "sales": [
          {
            "date_to": "2021-09-02T00:00:00Z",
            "sale_price": 1500,
            "date_from": "2021-09-01T00:00:00Z"
          }
        ]
      }
    ]
  },
  {
    "name": "item-B",
    "id": "id-002",
    "editions": [
      {
        "price": 1500,
        "edition": "Standard",
        "sales": [
          {
            "date_to": "2021-09-05T00:00:00Z",
            "sale_price": 1400,
            "date_from": "2021-09-01T00:00:00Z"
          }
        ]
      }
    ]
  }
]

2021-09-26

CDK の Vpc.fromLookup では StringParameter.valueFromLookup を使う

AWS CDK VPC SSM

CDK で既存 VPC とのピアリング設定等を行う場合、Vpc.fromLookup を用いて VPC を参照する事になると思います。

VPC ID のハードコーディングを避けるには、SSM（Systems Manager）のパラメータストアから値を取得する方法が考えられます。

CDK で SSM パラメータストアの値を参照するには、以下のような方法が用意されていますが、Vpc.fromLookup の vpcId へ指定できたのは (b) で (a) は駄目でした。

(a) StringParameter.valueForStringParameter ※
(b) StringParameter.valueFromLookup

 ※ valueForTypedStringParameter や 
    new StringParameter(・・・).stringValue でも同じ

(a) の戻り値を Vpc.fromLookup の vpcId へ設定すると、All arguments to Vpc.fromLookup() must be concrete (no Tokens) というエラーが発生しました。

これは、(a) が返すのは実際の値ではなく、実際の値を参照するためのトークンだという事が原因のようです。

一方、(b) は cdk synth 時にパラメータストアから値を取得して cdk.context.json ファイルにキャッシュするようになっており、パラメータストアから取得した実際の値が返ってきます。

ただし、cdk.context.json にキャッシュした後は、パラメータストアから再取得しないようで、パラメータストア側の値を更新しても反映してくれませんでした。※

 ※ cdk context --clear で cdk.context.json をクリアすると反映されましたが

スタック定義例

import { App, Stack } from '@aws-cdk/core'
import { StringParameter } from '@aws-cdk/aws-ssm'
import * as ec2 from '@aws-cdk/aws-ec2'

const vpcIdParamName = '/sample/vpcid'

const app = new App()

const stack = new Stack(app, 'VPCPeeringSample', {
    env: {
        account: process.env.CDK_DEFAULT_ACCOUNT,
        region: process.env.CDK_DEFAULT_REGION
    }
})

const vpc = new ec2.Vpc(stack, 'Vpc', {
    cidr: '192.168.0.0/16',
    subnetConfiguration: [
        { name: 'sample', subnetType: ec2.SubnetType.PUBLIC }
    ]
})

const peerVpc = ec2.Vpc.fromLookup(stack, 'PeerVpc', {
    vpcId: StringParameter.valueFromLookup(stack, vpcIdParamName)
    // 以下のようにするとエラーになる
    // vpcId: StringParameter.valueForStringParameter(stack, vpcIdParamName)
})

// VPC ピアリング
const vpcPeering = new ec2.CfnVPCPeeringConnection(stack, 'VpcPeering', {
    vpcId: vpc.vpcId,
    peerVpcId: peerVpc.vpcId
})

vpc.publicSubnets.map((s, i) => {
    new ec2.CfnRoute(stack, `src-peering-${i}`, {
        routeTableId: s.routeTable.routeTableId,
        destinationCidrBlock: peerVpc.vpcCidrBlock,
        vpcPeeringConnectionId: vpcPeering.ref
    })
})

・・・

今回のソースは http://github.com/fits/try_samples/tree/master/blog/20210926/

2021-06-21

Jest と Vue Test Utils による Vue コンポーネントのテスト

Vue.js TypeScript Jest

Vue CLI で作成した TypeScript 用の Vue プロジェクトに対して、Jest と Vue Test Utils（@vue/test-utils）を追加導入し、Vue コンポーネントのテスト（TypeScript で実装）を実施するようにしてみました。

今回は、Vue CLI の vue create 時に、Manually select features を選択して、Choose Vue version と TypeScript にだけチェックを付け、Vue.js 2.x と 3.x のプロジェクトをそれぞれ作成して実施しています。

今回のソースは http://github.com/fits/try_samples/tree/master/blog/20210621/

(a) Vue.js 2.x の場合

まずは、Vue.js 2.x のプロジェクトで実施してみます。

Vue.js 2.6.14

a-1. テストモジュールの導入

Jest と Vue Test Utils に必要なモジュールをインストールして設定します。

jest のインストールと設定

jest と ts-jest（テストコードを TypeScript で実装するため）、vue-jest（テストコードで Vue コンポーネントを扱うため）をインストールします。

> npm i -D jest ts-jest vue-jest

ts-jest と vue-jest を適用するように Jest を設定し、jsdom を適用してテストを実施するよう testEnvironment に jsdom を設定しておきます。

jest.config.js 設定例

module.exports = {
    testEnvironment: 'jsdom',
    preset: 'ts-jest',
    transform: {
        '.*\\.(vue)$': 'vue-jest'
    }
}

Vue Test Utils のインストール

@vue/test-utils をインストールします。

> npm i -D @vue/test-utils

@types/jest のインストールと設定

ついでに、VS Code で jest の関数を使ってもエラー表示にならないようにするための措置を行っておきます。

まずは、jest の型定義をインストールします。

> npm i -D @types/jest

tsconfig.json の types に jest を追加します。

tsconfig.json 設定例

{
・・・
    "types": [
      ・・・
      "jest"
    ],
・・・
}

a-2. コンポーネントの実装

テスト対象のコンポーネントを実装します。

src/components/Counter.vue

<template>
  <div>
    <p>
      <button @click="countUp">count up</button>
    </p>
    <p>
      counter: {{ count }}
    </p>
  </div>
</template>

<script lang="ts">
import Vue from 'vue'

export default Vue.extend({
  data() {
    return {
      count: 0
    }
  },
  methods: {
    countUp() {
      this.count++
    }
  }
})
</script>

a-3. テストの実装と実施

テストコードを実装し、テストしてみます。

TypeScript で実装する場合、型の都合上 counter.vm.count とする事はできなかったので、とりあえずは counter.vm.$data.count としています。

(counter.vm as any).count とする事も可能ですが、正直どちらも微妙な気がします。

tests/Counter.test.vue

import { mount } from '@vue/test-utils'
import Counter from '../src/components/Counter.vue'

test('count up', async () => {
    const counter = mount(Counter)

    expect(counter.vm.$data.count).toBe(0)
    // 以下でも可
    // expect((counter.vm as any).count).toBe(0)

    await counter.get('button').trigger('click')

    expect(counter.vm.$data.count).toBe(1)
})

jest コマンドでテストを実施します。

テスト実行

> npx jest

 PASS  tests/Counter.test.ts (5.907 s)
  ・・・

Test Suites: 1 passed, 1 total
Tests:       1 passed, 1 total
Snapshots:   0 total
Time:        7.434 s
Ran all test suites.

(b) Vue.js 3.x の場合

次は、Vue.js 3.x のプロジェクトで実施してみます。

Vue.js 3.1.1

b-1. テストモジュールの導入

基本的には 2.x と同じですが、インストールするモジュールのバージョンが多少異なります。

jest のインストールと設定

現時点で、Vue.s 3.x に対応した vue-jest をインストールするにはバージョンに next を指定する必要がありました。

また、（現時点で）インストールされる vue-jest 5.0.0-alpha.10 は、jest のバージョン 27 には対応しておらず、jest と ts-jest はバージョン 26 をインストールする必要がありました。

> npm i -D jest@26 ts-jest@26 vue-jest@next

Jest の設定は 2.x と同じですが、こちらは 2.x とは違って testEnvironment を設定しなくても特に支障は無さそうでした。

jest.config.js 設定例

module.exports = {
    testEnvironment: 'jsdom',
    preset: 'ts-jest',
    transform: {
        '.*\\.(vue)$': 'vue-jest'
    }
}

Vue Test Utils のインストール

@vue/test-utils のバージョンも next とする必要がありました。

> npm i -D @vue/test-utils@next

また、@types/jest は 2.x と同様に、必要に応じてインストールして設定しておきます。

b-2. コンポーネントの実装

テスト対象コンポーネントを実装します。

src/components/Counter.vue

<template>
  <div>
    <p>
      <button @click="countUp">count up</button>
    </p>
    <p>
      counter: {{ count }}
    </p>
  </div>
</template>

<script lang="ts">
import { defineComponent } from 'vue'

export default defineComponent({
  data() {
    return {
      count: 0
    }
  },
  methods: {
    countUp() {
      this.count++
    }
  }
})
</script>

上記を以下のように Composition API で実装しても、同じテストコードでテストできました。

src/components/Counter.vue（Composition API 版）

<template>
  <div>
    <p>
      <button @click="countUp">count up</button>
    </p>
    <p>
      counter: {{ count }}
    </p>
  </div>
</template>

<script lang="ts">
import { ref } from 'vue'

export default {
  setup() {
    const count = ref(0)
    const countUp = () => count.value++

    return {
      count,
      countUp
    }
  }
}
</script>

b-3. テストの実装と実施

こちらは 2.x とは違って、counter.vm.count とする事ができました。

tests/Counter.test.vue

import { mount } from '@vue/test-utils'
import Counter from '../src/components/Counter.vue'

test('count up', async () => {
    const counter = mount(Counter)

    expect(counter.vm.count).toBe(0)

    await counter.get('button').trigger('click')

    expect(counter.vm.count).toBe(1)
})

jest コマンドでテストを実施します。

テスト実行

> npx jest

 PASS  tests/Counter.test.ts
 ・・・

Test Suites: 1 passed, 1 total
Tests:       1 passed, 1 total
Snapshots:   0 total
Time:        3.78 s, estimated 4 s
Ran all test suites.