IDWR データの入手と加工

時系列データ分析を試すのに適した季節性(周期性)を持つオープンデータを探していて以下を見つけました。

インフルエンザ等の感染症の週単位の報告数が都道府県別にまとまっており、csv ファイルでデータを入手できるようになっています。

今回は、上記データを分析するための前処理(入手と加工)を行います。

ソースは http://github.com/fits/try_samples/tree/master/blog/20180114/

(1) データ入手

csv ファイルは IDWR速報データ から入手できます。

「定点把握疾患(週報告)、報告数、定点当たり報告数、都道府県別」の場合、csv ファイル名は <年4桁>-<週2桁>-teiten.csv となっています。

週は 1 ~ 52 もしくは 53 となっており、年によって 52 の場合と 53 の場合があります。

URL のルールが決まっているので過去データも簡単にダウンロードできましたが、curl コマンド等を使ってダウンロードする際は HTTP リクエストヘッダーへ User-Agent を付ける必要がありました。

teiten.csv 内容

teiten.csv の内容は以下の通りで、文字コードShift_JIS となっています。

これをそのままデータ分析で使うには以下の点が気になります。

  • ヘッダー行が複数行
  • 報告がない場合は - となっている
  • 最終行が空データ
2017-44-teiten.csv
"報告数・定点当り報告数、疾病・都道府県別","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""
"2017年44週(10月30日~11月05日)","2017年11月08日作成","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""
"","インフルエンザ","","RSウイルス感染症","","咽頭結膜熱","","A群溶血性レンサ球菌咽頭炎","","感染性胃腸炎","","水痘","","手足口病","","伝染性紅斑","","突発性発しん","","百日咳","","ヘルパンギーナ","","流行性耳下腺炎","","急性出血性結膜炎","","流行性角結膜炎","","細菌性髄膜炎","","無菌性髄膜炎","","マイコプラズマ肺炎","","クラミジア肺炎","","感染性胃腸炎(ロタウイルス)",""
"","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当","報告","定当"
"総数","2407","0.49","3033","","1621","0.51","5940","1.88","10937","3.47","1469","0.47","5126","1.62","173","0.05","1259","0.40","50","0.02","967","0.31","899","0.28","8","0.01","484","0.70","9","0.02","11","0.02","185","0.39","5","0.01","4","0.01"
"北海道","137","0.62","120","","369","2.65","384","2.76","217","1.56","71","0.51","103","0.74","2","0.01","33","0.24","5","0.04","9","0.06","11","0.08","-","-","9","0.31","-","-","1","0.04","9","0.39","-","-","1","0.04"
"青森県","10","0.15","35","","12","0.29","44","1.05","74","1.76","21","0.50","121","2.88","-","-","11","0.26","1","0.02","11","0.26","28","0.67","-","-","4","0.36","-","-","-","-","5","0.83","-","-","-","-"
・・・
"鹿児島県","79","0.87","37","","46","0.87","76","1.43","281","5.30","20","0.38","134","2.53","-","-","20","0.38","-","-","20","0.38","81","1.53","-","-","6","0.86","-","-","-","-","2","0.17","-","-","-","-"
"沖縄県","230","3.97","13","","16","0.47","25","0.74","71","2.09","7","0.21","42","1.24","-","-","10","0.29","3","0.09","7","0.21","3","0.09","-","-","13","1.44","-","-","-","-","-","-","1","0.14","-","-"
"","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""

f:id:fits:20180114215531p:plain

(2) 加工

複数の teiten.csv から必要な情報のみを抽出して単一の csv へ変換する処理を Python で実装してみました。

基本的な処理は pandas.read_csv で行っています。

pandas.read_csv の引数 内容
skiprows 不要なヘッダー行と総数の行を除外(総数は算出できるため)
skipfooter 最終行の空データを除外
usecols "定当" 列を除外
converters '-' を 0 へ変換

時系列データ分析用に年と週、そして週の最終日を追加しています。

idwr_convert.py
# coding: utf-8
import codecs
import functools
import glob
import os
import re
import sys
import pandas as pd

data_files = f"{sys.argv[1]}/*.csv"
dest_file = sys.argv[2]

r = re.compile('"([0-9]+)年([0-9]+)週\(.*[^0-9]([0-9]+)月([0-9]+)日\)')

# 抽出する列(定当データの除外)
cols = [i for i in range(38) if i == 0 or i % 2 == 1]

# 変換処理('-' を 0 へ変換)
conv = lambda x: 0 if x == '-' else int(x)

# 列毎の変換処理
conv_map = {i:conv for i in range(len(cols)) if i > 0}

# 年、週、その週の最終日を抽出
def read_info(file):
    f = codecs.open(file, 'r', 'Shift_JIS')
    f.readline()

    m = r.match(f.readline())

    f.close()

    year = int(m.group(1))
    week = int(m.group(2))
    # 50週を超えていて 1月なら次の年
    last_year = year + 1 if week > 50 and m.group(3) == '01' else year

    return (year, week, f"{last_year}-{m.group(3)}-{m.group(4)}")

def read_csv(file):
    d = pd.read_csv(file, encoding = 'Shift_JIS', skiprows = [0, 1, 3, 4],
                     usecols = cols, skipfooter = 1, converters = conv_map, 
                     engine = 'python')

    info = read_info(file)

    d['year'] = info[0] # 年
    d['week'] = info[1] # 週
    d['lastdate'] = info[2] # その週の最終日

    return d.rename(columns = {'Unnamed: 0': 'pref'})


dfs = [read_csv(f) for f in glob.glob(data_files)]

# データフレームの連結
df = functools.reduce(lambda a, b: a.append(b), dfs)

# csv ファイルとして出力
df.to_csv(dest_file, encoding = 'UTF-8')

実行

2014年 1週目 ~ 2017年 49週目 (2015年は 53週目まである) の teiten.csv を data ディレクトリを配置して、上記を実行しました。

実行例
> python idwr_convert.py data idwr.csv
idwr.csv
,pref,インフルエンザ,RSウイルス感染症,咽頭結膜熱,A群溶血性レンサ球菌咽頭炎,感染性胃腸炎,水痘,手足口病,伝染性紅斑,突発性発しん,百日咳,ヘルパンギーナ,流行性耳下腺炎,急性出血性結膜炎,流行性角結膜炎,細菌性髄膜炎,無菌性髄膜炎,マイコプラズマ肺炎,クラミジア肺炎,感染性胃腸炎(ロタウイルス),year,week,lastdate
0,北海道,333,125,40,118,203,236,2,2,12,0,5,4,0,1,0,0,0,0,1,2014,1,2014-01-05
1,青森県,77,22,18,9,134,71,0,10,4,0,0,7,0,0,0,0,0,0,2,2014,1,2014-01-05
2,岩手県,98,17,11,25,224,82,2,3,7,0,0,15,0,7,0,1,10,0,1,2014,1,2014-01-05
・・・
44,宮崎県,20,47,13,58,191,23,9,18,15,0,1,55,0,11,0,0,0,0,0,2015,53,2016-01-03
45,鹿児島県,40,27,50,109,303,40,4,32,21,0,0,63,0,7,0,0,0,0,0,2015,53,2016-01-03
46,沖縄県,353,3,5,30,209,30,1,2,7,3,1,57,0,2,1,1,9,0,0,2015,53,2016-01-03
0,北海道,1093,147,116,491,433,101,17,273,39,2,1,306,0,6,1,0,11,0,11,2016,1,2016-01-10
1,青森県,142,31,23,52,201,14,0,24,14,0,0,38,0,7,2,1,3,0,0,2016,1,2016-01-10
2,岩手県,156,33,4,125,255,14,2,15,18,1,2,29,0,13,1,0,14,0,0,2016,1,2016-01-10
・・・
44,宮崎県,347,46,81,98,358,14,40,0,29,2,4,24,0,15,0,1,0,0,0,2017,49,2017-12-10
45,鹿児島県,252,24,83,134,426,38,71,1,18,0,11,86,0,8,0,0,0,0,0,2017,49,2017-12-10
46,沖縄県,410,5,8,44,70,24,72,2,9,0,6,4,0,20,0,1,0,0,0,2017,49,2017-12-10

(3) グラフ化

idwr.csv感染症毎に集計して折れ線グラフ化してみます。

(a) matplotlib 使用 - 全体

まずは matplotlib を使って全ての感染症を同一グラフへ表示してみます。

idwr_plot_matplotlib.py
import sys
import pandas as pd
import matplotlib.pyplot as plt

data_file = sys.argv[1]
img_file = sys.argv[2]

df = pd.read_csv(data_file, parse_dates = ['lastdate'])

df.groupby('lastdate').sum().iloc[:, 1:20].plot(legend = False)

plt.savefig(img_file)

ここで、df.groupby('lastdate').sum() では lastdate 毎にグルーピングして合計しています。

f:id:fits:20180114215613p:plain

これだと不要な列も含んでしまうので df.groupby('lastdate').sum().iloc[:, 1:20] で必要な列のみを抽出します。

f:id:fits:20180114215636p:plain

実行結果は以下の通りです。

実行例
> python idwr_plot_matplotlib.py idwr.csv idwr_matplotlib.png
idwr_matplotlib.png

f:id:fits:20180114220012p:plain

他と比べ目立って報告数が多いのはインフルエンザです。

(b) matplotlib 使用 - 個別

次に指定したものだけをグラフ表示してみます。

idwr_plot_matplotlib2.py
import sys
import pandas as pd
import matplotlib.pyplot as plt

data_file = sys.argv[1]
item_name = sys.argv[2]
img_file = sys.argv[3]

df = pd.read_csv(data_file, parse_dates = ['lastdate'])

df.groupby('lastdate').sum()[item_name].plot(legend = False)

plt.savefig(img_file)
実行例
> python idwr_plot_matplotlib2.py idwr.csv インフルエンザ インフルエンザ.png

グラフ例

グラフ形状的に特徴のあるものをいくつかピックアップしてみました。

1. インフルエンザ

1年周期で一定期間内に大流行しています。

f:id:fits:20180114220104p:plain

2. 感染性胃腸炎

1年周期ですが、2016年の末頃は(異常に)大流行しています。

f:id:fits:20180114220118p:plain

3. 手足口病

2年周期で流行しています。

f:id:fits:20180114220133p:plain

4. 水痘

なんとなく 1年周期がありそうですが、全体的な傾向は下がっているように見えます。

f:id:fits:20180114220149p:plain

5. 流行性耳下腺炎

目立った周期性は無さそうです。

f:id:fits:20180114220202p:plain

このように、なかなか興味深いデータが揃っているように思います。

(c) HoloViews + bokeh 使用

最後に、HoloViews + bokeh でインタラクティブに操作できるグラフを作成します。

とりあえず、感染症(列)毎に折れ線グラフ(Curve)を作って Overlay で重ねてみました。

opts でグラフのサイズやフォントサイズを変更しています。

なお、Curve で label = c とすると (label の値は凡例に使われる)、正常に動作しなかったため、回避措置として label = f"'{c}'" のようにしています。

idwr_plot_holoviews.py
import sys
import pandas as pd
import holoviews as hv

hv.extension('bokeh')

data_file = sys.argv[1]
dest_file = sys.argv[2]

df = pd.read_csv(data_file, parse_dates = ['lastdate'])

dg = df.groupby('lastdate').sum().iloc[:, 1:20]

# 感染症毎に Curve を作成
plist = [hv.Curve(dg[c].reset_index().values, label = f"'{c}'") for c in dg]

# 複数の Curve を重ねる
p = hv.Overlay(plist)

# グラフのサイズを変更
p = p.opts(plot = dict(width = 800, height = 600, fontsize = 8))
# X軸・Y軸のラベルを変更
p = p.redim.label(x = 'lastdate', y = 'num')

# グラフの保存
hv.renderer('bokeh').save(p, dest_file)

実行結果は以下の通り。 拡張子 .html は勝手に付与されるようです。

実行例
> python idwr_plot_holoviews.py idwr.csv idwr_holoviews
idwr_holoviews.html 表示例

f:id:fits:20180114220302p:plain