TSVを加工し必要な列の値を作ってから.import
する。
成果物
元TSV
- https://github.com/ytyaru/Shell.StudentNameListGenerator.20200919160912/src/lib/Shell.JapaneseLastNameGenerator.20200912074709/src/dic/surnames/one_to_one_yk.tsv
- https://github.com/ytyaru/Shell.StudentNameListGenerator.20200919160912/src/lib/Shell.JapaneseFirstNameGenerator.20200919111358/src/dic/yk/oo/
- mc.tsv
- m.tsv
- fc.tsv
- f.tsv
- cm.tsv
- cf.tsv
- c.tsv
これをテーブルの列に合わせる。
LastNames
create table if not exists LastNames( Id int not null, Yomi text not null, Kaki text not null, primary key(Id), unique(Yomi,Kaki) )
元TSVにId
列を追加すればいい。primary key
だから自動追加してくれると思いきや、autoincrement
がないからか、やってくれない。
sqlite3 "$DB_PATH" \ '.mode tabs' \ '.import ./tsv/last/one_to_one_yk.tsv '"$TABLE_NAME"
id
列がない一時テーブルを作成し、そこからid
列がある本番用テーブルへ挿入する方法があるらしい。だが、やはりautoincrement
がないからか、やってくれない。
Id用の連番を作成し、結合したTSVを作り、それを.import
することで解決した。
TSV="$(cat ./tsv/last/one_to_one_yk.tsv)" MAX_ID=$(($(echo -e "$TSV" | wc -l) - 1)) paste <(eval echo {0..$MAX_ID} | tr ' ' '\n') <(echo -e "$TSV") | \ sqlite3 "$DB_PATH" '.mode tabs' '.import /dev/stdin LastNames'
以下エラーが出た。原因はTSVの末尾に不要なタブがあったため。削除して解決した。
/dev/stdin:322: expected 3 columns but found 4 - extras ignored /dev/stdin:723: expected 3 columns but found 4 - extras ignored /dev/stdin:1007: expected 3 columns but found 4 - extras ignored /dev/stdin:1054: expected 3 columns but found 4 - extras ignored /dev/stdin:1102: expected 3 columns but found 4 - extras ignored /dev/stdin:1215: expected 3 columns but found 4 - extras ignored /dev/stdin:1290: expected 3 columns but found 4 - extras ignored /dev/stdin:1667: expected 3 columns but found 4 - extras ignored /dev/stdin:2082: expected 3 columns but found 4 - extras ignored /dev/stdin:3023: expected 3 columns but found 4 - extras ignored
FirstNames
create table if not exists FirstNames( Id int not null, Yomi text not null, Kaki text not null, Sex text not null, primary key(Id), unique(Yomi,Kaki), check(Sex='m' or Sex='f' or Sex='c' or Sex='mc' or Sex='fc' or Sex='cm' or Sex='cf') )
ID=0; BEGIN_ID=0; END_ID=-1; for SEX in m f c mc fc cm cf; do echo "$SEX" TSV="$(cat './tsv/first/'$SEX'.tsv')" BEGIN_ID=$((END_ID + 1)) END_ID=$((BEGIN_ID + $(echo -e "$TSV" | wc -l) - 1)) NUM=$((END_ID - BEGIN_ID)) paste <(eval echo {$BEGIN_ID..$END_ID} | tr ' ' '\n') \ <(echo -e "$TSV") \ <(eval "printf ${SEX}\"%.s\n\" {0..$NUM}") | \ sqlite3 "$DB_PATH" '.mode tabs' '.import /dev/stdin FirstNames' done
cm.tsv
に重複行があったので削除して修正した。cat cm.tsv | sort | uniq > cm_new.tsv
/dev/stdin:32: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:33: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:34: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:35: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:36: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:37: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:38: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:39: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:40: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:41: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:42: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:43: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:44: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:45: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:46: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:47: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:48: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:49: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:50: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:51: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:52: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:53: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:54: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:55: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:56: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:57: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:58: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:59: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:60: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:61: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki /dev/stdin:62: INSERT failed: UNIQUE constraint failed: FirstNames.Yomi, FirstNames.Kaki
ファイルサイズ
ファイル | サイズ(Last+First) |
---|---|
入力元TSV合計 | 4536183(918483+3617700) |
列追加TSV合計 | 6528081(1192937+5335144) |
SQLite3 DB | 17022976 |
DBサイズは元ファイルの3.75倍……。
$ find /tmp/work/Sqlite3.JapaneseNames.Import.20200924095020/src/tsv/first -iname *.tsv | xargs -I@ bash -c 'cat @ | wc -c' | awk '{sum+=$1} END {print sum}' 3617700 $ find /tmp/work/Sqlite3.JapaneseNames.Import.20200924095020/src -iname add_*.tsv | xargs -I@ bash -c 'cat @ | wc -c' | awk '{sum+=$1} END {print sum}' 5335144
入力元7ファイルのTSVサイズ合計は3617700バイト。約3.6MB。
列を追加したTSVのサイズは7ファイル合計5335144バイト。約5.3MB。
file | size(byte) |
---|---|
add_cf.tsv | 430461 |
add_cm.tsv | 377154 |
add_fc.tsv | 62238 |
add_mc.tsv | 54882 |
add_c.tsv | 112164 |
add_f.tsv | 2042378 |
add_m.tsv | 2255867 |
DBファイルのサイズは約17MB。
file | size(byte) |
---|---|
Names.db | 17022976 |
サイズが肥大化しすぎ。ちなみにvacuum
した結果なので、これ以上小さくならない。
例
sqlite3 Names.db
sqlite> select count(*) from LastNames; 47594 sqlite> select count(*) from FirstNames; 199269
sqlite> select count(*) from FirstNames where sex='m'; 82728
sqlite> select * from LastNames where Id in (select Id from LastNames order by random() limit 5); 5351|いのくち|井ノ口 30172|としま|十島 32395|なだや|灘谷 33670|ぬき|貫 42250|みなみつじ|南辻
sqlite> select * from FirstNames where Id in (select Id from FirstNames where sex in ('m','mc','cm','c') order by random() limit 5); 22049|しょうま|梢茉|m 31287|たかと|崇登|m 41220|としき|年輝|m 77256|りゅうが|流翔|m 171754|いおり|依織吏|cm
所感
ファイルサイズが肥大化してしまった。これならTSVのほうがいいのでは?
対象環境
- Raspbierry pi 4 Model B
- Raspbian buster 10.0 2019-09-26 ※
- bash 5.0.3(1)-release 学習まとめ
- SQLite 3.33.0 ※ 学習まとめ
$ uname -a Linux raspberrypi 4.19.97-v7l+ #1294 SMP Thu Jan 30 13:21:14 GMT 2020 armv7l GNU/Linux