SQLite3 for data processing Link to heading

I was curious if SQLite3 can be useful for quick and dirty data processing. After all, SQL queries are quite handy when it comes to data processing — I would mostly choose to write simple processing script in SQL queries rather than Python/Pandas script. SQLite3 is ubiquitous and can execute SQL queries, so this can be an ideal way to carry out simple data processing in a small scale where distributed system is not necessary.

Having no prior experience with SQLite3 at all, this could also help me familiarize with this world-wide popular tool. Below is a simple script that reads in a tab-separated database file in plain text and performs a simple SQL query:

# script.sql
.mode ascii
.separator "\t" "\n"
.import database.tsv db
SELECT * FROM db WHERE titletype == "tvSeries";

As always, I was curious with its performance/efficiency, so I downlaoded a large IMDB database file and tested it out

$ curl https://datasets.imdbws.com/title.basics.tsv.gz | gunzip > imdb.tsv
$ sqlite3 < script.sql > output1.tsv

To compare the performance, I used gawk as a reference

$ gawk -F$'\t' 'BEGIN{OFS=FS} {if($2=="tvSeries"){print $0}}' imdb.tsv > output2.tsv

In terms of performance, SQLite3 was 5x slower than gawk. This is perhaps expected, since SQLite3 is not so much about data processing but rather managing databases, whereas gawk is a text-processing tool. In fact, importing the file itself took the bulk of the time.

In any case, I anticipate it can be useful if I want to quickly write scripts in SQL, as SQL queries will be much easier to write/read than Python/Pandas when it comes to simple data processing.