Planning to stop doing threads on Twitter and jot down experiments here going ahead.
So, the fun stupid exercise for the day is an extension of a recent one. The data pipeline I first wrote in 2014 for Tenreads (then) is slimming down like my body, brain, and wallet. With recent C++ experiments, a 88k(O3) binary does the entire scouring a full-blown Python stack did.
While I wanted to replace the DB interface, Mongo-cxx ate my brain enough. Guess I am not going there for sometime now considering my sanity. Wanted to check if Postgres could switch places as a second alternative. Since I restarted work on my JS typing/schema library felt easy, but didn't proceed further. A bit burnt sure, so a hiatus there.
Getting back, stripped things to depend on a single XML parser (pugi - might replace this with my own). And now, dropped the rest of the Py stack, thanks to Mongo CLI. Using the --eval and a bulk insert with ordered=false, the limb 1 is at the smallest thus far possible.
This insert cmd output is a well formatted JSON from Mongo. Can use jq to extract metrics for the insert operation from here. And top it with a stack in bash script to skip duplicate/erroneous entries when running in a while loop.