init.sql
If you are not writing portable sql schemas, then it is not necessary to write "integer not null primary key autoincrement." autoincrement incurs a performance penalty and should be omitted for integer primary keys1.
I had read about autoincrement being slow but I thought alternative to it would be me generating unique ids myself. I will fix code to use rowids.
Additionally, you may safely use datatype TEXT to store the names of your tags, as under the hood, sqlite will treat these the same.
I was going for VARCHAR(255) (max file name length in ext4) but then friend pointed that database should not be doing those checks, my reasoning was if I set limit on text it will help db have standard row size. I also later came across that sqlite does not actually apply this constraint, and it is my error that I did not change it.
The first, biggest thing to consider is the logic related...without using program logic.
This did come up but I took a break and later I did not have any motivation. I forced myself for past few days but this is becoming fun again so I will try to put much more effort. And as you said it is non standard so it felt weird, I was originally going about try except and logging duplicates.
Finally, in the sqlite implementation, cursor.executemany() is functionally equivalent to manually iterating over a loop yourself, and in this case would not require building a list of dictionaries manually to iterate over, if you are concerned with resource usage/speed.
I thought it would be optimized. I will refactor around this. I was going for dicts since qmark style is going to be deprecated (and as furthur along you have said, I agree dicts are cleaner)
init_db...
I had stong feeling that this could be done in better way, I was split between having whole schema in init function(but that did not sit right either). I did come across executescript in docs but in excitement of working of tagging logic did not gave it much attention. Will fix this.
add_images:
It may be interesting to take an alternative approach to loading all images known in the directory, and instead repeatedly ask the database if a given file is already present. e.g.
select 1 from images where dir_id = :dir_id and image = :image
This will either return 1 or you will get an empty result set from your cursor. This will probably be faster and will definitely use less memory.
As you have pointed out, Im thinking of upsert syntax to just ignore duplicate inserts. Originally I was trying to figure out a way to have "SELECT id FROM images WHERE image IN ?" and I tried every combination of qmark and dicts but it was no dice but could be done with string formatting, but I rather not use it since Im doing parametrized queries everywhere else, that would be bad.
create_insert_tuple:
This whole method can probably be gotten rid of by switching your logic to using the upsert syntax to apply tags.
It would be extremly cool and I will try hard for this, this one function was what kept me debugging all of today, and problem was I was passing "directory" insted of "dir_id" in tagg.py. There was also problem with if statement where I errantly assumed empty list == False but it was not so. I also dont like setting image_id for every image and allocating it is inefficient.
General Notes
Where are you actually committing to the database? AFAICT you're never writing anything to disk unless somehow executemany is autocommitting???
Ah yes I have autocommit turned on as long as there are no exceptions, I will add explicit commit. I was not even closing db mistaking that context manager will handle closing it for me. Just learned today that I have to close cursor too.
In general, shift all logic related to processing data into the database. It is very good at doing those tasks very fast. Brushing up on sql will go a long way here towards simplifying your python code as much as possible. The sqlite docs are wonderful.
Try to avoid doing data processing in pure python, it is very not fast at that.
Is it correct to use "sqlite3" directly and using functions and passing cursor around them, writing sql statements inside function. Im not sure if there is standard way to do this but personally it seems off. Since python is OOP shouldnt I be creating classes and adding methods? i.e (cursor.select_image(image_id) == cursor.execute("SELECT image FROM images WHERE id = ?", (image_id,))
My absolute first draft of it compared to this is such a mess, I was not using argparse and schema was HORROR. I was certain I would have to make a seprate table for every tag and it would hold image ids. Friend pointed me to correct many to many relation table and that was such a bombshell moment. I learned sql end of last year and at that time I did make correct many to many schemas but since then I have not kept up with it.I LOVE sql(ite) and will read all the Docs you have linked with fervour. Thank you VERY much for giving such detailed review. I will follow you up when I implement queries.
(apolocheese for typos and broken sentences, I deleted post because it had formatting issues, hope they are fixed this time)
GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.
All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.