So I have been making image tagging program that will allow for keys to be bound to tags so at once an image can be tagged with multiple tags. This is my first serious project and I have been having lot of fun working on it.
This is script prototype for it , that requires (-f) file with names of images in (-d) dir to be tagged with (-t) tag. It uses sqlite3 as db(and so will gui version). I still have to write query script for this and this can only tag. All of this files have to be in same directory, and it creates db file $HOME/.local/share/tagg/tags.db (--init)
I was first going for separate db for every directory, but then friend suggested to "not do that". Retroactively adding this was not fun but it is done. -u is needed for every directory first time
I use sxiv -o to get image names.
Now I'm certainly doing lot of things wrong like having raw sql statements in functions and passing cursor to function. So I need criticism, and hard. Be harsh, make me cry.
PS @p How do you manage your images since you always seem to have relevant images ready for occasion.
>How do you manage your images since you always seem to have relevant images ready for occasion. Have you ever used hydrus? https://hydrusnetwork.github.io/hydrus/index.html I've been using it for years now.
> @p How do you manage your images since you always seem to have relevant images ready for occasion.
I have a good memory (see attached) and also `find|grep` and also somewhat more recently I installed Tesseract and that lets me grep images.
As far as the code, just from reading the SQL it looks like this is how you'd want to structure it. sqlite3 is good for this kind of thing; it's probably the easiest option. (Flat files and grep aren't to be underestimated, but you don't learn a lot from doing things that way.) incelligence.jpeg
@SuperDicq@p@laurel@Pi_rat that is extremely complicated, and i can tell this because i myself tried to implement that. You keep a filename: namespace if you need filenames and not the structure (you can also import structure as tags) man held up by ps3 games.jpg
I'm most curious about what the UI for this will look like because that will be what makes or breaks it for me. Tagging an image with multiple tags and then going to the next should require the least amount of clicks and button presses.
> Just leave my fucking files where they are and don't make a copy.
ABSOLUTELY YES THAT
When x11amp^Wxmms finally stopped building, the main thing I wanted from a music player was "I give it a directory and it plays the music in that directory."
What happens if you've got a couple of million images and you start moving things around? Does it go through all moved directories and recompute hashes? Things get even worse with architectures that support publishing/sharing your images and tags with others.
@SuperDicq@p I am thinking more like sxiv with sidebar(and sxiv key binds so n saves whatever tags were added to this image and moves to next) for manipulating tags and keys they are bound to, Im also thinking since I will add "views" for advance querying, like (meme or pepe) and ( computer or wizard )
I have not personally used Hydrus or any other application, I tried DigiKam but gave up on quite early. This seemed like easy first project so I thought why not. Mainly I needed app that I can quickly tag, but thinking and writing, I have so many ideas for gui version like multiple OR and AND clauses...
@p@SuperDicq I can relate, I can generally guess the time frame where image will be depending on surrounding images but I would still like to be able to quickly find them......
@SuperDicq@hj@laurel@Pi_rat Yeah, I use cmus. `:add sound/` and there it all is. I don't know how easy it is to remap keys in VLC, but I actually built cmus on my n900 so that I could push keyboard buttons without looking at the device and crashing my car. (I still use the uConsole like this.)
How is it going to know it's the same files. If there are any discrepancies between its internal representation and the files on disc, it will have to recompute image hashes.
If you are not writing portable sql schemas, then it is not necessary to write "integer not null primary key autoincrement." autoincrement incurs a performance penalty and should be omitted for integer primary keys1.
Additionally, you may safely use datatype TEXT to store the names of your tags, as under the hood, sqlite will treat these the same. In fact, if you wish, you do not have to specify a datatype at all2
The first, biggest thing to consider is the logic related to applying tags to images. Take a look at the non-standard (but very useful) upsert clause3. Present in both sqlite and postgresql, it allows you to handle a uniqueness constraint violation inside the query, without using program logic.
For instance this would let you write something like:
insert into image_tags values (:id, :tag) on conflict do nothing
This turns duplicate tag application into a no-op, saving you some trouble and making your code nicer.
Finally, in the sqlite implementation, cursor.executemany() is functionally equivalent to manually iterating over a loop yourself, and in this case would not require building a list of dictionaries manually to iterate over, if you are concerned with resource usage/speed.
It is not necessary to manually parse the database initialization script and execute the commands one at a time. Instead use cursor.executescript() with the loaded contents of the file as a parameter.
add_images:
It may be interesting to take an alternative approach to loading all images known in the directory, and instead repeatedly ask the database if a given file is already present. e.g.
select 1 from images where dir_id = :dir_id and image = :image
This will either return 1 or you will get an empty result set from your cursor. This will probably be faster and will definitely use less memory.
Also, in general prefer named parameters over positional ones (:dir_id, :image) is better than (?, ?). This is stylistic and a personal choice so buyer beware.
create_insert_tuple:
This whole method can probably be gotten rid of by switching your logic to using the upsert syntax to apply tags.
General Notes
Where are you actually committing to the database? AFAICT you're never writing anything to disk unless somehow executemany is autocommitting???
In general, shift all logic related to processing data into the database. It is very good at doing those tasks very fast. Brushing up on sql will go a long way here towards simplifying your python code as much as possible. The sqlite docs are wonderful.
Try to avoid doing data processing in pure python, it is very not fast at that.
> which desktop OS filesystem does not support hardlinks?
When you buy a USB stick, what is the filesystem format?
Hard links also can't cross FS boundaries. People can run into this without knowing: Armbian, for example, likes to mount a tmpfs over some directories and then sync so as not to wear down your SD card.
If people remove the images, what are you going to do? Are you going to track it to make sure it's the only link? What about writes in place, like tweaking the EXIF data?
It's better to just deal with files where they are.
> When you buy a USB stick, what is the filesystem format?
FAT/exFAT is not a desktop OS filesystem, it's a removable storage filesystem. Neither is NFS or anything else you can think of that doesn't support hardlinks.
> Hard links also can't cross FS boundaries.
okay, but that's why you keep your Hydrus storage next to where you normally store these files... on the same filesystem...
> If people remove the images, what are you going to do? Are you going to track it to make sure it's the only link?
yeah it's actually quite easy I do it all the time
> What about writes in place, like tweaking the EXIF data?
I do this all the time with my music library and it doesn't break the hardlink and leave me with two different files?
like I literally suggested that this be an OPTION and you came in here firing in all directions like
"ACKCHYUALLY YOU CAN'T DO THAT BCAUSE I CAN INVENT AN INCOMPATIBLE USE CASE"
get a hobby and stop always trying to be the smartest person in the room, it's not a good look
Either we're talking about people that don't know whether their USB stick is FAT or not, or we're talking about people that might not be running a "desktop OS filesystem". Samba isn't a super rare thing, either.
> like I literally suggested that this be an OPTION and you came in here firing in all directions like
I just said it was not always going to work, you argued, I replied. I don't know what you want from me. If you don't want to argue about it, don't argue about it.
> get a hobby and stop always trying to be the smartest person in the room, it's not a good look
You jumped in the thread to suggest hard links; I just said it's not always going to work. I didn't post caricatures at you, speculate that you had some motivation like "trying to be the smartest person in the room": you jumped in, suggested that someone else patch this software to include your idea, I have never even heard of this program before this thread, I just think hard links are a bad idea, and now you're acting like I cut you off in traffic. Chill, dude. firing-in-all-directions.png
If you are not writing portable sql schemas, then it is not necessary to write "integer not null primary key autoincrement." autoincrement incurs a performance penalty and should be omitted for integer primary keys1.
I had read about autoincrement being slow but I thought alternative to it would be me generating unique ids myself. I will fix code to use rowids.
Additionally, you may safely use datatype TEXT to store the names of your tags, as under the hood, sqlite will treat these the same.
I was going for VARCHAR(255) (max file name length in ext4) but then friend pointed that database should not be doing those checks, my reasoning was if I set limit on text it will help db have standard row size. I also later came across that sqlite does not actually apply this constraint, and it is my error that I did not change it.
The first, biggest thing to consider is the logic related...without using program logic.
This did come up but I took a break and later I did not have any motivation. I forced myself for past few days but this is becoming fun again so I will try to put much more effort. And as you said it is non standard so it felt weird, I was originally going about try except and logging duplicates.
Finally, in the sqlite implementation, cursor.executemany() is functionally equivalent to manually iterating over a loop yourself, and in this case would not require building a list of dictionaries manually to iterate over, if you are concerned with resource usage/speed.
I thought it would be optimized. I will refactor around this. I was going for dicts since qmark style is going to be deprecated (and as furthur along you have said, I agree dicts are cleaner)
I had stong feeling that this could be done in better way, I was split between having whole schema in init function(but that did not sit right either). I did come across executescript in docs but in excitement of working of tagging logic did not gave it much attention. Will fix this.
add_images:
It may be interesting to take an alternative approach to loading all images known in the directory, and instead repeatedly ask the database if a given file is already present. e.g.
select 1 from images where dir_id = :dir_id and image = :image
This will either return 1 or you will get an empty result set from your cursor. This will probably be faster and will definitely use less memory.
As you have pointed out, Im thinking of upsert syntax to just ignore duplicate inserts. Originally I was trying to figure out a way to have "SELECT id FROM images WHERE image IN ?" and I tried every combination of qmark and dicts but it was no dice but could be done with string formatting, but I rather not use it since Im doing parametrized queries everywhere else, that would be bad.
create_insert_tuple:
This whole method can probably be gotten rid of by switching your logic to using the upsert syntax to apply tags.
It would be extremly cool and I will try hard for this, this one function was what kept me debugging all of today, and problem was I was passing "directory" insted of "dir_id" in tagg.py. There was also problem with if statement where I errantly assumed empty list == False but it was not so. I also dont like setting image_id for every image and allocating it is inefficient.
General Notes
Where are you actually committing to the database? AFAICT you're never writing anything to disk unless somehow executemany is autocommitting???
Ah yes I have autocommit turned on as long as there are no exceptions, I will add explicit commit. I was not even closing db mistaking that context manager will handle closing it for me. Just learned today that I have to close cursor too.
In general, shift all logic related to processing data into the database. It is very good at doing those tasks very fast. Brushing up on sql will go a long way here towards simplifying your python code as much as possible. The sqlite docs are wonderful.
Try to avoid doing data processing in pure python, it is very not fast at that.
Is it correct to use "sqlite3" directly and using functions and passing cursor around them, writing sql statements inside function. Im not sure if there is standard way to do this but personally it seems off. Since python is OOP shouldnt I be creating classes and adding methods? i.e (cursor.select_image(image_id) == cursor.execute("SELECT image FROM images WHERE id = ?", (image_id,))
My absolute first draft of it compared to this is such a mess, I was not using argparse and schema was HORROR. I was certain I would have to make a seprate table for every tag and it would hold image ids. Friend pointed me to correct many to many relation table and that was such a bombshell moment. I learned sql end of last year and at that time I did make correct many to many schemas but since then I have not kept up with it.I LOVE sql(ite) and will read all the Docs you have linked with fervour. Thank you VERY much for giving such detailed review. I will follow you up when I implement queries.
(apolocheese for typos and broken sentences, I deleted post because it had formatting issues, hope they are fixed this time)
> but then friend pointed that database should not be doing those checks
You definitely want the DB to enforce as much as you can, but filename length, maybe not. sqlite3 data types aren't enforced, that is correct, but they do usually act as hints to your ORM or library.
> Is it correct to use "sqlite3" directly and using functions and passing cursor around them, writing sql statements inside function.
For a small enough program, that's fine. But you should probably wrap it in an object so you don't have to deal with details of the query or the DB in other functions, keep the logic clean so functions are topical and legible. I don't know how Python deals with iterators, maybe better to model a cursor as a thing that yields rows to whoever asks.
I had read about autoincrement being slow but I thought alternative to it would be me generating unique ids myself. I will fix code to use rowids.
I may have not made myself clear here, you need not fuss with rowids yoursefl, you may simply use integer primary key
I thought it would be optimized. I will refactor around this. I was going for dicts since qmark style is going to be deprecated (and as furthur along you have said, I agree dicts are cleaner)
This is a good intuition! In other database drivers (psycopg2 for example), executemany is optimized.
As you have pointed out, Im thinking of upsert syntax to just ignore duplicate inserts. Originally I was trying to figure out a way to have "SELECT id FROM images WHERE image IN ?" and I tried every combination of qmark and dicts but it was no dice but could be done with string formatting, but I rather not use it since Im doing parametrized queries everywhere else, that would be bad.
Good catch here, I don't know how I missed using upsert here but you're absolutely right, this is a good use case for it.
Ah yes I have autocommit turned on as long as there are no exceptions, I will add explicit commit. I was not even closing db mistaking that context manager will handle closing it for me. Just learned today that I have to close cursor too.
When you use the actual python context manager syntax,
with sqlite3.connect("yourdb.sqlite") as db:
with db.cursor() as cursor:
do your query etc etc
...
here the cursor is now closed
db.commit()
here the database connection is now closed
In this case, the python context manager will automatically call the appropriate close function as you fall out of the indented block. If you do it in the imperative style you have been doing you must call close() on each cursor and database connection yourself. In a purely technical sense, all that is being cleaned up when the program exits, and this is not a long-running server process. But it's good practice to use either of these two methods to keep things tidy yourself. Many other objects use this protocol iirc you used it when opening and reading a file.
Is it correct to use "sqlite3" directly and using functions and passing cursor around them, writing sql statements inside function. Im not sure if there is standard way to do this but personally it seems off.
You may pass the connection or cursor object in a function call. However, things get hairier if you start to multithread or multiprocess. sqlite3 module objects are not threadsafe and you will get programming errors if you try to use e.g. a cursor outside of the thread it was created in. This is not really a relevant concern here, though.
As for code style, it might be slightly "cleaner" to pass the database object to functions that use the database, create a cursor for the scope of the function, and close the cursor before you return. This will still alow the callee function control over, e.g. committing to the database in case of a DML statement.
Since python is OOP shouldnt I be creating classes and adding methods? i.e (cursor.select_image(image_id) == cursor.execute("SELECT image FROM images WHERE id = ?", (image_id,))
It is bad practice to attach methods to objects after the object is instantiated, which is what you would be doing were you to do as you stated in this example.
I wouldn't get too caught up in trying to follow any particular paradigm here, your codebase is small and it's a for-fun project. Short of using an ORM to abstract away handling the database connection, carefully passing a connection object around is a-ok, and what the ORNM is doing under the hood, anyway.
I LOVE sql(ite) and will read all the Docs you have linked with fervour. Thank you VERY much for giving such detailed review. I will follow you up when I implement queries.
You're welcome! SQLite is one of my favorite pieces of software, too, if you couldn't tell! I'm a sql and python guy so I'm happy to help out and had fun reading through this and providing commentary today.
You definitely want the DB to enforce as much as you can, but filename length, maybe not. sqlite3 data types aren't enforced, that is correct, but they do usually act as hints to your ORM or library.
:yousoro:
For a small enough program, that's fine. But you should probably wrap it in an object so you don't have to deal with details of the query or the DB in other functions, keep the logic clean so functions are topical and legible.
I will look into this, Im not sure how I would go about it at all since I have very surface level understanding of OOP.
I don't know how Python deals with iterators, maybe better to model a cursor as a thing that yields rows to whoever asks.
I'm passing it around to execute relevant queries on it.
@laurel@Pi_rat@p@SuperDicq Abstracting your database connection into a class is usually not strictly necessary unless you need the abstraction. For instance, if you were trying to support many database connectors, not just one. You would wind up needlessly proxying function calls and little else. Unless you are going to be implementing novel or QoL features for yourself it's not really necessary in at least this specific case.
There's a separate approach where Path would be a model that itself interfaces with the db, eg Path.add(path) or Path(path).delete(), which is a pattern you might see if you were using a ORM like sqlalchemy, or the django ORM.
There's more than one way to skin a cat but his approach and his scope are well matched, there is neither the requirement, nor the need to inject OOPisms when the imperative code is clean and succint.
>I will look into this, Im not sure how I would go about it at all since I have very surface level understanding of OOP. I think he meant to make a class specifically for database operations. The database connection/initialization code will go into the class init and the cursor will be a class variable only accessible by the member functions. Then in your main routine you instantiate the class, let's say an object called db which you then use to call the methods providing only the information necessary. For instance db.addtag(path, newTagString), db.deletePath(path), etc.
It's been a while since I wrote any Python btw, I only use jupyter-notebook with some forecasting and data analysis libs nowadays.
> I will look into this, Im not sure how I would go about it at all since I have very surface level understanding of OOP.
OOP is a convenient way to model this, but essentially it's just keeping things organized by topic. Separation of concerns makes things much more legible is all.
> I'm passing it around to execute relevant queries on it.
Ah, okay. So what I was talking about was you run a query and then encapsulate the results in an iterator or something so the function that uses the results just gets fed this iterator that sends back rows and then it closes its own cursor when it gets to the last one and starts returning nils, whatever those are in Python. (I think it's `null`? I've written basically no Python from scratch, just editing Python. It's basically the same language as all the 90s scripting languages.)
If it's a singleton, it's not so bad to just make it globally accessible instead of passing it around. (For a good time, put it in a memoized function that opens the DB on first call; this is a little nicer than making it a global variable. Generally in a bigger project you'll end up including or writing something to manage a connection pool, but the code is still small and it's sqlite3 so no pool anyway. But don't get carried away writing code to manage things that you don't need managed yet, you know what I mean.)
@munir@RedTechEngineer@Pi_rat@feld@laurel@p@SuperDicq@hj it has a journal and everything so it shouldn't be but the only time I had a partition corrupt was in fact f2fs after some scuffed kernel update so yeah that was a system partition though
I may have not made myself clear here, you need not fuss with rowids yoursefl, you may simply use integer primary key
You were clear, I just phrased my sentence weird.
In this case, the python context manager will automatically call the appropriate close function as you fall out of the indented block. ... This will still alow the callee function control over, e.g. committing to the database in case of a DML statement.
Noted, will pass connection. (personallythis will also look better than 2 levels of indentations)
Short of using an ORM to abstract away handling the database connection, carefully passing a connection object around is a-ok, and what the ORNM is doing under the hood, anyway.
I was looking into sqlalchemy but and combined with everything else, my attention was spread to thin so, I left it. Nice to know it's not wrong to pass connection
You're welcome! SQLite is one of my favorite pieces of software, too, if you couldn't tell! I'm a sql and python guy so I'm happy to help out and had fun reading through this and providing commentary today.
Eyy!! :backtogab:
I forgot to ask yesterday, how should I be testing this, for now I manually open up interpreter and db and test, but this gets very tedious. My guess is unittests + a test.db which would return expected row(s). Wanted to write them but procrastinated on this and now I shudder at magnitude of what I would have to cover, probably wont be as bad once I get started and write a few.
Hope this werks, I was using new line and single space
OOP is a convenient way to model this, but essentially it's just keeping things organized by topic...I think it's null?...small and it's sqlite3 so no pool anyway.
Noted. Its called "None" around this parts. And yes I dont think this will be multi threaded, not even the GUI version.
But don't get carried away writing code to manage things that you don't need managed yet, you know what I mean.
I forgot to ask yesterday, how should I be testing this, for now I manually open up interpreter and db and test, but this gets very tedious. My guess is unittests + a test.db
For your needs, the built-in unittest module will probably suffice. You'll subclass a test case, write some setup-code that is shared between tests, and then your individual tests. The documentation will give you some good examples. I wouldn't worry about using 3rd party testing frameworks for now (there are plenty).
now I shudder at magnitude of what I would have to cover
You've got ~100loc, that's nothing. Your only wrinkle will be the test database, but just utilize an in-memory one for that (sqlite3.connect(":memory:")) and it will not be persisted to disk for you to clean up later. Or, if you must, you can write some teardown code in your test case that will delete the on-disk database after all the tests run. You've got options.