So you want to build a piece of high performance, deeply pipelined hardware, and write a device driver for it?
The hardest problem is usually: "How do I bring it to a clean stop?"
Get this right. Early.
I hear SW folks proposing "optimisations" to stop the pipeline 10ms faster at the expense of corrupting kernel memory when unexpected errors occur, and I hear HW folks asking "Why do you need to do that? Just reset it!"
Don't be like these people.