So when I dived down the rabbit-hole of the Nested Set Model, of course I created a sample database to write & test the code against.
It’s that time again! This month, Todd Kleinhans (b/t) asks us how we use databases in our day to day life, i.e. personal use or “outside of our work / day-job”. Actually, the question is kinda vague — because if you think about it, we all use TONS of databases in our daily lives. Your phone’s contact list, your calendar, online shopping, banking.. the list goes on. As I’ve said before, Data is everything.
But what I think he meant, and the way most of the community has interpreted it, is “How do you manage/administrate/build/work-with/develop databases in your day-to-day life outside of work?”. So we’ll go with that.
Now this may out me as “not a real DBA” or some such nonsense, but honestly.. I don’t spend much of my time creating silly playground databases. Not that anybody else’s are ‘silly’ — just look at some of the fantastic posts for this month! Such neat ideas brought to life.
Special shout-out to Kenneth Fisher, who, if you look closely at his screenshot (and it’s not even related to this post), committed the abhorrent sin of creating a database name of pure emojis — FOR SHAME sir! But also you’re awesome. ❤
Me, I’m more of a quick-n-dirty spreadsheet guy. If I need, say, an inventory of my computer parts & gadgets so I know what I can & can’t repair, what materials I have to work with as I tinker, etc.. well, I create a Google Sheet. And it serves my needs well enough. (Spoiler alert: yes, you can view that one; I shared it. But it’s fairly outdated since I moved in March and haven’t had time to re-do inventory since.. last autumn.)
But for blogging in the tech field, you gotta get your hands dirty. So when I dived down the rabbit-hole of the Nested Set Model, of course I created a sample database to write & test the code against. And there have been some additional bits & pieces for blog demos and GitHub samples.
Most of the time, I’m creating databases / entities on SQL 2016 Developer Edition. Of course by now, that’s 2 major versions ‘behind’, but since I don’t run Linux personally (yet?), and I’m not a conference speaker (yet??), I don’t feel a burning need to upgrade. It’s FAR superior to Express Edition, though, so please for the love of all that is holy, if you find yourself using Express for personal/playground use, save yourself the headache and go grab Developer.
Containers/Docker? Meh. If you want to start playing with those, definitely look at 2017 or higher. It sounds appealing in theory — “just spin it up when you need it, spin it down when you don’t!” — and I’m sure that’s great if you’re starved for resources on whatever laptop you’re working with, but if you’ve done your due diligence and set your local SQL instance to appropriate resource limitations (hello, ‘max server memory’ and file-growths!), I’ve found that its impact is quite tolerable.
But come now. Surely this isn’t just a “shameless self-promotion” post or a grumpy-old-DBA “get off my lawn” post. Right?? Right!
To you folks out there creating your own nifty little databases for personal projects, learning/development, or even hopes & dreams of building a killer app on top of it one day — you’re amazing! Keep doing what you do, and write about it, because I love reading about it. Heck, go try to create the same model in PostgreSQL or MariaDB and see how it goes. We could all use a little cross-stack exposure once in a while.
That’s all I have for this month; short & sweet. I need to finalize plans for virtualizing our main SQL instances (which is really just a migration off bare-metal & onto VMs) within the coming weeks. Yes, we’re that far behind the curve. Now get off my lawn!
Welcome to the first post of the new year. I’ll be keeping things a little on the lighter side for now. I’m still very into my work and learning lots of share-worthy things in the data world. But for now, movies!
I also want to take a moment to appreciate those who reached out to us after our devastating loss. Thank you for your thoughts and prayers. Please continue to remember our family as we struggle to find a sense of normalcy.
So, some of my elder moviegoers asked me the question that many people have been asking over the last year or two: “What Marvel movies do I really need to watch before Infinity War?”, or more recently, “before End Game?”. More generally, which ones are worthwhile viewing to a casual non-geek, to someone who doesn’t need to obsess over every little minutiae, someone who is not by nature a “comic book movie lover”. It’s a completely fair question, and honestly it needs more.. less nerdy answers.
Data isn’t literally everything. I mean it is, technically, but it’s not all super happy fun times, so we need to take a break once in a while and do something less neuron-intensive. Thus, my new segment: movie reviews! Because, despite what you may have read, all work and no play make Nate a dull boy. And yes, I promised you this blog would be professional. Mostly. I remember specifically using that word. So don’t wag your naggy finger at me. If you don’t like it, you can simply avoid the tags like #offtopic and #movies.
Yeah so I missed the boat by a few days week. That’s pretty much my M.O. This month’s T-SQL Tuesday #100 is hosted by the author of sp_WhoIsActive and the creator of T-SQL Tuesday himself, the legendary, the incomparable, AdamMachanic.
The Year is 2026
Despite IT’s best efforts to kill the relational database, it’s still alive and kicking. Sure, it’s mostly in the cloud, and we’ve largely solved the problems of scalability, availability, and “traditional” maintenance, but the DBA still plays a critical role in the IT organization. He/she is more of an architect and an automator, someone who understands the business and development needs as they relate to data — its storage, availability, security, and performance — and can leverage cohesive data platform technologies to provide those services and meet those needs. But the fundamental issue of data quality still haunts even the best environments, because at the end of the day, when you rely on a human to enter text into a field, you’re gonna get garbage inconsistency. Thus, we’re still fighting that fight, if only to appease our “data scientists” and machine-learning models so that they stop whining about it.
SQL Server itself has evolved. After realizing that it was pretty silly to bolt-on a hacky “graph db” component to what is, at its core, a relational engine, MS broke that off into its own product, “Microsoft GraphDB Server”. But the good news is, SQL & GraphDB talk to each other seamlessly; in fact all of the data-platform products integrate and inter-operate much more smoothly than 10 years ago.
We finally have a single unified CE (Cardinality Estimator), which is intelligent enough to know which paths/plans to use for a given query, so we don’t need to mess with those awful trace-flags anymore. Indexes and Statistics are all but self-maintaining; the DBA rarely has to step in and mess with them. Part of the reason for this is that SQL Server yells at you if you try to make a GUID the clustering-key, or other such nonsense. =D
Columnstore is everywhere; traditional row-store (b-tree) indexes barely exist. JSON storage & indexing inside SQL Server is much better, but it’s still preferable to use a document-store DB if you can. Hierarchical structures (not to be confused with graphs) are easily implemented and supported, without having to resort to old hacky models. And user-defined functions (all types) perform nearly on-par with stored procedures.
They’ve replaced sp_who and sp_who2 with the code from sp_WhoIsActive, and made SSMS Activity Monitor suck less & actually be semi-useful as a basic first-response monitor. Profiler was officially killed off, and XEvents has come into general widespread usage — largely because MS finally dedicated some hard-core dev time to improving its GUI & making it much easier to use. Native Intellisense finally works, and works well, for all but the most obscure/weird things, and is much less chatty in terms of network traffic to/from the server.
And finally. FINALLY. Each database has its own TempDB.
This is an oldie but goody. A) Developers want their apps to manage the record identifiers, but DBAs want the database to do it. B) Developers prefer abstracting the identity values out of sight/mind, DBAs know that occasionally (despite your best efforts to avoid it) your eyeballs will have to look at those values and visually connect them with their foreign key relationships while troubleshooting some obscure bug.
But there’s more to it than that. See, none of those arguments really matter, because there are easy answers to those problems. The real core issue lies with the lazy acceptance of GUI/designer defaults, instead of using a bit of brainpower to make a purposeful decision about your Primary Key and your Clustered Index.
Now wait a minute Mr. DBA, aren’t those the same thing?
NO! That’s where this problem comes from!
A good Clustered Index is: narrow (fewer bytes), unique (or at least, highly selective), static (not subject to updates), and ever-increasing (or decreasing, if you really want). NUSE, as some writers have acronym’d it. A GUID fails criteria ‘N’ and ‘E’. However, that’s not to say a GUID isn’t a fine Primary Key! See, your PK really only needs to be ‘U’; and to a lesser extent, ‘S’. See how those don’t overlap each other? So sure, use those GUIDs, make them your PK. Just don’t let your tool automagically also make that your CX (Clustered indeX). Spend a few minutes making a conscious effort to pick a different column (or couple columns) that meet more of these requirements.
For example, a datetime column that indicates the age of each record. Chances are, you’re using this column in most of your queries on this table anyway, so clustering on it will speed those up.
Most of the time, though, if your data model is reasonably normalized and you’re indexing your foreign keys (because you should!), your PKs & CX’s will be the same. There’s nothing wrong with that. Just be mindful of the trade-offs.
Battle 5: CSV vs TAB
Often, we have to deal with data from outside sources that gets exchanged via “flat files”, i.e. text files that represent a single monolithic table of data. Each line is a row, and within each line, each string between each delimiting character is a column value. So the question is, which is easier to deal with as that delimiter: comma, or tab?
String data values often have commas in them, so usually,the file also needs a “quoting character”, i.e. something that surrounds the string values so that the reader/interpreter of the file knows that anything found inside those quotes is all one value, regardless of any commas found within it.
But tabs are bigger.. aren’t they? No, they’re still just 1 byte (or 2, in Unicode). So that’s a non-argument. Compatibility? Every program that can read and automatically parse a .csv can just as easily do so with a .tab, even if Windows Explorer’s file icon & default-program handler would lead you to believe otherwise.
I recently encountered an issue with BCP (a SQL command-line utility for bulk copying data into / out of SQL server), where the csv was just being a pain in the arse. I tried a tab and all was well! I’m sure it was partially my fault but regardless, it was the path of least resistance.
Battle 6: designers vs scripting
This should be a no-brainer. There is absolutely no excuse for using the table designer or any other wizardy GUIs for database design and maintenance, unless you’re just learning the ropes. And even then, instead of pressing ‘OK’, use the ‘Script’ option to let SSMS generate a `tsql` script to perform whatever actions you just clicked-thru. Now yes, admittedly those generated scripts are rarely a shining example of clean code, but they get the job done, even with some unnecessary filler and fluff. Learn the critical bits and try to write the script yourself next time– and sure, use the GUI-to-script to double check your work, if you still need to.
Confession: I still use the GUI to create new SQL Agent Jobs. It’s not that I don’t know how to script it, it’s just that there are so many non-intuitive parameters to those msdb system-sp’s that I usually have to look them up, thereby spending the time I would have otherwise saved.
Bonus round: the pronunciation of “Data”
Dah-tuh, or Day-tuh? Or, for the 3 people in the world who can actually read those ridiculous pronunciation glyphs, /ˈdeɪtə/ or /ˈdætə/ ? It’s a question as old as the industry itself… or maybe not. Anecdotally, it seems like most data professionals, and people in related industries, tend to say “day-tuh”; while those in the media and generally less technical communities tend to say “dah-tuh”. (Where the first syllable is the same vowel-sound as in “dad” or “cat”.) This likely means that the latter is more popular, but the former is more industrially accepted.
In either case, it doesn’t really matter, because at the end of the day, we’re talking about the same thing. So if some dogmatic DBA or pedantic PHB tries to correct your pronunciation, tell ’em to stop being so persnickety and get on with the task at hand!
On a lighter note than usual, I thought it was time I weighed in on some of the long standing “programmer holy wars”, but with a little DBA-twist (like a twist of lime, only less delicious). Like any good holy war, this will be full of posturing, pontificating, and political correctness. And I probably won’t even commit to a particular side on some issues. But hey, isn’t that the point?
Battle 1: Tabs vs. Spaces
Text editors and IDEs have long been mature enough to handle “smart tabs” and preference-based tab size. However, you will occasionally have to copy-paste code into a non-code-oriented environment, such as an email or a document, where of course the tab size is based on inches rather than spaces in a monospace font. I will admit in those rare instances, tabs are annoying. But what is more annoying is the inconsistency you can get when spaces are used incorrectly, especially in the midst of lines in a sad attempt to do some kind of vertical alignment. Plus, if you happen to have a different spacing-size preference than the original code author, you’re now battling that visual discrepancy as you read & maintain said code.
So I prefer tabs. But I won’t fight my team on it if everybody else prefers spaces — that’s what those settings in the editor/IDE are there for! I will happily conform with the best of them. A quick Google says I’m in the minority anyway — which I’m OK with.
Battle 2: The Case for Casing
But that’s not really what this battle is usually about. Most often, it’s about your names, i.e. the identifiers for objects/methods/variables/procedures/APIs/etc. that your team and your developers have to come up with on a constant basis. And usually it comes down to camelCase, TitleCase (which are often incorrectly used interchangeably! and is apparently better known as PascalCase, which I just learned today, or possibly re-learned after several years), or lower_case_with_underscores (which, in another learning moment, I discovered is named snake_case! How cool is that?). Rarely, if ever, do people argue for ALLCAPS in these areas — it just feels.. obnoxious.
As with any programmer-y topic, you can dive down the rabbit-hole and dissect layer upon layer of nuance in this battle until you’ve lost all semblance of productivity. Because casing is, in some languages, important; while in others it’s simply convention-based, dependent on the abstraction level or family of things you’re talking about. For example, C# Class names are TitleCase, and so typically are Methods, while objectinstances are usually camelCase; public members can be TitleCase or camelCase, and private members can be _underscore_led, or whatever flavors for each that your boiler-plate/template system prefers. Scoped variableNames are most often camel’d as well, while global constants are typically CAPS_WITH_UNDERSCORES. And god help you if you ask a team of more than 3 people what their dependency packages’ names should look like.
So in this battle, I have to play Switzerland. I’m not vehemently opposed to any particular flavor of casing, finding it best to work within the conventions of the language and tool-set at hand.
Side-battle: Spacing in Names
That said, I can’t stand names/identifiers with actual white space in them, but that’s a somewhat different battle. Most languages don’t even allow that, but most RDBMSs will happily accept your ridiculous My Cool Database and its resident Silly Tables and Happy Column 1/2/etc. as long as you properly “quote” them (surround them with [square-brackets] or `backticks`, depending on the SQL flavor). If you submit that kind of nonsense to me, I will find you, and I will slap you with a large trout.
Battle 3: ORM vs Stored-Procs (vs Linq?)
This is that little twist-of-DBA as promised. I recently read an interesting post related to this topic, and essentially the point was this: Developers have “won” (won what? I thought were all on the same side!), the ORM is here to stay, and as DBAs/DBDevs, we (you/I) need to build up our understanding of them so that we A) know them even better than our devs, and B) can troubleshoot performance issues with them.
I think there’s some truth to that, and some necessary context as well. Ideally, yes, I would be an ORM expert on whatever 1 or 2 specific frameworks my colleagues are using (Entity Framework, most likely), and any time there was a potential performance challenge with a app-to-database call, I’d be able to parachute-in and sprinkle some magic dust and make it all better. But I’m also the one DBA (out of approx. 1.3 total), serving 4 teams of 3-6 devs each, so in the immortal words of meme-dom:
Ain’t nobody got time for that!
Now I’m not making excuses. All I’m saying is, the burden of understanding is on more than just one team member or job-role. If your dev team is adapting an ORM, said devs need to learn how it works too — at least enough to help with basic performance troubleshooting. Even if it’s just the ability to extract, from a debug session, the actual T-SQL code that’s being sent to the server, and give me a sample query to analyze for performance bottlenecks.
Let’s step back a bit. It’s all about using the right tool for the job, yes? ORMs are meant for basic CRuD operations and simple data access patterns, right? So why try to build complex business logic into them? Because, like it not, teams do build complex business logic into the data layer — despite our protests and soapbox sermons to not do it. And because the vast majority of applications we’re dealing with are not greenfield. Furthermore, ORMs tend to work best when the data model is well-defined, or the database is modeled well (well-modeled?). And again, we don’t all get to work with unicorns in utopia.
Put it this way: If you want an efficient, performant module of data-layer business-logic against your SQL database, it’s likely going to be a stored procedure carefully crafted by a DBA/DBDev. Could you achieve the same results from the app layer, using Linq and/or some mix of ORM and code? Probably. Do you have the time and patience to do so? Maybe not.
So once again, I’m Switzerland. Well, preferably a more pragmatic version — what country would that be? Norway? Anyway. Use the methodology that’s the best compromise between “right tool for the job”, “optimized developer productivity”, and “easiest to troubleshoot”. It’s a tough call, but that’s why we get paid.