Just Another Brick (in the) Wall

Over-engineering may be foolish, but under-engineering is just as perilous.

This month’s T-SQL Tuesday topic, thanks to Wayne Sheffield, is a doozy.  It’s difficult to admit failures, even if we eventually overcame them.  Mostly, it’s tough to fess up to those hours upon hours of what feels like wasted time preceding the hopefully inevitable breakthrough.

Let me tell you a tale of square pegs and round holes.

Get your mind out of the gutter.

And strap in.  This puppy went over 2k words.  I didn’t intend it that way, but, c’est la vie.

A Short History Lesson

I used to work for a software company that made K-12 educational assessment & reporting products.  A large part of K12ED is dealing with learning standards, these historically fraught named concepts that students are supposed to have “mastered” and learned during their time in the US public school system.  You ever hear a reference in pop-culture or media referring to “teaching to the test” or “state standardized testing” or similar?  That’s what they’re talking about.

In the late 90’s, a bunch of states, including California in 1997, finalized and centralized a “list” of sorts, almost a database (but not formally, because let’s face it, this was still the early days of what we know as modern IT), of said Standards, which came to be unofficially known as the “CA ’97 Standards”.  For over a decade, these learning goals dictated what was taught when in our schools.  Government funding incentives, based on state standardized test scores, drove instruction to the point that teachers had little if any time to dynamically mold their students’ minds into critically thinking, multi-faceted, creative individuals.

But this article isn’t about my bias.  This is about the broad, sweeping shift in the K12ED landscape called “Common Core”.  As the technology sector grew more mature and started permeating more “traditional” industries, a vocal minority of thought-leaders had what they deemed an epiphany.

Hey, what if we took the old-guard educational bureaucracy, and all those disparate state standards, and turned it into a unified, technology-driven learning ecosystem?

Sounds great in theory, right?  I mean, surely their intentions were good?  Well, you know what they say about the road to Hell and how it’s paved…

Gosh, there goes my bias again.  Sorry, let me just tuck that back in its pocket.

Anyway.  These new Core Standards introduced some new ways of thinking.  For example, that some learning concepts are inter-related and “cross-cutting” (a fancy way of saying that sometimes a Math concept requires a fundamental Reading-Literacy knowledge-point to fully grasp).  This had some very interesting impacts on the underlying technology systems which relied on, and housed, said Standards.  System which, I might add, had existed for over a decade at this point, in many cases.

Bringing it Back to Data

Our company’s system was one such.  Our key partner’s system, from which we quite literally inherited the traditional, relational database structure to house the CA ’97 Standards, was another.  You see, back before RESTful APIs ran the world, software systems relied heavily on what we call “local data stores”.  In order to show the teachers and administrators, who primarily used our system, the Standards, in which their students were performing well (or poorly), we had to relate those Standards to the test questions that said students were tested on month after month, year after year.  And, like so many other small businesses of the late 90’s / early 00’s, we had a trusty ol’ SQL Server lying around, just waiting to be loaded with all our precious data.

This was fine, for the most part.  The legacy Standards conformed reasonably well to a relational data model, even though we had to throw in a bit of hierarchy (using the good ol’ adjacency list scheme).  There wasn’t a complicated set of relationships from Standards in different subjects (Math, Science, History) to each other, and people didn’t care to do much in-depth analysis beyond “are my students getting well-prepared for the state tests?”.

Enter Common Core

You parents thought these things were complicated and convoluted — just Google “common core math problem” and you’ll find no shortage of critical satire.  Well, the underlying data structures required to store these new Standards were going to be significantly more complex as well.

One of my main jobs, for about a year or two, was to oversee the importation of said Core Standards into our SQL database system.  On the surface, it seemed reasonable — we had a hierarchy concept already, and we had a roll-up & drill-down mechanism for the handful of different “levels” of said hierarchy.  But it was all very static.  What that means, for us tech-heads, is that it was not easy to change or extend; not adaptable to new and expanded requirements.  The older Standards adhered to a fairly strict hierarchy, and each level of roll-up had a distinct name.  With Common Core, they broke out of that old mold, while simultaneously keeping some of the names only to change their meaning depending on context.

Think of it this way.  A group of cattle is called a herd.  A group of sheep is also called a herd.  And a group of seagulls is called a flock.

And I ra-a-an.. I ran so far a-wa-a-ay…

Sorry, where was I?  Right, group names.  So what if the government suddenly decided for us that a group of sheep will from now on be called a ‘gaggle’.  But only if they’re all female.  If the group contains any male sheep, it’s called a ‘herd’ still.  And groups of cattle will be still be called herds, unless it’s purely a group of bulls being raised for beef, in which case we call it a ‘meatsock’.

Have I lost you yet?  Of course I have!  This is pure nonsense, right?  Language does not work this way.  Moreover, hierarchies of groups of things do not work this way.

But I digress.  There was, despite my jest, a method to the madness of the new Common Core hierarchy groupings.  And I did learn it and understand it, for the most part.  The problem was that it threw our existing legacy data model to the wind.

Enter Academic Benchmarks

As usual with a legacy software system in a small company, the resources and buy-in for a data-layer overhaul were nil.  So it fell to us intrepid DB-Devs to shove that snowflake-shaped peg into the very square hole of the old relational model.  We sketched out plenty of ERDs, brainstormed how to tack-on to our existing structures, but nothing short of a miracle would make this new data conform to these old static norms.

Thankfully, the “geniuses” (and yes, that is used sarcastically) over at Academic Benchmarks, or AB for short (at least for the purposes of this post), had already done this.  And we paid them, thousands of dollars per year, for the convenience of using their GUIDs to identify Standards across different systems and vendors.  Never mind that they were just perpetuating the bad model of yesteryear; never mind that they provided zero support for data quality feedback loops.  We could happily take their Excel files or CSVs and load them nearly straight into our database.

Enter, and Exit, ASN

While I was searching for the words to express how insufficient our data model was, I came across this little gem from the Gates Foundation: Achievement Standards Network, or ASN.  (ASN stands for many other things, so as with all acronyms, it’s all about context; just fair warning for the Google-happy.)  The architects here had understood that learning standards needed a better and more flexible model, not only in terms of storage, but also in terms of data interchange format.  This new kid on the block called JSON had been making waves for a while, and was subsequently widely adopted by the tech industry in general, so it stood to reason that this would be the preferred format for publishing and serving the Standards from ASN.

Bonus: it was FREE.  Yes, really.  What a wonderful thought, I said to my team, to imagine never having to pay those crooks at AB another red cent!  Eventually.  After years of transition.  But alas, it was not to be.  See, AB had been around the industry for a long time.  They had their hooks in almost every learning content publisher and assessment vendor.  So as tempting as this shiny new source of academic truth may have been, sadly, it was not practical.

Enter the Contractor

Somewhere around the same time, we took on a promising new developer who, not only had a very strong background in Math and CS fundamentals, but who had also proven his worth with real world applications with actual “in the wild” deployments and users.  He was a bit arrogant, actually.  One could argue that he’d earned it, perhaps, but we didn’t appreciate always being told everything we were doing wrong, constantly, to the point that it was hard to hear the more important bits of wisdom and insight into how we could improve.

Despite that ego, one of his biggest contributions to the team was a fresh impetus to learn new things and branch out to new technologies.  To start looking at new Javascript frameworks.  To revisit OO fundamentals like Dependency Injection, and stop writing such procedural code.  To consider NoSQL data stores.  In particular, graph data stores.

Sadly, that last part came in too little, too late.

Side-note, on the crest of the micro-services wave, he diagrammed and proposed an entire system re-write for our core software product, using micro-services architecture and the concept of small purpose-dedicated data stores.  It looked real purty, but was ultimately and completely impractical for a company of our size.  If we were a “true startup” with millions in VC funding coming out the ears, and could afford to throw a brand new, young & hungry “2 pizza team” at the product, then sure.  But that was not reality.

The Brick Wall

No two bones about it: we had to support these new Standards.  So we soldiered on with our relational database tables.  And we tacked-on additional entities and relationships, logic and code, to make it “almost” do what the Common Core Standards wanted to do.  Effectively, what we were trying to do, was to shove that pretty round sparkly peg of graph data, into the dingy old square hole of a 15-year-old SQL table structure.

Somewhere along the way, AB caught up with the times and started offering other, less “flat” formats, for their Standards data.  So even though ASN was off the table, not all of our work toward JSON ingestion/conversion went to waste.  Consuming the data exports from the vendors wasn’t a problem — we’d already been doing this.  That was not the issue.

The issue, the brick wall against which we continually banged our heads, was the fact that we just plain couldn’t support the advanced & complex relationships and groupings (categorizations) of the new Standards.  Which turned out, in retrospect, not to be the end of the world, because honestly it would take years, if not decades, for the educational system’s old-guard mentality to even comprehend such relationships and categorizations, let alone how they could help shape classroom instruction toward better outcomes for their students.

Good lord, that sounded like a bunch of jargon.

What I mean, in plainer English, is that we spent a lot of time worrying and arguing about something that did not matter as much as we thought it did.  The consumers of our system, the teachers and principals and such, honestly didn’t care about all that.  They just wanted to know if their kids were on-track to be “Proficient” on the state tests so their funding would remain steady.  (I don’t give them enough credit, I know; teachers themselves also needed to know, on a regular basis, which kids needed special attention in what areas of learning, but that’s beyond the scope of most generalized reporting tools.)

Hindsight is Always 20/20

So what was the lesson here?  I don’t want to overlook the fact that we were still using the wrong data storage technology for the problem, fundamentally.  Or at least, the wrong data model.  But, we live in a real world where software needs to be delivered on-time, in-budget, with less than perfect teams who have less experience and expertise than they feel they should.  So instead of butting our heads against that brick wall, let’s try to remember to be pragmatic.  To be realistic about what’s feasible for who & when; and to adapt the parts and segments of our application infrastructure, appropriately and efficiently, to the business problems and use-cases at-hand.  Over-engineering may be foolish, but under-engineering is just as perilous.

T-SQL Tuesday #104: Code You’d Hate to Live Without

And that’s where we could use a little encouragement, I think — another DBA saying “Yay, it’s not just me!” makes it all worthwhile.

It’s that time of the month again!  Bert‘s fantastic invitation complete with YouTube vid is tempting indeed.  There are so many wonderful community scripts and tools for SQL Server DBAs.  But, in an interesting twist, he specifically asks us about something self-written or self-created.  And I’ll admit, I sometimes have trouble being exceptionally ‘proud’ of my home-brewed stuff.  Sure, I’ve blogged and GitHub‘d, and hopefully those have helped someone along the way.  Most of the time, though, those are specifically suited to a given use-case.

The more difficult code to share, I think, is the stuff that we keep in our “daily-grind” “Get-Stuff-Done” folders.  Scripts that we’ve tweaked and commented sporadically throughout the years, finding this edge-case or that tweak for a given scenario, most of which ends up being pretty environment-specific.  And that’s where we could use a little encouragement, I think — another DBA saying “Hey, it’s not just me!”, or “I knew there was another guy/gal out there who always has to do this kind of thing!”.  Thus, here I will attempt to show off something along those lines and see what you kind folks think.

Where Does it Hurt?

No, I don’t mean the famous wait-stats queries by Paul Randal (those are awesome).  I mean, what are the top few areas of your SQL environment where you’re always questioning something, or being asked to find some answer that’s not immediately obvious?  For me, there are 3.

Replication

Transactional replication is, in a word, ‘brittle’.  It works well when it’s working, and usually you don’t have to think about it.  But when you do need to check on it, or change anything about it (god forbid), or troubleshoot it (more like shoot it, amirite?), it feels a bit like trying to mess with a half-played Jenga stack.  Meaning, you might be just fine, but you might send the whole thing crashing down and have to rebuild it all.

peanut-brittle-from-the-joy-of-baking
As brittle as this stuff. But not nearly as delicious.

I won’t go into the whole troubleshooting aspect here, that’s too much scope.  But there’s a question that can often come up, especially from Devs or Biz-Analysts, and that is: “Hey, is TableX being replicated?”  And rather than subject my poor eyeballs to the replication properties GUI, I just run a little query, which reads from the distribution database and the actual database that’s being published (the ‘publisher’ db), and tells me a list of tables & columns that are being replicated.

Here it is.  Be sure to replace [dbName] with your actual published DB name.  Like StackOverflow or WideWorldImporters, or AdventureWorks <shudder>.

Report Subscriptions (SSRS)

Another question I’m often asked is, “Hey, did ReportX run?”  What they really mean is, “Did ReportX‘s email subscription get sent to BigWigUserY?”  We have an unholy amount of SSRS reports with email subscriptions.  And because I don’t care to bloat my inbox by BCC’ing myself with every single one, I rely on the users to speak up when they don’t receive something they’re expecting.

“This is a terrible idea.”, you say.  “Never trust the users!”

Yes, well, such is life.  If you have a better lazier solution I’m all ears.

So here’s a little script I wrote to query the ReportServer database (which MS will tell you is “officially unsupported”, snore).  If I know some keyword from the report title, or the supposed recipient’s email address, this will tell me if it ran successfully or not — LastRun, LastStatus.  A couple other useful bits: where it lives, ReportPath, and what its SQL Agent Job’s Name is, JobName.

That last bit is peculiar.  The JobName looks like a GUID (because it is, because SSRS just loves GUIDs), but it’s also the actual name of the agent job, which you can use to re-run said job — via exec msdb.dbo.sp_start_job — if the failure wasn’t systemic.  As I often do.

Disk Space

Last but not least, everybody’s favorite topic to try and forget about or pawn-off to the SysAdmins.  “How much space are those databases / data files / log files eating up?”  Well, mister suddenly-cares-about-disk-space-but-is-OK-with-storing-all-domain-users’-iTunes-music-libraries-on-the-central-fileshare-along-with-their-documents-because-that’s-what-we’ve-always-done.  {True story.}  Let me tell you.

keep calm and release the bitter
It’s healthy, I swear!

This script has a lot of comments and commented-out lines because I will tweak it depending on what I need to see.  Sometimes it’s “show me the DBs where the logical filename doesn’t follow my preferred pattern” (the DB name with ‘_data’ or ‘_log’ after it); or perhaps “Only show me files over 5GB with a lot of free space” when I’m itching to shrink something.

“Never shrink your files!  (In production!)”

Yes, thank you, knee-jerk reactionary.  I’m dealing with servers in lower environments, usually, with this one.  😉

What do you think?

I’d love to hear your feedback in the comments!  Happy Tuesday!!  =)

Credit where credit is due.

I did not magically come up with these all by myself.  They’re pieced together from many a StackOverflow answer and/or other community blog post or contribution that I’ve since forgotten.  I usually store a link to the source in these kind of things, when it’s true copy-pasta, but in these cases, I added enough of my own tweaks & style that I no longer tracked the original linkage.  If you see anything that looks familiar, please do tell me where to give kudos and shout-outs!  😀

T-SQL Tuesday #102: Giving Back

This month hosted by Mr. Major (scribnasium / @RileyMajor), whose past #tsql2sday post I may have missed but I love it all the same (BeyondCompare FTW!).  It’s about giving back to your community — mostly technical, i.e. the #SQLFamily / MS Data Platform community, but it can be about any community you’re a part of.

For starters, that’s what we blog for, at least partly.  I mostly blog about things I’ve learned or found interesting in my work, so that I don’t forget them later!  Of course, I’m always happy to learn that it helps someone else out there with a similar problem.  But there’s so much more to a community ecosystem than that.

SQL Saturday

SQL Saturdays are the perfect example of this.  I’ve been to 3 so far – 2 at Orange County, and 1 at San Diego.  I have to call out OC’s organizers, Ted Stathakis & .. the other guy.  Crap, I forgot his name.  He’s awesome though.  Srsly.  I talked with Ted a bit after a session at the last one and he truly is as nice as he is busy – 3 boys, all kinds of volunteer work, AND a full time job as the BI Director at Del Taco!  Wow!

I mean, it’s no Taco Bell, but still, Del Taco!  (Sorry, I had to..)  =P

I really want to volunteer to help at one of these.  I’m not sure what I’d do, but I know when the time comes, it’ll be worth it.

Lunch & Learns

To get my feet wet, I hosted a couple “lunch & learns” at my company with my Business Analyst and Development teams.  We talked about some basics, ranted about formatting a bit, tried to say goodbye to old habits from the SQL 2005 days (hello, date type!), and even dived into a few juicy things like performance-testing with IO stats and why you should never use scalar or multi-statement functions.  We also had a couple heart-to-hearts about DevOps and what it means in our environment.  (Hint: we’ve got a LOOONG way to go.)

At some point I’m going to scrub my slides and post them on SlideShare or something.  I just have to remove any company-specific info first.  ;o)

As with most teaching efforts, it helped me learn (and re-learn) some things myself!  I had to read some documentation to get wording exactly right, and I built a few playground DBs on my local machine to try out different concepts & examples.  Plus, it forced me to justify WHY I held certain opinions or notions on things, and in doing so, realize my own mistakes of the past.  Thus, I became a better DBA just by reinforcing some good practices and updating my own assumptions.

the-more-you-know
Yay learning!

Generosity

If there’s one thing I’ve learned from the many tributes to our late #SQLFamily member SQLSoldier, it’s the importance of being generous with your time.  Whether that means answering to the #sqlhelp tag on Twitter, participating in the SQL Community Slack, answering StackOverflow / DBA.StackExchange questions, or just taking a few moments to help someone at your workplace with a tech problem — it all makes a difference.  I need to be better about this, as far as my “online presence” goes.  In-person, I’m always eager to help, but because I work remotely most of the time (yay!), it’s a bit of a moving-target — the in-office days get packed with meetings and critical face-time (not FaceTime), so the peer-to-peer stuff is often postponed or relegated to Slack.

However, there’s a flip-side to this.  In being generous, we can’t forget to prioritize ourselves and our families.  I’m always being asked “Why are you working so much!?” — well, I could try to explain the difference between ‘work work’ and ‘tech community involvement’ and ‘self-betterment / career planning’ but… yeah.

Sorry, honey, be done in a sec!  =)

Anyway..

I encourage you, dear reader, to give back to your own communities, in whatever form that takes — tech/online, real-life, etc.  I promise that it will benefit you in some new & unexpected way.  Plus, whenever you make a real solid connection with someone, that’s worth its weight in gold.

it's not what you know, it's who you know
ORLY? Yes, RLY!

Favorite TSQL Tuesday #101 Posts

 

Since I didn’t even come close to making it in time for this month’s T-SQL Tuesday, I figured I’d highlight my 5 favorite posts from the community, and then share a few of my own tips/tools.

I use a Central Management Server too, and although I don’t often use it to run a query against multiple instances, it’s definitely a handy built-in feature to take advantage of.  A minor downside is that it only supports Windows Authentication (not SQL logins), so I can’t use it for my AWS RDS instances, nor for the CMS server itself — those I have to keep stored in my local “Registered Servers” section.  Another tool for running queries against multiple instances, with a good deal more flexibility, is Red Gate’s MultiScript, though it’s not free.  ;o)

Ethervane Echo, a clipboard manager and history-remember-er, is similar to something I use called Clipboard Fusion — in fact, it might even be better.  And who doesn’t love dbatools and dbachecks ?  If you’re not using them yet, don’t wait; start getting into PowerShell today by at least trying out some of the ‘get’ cmdlets from dbatools.

Telegraf looks absolutely stunning, as a monitoring system.  It does take some setup work and some maintenance, but it’d be a great branch-out learning opportunity to touch a few different technologies that a traditional SQL DBA might not normally think of.  Hats off to the people behind that, and may it continue to grow.

Leave it to Bert to go “outside the box” a bit, with these tools that help you be a better presenter and collaborator.  I use BeyondCompare, which is similar to WinMerge (tho, again, not free); I’ve fallen in love with its features that go beyond file diff/merge, but it’s nice to have a free option whenever I’m not on my main machine.

This is a broad sweeping post but it captures a LOT of what the community is and should be.  We’re inclusive, we want people to participate, grow & learn from each other, and ultimately advance their careers.  Tons of useful gems in here, from the Slack workspace to the event links to the networking advice.  Excellent stuff; go read it.

Honorable mention:

The SQL DB Modeler beta looks really interesting as an alternative to traditional big-$$$ tools like ER/Studio & Erwin.  If only I wasn’t stuck in brown-field legacy data models 95% of the time… =D

And finally, although they’ve probably been mentioned a few times already, pastetheplan and statisticsparser are two amazingly simple tools from the Brent Ozar folks that make sharing and comparing query performance so much easier.  My M.O. is to use PasteThePlan links in a dba.stackexchange post so that others can easily see the graphical execution-plan to offer feedback; while I use StatisticsParser to compare between A/B-testing runs of a stored-proc I’m trying to refactor & improve.

TSQL Tuesday #100 – Predictions for 2026

Yeah so I missed the boat by a few days week.  That’s pretty much my M.O.  This month’s T-SQL Tuesday #100 is hosted by the author of sp_WhoIsActive and the creator of T-SQL Tuesday himself, the legendary, the incomparable, Adam Machanic.

applause-please
You ain’t never had a friend like the SQL blogger community ;D

The Year is 2026

Despite IT’s best efforts to kill the relational database, it’s still alive and kicking.  Sure, it’s mostly in the cloud, and we’ve largely solved the problems of scalability, availability, and “traditional” maintenance, but the DBA still plays a critical role in the IT organization.  He/she is more of an architect and an automator, someone who understands the business and development needs as they relate to data — its storage, availability, security, and performance — and can leverage cohesive data platform technologies to provide those services and meet those needs.  But the fundamental issue of data quality still haunts even the best environments, because at the end of the day, when you rely on a human to enter text into a field, you’re gonna get garbage inconsistency.  Thus, we’re still fighting that fight, if only to appease our “data scientists” and machine-learning models so that they stop whining about it.

SQL Server itself has evolved.  After realizing that it was pretty silly to bolt-on a hacky “graph db” component to what is, at its core, a relational engine, MS broke that off into its own product, “Microsoft GraphDB Server”.  But the good news is, SQL & GraphDB talk to each other seamlessly; in fact all of the data-platform products integrate and inter-operate much more smoothly than 10 years ago.

We finally have a single unified CE (Cardinality Estimator), which is intelligent enough to know which paths/plans to use for a given query, so we don’t need to mess with those awful trace-flags anymore.  Indexes and Statistics are all but self-maintaining; the DBA rarely has to step in and mess with them.  Part of the reason for this is that SQL Server yells at you if you try to make a GUID the clustering-key, or other such nonsense.  =D

Columnstore is everywhere; traditional row-store (b-tree) indexes barely exist.  JSON storage & indexing inside SQL Server is much better, but it’s still preferable to use a document-store DB if you can.  Hierarchical structures (not to be confused with graphs) are easily implemented and supported, without having to resort to old hacky models.  And user-defined functions (all types) perform nearly on-par with stored procedures.

They’ve replaced sp_who and sp_who2 with the code from sp_WhoIsActive, and made SSMS Activity Monitor suck less & actually be semi-useful as a basic first-response monitor.  Profiler was officially killed off, and XEvents has come into general widespread usage — largely because MS finally dedicated some hard-core dev time to improving its GUI & making it much easier to use.  Native Intellisense finally works, and works well, for all but the most obscure/weird things, and is much less chatty in terms of network traffic to/from the server.

And finally.  FINALLY.  Each database has its own TempDB.

and there was much rejoicing.. yay
We’d only been asking for it for.. 10 years?

T-SQL Tuesday #99: Counting Rows the Less-Hard-Way

We can get our row count, and min & max date values, without ever touching the actual source table!

This month’s invite courtesy of Aaron Bertrand (B | T), whose “bad habits” blog series still inspires many an impassioned debate or engaging argument discussion on a regular basis among DBAs & Developers alike.

And yes, I’m taking the easier of the two “dealer’s choice” choices — the SQL focused one.  (I’m not big on sharing/blogging personal stuff, at least not here; I may one day start another blog for that, or perhaps just occasionally post more #off-topic stuff , but for now you’ll have to be content with my stories of vehicle troubles and the occasional movie-geekery).

So, without further ado…

By the way, what is ‘ado’ and why should there be no further of it?

art-vandelay-importer-exporter
Accidentally apropos on many levels…

Counting Rows in Really Big Tables

Previously touched on here, tables of unusual size (TOUSes) can be tricky.  You don’t want to lock them up for a long period of time, but you often need to gather information about them (such as row count, size, range of values) to perform some kind of operational maintenance with/on them.  In particular, Aaron’s post on “counting rows the hard way” inspired me to look into this a bit more and try to come up with a clever-ish way of finding out some basic “shape of data” info without actually querying (scanning) the table itself.

To start with, it’s actually really simple to get the total row-count from a few system catalog views — Aaron’s already shown you that, so I won’t repeat.  My interest is more in questions like “How many rows match a where-clause?” or “What are the min & max values for thatColumn?”

For this post, I’ll be focusing on a particular kind of table — the “history” or “transaction” table.  The idea here is that you have a record of “every time some event happens in/to some entity”.  A very common example is audit-trail tables, which I’ve been dealing a lot with lately.  Another common example is a “transaction history” table, such as, in our new favorite MSSQL demo database WideWorldImporters, the table Warehouse.StockItemTransaction​.  It’s the 2nd largest table in the db at 260-some-thousand rows.  {The largest is a multi-million-row beast that is actually the system-versioned aka temporal table behind a “normal” table; I might build a phase-2 example around this, but not today.}  So, while our queries won’t be super slow, we’ll get enough of an idea of what’s bad & good from measuring our IO stats (with SET STATISTICS IO ON).

TL;DR: The demo script is available here; the headers below correspond to the comment-lines of the same name, but I’ve left enough commentary in the SQL itself to keep the average reader on-track, so feel free to check it out ahead of time.  But do keep reading at some point!  :o)

A. Gathering Some Intel

First up, we have good ol’ sys.sp_spaceused.  This gives you some sizing info about the table, including its row count.  There’s a “disclaimer” circulating around out there that it’s not “up to the millisecond accurate” , i.e. it might not have the most current row count if someone else is in the middle of an insert operation or whatnot.  But for all intents & purposes, you can consider it truth.

Then you have the “hard ways” that people typically use — and that Aaron, again, covered just fine in his post on the subject, so I won’t spend any more time there.

But what if I want to count rows matching a where clause (a predicate)?  And in dealing with a typical history/transaction table, the predicate is almost always “between such and such dates”.  (Of course we won’t literally use the between operator, as we have been chastised severely; we know best to use >= and < !)  Also, I want to know the MIN and MAX of said dates in the table.  Lord knows we’re gonna be doing some table-scanning.

B. Ok, Let’s Try an Index

In their benevolent wisdom, the SQL deities decided not to give us an index on WideWorldImporters.Warehouse.StockItemTransactions.TransactionOccurredWhen.

BTW, how’s that for a verbose column name?  What, TransactionDate not good enough?  I suppose it isdatetime2 after all, but still…

So we create an index on it, to see if that helps our poor “count where dates” query.  And behold, it does!  We’ve cut our # of logical reads down by about 90% (from 1900 to 200, if you’re following along in the script).  That’s fantastic, but… we can do better.  Because if the table is, say, 500 million rows instead of 260k, that’s about 400,000 logical reads, which.. could definitely suck.

C. The Better Way

Again, the script has an ode to Aaron’s query on sys.partitions/tables to get the row-count from the meta-data.  Then the real fun begins.

There’s a system DMV (or probably ‘DMF‘ – dynamic management function) called sys.dm_db_stats_histogram, which takes the table’s object_id and the index’s index_id as arguments.  It gives you, obviously enough, the statistics histogram of the statistics object corresponding to that index.  We want to store its output in a temp-table (or even a real table — go nuts!) so we can query it some more.

--For example, if our new index is index_id 7:
sys.dm_db_stats_histogram(OBJECT_ID('Warehouse.StockItemTransactions'), 7)

So we create our #StatsHist table (“hist” being an abbreviation for “histogram”, not “history”, though in retrospect that’s probably not worth the possible confusion), and we populate it with the meta-data from Warehouse.StockItemTransactions and its new index that we just created (on TransactionOccurredWhen).  Poof!  We have an easy way of showing min/max values in that column!  Well… almost.  We have to convert the variant datatype to an understandable & aggregate-able (probably a made-up word.. aggregable? aggregatable?) type.  So we add a column range_hk_proper of type datetime2 and populate it with the converted values of range_high_key from the stats-output.

There!  Now we’re cookin’ with gas.  Our min/max/count query, and our “count where date-range” query, run in mere milliseconds, without ever touching the actual source table.  So we don’t lock it up or block anybody else from writing to it, even in the most pessimistic isolation levels.

Except when you created that index we needed on the date column.

Yes, I know.  What we’re hoping is that the tables we deal with in the “real world” already have such an index that we can take advantage of.  If not, well, that’s what maintenance windows are for.  And you better believe you’re gonna need that index sooner or later.

cooking-with-mustard-gas
Been a while since I used a Family Guy meme….

Where To Next?

Ostensibly, this whole thing could probably be turned into a stored-proc so you could run it “on demand” for any table that had a date or datetime column which you wanted to get such information about.  It’d have to do a lot of error-checking, of course — it wouldn’t work if you don’t have such a column, and if there’s no index on it, and probably a myriad of other ‘gotchas’ that I’m not thinking of at the moment.  But I did try to lay the groundwork for improvement. #StatsHist stores schema & table name too, so if you felt like turning it into a mini-data-warehouse holding a BUNCH of stat-histograms for a whole mess of tables, you could definitely do that.  And then you could run some basic analytics on it — min/max/avg, counts by year/month/day, etc.

Sounds like fun, no?  ;o)

T-SQL Tuesday #98: Orphaned Users Redux

It’s that time again!  The first #Tsql2sday of 2018.  Thanks to the Blobeater for this month’s invite: “your technical challenges conquered”.

Because I’m already ridiculously late, I have a short one.  This is about orphaned users — you know, when you restore a database and its users aren’t mapped to the server logins that they should be or used to be.

orphan-movie-poster
Not that kind of orphan… pretty decent movie tho!

The typical solution is sp_change_users_login with the auto_fix or update_one option.  But guess what?  Yep, that’s deprecated.  By the way, did you know that it also has a report option?  Apparently that’s got some bugs…ish?  Try it sometime and see — compare the output of sys.sp_helpuser where the ‘LoginName’ column is null, with sp_change_users_login 'report'.  Preferably on a DB you’ve restored from another server.  😉

So what’s the correct solution?  ALTER USER [theUser] WITH LOGIN = [theLogin].  Simple, no?  Let’s get more general.  Could we come up with a half-decent way do apply this kind of fix dynamically?  Well sure, the nice folks at DBATools have already solved that problem.  And that’s great, really.  But just in case that doesn’t work… ^_^

One of the many things I love about SQL Prompt is the right-click option to “Script as INSERT” (from the results grid).  This is a quick & easy way to built a temp-table for the results of an exec statement so you can do the ol’ insert #tmp exec sys.sp_blah !  Then we can query the list of DB users for the null LoginNames and write a little set of queries to fix them!  Sound good?

UPDATE: Behold the code!

Happy Tuesday!

PS: Coincidentally, today’s (Thursday) SQL Server Central newsletter featured a very similar post by a gentleman over at Madeira Data.  Go check it out, it’s another great solution to this problem!  And while you’re at it, get the SQL Server Radio podcast (created by a couple guys from the same company) – it’s a terrific addition to your iTunes library.

TSQL Tuesday #96: Good Influences

This month’s invitation is brought to you by Ewald Cress (blog, twitter), who I already like based on his tagline —

finds joy in minutiae..

Yes, my friend, don’t we all.

elaine-curious-why-didnt-use-exclamation-point
I can’t spend the rest of my life coming into this stinking apartment every 10 minutes to pore of the excruciating minutiae of every single daily event!

The topic at hand is fairly non-technical, but still important: folks who have made a positive contribution to your career or professional development.  So it’s time for a shout-out!  About a year ago, I wrote about my first major career move.  There were several great influences in my first job, from the developers that taught me how to code, to the DBA who taught me how to keep cool & calm in the face of outages, to the boss who taught me the importance of time management and breadth of knowledge.

Post-mortem

Since I am way too late in posting this, and I don’t feel like waxing poetic, I’ll just say a general “thank you” to all those who’ve helped me along in my career so far, with special acknowledgement to my former boss, my current boss, the SQL family, and my own family.  Happy belated Thanksgiving and have a safe & pleasant holiday season!  I’ll have a real post again quite soon, diving back into the tech stuff.

food-coma-happy-belated-thanksgiving
hope they don’t mind me borrowing their image… =D

TSQL Tuesday 95: Big Data

This month’s party brought to you by Mr. Hammer (b|t).

mc-hammer
No, not THAT one…

I apologize in advance for all the hammertime memes.  It was just too good to pass up.  Surely he must be used to this.  Or at least not surprised by it.  =D

So, Big Data.  What is it?  Well, in simple terms, it’s the realization and acceptance of the fact that data is multi-model, multi-faceted, multi-sourced, and constantly growing.  It’s the fact that the traditional RDBMS is no longer the be-all end-all source of truth and valuable information.  It’s part of a larger ecosystem involving JSON document stores, CSV files, streaming volatile bits of data coming from random devices and user activity that loses its meaning and potential impact almost as quickly as it can be gathered and sifted and stored.

But what do we actually get out of it?  As a small-medium enterprise NOT in the software business, I have to say, not as much as the hype would have us believe.  And look, I’m not so jaded and crusty that I refuse to adapt new tech.  I Just haven’t seen a meaningful transformative business use-case for it.  Sure, we have Google Analytics telling us how our websites are doing, and someone in marketing knows something about trending our social media traffic.  Does it really help us make more money?  Heck if I know.

cease thy actions, my timepiece has indicated the necessity of mallets
Old-timey colonials can even dig it…

Here’s what I’d like to see from the thought leaders.  Give me something I can chew on — a real-world, non-hypothetical, non-frivolous, impactful use-case for adopting and implementing something like Hadoop/Spark or Azure Data Lake.  Show me how my business can realistically journey down the path of predictive analytics and what it’s going to take from our Devs, IT staff, and management to actually get there.

Because they don’t get it yet.  I have managers still worrying about how much we’re spending on a dinky little flash storage array to support the growing needs of our on-prem converged infrastructure stack.  Meanwhile the AWS bill continues to baffle, and Devs want to play with Docker and Lambda.  But we can’t seem to convince the higher-ups that they’re short-staffed on the internal-apps team, even after a minor version upgrade takes 4 hours of Ops time and half a dozen end-users doing post-mortem testing just to be sure we didn’t break anything unexpected.

I’m not here to complain.  Really.  I do want to see something amazing, something inspiring, something that shows me what Big Data truly brings to the table.  And sure, I’ve see the vendor demos; they’re all just a bit outlandish, no?  I mean, they look really cool, sure — who doesn’t want to see a chord diagram of who’s killed who is GoT? — but does that really help my business improve sales and productivity?

My point is, there’s a gap.  A chasm of misunderstanding and mis-matched expectations between what management thinks Big Data is/means, and what it takes to actually implement.  They see the pretty pictures and the fancy demos, but they don’t see the toil and sweat (or at least, in the cloud, gobs of cash) that go into building & operating the underpinnings and pipelines that drive those nice graphics.  Not to mention the fundamental issues of data quality and governance.

continue not, time for hammer it is
OK OK, last one, I swear…

So do us a favor, Big Data pundits.  Show us something real, something that “the little guy” can use to up his/her game in the market.  Something that makes a positive impact on small non-startup non-software businesses with understaffed IT & Dev teams.  But more importantly, stop glossing over the effort and resources that it takes to “do Big Data right“.  Managers and executives need to understand that it’s not magic.  And IT practitioners need to understand that it’s actually worth-while.  Because I believe you — really — that the payoff in the end is there, and is good.  But you need to convince the whole stack.


PS: I know this is a fully day late for T-SQL Tuesday, and as such, I wasn’t going to post a ping-back in the comments of the invite, but then I saw there were only 8 others, so I felt it would benefit the event if I did add my late contribution.  I’ll tweet with a modified hash-tag instead of the standard #tsql2sday, to reflect my late-ness.  Hopefully that’s a fair compromise to the community & the event’s intentions.  =)

PowerShell and BITS

this is about using PowerShell and the BITS framework to copy very large files across servers…

Welcome back!  This month’s topic is PowerShell — thanks to one of our prominently bearded community members.  PowerShell is a fantastic tool in the IT professional’s toolbelt.  I was first introduced to it somewhere in 2014 or 2015 by a colleague, and started making much heavier use of it when my career took me to a new & bigger environment.

Actually, funny side-story.  I remember seeing one of the very early incarnations of PowerShell, or what would eventually evolve into it, in college.  A graphics programming course, of all things, had a MS partner come in to show us this “cool new” Windows command-shell thing (different and separate from the DOS-style CMD, obviously), where he demonstrated fetching some data from the filesystem, feeding it into a CSV, and doing some kind of super-basic analysis on it to show in a “report” (which was really just another text file).  This was 2005-2006, so I couldn’t say what it was specifically, though I seem to remember something about the word “Longhorn”.  Although, reading up on some of the Wiki-history, it seems more likely that it was a Monad beta.

that is so four score and seven years ago
Don’t ask me why hes wearing horn-rims. I don’t know.

Preamble

Anyway, back on topic.  Today’s post is pretty simplistic in comparison to what most people may be writing about.  But I’ve already blogged about doing hands-off SQL installation with PowerShell & CLI, and this was another thing kicking-around the back of my mind.  So this is about using PowerShell and the BITS framework (*-BitsTransfer cmdlets) to copy very large files across servers.  Specifically, database backups.  Because let’s face it, they can be really large.  And if you’re faced with fetching them off a PROD box, you want to minimize the impact on that box’s resources.

Now sure, there are other ways – xcopy or robocopy with the /J flag (un-buffered IO), or fancy GUI tools.  And in an ideal world your backups would be written to a network share that’s not a local drive on the PROD SQL server, right?  Right…

Oh, and one more thing.  You need to enable BITS via the Windows Features console — search “features” in your Start menu and it should come up as Turn Windows features on or off (Control Panel) .  On a server, it’s under the Server Role “Web Server (IIS)”, feature “Background Intelligent Transfer Service (BITS)”.  Underneath there are 2 sub-feature choices, “IIS Server Extension” and “Compact Server”.  Honestly I don’t know which is preferable, but I left it with the default selection, the first (former).  It should go without saying, but don’t do this in production (unless you have the blessing of your SysAdmins).

But Why?

Why BITS?  Well, as per the Docs, it has the following 3 key features (emphasis mine):

  • Asynchronously transfer files in the foreground or background.
  • Preserve the responsiveness of other network applications.
  • Automatically resume file transfers after network disconnects and computer restarts.

Wow, nifty!  So it doesn’t hog the network, and it’s resumable (resume-able?) in case of connectivity hiccups.  Pretty sweet, no?  Also, it can run asynchronously in the background, which means it won’t hog your storage bandwidth or compute resources.

async all the things!
Because we can.

Let’s See an Example

Most of the guts and inspiration for this came from this article over on “Windows OS Hub” (woshub, a somewhat unfortunate sounding acronym, but certainly not as bad as some!).  The datePattern nonsense is just to make it “dynamic” in the sense that, if you have a backup scheme like me, with Sunday FULLs, daily DIFFs, and obviously TLogs in some every-X-minutes fashion, you’ll usually want the latest set of FULLs and DIFFs.  But you could easily tweak this, make it more point-in-time aware or whatever, as needed.

So, here’s a bit of a talk-thru outline, and then I’ll just link the Gist.

  1. Get the list of files we want to copy, from “source”
  2. For each file:
    1. Make sure it doesn’t exist in the “destination”
    2. If not, start a BITS transfer job (saving said job to a variable for checking/finishing later)
    3. While said BITS job is pending, print a progress message and sleep for some seconds
    4. Finish (“complete”) said job and move on to the next file
  3. Conclude with a message about how much work we just did!
  4. Repeat steps 1-3 for another “set of files” (list) if desired

And without further ado, the code.

# AUTHOR: Nate Johnson, @njohnson9402/@natethedba, natethedba.wordpress.com
# LICENSE: https://choosealicense.com/licenses/unlicense/
# TYPE: PowerShell script
# DESCRIPTION/USAGE: example of 2 file copying loops using BITS Transfer cmdlets, targeted at SQL backups
# See corresponding blog post at https://natethedba.wordpress.com/powershell-and-bits/
# ASSUMPTION: your SQL backup files are named with a date component in the form "yyyyMMdd", and use extension ".bak"
$source = "\\BigSQL\Backup\BigSQLInstance\" #network share or local directory
$destiny = "X:\Restore\BigSQLInstance\" #LOCAL drive destination
$testonly = 1 #set to 0 to actually do the copying!
$sleepSeconds = 3 #how many seconds to sleep between each progress-check during each file transfer
# use some date-math to get last Sunday's FULL backups
$datePattern = $(Get-Date).AddDays($((Get-Date).DayOfWeek.value__)).ToString("yyyyMMdd") #e.g. "20170910"
$files = Get-ChildItem Path $source Recurse Filter ("*_FULL_" + $datePattern + "*.bak")
Write-Host ("About to copy " + $files.Count + " files to " + $destiny + "")
for ($i = 0; $i -lt $files.Count; $i++)
{
$srcfile = $files[$i].FullName
$dstfile = $destiny + "\" + $files[$i].Name
Write-Host (" Copying " + $srcfile)
if (Test-Path $dstfile)
{
Write-Host (" File already exists: " + $dstfile)
}
elseif ($testonly -eq 0)
{
$bitsJob = Start-BitsTransfer Source $srcFile Destination $destiny Asynchronous Priority Low
while (($bitsJob.JobState.ToString() -eq "Transferring") -or ($bitsJob.JobState.ToString() -eq "Connecting"))
{
#only display progress when transferring
if ($bitsJob.JobState.ToString() -eq "Transferring")
{
$pctCmplt = ($bitsJob.BytesTransferred / $bitsJob.BytesTotal)
#see https://technet.microsoft.com/en-us/library/ee692795.aspx for number formatting trick
Write-host ($bitsJob.JobState.ToString() + "" + $("{0:P2}" -f $pctCmplt))
}
Sleep $sleepSeconds
}
Complete-BitsTransfer BitsJob $bitsJob
}
Write-Host (" Finished transferring " + $srcfile + " to " + $destiny)
}
# today's DIFF backups
$datePattern = $(Get-Date).ToString("yyyyMMdd") #e.g. "20170912"
$files = Get-ChildItem Path $source Recurse Filter ("*_DIFF_" + $datePattern + "*.bak")
Write-Host ("About to copy " + $files.Count + " files to " + $destiny + "")
for ($i = 0; $i -lt $files.Count; $i++)
{
$srcfile = $files[$i].FullName
$dstfile = $destiny + "\" + $files[$i].Name
Write-Host (" Copying " + $srcfile)
if (Test-Path $dstfile)
{
Write-Host (" File already exists: " + $dstfile)
}
elseif ($testonly -eq 0)
{
$bitsJob = Start-BitsTransfer Source $srcFile Destination $destiny Asynchronous Priority Low
while (($bitsJob.JobState.ToString() -eq "Transferring") -or ($bitsJob.JobState.ToString() -eq "Connecting"))
{
#only display progress when transferring
if ($bitsJob.JobState.ToString() -eq "Transferring")
{
$pctCmplt = ($bitsJob.BytesTransferred / $bitsJob.BytesTotal)
#see https://technet.microsoft.com/en-us/library/ee692795.aspx for number formatting trick
Write-host ($bitsJob.JobState.ToString() + "" + $("{0:P2}" -f $pctCmplt))
}
Sleep $sleepSeconds
}
Complete-BitsTransfer BitsJob $bitsJob
}
Write-Host (" Finished transferring " + $srcfile + " to " + $destiny)
}
Write-Host ("All done!")

The Catch

There are some downsides here.  First, you cannot use BITS in a non-interactive mode, i.e. inside a Scheduled Task as a User that’s not logged-in.  This is because it’s a “desktop”-oriented feature, not a “server” one.  Second, I’ve never been able to get multiple transfers going at once — or at least, multiple PoSh scripts which use BITS transfers.  This could very well be my fault, but it does seem like the BITS jobs are “serial” in nature, i.e. one must finish before the next one can start.  Again, not the expert, just observing what I found during my experiments.

parallel vs serial ports on old computer
Obviously, serial won out in the end (specifically, his superstar protege, USB), but you gotta hand it to parallel, he had a good run.

Conclusion

BITS transfer is an interesting method for copying extra-large files around your environment with low overhead.  PowerShell makes it easily accessible and lets you wrap it up in loops and checks so you can effectively build a progress-indicative, predictable and reproducible method for copying a whole SQL server’s set of backups from one place to another.

What cool little things have you discovered using PowerShell?  Let me know!  Thanks for reading.