We’ve finally launched our new platform, and, while things are not perfect, we’re in a place that gives me a little time off to do something other than work work.

I’ve been using a lot more SQL lately to work with our Postgres instance, and it’s made me realize how much better I could be - more efficient queries (read: understanding how the queries work), shorter queries, a better understanding of how to pull out all of the data I want in the fewest number of queries.. the list goes on.

In order to address this, I’m continuing a book I took a pause on. It’s called Curious Moon, a fun take on learning SQL, where you’re placed into the role of a intern at a space consulting agency whose task is to wade through some data on Enceladus after the former DBA dropped a bunch of data…

If you’re skeptical that such a book could be useful, know that the author advocates against ORMs and for using psql and make, also starting with importing and cleaning up data from a CSV rather than downloading a pre-formed DB. Point is: this is not an “easy mode” book, rather it tries to instill what I would consider best practices based on my own experience.

stepping off the soapbox

With that said, I’m cataloging some of the tidbits I’ve learned from the book here:

(P)SQL

ilike: like like but like, case insensitive

WHERE title ~* '^A\d.*?': how did I not know that I could use regex like this before?

to_tsvector(col): create a fully text-searchable index on a column. Use where col @@ to_tsquery('txt'), for example, to search the field. More operators and functions available; of note websearch_to_tsquery function could be interesting if building a “web searchable” app that accepted more natural web queries. The internals are also pretty interesting:

The to_tsvector function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries (Section 12.6) is consulted, where the list can vary depending on the token type. The first dictionary that recognizes the token emits one or more normalized lexemes to represent the token. For example, rats became rat because one of the dictionaries recognized that the word rats is a plural form of rat. Some words are recognized as stop words (Section 12.6.1), which causes them to be ignored since they occur too frequently to be useful in searching …

Note: After playing around with this a bit, behavior is not exactly as expected. Example:

drop view customers; create view customers as select to_tsvector(concat(u.first_name, ' ', u.middle_name, ' ', u.last_name)) as name from users u;

select * from customers where name @@ to_tsquery('joe');
        name
---------------------
 'camel':2 'joe':1
 'boateng':2 'joe':1
 'joe':1 'ledford':2
(3 rows)

After using it a bit more, I’ve realized that this field is not necessarily meant to be used to read, but to simply query on. For example, query on name but only as a way to get to other columns in the rows, like id or email.

count(1) + group by: I’ve used count and group by before, but this solved a small problem I’ve been mulling over in the back of my head: count the number of occurrences of a common attribute among rows. For example, if you had services that could be part of a package offering, and those services referenced a package, you can find the number services sharing the same package with:

select count(1) as services, package_id from services group by package_id;

Note: You can also use having to narrow results on the count, e.g.

select count(1) ... having count(1) > 1;

search++: use concat to search even more! to_tsvector( concat(events.description,’ ‘,events.title) ) as search

psql

\H: output as HTML!

Other stuff

Sargeable: “Search ARGumentABLE” queries can be optimized. Basically, can the query planner optimize search if you had an index on a column. For example, if the column description had an index on it, final% could be optimized, but %final would result in a sequential scan; the former is sargeable, the latter is non-sargeable.

Apparently I can use aliases in order by but not where clauses because the where clause is evaluated before the select statement, but order by is evaluated after.. which makes sense. Filter then order.

Random Quotes

Data science and analytics is about finding patterns and exploiting relationships that, at first, don’t seem apparent. Before you do any of that, however, you have to have good data. The whole process is like digging for gold: the first problem is knowing where to look.