Meep's Data Labs
Susan Rahman
Professional Data Nerd
Learn more
A Little
About Me
Hi, I'm Susan (A.K.A. Meep), a professional data nerd with 14 years of experience across data analysis, data engineering, analytics engineering, data science, and python based software development. I'm comfortable working at any scale, from MB’s to PB’s and always up for a fun data modeling challenge. I'm a growth oriented person but equally focused on people. I enjoy mentoring and empowering others, data professional or otherwise. Outside of the bits and bytes I'm a hobby mead maker, creative writer, and hobby botanist. Don't be shy- come say hi on LinkedIn!
Leadership Manifesto
My Purpose
To support and inspire novice and experienced data professionals reach their greatest potential through thoughtful and compassionate technical and people-oriented mentorship and servant leadership.
My Vision
Within the next year I strive to enter a lead position in data engineering, analytics engineering or data analytics. As a continuation of this path, within the next one to two years I strive to enter a management role in a data engineering, analytics engineering, or data analytics team. Through this role I will work to inspire a love of data while working to instill best practices and a conscious and ethical approach to using data.
My Values
Authenticity
I strive to be my whole self with those around me and find ways to build safe environments where others feel empowered to do the same.
Compassion
I am a ‘people-first’ leader. I strive to think and act with compassion in all situations and interactions. Whether this is listening without judgement or offering emotional or functional support during a difficult time, my goal is to empower and support those around me in a manner that brings out their best through compassion led actions.
Flexibility
We are all adults with busy lives and multitudes of commitments. Believe me, I understand that as a mother of a small child. I strive to support others in ways that enable them to meet their commitments and needs while finding creative solutions for our agreed upon goals to still be met. Whether that's helping with stakeholder management, advocating and building more flexible project plans, or helping plan a birthday party- my goal is to be as supportive as possible while still enabling us to provide value together.
Integrity
I strive to always be honest and truthful in conversations, endeavours, or any other activity. Equally so I strive to never deceive or mislead those around me and to accept any accountability and consequences of mistakes I make.
Playfulness
I believe being playful is critical to building trust and stronger relationships. This can be as simple as a quick game of ping pong or perhaps an online game of charades. I strive to encourage others to bond through play by leading by example, using the approach to break down barriers that keep us from communicating fully and honestly with each other.
Recent Articles and Posts
Flashlights
&
Boogeymen
Published August 2, 2023
A Bit of History
Tags:Team Building, Hopes, Fears, Planning, Risk Management, Leadership, Emotional Intelligence, Empathy, Lean Coffee
While working at Riot Games, I had the privilege to work with an exceptionally talented individual named Neeraj Mathrani. He was and is truly gifted in facilitation and pushing us as a data team to reflect on our successes and opportunities. One of the facilitation exercises he employed to help our teams put words to thoughts was the Hopes and Fears exercise. It is intended to create a collection of ideas, events, or outcomes members of the team were either hopeful for or fearful of, usually focused on a specific event or topic. It is an excellent tool for creating empathy within a team and across teams by enabling the formation of a safe space for team members to be vulnerable and share outward their collective thoughts. While working at Spotify, my team and I faced some similar challenges as my team at Riot- most of which created uncertainty about the future leading to apprehension and a dip in trust. To help us work through this, I turned to the Hopes and Fears exercise again – but this time I decided to add some extra layers to it. I called it “Flashlights and Boogeymen.”
What is “Flashlights and Boogeymen?”
The premise of the exercise starts much the same as a Hopes and Fears activity. Members of the team post the ideas, concepts, or events they are either hopeful for or fearful of in a designated physical or virtual space. Then, the team clusters common themes together within each classification, that is a hope or a fear, and uses a lean coffee style to decide what fears to focus on. However, this is where “Flashlights and Boogeymen” goes one step further. The topics the team identifies and discusses from the Fears section are identified as the Boogeymen. Each Boogeyman is given a clear description that the team agrees upon with some aspect of this description dedicated to outlining the impact the fear will have if it is manifested. Once this is done, the team should put a timeframe on when they anticipate the Boogeyman will show up or when the fear becomes a reality. This will be important later on. Lastly, the team brainstorms and documents signs or events they consider to be indications the fear is becoming a reality. These are your Flashlights. Think of them as markers along a dark path; they will help the team see if there really is something lurking in the shadows.
With both the Boogeyman’s description and the Flashlights documented and agreed upon by the group it’s time to put them to work. Check-in with the team weekly or at an appropriate cadence to determine if any of the identified events have come into play. For any Boogeymen that were expected to materialize but haven’t: take some time to reflect on that fear and compare how the team feels now versus when the fear was initially surfaced. Is it still a fear held by the team or has the fear passed? Were the Flashlights helpful? Did they help the team feel more secure or give more clarity? For any future events where this fear might resurface, what would the team do either the same or differently to ensure the fear remains a figment rather than a reality?
Back Up – What is Lean Coffee?
For those unfamiliar with the lean coffee style to generating topics for discussion, the steps are as follows:
Want to Try It?
I’ve created a Miro version of the “Flashlights and Boogeymen” exercise that I welcome you to make a copy of and try for yourself and your team if it sounds interesting. If you discover any additions or changes you’d make to the process for your team, please share it with me!
Avoiding the Chaos When Adopting Data Democratization: An Opinion
Published July 29th, 2023
Tags: Analytics Engineering, Data Cleanliness, Data Democratization, Data Engineering, Data Management, Data Modeling, Data Stewardship, Data Warehouse
If you’ve read through my other post, An Argument Against Looker, you’re familiar with some of my stance on the risks of Data Democratization when done incorrectly or without pre-planning and thought. This topic is one of those that has me getting up on my soapbox. It is a data management ideology that I believe has potential but its execution is not for the faint of heart. If you’re thinking about adopting Data Democratization within your organization, perhaps consider the following experiences and potential ways to avoid the fallout of a poorly executed adoption.
What is Data Democratization?
Data Democratization is both an ideology and a process. It represents a decision within an organization to open access to data for all members. However, getting to the point where the gates to the data warehouse open and everyone enters to write epic queries to make awesomeness happen isn’t as simple as just making the data available. The ramp-up leading to that is long, challenging, and filled with slipping points.
The Common Fail Points in Adopting Data Democratization and How to Avoid Them
I have been at three companies that were in various stages of adopting or implementing Data Democratization and have seen some things go well and several common things go poorly. Let’s dive into what these common failure points were and reflect on how to get ahead of them if not prevent them entirely.
Not All Data Can or Should be Democratized
This was probably one of the most common failure points I witnessed- teams or organizations taking an ‘all or nothing’ stance on Data Democratization. Everyone has access to all the data to do what they need or we haven’t done this right. Every business has core metrics and KPI’s that act as the heartbeat for the company. These metrics need to be accurate. They are used in daily conversations, dashboards, and even life-altering conversations at leadership levels in terms of promotions or layoffs. These are not the kind of data elements that should be left to the whim of creation, alteration and usage. They need to be taken under stewardship by data professionals. This is where one of the key nuances of Data Democratization can come into play- access to data can be democratized but the creation and alteration of data should be regarded with much higher scrutiny.
Data Democratization – Availability or Creation?
Be Careful What You Wish For
One of the most painful Data Democratization exercises I went through while on a data team was a shift in the organization where anyone could create data and funnel it into the team’s data lake and in turn the data warehouse. It very quickly became the setting of ‘too many cooks in the kitchen’ where data standards were not adhered to and data was being used for purposes it was never intended for. The quality of data overall dropped significantly and the data team struggled to keep up with the influx of data as they tried to keep some semblance of order in the warehouse.
I argue this is the incorrect definition of Data Democratization. Sure- a team can create their own data through an API or other service- but that should not mean the data will be ingested and accepted into the vetted data warehouse the company uses for business reporting. There really should be a stewardship layer wrapped around the data warehouse and the data models within. Data that is not at the expected quality or within the defined standards should be rejected and brought up to the quality bar (within reason) before it is allowed in. If this slows down a business initiative or the momentum of a team: rather than retaliating against such a pause, perhaps consider that it is in the project’s best interest. What if decisions were made off of the data that has been indicated to have quality issues by the data professionals within the organization? Sometimes slowing down is the best thing to do or we will trip and get a mouth full of dirt.
This leads us back to the question of what it means then to democratize data. I argue it means making the data accessible to everyone within a reasonable and limited amount of constraints. It does not mean anyone can create and push data alongside vetted models nor can anyone change the definition (code or business) without careful review and approval.
Not Preparing for the Avalanche of ‘Where is this data?’ or ‘How do I write this query?’ Questions
Just because everyone can get to the data they need does not mean they know how to use the data they’ve found. One of the most crippling experiences a data team can experience in the instalment of a Data Democratization endeavour is not having the support to pre-plan, pre-train, and pre-document the anticipated questions that will follow. In many cases, the new users to the data will be peers that might have never written SQL before. Often they’re just using the query another peer gave them and trying to put the pieces together. Or, perhaps they’ve written a bit of SQL, but the data they need is more complicated than anything they’ve worked with. In an ideal situation, the following items would be created for prior to the launch of the Democratization:
Failure to put some work into these areas will result in a double-sided crunch on the data team. On one side there will be asks to add new data into the ecosystem or maintenance needs clashing up against an influx of questions from new users trying to figure out how to do their work. Something will eventually have to give and either way it will result in a detriment to the data quality across the organization
Throwing Data Novices or Non-Data Users at Raw Data
Oof- this one is painful. One of the biggest blunders and misinterpretations of Data Democratization is the assumption this means anyone should be able to go straight to the source data to run business level analytics. This becomes even worse if the person that is expected to do this is a non-technical person who is now faced with a nearly insurmountable learning curve filled with pitfalls they likely aren’t even aware of. For the benefit of readers who might not be familiar with the process data goes through between when it’s generated at its source and how it ends up in the pretty Tableau report: there are often several layers in between.
These layers are referred to as ETL’s or ELT’s or simply pipelines in the data engineering world. These pipelines often apply logic to the data from the source in which data might be excluded, formatting of data elements are changed, or certain fields are translated to something else (what might be a ‘false’ in the source could be ‘Not Active’ in the warehouse). Additionally, data sources are often joined together to create an output data set so looking at a single source might not give the full picture of what would be reflected at the end of a pipeline. Using Data Democratization as an attempted catalyst to cut through the data engineering process is, frankly, an uninformed and reckless approach to data within an organization. This will lead to mis-matched figures and KPI’s, a wild goose chase of ‘he said, she said’ and an increased potential for business decisions to be based on incorrect data.
Rather than viewing Data Democratization as a mechanism to ‘reduce time from source to dashboard,’ instead see it as a means to enable non-technical users to access safe and structured data in the data warehouse, after the data engineering process, rather than relying on a dashboard or data analytics expert alone.
Reducing Focus on Creating Verified and Modeled Data Sets
As hinted at in the section above, seeing Data Democratization as a potential way to skip or circumvent the data engineering process is a recipe for pain and confusion. If the goal is to enable non-data users to have greater access to data- then the data should be stable, of sufficient quality, and suitably easy to use and understand. This means that the data should be properly sanitized and modeled by a data professional. Removing focus from this type of work for the data team could result in a couple of things:
In short, Data Democratization will increase the need for more modeled and quality data sets- not reduce it. As such, ensure the team building the models within the data warehouse is empowered to plan for this type of work and does not neglect it in backlog grooming or sprint allocation.
Avoid the ‘We’re All Technical Now’ Approach
Some people really don’t like writing SQL; the thought of even looking at a database schema makes bile rise in their throat. Others just aren’t comfortable writing their own code to pull data that has otherwise been managed by teams upstream. Some of the common ‘fixes’ to this I encountered were pushing GitHub repo links to non-technical peers and expecting them to be able to dive right into the SQL to figure out all the filters and conditions they would need to apply to their own query to get the same results. The behavior of using Data Democratization as a false catalyst to force non-technical peers into technical roles can have lasting consequences in multiple areas. Most impactful are the impacts to turn over and employee morale. If non-technical employees sense and feel as though they are being forced to take up technical work to do their job when this was previously not required, it is going to be frustrating and stressful. Not everyone wants to be an engineer – no matter how cool the data is.
That’s A Lot… What’s The Take Away?
You’re right- it is a lot, and I suppose that is part of the take away. Data Democratization often introduces far more complexity than it removes which is often missed by leaders who might have been misled or perhaps misunderstood what exactly it is. If your organization is considering taking on a Data Democratized approach, please try to remember these key points:
A Study in Pokemon Sizes – Gen 1 to Gen 9
Published July 28th, 2023
Tags: Analysis, API, MatPlotLib, p-value, Pandas, Pokemon, Python, Python3, SQL, Statistics, T-Test
I wanted to do something fun while my family and I wrestled with the logistics of moving from Stockholm to London and decided to dabble in a bit of data around Pokemon. In this Jupyter-notebook-turned-PDF, I’ve conducted a study to determine how the weight and heights of Pokemon have changed over the generations.
You can find the code base used to capture the data from the used API as well as the SQL to build and query the BigQuery datasets in this repository. It was a nice fun project to have on the side while trying to figure out how to fit a life into 6 suitcases. Be safe and be good!
An Argument
Against
Looker
Published July 21st, 2023
Tags: SQL, Looker, Data Engineering, Analytics Engineering, Data Modeling, Data Cleanliness, Dashboard, Explore
They say getting old means you become stuck in your ways. Maybe it’s the need to always have your tea in the same cup every morning or perhaps it’s the way you fold your shirts. Oddly enough, I’m finding the conversation around data visualization tools to be the thing making me dig my heels in. I’ll just say it- I don’t care for Looker.
I know: it’s the new hotness- and that’s not to mention that it’s a tool that can seamlessly integrate with your massive stores of data on Google’s platform. It has a cool premise, but from my experienced perspective it is an underdeveloped tool that misses the mark. More concerning, it has the potential to instill a plethora of bad habits into inexperienced data users and set business leaders up with false promises. Hear me out on this one.
Argument One: Frankenstein’s Monster’s Cousin… Kind Of?
Okay, fair, maybe it’s a bit harsh to compare Looker to anything to do with Frankenstein’s monster- but truly, I feel that the scope and purpose of the tool is an undefined hodgepodge of asks; none of which it clearly delivers on. Let’s start with the basic premise of what Looker intends itself to be: Your unified business intelligence platform. Self-service. Governed. Embedded. (src).
For anyone who’s worked in data long enough- there should be a few questions that come up from this sentence. Let’s start with the ‘unified business intelligence platform‘ and the ‘Embedded.‘: okay, fine… I’ll give Looker this one. It’s a tool that’s embedded into GCP and enables users to access data stored within the various projects associated to an organization (pending correct permissions granted). This can solve a problem that some dashboard builders and consumers might have where vast sums of data need to be moved from a storage location to an external dashboarding tool’s host platform. However- I’d argue if that is a problem- that is a data management and data modeling problem, not a visualization tool problem. I’ll come back to this later.
Next let’s take in those last few loaded terms: Self-service. Governed. The first question that comes to mind here focuses in on the second term: Governed. By who, exactly and what does that even mean? Most of the Looker instances I’ve seen have been nothing short of messy. Like, a teenage boy’s room level of messy in some cases. Unkept explores and dashboards running wild due to an imbalance between the number of data-thirsty users and data-capable administrators is a recipe for pain that isn’t new to this space. Looker doesn’t help in this regard, and I argue only exasperates a pre-existing problem. For the tool to have a stance of governance, it must have some type of restriction which per many users I can think of would defeate the purpose of the final term: self-service. I think that’s an overused term now. Don’t get me wrong, I love being able to order my lunch and not mutter a word to another human being some days- but just because something is self-service, doesn’t mean I can help myself to whatever I want. I find this logic doesn’t apply in the world of data- and it is beyond frustrating. Looker’s promise of putting data at the fingertips of users carries a large undertone of assumption most business users will likely miss or balk at: the data that the administrators and data team deems you can have access to. I can’t go into a McDonalds and order a Whopper- so if the data team says there’s no revenue data in Looker- then it’s not on the menu. It sounds easy enough to establish that ‘menu’ in Looker- but I’ve seen data teams crumble under the intense pressure of business leaders to ‘share the data’ which results in a free-for-all. Turns out the idea of sharing all the data with everyone comes with some potent consequences when done incorrectly.
Argument Two: The Dark Side of Data Democratization
One of the arguments I hear in favor of Looker is that it democratizes data and enables users without any code experience (i.e. SQL) to be able to dive into the vast sums of data in their organization’s data and start putting together magic. Remember my menu analogy earlier? The idea I’ve heard over and over again is that anyone should be able to access any data and Looker makes that easier for everyone. It enables shorter turn-around times from ingestion to dashboard, teams to move faster to make a decision, reduces costs, improves efficiency, and builds synergy… and all those other now cringe-worthy coined phrases that get tossed into the mix pre-pending a layoff. Sure, Looker can do those things if, and it’s a big if, the data it is pulling from is: sanitized, modeled, and vetted.
Sanitized… Really?
Yes, really. Data doesn’t tumble out of back-end services or web scrapers clean and free of debris. Nevermind the data that comes out of CSVs or Excel files. This data is generated by back-end systems that need to generate a lot of additional data in formats that work for the service but not in a data model. Specifically, I mean breaking escape characters (that comma instead of a period in CSV that turns your 5 column expected output into 20 on random rows), leading and trailing spaces, variation on case, and even the structure of the data itself. Sanitizing the data accounts for these anomalies and little nuances to create a clean(er) set of data that can be loaded in a data lake, or if you must, a data warehouse.
Modeling Data is Essential for Any Visualization Tool
Well… data can be a lot like clay, actually. More often than not, the final data sets data or dashboard consumers see are the combination of different data sources through the use of ETL or ELT pipelines (ETL meaning Extract Transform Load and ELT meaning Extract Load Transform, I’ll write a post about them soon). In these pipelines, data from various sources that are oblivious to each other in the back-end world can be joined together to create a larger more wide-reaching dataset. The more data you can connect, usually, the more useful a dataset can become. ETL/ELT pipelines require resources to run and are a craft unto themselves. Attempting to do this in Looker is asking for Finance to come banging on your office door at the end of the month with the GCP bill in their hand. This is why Data Engineers and Analytics Engineers are here- let us help you… besides, we’ve seen. some stuff.
Vetted and Approved Data
If you’ve used data often enough or been a dashboard user enough times, chances are you’ve been in a meeting or a discussion where two dashboards present the same metric- with completely different numbers. If you haven’t been on the conversations following the awkward shuffle out of the meeting, it’s usually a mad scramble to figure out how each dashboard calculates the metric. In the worst case scenario, it involves breaking open the dashboard to pull out some locally applied filters to a vague SQL statement that hopefully someone has access to. That is data that has not been vetted. If a metric is important enough that a business decision will be based off of it- the data and the means of calculating it should be vetted. Heck, if it’s important enough, put the logic in a pipeline and make it a field that everyone can access. Abstracting business logic into a visualization tool is a dangerous habit to form- and its impacts can be very, very real. Decisions made off of unvetted data can be the cause of budget adjustments or even headcount reductions. This is my soap box, and I will jump up and down on it.
For all the content above, the core message I want to deliver here is data democratization has a massive asterisk beside it. It assumes the data is safe to be shared outward with users without it being misused or misunderstood and that the data is in a state where data users can use it. If the goal of a data consumption team in an organization is to connect Looker to raw data to skip these steps- it is the incorrect use of the tool. Simply put, Looker is not the right tool for data users, novice or otherwise, to dive into the jungle that is raw, unmodeled, or unvetted data to look for answers and insights. Before you look for the answer to the implied question: there isn’t a tool that is best for data users to dive into this type of data to make business decisions. What can be done is engaging with your data and analytics engineering team to build a proper data ecosystem.
Argument Three: A Misuse of Looker: It Can’t Replace a Data and Analytics Engineering Team
Probably saw this one coming, huh? I’ve seen it done and it’s a horrific mess of an outcome. As I’ve argued above, Looker is meant to be a tool to dive into sanatized, modeled, and vetted data to remove the need for users to write any SQL to get the data and answers they need. Those three steps, the sanatizing, modeling, and vetting of data, should be done by a team of professionals- not data consumers who cannot write SQL. Sorry if that seems a bit harsh- but it’s the truth. You wouldn’t want someone to work on the engine in your car who couldn’t change the socket on a socket wrench- it’s really no different. Apart from the quality and material impact loading data engineering work into Looker would have on the outcome of the resulting data, let alone the dashboards, it would be prohibitively expensive as a data ecosystem grew. I can imagine someone asking if data engineers could use Looker to do this instead of ETL’s or ELT’s- and sure, I guess. But that’s like using a Ferrari to carry gravel around instead of a cargo truck.
Argument Four: The Dashboard Capabilities are Limited
This one is more of me grumbling as a user of Looker rather than on its premise. In my time working in Looker I found myself repeating some of the same things:
‘Why can’t I just build the dashboard this way? Tableau can do that, PowerBI can do that.’
‘Picky, picky, picky.‘
‘Yes, yes- your Git Hub interaction is bugged… again.’
I was not impressed and I am not a fan as a user- both as an explore builder and as a dashboard maker. I think Looker might be good for initial data discovery of a clean ecosystem- but I do not see it as a valid contender as a visualization tools such as Tableau and PowerBI. It is too restrictive in its design capabilities right now.
Argument Five: It’s Too Expensive for What it Lacks
Looker comes with a pretty penny of a price tag:
Keep in mind this does not include the extra cost that will be incurred from any jobs that execute SQL against the organization’s BigQuery instance. For what the tool offers, or rather does not offer compared to others, the price tag simply doesn’t make sense. Combine that price tag with it being used on unclean and poorly structured data (especially if it’s raw data) and you’ve got some financial pain likely on the horizon.
Why I care so much...
Data is something I am deeply passionate about. Not just the engineering side or the data science side- but the human side too. Data can have far reaching impacts on an organization and it’s members; especially when it is used poorly. I have worked at some organizations that used data beautifully and others that have been brought great pain by their improper use of it (or lack thereof). Seeing how Looker was used in my experiences with it and what it’s misuse represents is concerning. Novice data users or those who are inexperienced in working with data do not have the background to understand many of the nuances of data management. Compound that with pressure from ill-informed leadership in a company and you have a recipe for complications and perhaps even true human impact in the form of reductions. If nothing else makes it from this post, consider that any tool, despite its flashy marketing or it’s claims to be the best ever at what it does, can only honor those statements in the right circumstances. Seek out those who know what those circumstances are and find ways to bring about that environment as best as you can. Do not allow a team or leadership to be blinded by the promise of speed and faster answers if how those answers are derived is unknown. And… please… can we stop using CSV’s?