Phil Marius

Data Scientist, Data Engineer, Linux, and OSS Fan

28 Sep 2021

Big Data LDN 2021

It’s been several months since I last posted so here’s a quick update:

  • I moved city
  • I got a new data engineering job

Cool now that’s done, onto what I did last week.

Big Data LDN is one of London’s hottest, most exciting data conferences (yay nerdy) and I had the fortune to go. It being my first professional conference ever, I had no idea what to expect. I was once president of the water polo club at university and I was half expecting a conference to be much of the same. People in branded t-shirts trying to make eye contact with you for 6 hours whilst plying you with flyers. And it was pretty much just that, with some added talks of course.

I’m not going to cover much of the software and tech that were present here as you can pretty much google “hot data startups” and you’ll get a list of them there. However, what I found most interesting were some of the talks I attended and what they taught me about my current role as a data engineer in a company.

Sidenote: I’m not going to talk about everything that was in the talks as I don’t want to steal the speakers’ thunders.

To summarise, data is inherently about people.

That is stolen from Kat Holmes’ “Why Your Data Strategy Isn’t Working”, a brilliant talk albeit with a clickbait-y title. However, Holmes most certainly lived up to the title. She dived into 12 common reasons as to why a company’s data strategy may not produce the desired outcomes it set out to do. I’m hoping it gets uploaded someday as I’ll definitely be tuning into it again. One of my favourite key points was regarding how companies are either “too tech centric” or “not tech centric enough” and both can be valid reasons for an unsuccessful data team. Holmes’ point about it being all about people really came clear here, focussing too much on data technology is not a silver bullet to data problems, sometimes people just want something simple that just does a “good enough” job. A well defined balance between making sure the tech is just right to automate the time consuming stuff (like governance) but not too all encompassing so that weeks are spent developing a solution that may not even be useful is key. And it’s very easy to get caught up in the latter. This also ties in neatly to another point she makes about how some companies are “too absolutist” in their data quality checks. Completely accurate and clean data is always the goal and data work should strive to reach it, but data doesn’t have to be completely accurate to derive insights. Getting caught up in making sure that every single row in the database or ensuring that every metric is accurate to the 0.001% can eat away at time spent doing more productive things. However, this isn’t to be taken lightly, battles should be picked carefully. This shouldn’t be applied to every metric or every piece of data.

On the topic of data teams, Starburst ran two talks about Data Mesh, hosted by Andy Mott. This a more recent paradigm for organising data teams and Mott succinctly explained the main principles behind it. In essence, it’s all about domains and making them responsible for the data they ingest, process, and serve (i.e. the microservices of data). What is a domain you ask? Well, that’s where the definitions get a little less concrete. A domain can be aligned to the company in several ways. They can be aligned to how data is produced and its origins, how the data is aggregated and organised (e.g. customer 360 view), or how the data is consumed. One thing remains clear though, they should mirror the functions of the business itself. For example, a business has a marketing department, and there could be a marketing data team that sits alongside them that ingests their data, processes it, and then serves it for other data stakeholders to use.

Following on from this, Lydia Collett’s talk on The Power of Community was all about how to foster a community of “data lovers” in a company or data experts that aren’t necessarily part of the data team. She goes on to describe her own journey of doing this at John Lewis and her learnings along the way. One key thing she mentioned is how this can further the data team’s data literacy agenda and enable teams to derive their own insights with data.

Coming on to one of my favourite talks of all and it’s (almost) not about data. The English Institute of Sport ran a talk on using data effectively to improve their Olympic and Paralympic athletes, hosted by Joshua Wass. He did a deep dive into how the data team and different departments at the EIS brainstormed how to use the data that was being collected in a more meaningful and impactful way. Wass went on to describe the process of “creative brainstorming”:

  • “braindumping”: where any and all ideas are written down on paper
  • stipulations: where actions like reverse and substitute are applied to these ideas to transform them into potential new ones
  • force fitting: where these new stipulated ideas are fitted to themes including silly themes and themes relevant to the project
  • impact and time assessments: where the real world kicks in and these ideas are then categorised by their impact and timelines
  • prioritisation: where the choosing is finally done and the final ideas are then prioritised by importance

Being an engineer at heart, I’m all about tech and using tech to do things. However, Wass really opened my eyes to how creatively data can be used and how getting people more involved with data can be done easily.

And saving the best until last, Tim Harford’s How We Fool Ourselves was the final keynote talk of the two days where Harford described the Han van Meegeren forgery of the Supper at Emmaus story. “People forecast what they want to see” is the main takeaway from Harford’s talk, the idea of wishful thinking or “motivated reasoning” and how it is a primary factor of how people deal with and respond to data. As much as we like to think we’re “above data biases” in the data industry, that’s simply not true and both the data teams and those who use data are subject to falling into that same line of thought. Some have an aim of reaching a conclusion and will cherry pick / interpret the data presented to them to achieve this.

To conclude, I’m glad I went. My biggest takeaway from the talks I saw was that people are what matter most with data, not technology, not cool software, just people. There seems to be an overall attitude of “moving the data closer to the people that need it” and this can be seen in data mesh’s decentralised approach to keeping data within the domains that produce it. Data literacy is another key point, both Collett and Holmes addressed this in their talks and creative brainstorming can’t be done without some form of data understanding. There is definitely a lot to unpack here and I’m going to be using some of my learnings going forward with my career.