Sarah Oh: Hello. Welcome back to the Technology Policy Institute’s podcast, Two Think Minimum. It’s Thursday, July 30th, 2020. I’m Sarah Oh, senior fellow at TPI, and I’m joined by Scott Wallsten and Nathaniel Lovin. Today, we’re delighted to talk to Felipe Hoffa, developer advocate, and software engineer at Google. Felipe is originally from Chile and is now based in San Francisco and around the world. If you’re involved in big data and data science, you may recognize him as a familiar name and face answering thousands of developer questions on stack overflow and Reddit, which are read by millions of programmers. For Google, he also records tutorial videos on YouTube, gives conference talks on big data, and writes blog posts on the latest developments in cloud tools. Phillipe is a leading voice on Google’s cloud computing products. Thanks Felipe, for joining us on Two Think Minimum today.
Felipe Hoffa: Oh, it’s an honor to be here. Thank you for inviting me.
Sarah: Here at TPI, we are big fans of your work. It’s a treat for us to host you on the program. We’ve used Google Cloud Tools for several research papers and for our listeners, many of whom may not be familiar with cloud computing, could you explain a little bit about the trend in the last 10 years towards cloud? What BigQuery is and how people use it?
Felipe: Cool. There’s so much in that question. I love the, a big part of our conversation today is also how you saw this trend. Like, so you’ve been doing econometrics for this many years and you are getting increasingly larger and larger problems. Please. You can tell me, like, there is a moment where whatever supercomputers you have in your university research center are not enough. And that’s when the cloud comes. It’s not your computer. It’s just a place where you can get the results that you want and you don’t need to invest in any, it’s not your computers. You can just get them even by the second. If you need three, 5,000 computers to work, you’re in 30 seconds in the problem that you have to solve. Is that how you say it?
Sarah: Yes, definitely. That’s one of the reasons why I personally am such a big fan. We don’t have to pay for a server room or a technician, and we don’t even need a lot of bandwidth on our own computers.
Scott Wallsten: It’s truly the democratization of powerful computing,
Sarah: You just send one terminal line to the computers out there and they’ve crunched the numbers, they store the numbers. It’s amazing.
Felipe: Exactly. So what size of data you’re working on? Like I’ve been also following your work and I’ve been impressed by what you have published. I’ll let you say the number.
Sarah: Yeah. Okay. Well, I don’t know if it’s that big in your eyes, but for us 40 terabytes is pretty big. It’s bigger than any computer we can buy.
Felipe: Yeah, that’s good. It’s like you have 40 terabytes of data and you need to get results, you need to derive policy out of these numbers and analyze them. That’s a pretty big deal that you have a place where you can store it, analyze it. And that’s what we have been doing. And I’m lucky to have first hand
Scott: I think it would also help people if you tell them what BigQuery is. I mean, that’s sort of our, that’s our go to tool. And Nathaniel, who you’ll be hearing from soon, you know, has written lots of new code for it. But I think a lot of people know about the cloud, but don’t really know how it works. And BigQuery is really important.
Felipe: Oh yes. So yes as context I work for Google. I started as a software engineer, as Sarah was saying, and then two years after I started, I joined Cloud and I became a developer advocate for BigQuery. And BigQuery is a product, a cloud product. Like it’s a service where you can store as much data as you want. Like you have data, you have 40 terabytes of data. Perfect. BigQuery will take it. And then you can analyze that data just by knowing SQL or connecting it to other, to your favorite statistics rollers. Like, are you a fan of R, R would perfectly work with BigQuery and leverage its capacity to handle 40 terabytes and to go over 40 terabytes of data, doing joins and analysis as fast as possible, which I know sometimes it’s things that used to take you hours. Now it takes you seconds without much thought. Yes. I liked how you are smiling right now. Like, yes, that’s where you’ve been.
Sarah: Well, it takes a little bit longer than a few seconds. When we ran our queries, sometimes they would run, like we would have 200 tables and the query would run overnight on all the tables. Nathaniel, do you want to explain a little bit?
Nathaniel Lovin: I think that’s possibly because we were running a bunch of updates. We were like adding variables as the update thing. So like we were marking rows by if they were a website that was a pirate website or a legitimate streaming website. So as we were updating it, it took a while to update, but for each variable, like it’ll only take like an hour for each variable. Whereas if we were doing this by in stata, it was literally, it would literally be not possible because it would just take ages and ages and ages to run.
Felipe: Oh, for sure. For sure. Like when I say seconds for me, it’s when it’s something that took a night before now, it’s really in 40 seconds. But if something that was just not possible for you before that you went overnight over 200 tables, ran the ratios, rank, whatever you had to and suddenly it’s
Nathaniel: I was working on combining FCC 477 data with census population data earlier and I did it manually, on Python on my computer and it took like overnight to run and I did it in BigQuery and it took like 45 seconds. It’s just so much better once you get the data in to BigQuery, which is
Scott: And another amazing piece of it, I don’t know if it would have applied in this particular situation though, is that you can, you can pay for the processing power that you want. And so if you need something done really, really quickly, you can have it done really, really quickly. If it’s something that doesn’t have to be done as quickly, you can do it much less expensively and on a smaller number of computers. There’s everything is, you can customize everything to the type of computing that is most suitable for what you need.
Felipe: The ramping up is crazy. Like first I can tell people, you don’t even need a credit card and you can use BigQuery. You just create an account. Everyone gets a free terabyte of analysis every month, and that’s a great start. Then you start paying per query. So you have one question you pay cents for that question. And then when you’re going to this massive scale, you have these enterprise programs where you just pay, okay, how much more power do you want? I know you’re doing a crazy thing that’s going to take all night. We can also make it last put way more power on it if you’re running at that great level. But the product itself scales from zero free to as much as you would like.
Sarah: I’m curious to hear a little bit more about what you’re seeing out there among researchers. So are we, I mean, I feel like we’re a small fish, 40 terabytes, but what are people doing out there on big query?
Felipe: I mean, I’ve seen so many cool things. 40 terabytes is totally respectful. There are whole companies that have less than that and still they get a lot of value out of this just because it scales that way. People are even analyzing genomes inside the query. Like how do I transform the whole genome into SQL relational tables that other people can analyze? And that has been like scientists at Stanford. Just being able to go through not only the genome of one person, but hundreds of people, a thousand people at the same time at this scale is crazy. Really shows you how much you can extend the power required. I know you do also a lot of, for your research, you’re analyzing a lot of internet data, how it behaves. You have this great paper that, does municipal broadband help or it doesn’t help. And then you get to prove econometrically if it does or not. I love that kind of research. I don’t know if you’re familiar with M lab project.
Sarah: Yes. That’s a good segue for us to pitch our new project where, well, we’re working on a beta project of our own looking at broadband data, what Nathaniel was talking about. And we’ve been exploring BigQuery’s GIS tools as well. So we’ve been loading GeoJSON files. We use some code, Python code to transform them so we could load them. And then now we’re using a Jupyter notebook to run intersects and queries on the GeoJSON data. Yeah, I think, I mean, that’s very powerful. Do you see other groups using that?
Felipe: Oh, for sure. I just need to take a mental note for my team. Like this is our whole work. Why do you have to run Python code to load this GeoJSON? We should be able to just load them into the query. That’s our, Google’s job. But if it only took a little bit of processing and then you had the data in BigQuery, it was worth it for you.
Nathaniel: It’s mostly because the FCC’s shape files are not very good shape files. So like there are errors in the raw data that then we have to do a little processing on Python to produce valid Geo JSONs so that they can be put into the BigQuery.
Sarah: [Inaudible] Yeah. We’re trying to clean up the duplicate for Texas edges.
Felipe: When I feel very lazy with I do is I load my JSONs as strings to BigQuery. And then I do all of the processing fixing cleaning inside BigQuery, like even if it’s an invalid JSON, you can upload it as a string and then play inside that someone that has one hammer that can do a lot of stuff. Something that’s phenomenal about BigQuery is the internal GIS capabilities that allow you to even run joins, like find things that are close to each other and optimize those queries. And soon you’ll be discovering some of that magic.
Sarah: Oh, yeah. So Nathaniel, you’ve done a little bit of digging around the machine learning capabilities, right? Well actually let me plug Nathaniel’s series where he kind of reconfigured a lot of standard econometrics tools for the BigQuery system. Do you want to talk about that Nathaniel?
Nathaniel: Okay. So there’s this, so BigQuery machine learning allows you to run a lot of different machine learning models, tensor flow stuff, and like K-means clusters and stuff and the thing, but for our purposes, the biggest thing is that it allows you to run ordinary least square regressions. But as like a machine learning tool it’s focused on predicting more than analyzing. So like it doesn’t have a lot of things that econometrics use, like standard errors and F tests because that’s not as relevant for machine learning statistics. But given the data that it produces, you can calculate those things. And so I’ve made a few programs that calculate ordinary least square standard errors and ordinary least square robust standard errors, and then combining stuff together to do, two stage least square regressions.
Felipe: Yeah, for me, that was crazy. It’s like the first time I became aware of TPI was when you publish this, that you had taken the linear regressions that BigQuery does. You had transformed them into an econometrics tool. And then I started researching all of you and what TPI does. I subscribed to the podcast. I heard so many smart people in this podcast that being here is a big honor. But part of my job as developer advocate is to be out there promoting the product, talking about, being an expert, like to be on stack overflow and answering questions. That is required to be an expert. That also allows you to know who I am. Because even if you don’t read my Twitter, if you don’t go to my LinkedIn. If you research BigQuery you will end up on stack overflow and there I am really to help. But the other part of my job is taking what I see out there, taking these conversations and bringing them to the team like, look, team, someone is taking BigQuery linear regression and they are not at all interested in the model and the result of the model. All they want is to get the residuals so they can use them in their formulas and look at all the steps they had to do, like nothing that you did a great job documenting and sharing code to make this helpful. But I just found it phenomenal.
Scott: Let me ask you a question about the whole area. You said that you’re also supposed to promote the product. And I guess, you know, cloud is generic, BigQuery is the product although I guess some of us who work in it think of it more as a way of life, I guess at this point, not to sound too much like an advertisement, but it fits into, there are lots of companies that do cloud. That’s a pretty competitive market. I mean, there’s, Google’s Cloud and there’s Amazon and there’s Azure and then Oracle and lots of others have them. Does BigQuery compete directly with services at those others? Or is there something, I guess, how is it different from other services offered at other cloud platforms?
Felipe: First, I need to be classy, I like to be classy, but yes, there are more clouds. BigQuery is a pretty unique product in how magic it makes that analysis. It also comes from a place where so much of the technology that you see current data center world comes out of Google just because Google had to solve the problem to grow at this scale first. Then it invented the technology, then it produced the papers. Then others took those papers and implemented their own because they could not use Google. And then comes a day where Google says, you know what, we’re going to open up our products. And we are going to share how we do our, not only the papers, but the products and commercialize them. So yes, there are alternative clouds, products like BigQuery, maybe there’s one, but I’m going to stay classy. I’m not going into the
Sarah: I can list them. I mean, you might not want to because your Google, but we looked around. I mean, we we use other tools as well. So Amazon’s Aurora, I think is another tool and Apache Parquet. And I looked at some rapid AI Kuda. They’re doing some SQL on like big GPUs. There are other big data SQL applications, but I guess for a group like ours, like it’s just easier to have a UI, like an easy browser window. And it’s, I guess it’s easier for beginners like us to just get started right away. But I do think, I mean, I have been looking at other solutions, but we still keep coming back cause it’s easy.
Felipe: Yeah. I mean, before being a Googler, I’m an engineer that loves data and knowing that there are several products competing for my lab, even if there is only one that I use, just the fact that there is more competition, it’s they all push each other’s boundaries. And BigQuery, I’m glad to hear that stays, that’s a good place for you. Even if you like, yes, you should always make informed decisions. It’s great that you are able to have a lot of solutions and it’s great to know that you found your favorite. And if you ever find another favorite, let us know. That will all ignite the engineers more.
Sarah: So I guess, yeah. To wrap up, maybe we can talk a little bit more about where you see the future of like this kind of tool set for economists like ourselves. It sounds like there are biologists using big data. I mean, it’s only the beginning of having like access to big super computers and then combining analysis on top. Where do you see like BigQuery going, you know, in five, 10 years?
Felipe: That’s a good question. Well, BigQuery will go where you want it to go. So I think people like expressing themselves is great. Something that’s really crazy how much better we get the whole time is at collaboration and just influencing each other and like the fact that you have polished what you are doing. The fact that you are taking one tool and making it do things that it wasn’t designed for, but you’re sharing your code. I only wonder how we can make collaboration to be even more fluid. How can I let more people know that you discover how to look on a matrix in BigQuery, share this code, bring it to where people share the data. Sharing data is so great. I don’t know if you have loaded most of it, or if you have found the FCC data in BigQuery already,
Nathaniel: I’ve been loading it mainly. I’m not sure if it’s in BigQuery already or not, but I’ve been loading it just from CSV files.
Felipe: Yeah. Because, for example [inaudible]. So we have made, we have public data sets that have the census data ready for you to use. The question is were we able to make that connection and save everyone time, like power, the data, especially for the things that you are doing should be readily available and save you this time and let you focus on econometrics.
Sarah: Yeah. I think at one point I asked you, like, if you guys could set up the FRED data, the federal reserve data, and then I think you bumped it back like, Oh, I don’t know. But yeah, there’s
Felipe: People calling me out.
Sarah: No, it was me asking for free stuff. I can’t do that.
Felipe: So the thing we need to scale and everyone in the clouds needs to scale is how do we make public data available even more easy to work. And it would be great if the FCC just made it so easy. Just they made all of the data available in a shape that you want it, if not, it’s up to us like a company to have a program that is ready to take your request. I think we’re going there.
Scott: That’s, I mean, that’s one of our objectives too. Obviously I know we’re much, much smaller scale, but with the, you know, the FCC data and the broadband data and the M-Labs, including the M-Labs data and lots of other sources, we’re trying to make it all easily available to everybody and, you know, and connect it. That’s one of our objectives and all that stuff. That’s using BigQuery.
Felipe: Yes, thank you for taking the ball. I should have said that. If you have loaded all of these public data that are in BigQuery with a couple of clicks, you could make it public. And then the question is, how do we let more people know that it’s there?How do we make it easier to find?
Scott: Right. I’m curious, how did you end up doing what you’re doing now? I mean, it’s always easy to look back on a career and say, Oh, I went from here to here to here, but is this kind of what you had predicted that you would want to do? I mean, I know BigQuery didn’t exist of course, when, when you started your career, but are you sort of surprised where you ended up or is this, you know, the fulfillment of a lifelong dream?
Felipe: That’s a great question. And on one hand I’ve been very lucky. Like I found the perfect job for, I love analyzing data and I love telling stories. And I went to film school many years ago just because I love telling stories, ideally for theater, because I love being on a stage, but all those were hobbies. Before joining Google, I was in Chile, jobless because I wanted to be jobless. Like I gave myself a year to discover myself.
Nathaniel: And you had just graduated from film school.
Felipe: Oh no, no. Film school I did in 1999, but in 2000, I told to myself, no, in December, 2010, that I was going to give myself a year out of the market just to discover myself and make money not even as an entrepreneur, but as an independent person. And the first thing that I started doing at that time was okay, how do I make money? Now? I will analyze data and I will write blog posts about it. And just start, it’s like the one place where I feel I should go promote myself by writing blog posts about it. Analyzing data
Nathaniel: Did that kind of come to you while you were during your year of self discovery that you just felt like you wanted to start writing about this? Cause that’s, I mean, taking a year off of doing a, you know, some kind of gap year is a great idea. I’m trying to encourage my daughter to do it, but usually you don’t hear people say, and then I discovered I wanted to blog about data.
Felipe: Oh, the thing is, I had always done that. And the discovery was like looking at myself and seeing that every time that I’m free to do anything, I analyze data. Like even at any job interview as a software engineer, there always came the time where I just started doing these visualizations of it and showing my managers look, look so, yeah, that’s my happy place. But my year off was starting to take off when Google called. So instead of taking a year off, I moved from Chile to San Francisco. I became a software engineer. And two years later someone offered me that I should be a developer for BigQuery. Not because I wanted not because I planned, but because they saw me, they saw it and they brought me to my happy place.
Nathaniel: So how do you manage developers? I mean, do you, I mean, I assume they’re all probably somewhat like you in the sense that they like data, and like me too, but then that means they’re also going to have all their own ideas about where they think the product should go, which is good, and also could make your life difficult. How does it work? How do you keep everyone kind of together and working in a particular direction while also allowing them to remain creative?
Felipe: Yeah, that, so I put myself on the creative side actually for that, I just try to convince people by ranting. Like look this is important, look this is good. Fortunately we have great teams and great organizations that are way more organized than me making, transforming rants into PRDs and OKRs and everything that needs to happen. But for me, I’m just, I love focusing on the communicating part. I’m listening, I’m collecting all of the information like I’m out on Twitter, finding your rock first. That is like, Whoa, this is really cool. I need to read it now. I need to make it a rant, and then let people figure it out. It’s a lot about knowing yourself, what you like doing and what others can do better.
Scott: How big is your team?
Felipe: Depends on what my team means, because we have the developer as big as Google, but then we have the BigQuery team. And then as Google does everything, like even the file system is combed. So it’s a big, big team collaborating. You have products adjacent to be querying. We are connecting. So it’s a big, big, big thing with a lot of different heads, priorities, but it happens.
Scott: So here’s the question you probably don’t want to answer. Did you watch your boss yesterday at the antitrust hearings?
Felipe: Oh, I did not. What, what did you think about it?
Nathaniel: Well, the hearings themselves, you know, these hearings are mostly an opportunity for the Congress people to give their own speeches rather than listen to anyone’s answer. But, you know, I think slowly they’re learning something, it’s, you know, they ask better questions than they did the very first time they interviewed Mark Zuckerberg, for example. So, and also I think your CEO was sort of overall rated as having the best background.
Felipe: Cool. Oh, that’s nice to hear, but it’s such a balance of keeping, these businesses are doing a lot of good for the world, but it’s also good. I don’t know, you know, the markets better than me, how to keep them healthy for everyone. I love what Google has done for me, like personally, like I learned so much from Google, the webpage. I’ve had also the job place and enjoy having a phone that gives me maps and they won’t have access to maps. Yeah, it’s crazy.
Scott: So there’s, I mean, there’s no way you can know everything that everyone does on BigQuery, but are there any fields or areas that you think have been notably absent? Let’s put it two ways. One, whether there’s something that’s notably absent or something totally surprising. So like for example, the national endowment for the humanities fund, some digital humanities, which does some really interesting work that nobody, you know, outside of that smaller area would have thought of. I don’t know whether they, to the extent they used cloud, but there are all kinds of surprising uses of data. And you know, do you see some of those or do you wish someone would someone else would come on to big query because they have to have tools that might be useful to them? That’s kind of an impossible question, I suppose.
Felipe: Helpful in a different way, like BigQuery and it’s used to surprise me every day. Most of my internal runs are why wasn’t it easier for you to load these JSON files? All of us, we focus in doing this crazy stuff. This crazy stuff like look doing the impossible is what we strive for everyday. The problem is how do we make the ramp up very, very easy for everyone. And then you allow other people to do the impossible. And yes, if I have to call something out there called out something there it’s what you are doing. Like when people ask me about crazy stuff, I quote your papers. Let me just take the receivers and condense the 40 terabytes in one formula that I guess you’ve used for years.
Scott: Yeah. I like that you call our papers crazy. That’s an excellent compliment.
Felipe: I mean, I come from outside of the pole. I know my father is an economist. I feel very close to all the crazy stuff he pulled off back in his day.
Scott: Right. Well, Chile actually is known for having lots of economists is the government.
Felipe: Oh yes. Oh. And we have our head economist too at Google but I’m blanking out of his name.
Scott: You don’t mean Hal though, do you?
Felipe: Yes. Another crazy thing at Google is that I can write to one of our internal forums. Hal Varian will call out how wrong I am and it feels so good.
Nathaniel: That’s funny, Hal it comes to a lot of our, attends lots of our events, participates in them. And yes, that sounds like him. I mean, it’s always in a good way of course.
Felipe: Of course yeah. But the fact that I’m doing my BigQuery stance, like finding correlation of, Oh, look, I can do the correlations with now with BigQuery. And now that I can do correlations, I can get all of these interesting results. And then Hal will read my email because he’s paying attention and Hal will call me out. That’s amazing. I feel very closely. I’m glad that you get to spend a lot of time with him too.
Scott: It’s pretty amazing. And it’s also interesting.
Felipe: It’s like we surf. We surf even of the internet and also he’s on my mailing list asking how to make the M-Labs data more powerful with BigQuery, side note if you go to the BigQuery first day, like 10 years ago, BigQuery was revealed to the old Google IO. The first demo that was given with BigQuery to the world in that state was using M labs data and it’s all recorded, it all comes together.
Scott: What was the data at the time? Do you remember?
Felipe: I mean, M Labs has been collecting all of these measurements
Scott: So it’s the same measurements that they have now just earlier reincarnations.
Felipe: Exactly M Labs at that time needed a place to store it and make it useful, and it was the same time that BigQuery was getting started. So it was a perfect demo of how to connect the bandwidth measurements and how the internet works.
Sarah: That’s a great segue for us I guess, to wrap up. We have some more like big query use cases that we’re working on. We’re doing broadband studies and internet home usage study. We’ll be using these tools in the cloud happily and also with a little bit of frustration because it’s hard to load the JSON files that are broken.
Felipe: Anyone from the team, please pay attention. Okay. I’m sending them this for sure.
Sarah: Thank you so much for your time today and for all your work on stack overflow and Reddit and answering questions for us.
Felipe: Have a great day. Let’s stay connected.
Scott: Absolutely.