Sarah Oh:
Hi, welcome back to Two Think Minimum. Today is Wednesday, February 23rd, 2022. I’m Sarah Oh Lam, a Senior Fellow at the Technology Policy Institute. I’m here with TPI President and Senior Fellow Scott Wallsten and TPI Research Associate and Software Engineer Nathaniel Lovin. Today, we’re delighted to talk about the technology needed to build and use a broadband map. States are soon going to have to make broadband plans and funds are going to be spent from the IIJA or the Bipartisan Infrastructure Law soon, and each of the states are going to be using broadband maps and data. Well, luckily the Technology Policy Institute, here we’ve been building a broadband map that has functionality to serve those purposes. And we have our team here to talk about it. Scott, can you tell us a little bit about why we’re doing this project?
Scott Wallsten:
Sure, Sarah. So, well, as you know, we’ve done lots and lots of empirical work on broadband, and over the last decade plus, every time we wanted to do something, we would download one dataset from the FCC or the census, and then work with it to merge it with another dataset because real insights come from combining datasets, usually not just a single dataset by itself, and the downloading of the raw data and merging it to whatever specific other dataset we needed, it would take a long time. It was complicated and it seemed like we were doing it over and over and over again, which seemed kind of wasteful. At the same time, there was this general talk about there not being a broadband map, and that sort of seemed silly to us because there’s lots of broadband data. It was possible to make a map.
And of course, some maps did exist, but we decided, you know, this problem had a solution, and we would provide it. And so, we built this platform that allows us to take any dataset really that exists and merge it with any other dataset and do lots of things to it. You’ll talk about later how we built it, and the the front end we built for it, and how much more we can do the backend. How it’s allowed us to answer policy questions and just be incredibly, incredibly useful. And I think also highlights, it’s not just the map, it’s how you use the map. And, like you said earlier, when we were just talking, that this is really, it’s just a tool in our toolkit, having a map isn’t enough, you have to be able to work with it and use it for analysis and answer questions.
Sarah Oh:
Nathaniel, could you tell us a little bit about why our map and how our map is different from other maps? How many datasets have you personally loaded into that map?
Nathaniel Lovin:
Let me actually count this so that we have an actual number. 1, 2, 3, 4, 5, 6, 7, 8, 9. There’s at least nine different data sources. It depends on how you divide them up, obviously. We have… the main difference between our map and preexisting maps is the depth of it. We have substantially more datasets aggregated together than most of the other maps out there. We also have real commitment to keeping it updated and not letting it stagnate, and it is set up so that it can be updated near instantly when new datasets come out. We have put a lot of work into not just displaying the data, but also allowing analysis of the data via graphs and plots and regressions, which I think nobody else has been really doing to the extent that we’ve achieved it. So, the first and main part of it is I’ve built several tools that make it very easy for us to combine the datasets at a specific level, into like bigger geographic areas so that we can have analysis, not just at the raw level of the data, but at aggregated levels. I’m using GIS and stuff to analyze what census blocks are in each zip code so that I can assess what level of service is theoretically available. The FCC form 477 in a given zip code. So I do that for, you know, 23, 24 geographic areas in total. And so…
Scott Wallsten:
You mean geographic types, right?
Nathaniel Lovin:
Yeah. Geographic types. That’s the biggest thing. The second part is just we’re using cloud technology like Google BigQuery to allow that computation of those combinations to go extremely rapidly. If I was just running this strictly on my laptop, it would take days to do all of the calculations, but I can put it into BigQuery and it will do the calculations in five, ten minutes max, usually less than that, but, you know, I can run a decade’s worth of combinations in, you know, an hour for all the geographic areas, if I need to.
Scott Wallsten:
And something like that, of course, is very useful potentially to states who have to come up with their own broadband plans and may each have different preferences on how they want to look at broadband within the state. Some states might be interested in a high level at the counties. Others might want to know about congressional districts or even state legislative districts or school districts, any of that’s possible. And I hesitate to use the word simple, because it’s simple for me and Nathaniel knows all of the complexities of the backend, but we can do it all without taking too much time.
Sarah Oh:
Yeah. And that brings us to the point of like, you know, what we do is we look at all the broadband maps that are out there. And, we also are looking at all the state broadband maps. Every state seems to have a broadband office now, and they’re staffing up, they’re getting ready to spend billions of dollars of infrastructure funds. And we want to be like a resource for people who are building maps and using the maps. And I just want to bring it back to Nathaniel a little bit. Like what have you seen in the maps that are out there? Like, you know, we’re one of like 50 maps, broadband maps that are out there. What do you like about other maps? What do you like about our map? What could be more helpful for policymakers?
Nathaniel Lovin:
The thing that I really like about our map is that it puts the data into context better than a lot of the other maps out there because of the regressions and the time series data. There are features on other maps that are really nice for mapping. You know, Georgia lets you slide between 477 data and their own data to show the difference. Broadband Now has a pretty good collection of pricing data that they’ve assembled. You know, there’s a lot of interesting datasets out there, but you know, just showing a map of data is interesting, but it doesn’t help you move, make decisions, really. It just shows you what happening, and you need more additional features to be able to actually get useful action from the map.
Scott Wallsten:
Nathaniel, what do you think has been the biggest challenge in making this work?
Nathaniel Lovin:
So, the biggest challenge was just dealing with all the weird little inconsistencies of the data. Like the Census Bureau has seven different ways that they have null values in the data or, you know, no value or there’s ice spent a couple of months trying to figure out how to fix the geography so that I could analyze them for the USAC data. And a lot of it is stuff that is having to deal with the fact that the datasets are just a lot bigger than things are built for. So, you know, you could fix the study areas if they were smaller with arc.js, but you couldn’t… it was too big of a file to load fully on my computer to fix. So, I needed to do it in BigQuery but BigQuery wasn’t supporting this at the time.
So, until they fixed it, I wasn’t able to get it to work. Similarly, I’ve decided to change on this, but like for a while, we were trying to figure out how to get the mobile Form 477 data into the system using the shape files. And they were too big to fit into the Google big data platform. Like, you know, you could not fit, the row size of BigQuery could not fit the size of the row needed for that dataset. And so, we had to do a bunch of stuff to break them up, which took like two weeks last summer, I think, to run on an iMac that one of our interns was using. And then once we got into BigQuery, there was a bunch of issues with actually processing the data and figuring out an algorithm that would be cost effective for doing that. So, we eventually decided to just use the precalculated actual area stuff from the FCC.
Scott Wallsten:
Right. So, you know, for people who work with a lot of data, that’ll probably… a lot of that might sound kind of familiar to them, but, you know, for people who deal more with just the broadband policy part of it and not the detailed data, these small issues can make a big difference. I mean, the way you deal with zeros and missing values can hugely affect your results. If you get it wrong. If you put zeros in where a value is actually missing, you mess up all your results. And if you don’t know that “999” means no data, and you interpret it as “999,” again, everything’s messed up. And like you said, they’re all different. The census had seven, you said, different ways of showing no data or something like that. Yeah, I mean, this is just all nature of working with data, but for each one of these datasets, we have to make sure we know what they’re saying and that they’re done, you know, when we have them in there consistently, in a consistent fashion.
Sarah Oh:
So, could you tell us a little bit more, like, why would we need to run regressions on the map? What makes the computation part important for a broadband map? Why can’t it display a map?
Scott Wallsten:
I mean, in terms of why we need to be able to run regressions, of course, is you want to look for relationships between variables, and regressions, in our opinion, are probably the best way to do that. And it is another way of making use of multiple datasets coming together. And also, I think to go back to Nathaniel, making that possible is a really big computational challenge. We’ll have some idea, and we’ll need a dataset at some random geographic level for some random time period, and we can do it.
Sarah Oh:
So, going into the weeds a little bit, why did you create a broadband connectivity index, Scott? You know, you said there are error rates in datasets. How do you remedy that? And I think the answer was you created an index.
Scott Wallsten:
Yeah. Indices are always popular. People love to rank things. When done badly, they can of course be kind of useless because you know, often people just take a bunch of criteria and add them up and say, that’s the ranking. You know, that’s give us all kinds of implicit weights in it, or you’re implicitly deciding that everything has the same weight, but the good part about an index is that it combines lots of information, and that’s what we want to do. So, the idea behind this connectivity index was to find many different measures of connectivity and combine them in a way that allows, using a what’s called a principal components analysis, to put them together to get some kind of overall picture of connectivity at whatever level, whatever geographic level, you’re talking about. And so, what that really does is allows you to find areas that require more focus, and then you can break the index apart to see, you know, what are the issues in different areas, because sometimes it’s an adoption issue. Sometimes it’s an availability issue. So, that makes this possible. The other thing which I think is important is that sometimes what you see is that you’re missing data, and that’s important too, because states, some are engaged in their own mapping efforts, which can be useful. But if you can identify smaller areas where you need more intensive data collection, that can save a lot of money. So there are all kinds of uses for it. And this is just one approach to getting new information by combining existing datasets.
Sarah Oh:
Why doesn’t the data come in the form of counties in states like it would need to be read. Like, are we ahead of the curve here?
Nathaniel Lovin:
So, I think there’s a couple reasons for that. I think the main one is simply that you have to make choices about how to aggregate the data. And I think it’s better that the FCC isn’t deciding, “Okay, here’s how we’re going to aggregate the data and distribute it for higher levels. You know, we’re just going to distribute the base level data.”
Scott Wallsten:
I think the FCC actually gets kind of a bad rap for the 477 data. When you’re talking about identifying unserved areas, specific unserved areas. It’s not quite the right thing, although it could be a starting point, but they show data at the census block level. And there are 11 million census blocks in the US. That’s pretty granular, and those can aggregate up to almost any other geographic type that you’re interested in.
Sarah Oh:
Do you think the new data, the broadband fabric with higher resolution will be better to use? Easier? Harder? Like I know Nathaniel has some speculation. How are we going to use that data in our map? The new data that’s coming.
Nathaniel Lovin:
So, I am extremely excited to look at that data whenever it comes out. Because I think it’s just going to be interesting to see how they release it and deal with it. The worry I have is that the 477 data is already, you know, what two gigabytes in size, just for the current data. When you are covering smaller and smaller geographic areas like each individual parcel of land or whatever the thing is going to be, you’re going to have much more discrete data, which is going to massively inflate the size of the dataset, which makes it harder then to aggregate into larger areas to do useful analysis with. Now, I think once we have the data, you know, it’s going to take a little bit of finagling to figure out how to exactly use it. But I do think we’ll be able to get good information out of it, but I’m nervous about the complexity of doing GIS with the data and analysis and combining into higher level usable information.
Scott Wallsten:
I’m with Nathaniel on this. On the one hand, super excited for this data to come out because, well, we want to use it. We want to incorporate it into our map. And, you know, every new dataset we add to it makes it that much more useful, but it is a little bit worrisome because like Nathaniel said, it’s going to be enormous and everybody may have different uses for it, which means you’re going to need to work with some version of the raw data, which it’s going to, you know, take some skill. But also, I am a little concerned that people have an idea that this data will finally be the broadband map. It’ll tell us exactly where broadband is, and where it isn’t, and that’s the end all and be all. Of course, that’s not true for a lot of reasons. One is that every dataset has error, intrinsic into datasets and maps.
So, there will still be errors in the data. And that will not mean that the FCC did a bad job, but we need to be ready for that and think about how to deal with it. And the other is of course, that will need to be updated frequently, and that will presumably also be costly. The other thing is that kind of data will be useful, hopefully, for identifying the unserved areas. But I have a feeling that in most analyses people will end up aggregating it up to higher levels anyway. And so, we’re going to find out that, you know, we’ve done all this work, and we end up still doing analyses at the county level or, you know, census tract level. And of course, there’s nothing wrong with that, but I’m not sure that people will use the data… the data will be useful in as many ways as people expect. So, split minds. I’m really excited and a little skeptical.
But Sarah, you know, one thing we haven’t mentioned, you know, one of the origins for this is something that you did as a graduate student. So, with Fiscly, why don’t you talk about that? Because that was sort of a primordial version of this, right?
Sarah Oh:
Oh, yeah. Right. Well, I did want to talk about the Universal Service Fund spending, and just the need to track where the money has gone. And so, my project as a grad student was to create a website that made it easy to see where E-Rate money was going. So, USAC puts out these open datasets that are historical, that are really granular. You can see how much E-Rate money is going to every school and what equipment they’re buying, what speed internet they are getting. And so, I wanted to provide some transparency to that great data, and that raises other points too, that, well, you know, we should be learning from all the billions of dollars that are being spent already and improving on the Universal Service Fund. So, that’s part of why data and data practices and like web development is so important because if you match up where the subsidies are going by school with the speeds and where areas are unserved, then we can actually figure out where money should be going to fix the problem faster, to narrow the digital divide faster.
So, the motivation behind all this is to get the money to where needs to be. And the missing link is really having better data matching and mapping and the analysis behind it. And that brings us to other like applications for why a broadband map that includes analysis is so important, because you can answer policy questions like, you know, how effective are the affordability programs? Where’s the EBB money going? Is it helping connect people? There are other policy questions that we’ve been investigating with our mapping tool, you know, state specific questions. Where is funding going in Florida? So, I’ll just have a little soapbox here too. I think, you know, we can tweak the way USAC, like it distributes their open data. It could be a little bit easier for researchers to use. The USAC datasets could have FRN numbers in the datasets. It can have the NCES school codes in them, but they don’t currently include those.
Scott Wallsten:
And, if you know what that means, you know way too much about E-Rate.
Sarah Oh:
Yeah. So, for somebody who’s very interested in the digital divide and like seeing broadband money be spent well, there are a lot of improvements that can be made with the government data. So, we’re just part of that effort, and that’s why we’re doing all this coding and research on broadband data.
Scott Wallsten:
But, Sarah, also, I think you’re making two important points though also implicitly. One is that this kind of tool is an input. We don’t have a national strategy of building a broadband map and set of tools and then we’re done. We don’t, you know, we don’t actually care about that. We care about it to the extent that it’s useful. And so the second is that it helps us answer important questions. And, you know, when you’re talking about the USAC data, with the data that USAC provides, you can combine it with other data to see how effective it’s been. And that’s what we really want to do. And those suggestions for USAC, you know, are really important, but though we do want to make sure USAC knows that we think their datasets are great. And so, keep them coming.
Sarah Oh:
So, maybe to wrap up what do you think we’ll be doing next with the map? What’s coming down the pike.
Nathaniel Lovin:
So, I have a bunch of features that are just improvements to the actual user experience. I want to make it so that people can save the state of the map, so that they can share like their exact setup of how, you know, what variables they have set up on the map and the scatter plot and everything. I would like to, at some point, put more time into working on making it so that you run multivariate regressions with multiple variables. We want to, at some point, make it easier for people to download the data that we’ve assembled so that they can use it more directly if they want to. Obviously, we’re waiting for the new FCC data to come out. We’re excited to work with that.
Scott Wallsten:
Well, we have, I mean, there’s so much we can do with it behind the scenes that’s not available on the front end yet. And so many features that are in progress, but we don’t quite have the financial resources to fully build them out yet. So, it’s kind of the opposite of vaporware. We’ve got tons of great analytical tools that we can’t show anybody. So, yeah, there’s a whole lot more back there that we hope to bring out.
Sarah Oh:
Thank you, Nathaniel and Scott, for talking about our broadband map project at the Technology Policy Institute, and we’ll continue to be working on these broadband tools to help all the folks who are going to be working on state broadband plans and interpreting the broadband fabric datasets that are coming.
Scott Wallsten:
And if you want to hear more, reach out to us send an email to [email protected],
Nathaniel Lovin:
And look at the map at tpibroadband.com
Scott Wallsten:
Yes. Or your own state at TPI broadband.com/state and pick your state.
Sarah Oh:
You can also visit broadband.tools for our white paper.
Scott Wallsten:
Yeah. We have a lot of stuff.
Sarah Oh:
And TPIreports.com