Looking back on my first data science project, I am definitely able to pinpoint one aspect of the process that stood out to me. While the realization that I actually knew what I was doing and how I to approach this project is up there, the EDA aspect is what stands out the most. It was the the part of the process where I can say, with absolute certainty, that I nerded out!
Up until the EDA aspect of the project I was feeling a bit overwhelmed and frustrated. This was due to the nature of cleaning data and setting up the data frame so that the model you create is giving you reliable results. The perfect example was figuring out how to one hot encode the columns with a period (“.”) in its values, which was my first major hiccup and then victory. I was able to modify the columns values from having a period to an underscore, so that when the column was one hot encoded, each column name was in a format that allowed it to be run through the OLS model. The tediousness! Although it had me questioning a lot about my project, I know now that each frustrating moment made the joy of being able to explore the data and to see clearly what it was telling me, that much more special.
While the above paragraph might lead you to believe that in the EDA aspect of my project there was not manipulation of the data frame or the frustrations that come with that process, that is not the truth. As you know, the whole data analysis process is very iterative and that is true for the EDA aspect of my project as well. So please do not confuse my favorite process to be the easiest part of the project but the part where everything came together for me.
When I was exploring the initial data that was provided to me in the models, I began to see the correlations and also some anomalies given my interest in the industry. The most memorable was noticing how the zip code column initially had one of the lowest r-values. When I saw that zip code had such a low r-value, I knew that something was off. Like many large cities, the city I live in has a lot of different cities within the metro area. When looking at home values in my day to day life, it is vary apparent that a zip code has a significant impact on the value of a home. I have even seen this happen in cases where homes are on opposite sides of the street but happen to be in different zip codes and therefore have very different home values. Knowing this is what reminded me to one hot encode the zip codes on top of the columns that I had already one hot encoded. Without the one hot encoding of the zip code column, the model is looking at each zip code as a continuous numerical value as apposed to each zip code as a category. As you can imagine this was a huge problem. Not only would I have been overlooking pertinent data from a data scientist’s perspective, from the business’ perspective, they would not know where to go and implement the strategy that I had provided them. I know that as a data scientist you are not expected/able to provide all the answers as we are subject to the data we have been provided to produce these answers. So being able to provide this information when the data allows us to is critical and only further increases the value that we are able to bring to the companies that hire us. It was in that moment I realized how important all the tedious aspects of the OSEMN process are to me as a data scientist and also the moment I got excited. Having to go back and restructure the data frame and all the frustrations that comes with that, now turned into a minimal concern.
While not everyone is able to get excited about real estate, it is helpful to know that when you are applying what you are currently learning to something that you are passionate about, it gets easier. The fear of not knowing if I was capable of handling this whole process, even on a simple project, was definitely real and I know that I am not going to be the only one experiencing this while on this journey. Because this blog is supposed to be geared towards aspiring data scientists like myself, I wanted to provide you with a light at the end of the tunnel, paired with a few of the obstacles that I had to overcome. So hopefully I was able to provide you with a blend of both a technical view of what I went through on my first data science project as well as a more human side of the experience.
Happy Coding!