Jekyll2017-09-15T00:38:12+00:00http://rolandjeannier.com/Roland JeannierRoland Jeannier is a data scientist with a particular interest in data visualization.Predicting Procrastination2017-09-01T13:00:00+00:002017-09-01T13:00:00+00:00http://rolandjeannier.com/2017/09/01/predicting-procrastination<h1 id="introduction">Introduction</h1>
<p>The motivation for this project came from my own personal struggles with procrastination. There exist many productivity apps out there to help people like me stay focused despite the seemingly infinite number of distractions that exist on the current state of the Internet (cough Reddit cough). The problem is, most of these productivity apps block website purely off URL. In many cases this is good enough, but often times blocking an entire URL creates more problems than it solves. Youtube, for example, is one of the largest sources of distraction the Internet has to offer. Naturally, it is a website one would block when they would like to be productive. However, Youtube also provides a vast wealth of incredibly informative videos. There have been numerous times when watching a five minute video on matrix multiplication to refresh my memory would lead to hours of increased productivity down the line. But alas, the site is blocked by my supposedly helpful productivity app. So I find myself in at a crossroads: I can disable the app to watch the video I want to see, but run the risk of spiraling down a never ending path of cat videos; or I leave the app running and search for a less effective refresher. I should not have to choose. So I set out to find a better way.</p>
<p>Let’s start by importing our data</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="kn">as</span> <span class="nn">sns</span>
<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">model_selection</span>
<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">linear_model</span>
<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">ensemble</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">accuracy_score</span><span class="p">,</span> <span class="n">precision_score</span><span class="p">,</span> <span class="n">recall_score</span><span class="p">,</span> <span class="n">confusion_matrix</span><span class="p">,</span> <span class="n">classification_report</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="kn">from</span> <span class="nn">sklearn.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'parsed_html.csv'</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">data</span><span class="p">[[</span><span class="s">'text'</span><span class="p">,</span><span class="s">'url'</span><span class="p">]]</span>
<span class="c"># Splitting our data into a training and test set before we begin our data building</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">model_selection</span><span class="o">.</span><span class="n">train_test_split</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">activity</span><span class="p">,</span><span class="n">test_size</span><span class="o">=</span><span class="mf">0.33</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">43</span><span class="p">)</span>
</code></pre>
</div>
<h2 id="model-building">Model building</h2>
<h4 id="building-a-grid-searched-pipeline">Building a grid searched pipeline</h4>
<p>I am using a grid search over a pipeline to find the best model to use. After much trial and error I settled on a logistic regression model. First off it simply performaned the best. But additionally it is by far the most interpretible model. Also, Lasso Regularization is a great way to reduce the number of features generated by the countvectorization process, and the use of ngrams. At one point I had a dataframe with over 2,000,000 features. Lasso Regularization reduced that to a few hundred.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">stopwords</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span><span class="p">,</span> <span class="n">HashingVectorizer</span><span class="p">,</span> <span class="n">TfidfVectorizer</span><span class="p">,</span> <span class="n">TfidfTransformer</span>
<span class="c"># Let's create some stop words. I chose these values after doing a little bit of EDA.</span>
<span class="n">stop</span> <span class="o">=</span> <span class="n">stopwords</span><span class="o">.</span><span class="n">words</span><span class="p">(</span><span class="s">'english'</span><span class="p">)</span>
<span class="n">stop</span> <span class="o">=</span> <span class="n">stop</span> <span class="o">+</span> <span class="p">[</span><span class="s">'https'</span><span class="p">,</span> <span class="s">'www'</span><span class="p">,</span> <span class="s">'com'</span><span class="p">,</span> <span class="s">'http'</span><span class="p">]</span>
<span class="n">cvt</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">(</span><span class="n">stop_words</span><span class="o">=</span><span class="n">stop</span><span class="p">,</span> <span class="n">ngram_range</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">])</span>
<span class="c"># Here we are initializing the values we want to grid search over.</span>
<span class="n">param_grid</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">vect</span> <span class="o">=</span> <span class="p">[</span><span class="n">CountVectorizer</span><span class="p">()],</span>
<span class="n">vect__ngram_range</span><span class="o">=</span><span class="p">[[</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">],[</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">]],</span> <span class="c"># Trying different ngram ranges</span>
<span class="n">vect__stop_words</span> <span class="o">=</span> <span class="p">[</span><span class="n">stop</span><span class="p">],</span>
<span class="n">tfidf</span> <span class="o">=</span> <span class="p">[</span><span class="n">TfidfTransformer</span><span class="p">()],</span>
<span class="n">tfidf__norm</span> <span class="o">=</span> <span class="p">[</span><span class="bp">None</span><span class="p">],</span>
<span class="n">clf</span><span class="o">=</span><span class="p">[</span><span class="n">LogisticRegression</span><span class="p">()],</span>
<span class="n">clf__C</span><span class="o">=</span><span class="p">[</span><span class="o">.</span><span class="mo">04</span><span class="p">,</span><span class="o">.</span><span class="mi">1</span><span class="p">,</span><span class="o">.</span><span class="mo">06</span><span class="p">,</span> <span class="o">.</span><span class="mo">07</span><span class="p">,</span> <span class="o">.</span><span class="mo">05</span><span class="p">],</span> <span class="c"># Trying different coefficients for alpha</span>
<span class="n">clf__penalty</span><span class="o">=</span><span class="p">[</span><span class="s">'l1'</span><span class="p">])</span>
<span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([</span>
<span class="p">(</span><span class="s">'vect'</span><span class="p">,</span> <span class="n">cvt</span><span class="p">),</span>
<span class="p">(</span><span class="s">'tfidf'</span><span class="p">,</span> <span class="n">TfidfTransformer</span><span class="p">(</span><span class="n">norm</span><span class="o">=</span><span class="bp">None</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'clf'</span><span class="p">,</span> <span class="n">LogisticRegression</span><span class="p">(</span><span class="n">penalty</span><span class="o">=</span><span class="s">'l1'</span><span class="p">))</span>
<span class="p">])</span>
<span class="n">grid_search</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">pipeline</span><span class="p">,</span> <span class="n">param_grid</span><span class="o">=</span><span class="n">param_grid</span><span class="p">)</span>
<span class="n">grid_search</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</code></pre>
</div>
<h2 id="analyzing-our-results">Analyzing our results</h2>
<h4 id="calculating-some-metrics">Calculating some metrics</h4>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Finding our best pipeline and pulling out the useful components</span>
<span class="n">pipeline</span> <span class="o">=</span> <span class="n">grid_search</span><span class="o">.</span><span class="n">best_estimator_</span>
<span class="n">lm</span> <span class="o">=</span> <span class="n">pipeline</span><span class="o">.</span><span class="n">named_steps</span><span class="p">[</span><span class="s">'clf'</span><span class="p">]</span>
<span class="n">vect</span> <span class="o">=</span> <span class="n">pipeline</span><span class="o">.</span><span class="n">named_steps</span><span class="p">[</span><span class="s">'vect'</span><span class="p">]</span>
<span class="c"># Let's see what our accuracy looks like</span>
<span class="n">grid_search</span><span class="o">.</span><span class="n">best_estimator_</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code>Out: 0.97214484679665736
</code></pre>
</div>
<p>Our accuracy score is looking great. 97.5% is great, but we should compare it to our baseline distribution before we get to excited.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Caluclating our baseline</span>
<span class="p">(</span><span class="n">y_train</span> <span class="o">==</span> <span class="s">'work'</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">y_train</span><span class="p">))</span>
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code>out: 0.41483516483516486
</code></pre>
</div>
<p>So we have massively improved over random chance. This is a good start. We should look at some additional metrics as well to see if we have anything to be concerned about.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">result_x</span> <span class="o">=</span> <span class="n">vect</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="n">pred</span> <span class="o">=</span> <span class="n">lm</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">result_x</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">pred</span><span class="p">))</span>
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code>precision recall f1-score support
procr 0.96 0.98 0.97 214
work 0.97 0.94 0.95 145
avg / total 0.96 0.96 0.96 359
</code></pre>
</div>
<p>So far so good. I don’t see anything that looks particularly concerning with these results. Our recall being a little less for work then procrastination is something to keep in mind going forward. We may want to consider examining our predict probabilities to see what kind of values are getting misclassified.</p>
<p>Next we should take a look at our ROC and the area under the curve.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">roc_curve</span><span class="p">,</span> <span class="n">auc</span>
<span class="c"># Here is some helpful code found on stack overflow</span>
<span class="n">pred_proba</span> <span class="o">=</span> <span class="n">lm</span><span class="o">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">result_x</span><span class="p">)</span>
<span class="n">fpr</span><span class="p">,</span> <span class="n">tpr</span><span class="p">,</span> <span class="n">threshold</span> <span class="o">=</span> <span class="n">roc_curve</span><span class="p">(</span><span class="n">y_test</span> <span class="o">==</span> <span class="s">'work'</span><span class="p">,</span> <span class="n">pred_proba</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">roc_auc</span> <span class="o">=</span> <span class="n">auc</span><span class="p">(</span><span class="n">fpr</span><span class="p">,</span> <span class="n">tpr</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">80</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Receiver Operating Characteristic'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">22</span><span class="p">);</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">fpr</span><span class="p">,</span> <span class="n">tpr</span><span class="p">,</span> <span class="s">'b'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'AUC = </span><span class="si">%0.3</span><span class="s">f'</span> <span class="o">%</span> <span class="n">roc_auc</span><span class="p">);</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span> <span class="o">=</span> <span class="s">'lower right'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">20</span><span class="p">);</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span><span class="s">'r--'</span><span class="p">);</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlim</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]);</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylim</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]);</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'True Positive Rate'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">18</span><span class="p">);</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'False Positive Rate'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">18</span><span class="p">);</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">yticks</span><span class="p">(</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
</code></pre>
</div>
<p><img src="/img/plots/procr_roc_auc.png" alt="AUC_ROC" /></p>
<p>Wow! 0.992 AUC_ROC. That is amazing. I think we can safely say we have created an extremely effective model for predicting procrastination.</p>
<h4 id="taking-a-deeper-look-at-our-features">Taking a deeper look at our features</h4>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># This bit of code is pulling out my features that have coefficients greater than zero</span>
<span class="c"># Lasso regularization reduces the coef to 0 of the features (in our case unique ngrams)</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="n">features</span> <span class="o">=</span><span class="p">(</span><span class="n">vect</span><span class="o">.</span><span class="n">get_feature_names</span><span class="p">())</span>
<span class="n">feature_dict</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">features</span><span class="p">):</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">lm</span><span class="o">.</span><span class="n">coef_</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">i</span><span class="p">])</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">feature_dict</span><span class="p">[</span><span class="n">f</span><span class="p">]</span> <span class="o">=</span> <span class="n">lm</span><span class="o">.</span><span class="n">coef_</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">i</span><span class="p">]</span>
<span class="c"># For convienence sake I'll put the features into a data frame for easier exploration</span>
<span class="n">feature_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="o">.</span><span class="n">from_dict</span><span class="p">(</span><span class="n">feature_dict</span><span class="p">,</span> <span class="n">orient</span><span class="o">=</span><span class="s">'index'</span><span class="p">)</span>
<span class="n">feature_df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'coef'</span><span class="p">]</span>
</code></pre>
</div>
<p>Let’s take a quick look at the number of features and our total documents. We really do not want a model utilizing more features than we have documents. Our reguralization should have accounted for this, but it’s not a bad idea to double check.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># How many documents do I have in my training set</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Number of docs: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">X_train</span><span class="p">)))</span>
<span class="c"># How many features do I have after reguralization</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Number of features: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">feature_df</span><span class="p">)))</span>
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code>Number of docs: 728
Number of features: 211
</code></pre>
</div>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># We can raise our logisitic regression coef to the e to calculate the odds ratio</span>
<span class="n">feature_df</span><span class="p">[</span><span class="s">'odds_ratio'</span><span class="p">]</span> <span class="o">=</span> <span class="n">feature_df</span><span class="p">[</span><span class="s">'coef'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">)</span>
</code></pre>
</div>
<p>Now let’s look at what words are most associated with procrastination and productivity. We can sort our dataframe by odds ratio. The smaller odds ratio means words that are less related productivity, and a higher ratio means more related.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>feature_df.sort_values('odds_ratio').head(10)
</code></pre>
</div>
<table border="1" class="dataframe"><thead> <tr style="text-align: right;"> <th></th> <th>coef</th> <th>odds_ratio</th> </tr> </thead> <tbody> <tr> <th>game</th> <td>-0.130174</td> <td>0.877943</td> </tr> <tr> <th>likes</th> <td>-0.093046</td> <td>0.911152</td> </tr> <tr> <th>thwas</th> <td>-0.087130</td> <td>0.916558</td> </tr> <tr> <th>reddit</th> <td>-0.074749</td> <td>0.927976</td> </tr> <tr> <th>photo</th> <td>-0.072818</td> <td>0.929770</td> </tr> <tr> <th>ignore_index</th> <td>-0.056659</td> <td>0.944916</td> </tr> <tr> <th>attack</th> <td>-0.054527</td> <td>0.946933</td> </tr> <tr> <th>video</th> <td>-0.043512</td> <td>0.957421</td> </tr> <tr> <th>src</th> <td>-0.042894</td> <td>0.958013</td> </tr> <tr> <th>us</th> <td>-0.040551</td> <td>0.960260</td> </tr> </tbody></table>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">feature_df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'odds_ratio'</span><span class="p">,</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre>
</div>
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>coef</th> <th>odds_ratio</th> </tr> </thead> <tbody> <tr> <th>github</th> <td>0.203852</td> <td>1.226116</td> </tr> <tr> <th>using</th> <td>0.162867</td> <td>1.176880</td> </tr> <tr> <th>data</th> <td>0.089620</td> <td>1.093759</td> </tr> <tr> <th>code</th> <td>0.080694</td> <td>1.084039</td> </tr> <tr> <th>file</th> <td>0.079490</td> <td>1.082735</td> </tr> <tr> <th>import</th> <td>0.069770</td> <td>1.072262</td> </tr> <tr> <th>instagram</th> <td>0.065247</td> <td>1.067422</td> </tr> <tr> <th>friction</th> <td>0.061906</td> <td>1.063863</td> </tr> <tr> <th>project euler</th> <td>0.058094</td> <td>1.059814</td> </tr> <tr> <th>stack</th> <td>0.049860</td> <td>1.051124</td> </tr> </tbody></table>
<p>Looking at the top 10 words related to procrastinating I can see a lot of things that make sense. Words like ‘game’, ‘reddit’, ‘photo’ make sense in a general sense. ‘5e’ and ‘attack’ look related to Dungeons and Dragons (I thing I spend a lot of time reading about). ‘src’, ‘us’, and ‘thwas’ I don’t undrestand as much.</p>
<p>The top 10 words for being productive are almost all really clear to me. ‘github’ being the strongest indicator comes at no surprise, with ‘data’, ‘file’, ‘code’, and ‘import’ all following closely behind. The word ‘using’ is interesting. I do find myself googling phrases like “classifying data using logistic regression’ quite frequently, perhaps that verb is largely prevalent in sentences for when I’m being productive. ‘instagram’ is another interesting word. I do not use instagram. I don’t even have an account. But I did spend a long afternoon one day trying to figure out how to get their API to work for a project I was working on. ‘Kurzgesagt’ is the name of a Youtube channel for educational videos. I am extremely please to see it show up as an indicator of productivity. That is a text book example of the kind of key word I was hoping to find that would distinguish mindless Youtube videos from educational ones.</p>Roland JeannierIntroduction The motivation for this project came from my own personal struggles with procrastination. There exist many productivity apps out there to help people like me stay focused despite the seemingly infinite number of distractions that exist on the current state of the Internet (cough Reddit cough). The problem is, most of these productivity apps block website purely off URL. In many cases this is good enough, but often times blocking an entire URL creates more problems than it solves. Youtube, for example, is one of the largest sources of distraction the Internet has to offer. Naturally, it is a website one would block when they would like to be productive. However, Youtube also provides a vast wealth of incredibly informative videos. There have been numerous times when watching a five minute video on matrix multiplication to refresh my memory would lead to hours of increased productivity down the line. But alas, the site is blocked by my supposedly helpful productivity app. So I find myself in at a crossroads: I can disable the app to watch the video I want to see, but run the risk of spiraling down a never ending path of cat videos; or I leave the app running and search for a less effective refresher. I should not have to choose. So I set out to find a better way.Exploring the Billboard Top 100 with Pandas (Part 2)2017-07-14T19:00:00+00:002017-07-14T19:00:00+00:00http://rolandjeannier.com/2017/07/14/exploring-the-billboard-top-100-with-pandas-part2<h1 id="working-with-time-series">Working with Time Series</h1>
<p>Last week we were working with billboard data that we wanted to convert to time series. This week we are going to work through how to make that happen.</p>
<p>Working with time series means we are going to need to import datetime. We also need to come up with a range of time for which this data set covers. This means we need to find the earliest date of the earliest data point, and then the latest date of the latest data point and create a timeseries of weeks for the period between those two dates.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">datetime</span> <span class="kn">as</span> <span class="nn">datetime</span>
<span class="c"># Converting the date_entered of our data set to datetime data.</span>
<span class="n">start_dates</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'date_entered'</span><span class="p">])</span>
<span class="c"># Now we are finding the earliest date, or the minimum start date. This is the earliest date that we have data for</span>
<span class="n">first_date</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">start_dates</span><span class="p">)</span>
<span class="n">last_date</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">start_dates</span><span class="p">)</span>
<span class="c"># Creating a list of dates split into weeks from the earliest a song entered the charts until the last time</span>
<span class="n">date_list</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">date_range</span><span class="p">(</span><span class="n">start</span><span class="o">=</span><span class="n">first_date</span><span class="p">,</span><span class="n">end</span><span class="o">=</span><span class="n">last_date</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s">'W'</span><span class="p">)</span>
</code></pre>
</div>
<p>Now that we have a list of dates split into weeks over the appropriate range we can create a dataframe with time_series as the index and we will use the songs as the columns. We are going to put the placement of a given song during each week in the respective column. If the track was not on the chart for a week we will put a ‘None’ value instead.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Initializing new data frame</span>
<span class="n">df_time</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">columns</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">track</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="n">date_list</span><span class="p">)</span>
<span class="n">df_time</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">'Date'</span>
<span class="c"># Transposing data so we can iterate over songs</span>
<span class="n">df_t</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">df_t</span><span class="p">:</span>
<span class="n">curr_song</span> <span class="o">=</span> <span class="n">df_t</span><span class="p">[</span><span class="n">t</span><span class="p">]</span>
<span class="n">track</span> <span class="o">=</span> <span class="n">df_weeks</span><span class="p">[</span><span class="n">curr_song</span><span class="o">.</span><span class="n">track</span><span class="p">]</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">track</span><span class="p">,</span> <span class="nb">type</span><span class="p">(</span><span class="n">df</span><span class="p">)):</span>
<span class="n">track</span> <span class="o">=</span> <span class="n">track</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]</span>
<span class="c"># Creating a temp list</span>
<span class="n">temp</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">df_time</span><span class="o">.</span><span class="n">index</span><span class="p">:</span>
<span class="c"># If track has not entered the charts yet, put None</span>
<span class="k">if</span> <span class="n">d</span> <span class="o"><</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">curr_song</span><span class="o">.</span><span class="n">date_entered</span><span class="p">):</span>
<span class="n">temp</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="c"># Else if find the value for this songs position on the chart for this week in it's run</span>
<span class="c"># We are counting with i to avoid indexing issues</span>
<span class="k">elif</span> <span class="n">i</span> <span class="o"><</span> <span class="nb">len</span><span class="p">(</span><span class="n">df_weeks</span><span class="p">):</span>
<span class="n">temp</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">track</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="n">i</span><span class="o">+=</span><span class="mi">1</span>
<span class="c"># Else append None when the songs run is over anyway. This will only hit for songs that entered the chart early</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">temp</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="c"># Update our new data frame with our time_series organized data</span>
<span class="n">df_time</span><span class="p">[</span><span class="n">df_t</span><span class="p">[</span><span class="n">t</span><span class="p">]</span><span class="o">.</span><span class="n">track</span><span class="p">]</span> <span class="o">=</span> <span class="n">temp</span>
</code></pre>
</div>
<p>With our now time series indexed data frame plotting with time as our x axis should be super easy!</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">df_time</span><span class="p">[</span><span class="n">top_5</span><span class="o">.</span><span class="n">track</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s">'-'</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylim</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">100</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="n">ax</span><span class="o">.</span><span class="n">get_ylim</span><span class="p">()[::</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_yticks</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span><span class="mi">20</span><span class="p">,</span><span class="mi">40</span><span class="p">,</span><span class="mi">60</span><span class="p">,</span><span class="mi">80</span><span class="p">,</span><span class="mi">100</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">"Position on Billboard Hot 100"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">gcf</span><span class="p">()</span>
<span class="n">fig</span><span class="o">.</span><span class="n">set_size_inches</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'run_by_time.png'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre>
</div>
<p><img src="/img/plots/billboardtop5_weekly_timeseries.png" alt="Top 5 Billboard Runs with Timeseries" /></p>
<p>There we have it. The same plot as before, but now with a time series axis. I was going to try to impute the missing data for this dataset, but I will leave that as a project for another time.</p>Roland JeannierWorking with Time Series Last week we were working with billboard data that we wanted to convert to time series. This week we are going to work through how to make that happen.Exploring the Billboard Top 100 with Pandas2017-07-06T19:00:00+00:002017-07-06T19:00:00+00:00http://rolandjeannier.com/2017/07/06/exploring-the-billboard-top-100-with-pandas<h1 id="understanding-our-data">Understanding our Data</h1>
<p>The data set below is a selection from this larger dataset (link <a href="https://github.com/rtjeannier/jekyll-blog/blob/master/data/billboard.csv">here</a>) of songs from the year 2000 on Billboard’s top 100.</p>
<table>
<thead>
<tr>
<th>year</th>
<th>artist_inverted</th>
<th>track</th>
<th>time</th>
<th>genre</th>
<th>date_entered</th>
<th>date_peaked</th>
<th>x1st_week</th>
<th>…</th>
<th>x53rd_week</th>
<th>x54th_week</th>
<th>x55th_week</th>
<th>x56th_week</th>
<th>x57th_week</th>
<th>x58th_week</th>
<th>x59th_week</th>
<th>x60th_week</th>
</tr>
</thead>
<tbody>
<tr>
<td>2000</td>
<td>Lonestar</td>
<td>Amazed</td>
<td>4:25</td>
<td>Country</td>
<td>1999-06-05</td>
<td>2000-03-04</td>
<td>81</td>
<td>…</td>
<td>20.0</td>
<td>22.0</td>
<td>22.0</td>
<td>25.0</td>
<td>26.0</td>
<td>31.0</td>
<td>32.0</td>
<td>37.0</td>
</tr>
<tr>
<td>2000</td>
<td>Creed</td>
<td>Higher</td>
<td>5:16</td>
<td>Rock</td>
<td>1999-09-11</td>
<td>2000-07-22</td>
<td>81</td>
<td>…</td>
<td>17.0</td>
<td>17.0</td>
<td>21.0</td>
<td>26.0</td>
<td>29.0</td>
<td>32.0</td>
<td>39.0</td>
<td>39.0</td>
</tr>
<tr>
<td>2000</td>
<td>3 Doors Down</td>
<td>Kryptonite</td>
<td>3:53</td>
<td>Rock</td>
<td>2000-04-08</td>
<td>2000-11-11</td>
<td>81</td>
<td>…</td>
<td>49.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<td>2000</td>
<td>Creed</td>
<td>With Arms Wide Open</td>
<td>3:52</td>
<td>Rock</td>
<td>2000-05-13</td>
<td>2000-11-11</td>
<td>84</td>
<td>…</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<td>2000</td>
<td>Hill, Faith</td>
<td>Breathe</td>
<td>4:04</td>
<td>Rap</td>
<td>1999-11-06</td>
<td>2000-04-22</td>
<td>81</td>
<td>…</td>
<td>47.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>The data tracks the date a song entered the Billboard Top 100 and the weekly position it had there after. NaN is used to represent the weeks when a song has either not reached the chart yet, or has fallen off the chart for that week(s).</p>
<p>I would like to see which songs had the longest runs on the chart. Or in other words which songs have the most numerical values in ‘x_weeks’ columns. The easiest way to do this in Pandas is to add a new column to our data frame that we can call ‘week_count’.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># First we can grab a data frame of just the weeks to make things a little easier to work with.</span>
<span class="n">df_weeks</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="nb">list</span><span class="p">(</span><span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'x'</span><span class="p">),</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">))]</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span>
<span class="c"># We can now transpose the dataframe and count. The index will still match, so we can tack it on with a new name.</span>
<span class="n">df</span><span class="p">[</span><span class="s">'weeks_on_chart'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">df_weeks</span><span class="o">.</span><span class="n">count</span><span class="p">()]</span>
<span class="c"># I also need to convert the counts to ints</span>
<span class="c">#Let's sort our dataframe and look at what the longest running song was</span>
<span class="n">top_5</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'weeks_on_chart'</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre>
</div>
<p>What does our output look like?</p>
<table>
<thead>
<tr>
<th>artist_inverted</th>
<th>track</th>
<th>genre</th>
<th>weeks_on_chart</th>
</tr>
</thead>
<tbody>
<tr>
<td>Creed</td>
<td>Higher</td>
<td>Rock</td>
<td>57</td>
</tr>
<tr>
<td>Lonestar</td>
<td>Amazed</td>
<td>Country</td>
<td>55</td>
</tr>
<tr>
<td>3 Doors Down</td>
<td>Kryptonite</td>
<td>Rock</td>
<td>53</td>
</tr>
<tr>
<td>Hill, Faith</td>
<td>Breathe</td>
<td>Rap</td>
<td>53</td>
</tr>
<tr>
<td>Creed</td>
<td>With Arms Wide Open</td>
<td>Rock</td>
<td>47</td>
</tr>
</tbody>
</table>
<p>Oh man, 2000 wasn’t exactly the greatest year for music. But nonetheless, there is still plenty of things for us to explore. What did these long runs on the chart look like? Did any of these songs reach number one? If so for how long? There a lot of questions I’d like to have answered, but don’t feel like typing all those things out.</p>
<p>This calls for a visualization!</p>
<h1 id="visualizing-our-data">Visualizing our Data</h1>
<p>Plotting with pandas can be incredibly simple, especially when we have a well organized dataframe. To plot the runs of the top 5 songs we only need a few lines of code.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># First we need to import some plotting libraries</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="kn">as</span> <span class="nn">sns</span>
<span class="c"># And then we make our plot!</span>
<span class="n">df_weeks</span><span class="p">[</span><span class="n">top_5</span><span class="o">.</span><span class="n">track</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s">'-'</span><span class="p">)</span>
</code></pre>
</div>
<p><img src="/img/plots/billboardtop5_weekly.png" alt="Top 5 Billboard Runs" /></p>
<p>This answers many of my earlier questions, but this plot is still a bit deceptive. The x axis of this plot represents the week a song entered the top 100; meaning every song’s run starts at week one. However, each song on the chart has a different <em>literal</em> week of the year 2000 for its week one, meaning the start date for ‘Higher’ is a different week one than the start date for ‘Amazed’. Our current plot does not show this. Additionally, there is a visible gap between weeks 20 and 40. After examining this phenomena I learned that the dataset being used for these plots is apparently corrupted. A large portion of the values between the columns xWeek20 and xWeek40 are totally missing.</p>
<p>If we were to somehow translate our x axis to actual dates we could make the plot I really want to see and simultaneously allow ourselves to examine what rankings are missing from the chart for a given week. We could infer our missing data by iterating over each week in the year 2000 and find what rankings from the top 100 are missing.</p>
<p>In order to do this we are going to have to work with Timeseries. But I will save that analysis for next time…</p>Roland JeannierUnderstanding our Data The data set below is a selection from this larger dataset (link here) of songs from the year 2000 on Billboard’s top 100.