The Wandering Mad Movie Mind

Last week in my post I spent some time leading you through my thought process in developing a Watch List. There were some loose threads in that article that I’ve been tugging at over the last week

Last week in my post I spent some time leading you through my thought process in developing a Watch List. There were some loose threads in that article that I’ve been tugging at over the last week.

The first thread was the high “really like” probability that my algorithm assigned to two movies, Fight Club and Amelie, that I “really” didn’t like the first time I saw them. It bothered me to the point that I took another look at my algorithm. Without boring you with the details, I had an “aha” moment and was able to reengineer my algorithm in such a way that I can now develop unique probabilities for each movie. Prior to this I was assigning the same probability to groups of movies with similar ratings. The result is a tighter range of probabilities clustered around the base probability. The base probability is defined as the probability that I would “really like” a movie randomly selected from the database. If you look at this week’s Watch List, you’ll notice that my top movie, The Untouchables, has a “really like” probability of 72.2%. In my revised algorithm that is a high probability movie. As my database gets larger, the extremes of the assigned probabilities will get wider.

One of the by-products of this change is that the rating assigned by Netflix is the most dominant driver of the final probability. This is as it should be. Netflix has by far the largest database of any I use.  Because of this it produces the most credible and reliable ratings of any of the rating websites. Which brings me back to Fight Club and Amelie. The probability for Fight Club went from 84.8% under the old formula to 50.8% under the new formula. Amelie went from 72.0% to 54.3%. On the other hand, a movie that I’m pretty confident that I will like, Hacksaw Ridge changed only slightly from 71.5% to 69.6%.

Another thread I tugged at this week was in response to a question from one of the readers of this blog.  The question was why was Beauty and the Beast earning the low “really like” probability of 36.6% when I felt that there was a high likelihood that I was going to “really like” it. The fact is that I saw the movie this past week and it turned out to be a “really like” instant classic. I rated it a 93 out of 100, which is a very high rating from me for a new movie. In my algorithm, new movies are underrated for two reasons. Because they generate so few ratings in their early months, e.g. Netflix has only 2,460 ratings for Beauty and the Beast so far, the credibility of the movie’s own data is so small that the “really like” probability is driven by the Oscar performance part of the algorithm. This is the second reason for the low rating. New movies haven’t been through the Oscar cycle yet and so their Oscar performance probability is that of a movie that didn’t earn an Oscar nomination, or 35.8%. This is why Beauty and the Beast was only at 36.6% “really like” probability on my Watch List last week.

I’ll leave you this week with a concern. As I mentioned above, Netflix is the cornerstone of my whole “really like” system. You can appreciate then my heart palpitations when it was announced a couple of weeks ago that Netflix is abandoning it’s five star rating system in April. It is replacing it with a thumbs up or down rating with a % next to it, perhaps a little like Rotten Tomatoes. While I am keeping and open mind about the change, it has the potential of destroying the best movie recommender system in the business. If it does, I will be one “mad” movie man, and that’s not “crazy” mad.

How Do You Know a Tarnished Penny Isn’t a Tarnished Quarter?

One of my first posts on this site was The Shiny Penny in which I espoused the virtues of older movies. I still believe that and yet here I am, almost eleven months later, wondering if my movie selection algorithm does a good enough job surfacing those “tarnished quarters”. A more accurate statement of the problem is that older movies generate less data for the movie websites I use in my algorithm which in turn creates fewer recommended movies.

One of my first posts on this site was The Shiny Penny in which I espoused the virtues of older movies. I still believe that and yet here I am, almost eleven months later, wondering if my movie selection algorithm does a good enough job surfacing those “tarnished quarters”. A more accurate statement of the problem is that older movies generate less data for the movie websites I use in my algorithm which in turn creates fewer recommended movies.

Let me explain the issue by using a comparison of IMDB voting with my own ratings for each movie decade. Since I began developing my algorithm around 2010, I’m also going to use 2010 as the year that I began disciplining my movie choices to an algorithm. Also, you might recall from previous posts, that my database consists of movies I’ve watched in the last fifteen years. Each month I remove movies from the database that go beyond the fifteen years and make them available for me to watch again. One other clarification, I use the IMDB ratings for age 45+ to better match with my demographic.

To familiarize you with the format I’ll display for each decade here’s a look at the 2010’s:

Database Movies Released in the 2010’s # of Movies % of Movies Avg # of Voters Avg. IMDB Rating My Avg. Rating
Viewed After Algorithm 340 100.0%    10,369 7.3 7.3
Viewed Before Algorithm 0 0.0%

The 340 movies that I’ve seen from the 2010’s are 17.2% of all of the movies I’ve seen in the last 15 years and there are three more years in the decade to go. If the number of recommended movies were distributed evenly across all nine decades this percentage would be closer to 11%. Because the “shiny pennies” are the most available to watch, there is a tendency to watch more of the newer movies. I also believe that many of the newer movies fit the selection screen before the data matures that might not fit the screen after the data matures. The Average # of Voters column is an indicator of how mature the data is. Keep this in mind as we look at subsequent decades.

The 2000’s represent my least disciplined movie watching. 38.4% of all of the movies in the database come from this decade. The decision to watch specific movies was driven primarily by what was available rather than what was recommended.

Database Movies Released in the 2000’s # of Movies % of Movies Avg # of Voters Avg. IMDB Score Avg.My Score
Viewed After Algorithm 81 10.6%    10,763 7.2 6.8
Viewed Before Algorithm 680 89.4%    10,405 7.1 6.4

One thing to remember about movies in this decade is that only movies watched in 2000 and 2001 have dropped out of the database. As a result, only 10.6% of the movies were selected to watch with some version of the selection algorithm.

The next three decades represent the reliability peak in terms of the algorithm.

Database Movies Released in the 1990’s # of Movies % of Movies Avg # of Voters Avg. IMDB Score Avg.My Score
Viewed After Algorithm 115 46.7%    18,179 7.4 8.1
Viewed Before Algorithm 131 53.3%    11,557 7.2 7.0
Database Movies Released in the 1980’s # of Movies % of Movies Avg # of Voters Avg. IMDB Score Avg.My Score
Viewed After Algorithm 68 44.4%    14,025 7.5 7.6
Viewed Before Algorithm 85 55.6%    12,505 7.4 7.0
Database Movies Released in the 1970’s # of Movies % of Movies Avg # of Voters Avg. IMDB Score Avg.My Score
Viewed After Algorithm 38 38.0%    18,365 7.8 7.6
Viewed Before Algorithm 62 62.0%      9,846 7.5 6.5

Note that the average number of voters per movie is higher for these three decades than the movies released after 2000. Each decade there is a growing gap in the number of voters per movie that get recommended by the algorithm and those that are seen before using the algorithm. This may be indicative of the amount of data needed to produce a recommendation. You also see larger gaps in my enjoyment of the movies that use the disciplined movie selection process against those movies seen prior to the use of the algorithm. My theory is that younger movie viewers will only watch the classics and as a result they are the movies that generate sufficient data for the algorithm to be effective.

When we get to the four oldest decades in the database, it becomes clear that the number of movies with enough data to fit the algorithm is minimal.

Database Movies Released in the 1960’s # of Movies % of Movies Avg # of Voters Avg. IMDB Score Avg.My Score
Viewed After Algorithm 23 20.0%    14,597 8.0 8.3
Viewed Before Algorithm 92 80.0%      6,652 7.7 6.6
Database Movies Released in the 1950’s # of Movies % of Movies Avg # of Voters Avg. IMDB Score Avg.My Score
Viewed After Algorithm 22 18.0%    11,981 8.0 8.4
Viewed Before Algorithm 100 82.0%      5,995 7.7 5.9
Database Movies Released in the 1940’s # of Movies % of Movies Avg # of Voters Avg. IMDB Score Avg.My Score
Viewed After Algorithm 21 22.1%      8,021 8.0 7.9
Viewed Before Algorithm 74 77.9%      4,843 7.8 6.5
Database Movies Released in the Pre-1940’s # of Movies % of Movies Avg # of Voters Avg. IMDB Score Avg.My Score
Viewed After Algorithm 7 14.0%    12,169 8.0 7.5
Viewed Before Algorithm 43 86.0%      4,784 7.9 6.2

The results are even more stark. For these oldest decades of movies, today’s movie viewers and critics are drawn to the classics for these decades but probably not much else. It is clear that the selection algorithm is effective for movies with enough data. The problem is that the “really like” movies from these decades that don’t generate data don’t get recommended.

Finding tarnished quarters with a tool that requires data when data diminishes as movies age is a problem. Another observation is that the algorithm works best for the movies released from the 1970’s to the 1990’s probably because the data is mature and plentiful. Is there a value in letting the shiny pennies that look like quarters get a little tarnished before watching them?

Merry Christmas to all and may all of your movies seen this season be “really like” movies.

 

 

When the Facts Get in the Way of a Good Story

Normally I’m not bothered by moviemakers taking some story-tellers license when making a movie based on a true story, some adjusting of the timeline, or adding a fictitious character to better tell the essential story. In those instances, though, the essence of the story isn’t compromised. In the instance of McFarland USA, you end up with a 50% untrue story based on a true story. The story of the team is true but the story of the coach is 90% false

Friday night is movie night for my wife, Pam, and I. One of the neat features Netflix provides is the capability to add separate profiles to your account for up to five members of the family. This has allowed Pam to input her own ratings of movies which produce her own Netflix recommendations based on her taste in movies. So, on Friday nights we seek out movies that are recommended for both of us and settle in for an enjoyable movie night.

On a recent Friday night we watched McFarland USA from the Disney Studios. It is the kind of movie we both enjoy. Netflix would probably group it in the “inspirational coach of underdog kids sports movies based on a true story” group. We both loved the movie. Pam gave it five stars and I gave it a nine out of ten, which if you read my last post converts to a rating of five stars on Netflix.

As a general practice, I don’t read critics reviews of a movie until after I see the movie. For McFarland USA, the critics review at the top of the IMDB list of external reviews referenced a website I had never visited before, historyvshollywood.com. It’s a niche movie website that specializes in fact checking movies based on a true story. When I read the History vs. Hollywood fact check of McFarland USA, I discovered that a critical chunk of the McFarland USA story was a fabrication. Frankly, I felt cheated. Normally I’m not bothered by moviemakers taking some story-tellers license when making a movie based on a true story, some adjusting of the timeline, or adding a fictitious character to better tell the essential story. In those instances, though, the essence of the story isn’t compromised. In the instance of McFarland USA, you end up with a 50% untrue story based on a true story. The story of the team is true but the story of the coach is 90% false.

One of the self-imposed posting rules that I intend to keep is that I won’t discuss details of a recent movie, no spoilers (Classic movies, like Saturday Night Fever, which have been around for years, however, are fair game for discussion). If you have already watched McFarland USA, or you don’t mind “spoilers”, you can link to the Hollywood vs. History fact check of the movie here.

Rather than getting into the details of the movie, I’d like to address the issue of whether the discovery of the fact that Disney engaged in blatant manipulation should be cause to go back and rerate the movie. After all, if the rating was influenced by the inspiration provided by a true story, shouldn’t the rating reflect the different view of the movie that exists when you discover the story is full of holes. The answer is an emphatic No. Predictive modeling is a science. Check your emotions at the door. The reality is that, despite the fabrications implanted in this particular movie, Pam and I still like to watch well-made “inspirational coach of underdog kids sports movies based on a true story” and we’d like to see more of them. Despite being a little less inspired after learning about the facts behind McFarland USA, it is a well-made and entertaining story and we certainly don’t want facts to get in the way of a good story.

 

 

 

 

 

 

 

 

 

 

 

 

 

Rating Movies: If You Put Garbage In, You’ll get Garbage Out

Netflix, MovieLens, and Criticker are predictive models. They predict what movies that you will like based on your rating of the movies you have seen. Just like the predictive model discussed above, if the ratings that you input into these movie models are inconsistent from movie to movie, you increase the chances that the movie website will recommend to you movies that you won’t like. Having a consistent standard for rating movies is a must.

In my prior life, I would on occasion find myself leading a training session on the predictive model that we were using in our business. Since the purpose of the model was to help our Account Executives make more effective business decisions, one of the points of emphasis was to point out instances when the model would present them with misleading information that could result in ineffective business decisions. One of the most basic of these predictive model traps is that it relies on data input that accurately reflects the conditions being tested in the model. If you put garbage into the model, you will get garbage out of the model.

Netflix, MovieLens, and Criticker are predictive models. They predict movies that you might like based on your rating of the movies you have seen. Just like the predictive model discussed above, if the ratings that you input into these movie models are inconsistent from movie to movie, you increase the chances that the movie website will recommend to you movies that you won’t like. Having a consistent standard for rating movies is a must.

The best approach to rating movies is a simple approach. I start with the Netflix guidelines to rating a movie:

  • 5 Stars = I loved this movie.
  • 4 Stars = I really liked this movie.
  • 3 Stars = I liked this movie.
  • 2 Stars = I didn’t like this movie.
  • 1 Star = I hated this movie.

When I’ve used this standard to guide others in rating movies, the feedback has been that it is an easily understood standard. The primary complaint has been that sometimes the rater can’t decide between the higher and lower rating. The movie fits somewhere in between. For example, “I can’t decide whether I “really like” this movie or just “like” it. This happens enough that I’ve concluded that a 10 point scale is best:

  • 10 = I loved this movie.
  • 9 = I can’t decide between “really liked” and “loved”.
  • 8 = I really liked this movie.
  • 7 = I can’t decide between “liked” and “really liked”.
  • 6 = I liked this movie.
  • 5 = I can’t decide between “didn’t like” and “liked”.
  • 4 = I didn’t like this movie.
  • 3 = I can’t decide between “hated” and “didn’t like”.
  • 2= I hated this movie.
  • 1 = My feeling for this movie is beyond hate.

The nice thing about a 10 point scale is that it is easy to convert to other standards. Using the scales that exist for each of the websites, an example of the conversion would look like this:

  • IMDB = 7  (IMDB uses a 10 point scale already)
  • Netflix = 7 /2 = 3.5 = 4 rounded up.  (Netflix uses 5 star scale with no 1/2 stars)
  • Criticker = 7 x 10 = 70 (Criticker uses 100 point scale).
  • MovieLens = 7 /2 = 3.5 (MovieLens has a 5 star scale but allows input of 1/2 star)

Criticker, being on a 100 point scale, gives you the capability to fine tune your ratings even more. I think it is difficult to subjectively differentiate, for example, between an 82 and an 83. In a future post we can explore this issue further.

So from one simple evaluation of a movie you can generate a consistent rating across all of the websites that you might use. This consistency allows for a more “apples to apples” comparison.

So throw out the garbage. Good data in will produce good data out, and a more reliable list of movies that you will “really like”.

 

Is There Something Rotten (Tomatoes) in Denmark?

With apologies to William Shakespeare and Hamlet, does the influence of corporate profit incentives have a corrupting influence on movie recommender websites.

With apologies to William Shakespeare and Hamlet, does the influence of corporate profit incentives have a corrupting influence on movie recommender websites? Movie Ratings have become big business. Amazon bought IMDB in 1998 to promote Amazon products. There appears to be a synergy between the two that doesn’t seem to impact IMDB’s rating system. On the other hand, the Netflix business model, which began as DVD mail order business,  today is a very different business. Netflix has become heavily invested in original entertainment content for its online streaming business and is using a recommender algorithm for that business that is different than its gold-standard algorithm used for the DVD business. Does the Netflix algorithm for its online streaming business better serve the interest of Netflix subscribers or Netflix profits? I’m sure Netflix would say that it serves both. I’m not so sure. This will be a topic of interest for me in future posts. The more immediate concern is Rotten Tomatoes.

It was announced on Feb. 17, 2016 that Rotten Tomatoes, along with the movie discovery site Flixster, was sold to Fandango. For those of you who are not familiar with Fandango, it is one of two major online advance movie ticket sales sites. MovieTickets.com is the other site. For a premium added to your ticket price, Fandango allows you to print movie tickets at home to allow the moviegoer to avoid big lines at the theater.

So, why should we be concerned? Let’s start with the perception that Rotten Tomatoes has become so influential that it makes or breaks movies before they are even released. Here are a couple of articles that express the growing concern film-makers have with Rotten Tomatoes scores: Rotten Tomatoes: One Filmmaker’s Critical Conundrum and Summer Box Office: How Movie Tracking Went Off the Rails. Whether it is true or not, the movie industry believes that the box office success or failure of a film is in the hands of 200 or so critics and the website that aggregates the results, Rotten Tomatoes.

This impact that Rotten Tomatoes has on the box office each week may be a driving force behind Fandango’s acquisition. In CNN Money’s article  announcing the purchase, Fandango President Paul Yanover states “Flixster and Rotten Tomatoes are invaluable resources for movie fans, and we look forward to growing these successful properties, driving more theatrical ticketing and super-serving consumers with all their movie needs,”. Fandango makes money when more people go to the movies, particularly on opening weekends for well-reviewed movies, when lines are expected to be long. Rotten Tomatoes’ Certified Fresh designations drive opening weekend long lines. Logically, Fandango business interests would be better served by even more movies earning the Certified Fresh rating.

Am I being too cynical? Well, according to a study by Nate Silver’s FiveThirtyEight site  Fandango has done this before. According to FiveThirtyEight Fandango used some creative rounding to inflate their movie ratings in the past. Has Fandango learned its lesson? They claim that Rotten Tomatoes will maintain their independence within their corporate structure. Maybe, but from my experience, corporate acquisitions are made to create profitable synergies – more Certified Fresh ratings, more moviegoers, more long lines for tickets, more “theatrical ticketing” in advance, more profits.

If you begin to “really like” fewer movies that are Certified Fresh on Rotten Tomatoes you might conclude that there may be something Rotten (Tomatoes) in Fandango…if not in Denmark.