Entry Name: "TUW‐ Omenitsch‐MC1" VAST 2013 Challenge Mini‐Challenge 1: Box Office VAST Team Members: Primary: Philipp Omenitsch Bachelor Student at Vienna University of Technology, [email protected] Advisors: Bilal Alsallakh, [email protected]; Markus Bögl [email protected] PhD Students at Vienna University of Technology. Student Team: YES Analytic Tools Used: Tableau MATLAB MySQL queries. May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2013 is complete? YES Video: http://youtu.be/OjsJaqUjM1c (will be uploaded on Aug. the 7th) Description: When I first tried to tackle the prediction problem, I tried to familiarize myself with the topic by making rough predictions with as few factors as possible to have a solid basis. Afterwards I tried new ideas and added bit by bit new features and methods for the predictive analysis. To acquire the IMDb data, I used an LINQ to parse the data via the API provided by IMDb. In order to extract certain data from the acquired IMDb collection, I extended the software written by Kosara et al. for the InfoVis 2007 contest (available at http://eagereyes.org/blog/2007/infovis‐contest‐ 2007‐data). I had to fix several bugs in the software to get it to work correctly for my purposes. Then I imported all the data into a MySQL database in order to make it easier to query and manage. In MySQL I wrote some scripts to calculate more features from the existing data (e.g. the mean actor ratings). To perform the analysis, I used both automated methods, and visualization methods. The automated methods are based on a 1‐layer neural network. I used MATLAB to train the network. The visualization is performed almost entirely using Tableau, which worked very well for analyzing large volumes of data quickly and in a flexible way. With the help of Tableau I could identify many trends and correlations between data features, which were vital for the features I used for the automated analysis. Predicting the Average Viewer Rating To predict the average viewer rating on a movie, I used a linear combination of the following data features of the movies: mean viewer rating of actors: For each of the first 5 actors in the billboard of one movie, I calculated the mean rating of the movies this actor participated in. Only the movies in which the actor’s billboard position is among the top 10 are included, as proposed in the literature on movie success prediction provided by the VAST challenge organizers. The mean rating is computed as a weighted average of the movie ratings, with lower weights given to older movies. This aims to emphasize recent movies, as I hypothesize that they have influence on the current popularity of each actor. We adjusted these weights iteratively base on the error plots of our neural network (we choose the values that achieve better converges). As a result, we compute five data features related to the top‐5 actors. mean viewer rating of director: I computed the mean rating for a director in the same way as for the actors. The main difference is that, in case of multiple directors, their mean values are aggregated by computing a simple average, neglecting the billboard order. Number of ratings: I discovered a weak correlation between the number of ratings of a movie and the viewer rating, as illustrated in the figure 1. Fig. 1: the relation between the number of ratings and the actual ratings on a movie. Therefore, I decided to add the log10 of the number of ratings as a feature. I took the log to limit the variation of the values to a small range, as only the order of magnitude is important. To get weights for the linear combination of the above seven features, I employed a simple 1‐layer neural network. As training data for this network, I used the IMDB data set filtered to US‐only movies released since this year 2000. The target vector is the average viewer rating on each of these movies. This simple model exhibited a small error rates as depicted in the figure 2: Fig. 2: the error rate of our linear model for predicting the movie rating. The average rating of a movie varies slightly by genre, as illustrated in figure 3. In this figure, all movies are taken into account, regardless to their number of ratings. However, I discovered that the values change significantly depending on the number of rating (some genres move to the left, some to the right). Therefore, I adjust the value predicted by our linear model according to the genre(s) of the movie, taking into account the number of rating when computing the adjustment. Fig. 3: the average viewer rating on a movie by genre Predicting the Box Office Number Clearly, the box office income is strongly linked to the number of theatres a movie is shown at in its opening weekend. However, this data was not available directly from the IMDb API so we were not allowed to use it. Nevertheless, I hypothesize that the box office number for a movie correlates strongly with the number of ratings it receives. Contrary to the actual rating, this number varies strongly by genre (figure 4). This is because each genre corresponds to a different target audience. Therefore, our initial guess of the number of ratings the movie will receive is based initially on the genres. Fig. 4: the average number of ratings on a movie by genre Fig. 5: influence of week‐of‐year and day‐of‐week on the number of ratings on a movie. In addition to the genres, I noticed that the number of ratings a movie and the hence the boxoffice number varies strongly based on the wee‐of‐year and the day‐of‐week the movie is released in. Therefore, I adjusted the expectation for the number of ratings computed based on the genre(s) according to the release date of the upcoming movies (using the curves as adjustment weights). The expected number of ratings according to the genres and to the release date is computed based on historical IMDb data only. In order to improve these estimates with actual data about the movie at hand, I tried to incorporate real‐time data extracted from twitter and Bitly. I hypothesized that twitter represents viewer interest most accurately from all features available, because the engagement on social media shows the interest in a movie and therefore the motivation to go and see it regardless of good or bad sentiment. Also the more users were talking about a movie, the more it might have been advertised, which suggest it will probably be available in more cinemas. I also analyzed the trend by sampling 180 tweets and plotting them by the day they were posted to get a time series. All these social media interaction built the foundation for my predictions. Manually I also always refined these predictions by trying to clean them for seasonal effects and by analyzing the viewer prediction from the other algorithm, that's also the reason why both predictions always influenced each other in my model. Fig. 6: Twitter data for two movies, one week ahead of their release dates. 1. What data factors, alone or in combination, were most useful for predicting possible outcomes? For box office I found a close correlation in the IMDB data set between box office incomes and number of ratings for a movie. Therefore based on a limited number of boxoffice numbers available in the data set, I I extrapolated the relation between the number of ratings and to box‐ office numbers in order to generate a training set of adequate size. For determining the actual box office numbers I used a mix of features: movie genres, time of the year, and also our predicted viewer rating. Another very important feature was the time series and the volume of tweets and Bitly link hits leading up to the movie start. For viewer ratings we used a linear weighted combination of the following features actors average ratings directors average rating log10 of anticipated number of ratings 2. How did you combine factors from the structured data with factors in unstructured data and what was the impact on the results? Did you see correlations? How can a user of your system explore this combination? I used the factors from structured data to get a first hard estimate and later on tried to refine and tune the parameters with help of my own anticipation of the movie based on reviews and visual analysis. A very important interaction with the data set to obtain a plausible viewer rating was setting a filter for the number of ratings parameter, because there is a slightly positive non‐ linear correlation between the number of ratings and the viewer rating of a movie. Iteracting with this parameter and setting it to the right range has helped a lot in estimating the viewer rating according to the training data. 3. Do the important factors vary by class, such as movie genre? Yes they do. One of the big challenges was to find these variations and interpret them in the right way, because they introduce “hidden dimensions” and caused non‐linear dependencies which lead to a very sparse population of data in the feature space. I tackled this issue by ignoring the genres from the initial guess and using them to adjust this guess later. 4. Did you use data on previous movies to help analyze/predict outcomes for later movies? If so, how? Yes, I built a simple neural network to help us find the right weights for combining the data features used for the prediction (listed in Q1). Also the analysis of genres and day‐of‐year made a great deal of the work and helped improve the predictions a lot. 5. For any prediction that you had a significant margin of error (for our challenge, this would be a high mean relative absolute error), explain possible sources of error. For our submission for “The Heat” on 28.06.2013 I was really far off with our predictions. By analyzing the twitter data I saw a very low twitter engagement which came quite as a surprise to us considering the actors and directors of the movie and other factors like Media coverage. Also the difference between twitter engagement and Bitly was really big, sadly there was no time to check the sanity of the twitter data boxofficevast provided so I decided to make the prediction with possibly corrupt data. This shows the weak resilience against one corrupt feature and the need for a human in the loop to check sanity of data later used in analysis. 6. What data trends if any were you able to identify? How did the identification of trends affect / shape predictions? We discovered a strong correlation between the number of IMDb ratings on a movie and the actual boxoffice numbers. A weaker correlation exists between the number of ratings and the actual viewer rating; but it becomes stronger, the more ratings a movie has or the higher its rating is (see figure 1). These trends were essential for choosing the right features. Other trends exist like the varying number of ratings by the day‐of‐week and day‐of‐year and by genres. I used these trends to adjust our predictions accordingly, by weighting the results with the trend curves. Did you see instances where early data about a movie was contradicted by later data/factors? Yes, in some submissions, I noticed that the twitter data is not in line with the number‐of‐ratings predictions based on historical data. Lesson learned: Throughout the challenge I learned a lot about data analysis and the importance of visualization of data which can help a lot with improving prediction. This is because more complex trends can be discovered by a human by using his apriori knowledge of the domain together with the good pattern recognition ability for complex trends and correlations between features. Furthermore, it is far more probable to detect error or corrupt data along the way as I saw on my own with twitter data.
© Copyright 2026 Paperzz