Data and e-books - appealing, but tough


(Originally posted on LinkedIn, on April 5th, 2016)
I believe data can play a really critical role in the present and future of books. As a software engineer, as a data engineer and in business roles for data companies, I have worked with data most of my professional life (now I mainly work with publisher agreements, some of which are quite data-intensive themselves :D.) My academic career is also deeply ingrained with data.
With this in mind, it made a lot of sense that I wanted to have a deep understanding of how data could play a role in 24symbols, the company I, along with other three partners, founded as a subscription service for ebooks.

Julieta Lionetti, an expert in publishing and content that worked with us for a couple of years, wrote a post a few days ago describing some ideas we once had in 24symbols to engage publishers more, and let them know some of the cool stuff we were thinking of doing. These ideas were related to the information about reader behavior we can process in 24symbols. The post is in Spanish but you can translate it quite easily.
Julieta's article is correct, but I wanted to point out a couple of things.

The validity of data depends a lot on what you want to do with it. For example, there is a technique used in website design and other areas, called A/B testing. In order to improve conversion rates in a website or app, be it an increase in the number of registrations, or the number of purchases, or even the time spent on the real estate, companies usually perform small but meaningful changes to the service's aesthetics or messaging in order to improve those rates. A/B testing tools enable a random assignment of the visits to the updated pages so that a subset of them see, say, "Page A", while another subset sees "Page B", the alternative. The tools are then able to measure the success of each option based on a pre-defined goal (e.g. Number of registered users that come from that page) and, after some time, provide the results and the winner. The thing is that, though the test itself uses a statistical method to make sure the results are significant, most of the times you don't need to wait the week or two that it may take to know which option to choose. As long as you see that a few hundred people from a random sample behave in a clear way, you can be quite sure of which page is better.

The same happens when analyzing how people read books. If a publisher creates a focus group, either physically or online, and the first hundred readers stop reading at page 50, it is clear that this won't be an outlier, and the publisher could probably stop right there and research on WTF happens on that page.

But, do we need complex Data Science, or Big Data for that? I don't think so. No relevant statistics are involved, no massive amounts of data, no rapidly changing pieces of information, no complex metadata. No real time decisions. In most cases, a focus group would make it.

We can go further. Julieta mentioned that we had 20,000 readers (I don't remember if that was the exact number, but let's take it as it is) of one book in 24symbols, and that Juliett, our data scientist, and I said that we didn't have enough data points to work on it. This requires an explanation, that also serves as another reason why data is so complex in constrained scenarios. Obviously, 20,000 full readers would provide great insights about a book. Many authors would LOVE to have those readers on a usual basis. If we just wanted to talk about THAT book, it'd be more than enough. Also, we are talking here about real readers, that decide to choose a book on their own, not a predefined group that are meant to review it. This provides a much better population sample, coming from all around the world, and that can be perfectly segmented to make sure the experiment is valid. But we wanted to engage publishers and authors to find out what data could provide to them not just with one book. So we needed to find not only specific behaviors but also generic ones. And that made things more difficult.
At the time, most of those readers were free users, meaning they could not get beyond the first 10% of the book. We had just started to work on improving our conversion rate to premium, once we had reach a good free user base. That created a chasm for data analysis, as we needed to wait to see how many of those would convert to premium, and if we could relate this action to the book itself -as an impulse purchase- or not. In addition, there were many outliers in how people read that prevented us from making generic assumptions. That makes a lot of sense: people read however they want, and do not follow any specific pattern.
Staticians know from a long time ago that sampling precision improves with randomness more than with sample size. But obtaining a pure random sample is difficult. How do you avoid systematic bias when all your users read through a single platform? Or when the initial sample size you send invites to is so small and, still, you want to segment by gender, location or by how fast people read?
Sampling precision improves with randomness more than with sample size. But obtaining a pure random sample is difficult.
In our case, at the end of the day, our sample was not at that time statistically significant to provide the results we believe the industry needs.
Of course, we could have published those results, and they would have been of certain interest for our bookish friends around; certainly, some insights could have been obtained from it. But as we try to fully understand how people actually read, how they truly experience a passion, a must or a penitence when opening a book, we decided to continue our research until we find the best way to serve those interests.


Now that we have many more tens of thousands of paying subscribers with no 10% limitation, and 3x more books than when we had this conversation with Julieta, it may be time to review our hypotheses. Our internal prototypes, albeit still simple from a Data Science standpoint, provide us with some really interesting info about reading behaviour that help us understand certain patterns, from genre behaviour categorization to fraud detection.

If anyone is interested, please let me know.

Photo credit: https://www.flickr.com/photos/113026679@N03/16343064811/in

Comments

Anonymous said…
Hey Justo!
New entries in your blog! Cool that you came back again!
Kind regards from an old pupil from yours :D
Ernesto.
Unknown said…
Will do my best, Ernesto! :D :D :D

Thanks for reading!