A successful partnership between a startup and academia - a case on data science
(Originally posted on LinkedIn, on October 9th, 2017)
Being part of a startup is wonderful in so many ways. But not in the resource side. We always find ourselves swamped in ideas, needs and requirements that seem to never be solved. That's why having partners is so critical, as they help you advance some of those areas in faster and more efficient ways.
The relationship between academia and business is always a tough call. While there are some counterexamples, it is typically frustrating to reach deals even if they are both academically promising and beneficial to companies. Sometimes it is a matter of high level strategy. Others, as I have suffered in the past, it is just because of unhealthy behaviors on one or both sides.
24symbols and dataOne of the many things I love about 24symbols is how it satisfies my data craves. Though as a CEO I do not have the luxury of spending that much time on data munging -as much as I think it will become a critical part of our future-, I do everything I can to continue advancing the use of data in order to improve our relationship with our customers, our understanding of the product, or the details of the books we provide.
However, most of the services and funcionalities we create are for internal use. The reason is almost simplistic: time, priorities and resources. Crafting a production-ready, data-enabled feature for our readers is not something to take lightly.
We had already tried working with students a few times in the past. My academic background pushes me to believe that this should always be a win-win strategy, and my experience while at Denodo with the university of A Coruña, in Northern Spain, amplifies this feeling. For us now, that somebody could help us validate ideas is golden. And for students, having access to real datasets and real challenges should always be a plus for their projects and research. But after two initial projects that went relatively well, the following experiences were awful. Even when we partnered with a small, 2-statistician company that had just started and needed some initial references and testimonials; but, incredibly, they thought that the effort required to... clean the dataset... was not worth it (and no, our dataset is not particularly messy).
But sometimes serendipity works. Last April and May, I taught a couple of courses on Data Analytics and Data Science both in Madrid and Barcelona. They were hosted by adigital, the Spanish eCommerce Association. In one of the courses in Madrid, one of the attendees came to me during the break and told me he was studying a Master's Degree in Data Science and that his team was looking for a company who would help them have a good Final Project.
I was initially taken aback by this new proposal, from a person I had just met, and coming also from a new school I knew nothing about. But, six months later, things are quite different. After I agreed to have an initial meeting, the four-people team has delivered. Though their first proposal was crazy (they basically wanted to implement a two-year, full-time project in four part-time months), they have been able to deliver a really interesting prototype that has brought up quite a few high quality ideas for us, and that may become a real product in the near future.
The project: users, editions, visualization and recommendationsThe collaboration between 24symbols and the MBIT School, with students of the Master's Degree in Big Data and Data Science, is focused on bridging the gap between our intellectual needs and the actual work required to transform the idea into a running prototype.
There were four main areas of work - in addition to setting up the infrastructure, not a simple one to handle-. The first two comprised the platform from which to build the third and fourth ones:
- Obtain relevant metrics for reading behavior based on a dataset provided by us. From basics like most read genres per reader, to more advanced metrics like the point of no return (estimated moment where a book is not put down anymore and will be eventually finished) per reader -typical behavior of a user with respect to the books he/she reads- and edition -how has that book been perceived by its actual readers-.
- Perform some basic content analysis of a subset of the books in 24symbols, both from a stylometrics standpoint, and from a character analysis one as well. Metrics involved were lexical richness, number of adjectives, or character density.
- Perform some tests about how all this information could be shown to our readers in a meaningful and yet attractive way. This part was basically focused on data visualization options.
- Use this information to craft a set of recommender systems that potentially inform the reader in a richer way about the books he/she may want to read.
Now, the second step was new to us. We had applied some topic modelling by using techniques such as LDA, but had not gone through any lexicographic analysis, and had not used the concepts to find relationships among them. We had tried to do something like that in a previous project, but never worked it out. The results are really promising as well, even though I believe some additional metrics and further text analysis will be required in order to solve the complexity of having more than 750.000 books in tens of different languages.
The third step, though less technically complex than the rest, was really helpful. We believe in providing the reader with additional information about his/her actions on the service, but want to be very cautious as we do not want to create an extremely controlled experience, even gamified, where most readers may feel alienated to the actual reading they want to perform. While there is lot of work to do there, the team came up with some good ideas of value to us.
The first and second steps were then used to craft different recommender systems, each of which made use of one or more new metrics: speed of reading, time of reading, point of no return, lexicographic information, character density... The goal of the work was not to choose one against others, but to understand how each recommender could, in certain cases, be more accurate than the others. Finding the specific cases is something we will look after in future projects. But the implementation of the recommenders was just what we needed now: we can now compare our current system with these new ones and see, with the help of our editorial team, what works and what doesn't.
Data Science is a very complex set of techniques, methodologies and best practices, and even a full year of specialization is just not enough. Now these guys will have to sharpen their skills in the industry or academia. But I truly believe that having worked with us during these six months have been useful in going beyond theory.
On our side, now that the first project is done, we need to improve our following needs. As usual, a first project is always too broad in scope. But we now have a better idea of what we need, so we can focus on more specific features, etc. As a Product Manager, it is like the feeling after having built an MVP (minimum viable product). Now I better understand what works well, what doesn't, and what needs clear improvements.
This type of relationship between a company and the academia is just one of the many existing ones. Here we do not even try to obtain productive code, but important learnings. During my time at Denodo, I saw how this can be a fruitful partnership, but it takes time and interest by both sides. I hope to continue deepening our partnerships with schools in Data Science, in Publishing or in anything that may result in successful projects for both sides. And to the team, the best of luck in the future, they demonstrated they can face a data project and be successful at it.
MBIT team members, from left to right: Israel, Rafael, myself, Jesús and Noel.
Image credits: https://www.flickr.com/photos/jannekestaaks/14390184414/in/