I describe here the projects that I worked on, as well as career progress, starting 25 years ago as a PhD student in statistics, until today, and the transformation from statistician to data scientist that occurred slowly and started more than 20 years ago. This also illustrates many applications of data science, most are still active.

Early years

My interest in mathematics started when I was 7 or 8, I remember being fascinated by the powers of 2 in primary school, and later purchasing cheap russian math books (Mir publisher) translated in French, for my entertainement. In high school, I participated in the mathematical olympiads, and did my own math research during math classes, rather than listening to the very boring lessons. When I attended college, I stopped showing up in the classroom altogether - afterall, you could just read the syllabus, memorize the material before the exam and regurgitate it at the exam. Moving fast forward, I ended up with a PhD summa cum laude in (computational) statistics, followed by a joint postdoc in Cambridge (UK) and the National Institute of Statistical Science (North Carolina). Just after completing my PhD, I had to do my military service, where I learned old data base programming (SQL on DB2) - this helped me get my first job in the corporate world in 1997 (in US), where SQL was a requirement - and still is today for most data science positions.

My academia years (1988 - 1996)

My major was in Math/Stats at Namur University, and I was exposed between 1988 and 1997 to a number of interesting projects, most being precursors to data science:

At Cambridge University in 1995 (click here to see the names of all these statisticians)

When I moved to Cambridge university stats lab and then NISS to complete my post-doc (under the supervision of Professor Richard Smith), I worked on:

Note: AnalyticBridge's logo represents the mathematical bridge in Cambridge.

My first years in the corporate world (1996 - 2002)

I was first offered a job at MapQuest, to refine a system that helps car drivers with automated navigation. At that time, location of the vehicule was not determined by GPS, but by checking the speed and changes in direction (measured in degrees, as the driver makes a turn). This technique was prone to errors and that's why they wanted to hire a statistician. But eventually, I decided to work for CNET instead, as they offered a full time position rather than a consulting role.

I started in 1997 working for CNET, at that time a relatively small digital publisher (they eventually acquired ZDNet). My first project involved designing an alarm system, to send automated email to channel managers whenever traffic numbers were too low or too high: a red flag indicated significant under-performance, a bold red-flag indicated extreme under-performance. Managers could then trace the dips and spikes to events taking place on the platform, such as double load of traffic numbers (making the numbers 2x as big as they should be), web site down for a couple of hours, promotion etc. The alarm system used SAS to predict traffic (time series modeling, with seasonality, and confidence intervals for daily estimates), Perl/CGI to develop it as an API, access databases, and to send automated email, Sybase (star schema) to access traffic database and create a small database of predicted/estimated traffic (to match with real, observed traffic), and of course, cron jobs to run everything automatically, in batch mode, according to a pre-specified schedule - and resume automatically in case of crash or other failure (e.g. when production of traffic statistics were delayed or needed to be fixed fitst, due to some glitch). This might be the first time that I created automated data science.

Later in 2000, I was involved with market research, business and competitive intelligence. My title was Sr. Statistician. Besides identifying, defining, and designing tracking (measurement) methods for KPI's, here are some of the interesting projects I worked on:

Consulting years (2002 - today)

I worked for various companies - Visa, Wells Fargo, InfoSpace, Looksmart, Microsoft, eBay, sometimes even as a regular employee, but mostly in a consulting capacity. It started with Visa in 2002, after a small stint with a statistical litigation company (William Wecker Associates), where I improved time-to-crime models that were biased because of right-censorship in the data (future crimes attached to a gun are not seen yet - this was an analysis in connection with the gun manufacturers lawsuit).

At Visa, I developed multivariate features for credit card fraud detection in real time, especially single-ping fraud, working on data sets with 50 million transactions - too big for SAS to handle at that time (a SAS sort would crash), and that's when I first developed Hadoop-like systems (nowadays, SAS sort can very easily handle 50 million rows without visible Map-Reduce technology). Most importantly, I used Perl, associative arrays and hash tables to process hundreds of feature combinations (to detect the best one based on some lift metric) while SAS would - at that time - process one feature combination over the whole weekend. Hash tables were used to store millions of bins, so an important part of the project was data binning - doing it right (too many bins results in a need for intensive Hadoop-like programming, too few results in lack of accuracy or predictability). That's when I came up with the concepts of hidden decision trees, predictive power of a feature, and testing a large number of feature combinations simultaneously. This is much better explained in my book pages 225-228 and pages 153-158.  

After Visa, I worked at Wells Fargo, and my main contribution was to find that all our analyses were based on wrong data. It had been wrong for a long time without anyone noticing, well before I joined this project: Tealeaf sessions spanning accross multiple servers were broken in small sessions (we discovered it by simulating our own sessions and look at what shows up in the log files, the next day), making it impossible to really track user activity. Of course we fixed the problem. The purpose here was to make user navigation easier, and identify when a user is ready for cross-selling, and which products should be presented to him/her based on history.

So I moved away from the Internet, to Finance and fraud detection. But I came back to the Internet around 2005, this time to focus on traffic quality, click fraud, taxonomy creation, and optimizing bids on Google keywords - projects that require text mining and NLP (natural language processing) expertize. My most recent consulting years involved the following projects:

During these years, I also created my first start-up to score Internet traffic (raising $6 million in funding) and produced a few patents.

Today

As the co-founder of DataScienceCentral, I am also the data scientist on board, optimizing our email blasts and traffic growth with a mix of paid and organic traffic as well as various data science hacks. I also optimize client campaigns and manage a system of automated feeds for automated content production (see my book page 234). But the most visible part of my activity consists of 

I am also involved in designing API's and AaaS (Analytics as a Service). I actually wrote my first API in 2002 to sell stock trading signals: read my book pages 195-208 for details. I was even offered a job at Edison (utility company in Los Angeles) to trade electricity on their behalf. And I also worked on other arbitraging systems, in particular click arbitraging.

Accomplishment

Grew revenue and profits from 5 to 7 digits in less than two years, while maintaining profit margins above 65%. Grew traffic and membership by 300% in two years. Introduced highly innovative, efficient, and scalable advertising products for our clients. DataScienceCentral is an entirely self-funded, lean startup with no debt, and no payroll (our biggest expense on gross revenue is taxes). I used state-of-the-art growth science techniques to outperform competition.

Publications, Conferences

Selected Data Science Articles: Click here to access the list. 

Refereed Publications

You can follow me on ResearchGate to check out my research activities.

Other Selected Publications

Conference and Seminars

Related article