Yes. That is right. I can predict the future and I can read your mind.

Actually, I can’t… but Markov can.

And to show it we put together an exercise of data analysis and a prototype.

To try it, just type a sentence in English on the first form on the top. The prototype provides a list of words that you might be typing next. Make a selection of a word, or just  keep typing.

On the right you will see a distribution of the alternatives we found so far. The (exponential) distribution is a visual description of how certain we are about each of your possible future choices.

This is a demonstration of the use of Markov Chains for the prediction of sequenced events. This type of prediction assumes memorylessness, or in other words, that a probability of an event occurring at time t  is affected only by the events occurring at t-1.

This technique is used on several domains like genetic sequencing, speech recognition (this notable example here) and my favorite: prediction of the next price tick in financial time series and random walks.

If you are looking for details, the data analysis leading to this prototype used data sampled from blogs, news and twitter to calculate frequencies of distributions for bi-grams, tri-grams and so forth.

Screen Shot 2017-05-28 at 10.41.58 PM

Playing with data on the report, millions of words sampled from blogs, news feeds and twitter: attenuation of densities as frequencies of n-grams increase. Sounds fancy but, simply put, shows that combinations of words become less and less frequent as the size of n on the n-gram increases.  Very intuitive.

You can also refer to a set of slides describing the background and the overall structure of this exercise.

If you really, really, really want details, you can fork the github repository and experiment with the source code of the prototype.

Have fun!

This is a question that has been around since early 20th century, on the first notes of the Dow Theory: Does technical analysis work? And by “work” I mean, can you get consistent profits using technical analysis?

On this latest publication we answer this question using data, simulations… and science.

If you wonder what is technical analysis, or TA, you might not know what it is, but you must certainly have seen their proponents on day time television staring at the camera barking buy and sell commands at you and yelling terms like resistance, support, momentum and trends, pointing to charts of time series and price variation of an asset with overlapped lines.

At first sight they make as much of a sense as black magic, or tea leaves reading.

Another common trait of TAs is that they usually (if not always) try to convince investors to assume the same long or short positions they recommend – what should damage the profitability of their own strategy. There are also some TA blogs in the web advertising consistent gains in the range of triple digits in a year. Yes, that is right… if you doubt google it. And use your common sense to judge it.

Technical Analysts is all about formalization of visual patterns. Indeed TAs refer so often to charts that their detractors call them “chartists”.

screen-shot-2016-11-26-at-10-50-14-pm

This is an example of some signals one can derive from charts, taken from an online source that gives you instructions on how to read chart patterns.

Since I mentioned their detractors, let me list them, the way I see it. People going around trying to figure out how to make money in financial markets using computers in one way or another is divided in three sects: technical analysts, fundamentalists and quantitative analysts.

Fundamentalists believe that value is given by intrinsic economic features of the environment associated to the underlying. If an underlying is issued by a corporation some of the features could be product and market placement, quality of management, competition, and culture. If the underlying is issued by a sovereign government: interest rates, geopolitics, public policies. Names like Warren Buffett and Philip Fisher follow fundamentalist strategies.

Quantitative analysts look at statistical properties of price movements, individually or in correlated pairs or groups, trying to predict future price movements.

Lastly, to the sect we care about on this study, if we navigate past all the terms they use and get to the basics, technical analysts believe they can predict future value of a price path by looking at past features of a chart. They believe for example that if a price movement hits something like a resistance line it has a tendency to fall, or on the other hand if a price hit a support line you can nowhere but up.

So, which sect teachings is more profitable? Who should you listen to?

If I move my personal biases aside I will have to say that the answer so far has actually depended on who you ask, the personal beliefs and biases of that individual.

Like I said before, with this publication we push aside beliefs and biases and look solely at method and data. We use quantitative techniques to dissect and back test technical analysis, answering one simple question using science and data: does technical analysis really work?

screen-shot-2016-11-26-at-11-16-46-pm

Scatter plot showing how dampening, load (cost) and sigma affects profitability (fitness) of a momentum strategy used in technical analysis, one of the several insights on the research paper. Refer to the link for all details.

The original bootstrap of this work is a FRACTI notebook dissecting the crown strategy of technical analysts: the momentum strategy. You can follow the analysis step by step in there, or you can download and execute it yourself.

From November or 2015 until now we have been having internal discussions at the CCFEA on the structure and content that generated 22 different revisions of this publication.

And now, after a year, the first version is out for public review. That means it’s now your turn. We should be submitting it to external peer review and publication over the next few weeks and we welcome your insights and ideas. This is the essence of our research, FRACTI: transparency, collaboration, data and method.

I don’t want to give any spoilers, you can skip all the formalization and the math and go straight to conclusions. But the ending might surprise you…

Enjoy!

Yes. Throw away your dumbbells, fire your overpriced personal trainer and get a data powered dumbbell instead.

Have you ever imagined a scenario in which your training equipment would play the role of your personal trainer?

People regularly quantify how much of a particular activity they do, but they rarely quantify how well that same activity is performed. More often than not discerning the quality of a work out requires specialized supervision of a personal trainer.

This is actually what this whole analysis is all about. We predict how well people exercise based on data produced by accelerators attached to their belt, forearm, arm, and dumbbell.

The overall quality in which people exercise is given by the “classe” variable in the training set. Classe ‘A’ indicates an exercise performed correctly (all kudos to you, athlete). The other classes indicate common exercising mistakes.

All credits for data collection and original analysis go to the Human Activity Recognition – HAR laboratory, previously detailed in this paper. Credits for educational notes go to the Johns Hopkins School of Biostatistics and our valuable co-workers.

In a simple and quick fitting we were able to get very close to the weighted average of the baseline accuracy of 99.4%. Despite of the numerical proximity of the results, we can see the baseline is on the upper boundary of the confidence interval of this study.

We were limited in terms of computing resources and time (this analysis was performed beginning to end in about 3 hours). If we had more time we could try ensemble methods for classifications, specifically AdaBoost, but that would be beyond the intent and time allocated for this exercise.

If you want to check a more elaborate analysis you can either check the original paper or better yet, refer to a longer version of this study for details on each of the steps of the analysis:

  • First things first: obtaining the data, pre-processing and clean-up: downloading, caching raw data and clean up of the data generated by the electronic devices
  • “Raw” feature selection: selecting among the dozens of features which ones are relevant in order to classify an exercise as well or poorly executed. Since we will fit using an implicit feature selection model, this is a “raw” feature selection.
  • Data partitioning: a 75:25 split of the data set on training:testing for training, cross-validation and in/out sample testing
  • Data imputation: electronic devices generate lots of gaps in data. Imputation of NA values by K nearest neighbors imputation algorithm
  • Model fitting:
    • Feature selection: We use random forest for fitting, in which feature selection is performed explicitly (see below).
    • Training: training is performed through a random forest model over 6-folds of cross validation
    • Feature importance: Despite of the absence of automatic feature selection, the random forest classification algorithm does keep track of a ranking of how well each feature collaborate to the outcome on each class. We call this ranking ‘importance’.
  • Prediction and in/out sample measurements: We measure and track in and out sample error by means of comparison of a confusion matrix against the training partition and a testing partition centered and scaled around metrics of the training partition.

The final results: for this model, for an accuracy of 99.27% the out-sample error is 0.73%, with a confidence interval of 98.99% to 99.49%. Considering the little time and computational resources, really good.

This, like most R analysis, relies on a ‘literate programming’ paradigm, what basically means the output of this model is a final paper, or in this case a web page. You can find all details of each step, including real executable R code, there.

If you want to dig even deeper, understand details of the model, run it yourself, improve it and compare results, you can check-out and run it yourself from the github repository where it is graciously made available. You will need R and R Studio for that. The report with the executable analysis is also available.

Have fun!

Yes, Oracle. Sorry. The make believe situation is we have a huuuge table for Oracle and Oracle says we need to partition the table so we can query it. We need to allocate data to partitions in intervals of 7.

We decide to use range-interval partitions.

create table A (
   id number primary key,
   partition_id number not null
)
partition by range(partition_id) interval (7)
(partition FIRST values less than (0))
enable row movement
parallel;

The catch though is that we want to allocate the second partition starting on 5, not 1.

How do we do that?

Change the empty values less than clause to the value of the first item on the second partition, 5, the beginning of the second partition:

create table A (
   id number primary key,
   partition_id number not null
)
partition by range(partition_id) interval (7)
(partition FIRST values less than (5))
enable row movement
parallel;

How many partitions do we have?

select partition_name, high_value
from user_tab_partitions
where table_name = 'A' order by 1

What yields…

PARTITION_NAME                  HIGH_VALUE
--------------------------------------------
FIRST                           5
SYS_P586996                     12

Let’s test it. First, we add a few rows:

insert into A (id, partition_id) values (2,2);
insert into A (id, partition_id) values (3,3);
insert into A (id, partition_id) values (5,5);
insert into A (id, partition_id) values (10,10);

Where each row landed at?

First things first, the partition named FIRST…

select * from A partition (FIRST);
ID                PARTITION_ID
------------------------------
2                 2
3                 3

And what does the second partition holds?

select * from A partition (SYS_P586996);
ID                PARTITION_ID
------------------------------
2                 5
10                10

Exactly what we wanted. You can go back to your painful Oracle life.

The Bank of Finland has this thing they call “the simulator” – it is basically a platform for stress-testing of large financial institutions at a systemic level.

Basically the same thing we have here in the Federal Reserve, the difference is in Finland  it is not as opaque. The “simulator” is a computational platform, allowing scenario testing through plug and play quantitative models leveraging payment and liquidity data from large financial institutions. It is used by several nations outside of Finland as well.

They invited me for a talk about the paper introducing FRACTI we published about a year ago as part of our research at the CCFEA.

If you by any remote chance find yourself around Helsinki on 8/25 and 26, and want to get engaged in interesting conversations about modelling of crowd behavior, data-driven predictive models, larges-scale simulations and systemic risk, come join us… the first and second rounds of Karhu are on me…

UPDATE — August 30th 2016

I am now back to NYC. Excellent event, very well organized and a great opportunity to delve into several interesting topics. The atmosphere was very friendly and several parallel events were excellent opportunities for casual and interesting conversations.

IMG_9253Detection of systemic failures by looking into large scale payment interaction data is a new field, bringing with it several opportunities for interesting research. Specially in terms of graph oriented computation to identify nodes (and relationships) associated to risk prone entities. Several sophisticated ideas and algorithms were discussed in detail.

On my talk specifically, the most interesting thing to note is actually a common and recurring theme: a number of questions and inquiries were related to our “hard science” quantifiable approach to economics. Many participants and members of the audience view human crowds as a subject “impossible to model”.

Probably the same was said on the past about other seemingly complicated subjects like quantum physics, space travel, astronomy, and speech recognition. Time might give the final answer.

If you are one of the souls out there dealing with real world cases of finance and data technology you have been faced with statements from your clients that go somewhat along these lines:

I have this great idea: I think I can collect this data, predict a few things about the future, sell those predictions, make tons of money, get rich, retire.

Nothing bad about this. It seems like a fair (under) statement (wish?), but let’s entertain the nuances of it.

The first nuance is that everything is getting more and more complex. Complexity in technology and methods are forcing teams to pack oversimplified ideas in bitesize buzzwords in order to monetize them, unfortunately hiding appropriate methods and concepts that should be used otherwise. If you have been a practitioner in finance and technology for a while you might have noticed that this is nothing new, but it is getting worst. To the point of compromising the benefit of high end technology in itself.

The second nuance is that chances are your client’s statement might come sprinkled with the words “data”, “machine learning”, and “prediction” here and there. In fact those words are so common that they are losing meaning.

This is what we will write about. The real meaning behind those “buzzwords”. Let’s approach it one word at a time, so we can get into some context.

First, data.

Data in itself, without an application domain, is utterly meaningless.

If you are a practitioner chances are you have been in professional engagements in which you had to burn several precious hours explaining to middle managers (not their fault) that you should first understand the domain of expertise, come up with assumptions and then – only then – look at the data. Never (very important, never) you should try to peek at data in advance, or even worse: come up with “synthetic” data in any form: from other domains, downloaded from the web, randomized, or any other seemingly innovative idea. Unless of course you want to fool yourself, your client and your investors – what is usually not a good idea. Don’t do it.

The correct approach to “data” is not just “data”, but “data” AND “science” – and in which paradoxically the keyword is science, not data.

Despite of the Google-centered “end of science” movement, in which smart people say that as long as you have data you can figure out pretty much anything, by now (late 2015) there is enough evidence out there to say that the data-centered movement, if anything, over emphasize our biases and with that, kills any innovation or predictive insights.

Forget data. Concentrate on science. Keep doing it the same way our grandparents did: observations, ingenuity, hypothesis, test… and repeat. Of course, unlike our grandparents, now with powerful computers, and terabytes of data – the beauty of it.

Second, machine learning.

Data science is a congruence of hacking, statistics and domain expertise. To that, one image is worth a thousand words:

screen-shot-data-science

Data Science = Hacking + Statistical Inference + Domain Expertise

When you lack any insights on the latter, you are left with “machine learning”, the light green chunk on the center of the diagram.

In other words, the ability to hammer things out of computers , applying some statistical insights and no domain expertise – this is “machine learning”. Nothing less, nothing more. Cool, but limited.

Last but not least: Futurology.

Why all of that? The urge to control everything. It is cast in our DNA. Uncertainty makes us feel not in control. We do not fare well without control.

The lack of control makes humans scared. We do not like “scared”. The urge to be in control and know what is coming has been with us since the time of druids and witches.

In forgone times we used to call it magic. I like that connotation. I still like to call it “black magic” for the sake of it and also because sounds and fits well:

Any sufficiently advanced technology is indistinguishable from magic (aka Clarke’s Third Adage) — Arthur C. Clarke.

And data science is indeed sufficiently advanced. We have thousands of predictive models and computer applications, millions of domains of expertise. This yields literally billions of combinations of possible techniques and approaches for every single problem.

The main skill of a data scientist is to be able to understand a domain, to select among billions of options out there what fits, define a pattern, and then be able to use this pattern and win over our biases (yes, we all have them). And lastly seek the truth.

Biases hide the truth. Your bias is your worst enemy. And forget “optimal”: measure often, adjust often, and always remember that “perfect” is the enemy of “good enough”. “Good enough” is good.

Conclusion.

Modern problems are complex and require a different way of thinking, and organizations should be re-tooled to that end. You cannot approach complex subjects by “problem hammering” your way through, and unfortunately this is the only kind of problem the majority of organizations are wired to handle.

Most organizations deal with problems by assuming all problems can be “hammered” to resolution: dividing problems into tasks assigning those tasks to “JIRAs” or anything similar; by tracking linear time and throwing layers of management and bodies at it.

Moderns problems are so complex that their components are utterly unknown, in a sense that formalization, cause, effect and resolution are subject to a process of iterative search.

The resolution to this distinct class of problems require a different class of thinking, in which the problem itself is shaped as you move along iterations; and real, open collaboration is required.

Most companies (and yes, leaders) nowadays are not ready for “problem search”, the scientific approach to resolution of problems: by ingenuity and openly “searching” for the exact definition of the problem itself.

Science is not dead – very much on the contrary. Computers and data are nothing without scientific curiosity and human ingenuity. Forget buzzwords and look at problems and solutions the way polymaths approached them in the past. Using the same ingenuity, adding a good dose of collaboration and controlled computing power as a handicap for our comparative lack of individual brilliance.

Languages are more than communication. They are often one’s window to reality.

Your language shapes how you think, what you can achieve, and how you achieve it. There are languages that often facilitate concepts in a domain of knowledge, or make them more obscure. You might try to use the French language for philosophy, or German for poetry. Using them the other way around might make you write more being forcibly more verbose. Using the wrong language can even impede the expression of your ideas.

Language defines reality. This is the case not only with natural languages, but also with computer languages.

Like natural languages, computer languages often grow from needs of specialized domains, and therefore are better suited for use cases relevant for that specific domain. In the past computer languages were born and bred in a specific domain, frozen to requirements of that domain in that specific point in time. If requirements on that domain evolve in order to follow increasing complexity of the problems at hand, that language would no longer fit.

In modern times computer languages must be dynamic, quasi-living things that must be able to evolve and adapt to solve new classes of problems and new computing environments. Modern problems are different from what we had to deal with a few years back. You must have adequate tools and methods in order to approach them properly. In the same way new, computational environments change in face of new demands and new hardware technologies. Single to multiple cores, cloud, cluster, grid computing.

The way in which you describe to a binary being a way to resolve a problem plays a very special role. This role is tied to the concept of representability. The effectiveness of your representation is limited by features of your language, your familiarity with that specific domain knowledge and your experience, i.e. thinking patterns you have used when approaching previous problems on that domain.

“A good notation has a subtlety and suggestiveness which at times makes it almost seem like a live teacher.” Bertrand Russell The World of Mathematics (1956).

If you zoom closer into the specialized domain of our interest, computational finance, and look at problems we had to approach in the past, and patterns we used to resolve them, we can list a number of important features our language (and environment) will have to support:

  • Responsiveness: Deterministic response time is critical in common use cases in computational finance. As harsh as this may sound, the fact that you can keep your response time under a few dozen microseconds 99.99% of the time is irrelevant when you took a few seconds to decide on what to do waiting on a garbage collection. Even if just happened once that time, you wiped out all your hard-gained profits of day.
  • Adequate representation of data structures: Plain old data structures have to be represented properly. It is hard to believe several of the widespread programming platforms still have problems properly representing data structures introduced on CS 101 curricula, cases like contiguous arrays and sparse vectors. In computational finance we care about very specific abstractions, like proper representation of time series and currencies.
  • Functional-vectorization friendly: Representation of data structures must be able to leverage the vectorial nature of modern computer architectures through lambda functors. Functional support is crucial.
  • Simplified concurrency through continuations: Continuations, or co-routines, are probably the simplest and most abstract way to leverage concurrency. You can leverage streams, vectors and parallelism using simple patterns. No shared state synchronization required.
  • Interactive: Support for interactive command line for preliminary brainstorming, prototyping and testing. Being able to record, share and story-tell a resolution of a problem is very important. The record must support rich representation – plots, tables, structured formatting, etc – the more, the better. Communication and collaboration are critical, and your representation cannot ignore that. The hardcore problems of our times cannot be solved without proper and organic collaboration. Your representation must be collaboration friendly.
  • Mini-representations: Notations matter in any representation. Domains have specific ways to represent concepts, and your representation has to be flexible enough to adhere to use cases of that domain. Mini-representations are used here in the same sense as mini-languages, also called little languages, or domain specific language (DSL), are used as ways to leverage a host language for meta-representation. In other ways, you could use a language to “override” its tokens and represent a language for appropriate for streaming, or behavior.

These are personal (and of course biased) and are limited by my own experiences of patterns that seem to work best when solving practical problems related to computational finance.

As our research goes on it seems like the major missing piece is a proper representation of financial models. Which “language” to properly represent financial models, across all use cases: risk, trading, simulation, back testing, and others. The search continues.