Matteo Marsilli

Complex systems like a cell, the financial market or a society, exhibit non-trivial behavior. Often, empirical data obey a scale free distribution in the frequency of observations or more specifically ZIpf’s law – that states that the size of the k th most frequent observation should be proportional to 1/k. Mora and Bialek translated this observation in statistical mechanics terms, by observing that systems that exhibit this behavior are akin to critical systems, i.e. systems close to a phase transition in physics. Since this is a very special point, this raises the issue of what universal mechanism may be responsible for the self-organization to the critical

point. On the theoretical side, complex systems can be regarded as systems of many degrees of freedom, that perform a function (i.e. optimize a goal function). However, models can take into account only few variables and the interactions among these.

They necessarily neglect unknown unknowns. This raises a number of issues:

i)how can one choose relevant variables, how many should they be?

ii)under what conditions can the prediction of models match systems’ behavior?

On the empirical side, one typically faces two problems:

i) data are noisy and ii) data most often under sample the space of possible states. A convenient strategy for solving both models is dimensional reduction (e.g. data clustering). Different methods, however, provide different results. Can one measure the information con-tent of different methods and compare them? What is the optimal level of detail (i.e. number of clusters)? After a brief (and biased) review of the problem, we discuss these issues in a simple framework inspired by maximum entropy considerations. Our arguments suggest that the under sampling regime can be distinguished from the regime where the sample becomes informative of system’s behavior. In the under-sampling regime, the most informative frequency size distributions have power law behavior and Zipf’s law emerges at the crossover between the under sampled regime and the regime where the sample contains enough statistics to make inference on the behavior of the system. These ideas are illustrated in some applications, showing that they can be used to identify relevant variables or to select most informative representations of data, e.g. in data clustering. Preprint available at http://arxiv.org/abs/1301.3622

Esta charla se llevará a cabo el día Miércoles 6 de Marzo a las 14hs en el aula Federman, 1er Piso, Pab. 1, Departamento de Física, Ciudad Universitaria.