Sunday, April 04, 2010

MultiDimensional Scaling (MDS) with Solver

This file creates the Solver model for multidimensional scaling (see a previous post on MDS in Excel/VBA). The application calculates MDS for the matrix from McCormick’s BEA article (see the previous post on BEA in Excel/VBA).

PROXIMUS with Solver

This file creates an approximation of a matrix (using PROXIMUS – see a previous post on Koyutürk’s PROXIMUS) using Solver.

McCormick’s Bond Energy Algorithm (BEA) in Excel/VBA

This file contains an implementation of a classic seriation algorithm by McCormick, Schweitzer and White (Problem Decomposition and Data Reorganization by a Clustering Technique) – bond energy algorithm – in Excel/VBA. This implementation is very similar to the fortran code on the S-Plus/R page of Fionn Murtagh’s. Additionally, an article by Arabie, Schleutermann, Daws and Hubert (Marketing Applications of Sequencing and Partitioning of Nonsymmetric and/or Two-Mode Matrices), that uses multiple iterations, is examined. Some modifications to this algorithm are considered.

Approximating matrices with PROXIMUS in Java

This file is an application of Mehmet Koyutürk’s algorithm PROXIMUS – an efficient framework for error-bounded compression of high-dimensional discrete attributed datasets. As sources for information on PROXIMUS, on the internet, seem to be limited, the best is perhaps the C source code of the R PROXIMUS package.

Sometimes, it would just be better to switch to VBA :)

This file contains vehicle routing paths (a solution to the two truck problem a few posts back), as a Trendline’s Microsoft Excel Solver optimization result. The problem (in the two left red columns) is that the two paths have become mixed. One path would have to take us: Tallinn-Märjamaa-Riidaku-Tallinn and the other: Tallinn-Pärnu-Valgerand-Tallinn, as determined by the ones and zeros in the solution. But the six ones determining the paths reside in the same column and have to be separated.
The set of Excel relationships is visualized by the formula auditing tool. The final result is in the right side of the spreadsheet, where the two red columns have three ones each.

A Solver model for a transportation problem that finds a location for an additional factory – using 625 changing cells :(

In this file a location for an additional factory has to be determined. It cannot be built on an existing location, goods can only be moved in vertical/horizontal directions, the distance between adjacent cells being 1. Total distance has to be minimized.

A transportation model with two trucks in Solver

This model uses flow balance constraints as well as Boolean variables.

Tuesday, March 30, 2010

Estonian peasants forming Bucket Brigades

Java applet demo

This Scratch file demonstrates the use of bucket brigades – a way of organizing workers (e.g. on an assembly line) so that the line balances itself. This algorithm from Bartholdi’s materials shows that Scratch can be used for algorithmic purposes. We start our Informatics II – Visual Basic for Applications (VBA) in Excel – class by introducing programming concepts in Scratch, this way mitigating the hardships of learning IDE, syntax and programming concepts simultaneously. 

Spacefilling curves approximating TSP in Java

This file contains an algorithm that calculates a spacefilling curve – a means for mapping a high-dimensional space in a low-dimensional space. Bartholdi’s material uses Sierpinski’s curve and the mapping is a fast approximation that can be used in solving traveling salesman problem, in seriation, etc. Next to the Java code, a test set from Traveling Salesman Problem website is imported to Excel, and a spacefilling curve solution is found. As expected this fast approximation is ca. 40% inferior to the optimal solution.

The file, thus, contains a presentation, Java code and Excel/VBA program. The way of creating spacefilling curve itself is suboptimal, as I decided to read only the first paragraph of the algorithm and to try to devise the solution (on the picture above) myself.

A heuristic solution for a knapsack problem: using Excel’s VBA to optimize a sequence of Solver problems

Suppose we have to bill our business partner in the total amount of $500 000, but it’s our policy, that one invoice contains no more than $50 000 worth of records. This could be stated to be a kind of a curious knapsack problem. At first, the model is run and an initial invoice is created – worth of up to $50 000 –, followed by the removal of those records from the dataset. Then the model is run again, the chosen records are once again removed and so on – until we have created a minimal number of invoices, all filled to the maximum possible capacity. This is, though, stated with notion, that we are actually only using a heuristic method: each time the $50 000 solution is found to the remaining problem, and this is all that is taken into account. Ideally, we should look at the entire picture at once and find the global optimum of record allocation. This model was created for my optimization modeling class.

Multinational Corporations practice transfer pricing: lets try to fit that into a Solver/Excel network optimization model

The foundation of this Solver model is a classical operations research problem on network optimization. There is a certain capacity in the three “producing” countries, which can’t be exceeded; and the demand sets the upper boundary for all product flows into the four “consuming” countries. We know the production costs as well as the selling prices across the countries. All that in a classical model would help us establish the maximal profit. Regarding the MNC’s, though, we also take into account the corporate income taxes and the fact that an MNC has, either sales or production, subsidiaries in all countries.

If an MNC plans to produce in country 1 and sell in country 4, it would check which one has a lower corporate tax. If the consuming country is more competitive, the subsidiary of country 1 will sell the product to the subsidiary of country 4 at zero profit margin. The subsidiary of country 4 will make the sale to the customer at market prices, thus obtaining all profits. The profits are thus paid in the country with lower taxes.

If, on the other hand, the producing country has lower corporate income taxes, the producing subsidiary sells the product to the sales subsidiary – at full market value. The sales subsidiary sells the product to the final customer at market price as well, making no profit. All the profit is accrued in the producing country – with the MNC once again being able to choose the lower tax rate. This transfer pricing model was created for my optimization modeling class.

For business students: Monte Carlo method demystified with Excel/VBA

As I did my bachelor’s degree in the faculty of Economics and Business Administration of the University of Tartu, the topic of Monte Carlo simulations surfaced a number of times. Looking back, the simplicity of this method, applied in many classes, seems to have been rather unrevealed. As in our university VBA is taught in all faculties, we have taken a look at this business problem, the solution of which at its most involved part comprises of an If embedded inside two For-s. The key to the Excel part of the application logic is calculating the cumulative distribution function from probability density function.

Another tentative candidate for the Excel Gurus Gone Wild: Do the Impossible with Microsoft Excel book: Convert a two-dimensional table/matrix into a column or row vector and vice versa :)

In spreadsheet modeling one often has to convert a two-dimensional table into a row or column vector and vice versa. This Excel file does just that by using the following formulas: =INDEX($C$17:$N$28;IF(MOD(P3;12)=0;12;MOD(P3;12));IF(P3/12=INT(P3/12);P3/12;INT(P3/12)+1)) and =INDEX($R$3:$R$146;MATCH($B31&"-"&C$30;$Q$3:$Q$146;0)).

The former formula is actually an involved version of =INDEX($C$17:$N$28;MOD(P3;12);INT(P3/12)+1) that fixes the last row problem.

This is, for instance, a problem that is tackled in Solver network optimization models, when a two dimensional cost/distance matrix is converted into a flow balance constraint form.

Demos of Excel/VBA graphical capabilities based on NetLogo – a programmable modeling environment for simulating natural and social phenomena –: the world of rock-paper-scissors and the dynamics of bird flocks

This program is a demo of graphical capabilities that Excel has. A famous game of rock-paper-scissors forms the basis of our simulated world. Each species consumes one neighbor and is in turn consumed by another. A new entity of one’s own type is spawned, when food is consumed.

There is a need for maintaining a dynamic balance – if one was to eat all of its food, only two species would be present, and the focal species itself would be driven out as it’s the weaker of the two.

There are a few specific procedures created for this program: touching(object 1, object 2) checks whether the two objects touch (this sub program can also be used to check whether one block of cells is within another block of cells). Pause (s) pauses the program for a period of time, also updating the graphical display. Arrays are used for storing graphical objects.

Another example from Netlogo models shows the dynamics of bird flocks. Bird flocks is the one program here that has been created by my TTU students, not me. These students come from the same high school I once attended – Tallinna Reaalkool.

Kohonen’s Self Organizing Maps in Excel/VBA, applied for reducing dimensions of colors and of financial ratios from Google Finance

Kohonen’s Self Organizing Maps (SOM) is a type of artificial neural network that is trained using unsupervised learning. The number of dimensions is effectively reduced as a two-dimensional, discrete representation of high-dimensional data is produced. A neighborhood function is used as the topological properties of the input space are preserved.

In this file, the output space is visualized in a 40x40 block of Excel cells that are colored appropriately. If you would like to take a look at a good description of the algorithm, go to AIJunkie’s website.

The second file, chooses three financial indicators – ROE, Debt-to-Equity and Market Capitalization – and once again depicts this three-dimensional data on a two-dimensional SOM. The ratios are from telecom sector of Google Finance and are (0,1) normalized beforehand. As the initial plot is “dominated” by the three biggest companies – for which a substantial portion of the entire space is reserved (the market capitalization ratio is extremely right skewed) – a logarithm transformation is later undertaken. The colors of the graph are unappealing, because the ratio vectors are used as RGB components of Excel cell colors :).

Principal Component Analysis of a U.S. city distance matrix in Excel

Principal Component Analysis (PCA) transforms a number of correlated variables into a smaller number of uncorrelated variables – the principal components. This way, the high-dimensional data set can once again be depicted on a low(two)-dimensional graph.

PCA is performed by calculating the covariance matrix for the initial data (each dimension of initial data must have zero means – if needed the average of the respective dimension is first subtracted from each data item). Next, eigenvectors of covariance matrix are calculated. The proportion of eigenvalues shows the importance of each new principal component dimension. A suitable number of eigenvectors is chosen, usually two. The initial data matrix is multiplied with the eigenvectors, thus obtaining a two- dimensional representation of data. If one was to calculate the covariance matrix for the new representation of data, this would be diagonalized – which means that each new axis would have maximal covariance with itself (in other words variance) and no covariance with the other dimensions.

A document from Salk Institute presents a very thorough description of PCA, together with the full derivation of the algorithm.

In our file, a symmetric 11x11 matrix of distances between the cities of U.S. is taken. The first two principal components are calculated from this 11 dimensional data set – depicting 80% of the variance in the initial data. The picture above shows the two-dimensional map that was constructed. Eigenvalues and eigenvectors were calculated with add-in MATRIX 2.3 - Matrix and Linear Algebra functions for EXCEL that is freely downloadable.

A primitive try at testing graph isomorphism in Excel/VBA

When we consider an adjacency matrix of a graph, each entry shows the presence of an arc between the nodes. Taking the matrix to the power of two, an entry shows the number of length two connections between the nodes considered. Similarly, the adjacency matrix to the power of three shows the number of length three connections, etc.

In this program a three-dimensional array is constructed, with the initial matrix at the first position of the third dimension. Next the matrix is taken to subsequent powers, each being stored at respective positions of the third dimension. Thus, if we take position 2-3 of the adjacency matrix and look at its powers on the third dimension, we will see how this node is connected to the rest.

The program takes the graph up to the power of the number of its elements. For each adjacency matrix element the sequence of its successive powers is then written out as a text string and the strings are sorted in an ascending order. The idea is that for two isomorphic graphs, the results would be the same.

This, of course, is a very primitive try at testing the isomorphism of graphs, as in the entire computational complexity theory this is one of only two problems in existence, the complexity of which is neither polynomial time nor NP-complete. 

A set of pictures that I use for explaining Singular Value Decomposition (SVD)

According to finite-dimensional spectral theorem, a symmetric matrix has orthonormal eigenvectors, thus Q*QT=I and QT=Q-1. Let us take Q1 as eigenvectors of a symmetric matrix formed by A*AT; and Q2 as eigenvectors of a symmetric matrix formed by AT*A. According to the (not too involved) proof, SVD decomposes matrix A in a following way: A = Q1*S*Q2T, with eigenvalues being stored in S. As eigenvalues are stored in descending order, the matrix is now effectively multilayered (see pictures above), with a possibility of “peeling” off the more important vectors at first and discarding the less influential information from the latter part of decomposition.

Using Singular Value Decomposition (SVD): Excel/VBA application that reads a BMP file and compresses it in a lossy way

(See the post on SVD for keywords.) This application begins by reading a BMP file – BMP format is chosen as pictures therein are stored in a raw format, pixel by pixel, with no transformation undertaken –, and then displaying the picture in Excel with each cell being shaded respectively.

Next, SVD is undertaken (once again using Excel add-in MATRIX 2.3 - Matrix and Linear Algebra functions for EXCEL that is freely downloadable). Analysis of resultant eigenvalues shows that over 80% of the information is contained in the first 41 eigenvectors. Whereas the initial picture had information worth of 154*181=27874 bytes, the first 41 vectors of Q1 and Q2 and the first 41 eigenvalues are contained in 154*41+181*41+41*41=15416 bytes, resulting in 44,6% compression. The quality you can judge for yourselves :). The picture has thus been compressed in a lossy way; additionally the method could be used for transferring the picture over a network, peel by peel, until the final fine-grained view is obtained.

Correspondence Analysis in Excel: the semantic differential of words according to Enron email dataset.

(See the post on Singular Value Decomposition for keywords.) After the bankruptcy of Enron, a $100 billion sales, 22 000 employee company, the Federal Energy Regulatory Commission made public and posted on the web, the data set containing 500 000 messages between the 150 top executives of the company – a real treat for the people doing data mining and visualization.

Here, I have created a VBA program that traverses the 150 folders containing the emails and constructs a contingency table. This two mode table shows how frequently the 150 executives used the 1000 most common words in English language.

Next the process of correspondence analysis is undertaken in Excel. Correspondence analysis displays both the rows and the columns of a two-way contingency table in one low-dimensional space. It calculates the Chi-square statistic divided by n (called Total Inertia) for the table and performs its Singular Value Decomposition. In order to the calculate coordinates for rows and columns, Q1 and Q2 as well as row and column sums are used. The theory is documented in the book Correspondence Analysis in Practice by Michael Greenacre. In our case the first two dimensions will contain about 50% of the total variance.

More interestingly, a similar process is undertaken in the second file. Here, from amongst the 1000 most common words in English, twelve are chosen – which according to author’s perception have either a strong positive or negative connotation. The results of correspondence analysis display the semantic differential that these words really do have – as words are ranging from those that have a strong negative meaning (on the left side of the primary axis) to those with a strong positive one (on the right side).

MultiDimensional Scaling (MDS) in Excel/VBA

Information visualization technique multidimensional scaling (MDS) has its roots in the field of psychometrics and is used for exploring similarity/dissimilarity data. The case in point is once again the city distance matrix for the U.S. (see the post on PCA).

As a result a two-dimensional representation of data is constructed, where each Euclidean distance approximates the corresponding dissimilarity/distance in the original matrix. There is an objective function (called stress/Kruskal’s stress/energy) that is minimized.

This file is based on methods put forth in the dissertation of Wojciech Basalaj and uses Newton-Raphson method for optimization. There is a problem with convergence, though, as solution seems to get stuck in the local minima. 

Using Singular Value Decomposition (SVD) to factorize data on car properties

This file has data on five cars: Chevrolet Matiz, Chevrolet Evanda, Chevrolet Tacuma, Chevrolet Nubira, Chevrolet Kalos. The properties considered are engine displacement, length, width, cargo volume, number of cylinders, power, fuel consumption and CO2 amount.
SVD reveals that 93,4% and 5,4% of information is contained in the first two singular vectors, respectively – leaving only 1,2% for the rest of the decomposition. The first singular vector can be called general assessment – it shows a basic relationship between all the cars across the entire set of variables. Our main focus is to try to understand the contribution that the second singular vector has. As shown above, the most influential contributions of the second singular vector are in engine displacement, cargo volume and power (kW) – this is the fine-tuning, which shows that the cars can’t simply be described with a rank one matrix.
This is the second file on this blog, that was not originally created by me (the first was the model of bird flocks), but rather replicated and expanded.

Haar wavelet transform in VBA and in Excel

This file does Haar transform in VBA and in Excel.

Inverse of a matrix in Excel/VBA

This file finds the inverse matrix in VBA, using Gauss-Jordan elimination.

PA=LU factorization of a matrix in Excel/VBA

This file performs the PA=LU factorization (LU decomposition with partial pivoting).

Ulam spiral and the prime numbers in Excel/VBA

This file draws the Ulam spiral in Excel and colors the prime numbers.

Stephen Wolfram’s A New Kind of Science: a set of cellular automatas in Excel/VBA

This file(and this) uses VBA or Excel’s formulas to operate Stephen Wolfram’s cellular automatas from his book A New Kind of Science, including Reflector and Alice in the Wonderland’s Cat.

Wednesday, January 10, 2007

Normal distribution of stock's rate of return resulting in a log-normal distribution of stock prices

When we aggregate the returns of the stocks of a market, we can compute its mean return and the corresp7onding standard deviation. Our goal, in this post, is to model the behavior of a set of stocks on a theoretical stock market that has the same statistical indicators. We will use continuous compounding, while calculating stock market returns. As it means that P1 = P0*e^r, where P0 and P1 are stock prices in periods 0 and 1 and r is the rate of return, it follows that r = ln(P1/P0). In the case of stock markets we can assume, that r is normally distributed(see the excel file). This in turn means that stock prices, P1/P0, have a log-normal distribution(see the excel file). If the mean growth of our market is 15% with the standard deviation of 30% - over the simulation's next 250 business days i.e. a year; one day being deltaT = 1/250 = 0.004 - we are going to calculate the stock price for the next day using the following formula. P1 = P0*e^(15%*deltaT+30%*deltaT^0.5*Z), where Z is a variable simulated by us in Excel, with a normal distribution, a mean of 0 and a standard deviation of 1; and deltaT^0.5 is a square root of deltaT, used because price is assumed to be linearly dependent upon the mean and the variance (stdev=var^0.5).

Thus, in this file, we are at first going to simulate 32000 normal distribution numbers with Random Number generator, and calculate the corresponding log-normal distribution values from it. We will analyze the distributions and graph out the corresponding results, affirming the presence of functions expected. Next we are going to simulate 40 theoretical stocks on a market, all having the initial price of 30, the mean growth at 20%, and 30% as the standard deviation - over a two year period. We will graph out the dynamics of stocks, which will result in a diffusion of stock prices.

The idea of this Excel file is from  Simon Benninga's book Financial Modeling.

One further note on this topic would be, that it has been proven that as time of such simulation approaches infinity, the average stock price also approaches infinity, but the probability of default approaches 1. :-)

Saturday, October 28, 2006

C Add-Ins in Excel: Performance Comparison with VBA

As Excel is also applied in order to solve industry level business problems, and as therein the speed will be of the essence, the choice of programming in VBA could become unfeasible. In such cases critical parts can be developed in C, which is the topic of this post.

It is curious to note that while other language choices in Microsoft Visual Studio are well documented, writing an Xll in C – that is, registering it, exchanging data via XLOPER structure and running C API – is much more of an uncharted water.

There still exists some vital material: for instance a thorough book “Excel add-in development in C/C++, Applications in Finance” by Steve Dalton. On the net an example of a guide would be interpolation add-in by JChampion, which also goes over the basic concepts, and might be a better read, than Microsoft’s own material on the net.

The example is as follows. The idea was to test C at its best: to have it make 137 000 000 comparisons of text based on Excel data. (Text comparisons should have the biggest superiority of performance in C, because of how it handles the data type).
The data (which is interesting on its own!) comes from perhaps the biggest publicly available database of owners in Estonia –
the daily list of the owners of the stock of company Tallink, consisting of records on about 16000 entities. The add-in works as follows – it is an array formula that compares two snapshots of data and writes out stock owners, who are present in both periods. No optimization of the algorithm is done – just a plain one by one comparison.

The Microsoft Visual Studio 2005 project with C files and the compiled Xll as well as the ready made files on Tallink are downloadable. I would like to express my sincere gratitude to Jüri Vilipõld, who benchmarked the programs. The results showed, that depending on the computer, the program took about an hour to run if VBA operations accessed Excel cells throughout the operation, programs using VBA arrays trimmed that time down to about 40 seconds, depending on the implementation – one using the direct copying of the entire array and the other not. And the C add-in in turn was about 5.5 times faster, completing in less than 10 seconds. The results, thus, give us a picture of the speed of accessing Excel objects excessively, of using VBA or optimized array copying, and of using C.

S.M. Johnson: production schedules - in VBA

Amongst mathematical algorithms taught to business students one was devised by S.M. Johnson back in 1954. The issue is having two different operations undertaken on products, in succession. A range of products at hand has different lapses of time, for both operations. The goal is to arrange such a schedule, which will minimize the cumulative production time. This program was needed by the Chair of Business Mathematics.

ANN solving XOR in VBA

Here is a feedforward, multiconnectivity, Sigmoid transfer function, backpropagation, multi layer perceptron Artificial Neural Network source code in VBA; that I needed to make for my studies. It will solve the XOR nonlinear separation problem and graph out the learning path.

Traveling Salesman Problem (TSP) in Excel/VBA. The use of graphics in Informatics II class

The most advanced topic that has been addressed in our Informatics II for non-IT students: VBA, class is the Traveling Salesman Problem (in this file). As TSP is NP-complete, as far as its complexity is concerned, problems such as traversing all the 24978 towns of Sweden are solved. Our problem is a very simple one – of passing through 16 cities.

The Waites family project VisualBots has created an Excel Add-In by that name. My thanks go to their project regarding the TSP – from which I borrowed the algorithm that is used. The algorithm used is simulated annealing, which is a heuristic that, with a decreasing probability, chooses a longer total distance – in order to avoid getting stuck in a local minima. The VisualBots is an add-in that gives Excel additional graphical capabilities.

With a bit different accentuation it has also been a goal in our university to rely on Excel’s graphical objects. For years the Informatics II for non-IT students has used those to convey the essentials of programming with VBA. Lately, with the advent of educational programming languages, such as Scratch, our approach has been validated, as courses like Harvard’s CS50 also start with graphical objects now. In the present example graphical objects are complemented with Excel’s charts displaying the dynamics of the solution process.

Gauss-Jordan elimination in Excel/VBA. The use of matrix problems in Informatics II class

In our undergraduate Informatics II: VBA for non-IT students, class we do many computations on matrices – in order to let students reason algorithmically. As we start out, it’s good to show that a simple 12 row algorithm can solve linear equation systems by doing Gauss-Jordan elimination. Examples of subsequent test problems include: finding the number of elements between -3 and +3 amongst those above the main diagonal; finding the average of all positive elements of a matrix.

As a huge variety of such problems can be created, this is a topic that guarantees a thorough and genuine understanding of the code written – compared to other topics of the class, where a bigger proportion of material seems to be memorized, and not so thoroughly understood. This topic also gives a good incentive for streamlining testing – as literally hundreds problems can be handed out as WebCT tests; with students entering the multiple choice (again from amongst hundreds) answers after doing programming in Excel/VBA. In order to be sure of students’ comprehension of the topic, multiple tests have to be completed.

Friday, October 27, 2006

NPV and IRR in Execl/VBA

In our undergraduate Informatics II: VBA for non-IT students, class the financial indicator Net-Present-Value is programmed. This program also includes the counterpart, IRR, the difficulty of which also remains within the scope of this class. By the fourth semester, on which Informatics II is held, students have already taken the basic classes such as finance and statistics, so that no time needs to be used for conveying background information.

As IRR can have multiple values, one would have to start out with an initial guess. We take a value that would ensue from a too good to be true project: a version of our current project in which all the future cash flows occur at once (there is no discounting from the future) – thus purportedly overestimating IRR. We then start decreasing our estimation of IRR and at every step check whether we have arrived at a correct answer.