SPEL Instruction Set Architecture

Points of Compatibility

CPU's inherenatly are very capable, so don't offer many limitations to what can be done, so being compatible with them is easy, though certainly still an important consideration.

Because GPU's currently have the best price to giga-flops ratio, (around $0.08 per Gigaflop as of Jan 2015), they are the most important type of device to be compatible with.

Wheras GPU's are very fast, FPGA's are very maleable, in fact the most maleable, so it good to be compatible with them. To that end, we'll seek SystemC compatibility.

However considering that other important technologies are likely going to eventually replace GPU's as price performance leaders, it is good to keep those possiblities in mind.

Quantum Computers is where we are headed, and they offer many new and wonderful opportunities. They also have a unique way of doing computations, which is most easily emulated in current hardware with reversible circuits.

Reversible circuits may be a step that we may experience once three dimensional processors are commonplace, and before Quantum Computer are as reversible cirucits can lower the amount of heat that is generated.

Mutation compatability is also important for generating algorithms automatically. There is a sweet spot between the simplicity of the language, and the functionality that it gives, where it is both easy to generate correct programs, and they have sufficient ability to accomplish a variety of tasks.

Javascript compatability is important because Javascript is the most portable programming language in the world, as it works on all devices enabled with a modern web browser.

CPU Compatability

Modern CPU's have a wide slew of abilities. One underutilized by a majority of programming languages is many SIMD or vector operations which are provided. So certainly have to have support for vector operations.

GPU compatability

Notably OpenCl does not support goto or jump instructions, thus we are limited to loop instructions for any repetitive activities. OpenCl also tends to inline all functions as far as I understand, so there is no call and return which are jump based. In fact there is no traditional "return" value, as used in much of functional-paradigm programming, instead some of the input parameters are written to in order to make output.

64KB consonant memory
32KB working memory per device
32B of registers per compute unit
32B or 256bit bus width

for SPEL

64KB fits data and complete code size of loaded functions
32KB working memory available for loaded functions
32B is sentence size

memory partition:

32KB of constant memory for code segment 32KB of shared memory for data and bss

FPGA and SystemC compatibility

One notable limitation of SystemC is a lack of floating-point-numbers. there are only fixed point numbers enabled by default, since floating-point-numbers require a large number of circuits to implement.

This somewhat opens the door to exploration of various kinds of partial number representation schemes. though it is safe to say that we should stick to unsigned ints as much as possible, with signed ones possible if necessary.

Another limitation is that FPGA's don't have true loops, as all loops are unrolled and inlined. So it is necessary to have a limited number of iterations for any loop in order to fit on the hardware.

Thus while and until loops aren't viable, only for loops, with well defined iteration and boundaries, so they can be unrolled.

Somewhat of a feature of FPGA programming is that input and output are expressely defined. This is similar to human languages where the dative is the output.

Quantum Computer and Reversible Circuit Compatibility

Fredkin and Toffoli gates are universal gates with which all other gates can be constructed.

Toffoli Gate is a controlled flip of bits, and Fredkin Gate is a control switch of bits. both of them can be supported.

It may also be possible to in source mark something as reversible, in which case could arrive at the input based on the output.

This may save some hassle as if one writes a reversible encoder, then it can likely also decode simply by pushing the bits backwards.

For implementing reversible computing on a functional level it is somewhat similar to dataflow programming, in terms that each function "consumes" the input, however to make it reversible the output has to have enough information to reverse the operations. For-example an addition would modify one of the input values, but leave the other one unmodified, thus making it possible to "undo" the the addition using subtraction.

Mutation Compatability

In order to be useable for AI's to easily evolve algorithms as accomplished by Kory Becker the code at some level has to support mutation. For instance in Kory's Brainplus has few opportunities for parsing errors, so generating proper code is fairly easy. The easiest way of demonstrating it, is by showing that brainplus can be mapped to SPEL. overwrite byte in storage with byte at pointeroverwrite byte at pointer with byte in storage

Brainplus command	meaning	C command
program start	establishes a modifiable array for processing	char array[0x8000] = {0}; char *ptr=array;
>	increment data pointer	++ptr;
<	decrement data pointer	--ptr;
+	increment byte at data pointer	++*ptr;
-	decrement byte at data pointer	--*ptr;
[	if byte at data pointer is zero then jump to matching closing square bracket.	while (*ptr) {
]	if byte at data pointer is non-zero, jump to preceding matching opening square bracket.	}
.	output byte at data pointer	putchar(*ptr);
,	input byte to data pointer	*ptr=getchar();
$
!
0-F	sets value of pointer to multiple of 16
a-z	call function a-z based on location in code
@	exit program, or if in function, return to last position in main program and restore state

Commands

verbs

flip: like NOT, and CNOT this is a Toffoli Gate
switch: like exchange, this is a Fredkin Gate
Add
Subtract
Loop
Label
Multiply
Divide
logical not
logical and
logical and-or
logical exclusive-or
logical shift up
logical shift down

Cases

Conditional: there can be a condition that must be satisfied before the sentence is performed
Accusative: The main object or input the sentence
Dative: The destination or output of the sentence
Instrumental case: an additional input for the sentence

Conjunctions

unconditional then
conditional and
conditional and-or

Data Types

unsigned integer: 8bit, 16bit, 32bit.
Also used for text, such as UTF-8, UTF-16 and UTF-32
signed integer: 8bit, 16bit, 32bit

Extended Types

large unsigned integer

64bit, 128bit, 256bit*

large signed integer

64bit, 128bit, 256bit*

floating point

32bit, 64bit, 128bit, 256bit*

unsigned fixed point

32bit, 64bit, 128bit, 256bit*

signed fixed point

32bit, 64bit, 128bit, 256bit*

*256 bit maybe added later if needed. Those basics will be enough for turing completeness, and maximum compatability. For ease of writing source code can have a call and return, knowing that it will be inlined for both GPU's and FPGA's.

Basics

using SSA static-single-assignment

each warp gets a sentence, or several sentences, those parts of it which are numbers it can solve immediately, those of it which are variables it has to wait on. Once a sentence has been solved, and it's SSA output variable has been filled by a number, other

for purposes of source-code can use compound words for variable names, but for bytecode purposes those compounds will be converted to click-words, so there will only be a single word per variable. There are at least 100's, possibly thousands of SPEL-legal click-words so there should be quite a sufficient amount. though would have to have enough space in order for the full variable to be expressed in the code.

tould also have pointers to the data in the code, though may need something like pointer inactive and pointer active. for instance it could be a variable name when inactive, and a pointer when active.

Chapters

code segment is a chapter. The idea is that any output to global memory would occur at the end of the chapter, and any input from global memory would occur at the begining of a chapter.

For instance a Reduce instruction, at the begining each warp may load an appropriate variable from global memory. But then some of them will have to wait while others compute earlier results. Then the final sentence would write the result to global memory.

However in a map operation, each warp would write to their corresponding global memory. So begining and end refering to time rather than place.

In any case the idea is that there should be no cross-chapter references, except via global memory input at start, or global memory output at end.

Loops

traditionally implemented with jumps for control flow. but can do it by using different variables, ones starting with a lateral click "8", are a loop, and once loop is finished they can be replaced with a dental click "!".

To show that are within a loop, the initial bit of the index can be a 0. With the final line of the loop having an initial bit of 1. Also for blocks which have no dependancies also have an initial index bit of 1.

Pointers to Shared Memory:

a 16bit pointer could address the whole shared memory space, and half the constant space. considering some GPU's don't have constant space, or split constant and shared space, it should be sufficient. CPU's have L1 cache, which is usually around 32KB nowadays, so having 16bit addressing should be more than enough.

It may actually be best to limit ourselved to only the first 32KB to fit in the L1 cache, so that would be 16KB of code or around 512 blocks, with up to 16 instructions per block, though more realistically 8. so approximately 4096 instructions or phrases per chapter, average sentence consists of 3 phrases, so around a thousand sentences per chapter.

For human reading, a chapter is best kept between the 5 minute attention span of a toddler and the 20 minute attention span of a focused adult. Often a rule of thumb used is 10 minutes.

Reading speed for most languages is between 155-213 words per minute, and there are up to 15 words per block, so a human could read about 12 blocks a minute, or a 120 blocks in 10 minutes, or 240 blocks in 20 minutes. So would only need 8KB even for a 20 minute chapter.

currently the encoder and it's tests combined come to around 6 thousand words or 31 minute read. though a bunch of those are compound names of variables and functions, and there are several stages in the program. a 16KB code area would allow for 40 minute chapters if necessary.

can have the code held in constant space, allowing for probably even 32KB of code, and simply have a listing of the variables in the shared memory space.

for instance could give a number for the variable, and a length, could also have a bit index at the start of the code in order to more quickly find where the variable is. that way wouldn't have to do search and replace in multiple locations, once the variable is initialized, it is done. there would also have to be a flag to indicate that a variable has been initialized.

Admitedly this kind of system would be harder to debug than the afformentioned one -- and would be less elegant. However the aforementioned one would require that the sentence have enough space for the whole variable.

Also the variable lookup table version would mean the code doesn't get modified, as modifying the code can be problematic, though sometimes it is good, such as for reflective metaprogramming.

High Level Overview

There can be different sections or parts to the code:

Main Thread: the entry and exit point of the program. Also where all input and output occurs
Worker Thread: These is for functions optimized to run on other CPU cores.
GPU Thread: This is for functions that are optimized for GPU.
FPGA Thread: This optional at this point, but would be for functions that would be implemented on the FPGA

Each level has more restrictions than the previous level, so one of the most restricted level will work on the previous levels.

The main chapter is the only one that will have input and output. The compiler will complain if the main chapter is too large, and suggest making other chapters.

Also each function has limits also. Ideally no function should comprise no more than 60 rows and 60 columns, with the ideal being closer to 56 on both counts, as that is the most stable isotope of iron, and also how many lines fit on a page, and how many characters per line is conducive to quickest reading. So that a maximum of about 52 blocks per function, though SPEL is highly compressed language, so can round down to 48 blocks, -- which is an even number in dozenal and hexadecimal -- and probably would still have slightly more words than would fit on a piece of paper when translated to just about any language other than SPEL's interlingua.

considering that can fit at least 10 functions or pages in a chapter.

Error Reporting

Since except for the main chapter, there is no input or output, there has to be a different way of reporting errors and making assertions. One way, so to have an output variable which will take any error messages, similar to an stderr. These could be channeled up like exceptions until they reach the main chapter.

Exceptions are thrown using apprehensive-mood. try is deliberative-mood. and catch is hypothetical-mood.

Error reporting in this way does have some overhead, so it may be possible to turn off, like assertions, at least for production code.

Also unlike exceptions, it will probably take a bit longer to propogate to the main thread, as there would be a check installed at each output of a function, to see if it has produced an error, if so, then the function itself will exit, forwarding the error, until it is caught somewhere up the stack.

Quotes

The quote is probably the most complicated part of the encoding. It is best for it to come at the begining of the phrase, that way can know how long the rest of it is, and load it all up, before assigning it to a register designated by the case.

The quote starts with the Quote Flag, which is 30. which takes up 5 bits. The following 3 bits gives the length.

unsigned 8 bit quote
16 bit quote
32 bit quote
64 bit quote
128 bit quote
256 bit quote* (maybe later)
signed 8 bit quote
pointer

The first 8 bits are either the 8 bit quote, or they provide more type information. The first bit shows weather it is literal data, or is an address to the data.

literal
address

The next bit indicates if it is a signed or unsigned value

unsigned
signed

The next two bits show width of vector if it is one. otherwise if scalar is point of redundancy and is same width as the width previously pointed to.

1 byte, 8bit (char)
2 byte, 16bit (short)
4 byte, 32bit (int or float)
8 byte, 64bit (long long or double)

the last 4 bits give the named type information

text (ASCII or UTF)
integer
rational number
floating point
fixed point
complex number

An example is some ASCII text "hello world\n", it's width is 12bytes, closest container is 16 bytes, it is a constant, signed value, vector of 8bits, that is text.

If it was in UTF-8, then it would be an unsigned value. UTF-16 wouldn't fit the whole statement in 16bytes, though would be a text vector made of 16bit width scalars.

Internationalization

Probably one of the main points of this programming language is that can write a program in one language, and it becomes available in all other languages.

Taking the example of gettext, simply by having a different kind of quote, such as one that quotes legal SPEL utterances can be translated automatically based on locality.

Syntax

infinintive ni: for verbs
accusative-case ka: for input
dative-case pa: for output, only holds pointers. though it can point to a register, which is assumed to be the first several byte locations of memory. generally for genetic programming it would point to either storage or accusative, so that next operation can function without needing to exchange registers
instrumental-case se: for extra (constant) input
quote: for various input types

Linear Genetic Programming

Core operations

load accusative
load instrumental
load dative
add
subtract
conditional
label or start-codon starts a function
jump to forward label
function end or stop-codon stops a function and returns to main thread

extended operations depending on domain. also could use other library functions. here are some examples of builtins that are provided my assembly languages:

loop to previous label
no operation
multiply
divide
modulo
and
and-or
shift up
shift down
exclusive-or
absolute value
negate
exponent
logarithm
square root
cube root
power
hypoteneuse
sine
arc sine
cosine
arc cos
tangent
arc tangent
round
max
min
truncate
gamma function
random number

and others which may be listed on GCC builtins

The idea is to have only around 20 functions.

Evolutionary Computer-programming

steady constraints

input and conclusion elements
words provided for genetics
information kinds provided
working memory provided
resources use weight of information kinds
resources use weight of words

work constraints

check set of inputs and conclusions
economy coefficient recipe

crowd constraints

208 member can be maximum for crowd
56 member can be begining for crowd
migration luck can be near 2%
fertility rate near 6

evolutionary processing

use steady constraints for create first crowd
different crowd island(s) for each compute unit. each crowd can have different subset of instructions. the number of functions can be between golden-ratio and 1 over golden-ratio, the functions would then be added to the roster at random, so some functions may be there more than once, making them more likely to occur in the crowd population.
per crowd
1. choose two set of random persons.
  enough so those that fail are replaced by children of the heros. 4 per set.
2. account economy coefficient for set
3. account growth luck for crowd. 50-atan(crowd length - 56)*31
4. per person on GPU
  1. evaluate economy of full person
  2. detect of waste and effective expressions
  3. evaluation of fit for effective expressions
  4. return fit and effective expressions
5. championship
  1. detect person from each set that maximum fit
6. produce children
  1. do crossover of random or effective expressions between parents to make children
  2. do mutations, can replace functions with those functions which are available for a crowd. new migrants can add new functions.
  3. replace those that fail with children
  4. can add additional children if growth luck succeeds
  5. if some of the removed persons had unique functions, they can be removed from mutation possiblity for that crowd.
  6. can remove some children if growth luck fails
7. children and-or parents may migrate to other crowd. no copy remains in old crowd of those that migrate.