65GS10

ruud.baltissen_at_abp.nl
Date: 2002-04-17 13:37:09

  • Next message: Professor Dredd: "Re: SSv5 + REU + T232?"
    Hallo allemaal,
    
    I simply forward some other emails regarding Gideons FPGA implementation of
    the 6510. At the end you find a small history and some more ins and outs.
    
    ============================================================================
    ==
    
    |How about the illegal instructions? As I recall previous postings at least
    |that time there were no plans to implement them.
    
    In this version, almost all 6502 illegal opcodes have been implemented, too.
    Most of them come forth of 'poorly' decoding the instruction, such as LAX
    and SAX. The opcodes ending with $3 and $7 and $f (and some $b) also act the
    same as on the 6502, because it just implies a 'wrong' order of the internal
    states to be taken. Some other "illegal" opcodes, like the places where STX
    $nnnn,Y and STY $nnnn,X should have been were called illegal because they
    didn't work in the original 6502. In this FPGA implementation they do work,
    so on those opcode places, you'll find STX $nnnn,Y and STY $nnnn,X, and of
    course the load variants as well. IMHO it doesn't matter that these opcodes
    act a bit differently from the original 6502 since these were not stable.
    
    Anyone who is interested in testing all opcodes: you're welcome. I just
    don't know how to get an FPGA board to you to test them. Maybe I will ahve a
    few of them made.
    
    
    |Does this implementation also enable weird stuff like putting bytes on the
    |bus at certain times to write to areas normally unaccessable? I mean writes
    |to RAM $00/$01 which is more stable on the later C64s.
    
    I don't have any demos, so if you'd like to have the result of some tests,
    then please send me the 5.25" disk with the demo :)
    
    What locations $0 and $1 are concerned; in my implementation the reads are
    always from the local PIO registers, and the writes go to the PIO registers,
    but also to the bus, so the rest of the system *does* write those bytes into
    RAM. I am not sure if this is the case with the original 6510. Anyway -
    reading the RAM locations $0 and $1 by using sprite collisions etc, doesn't
    have anything to do with the CPU, since you are reading it through the VIC,
    so that should work.
    
    What some illegal opcodes are concerned; in my last post I wrote that the
    'unstable' opcodes of the original 6502 do not work the same on my 6510
    implementation. This is true, since Nathan pointed out that there were only
    2 that were not stable, I have to broaden this definition a bit. In this
    6510, the ones that had a very unusual meaning and hard to comprehend (like
    the high address byte + 1 anded with some other value, blah blah), *those*
    will all work differently. Opcodes $x3, $x7, $xF will do the same as on the
    original chip; guaranteed! So will the opcodes that select A and X together;
    LAX and SAX.
    
    Some other opcodes that did nothing but a "read from the bus" in the
    original 6502 now do something. Examples:
    5C: JMP $nnnn,X
    34: BIT $nn,X
    3C: BIT $nnnn,X
    04, 14, 0C, 1C: Similar to BIT, but than with OR instead of AND
    
    These came for free by 'loosening' the decoding a little.
    
    That the timing is concerned; there are some differences. From the top of my
    head:
    * branches take 2 cycles untaken, 4 taken, no matter if the page boundary is
    crossed or not.
    * implied instructions always take 1 cycle instead of 2 (TAX, CLI, etc)
    * RTS and RTI take one cycle more
    * Additions/subtractions in decimal mode are less buggy and take one
    clockcycle more.
    * In read/modify/write instructions, the wrong value is not written first,
    like what was the case on the 6502.
    
    I hope that this gives some more clarity about what the implementation looks
    like.
    
    ============================================================================
    ==
    
    History:
    Gideon contacted me in private because, being a C64 fan and working with
    FPGA's, he had the idea of building a C64 in FPGA. He searched the net and
    hit my sit so often :) and being Dutch aswell, he decided to contact me. 
    I told him that Jeri was working on the C=1 so in fact he would be inventing
    the wheel again. On the other side Jeri had to use the 65816 as there was no
    free (good) core for the 6502 and 65816. 
    So Gideon decided to shift his attention to the processor by producing a
    better CPU then the 65816. One that actually can replace the original 65816
    on the C=1 but also the 6510 or 6502 on other C= computers (just a matter of
    another interface). We are aiming at a 32 bits CPU with some extra's running
    at 32 MHz. Maybe one that has a 65816-mode. But one that can run 6502-code
    any time !!!
     
    About the illegal opcodes:
    Gideon and I are still discussing what to do with them. Let's have a look at
    the 65816. It has only ONE opcode (WDM / $42) left that can be used for
    extending the instructionset. This would mean we would end up with 3 byte
    instructions. Using illegal opcodes means we still will have some two byte
    instructions but 3 byte ones.
    Then the facts:
    1) who is using illegal opcodes? AFAIK mostly demo's with no other reason
    than to gain some extra microseconds. 
    2) Users having a SCPU cannot play these demo's anyway as the 65816 won't
    recognise the instructions as meant (and therefor can crash).
    
    About the differences in timing: 
    Gideon could change the design so the 65GS10 would work exactly like the
    original 6510. Adding an extra cycle is no problem, but reducing the extra
    ones is. All operations inside the FPGA are done at the rising edge of the
    clock. Doing some operations at the falling edge as well would do the trick.
    But then you end up with operations that sometimes have to activated on the
    rising edge and other times at the falling edge. And the combination is the
    problem as (for the moment) the solution costs too many gates compared to
    the gain.
    
    What programs really depend on these timings? Mostly demo's and games. As I
    said before, we are aming at a 32 MHz CPU. Running the CPU at any other
    speed then the original frequency would screw up this game/demo anyway. IMHO
    then those few clockcycles won't make the difference anyway.
    
    What about the extra speed for games? SCPU-users ran into this problem
    allready I think. (I don't have one, so I cannot tell) In fact I think we
    will run into the same problem with a lot of games as we had with the PC's
    at the end of the 80's: many games only ran fine at PC's equiped with a 8088
    running at 4.77 MHz. (This is IMHO the only reason why PC's were equiped
    with a "Turbo-button") 
    I don't see any reason  why the 65GS32 could not run at 1 MHz. I wonder what
    game would drop dead on the fact the some instruction aren't time exact.
    (Hmmm, a "single Stepper" inside a monitor could)
    
    About the extra's:
    - The 65GSxx is capable of addressing SDRAM's directly. This feature is
    needed so the 65GSxx can run at those high speeds. 
    OK, this isn't a feature you would expect of a CPU but it is "build" inside
    the same FPGA and therefor considered as part of the CPU. Same comment for
    other extra's.
    
    - A Memory Management Unit. Those people familiar with a SCPU immediatly
    know why we need this device. The VIC cannot "see" the SDRAM in any way. So
    the 65GSxx MUST write video-data to the original RAM of the C64. The MMU
    enables us to tell the CPU wether to use the original RAM or the SDRAM.
    A special instruction will replace the Zeropage with a set of registers. A
    simple loop like:
         ldx #0
    L1   lda ROM,X
         sta $00,X
         dex
         bne L1
    could fill these registers from ROM or whatever other source.
    
    - The CPU is going to be equiped with 32 (?) 32 bits general purpose
    registers. This means we could perform instructions like "LOAD R1, ($12),R3"
    but also "LOAD R3, ($12),R1". The idea is to dedicate (part of) the
    registers to the well known standard registers of the 6502. So "LDA ($12),Y"
    will in fact do the 8-bits version of the above "LOAD R1, ($12),R2".
    We also need more instructions. LDA (or LDAB) loads a byte. LDAW will load a
    word, LDAD a double-word. LDAx (or LDAx16) uses a 16-bit address, LDAx24 a
    24-bits address, LDAx32 all 32 bits. 
    In this way it is easy to extend the existing instruction. A problem will be
    the sheer mass of possible combinations. What about all possibilities with
    the instruction "LDA ($xx),Y"? This command allone has 36 possible
    combinations !!! 
    Another problem The 16- and 24-bit address instructions are another problem:
    what about the unused higher addressbits? One idea is to make them zero.
    Another idea is to dedicate a register to these instructions to fill in the
    remaining bits. In this way we can run several virtual 6502-processes
    parallel to each other.
    
    -new instructions:
    This is a matter of gains and costs. Using the ADC instruction the first
    time means we need to (re-)set the Carry-flag. The 80x86 has the ADD
    instruction that does the addition with disregarding the Carry. In our
    opinion we can do without this instruction as the gain is marginal.
    The 6502 has no block-instruction. The 65816 has: MVN and MVP. With X
    varying from 8 to 32 bits, this loop:
         ldx VALUE
    L1   lda HERE,X
         sta THERE,X
         dex
         bne L1
    can replace such a blockfunction. But I figured out that a blockfunction
    could move a double-word every 2 cycles against 6 for the above loop. This
    is a gain of 200% but then: what is the over-all gain? I could be wrong but
    a compiler does not benefit from this gain, a text-editor could. 
    
    I can hear some of you think: 6 cycles ????   Yep :)
    - cache with onboard allignment and pipelining 
    "LDA ..." and "STA ..." are 6 bytes each: 2-byte instruction, 4
    addressbytes. "DEX" one byte, "BNE L1" two bytes. Total: 15 bytes = 4
    cycles. Add two cycles for the actual read and write and you have 6.
    
    
    Future:
    Gideons idea is start with the SDRAM interface and MMU first. Without the
    there is no good way in testing any 32 bit extensions. 
    
        ___
       / __|__
      / /  |_/     Groetjes, Ruud
      \ \__|_\
       \___|       http://Ruud.C64.org
    
     
    
           Message was sent through the cbm-hackers mailing list
    

    Archive generated by hypermail 2.1.4.