SRC Insider

SRC Computers, Inc.                                                                                                                       March 2007


In This Issue

Letter from the CEO

● SRC-7 Update

Software Update – Carte 2.2 Features

Carte 2.2 Tips & Tricks

 

Upcoming Events

SRC Award for Excellence in Reconfigurable Computing

  

Letter from the CEO

Welcome to the inaugural edition of the SRC Insider, a newsletter designed to provide regular insight into SRC’s product and software development. In future issues we also expect to include reader contributed articles with helpful tips and application case studies. If you have an article idea, or have feedback on the Insider in general, please e-mail marketing@srccomputers.com.  

The Insider is being sent to everyone who has expressed an interest in or is currently using an SRC system. but it is also available by subscribing on our web site at www.srccomputers.com/News_SRC_Insider.htm. Please also feel free to forward this newsletter to your colleagues. Should you wish to unsubscribe, instructions are located in the left hand column of this newsletter.

I hope you find this publication informative, and we look forward to your feedback.

Jon Huppenthal
President & CEO
 


SRC-7 Update

Where is the SRC-7?

Jon Huppenthal, President & CEO 

For the last year we have been focused on completion of the SRC-7. As with any new product launch and associated R&D, there have been successes in some areas and challenges in others.  

Those who attended the 2006 SRC User Meeting know that the first unexpected challenge arose with the change in contract manufacturers which delayed the project about 6 months.  In July 2006 we finally had all the modules needed for the SRC-7 MAPstation™ in house and successfully tested the new EM64T microprocessor, the new Series D SNAP, the new Series D six-port Hi-Bar® switch, the new Series C Common Memory, and the new high bandwidth interconnect. We had since also successfully tested the Series H MAP controller and its new on-board Common Memory.

However, an issue with the communication between the User Logic FPGAs and the SRAM OBM banks was uncovered, and after months of extensive investigation by SRC, the SRAM vendor and the FPGA vendor, the root cause was finally discovered. In December 2006 the FPGA vendor acknowledged that the User Logic FPGA was unable to meet its specified performance due to vendor quality issues and that these issues were not expected to be resolved. This was obviously a very undesirable situation, but it presented us with a great opportunity to improve the SRC-7’s performance even further.

Since we were now in a position to select a new User Logic FPGA, we decided to use the Stratix II EP2S180 which was not available when the SRC-7 was initially designed. This choice will more than double the single and double precision floating point capability and the logic capacity of the Series H MAP. It also removes the Control Chip to User Logic bandwidth bottleneck by increasing it from 12 GB/s to 14.4 GB/s. Despite the shortcomings of the first units built, we have now been able to verify that the Series H MAPs 3D FFT performance exceeds that of a 64 node Blue Gene/L System while consuming less than 300 watts for the whole system.

Such a major design change of course meant the MAP had to go through another layout cycle which was just completed in the last few weeks. We now expect to have new boards in test in early April. In parallel we have been updating Carte and its libraries to insure that we maintain the same software computability and ease of use that our customers have become accustomed to.

The end result is that the SRC-7 will have even higher performance than originally anticipated at the same price point. And of course, all SRC systems are still programmable using ANSI C and Fortran.

For more detailed information on the updates to the SRC-7, please visit www.srccomputers.com/SRC7_Performance.htm.  


back to top


Software Update – Carte 2.2 Features
Dan Poznanovic, VP of Software

The Carte 2.2 release was distributed in September 2006 and has a number of new features and support functions. Applications written under Carte 2.1 generally required only a recompile to allow them to function under Carte 2.2. 

Multiple MAP application support. One of the themes of the release is support for multiple MAP applications. Several features were introduced specifically enable application development across multiple MAPs. 

  • Concurrent multi-MAP execution. When multiple MAPs are allocated, current execution is provided through the use of multiple threads just as multiple microprocessor cpus are executed simultaneously. The technique for utilizing multiple threads the use of pthreads. In Carte 2.2 microprocessor code written in either C or Fortran can utilize the pthread_create and pthread_join procedures to manage threads
  • Heterogeneous MAP compile support. Systems that contain multiple MAP often contain more than one MAP type. To take advantage of more than one MAP type within an application both the compile time and run time support has been enhanced to target multiple MAP types. At compile time MAPTARGET associated with specific files is specified in the Makefile. Instead of MAPFILES= <file names> a new set of environment variables is used: MAP_B_FILES, MAP_C_FILES, or MAP_E_FILES for example, target the associated list of files for the specific MAP type.
  • Heterogeneous MAP run time support. When an application is compiled for more than one MAP type, a new allocation procedure is called. The routines: map_init_spec, set_map_type, and map_allocate_spec allow multiple MAPs to be allocated and specific MAP types associated with map numbers. See Chapter 5 of the C and Fortran Programmers Guide for more information.
  • Synchronization. With multiple MAP concurrent execution a method for synchronizing activity between MAPs and the Cpus is required. The Carte 2.2 release supports barriers through the use of the new procedures: barrier_allocate, barrier_initialize,  and Barrier_Wait. 

See Chapter 5 of the C and Fortran Programmers Guide for more information on Multi-MAP support. 

Data movement using the GPIO port to connect to an external memory. A new macro is provided, stream_memory, that allows a MAP routine to send or receive data to a memory location that is physically connected to the MAP using the GPIO port.

Math intrinsic functions. This release provides support for many common mathematical intrinsic functions that expect 32-bit floating-point values as input. These functions are the arc sin and arc cosine, hyperbolic sin, hyperbolic cosine, and hyperbolic tangent intrinsics. Also supported are the logarithmic functions for base 2 and base 10. In C exponentiation is supported using the pow function, in Fortran the ‘**’ operator is support for specific data types.

Small locally declared arrays. Previously small locally declared arrays were initiated in BRAM hardware. With this release, arrays that are smaller than 256 elements will be initiated in LUT-ROM hardware instead of BRAM, allowing for potentially more local array data to exist in a MAP routine.

Selector macros. There are two new families of macros, select_xbit_nval and select_pri_xbit_nval, where x is the number of bits used in values used for comparison as well as the bit size of the result, and n is the number of values used in the comparison list.

Sequential decrement macro. There is a new macro, cg_decr_seq_32, that allows the user to supply a 32-bit integer number as a fixed upper limit that will then be sequentially decremented on each call to the macro.

User macro libraries. New environment variables and rules have been provided that allow for the building and inclusion of libraries containing user’s macros.

 

Features Planned for Future Releases

The following set of features are in development for the next release:

Multi-map streaming. Passing data through chain port streams between MAPs is supported. This feature allows code to be split between chips and MAPs allowing large algorithms to be implemented. Global Resource Manager (GRM) is also enhanced to allow allocation of MAPs that are chain port connected.

Streams Enhancements. To make streams more easily utilized, two new features will be available: wide streams, and stream termination. The wide stream feature allows multiple 64 bit words to be passed into a stream within an loop iteration, and stream termination allows a stream to be stopped prior to exhausting the specified word count.

Complex DMA. Common Memory modules now support several forms of complex DMA and the compiler supports the associated DMA calls. The set of Complex DMAs that are initially supported are: Strided read, Transpose read, Subarray read. These types are related and parameterized.

Questions or comments about this article may be directed to poz@srccomputers.com.  


back to top


Carte 2.2 Tips & Tricks
 

David Caliga, Application Technology Manager

This is the first of a series of helpful code examples. This section will have examples that are related to new features in Carte releases and examples of” tricks” that we have found to give performance improvements.


Heterogeneous MAPs

Applications can now be built targeting a heterogeneous set of MAP board types. Execution of these applications on SRC systems with a MAP configuration that matches the targeted MAP board types is supported in this release.  

MAP Files
MAPFILES = <filenames compiled with current MAPTARGET
MAP_B_MAPFILES = <filenames compiled with MAPTARGET=map_b
MAP_C_MAPFILES = <filenames compiled with MAPTARGET=map_c
MAP_D_MAPFILES = <filenames compiled with MAPTARGET=map_d
MAP_E_MAPFILES = <filenames compiled with MAPTARGET=map_e
MAP_F_MAPFILES = <filenames compiled with MAPTARGET=map_f
MAP_G_MAPFILES = <filenames compiled with MAPTARGET=map_g

If specific compilation flags are to be associated with a particular MAP target, the MCCFLAGS and MFTNFLAGS can also be prefixed with the appropriate target:

If you are compiling for a MAP E, then you would have the following:
   
MAP_E_MCCFLAGS – sets the mcc compiler 
   
MAP_E_MFTNFLAGS – sets the mftn compiler

The main.c needs to identify what type and how many of each type MAP will be needed by the job. The following example will use 1 MAP C and 2 MAP Es.

  map_init_spec (3);
 
set_map_type (0, MAP_C);
 
set_map_type (1, MAP_E);
 
set_map_type (2, MAP_E);

  if (map_allocate_spec ()) {
   
fprintf (stderr, "MAP allocation failed\n");
   
exit (1);
 
}

User Macros
The following environment variable settings is an example of multiple MAP type:

Macros that use the default MAPTARGET

MACROS
MY_NGO_DIR
MY_INFO

= my_source_macros/mac1.v
= my_macros
= my_macros/info

If you have macros that are for specific MAP types, then you will have following lines in the Makefile.
 

MAP_C_MACROS
MAP_C_MY_NGO_DIR
MAP_C_MY_INFO

= my_source_macros/mac1.v/my_map_c_macros/mac2.vhdl
= my_map_c_macros
= my_map_c_macros/info

MAP_E_MACROS
MAP_E_MY_NGO_DIR
MAP_E_MY_INFO

= my_source_macros/mac1.v/my_map_c_macros/mac3.v
= my_map_e_macros
= my_map_e_macros/info

 
Streaming Data Into a CM Module

Data movement using inbound and outbound streaming memory calls to external memory connected via GPIO ports is available. There are some applications that can take advantage of a large CM directly attached to the GPIO port of a MAP. The user can stream in/out of the GPIO at a rate of two 64b values every clock. This feature is not supported on MAP B type MAPs. Tests in simulation mode are also not supported.

One example could be using the memory as a very large work space. Another example would be a "database" of information is kept in CM to be used by MAP jobs. 

void subr (int64_t Src_array[], int64_t Key_array[], int64_t Res_array[],  

           int sz, int ksz, int mapnum) {

OBM_BANK_A (AL, int64_t, MAX_OBM_SIZE)

OBM_BANK_B (BL, int64_t, MAX_OBM_SIZE)

Stream_64 S0, S1, S2, S3;

int k;

 

// stream from CPU directly into GPIO memory

#pragma src parallel sections

{

#pragma src section

{

stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, AL, DMA_A_B, Src_array,

1, sz*8);

}

#pragma src section

{

stream_memory (&S0, &S1, STREAM_TO_PORT, GPIO_PORT_0, 0, sz*8);

}

}

 

// get an array of keys into OBM A

DMA_CPU (CM2OBM, AL, MAP_OBM_stripe (1, "A"), Key_array, 1, ksz*8, 0);

wait_DMA (0);

 

// for each key, stream data in from GPIO

// mem and count the number of matches

for (k=0; k<ksz; k++) {

  printf ("processing key %d...\n", k);

    #pragma src parallel sections

  {

    #pragma src section

    {

     stream_memory (&S2, &S3, PORT_TO_STREAM, GPIO_PORT_0, 0, sz*8);

    }

    #pragma src section

    {

      int i, cnt0, cnt1;

      int64_t v0, v1;

      for (i=0; i<sz/2; i++) {

        get_stream (&S2, &v0);

        get_stream (&S3, &v1);

        cg_accum_add_32 (1, v0==AL[k], 0, i==0, &cnt0);

        cg_accum_add_32 (1, v1==AL[k], 0, i==0, &cnt1);

      }

      BL[k] = cnt0 + cnt1;

    } 

  }

}

 

DMA_CPU (OBM2CM, BL, MAP_OBM_stripe (1, "B"), Res_array, 1, ksz*8, 0);

wait_DMA (0); 

The main.c needs to have a GRM call that identifies which MAP is connected to a CM module. See the following example: 

map_init_spec (2);

set_map_type (0, MAP_B);

set_map_type (1, MAP_D);

set_map_gpio_mem (1, GPIO_PORT_1, 4);

if (map_allocate_spec ()) {

fprintf (stderr, "MAP allocation failed\n");

exit

where the prototype for set_map_gpio_mem is:

 

   int set_map_gpio_mem (int map_id, int port, int mem_size);

This call specifies a streaming memory module of mem_size GBytes connected to GPIO port port on MAP map_id. A return of zero indicates success.

Synchronization of MAPs. Applications that execute by running multiple MAPs concurrently may now synchronize between either the MAPs or between a MAP and a microprocessor.

 
Statically Declared Arrays

The MAP routine may statically initialize small local arrays with data prior to execution. These arrays may be implemented in LUT-ROM hardware or BRAM This capability is supported in both C and Fortran using a subset of the language syntax. These statically initialized arrays may be in the MAP routine or within inlined routines to the MAP routine.  LUT arrays are declared in the same way as BRAM constant arrays. The array's size determines whether it will be implemented in LUTs or BRAM. Implementing small constant lookup tables in BRAM can use up valuable BRAM space, so the MAP Compiler instead implements these arrays of up to 256 elements in the LUTs that make up the random logic of the FPGA.  The arrays in BRAM can issue 2 reads of statically initialized arrays in the same clock cycle.  The LUT arrays are single ported, so one access per clock can take place.  

C code example:
const int8_t T[16] = {
  
34, 55, 73, 98, 23, 11, 26, 90,
  
15, 72, 91, 25, 10, 31, 64, 88
  
};

Fortran code example:
integer(kind=1), dimension(16), parameter :: t = (/  &

  
34, 55, 73, 98, 23, 11, 26, 90,                                     &
 
15, 72, 91, 25, 10, 31, 64, 88 /)

See the Programming Guide for more details on the LUT storage given the magnitude of the data values.

 
Code Preprocessors

There have been several groups that have taken advantage of performing a preprocessor on the algorithm code before it is compiled into Carte. Examples of this would be to generate code via a Perl script. 

The Carte standard Makefile and underlying rules that control the Carte compilation process have been changed for this release. Support has been added for identifying preprocessors to use in compilation. Environment variables in the Makefile define the preprocessor command, any options to the command, a file suffix that determines which files are to be processed, and another file suffix defining the file type being produced by the preprocessor. By default the preprocessors are applied to files in the current working directory. Optionally, directories can be specified as locations for files to be preprocessed.

Up to four distinct preprocessors can be specified in the Makefile. It is also possible for preprocessors to be applied in sequence to files. This is accomplished by properly specifying the source and target file suffixes. Any file can be specified for preprocessing, including code files for the microprocessor, MAP routines, and macros.  

Preprocessing takes place prior to any compilation of source code, or synthesis of macros. Preprocessing of files is controlled by Makefile rules and dependencies. Therefore modification of files requiring preprocessing will trigger invoking the appropriate preprocessor, just as modifying source files triggers compilation. 

The environment variables that control preprocessing are: 

PRE1 = <command> <source suffix> <target suffix> [<dir1> .. <dirN>]
PRE2 = <command> <source suffix> <target suffix> [<dir1> .. <dirN>]
PRE3 = <command> <source suffix> <target suffix> [<dir1> .. <dirN>]
PRE4 = <command> <source suffix> <target suffix> [<dir1> .. <dirN>]
PRE1_FLAGS = <any set of options for preprocessor 1>
PRE2_FLAGS = <any set of options for preprocessor 2>
PRE3_FLAGS = <any set of options for preprocessor 3>
PRE4_FLAGS = <any set of options for preprocessor 4>

The following example shows specifying preprocessing on a group of files:

PRE1 = pre_cmd pmc mc
PRE2 = pre_cpu pc c
PRE3 = pre_mac pv v my_macro
MAPFILES = file1.mc file2.mc file3.mc
MACROS = my_macro/mac.v my_macro/mac2.vhdl

Questions, comments or tips & tricks you would like to share may be directed to david.caliga@srccomputers.com.


back to top