matrix multiplication speedup

December 12, 2014, 12:03 pm

Latest and popular articles on Intel Technologies

≫ Next: example code of Direct Sparse Solver (DSS) Interface gives wrong result

≪ Previous: 1D convolution of a 3D array using Intel MKL

Hi,

I'm using cblas_dgemm to calculate matrix multiplication. For random generated matrix X of size N * N (N could be 100), I calculate Y = X^T * X. (X^T is the tranpose of X). I can do it in two ways: (1) using cblas_dgemm to calculate Y directly (2) using a forloop that for i = 1:N, Y += X[i] * X[i]^T, where X[i] is the i_th column of X.

By comparing the speed, theoretically, they should have same complexity of N^3. But in reality, (2) way might take 4 times longer than (1). Could you help me to understand this?

Thanks

↧

example code of Direct Sparse Solver (DSS) Interface gives wrong result

December 15, 2014, 7:52 am

Latest and popular articles on Intel Technologies

≫ Next: Linking error

≪ Previous: matrix multiplication speedup

I am trying to use DSS routine to solver linear equation with a sparse matrix. I found the example code under the intel compiler directory named

dss_sym_f90.f90 and compiled it as ifort dss_sym_f90.f90 -o test-dss -mkl. The code solves a 5*5 linear equation and produces a wrong solution as

"Solution Array: -326.333 983.000 163.417 398.000 61.500"

while in fact it should be (-1.35972222222222, 4.00000000000000 0.250000000000000 , 6.40000000000000 , 0.312500000000000).

Source file is pasted as below. Could someone tell me what might be the reason of the wrong result?

Another thing, in my application, the matrix of the linear equation is constant while the right side vector changes, is there any way that I can store the LU factorization so I don't have to do it every time ? Thanks a lot!

INCLUDE 'mkl_dss.f90' ! Include the standard DSS "header file."
PROGRAM solver_f90_test
use mkl_dss
IMPLICIT NONE
INTEGER, PARAMETER :: dp = KIND(1.0D0)
INTEGER :: error
INTEGER :: i
INTEGER, PARAMETER :: bufLen = 20
! Define the data arrays and the solution and rhs vectors.
INTEGER, ALLOCATABLE :: columns( : )
INTEGER :: nCols
INTEGER :: nNonZeros
INTEGER :: nRhs
INTEGER :: nRows
REAL(KIND=DP), ALLOCATABLE :: rhs( : )
INTEGER, ALLOCATABLE :: rowIndex( : )
REAL(KIND=DP), ALLOCATABLE :: solution( : )
REAL(KIND=DP), ALLOCATABLE :: values( : )
TYPE(MKL_DSS_HANDLE) :: handle ! Allocate storage for the solver handle.
REAL(KIND=DP),ALLOCATABLE::statOUt( : )
CHARACTER*15 statIn
INTEGER perm(1)
INTEGER buff(bufLen)
! Set the problem to be solved.
nRows = 5
nCols = 5
nNonZeros = 9
nRhs = 1
perm(1) = 0
ALLOCATE( rowIndex( nRows + 1 ) )
rowIndex = (/ 1, 6, 7, 8, 9, 10 /)
ALLOCATE( columns( nNonZeros ) )
columns = (/ 1, 2, 3, 4, 5, 2, 3, 4, 5 /)
ALLOCATE( values( nNonZeros ) )
values = (/ 9.0_DP, 1.5_DP, 6.0_DP, 0.75_DP, 3.0_DP, 0.5_DP, 12.0_DP, && 0.625_DP, 16.0_DP /)
ALLOCATE( rhs( nRows ) )
rhs = (/ 1.0_DP, 2.0_DP, 3.0_DP, 4.0_DP, 5.0_DP /)
! Initialize the solver.
error = DSS_CREATE( handle, MKL_DSS_DEFAULTS )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
! Define the non-zero structure of the matrix.
error = DSS_DEFINE_STRUCTURE( handle, MKL_DSS_SYMMETRIC, rowIndex, nRows, && nCols, columns, nNonZeros )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
! Reorder the matrix.
error = DSS_REORDER( handle, MKL_DSS_DEFAULTS, perm )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
! Factor the matrix.
error = DSS_FACTOR_REAL( handle, MKL_DSS_DEFAULTS, values )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
! Allocate the solution vector and solve the problem.
ALLOCATE( solution( nRows ) )
error = DSS_SOLVE_REAL(handle, MKL_DSS_DEFAULTS, rhs, nRhs, solution )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
! Print Out the determinant of the matrix (no statistics for a diagonal matrix)
IF( nRows .LT. nNonZeros ) THEN
ALLOCATE(statOut( 5 ) )
statIn = 'determinant'
call mkl_cvt_to_null_terminated_str(buff,bufLen,statIn)
error = DSS_STATISTICS(handle, MKL_DSS_DEFAULTS, buff, statOut )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
WRITE(*,"('pow of determinant is '(5F10.3))") ( statOut(1) )
WRITE(*,"('base of determinant is '(5F10.3))") ( statOut(2) )
WRITE(*,"('Determinant is '(5F10.3))") ( (10**statOut(1))*statOut(2) )
END IF
! Deallocate solver storage and various local arrays.
error = DSS_DELETE( handle, MKL_DSS_DEFAULTS )
IF (error /= MKL_DSS_SUCCESS ) GOTO 999
IF ( ALLOCATED( rowIndex) ) DEALLOCATE( rowIndex )
IF ( ALLOCATED( columns ) ) DEALLOCATE( columns )
IF ( ALLOCATED( values ) ) DEALLOCATE( values )
IF ( ALLOCATED( rhs ) ) DEALLOCATE( rhs )
IF ( ALLOCATED( statOut ) ) DEALLOCATE( statOut )
! Print the solution vector, deallocate it and exit
WRITE(*,"('Solution Array: '(5F10.3))") ( solution(i), i = 1, nCols )
IF ( ALLOCATED( solution ) ) DEALLOCATE( solution )
GOTO 1000
! Print an error message and exit
999 WRITE(*,*) "Solver returned error code ", error
1000 CONTINUE
END PROGRAM solver_f90_test

↧

Linking error

December 15, 2014, 1:34 pm

Latest and popular articles on Intel Technologies

≫ Next: Unitialized issue in cblas_dsyrk

≪ Previous: example code of Direct Sparse Solver (DSS) Interface gives wrong result

Greetings,

I'm trying to compile and link a use-subroutine for a commercial FE code. In my code, I have following lapack calls:

use lapack95

call potrf(cmMinusCfInv)
call potri(cmMinusCfInv)

To compile, I use /Qmkl among a bunch of other compiler directives and it compiles without complaining about use lapack95.

To link as dll, I used Intel® Math Kernel Library Link Line Advisor and according to that, calling:

mkl_blas95_lp64.lib mkl_lapack95_lp64.lib mkl_intel_lp64_dll.lib mkl_core_dll.lib mkl_intel_thread_dll.lib

in the link line. Then the Liker gives following error:

error LNK2019: unresolved external symbol dpotrf_f95 referenced in function ...

The folder that includes mkl libraries are added to the lib path:

export LIB="C:\\Program Files (x86)\\Intel\\Composer XE 2013 SP1\\compiler\\lib\\intel64;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Common7\\IDE;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\ATLMFC\\LIB;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\LIB\\amd64;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\bin\\amd64;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\PlatformSDK\\Lib;C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v7.1A\\Lib\\x64;C:\\Program Files (x86)\\Intel\\Composer XE 2013 SP1\\mkl\\lib\\intel64;

Would appreciate it if you could help me to figure out the source of the problem.

Thanks,

Alireza

↧

Unitialized issue in cblas_dsyrk

December 17, 2014, 11:20 am

Latest and popular articles on Intel Technologies

≫ Next: Quick Linking Intel® MKL BLAS, LAPACK to R

≪ Previous: Linking error

I have created a very small test program that illustrates an an issue in syrk. In fact when I run valgrind on my program I get.

==44033==

==44033== Conditional jump or move depends on uninitialised value(s)
==44033== at 0x401CFD: main (testsyrk.c:40)
==44033==
Check 990
==44033==
==44033== HEAP SUMMARY:
==44033== in use at exit: 39,184 bytes in 4 blocks
==44033== total heap usage: 4 allocs, 0 frees, 39,184 bytes allocated

My program is so simple and I will claim the uninitialised value comes from cblas_dsyrk.

The follow instructions reproduce the issue:

icc -g -o testsyrk testsyrk.c -I$MKLROOT/include -Wl,--start-group $MKLROOT/lib/intel64/libmkl_intel_lp64.a $MKLROOT/lib/intel64/libmkl_core.a $MKLROOT/lib/intel64/libmkl_sequential.a -Wl,--end-group -lpthread -lm

valgrind ./testsyrk

I use the latest Intel 15.0.1.

Here is my program

#include <stdio.h>
#include <stdlib.h>

#include "mkl_cblas.h"
#include "mkl_lapack.h"

int main()
{
  double *source,*target;
  int    i,j,
         z=0,
         d=44,w=14;

  source = calloc(d*w,sizeof(double));
  target = calloc(d*d,sizeof(double));

  for(j=0; j<w; ++j)
    for(i=0; i<d; ++i)
      source[d*j+i] = 1.0;

  for(j=0; j<d; ++j)
    for(i=0; i<=j; ++i)
      if ( target[j*d+i]>0.0 )
        z += 1;

  cblas_dsyrk(CblasColMajor,
              CblasUpper,
              CblasNoTrans,
              d,w,
              1.0,source,d,
              0.0,target,d);

  for(j=0; j<d; ++j)
    for(i=0; i<=j; ++i)
    {
      #if 0
      fprintf(stderr,"%d %d\n",i,j);
      #endif

      if ( target[j*d+i]>0.0 )
        z += 1;
    }

  printf("Check %d\n",z);

  return ( 0 );
}

↧

Quick Linking Intel® MKL BLAS, LAPACK to R

December 17, 2014, 9:00 pm

Latest and popular articles on Intel Technologies

≫ Next: After making static linking using \MT my Project still have dependency on "libiomp5md.dll"

≪ Previous: Unitialized issue in cblas_dsyrk

Overview

R is a popular programming language for statistical computing and machine learning. There is one article we published already- Using Intel® Math Kernel Library (Intel MKL) with R to show how to integrate Intel MKL BLAS and LAPACK libraries within R to improve the math computing performance of R. But we see there are still a lot of troubles for R developers to link the Intel MKL library to R. This article will provide a simple way to link Intel MKL BLAS and LAPACK to R environment.

Reference: http://cran.r-project.org/doc/manuals/r-release/R-admin.html#Shared-BLAS

Prerequisites:

· Intel® MKL

It contains highly optimized BLAS, LAPACK as well as statistical functionality of direct application to R. More information on Intel MKL can be found here: Intel® Math Kernel Library

· Download the R package - http://www.r-project.org/

The article is based on Intel MKL 11.2.0 from Intel Parallel Studio XE 2015 Composer Edition for Linux* and later versions and R-3.1.2.tar.gz

System Platform: Red Hat Enterprise Linux Server release 6.3 on Intel® Xeon® CPU E5-2680 @ 2.70GHz, 8 Cores, AVX support.

Linking Intel MKL to R

The BLAS library will be used for many of the add-on packages as well as for R itself. R offers the option of compiling the BLAS into a dynamic library libRblas stored in R_HOME/lib and linking both R itself and all the other add-on packages against that library. This is the default on all platforms except IBM AIX*. So it will be easy for most of developers to change the BLAS without needing to re-install R and all the add-on packages, since all references to the BLAS go through libRblas, and that can be replaced. R project shows a simple way to change the BLAS by using symlink a dynamic BLAS library (such as ACML or Goto’s) to R_HOME/lib/libRblas.so. in their documentation located at http://cran.r-project.org/doc/manuals/r-release/R-admin.html#Shared-BLAS

In this article, we will illustrate the same way to link Intel MKL BLAS library to R. Please follow the below instructions, to build R with default BLAS, LAPACK using gnu compiler chain.

$ tar -xzvf R-3.1.2.tar.gz
$ cd R-3.1.2
$ ./configure
(or $./configure --with-readline=no --with-x=no if package readline and X11 is not installed)
$make
(not $ make install, so, we do not pollute system directory)
$ ldd bin/exec/R
(To make sure it will link libRblas.so although it may show that libRblas.so => not found)
$ cd lib
$ mv libRblas.so libRblas.so.keep
$ln –s /opt/intel/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_rt.so libRblas.so

The same way, you can replace the LAPACK libRlapack.so library too

($mv libRlapack.so libRlapack.so.keep
$ln –s /opt/intel/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_rt.so libRlapack.so)

Performance Results

To provide some indication of the performance improvements that this replacement can provide, I ran R-benchmark-25.R found on the R benchmarks site on the system mentioned above,

$cd ..
Set the Intel MKL environment by sourcing the mklvars.sh for 64 bit platforms
$source /opt/intel/composer_xe_2015.0.090/mkl/bin/mklvars.sh intel64

Because R uses GNU OpenMP multithread library libgomp.so, and Intel MKL uses Intel OpenMP multithread library, from Intel MKL 11.1.3 onwards, we provided the flexibility of supporting GNU threading layer by setting certain environment variables as explained in the MKL reference manual section here https://software.intel.com/en-us/node/528522/)

Please set the MKL interface and threading layer to GNU and LP64 as

$export MKL_INTERFACE_LAYER=GNU,LP64
$export MKL_THREADING_LAYER=GNU
$ ./bin/Rscript ../R-benchmark-25.R

With the intel mkl blas, I was able to get:

R Benchmark 2.5

…

I. Matrix calculation

2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.109999999999999

…

Total time for all 15 tests_________________________ (sec): 8.89966666666666

Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.494403941035161

And if with default build, or change back to default BLAS and LAPACK library

$ mv libRblas.so libRblas.so.mkl

$ mv libRlapack.so libRlapack.so.mkl

$ mv libRblas.so.keep libRblas.so

$ mv libRlapack.so.keep libRlapack.so

I get:

R Benchmark 2.5

…

2800x2800 cross-product matrix (b = a' * a)_________ (sec): 14.0946666666667

…

Total time for all 15 tests_________________________ (sec): 42.2893333333333

Overall mean (sum of I, II and III trimmed means/3)_ (sec): 1.42207437362512

As you can see, the overall performance speedup is about 4.75X in this standard R benchmark. By just replacing the default BLAS, and LAPACK library with Intel MKL using the simple step explained above you can be able to get significant performance boost for your R applications.

Other Reference:

link MKL to R; R MKL; Rblas; Rlapack; libRblas.so; libRlapack.so; BLAS

Intel® Math Kernel Library

Intel® Advanced Vector Extensions

Improving performance

Libraries

Multithread development

MKL-Support

Theme Zone:

IDZone

↧

After making static linking using \MT my Project still have dependency on "libiomp5md.dll"

December 18, 2014, 6:56 am

Latest and popular articles on Intel Technologies

≫ Next: DGESVD/DGESDD computation complexity

≪ Previous: Quick Linking Intel® MKL BLAS, LAPACK to R

I am using IntelMKL in my project. I use 'PARDISO' API from IntelMKL.For parallel processing, I have made changes in the project settings as follows,
"Configuration Properties => Intel Performance Libraries" 1)UseIntelMKL => Parallel and 2)UseILP64Interfaces => Yes

It is desirable that my project DLL should not have any external DLL dependency because of our internal unavoidable reasons (i.e. we want complete static linking). So for static linking, I have switched the compiler to "/MT" to use static multi-threaded runtime libs. Still my project DLL shows dependency on OpenMP DLL "libiomp5md.dll".

Please let me know which functions from libiomp5md.dll are referred in my project even though we just call 'PARDISO' function? Please guide me to link statically to 'libiomp5md.lib / libiomp5mt.lib'.

Thank you for all your help.

↧

DGESVD/DGESDD computation complexity

December 18, 2014, 7:12 am

Latest and popular articles on Intel Technologies

≫ Next: MKL PARDISO pivot function fault

≪ Previous: After making static linking using \MT my Project still have dependency on "libiomp5md.dll"

I hope to know that the real computation complexity of DGESVD and DGESDD functions of mkl for an N by N matrix. If the complexity can be written as T(N) = C1*N^3 + C2*N^2, I hope to know the values of C1 and C2.

Thanks

↧

MKL PARDISO pivot function fault

December 18, 2014, 5:00 pm

Latest and popular articles on Intel Technologies

≫ Next: Issue during replacing ipp DCT function with MKL DCT function

≪ Previous: DGESVD/DGESDD computation complexity

HI all,

I tried to implement mkl_pardiso_pivot() in my program.

Sometime, it will cause a segmentation fault in a function in mkl_intel_thread.so.

Does anyone experience the same situation?

↧

Issue during replacing ipp DCT function with MKL DCT function

December 22, 2014, 11:21 pm

Latest and popular articles on Intel Technologies

≫ Next: C++ MKL BLAS wrappers vs expression templates

≪ Previous: MKL PARDISO pivot function fault

Hi,

I want to replace my IPP based DCT function with mkl based DCT function .

I am getting different output data when I will cross check with the ipp DCT vs mkl DCT function output.

I used below functions to get the DCT by usng IPP.lib function calls :

ippsDCTFwdInitAlloc_32f
ippsDCTFwd_32f
ippsDCTFwdFree_32f

Below is my code :

//pfa of the fileinput.txt

int main(int argc, char* argv[]){

   float *dpar;
   float *out;
   MKL_INT *ipar;
   MKL_INT tt_type,stat,n_1,nn;
   FILE *fp,*fw,*fonce;
   fp = fopen( "D:\\dump\\fileinput.txt","r" );
   if(fp == NULL){
       cout<<"file not created properly"<<endl;
   }
   DFTI_DESCRIPTOR_HANDLE handle = 0;
   int n = 65; //Hardcoded to run for my code TODO:going to change after integrating into my main codebase
   nn = (MKL_INT)n;
   tt_type = MKL_STAGGERED_COSINE_TRANSFORM;

   n_1 = nn + 1 ;
   out = (float*)malloc((n+1)*sizeof(float));
   dpar= (float*)malloc((5*n_1/2+2)*sizeof(float));
   ipar= (MKL_INT*)malloc((128)*sizeof(int));
   s_init_trig_transform(&n_1,&tt_type,ipar,dpar,&stat);
   for (int srcSize =0 ;srcSize< n ; srcSize++)
   {
       fscanf(fp,"%f\n",&out[srcSize]);
}
   fclose(fp);
if (stat != 0)
   {
       printf("\n============================================================================\n");
       printf("FFTW2MKL FATAL ERROR: MKL TT initialization has failed with status=%d\n",(MKL_INT)stat);
       printf("Please refer to the Trigonometric Transform Routines Section of MKL Manual\n");
       printf("to find what went wrong...\n");
       printf("============================================================================\n");
       return NULL;
   }
   ipar[10] = 1; //nx, that is, the number of intervals along the x-axis, in the Cartesian case.
   ipar[14] = n_1; //specifies the internal partitioning of the dpar array.
   ipar[15] = 1; //value of ipar[14]+1,Specifies the internal partitioning of the dpar array.
   s_commit_trig_transform(out,&handle,ipar,dpar,&stat);
   if (stat != 0)
   {
       printf("\n============================================================================\n");
       printf("FFTW2MKL FATAL ERROR: MKL TT commit step has failed with status=%d\n",(MKL_INT)stat);
       printf("Please refer to the Trigonometric Transform Routines Section of MKL Manual\n");
       printf("to find what went wrong...\n");
       printf("============================================================================\n");
       return NULL;
   }
   s_forward_trig_transform(out,&handle,ipar,dpar,&stat);
   if (stat != 0)
   {
       printf("\n============================================================================\n");
       printf("FFTW2MKL FATAL ERROR: MKL TT commit step has failed with status=%d\n",(MKL_INT)stat);
       printf("Please refer to the Trigonometric Transform Routines Section of MKL Manual\n");
       printf("to find what went wrong...\n");
       printf("============================================================================\n");
       return NULL;
   }
free_trig_transform(&handle,ipar,&stat);
   printf("\n===== DCT GOT OVER ======== \n");

return 0;

}

Attachment	Size
Download fileinput.txt	283.46 KB

↧

C++ MKL BLAS wrappers vs expression templates

December 23, 2014, 7:41 am

Latest and popular articles on Intel Technologies

≫ Next: How can I reuse sparse factorizations in Pardiso

≪ Previous: Issue during replacing ipp DCT function with MKL DCT function

This is a conceptual question:

Expression templates are a popular technique in C++ in order to implement Matrix and Array operations by avoiding unnecessary temporaries and loop unrolling. In other words using expression templates, an expression such as D = A+B+C, where D, A, B & C are matrices will not incur the temporaries usually resulting in a naive C++ implementation. How does this compare in performance terms by using C++ wrappers around the MKL BLAS routines. In other words will a naive implementation of a Matrix/Array class wrapping the optimized BLAS routines perform at least as well as an implementation using expression templates?

I realise this question is quite general in essence, but would be quite grateful if someone could provide me some hints on this.

Thanks!

↧

How can I reuse sparse factorizations in Pardiso

December 23, 2014, 10:17 pm

Latest and popular articles on Intel Technologies

≫ Next: Using FEAST for large matrix

≪ Previous: C++ MKL BLAS wrappers vs expression templates

Hello,

I have a serials of structurally identical matrixs such as {A1, A2, A3,....}

and I need to solve A*X=Y, for A1,A2,A3......., note that rho vector Y changes as time goes while all matrixs are kept constant ,

so I need to solve all these equations at each time step. Is there any way I can do factorization only once at the starting time and

stores all the computed factors in a memory efficient way so that I can solve the linear equations whenever the Y vectors are updated?

Thank you!

PS 1: I know I can store a array of pardiso handles like pt(:,N_matrixs) but I am afraid that in this way the internal memory cost would be

too much since all these matrixs are structures identically.

PS 2: I don't understand why most sparse LU factorization package does not provide the users the actually LU matrixs, which are exactly what they are expected to to, instead, they prefer to use some kind of internal memory structures which nobody knows what they really are, except god.

↧

Using FEAST for large matrix

December 25, 2014, 9:21 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel ODE Library on Mac OS

≪ Previous: How can I reuse sparse factorizations in Pardiso

Hello,

I am presently working with FEAST to find eigenvalues and eigenvectors for a symmetric matrix. I need to solve N X N matrix with N ~ 10^6- 10^8.

Now I have few queries :

1. SInce the size is large it is not possible to allocate this storage in a desktop (it has 8GB ram). Is there any way to handle large matrix of this size ?

2. The matrix is also expected to be sparse for which I expect to store it in a compressed format and that can save some memory space. But the eigenvector matrix is also of the dimension N X N which I have to pre-allocate before calling FEAST. Hence the compressed storage will not be of much help. Is there any way to solve this problem ?

3. Since FEAST fpm uses 64 iparm of MKL_pardiso, I have checked that iparm(60) helps to work using disk space storage. Can I use that in feast to solve this large problem ? However, in this case also I guess I have to pass eigenvectors (N X N) to FEAST which I have to pre-allocate. Can I use disk space somehow for this?

My program is working for moderate size matrices (10000 X 10000).

I would appreciate any help in this regard.

Thanks,

Dhiraj

↧

Intel ODE Library on Mac OS

December 27, 2014, 9:46 pm

Latest and popular articles on Intel Technologies

≫ Next: SVD produces wrong results in mkl=parallel (2013 sp1)

≪ Previous: Using FEAST for large matrix

An application I would like to run on my macbook links to the Intel ODE library.

Unfortunately, the library is available only for Windows and Linux: binaries but not source are available for download.

How can I obtain a build for the Mac OS or a copy of the source so I can compile it myself?

Thanks!

↧

SVD produces wrong results in mkl=parallel (2013 sp1)

December 28, 2014, 11:07 am

Latest and popular articles on Intel Technologies

≫ Next: Intel(R) System Studio Developer Story : How to configure, build and profile the Linux Kernel of Android by using the VTune

≪ Previous: Intel ODE Library on Mac OS

I have experienced a strange bug in MKL: zgesvd produces different results (some wrong) depending on the number of threads that MKL uses. Above 2, the singular values all become NaN, even though the matrix is perfectly diagonalizable. I would appreciate some help, as this is critical for my simulations at work.

I have placed a copy of a reproducible example here
https://www.dropbox.com/sh/0fejoblyv7w6t30/AABcD9jW3KZRR0z5BLJXA0KLa?dl=0

The code is intended to be run in a cluster with varying number of threads and thus "Makefile" should serve only as a guide. The compiler version is that from Composer 2013 sp1 2.144 and the Intel MKL library is the one that comes with this software.

↧

Intel(R) System Studio Developer Story : How to configure, build and profile the Linux Kernel of Android by using the VTune

December 23, 2014, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Running multiple Pardiso solves concurrently

≪ Previous: SVD produces wrong results in mkl=parallel (2013 sp1)

Intel(R) System Studio Developer Story: How to configure, build, debug and optimize key parts of your mobile software stack for Android*

1. Set-up and configure a development environment.

(1) The target HW environment

In this article, a Minnow board MAX is used for a HW platform. The MinnowBoard is an Intel® Atom™ processor based board which introduces Intel® Architecture to the small and low cost embedded market for the developer and maker community. It has exceptional performance, flexibility, openness and standards.

Specification Minnow Board Max
CPU	64-bit Intel® Atom™ E3815 (Single Core 1.46 GHz or E3825 Dual Core 1.33 GHz
Graphics	Integrated Intel® HD Graphics
Memory	1~2 GB DDR3 RAM
IO	Video : micro HDMI / Micro SD / SATA2 / USB 3.0 (Host) / USB 2.0 (Host) / Serial / Ethernet
OS	Linux / Yocto Linux / Windows 8.1 / Android 4.4

* Please find more details in officially Minnow homepage : http://www.minnowboard.org/

(2) Software environment

Host OS : Ubuntu 14.04 LTS / Window 7 64 bits

IDE : Android Developer Tools / eclipse KEPLER

Tools : Intel® System Studio System Studio 2015

Target OS: Android 4.2.2

(3) Set up the Android SW development environment

Configure a workstation (Ubuntu Linux)

sudo dpkg --assert-multi-arch
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java6-installer

sudo apt-get install git git-core gnupg flex bison gperf build-essential ccache squashfs-tools zip curl libc6-dev libncurses5-dev x11proto-core-dev g++-multilib mingw32 tofrodos  python-markdown libxml2-utils zlib1g-dev:i386 libx11-dev libreadline6-dev xsltproc

echo 'export USE_CCACHE=1'>> ~/.bashrc
ccache -M 16

mkdir ~ / bin
PATH=~/bin:$PATH
curl http://commondatastorage.googleapis.com/git-repo-downloads/repo > ~/bin/repo
chmod a+x ~/bin/repo

Download the code

repo init -u https://github.com/android-ia/platform_manifest.git
repo sync -j4 -q -c --no-clone-bundle

Build

source build/envsetup.sh
lunch<select minnow-eng>
make -j4

Make an installer USB memory stick and install it to a micro SD card in the Minnow board

Copy android-ia/out/target/product/minnowboard_max/live.img and make a installer image to an USB memory stick. In Windows you can make an android installer image by Win32diskimager to an USB memory stick. And then Insert an USB memory stick which has an Android installer image into the Minnow board which has a micro SD card. After boot up you can select the installing Android.

Connect the Android on the Minnow board to a host machine

After successful boot-up, connect the target via an Ethernet cable. And configure a connection.

adb connect 192.168.42.1
adb shell

2. Using a VTune.

The VTune is the performance analysis tool to find a hotspot which may utilize CPUs' inefficiency way both in applications and a system wide.

Links following are the useful documents you can refer for how to find out a hotspot in Android.

In this article, we are focusing on the profiling an Android system. To profile a system - including Linux Kernel by using the system wide profiling feature of a VTune, some functions which can stop the VTune profiling will be added in the Linux Kernel layer of Android and the user-defined custom analysis feature of a VTune will be introduced as well. For the prerequisite of this kind of analysis, the compiling an Android and updating a boot image by the fastboot and the logging and some ADB command of an Android are also explained.

(1) The example function of stopping VTune profiling

During poring or working on the Linux kernel layer of an Android, sometimes we need to stop the profiling at the time of any specific event or signal or exceptional cases such as a kernel panic. And we can start analysis the time which is just right after the specific debug point you want to check. And you can accomplish it by sending the QUIT signal to the VTune process which is started when you start analysis in a VTune GUI or a command line. The example function following is to find the VTune process and send QUIT signal to the process to stop profiling.

void stop_vtune_process (void)
{
	struct task_struct *p;
	int j;
	int flasg_to_skip_sh = 0;

	for_each_process(p) 	{
		for (j=0;j< (TASK_COMM_LEN-4);j++) {
			if (p->comm[j] == 'a'&& p->comm[j+1] == 'm'&& p->comm[j+2] == 'p' \&& p->comm[j+3] == 'l'&& p->comm[j+4] == 'x') {
                                /* found the amplx ... in the process name. */
				printk ("[vtune] %d %s \n",task_pid_nr(p),p->comm);
				if (flasg_to_skip_sh) {
					task_lock(p);
					printk("[vtune]Kill %d(%s)\n",task_pid_nr(p), p->comm);
					task_unlock(p);
					do_send_sig_info(SIGQUIT, SEND_SIG_PRIV, p, false);
					break;
				}
				else	{
					printk("[vtune] skip sh for amplxe\n");
					flasg_to_skip_sh++;
				}
			}
		}
	}
}

You can add and call this kind of function into your Linux kernel source codes such as kernel panic, an USB event function, a Key event function etc., whichever you want to stop profiling and start the analysis by a VTune.

(2) Some useful things - fastboot, ADB, kernel log (Minnow board MAX)

<build source codes>
make -j4<download  kernel image by fastboot>
adb reboot bootloader
fastboot -t 192.168.42.1 flash boot boot.img
fastboot -t 192.168.42.1 continue<logging kernel via adb>
adb shell cat /proc/kmsg | grep vtune

(3) Setting and using the VTune for the system profiling

To get the details of system data during profiling, we'd better use an Advanced hotspot with the system wide profile even it has little call stack information, but we can see a system wide processes and functions which is called during profiling.

Project properties - Target type : Profile System
New Analysis - Choose analysis type - advanced hotspot , collection level : hotspot

advanced hotspot - system wide profile This advanced hot spot is using 3 high frequent basic HW events - CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED_REF_TSC / INST_RETIRED_ANY. If your system codes which you want profiling are more delay-critical-codes or you want to use specific HW PMU event, use the custom analysis. Next example is the custom analysis.

Project properties - Attach process -select process
New Analysis - Custom Analysis - New Hardware Event-based sampling analysis
New Hardware Event-based sampling analysis- Edit - add events you want
New Hardware Event-based sampling analysis- Edit - Check the collect stacks or Analyze system-wide context switches

Hardware Event-based sampling analysis

You can analysis the process working in the time line as the picture above. And if you found any suspicious process which needs more investigation. Change the Vtune - Project Properties - Target Type - Attach to Process and repeat testing above to do narrow down the issues.

Intel system studio 2015

Android* Development Tools

Intel® C++ Compiler

Intel® JTAG Debugger

Intel® System Debugger

Intel® System Studio

Intel® VTune™ Amplifier

Intel® Integrated Performance Primitives

Intel® Math Kernel Library

Intel® Atom™ Processors

Improving performance

Theme Zone:

IDZone

Last Updated:

Friday, December 26, 2014

Co-authors:

JONG IL P. (Intel)

↧

Running multiple Pardiso solves concurrently

January 6, 2015, 7:52 pm

Latest and popular articles on Intel Technologies

≫ Next: cluster_sparse_solver computes wrong solution

≪ Previous: Intel(R) System Studio Developer Story : How to configure, build and profile the Linux Kernel of Android by using the VTune

Hi,

We're using MKL Pardiso inside an optimisation web service on Windows and Linux. Clients can spin up multiple optimisations in one call to the service, and so we have multiple runs occurring concurrently in the same memory space, with multiple calls to Pardiso, so one might be analysing, another factorising, and another solving, and so on, all at the same time. Under heavy loads we have crashes from heap corruption, and Pardiso is often in the call stack.

I'm trying to eliminate obvious causes of death here before diving into Inspector runs. Does Pardiso actually support multiple independent runs and have separate memory for each initialisation and solve, or should we be putting each solve in its own process to protect memory?

Thanks,

Damien

Tce Issue/Feature:

Other

↧

cluster_sparse_solver computes wrong solution

January 7, 2015, 12:07 am

Latest and popular articles on Intel Technologies

≫ Next: Build Scipy With MKL failed

≪ Previous: Running multiple Pardiso solves concurrently

Hello,

I'm trying to use cluster_sparse_solver and solve a system in-place (iparm(6) = 1), with a distributed format (iparm(40) = 1). I adapted the example cl_solver_unsym_distr_c.c as you can see attached, and at runtime, on two MPI processes, I get the following output:

$ icpc -V
Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.1.133 Build 20141023

$ mpicc -cc=icc cl_solver_unsym_distr_c.c -lmkl_intel_thread -lmkl_core -lmkl_intel_lp64 -liomp5

$ mpirun -np 2 ./a.out

The solution out-of-place of the system is:
on zero process x [0] = 0.149579 rhs [0] = 1.000000
on zero process x [1] = 0.259831 rhs [1] = 1.000000
on zero process x [2] = -0.370084 rhs [2] = 0.250000
on zero process x [3] = 0.011236 rhs [3] = 1.000000
on zero process x [4] = 0.415730 rhs [4] = 1.000000

Solving system in-place...
The solution in-place of the system is:
on zero process x [0] = 0.149579
on zero process x [1] = 0.259831
on zero process x [2] = -0.370084
on zero process x [3] = 1.000000
on zero process x [4] = 1.000000

Can you reproduce this behavior ? The solution in-place is obviously wrong. Do you see how to fix that ? Thank you in advance.

Attachment	Size
Download cl_solver_unsym_distr_c.c	12.09 KB

↧

Build Scipy With MKL failed

January 7, 2015, 10:57 pm

Latest and popular articles on Intel Technologies

≫ Next: Help needed with bdsqr

≪ Previous: cluster_sparse_solver computes wrong solution

I use command to build numpy first:

python %MYPWD%/%NUMPY_VER%/setup.py config --compiler=msvc build_clib --compiler=msvc build_ext

the site.cfg content is:

[mkl]
library_dirs = C:\Program Files (x86)\Intel\Composer XE 2015\mkl\lib\intel64
include_dirs = C:\Program Files (x86)\Intel\Composer XE 2015\mkl\include
mkl_libs = mkl_rt
lapack_libs =

Then I build scipy with command:

python %MYPWD%/%SCIPY_VER%/setup.py config --compiler=msvc build_clib --compiler=msvc build_ext

It failed with message:

_fftpackmodule.obj : warning LNK4197: export 'init_fftpack' specified multiple times; using first specificatio
n
Creating library build\temp.win-amd64-2.7\Release\build\src.win-amd64-2.7\scipy\fftpack\_fftpack.lib and ob
ject build\temp.win-amd64-2.7\Release\build\src.win-amd64-2.7\scipy\fftpack\_fftpack.exp
zfft.obj : error LNK2019: unresolved external symbol zfftf_ referenced in function zfft
zfft.obj : error LNK2019: unresolved external symbol zfftb_ referenced in function zfft
zfft.obj : error LNK2019: unresolved external symbol zffti_ referenced in function get_cache_id_zfft
zfft.obj : error LNK2019: unresolved external symbol cfftf_ referenced in function cfft
zfft.obj : error LNK2019: unresolved external symbol cfftb_ referenced in function cfft
zfft.obj : error LNK2019: unresolved external symbol cffti_ referenced in function get_cache_id_cfft
drfft.obj : error LNK2019: unresolved external symbol dfftf_ referenced in function drfft
drfft.obj : error LNK2019: unresolved external symbol dfftb_ referenced in function drfft
drfft.obj : error LNK2019: unresolved external symbol dffti_ referenced in function get_cache_id_drfft
drfft.obj : error LNK2019: unresolved external symbol rfftf_ referenced in function rfft
drfft.obj : error LNK2019: unresolved external symbol rfftb_ referenced in function rfft
drfft.obj : error LNK2019: unresolved external symbol rffti_ referenced in function get_cache_id_rfft
dct.obj : error LNK2019: unresolved external symbol costi_ referenced in function get_cache_id_dct1
dct.obj : error LNK2019: unresolved external symbol cost_ referenced in function dct1
dct.obj : error LNK2019: unresolved external symbol cosqi_ referenced in function get_cache_id_dct2
dct.obj : error LNK2019: unresolved external symbol cosqb_ referenced in function dct2
dct.obj : error LNK2019: unresolved external symbol cosqf_ referenced in function dct3
dct.obj : error LNK2019: unresolved external symbol dcosti_ referenced in function get_cache_id_ddct1
dct.obj : error LNK2019: unresolved external symbol dcost_ referenced in function ddct1
dct.obj : error LNK2019: unresolved external symbol dcosqi_ referenced in function get_cache_id_ddct2
dct.obj : error LNK2019: unresolved external symbol dcosqb_ referenced in function ddct2
dct.obj : error LNK2019: unresolved external symbol dcosqf_ referenced in function ddct3
dst.obj : error LNK2019: unresolved external symbol sinti_ referenced in function get_cache_id_dst1
dst.obj : error LNK2019: unresolved external symbol sint_ referenced in function dst1
dst.obj : error LNK2019: unresolved external symbol sinqi_ referenced in function get_cache_id_dst2
dst.obj : error LNK2019: unresolved external symbol sinqb_ referenced in function dst2
dst.obj : error LNK2019: unresolved external symbol sinqf_ referenced in function dst3
dst.obj : error LNK2019: unresolved external symbol dsinti_ referenced in function get_cache_id_ddst1
dst.obj : error LNK2019: unresolved external symbol dsint_ referenced in function ddst1
dst.obj : error LNK2019: unresolved external symbol dsinqi_ referenced in function get_cache_id_ddst2
dst.obj : error LNK2019: unresolved external symbol dsinqb_ referenced in function ddst2
dst.obj : error LNK2019: unresolved external symbol dsinqf_ referenced in function ddst3
build\lib.win-amd64-2.7\scipy\fftpack\_fftpack.pyd : fatal error LNK1120: 32 unresolved externals

Please give me your suggestion if you have experience about this. Thanks

↧

Help needed with bdsqr

January 9, 2015, 2:07 am

Latest and popular articles on Intel Technologies

≫ Next: Parameters for ?stemr

≪ Previous: Build Scipy With MKL failed

Hello,

I'm trying to compute a partial SVD of a rectangular matrix A. I tried to adapt an example of the MKL that uses gesvd. While I get the same singular values, I'm not able to compute the correct left singular vectors. Any held would be greatly appreciated.

Thank you.

Attachment	Size
Download lapack.cpp	3.34 KB

↧

Parameters for ?stemr

January 9, 2015, 2:53 am

Latest and popular articles on Intel Technologies

≫ Next: performance of mkl in multithread application in numa memory architecture decreased

≪ Previous: Help needed with bdsqr

I have a problem where I need to calculate a number of eigenvectors and a different number of eigenvalues. Instead of calling dsyevr twice I plan on calling dsytrd -> dstemr * 2 -> dormtr. (Or alternatively stebz / stein)

However, I have noticed unexpected behavior regarding the eigenvector parameter (using 11.0 update 5, from C).

If I use jobz = 'N', then the call to dstemr sets the first element of the eigenvector array to 0.0 even when only calculating eigenvalues, and crashes if no array is provided. The documentation states that this variable is not used in this case. Is it sufficient to pass a single double as a dummy parameter or does the function also set more values in this array? Also while the documentation states that ldz should be >= 1 in this case, the function fails unless it is >= N.

Secondly, If I use jobz = 'V', the documentation states

"Array z(ldz, *), the second dimension of z must be at least max(1, m).

If jobz = 'V', and info = 0, then the first m columns of z contain the orthonormal eigenvectors of the matrix T corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). ".

However, when calling from C, ldz contains the number of columns in the array, and should be set to the number of eigenvalues, but the parameter validation requires ldz to be >= N, which means that if I want to calculate the first 10 eigenvalues of a 1000 x 1000 matrix, I still would need to allocate the full size matrix. Am I missing something? Is this just due to LAPACK_ROW_MAJOR?

↧