Quantcast
Channel: Intel® Math Kernel Library
Viewing all 1435 articles
Browse latest View live

matrix multiplication speedup

$
0
0

Hi,

I'm using cblas_dgemm to calculate matrix multiplication. For random generated matrix X of size N * N (N could be 100),  I calculate Y = X^T * X. (X^T is the tranpose of X). I can do it in two ways: (1) using cblas_dgemm to calculate Y directly (2) using a forloop that for i = 1:N, Y += X[i] * X[i]^T, where X[i] is the i_th column of X. 

By comparing the speed, theoretically, they should have same complexity of N^3. But in reality, (2) way might take 4 times longer than (1). Could you help me to understand this?

Thanks 


example code of Direct Sparse Solver (DSS) Interface gives wrong result

$
0
0

 

 I am trying to use DSS routine to solver linear equation with a sparse matrix. I found the example code under the intel compiler directory named

dss_sym_f90.f90 and compiled it as ifort dss_sym_f90.f90 -o test-dss -mkl. The code solves a 5*5 linear equation and produces a wrong solution as

"Solution Array:   -326.333   983.000   163.417   398.000    61.500" 

while in fact it should be  (-1.35972222222222,        4.00000000000000       0.250000000000000 ,   6.40000000000000   ,    0.312500000000000).

Source file is pasted as below.  Could someone tell me what might be the reason of the wrong result?  

Another thing, in my application, the matrix of the linear equation is constant while the right side vector changes, is there any way that I can store the LU factorization so I don't have to do it every time ? Thanks a lot!

 

INCLUDE 'mkl_dss.f90' ! Include the standard DSS "header file."
PROGRAM solver_f90_test
use mkl_dss
IMPLICIT NONE
INTEGER, PARAMETER :: dp = KIND(1.0D0)
INTEGER :: error
INTEGER :: i
INTEGER, PARAMETER :: bufLen = 20
! Define the data arrays and the solution and rhs vectors.
INTEGER, ALLOCATABLE :: columns( : )
INTEGER :: nCols
INTEGER :: nNonZeros
INTEGER :: nRhs
INTEGER :: nRows
REAL(KIND=DP), ALLOCATABLE :: rhs( : )
INTEGER, ALLOCATABLE :: rowIndex( : )
REAL(KIND=DP), ALLOCATABLE :: solution( : )
REAL(KIND=DP), ALLOCATABLE :: values( : )
TYPE(MKL_DSS_HANDLE) :: handle ! Allocate storage for the solver handle.
REAL(KIND=DP),ALLOCATABLE::statOUt( : )
CHARACTER*15 statIn
INTEGER perm(1)
INTEGER buff(bufLen)
! Set the problem to be solved.
nRows = 5
nCols = 5
nNonZeros = 9
nRhs = 1
perm(1) = 0
ALLOCATE( rowIndex( nRows + 1 ) )
rowIndex = (/ 1, 6, 7, 8, 9, 10 /)
ALLOCATE( columns( nNonZeros ) )
columns = (/ 1, 2, 3, 4, 5, 2, 3, 4, 5 /)
ALLOCATE( values( nNonZeros ) )
values = (/ 9.0_DP, 1.5_DP, 6.0_DP, 0.75_DP, 3.0_DP, 0.5_DP, 12.0_DP, && 0.625_DP, 16.0_DP /)
ALLOCATE( rhs( nRows ) )
rhs = (/ 1.0_DP, 2.0_DP, 3.0_DP, 4.0_DP, 5.0_DP /)
! Initialize the solver.
error = DSS_CREATE( handle, MKL_DSS_DEFAULTS )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
! Define the non-zero structure of the matrix.
error = DSS_DEFINE_STRUCTURE( handle, MKL_DSS_SYMMETRIC, rowIndex, nRows, && nCols, columns, nNonZeros )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
! Reorder the matrix.
error = DSS_REORDER( handle, MKL_DSS_DEFAULTS, perm )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
! Factor the matrix.
error = DSS_FACTOR_REAL( handle, MKL_DSS_DEFAULTS, values )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
! Allocate the solution vector and solve the problem.
ALLOCATE( solution( nRows ) )
error = DSS_SOLVE_REAL(handle, MKL_DSS_DEFAULTS, rhs, nRhs, solution )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
! Print Out the determinant of the matrix (no statistics for a diagonal matrix)
IF( nRows .LT. nNonZeros ) THEN
ALLOCATE(statOut( 5 ) )
statIn = 'determinant'
call mkl_cvt_to_null_terminated_str(buff,bufLen,statIn)
error = DSS_STATISTICS(handle, MKL_DSS_DEFAULTS, buff, statOut )
IF (error /= MKL_DSS_SUCCESS) GOTO 999
WRITE(*,"('pow of determinant is '(5F10.3))") ( statOut(1) )
WRITE(*,"('base of determinant is '(5F10.3))") ( statOut(2) )
WRITE(*,"('Determinant is '(5F10.3))") ( (10**statOut(1))*statOut(2) )
END IF
! Deallocate solver storage and various local arrays.
error = DSS_DELETE( handle, MKL_DSS_DEFAULTS )
IF (error /= MKL_DSS_SUCCESS ) GOTO 999
IF ( ALLOCATED( rowIndex) ) DEALLOCATE( rowIndex )
IF ( ALLOCATED( columns ) ) DEALLOCATE( columns )
IF ( ALLOCATED( values ) ) DEALLOCATE( values )
IF ( ALLOCATED( rhs ) ) DEALLOCATE( rhs )
IF ( ALLOCATED( statOut ) ) DEALLOCATE( statOut )
! Print the solution vector, deallocate it and exit
WRITE(*,"('Solution Array: '(5F10.3))") ( solution(i), i = 1, nCols )
IF ( ALLOCATED( solution ) ) DEALLOCATE( solution )
GOTO 1000
! Print an error message and exit
999 WRITE(*,*) "Solver returned error code ", error
1000 CONTINUE
END PROGRAM solver_f90_test

Linking error

$
0
0

Greetings,

I'm trying to compile and link a use-subroutine for a commercial FE code. In my code, I have following lapack calls: 

use lapack95

call potrf(cmMinusCfInv)
call potri(cmMinusCfInv)

To compile, I use /Qmkl among a bunch of other compiler directives and it compiles without complaining about use lapack95. 

To link as dll, I used Intel® Math Kernel Library Link Line Advisor and according to that, calling: 

 mkl_blas95_lp64.lib mkl_lapack95_lp64.lib mkl_intel_lp64_dll.lib mkl_core_dll.lib mkl_intel_thread_dll.lib 

in the link line. Then the Liker gives following error:

error LNK2019: unresolved external symbol dpotrf_f95 referenced in function ...

The folder that includes mkl libraries are added to the lib path: 

export LIB="C:\\Program Files (x86)\\Intel\\Composer XE 2013 SP1\\compiler\\lib\\intel64;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Common7\\IDE;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\ATLMFC\\LIB;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\LIB\\amd64;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\bin\\amd64;C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\PlatformSDK\\Lib;C:\\Program Files (x86)\\Microsoft SDKs\\Windows\\v7.1A\\Lib\\x64;C:\\Program Files (x86)\\Intel\\Composer XE 2013 SP1\\mkl\\lib\\intel64;

 

Would appreciate it if you could help me to figure out the source of the problem.

Thanks,

Alireza

 

Unitialized issue in cblas_dsyrk

$
0
0

I have created a very small test program that illustrates an an issue in syrk. In fact when I run valgrind on my program I get.

==44033==

==44033== Conditional jump or move depends on uninitialised value(s)
==44033==    at 0x401CFD: main (testsyrk.c:40)
==44033==
Check 990
==44033==
==44033== HEAP SUMMARY:
==44033==     in use at exit: 39,184 bytes in 4 blocks
==44033==   total heap usage: 4 allocs, 0 frees, 39,184 bytes allocated

My program is so simple and I will claim the uninitialised value comes from cblas_dsyrk.

The follow instructions reproduce the issue:

icc -g -o testsyrk testsyrk.c -I$MKLROOT/include  -Wl,--start-group $MKLROOT/lib/intel64/libmkl_intel_lp64.a $MKLROOT/lib/intel64/libmkl_core.a $MKLROOT/lib/intel64/libmkl_sequential.a -Wl,--end-group -lpthread -lm

 

valgrind ./testsyrk 

I use the latest Intel 15.0.1.

Here is my program

#include <stdio.h>
#include <stdlib.h>

#include "mkl_cblas.h"
#include "mkl_lapack.h"

int main()
{
  double *source,*target;
  int    i,j,
         z=0,
         d=44,w=14;

  source = calloc(d*w,sizeof(double));
  target = calloc(d*d,sizeof(double));

  for(j=0; j<w; ++j)
    for(i=0; i<d; ++i)
      source[d*j+i] = 1.0;

  for(j=0; j<d; ++j)
    for(i=0; i<=j; ++i)
      if ( target[j*d+i]>0.0 )
        z += 1;

  cblas_dsyrk(CblasColMajor,
              CblasUpper,
              CblasNoTrans,
              d,w,
              1.0,source,d,
              0.0,target,d);

  for(j=0; j<d; ++j)
    for(i=0; i<=j; ++i)
    {
      #if 0
      fprintf(stderr,"%d %d\n",i,j);
      #endif

      if ( target[j*d+i]>0.0 )
        z += 1;
    }

  printf("Check %d\n",z);

  return ( 0 );
}

 

 

 

Quick Linking Intel® MKL BLAS, LAPACK to R

$
0
0

Overview

R is a popular programming language for statistical computing and machine learning. There is one article we published already- Using Intel® Math Kernel Library (Intel MKL) with R to show how to integrate Intel MKL BLAS and LAPACK libraries within R to improve the math computing performance of R. But we see there are still a lot of troubles for R developers to link the Intel MKL library to R. This article will provide a simple way to link Intel MKL BLAS and LAPACK to R environment.

Reference: http://cran.r-project.org/doc/manuals/r-release/R-admin.html#Shared-BLAS

Prerequisites:

  • · Intel® MKL

It contains highly optimized BLAS, LAPACK as well as statistical functionality of direct application to R. More information on Intel MKL can be found here: Intel® Math Kernel Library

The article is based on Intel MKL 11.2.0 from Intel Parallel Studio XE 2015 Composer Edition for Linux* and later versions and R-3.1.2.tar.gz

System Platform: Red Hat Enterprise Linux Server release 6.3 on Intel® Xeon® CPU E5-2680  @ 2.70GHz, 8 Cores, AVX support.

Linking Intel MKL to R

The BLAS library will be used for many of the add-on packages as well as for R itself. R offers the option of compiling the BLAS into a dynamic library libRblas stored in R_HOME/lib and linking both R itself and all the other add-on packages against that library. This is the default on all platforms except IBM AIX*. So it will be easy for most of developers to change the BLAS without needing to re-install R and all the add-on packages, since all references to the BLAS go through libRblas, and that can be replaced. R project shows a simple way to change the BLAS by using symlink a dynamic BLAS library (such as ACML or Goto’s) to R_HOME/lib/libRblas.so. in their documentation located at http://cran.r-project.org/doc/manuals/r-release/R-admin.html#Shared-BLAS

In this article, we will illustrate the same way to link Intel MKL BLAS library to R. Please follow the below instructions, to build R with default BLAS, LAPACK using gnu compiler chain.

$ tar -xzvf R-3.1.2.tar.gz

$ cd R-3.1.2

$ ./configure

(or $./configure --with-readline=no --with-x=no if package readline and X11 is not installed)

$make

(not $ make install, so, we do not pollute system directory)

$ ldd bin/exec/R

(To make sure it will link libRblas.so although it may show that libRblas.so => not found)

$ cd lib

$ mv libRblas.so libRblas.so.keep

$ln –s  /opt/intel/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_rt.so libRblas.so

The same way, you can replace the LAPACK libRlapack.so library too

($mv libRlapack.so libRlapack.so.keep

$ln –s  /opt/intel/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_rt.so libRlapack.so)

Performance Results

To provide some indication of the performance improvements that this replacement can provide, I ran R-benchmark-25.R found on the R benchmarks site on the system mentioned above,

$cd ..  

Set the Intel MKL environment by sourcing the mklvars.sh for 64 bit platforms

$source /opt/intel/composer_xe_2015.0.090/mkl/bin/mklvars.sh intel64

Because R uses GNU OpenMP multithread library libgomp.so, and Intel MKL uses Intel OpenMP multithread library, from Intel MKL 11.1.3 onwards, we provided the flexibility of supporting GNU threading layer by setting certain environment variables as explained in the MKL reference manual section here https://software.intel.com/en-us/node/528522/)

Please set the MKL interface and threading layer to GNU and LP64 as

$export  MKL_INTERFACE_LAYER=GNU,LP64

$export  MKL_THREADING_LAYER=GNU

$ ./bin/Rscript ../R-benchmark-25.R

With the intel mkl blas, I was able to get:

R Benchmark 2.5

I. Matrix calculation

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.109999999999999

Total time for all 15 tests_________________________ (sec):  8.89966666666666

Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.494403941035161

And if with default build, or change back to default BLAS and LAPACK library

$ mv libRblas.so libRblas.so.mkl

$ mv libRlapack.so libRlapack.so.mkl

$ mv libRblas.so.keep libRblas.so

$ mv libRlapack.so.keep libRlapack.so

I get:

R Benchmark 2.5

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  14.0946666666667

Total time for all 15 tests_________________________ (sec):  42.2893333333333

Overall mean (sum of I, II and III trimmed means/3)_ (sec):  1.42207437362512

As you can see, the overall performance speedup is about 4.75X in this standard R benchmark. By just replacing the default BLAS, and LAPACK library with Intel MKL using the simple step explained above you can be able to get significant performance boost for your R applications.

Other Reference:

  1. Build R-3.0.1 with Intel® C++ Compiler and Intel® MKL on Linux* 
  2. Extending R with Intel MKL.
  3. http://www.r-project.org/
  4. http://cran.r-project.org/doc/manuals/r-release/R-admin.html#Shared-BLAS
  • link MKL to R; R MKL; Rblas; Rlapack; libRblas.so; libRlapack.so; BLAS
  • Developers
  • Partners
  • Professors
  • Students
  • Linux*
  • C/C++
  • Python*
  • Beginner
  • Intermediate
  • Intel® Math Kernel Library
  • Intel® Advanced Vector Extensions
  • Development Tools
  • URL
  • Code Sample
  • Compiler Topics
  • Getting started
  • Improving performance
  • Libraries
  • Multithread development
  • MKL-Support
  • Theme Zone: 

    IDZone

    After making static linking using \MT my Project still have dependency on "libiomp5md.dll"

    $
    0
    0

    I am using IntelMKL in my project. I use 'PARDISO' API from IntelMKL.For parallel processing, I have made changes in the project settings as follows,
        "Configuration Properties => Intel Performance Libraries" 1)UseIntelMKL => Parallel  and 2)UseILP64Interfaces => Yes

    It is desirable that my project DLL should not have any external DLL dependency because of our internal unavoidable reasons (i.e. we want complete static linking). So for static linking, I have switched the compiler to "/MT" to use static multi-threaded runtime libs. Still my project DLL shows dependency on OpenMP DLL "libiomp5md.dll".

    Please let me know which functions from libiomp5md.dll are referred in my project even though we just call 'PARDISO' function? Please guide me to link statically to 'libiomp5md.lib / libiomp5mt.lib'.

    Thank you for all your help.

    DGESVD/DGESDD computation complexity

    $
    0
    0

    I hope to know that the real computation complexity of DGESVD and DGESDD functions of mkl for an N by N matrix. If the complexity can be written as T(N) = C1*N^3 + C2*N^2, I hope to know the values of C1 and C2.

     

    Thanks

    MKL PARDISO pivot function fault

    $
    0
    0

    HI all,

    I tried to implement mkl_pardiso_pivot() in my program.

    Sometime, it will cause a segmentation fault in a function in mkl_intel_thread.so.

    Does anyone experience the same situation?

     


    Issue during replacing ipp DCT function with MKL DCT function

    $
    0
    0

    Hi,

    I want to replace my IPP based DCT function with mkl based DCT function .

    I am getting different output data when I will cross check with the ipp DCT vs mkl DCT function output.

    I used below functions to get the DCT by usng IPP.lib function calls :

    ippsDCTFwdInitAlloc_32f
    ippsDCTFwd_32f
    ippsDCTFwdFree_32f

    Below is my code :

    //pfa of the fileinput.txt 

    int main(int argc, char* argv[]){

        float *dpar;
        float *out;
        MKL_INT    *ipar;
        MKL_INT tt_type,stat,n_1,nn;
        FILE *fp,*fw,*fonce;
        fp = fopen( "D:\\dump\\fileinput.txt","r" );
        if(fp == NULL){
            cout<<"file not created properly"<<endl;
        }
        DFTI_DESCRIPTOR_HANDLE handle = 0;
        int n = 65; //Hardcoded to run for my code TODO:going to change after integrating into my main codebase
        nn = (MKL_INT)n;
        tt_type = MKL_STAGGERED_COSINE_TRANSFORM;

        n_1 = nn + 1 ;
        out = (float*)malloc((n+1)*sizeof(float));
        dpar= (float*)malloc((5*n_1/2+2)*sizeof(float));
        ipar= (MKL_INT*)malloc((128)*sizeof(int));
        s_init_trig_transform(&n_1,&tt_type,ipar,dpar,&stat);
        for (int srcSize =0 ;srcSize< n ; srcSize++)
        {
            fscanf(fp,"%f\n",&out[srcSize]);
        }
        fclose(fp);
        if (stat != 0)
        {
            printf("\n============================================================================\n");
            printf("FFTW2MKL FATAL ERROR: MKL TT initialization has failed with status=%d\n",(MKL_INT)stat);
            printf("Please refer to the Trigonometric Transform Routines Section of MKL Manual\n");
            printf("to find what went wrong...\n");
            printf("============================================================================\n");
            return NULL;
        }
        ipar[10] = 1;    //nx, that is, the number of intervals along the x-axis, in the Cartesian case.
        ipar[14] = n_1;  //specifies the internal partitioning of the dpar array.
        ipar[15] = 1;    //value of ipar[14]+1,Specifies the internal partitioning of the dpar array.
        s_commit_trig_transform(out,&handle,ipar,dpar,&stat);
        if (stat != 0)
        {
            printf("\n============================================================================\n");
            printf("FFTW2MKL FATAL ERROR: MKL TT commit step has failed with status=%d\n",(MKL_INT)stat);
            printf("Please refer to the Trigonometric Transform Routines Section of MKL Manual\n");
            printf("to find what went wrong...\n");
            printf("============================================================================\n");
            return NULL;
        }
        s_forward_trig_transform(out,&handle,ipar,dpar,&stat);
        if (stat != 0)
        {
            printf("\n============================================================================\n");
            printf("FFTW2MKL FATAL ERROR: MKL TT commit step has failed with status=%d\n",(MKL_INT)stat);
            printf("Please refer to the Trigonometric Transform Routines Section of MKL Manual\n");
            printf("to find what went wrong...\n");
            printf("============================================================================\n");
            return NULL;
        }
        free_trig_transform(&handle,ipar,&stat);
        printf("\n===== DCT GOT OVER ======== \n");

        return 0;

    }

     

    AttachmentSize
    Downloadfileinput.txt283.46 KB

    C++ MKL BLAS wrappers vs expression templates

    $
    0
    0

    This is a conceptual question:

    Expression templates are a popular technique in C++ in order to implement Matrix and Array operations by avoiding unnecessary temporaries and loop unrolling. In other words using expression templates, an expression such as D = A+B+C, where D, A, B & C are matrices will not incur the temporaries usually resulting in a naive C++ implementation. How does this compare in performance terms by using C++ wrappers around the MKL BLAS routines. In other words will a naive implementation of a Matrix/Array class wrapping the optimized BLAS routines perform at least as well as an implementation using expression templates?

    I realise this question is quite general in essence, but would be quite grateful if someone could provide me some hints on this.

    Thanks!

     

     

     

    How can I reuse sparse factorizations in Pardiso

    $
    0
    0

     

    Hello, 

    I have a serials of structurally identical matrixs such as {A1, A2, A3,....}

    and I need to solve  A*X=Y, for A1,A2,A3......., note that rho vector Y changes as time goes while all matrixs are kept constant ,

    so I need to solve all these equations at each time step. Is there any way I can do  factorization only once at the starting time and 

    stores all the computed factors in a memory efficient way so that I can solve the linear equations whenever the Y vectors are updated?

    Thank you! 

    PS 1: I know I can store a array of pardiso handles like pt(:,N_matrixs) but I am afraid that in this way the internal memory cost would be 

    too much since all these matrixs are structures identically. 

    PS 2: I don't understand why most sparse LU factorization package does not provide the users the actually LU matrixs, which are exactly what they are expected to to, instead, they prefer to use some kind of internal memory structures which nobody knows what they really are, except god.

     

    Using FEAST for large matrix

    $
    0
    0

    Hello,

       I am presently working with FEAST to find eigenvalues and eigenvectors for a symmetric matrix. I need to solve N X N matrix with N ~ 10^6- 10^8.     

      Now I have few queries :

       1. SInce the size is large it is not possible to allocate this storage in a desktop (it has 8GB ram). Is there any way to handle large matrix of this size ?

        2. The matrix is also expected to be sparse for which I expect to store it in a compressed format and that can save some memory space. But the eigenvector matrix is also of the  dimension N X N  which I have to pre-allocate before calling FEAST. Hence the compressed storage will not be of much help. Is there any way to solve this problem ?

        3. Since FEAST fpm uses 64 iparm of MKL_pardiso, I have checked that iparm(60)  helps to work using disk space storage. Can I use that in feast to solve this large problem ? However, in this case also I guess I have to pass eigenvectors (N X N) to FEAST which I have to pre-allocate. Can I use disk space somehow for this?

        My program is working for moderate size matrices (10000 X 10000).

       I would appreciate any help in this regard.

    Thanks,

    Dhiraj

     

     

     

    Intel ODE Library on Mac OS

    $
    0
    0

    An application I would like to run on my macbook links to the Intel ODE library.

    Unfortunately, the library is available only for Windows and Linux: binaries but not source are available for download.

    How can I obtain a build for the Mac OS or a copy of the source so I can compile it myself?

    Thanks!

     

    SVD produces wrong results in mkl=parallel (2013 sp1)

    $
    0
    0

    I have experienced a strange bug in MKL: zgesvd produces different results (some wrong) depending on the number of threads that MKL uses. Above 2, the singular values all become NaN, even though the matrix is perfectly diagonalizable. I would appreciate some help, as this is critical for my simulations at work.

    I have placed a copy of a reproducible example here
    https://www.dropbox.com/sh/0fejoblyv7w6t30/AABcD9jW3KZRR0z5BLJXA0KLa?dl=0

    The code is intended to be run in a cluster with varying number of threads and thus "Makefile" should serve only as a guide. The compiler version is that from Composer 2013 sp1 2.144 and the Intel MKL library is the one that comes with this software.

    Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.144 Build 20140120
    Copyright (C) 1985-2014 Intel Corporation. All rights reserved.

    Intel(R) System Studio Developer Story : How to configure, build and profile the Linux Kernel of Android by using the VTune

    $
    0
    0

    Intel(R) System Studio Developer Story: How to configure, build, debug and optimize key parts of your mobile software stack for Android*

    1. Set-up and configure a development environment.

    (1) The target HW environment

       In this article, a Minnow board MAX is used for a HW platform. The MinnowBoard is an Intel® Atom™ processor based board which introduces Intel® Architecture to the small and low cost embedded market for the developer and maker community. It has exceptional performance, flexibility, openness and standards.

    Specification Minnow Board Max
    CPU64-bit Intel® Atom™ E3815 (Single Core 1.46 GHz or E3825 Dual Core 1.33 GHz
    GraphicsIntegrated Intel® HD Graphics
    Memory1~2 GB DDR3 RAM
    IO

    Video : micro HDMI  / Micro SD / SATA2 / USB 3.0 (Host) / USB 2.0 (Host) /  Serial / Ethernet

    OSLinux / Yocto Linux / Windows 8.1 / Android 4.4

    * Please find more details in officially Minnow homepage : http://www.minnowboard.org/

    (2) Software environment

     Host OS : Ubuntu 14.04 LTS / Window 7 64 bits

     IDE : Android Developer Tools / eclipse KEPLER

     Tools : Intel® System Studio System Studio 2015

     Target OS: Android 4.2.2

    (3) Set up the Android SW development environment

    • Configure a workstation (Ubuntu Linux)    
    sudo dpkg --assert-multi-arch
    sudo add-apt-repository ppa:webupd8team/java
    sudo apt-get update
    sudo apt-get install oracle-java6-installer
    
    sudo apt-get install git git-core gnupg flex bison gperf build-essential ccache squashfs-tools zip curl libc6-dev libncurses5-dev x11proto-core-dev g++-multilib mingw32 tofrodos  python-markdown libxml2-utils zlib1g-dev:i386 libx11-dev libreadline6-dev xsltproc
    
    echo 'export USE_CCACHE=1'>> ~/.bashrc
    ccache -M 16
    
    mkdir ~ / bin
    PATH=~/bin:$PATH
    curl http://commondatastorage.googleapis.com/git-repo-downloads/repo > ~/bin/repo
    chmod a+x ~/bin/repo
    • Download the code
    repo init -u https://github.com/android-ia/platform_manifest.git
    repo sync -j4 -q -c --no-clone-bundle
    • Build 
    source build/envsetup.sh
    lunch<select minnow-eng>
    make -j4

     

    • Make an installer USB memory stick and install it to a micro SD card in the Minnow board

    Copy android-ia/out/target/product/minnowboard_max/live.img and make a installer image to an USB memory stick.  In Windows you can make an android installer image by Win32diskimager to an USB memory stick.  And then  Insert an USB memory stick which has an Android installer image into the Minnow board which has a micro SD card. After boot up you can select the installing Android.

    • Connect the Android on the Minnow board to a host machine

    After successful boot-up, connect the target via an Ethernet cable. And configure a connection.

    adb connect 192.168.42.1
    adb shell

     

    2. Using a VTune.

    The VTune is the performance analysis tool to find a hotspot which may utilize CPUs' inefficiency way both in applications and a system wide.

    Links following are the useful documents you can refer for how to find out a hotspot in Android. 

    In this article, we are focusing on the profiling an Android system. To profile a system - including Linux Kernel by using the system wide profiling feature of a VTune, some functions which can stop the VTune profiling will be added in the Linux Kernel layer of Android and the user-defined custom analysis feature of a VTune will be introduced as well. For the prerequisite of this kind of analysis,  the compiling an Android and updating a boot image by the fastboot and the logging and some ADB command of an Android are also explained.

    (1) The example function of stopping VTune profiling

    During poring or working on the Linux kernel layer of an Android, sometimes we need to stop the profiling at the time of any specific event or signal or exceptional cases such as a kernel panic. And we can start analysis the time which is just right after the specific debug point you want to check. And you can accomplish it by sending the QUIT signal to the VTune process which is started when you start analysis in a VTune GUI or a command line. The example function following is to find the VTune process and send QUIT signal to the process to stop profiling.

    void stop_vtune_process (void)
    {
    	struct task_struct *p;
    	int j;
    	int flasg_to_skip_sh = 0;
    
    	for_each_process(p) 	{
    		for (j=0;j< (TASK_COMM_LEN-4);j++) {
    			if (p->comm[j] == 'a'&& p->comm[j+1] == 'm'&& p->comm[j+2] == 'p' \&& p->comm[j+3] == 'l'&& p->comm[j+4] == 'x') {
                                    /* found the amplx ... in the process name. */
    				printk ("[vtune] %d %s \n",task_pid_nr(p),p->comm);
    				if (flasg_to_skip_sh) {
    					task_lock(p);
    					printk("[vtune]Kill %d(%s)\n",task_pid_nr(p), p->comm);
    					task_unlock(p);
    					do_send_sig_info(SIGQUIT, SEND_SIG_PRIV, p, false);
    					break;
    				}
    				else	{
    					printk("[vtune] skip sh for amplxe\n");
    					flasg_to_skip_sh++;
    				}
    			}
    		}
    	}
    }

    You can add and call this kind of function into your Linux kernel source codes such as kernel panic, an USB event function, a Key event function etc., whichever you want to stop profiling and start the analysis by a VTune.

    (2) Some useful things - fastboot, ADB, kernel log (Minnow board MAX)

    <build source codes>
    make -j4<download  kernel image by fastboot>
    adb reboot bootloader
    fastboot -t 192.168.42.1 flash boot boot.img
    fastboot -t 192.168.42.1 continue<logging kernel via adb>
    adb shell cat /proc/kmsg | grep vtune

    (3) Setting and using the VTune for the system profiling

    To get the details of system data during profiling, we'd better use an Advanced hotspot with the system wide profile even it has little call stack information, but we can see a system wide processes and functions which is called during profiling.  

    • Project properties - Target type : Profile System 
    • New Analysis - Choose analysis type - advanced hotspot  , collection level : hotspot

    <The screen shot of the result - example : Advanced hot spot>

    advanced hotspot - system wide profileThis advanced hot spot is using 3 high frequent basic HW events - CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED_REF_TSC / INST_RETIRED_ANY. If your system codes which you want profiling are more delay-critical-codes or you want to use specific HW PMU event, use the custom analysis. Next example is the custom analysis.

    • Project properties - Attach process -select process
    • New Analysis - Custom Analysis - New Hardware Event-based sampling analysis
    • New Hardware Event-based sampling analysis- Edit - add events you want
    • New Hardware Event-based sampling analysis- Edit - Check the collect stacks or Analyze system-wide context switches

    <The screen shot of result - example : custom analysis>

    Hardware Event-based sampling analysis

     

    You can analysis the process working in the time line as the picture above. And if you found any suspicious process which needs more investigation. Change the Vtune - Project  Properties - Target Type - Attach to Process and repeat testing above to do narrow down the issues.

     

  • Intel system studio 2015
  • ISS use cases
  • Android
  • Linux
  • kernel
  • vtune
  • profiling
  • profiler
  • Developers
  • Partners
  • Professors
  • Students
  • Android*
  • Linux*
  • Android*
  • Internet of Things
  • C/C++
  • Advanced
  • Beginner
  • Intermediate
  • Android* Development Tools
  • Intel® C++ Compiler
  • Intel® JTAG Debugger
  • Intel® System Debugger
  • Intel® System Studio
  • Intel® VTune™ Amplifier
  • Intel® Integrated Performance Primitives
  • Intel® Math Kernel Library
  • Intel® System Studio
  • Debugging
  • Development Tools
  • Education
  • Intel® Atom™ Processors
  • Embedded
  • Phone
  • Tablet
  • URL
  • Getting started
  • Improving performance
  • Theme Zone: 

    IDZone

    Last Updated: 

    Friday, December 26, 2014

    Co-authors: 

    JONG IL P. (Intel)

    Running multiple Pardiso solves concurrently

    $
    0
    0

    Hi,

    We're using MKL Pardiso inside an optimisation web service on Windows and Linux.  Clients can spin up multiple optimisations in one call to the service, and so we have multiple runs occurring concurrently in the same memory space, with multiple calls to Pardiso, so one might be analysing, another factorising, and another solving, and so on, all at the same time.  Under heavy loads we have crashes from heap corruption, and Pardiso is often in the call stack.

    I'm trying to eliminate obvious causes of death here before diving into Inspector runs.  Does Pardiso actually support multiple independent runs and have separate memory for each initialisation and solve, or should we be putting each solve in its own process to protect memory?

    Thanks,

    Damien 

     

    Tce Issue/Feature: 

    Other

    cluster_sparse_solver computes wrong solution

    $
    0
    0

    Hello,

    I'm trying to use cluster_sparse_solver and solve a system in-place (iparm(6) = 1), with a distributed format (iparm(40) = 1). I adapted the example cl_solver_unsym_distr_c.c as you can see attached, and at runtime, on two MPI processes, I get the following output:

    $ icpc -V
    Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.1.133 Build 20141023

    $ mpicc -cc=icc cl_solver_unsym_distr_c.c -lmkl_intel_thread -lmkl_core -lmkl_intel_lp64 -liomp5

    $ mpirun -np 2 ./a.out

    The solution out-of-place of the system is: 
     on zero process x [0] =  0.149579       rhs [0] =  1.000000
     on zero process x [1] =  0.259831       rhs [1] =  1.000000
     on zero process x [2] = -0.370084       rhs [2] =  0.250000
     on zero process x [3] =  0.011236       rhs [3] =  1.000000
     on zero process x [4] =  0.415730       rhs [4] =  1.000000

    Solving system in-place...
    The solution in-place of the system is: 
     on zero process x [0] =  0.149579
     on zero process x [1] =  0.259831
     on zero process x [2] = -0.370084
     on zero process x [3] =  1.000000
     on zero process x [4] =  1.000000

    Can you reproduce this behavior ? The solution in-place is obviously wrong. Do you see how to fix that ? Thank you in advance.

    AttachmentSize
    Downloadcl_solver_unsym_distr_c.c12.09 KB

    Build Scipy With MKL failed

    $
    0
    0

    I use command to build numpy first:

    python %MYPWD%/%NUMPY_VER%/setup.py config  --compiler=msvc build_clib --compiler=msvc  build_ext

    the site.cfg content is:

    [mkl]
    library_dirs = C:\Program Files (x86)\Intel\Composer XE 2015\mkl\lib\intel64
    include_dirs = C:\Program Files (x86)\Intel\Composer XE 2015\mkl\include
    mkl_libs = mkl_rt
    lapack_libs = 

    Then I build scipy with command:

    python %MYPWD%/%SCIPY_VER%/setup.py config  --compiler=msvc build_clib --compiler=msvc  build_ext

     

    It failed with message:

    _fftpackmodule.obj : warning LNK4197: export 'init_fftpack' specified multiple times; using first specificatio
    n
       Creating library build\temp.win-amd64-2.7\Release\build\src.win-amd64-2.7\scipy\fftpack\_fftpack.lib and ob
    ject build\temp.win-amd64-2.7\Release\build\src.win-amd64-2.7\scipy\fftpack\_fftpack.exp
    zfft.obj : error LNK2019: unresolved external symbol zfftf_ referenced in function zfft
    zfft.obj : error LNK2019: unresolved external symbol zfftb_ referenced in function zfft
    zfft.obj : error LNK2019: unresolved external symbol zffti_ referenced in function get_cache_id_zfft
    zfft.obj : error LNK2019: unresolved external symbol cfftf_ referenced in function cfft
    zfft.obj : error LNK2019: unresolved external symbol cfftb_ referenced in function cfft
    zfft.obj : error LNK2019: unresolved external symbol cffti_ referenced in function get_cache_id_cfft
    drfft.obj : error LNK2019: unresolved external symbol dfftf_ referenced in function drfft
    drfft.obj : error LNK2019: unresolved external symbol dfftb_ referenced in function drfft
    drfft.obj : error LNK2019: unresolved external symbol dffti_ referenced in function get_cache_id_drfft
    drfft.obj : error LNK2019: unresolved external symbol rfftf_ referenced in function rfft
    drfft.obj : error LNK2019: unresolved external symbol rfftb_ referenced in function rfft
    drfft.obj : error LNK2019: unresolved external symbol rffti_ referenced in function get_cache_id_rfft
    dct.obj : error LNK2019: unresolved external symbol costi_ referenced in function get_cache_id_dct1
    dct.obj : error LNK2019: unresolved external symbol cost_ referenced in function dct1
    dct.obj : error LNK2019: unresolved external symbol cosqi_ referenced in function get_cache_id_dct2
    dct.obj : error LNK2019: unresolved external symbol cosqb_ referenced in function dct2
    dct.obj : error LNK2019: unresolved external symbol cosqf_ referenced in function dct3
    dct.obj : error LNK2019: unresolved external symbol dcosti_ referenced in function get_cache_id_ddct1
    dct.obj : error LNK2019: unresolved external symbol dcost_ referenced in function ddct1
    dct.obj : error LNK2019: unresolved external symbol dcosqi_ referenced in function get_cache_id_ddct2
    dct.obj : error LNK2019: unresolved external symbol dcosqb_ referenced in function ddct2
    dct.obj : error LNK2019: unresolved external symbol dcosqf_ referenced in function ddct3
    dst.obj : error LNK2019: unresolved external symbol sinti_ referenced in function get_cache_id_dst1
    dst.obj : error LNK2019: unresolved external symbol sint_ referenced in function dst1
    dst.obj : error LNK2019: unresolved external symbol sinqi_ referenced in function get_cache_id_dst2
    dst.obj : error LNK2019: unresolved external symbol sinqb_ referenced in function dst2
    dst.obj : error LNK2019: unresolved external symbol sinqf_ referenced in function dst3
    dst.obj : error LNK2019: unresolved external symbol dsinti_ referenced in function get_cache_id_ddst1
    dst.obj : error LNK2019: unresolved external symbol dsint_ referenced in function ddst1
    dst.obj : error LNK2019: unresolved external symbol dsinqi_ referenced in function get_cache_id_ddst2
    dst.obj : error LNK2019: unresolved external symbol dsinqb_ referenced in function ddst2
    dst.obj : error LNK2019: unresolved external symbol dsinqf_ referenced in function ddst3
    build\lib.win-amd64-2.7\scipy\fftpack\_fftpack.pyd : fatal error LNK1120: 32 unresolved externals

     

    Please give me your suggestion if you have experience about this. Thanks

     

    Help needed with bdsqr

    $
    0
    0

    Hello,

    I'm trying to compute a partial SVD of a rectangular matrix A. I tried to adapt an example of the MKL that uses gesvd. While I get the same singular values, I'm not able to compute the correct left singular vectors. Any held would be greatly appreciated.

    Thank you.

    AttachmentSize
    Downloadlapack.cpp3.34 KB

    Parameters for ?stemr

    $
    0
    0

    I have a problem where I need to calculate a number of eigenvectors and a different number of eigenvalues. Instead of calling dsyevr twice I plan on calling dsytrd -> dstemr * 2 -> dormtr. (Or alternatively stebz / stein)

    However, I have noticed unexpected behavior regarding the eigenvector parameter (using 11.0 update 5, from C).

    If I use jobz = 'N', then the call to dstemr sets the first element of the eigenvector array to 0.0 even when only calculating eigenvalues, and crashes if no array is provided. The documentation states that this variable is not used in this case. Is it sufficient to pass a single double as a dummy parameter or does the function also set more values in this array? Also while the documentation states that ldz should be >= 1 in this case, the function fails unless it is >= N.

     

    Secondly, If I use jobz = 'V', the documentation states

    "Array z(ldz, *), the second dimension of z must be at least max(1, m).

    If jobz = 'V', and info = 0, then the first m columns of z contain the orthonormal eigenvectors of the matrix T corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). ".

    However, when calling from C, ldz contains the number of columns in the array, and should be set to the number of eigenvalues, but the parameter validation requires ldz to be >= N, which means that if I want to calculate the first 10 eigenvalues of a 1000 x 1000 matrix, I still would need to allocate the full size matrix. Am I missing something? Is this just due to LAPACK_ROW_MAJOR?

     

    Viewing all 1435 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>