ArticlePDF Available

Implementation of SIMD-based Many-Core Processor for Efficient Image Data Processing

Authors:

Abstract and Figures

Recently, as mobile multimedia devices are used more and more, the needs for high-performance and low-energy multimedia processors are increasing. Application-specific integrated circuits (ASIC) can meet the needed high performance for mobile multimedia, but they provide limited, if any, generality needed for various application requirements. DSP based systems can used for various types of applications due to their generality, but they require higher cost and energy consumption as well as less performance than ASICs. To solve this problem, this paper proposes a single instruction multiple data (SIMD) based many-core processor which supports high-performance and low-power image data processing while keeping generality. The proposed SIMD based many-core processor composed of 16 processing elements (PEs) exploits large data parallelism inherent in image data processing. Experimental results indicate that the proposed SIMD-based many-core processor higher performance (22 times better), energy efficiency (7 times better), and area efficiency (3 times better) than conversional commercial high-performance processors.
Content may be subject to copyright.
16 1 , 2011. 1.
2011-16-1-1-1
SIMD
1)
*,
김철홍
**,
***
Implementation of SIMD-based Many-Core Processor for Efficient
Image Data Processing
Byong-Kook Choi *, Cheol-Hong Kim **, Jong-Myon Kim ***
,
.
(ASIC)
.
DSP
,
,
.
,
(Single Instruction Multiple Data, SIMD)
.
SIMD
16
(processing element, PE)
.
,
SIMD
22
, 7
3
.
Keyword :
,
/
,
Abstract
Recently, as mobile multimedia devices are used more and more, the needs for high-performance and
low-energy multimedia processors are increasing. Application-specific integrated circuits (ASIC) can meet the
needed high performance for mobile multimedia, but they provide limited, if any, generality needed for various
application requirements. DSP based systems can used for various types of applications due to their generality,
but they require higher cost and energy consumption as well as less performance than ASICs. To solve this
1
:
:
: 2010. 08. 24,
: 2010. 09. 16,
: 2010. 10. 29.
*
(School of Electrical Engineering, University of Ulsan)
**
(Chonnam National University, )
***
(School of Electronics and Computer Engineering, University of Ulsan)
2010
(
)
(No. 2010-0010863).
2
(2011. 1.)
problem, this paper proposes a single instruction multiple data (SIMD) based many-core processor which
supports high-performance and low-power image data processing while keeping generality. The proposed SIMD
based many-core processor composed of 16 processing elements (PEs) exploits large data parallelism inherent in
image data processing. Experimental results indicate that the proposed SIMD-based many-core processor higher
performance (22 times better), energy efficiency (7 times better), and area efficiency (3 times better) than
conversional commercial high-performance processors.
Keyword : Many-core processor, image/video processing, data level parallelism
.
[1].
ASIC(Application-Specific Integrated Circuit)
,
[2][3][4].
(General-Purpose Processor,
GPP)
DSP (Digital Signal Processor)
.
,
.
GPP
DSP
(massive parallelism)
.
SIMD
(Single Instruction Multiple Data)
[5][6].
(Instruction-level)
(thread-level)
(multiported register file),
(cache),
(deep pipelined)
,
SIMD
(processing element, PE)
[7].
, SIMD
(locality)
(regularity)
2
.
,
SIMD
.
SIMD
16
,
. (
,
256x256
16 PE
PE
64x64
PE
)
(C6416[8], ARM926EJ-S[21],
ARM1020E[22])
22
, 7
3
.
. 2
SIMD
, 3
. 4
, 5
.
6
.
.
(data-level parallelism, DLP)
: (1)
SIMD
[9],[10],[11]
(2) SIMD
[6],[12].
SIMD
. [9]
UltraSPARC
VIS
.
4-way out-of-order
single in-order
2.3
~4.2
VIS
1.1
~4.2
. [10]
DSP
MMX
. MMX
81%
5.5
.
SIMD
3
SIMD
.
.
SIMD
(spatial parallelism)
(processing
unit)
.
.
.
(massively data parallel
array)
30
,
SIMD
(TMC Connection Machine
1[13])
I/O
.
SIMD
TMC CM-2[14]
MasPar MP-2[15]
. Fine-grained
MGAP[16]
ABACUS[17]
,
I/O bandwidth
latency
.
,
SIMD
I/O
,
.
.
SIMD
.
Translati on
Transform,
Subtraction
Mask Scaling,
Histogram
Segmentation,
Edge Detection
.
Translation Transform
.
(x, y)
a
,
b
(1)
.
······························ (1)
Subtraction
,
(2)
.
h(x,y) = f(x,y) - g(x,y) ················ (2)
Mask Scaling
Mask
. Mask
,
.
,
(3)
.
····························· (3)
Histogram Segmentation
Histogram
Otsu
.
Histogram
Otsu
. Otsu
Classification
.
Edge Detection
.
.
IV.
4.1 SIMD
1
SIMD
.
16
Array Control Unit
(ACU)
,
.
.
4
(2011. 1.)
32
4096
32
16
3
/
ALU
64
(multiply accumulator)
/
(Barrel Shifter)
PE
Sleep
PE
NEWS (north-east-
west-south)
serial I/O
1. SIMD
Fig. 1. A block diagram of SIMD based many-core processor Processor
architecture and single PE
4.2
2
SIMD
(Fetch),
(Decode),
(Execution)
3
. 1
ACU
(instruction)
. 2
ACU
ACU
(Scalar)
PE
(vector)
BusA,
BusB, BusC
immediate
.
3
.
2. ACU
PE
Fig. 2. Pipeline stage of ACU and PE
4.3
SIMD
9
,
,
(shift),
,
,
PE
sleep
,
PE
I/O
NEWS (North, East, West, South)
,
, ACU
.
3
SIMD
PE
.
branch
macc(multiply accumulator)
.
Branch
,
2
.
3. Sleep
PE
Fig. 3. Activation of PE using a Sleep instruction
4.4
4
(
,
,
)
SIMD
.
SIMD
,
,
,
(utilization)
.
Chai
SIMD
[18].
(latency,
power, clock frequency)
Generic System
Simulator (GENESYS)
[19].
,
,
.
SIMD
5
4.
Fig. 4. Experiment methodology for many-core processor simulation
V.
5.1
5
MRI
,
6
SIMD
.
Translation
x
, y
20
, Subtraction
. Mask Scaling
2
, Histogram Segmentation
2
, Edge Detection
(Sobel)
.
5.
MRI
Fig. 5. Input MRI image and mask image
1 2 3
4 5 6
6.
(1.Translation, 2.Subtraction, 3.Mask Scaling,
4.Histogram1, 5.Histogram2, 6.Edge Detection)
Fig. 6. Output images(1.Translation, 2.Subtraction, 3.Mask Scaling,
4.Histogram1, 5.Histogram2, 6.Edge Detection)
5.2
1
,
SIMD
(cycle-accurate)
.
16
,
.
32
4096
, 130nm
720MHz
.
Parameter
Value
Mumber of PEs
16
Pixels/PE
4096
Memory/PE [Word]
4096 [32-bit word]
VLSI Technology
130nm
Clock Frequency
720MHz
Interconnection Network
Mesh
IntALU/intMUL/Barrel
Shift/intMACC/Comm
1/1/1/1/1
1.
Table 1. Parameters for the implemented many-core processor
2
SIMD
4
.
(execution time)
,
(sustained throughput)
(Giga-operations/second)
,
(energy efficiency)
(Giga-operations/Joule)
,
(area
efficiency)
[20].
6
(2011. 1.)





·
·

sec



·
·





·
·

·

:
,

:
,

:
:
,

:
2.
Table 2. Summary of performance evaluation methods
5.3
TI C6416,
ARM926EJ-S[21], ARM1020E[22]
.
130nm
.
16
(PE)
(data-level parallelism)
, TI C6416
8-way VLIW
8
(instruction-level parallelism)
.
3
,
4
TI C6416,
ARM926EJ-S
ARM1020E
.
7, 8, 9
TI C6416, ARM926EJ-S, ARM1020E
,
.
,
Edge Detection
4~39
,
5.5~8.5
,
1.9~3.8
.
TI DSP
ARM
.
(>30ms)
,
.
7.
Fig. 7. Execution time comparison
8.
Fig. 8. Energy efficiency comparison
9.
Fig. 9. Area efficiency comparison
SIMD
7
3.
Table 3. Performance of each image processing algorithm using many-core processor
Algorithm
Total Cycle
[cycles]
Vector
Instruction
[cycles]
Scalar
Instruction
[cycles]
system
utilization
[%]
sustained
throughput
[Gops/sec]
execution
time
[ms]
Translation
776,365
515,504
260,861
98.75
7.65
1.08
Subtraction
61,460
45,064
16,396
92.80
7.84
0.09
Mask
Scaling
132,454
98,879
33,575
74.81
8.43
0.18
Histogram
Segmentation
277,570
127,356
150,214
90.90
8.68
0.39
Edge
Detection
397,616
250,767
146,849
95.94
6.97
0.55
4.
TI DSP C6416, ARM926EJ-S, ARM1020E
Table 4. Performance comparison of many-core processor, TI DSP C6416, ARM926EJ-S, and ARM1020E
Algorithm
Translation
Subtraction
Mask
Scaling
parameter
unit
Many-core
TI C6416
ARM9
26EJ-S
ARM
1020E
Many-core
TI
C6416
ARM9
26EJ-S
ARM
1020E
Many-c
ore
TI C6416
ARM9
26EJ-S
ARM
1020E
Technology
[nm]
130
130
130
130
130
130
130
130
130
130
130
130
Clock
Frequency
[Mhz]
720
720
250
400
720
720
250
400
720
720
250
400
Average
Power
[mW]
1,841.28
950
120
200
1,226.83
950
120
200
1,469.35
950
120
200
Average
Throughput
[MIPS]
7,649.24
1,595.85
275
520
7,838.98
2,296.79
275
520
6,433.69
1,773.85
275
520
Execution
Time
[ms]
1.08
1.19
2.18
0.63
0.09
1.18
5.20
2.68
0.18
1.16
6.63
2.93
Energy
[
μ
Joule]
1,985.42
1,127.14
261.47
126.43
104
.72
1,121
.81
623.71
535.46
270.31
1,105.07
795.74
585.90
Energy
Efficiency
[Gops/
Joule]
8.69
1.68
2.29
2.60
12.10
2.42
2.29
2.60
8.15
1.87
2.29
2.60
Area
Efficiency
[Gops/
(s·mm²)]
0.21
0.03
0.10
0.05
0.21
0.04
0.10
0.05
0.18
0.03
0.10
0.05
Algorithm
Histogram
Segmentation
Edge
Detection
parameter
unit
Many-core
TI C6416
ARM9
26EJ-S
ARM
1020E
Many-core
TI C6416
ARM9
26EJ-S
ARM
1020E
Technology
[nm]
130
130
130
130
130
130
130
130
Clock
Frequency
[Mhz]
720
720
250
400
720
720
250
400
Average
Power
[mW]
1,152.24
950
120
200
787.79
950
120
200
Average
Throughput
[MIPS]
8,577.49
806.97
275
520
6,970.46
3,358.40
275
520
Execution
Time
[ms]
0.39
3.59
10.34
5.51
0.55
12.05
21.90
12.41
Energy
[
μ
Joule]
444.21
3,413.90
1,241.05
1,101.77
435
.05
2,136
.39
2,627.78
2,482.50
Energy
Efficiency
[Gops/
Joule]
12.62
0.85
2.29
2.60
19.49
3.54
2.29
2.60
Area
Efficiency
[Gops/
(s·mm²)]
0.24
0.02
0.10
0.05
0.19
0.06
0.10
0.05
8
(2011. 1.)
5.4
SIMD
RTL
, Xilinx
Vertex-4 XC4VLX60
FPGA[23]
.
10
16
PE
,
5
.
PE
1095
LUT
195
register
, ACU
1147
LUT
124
register
. 16 PE
18,667
LUT
3,244
4,202,496bit
.
10.
Fig. 10. Hardware schematic for the many-core processor
5.
Table 5. Synthesis result of the many-core processor
Array Control Unit
LUTs
1,147
Register
124
Processing Element
LUTs
1,095
Register
195
Total Block Memory bits
4,202,496
VI.
,
SIMD
.
16
,
.
(130 nm Technology)
(720MHz)
TI C6416 DSP
,
22
,
7
,
3
.
,
.
[1] S.-H. Kim, S.-Y. Nam, and H.-J. Lim, An
improved area edge detection for real-time image
processing, Journal of the Korea Society of
Computer and Information, vol. 14, no. 1, pp.
99-106, Jan. 2009.
[2] X.-G. Jiang, J.-Y. Zhou, J.-H. Shi, H.-H. Chen FPGA
Implementation of Image Rotation Using Modified
Compensated CORDIC, in Proc. of 6th Intl. Conf. on
ASIC, vol. 2, pp. 752
756, 2005.
[3] E. B. Bourennane, S. Bouchoux, J. Miteran, M. Paindavoine,
S. Bouillant, Cost comparison of image rotation
implementations on static and dynamic reconfigurable
FPGAs, in Proc. of IEEE Intl. Conf. on Acoustics,
Speech, and Signal Processing (ICASSP '02), vol. 3, pp.
III-3176-3179, 2002.
[4] S.-H. Lee, The design and implementation of
prallel processing system using the Nios(R) II
embedded processor, Journal of the Korea
Society of Computer and Information, vol. 14,
no. 11, pp. 97-103, Nov. 2009.
[5] A. D. Blas et. al, The UCSC Kestrel Parallel Processor,
IEEE Trans. on Parallel and Distributed Systems, vol.
16, no. 1, pp. 80-92, Jan. 2005.
[6] A. Gentile and D. S. Wills, Portable Video Supercomputing,
IEEE Trans. on Computers, vol. 53, no. 8, pp. 960-973,
Aug. 2004.
[7] L. V. Huynh, C.-H. Kim, and J.-M. Kim, A
massively parallel algorithm for fuzzy vector
quantization, The KIPS Transactions: PartA,
vol. 16-A, no. 6, pp. 411-418, Dec. 2009.
[8] TMS320C64x families,
http://www.bdti.com/procsum/tic64xx.htm.
[9] P. Ranganathan, S. Adve, and N. P. Jouppi, Performance
SIMD
9
of image and video processing with general-purpose
processors and media ISA extensions," in Proc. of the
26th Intl. Sym. on Computer Architecture, pp. 124-135,
May. 1999.
[10] R. Bhargava, L. John, B. Evans, and R. Radhakrishnan,
Evaluating MMX technology using DSP and
multimedia applications, in Proc. of IEEE/ACM Sym.
on Microarchitecture, pp. 37-46, 1998.
[11] N. Slingerland and A. J. Smith, Measuring the
performance of multimedia instruction sets, IEEE
Trans. on Computers, vol. 51, no. 11, pp. 1317-1332,
Nov. 2002.
[12] A. Krikelis, I. P. Jalowiecki, D. Bean, R. Bishop, M.
Facey, D. Boughton, S. Murphy, and M. Whitaker, A
programmable processor with 4096 processing units
for media applications, in Proc. of the IEEE Intl. Conf.
on Acoustics, Speech, and Signal Processing, vol. 2,
pp. 937-940, May. 2001.
[13] L. W. Tucker and G. G. Robertson, Architecture and
applications of the connection machine, IEEE
Computer, vol. 21, no. 8, pp. 26-38, 1988.
[14] Connection machine model CM-2 technical summary,
Thinking Machines Corp., version 51, May 1989.
[15] MarPar (MP-2) System Data Sheet. MarPar
Corporation, 1993.
[16] M. J. Irwin, R. M. Owens, "A Two-Dimensional,
Distributed Logic Processor," IEEE Trans. on
Computers, vol. 40, no. 10, pp. 1094-1101, 1991.
[17] M. Bolotski, R. Armithrajah, W. Chen, "ABACUS: A
High Performance Architecture for Vision," in
Proceedings of the International Conference on Pattern
Recognition, 1994.
[18] S. M. Chai, T. Taha, D. S. Wills, J. D. Meindl,
"Heterogeneous Architecture Models for Interconnect-
Motivated System Design," IEEE Trans. on VLSI
Systems, vol. 8, no. 6, pp. 660-670, 2000.
[19] V. Tiwari, S. Malik, and A. Wolfe, "Compilation
techniques for Low Energy: An Overview," in Proc.
IEEE Intl. Symp. on Low Power Electrin., pp. 38-39,
1994.
[20] V. Tiwari, S. Malik,and A. Wolfe, Compilation
Techniques for Low Energy: An Overview, in Proc.
of the IEEE Intl. Symp. on Low Power Electron., pp.
38-39, Oct. 1994.
[21] ARM 926EJ-S data sheet,
http://www.arm.com/products/processors/classic/arm
9/arm926.php.
[22] ARM 1020E data sheet,
http://www.hotchips.org/archives/hc13/2_Mon/02arm.
pdf
[23] Xilinx Vertex-4 FPGA XC4VLX60 data sheet,
http:/ /www.alldatasheet.net/ datasheet-pdf/pdf
/152986/XILINX/XC4VLX60.html
2009 :
.
2009 :
.
:
SoC,
,
,
Email: dowonbest@naver.com
1998 :
.
2000 :
.
2006 :
2005 - 2007
:
2007 -
:
:
,
,
SoC
,
Email: cheolhong@gmail.com
1995 :
2000 : University of Florida ECE
2005 : Georgia Institute of Technology
ECE
2005 - 2007 :
2007 -
:
:
,
SoC,
,
Email: jongmyon.kim@gmail.com
Article
Full-text available
In this paper, we implemented and evaluated the performance of a vector-based rasterization algorithm of 3D graphics using a SIMD-based many-core processor that consists of 4,096 processing elements. In addition, we compared the performance and efficiency of the rasterization algorithm using the many-core processor and commercial GPU (Graphics Processing Unit) system which consists of 7 GPUs and each of which have 512 cores. Experimental results showed that the SIMD-based many-core processor outperforms the commercial GPU system in terms of execution time (3.13x speedup), energy efficiency (17.5x better), and area efficiency (13.3x better). These results demonstrate that the SIMD-based many-core processor has potential as an embedded mobile processor.
Article
Full-text available
In this paper, we implement the SIFT(Scale-Invariant Feature Transform) algorithm for feature point extraction using a many-core processor, and analyze the performance, area efficiency, and system area efficiency of the many-core processor. In addition, we demonstrate the potential of the proposed many-core processor by comparing the performance of the many-core processor with that of high-performance CPU and GPU(Graphics Processing Unit). Experimental results indicate that the accuracy result of the SIFT algorithm using the many-core processor was same as that of OpenCV. In addition, the many-core processor outperforms CPU and GPU in terms of execution time. Moreover, this paper proposed an optimal model of the SIFT algorithm on the many-core processor by analyzing energy efficiency and area efficiency for different octave sizes.
Article
In this paper, we implement and evaluate the performance of a vector-based rasterization algorithm for 3D graphics by using a SIMD (single instruction multiple data) many-core processor architecture. In addition, we evaluate the impact of a data-per-processing elements (DPE) ratio that is defined as the amount of data directly mapped to each processing element (PE) within many-core in terms of performance, energy efficiency, and area efficiency. For the experiment, we utilize seven different PE configurations by varying the DPE ratio (or the number PEs), which are implemented in the same 130 nm CMOS technology with a 500 MHz clock frequency. Experimental results indicate that the optimal PE configuration is achieved as the DPE ratio is in the range from 16,384 to 256 (or the number of PEs is in the range from 16 and 1,024), which meets the requirements of mobile devices in terms of the optimal performance and efficiency.
Article
Full-text available
Vector quantization algorithm based on fuzzy clustering has been widely used in the field of data compression since the use of fuzzy clustering analysis in the early stages of a vector quantization process can make this process less sensitive to its initialization. However, the process of fuzzy clustering is computationally very intensive because of its complex framework for the quantitative formulation of the uncertainty involved in the training vector space. To overcome the computational burden of the process, this paper introduces an array architecture for the implementation of fuzzy vector quantization (FVQ). The arrayarchitecture, which consists of 4,096 processing elements (PEs), provides a computationally efficient solution by employing an effective vector assignment strategy during the clustering process. Experimental results indicatethat the proposed parallel implementation providessignificantly greater performance and efficiency than appropriately scaled alternative array systems. In addition, the proposed parallel implementation provides 1000x greater performance and 100x higher energy efficiency than other implementations using today`s ARMand TI DSP processors in the same 130nm technology. These results demonstrate that the proposed parallel implementation shows the potential for improved performance and energy efficiency.
Conference Paper
Full-text available
FPGA components are widely used today to perform various algorithms (digital filtering) in real time. The emergence of Dynamically Reconfigurable (DR) FPGAs made it possible to reduce the number of necessary resources to carry out an image processing application (tasks chain). We present in this article an image processing application (image rotation) that exploits the FPGA 's dynamic reconfiguration feature. A comparison is undertaken between the dynamic and static reconfiguration by using two criteria, cost and performance criteria. For the sake of testing the validity of our approach in terms of Algorithm and Architecture Adequacy, we realized an AT40K40 based board ARDOISE.
Conference Paper
Full-text available
Many current general purpose processors are using extensions to the instruction set architecture to enhance the performance of digital signal processing (DSP) and multimedia applications. In this paper, we evaluate the X86 architecture's multimedia extension (MMX) instruction set on a set of benchmarks. Our benchmark suite includes kernels (filtering, fast Fourier transforms, and vector arithmetic) and applications (JPEG compression, Doppler radar processing, imaging, and G.722 speech encoding). Each benchmark has at least one non-MMX version in C and an MMX version that makes calls to an MMX assembly library. The versions differ in the implementation of filtering, vector arithmetic, and other relevant kernels. The observed speed up for the MMX versions of the suite ranges from less than 1.0 to 6.1. In addition to quantifying the speedup, we perform detailed instruction level profiling using Intel's VTune profiling tool. Using VTune, we profile static and dynamic instructions, microarchitecture operations, and data references to isolate the specific reasons for speedup or lack thereof. This analysis allows one to understand which aspects of native signal processing instruction sets are most useful, the current limitations, and how they can be utilized most efficiently
Conference Paper
Full-text available
This paper aims to provide a quantitative understanding of the performance of image and video processing applications on general-purpose processors, without and with media ISA extensions. We use detailed simulation of 12 benchmarks to study the effectiveness of current architectural features and identify future challenges for these workloads. Our results show that conventional techniques in current processors to enhance instruction-level parallelism (ILP) provide a factor of 2.3 X to 4.2 X performance improvement. The Sun VIS media ISA extensions provide an additional 1.1 X to 4.2 X performance improvement. The ILP features and media ISA extensions significantly reduce the CPU component of execution time, making 5 of the image processing benchmarks memory-bound. The memory behavior of our benchmarks is characterized by large working sets and streaming data accesses. Increasing the cache size has no impact on 8 of the benchmarks. The remaining benchmarks require relatively large cache sizes (dependent on the display sizes) to exploit data reuse, but derive less than 1.2 X performance benefits with the larger caches. Software prefetching provides 1.4 X to 2.5 X performance improvement in the image processing benchmarks where memory is a significant problem. With the addition of software prefetching, all our benchmarks revert to being compute-bound
Article
Full-text available
The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel systems to dedicated hardware implementations of algorithms. Single-purpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multiple- data stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focuses on efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This paper presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz.
Article
In this thesis, we discuss the implementation of parallel processing system which is able to get a high degree of efficiency(size, cost, performance and flexibility) by using II(32bit RISC(Reduced Instruction Set Computer) processor) embedded processor in DE2- reference board. The designed Parallel processing system is master-slave, shared memory and MIMD(Mu1tiple Instruction-Multiple Data stream) architecture with 4-processor. For performance test of system, N-point FFT is used. The result is represented speed-up as follow; in the case of using 2-processor(core), speed-up is shown as average 1.8 times as 1-processor's. When 4-processor, the speed-up is shown as average 2.4 times as it's.
Article
Though edge detection, an important stage that significantly affecting the performance of image recognition, has been given numerous researches on its execution methods, it still remains as difficult problem and it is one of the components for image recognition applications while it is not the only way to identify an object or track a specific area. This paper, unlike gradient operator using edge detection method, found out edge pixel by referring to 2 neighboring pixels information in binary image and comparing them with pre-defined 4 edge pixels pattern, and detected binary image edge by determining the direction of the next edge detection exploring pixel and proposed method to detect binary image edge by repeating step of edge detection to detect another area edge. When recognizing image, if edge is detected with the use of gradient operator, thinning process, the stage next to edge detection, can be omitted, and with the edge detection algorithm executing time reduced compared with existing area edge tracing method, the entire image recognizing time can be reduced by applying real-time image recognizing system.
Article
The concept of data-parallel computers is explained, and their architecture of the Connection Machine (CM), which implements this approach, is described. It provides 64 K physical processing elements, millions of virtual processing elements with its virtual processor mechanism, and general-purpose, reconfigurable communications networks. The evolution of the CM architecture is examined, and the software environment, engineering and physical characteristics, and performance of the current embodiment (the CM-2) are discussed. Applications of the CM to molecular dynamics, VLSI design and circuit simulation, and computer vision are described
Conference Paper
Rotation is a basic operation for image processing, and the complexity of its computation is considered as the key problem of the implementation of real-time visual system. This paper proposes a novel architecture based on modified compensated CORDIC and bilinear interpolation algorithms in a recursive and folded way. The proposed modified compensated CORDIC algorithm compensates the scale factor in parallel with angle rotations, expands the convergence range to entire 2pi and avoids pre- and post- rotations. The detailed architecture for image rotation is modeled by Verilog and implemented in Xilinx FPGA. Experiment results show that the proposed CORDIC algorithm has the lowest computational complexity and the architecture for real-time image rotation has lower hardware cost and power consumption
Conference Paper
Media data delivery and processing such as telecommunications, networking, video processing, speech recognition and 3D graphics is increasing in importance and will soon dominate the processing cycles consumed in computer-based systems. This paper describes a processor called Linedancer, that provides high media performance with low energy consumption by integrating associative SIMD parallel processing with embedded microprocessor technology. The major innovations in the Linedancer is the integration of thousands of processing units in a single chip that are capable to support software programmable high-performance mathematical functions as well as abstract data processing. In addition to 4096 processing units, Linedancer integrates on a single chip a RISC controller that is an implementation of the SPARC architecture, 128 kbytes of data memory, and I/O interfaces. The SIMD processing in Linedancer implements the ASProCore architecture, which is a proprietary implementation of SIMD processing, operates at 266 MHz with program instructions issued by the RISC controller. The device also integrates a 64-bit synchronous main memory interface operating at 133 MHz (double-data rate, DDR), and a 64-bit 66 MHz PCI interface