Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Use Itertools for Better Iteration

2016-08-20

itertools

This module contains lots of function on sequence-like objects. It has been never easier for iteration.

Iterator is lazy, which will not be generated until called. This is more memory efficient.

import itertools as itls

Combine: chain()izip()

chain() takes n iterable objects and combine them together. Let me show you an example.

for i in itls.chain([1,2,3],['a','b','c']):
    print i,
1 2 3 a b c

izip() takes n iterable objects and combine them as tuples. Just like zip() but return iterator instead of list.

for i, j in itls.izip([1,2,3],['a','b','c']):
    print i, j
1 a
2 b
3 c

Slice: islice()

islice() takes a iterator and return its slice. Just like the slice operation on list. There are three params start, stop and step.

print "Stop at 5:"
for i in itls.islice(itls.count(),5):
    print i,
Stop at 5:
0 1 2 3 4
print "Start at 5, Stop at 10:"
for i in itls.islice(itls.count(),5,10):
    print i,
Start at 5, Stop at 10:
5 6 7 8 9
print "By tens to 100:"
for i in itls.islice(itls.count(),0,100,10):
    print i,
By tens to 100:
0 10 20 30 40 50 60 70 80 90

Duplicate: tee()

Just like the tee in Unix. tee() takes a iterator and return n same iterators.

r = itls.islice(itls.count(),4)
i1, i2, i3 = itls.tee(r,3) # i1 and i2, like a copy

for i, j, k in itls.izip(i1,i2,i3):
    print i, j, k
0 0 0
1 1 1
2 2 2
3 3 3

One thing to notice is you should be careful when using the original iterator. Check this example to see why.

r = itls.islice(itls.count(),4)
i1, i2 = itls.tee(r)
for i in r:
    print 'r:', i
    if i > 0:
        break
for i in i1:
    print 'i1:', i

for i in i2:
    print 'i2:', i
r: 0
r: 1
i1: 2
i1: 3
i2: 2
i2: 3

The original iterator consumed 0,1 and will not be generated in i1 and i2.

Map

imap() function transform iterator. Just like the built in map(). Let’s multiply xrange(5) by 2.

print "Doubles:"
for i in itls.imap(lambda x: 2*x, xrange(5)):
    print i,
Doubles:
0 2 4 6 8

imap() can take more than one iterator and map it.

print "Multiples:"
for i in itls.imap(lambda x,y:(x, y, x*y), xrange(5),xrange(5,10)):
    print '%d * %d = %d' % i
Multiples:
0 * 5 = 0
1 * 6 = 6
2 * 7 = 14
3 * 8 = 24
4 * 9 = 36

starmap() is kind of similar to imap(), but with small difference. starmap() can parse several params from a tuple, while imap() get several params from multiple iterators.

values = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9)]
for i in itls.starmap(lambda x,y:(x,y,x*y), values):
    print '%d * %d = %d' % i
0 * 5 = 0
1 * 6 = 6
2 * 7 = 14
3 * 8 = 24
4 * 9 = 36

Create new iterator

count()cycle() and repeat() for iterator generation.

count()

Continous integers, with lower bound 0 and no upper bound(upper bound with xrange()).

for i in itls.izip(itls.count(1),['a','b','c']):
    print i
(1, 'a')
(2, 'b')
(3, 'c')

cycle()

Cycle iterable unlimitted times.

i = 0
for item in itls.cycle(['a','b','c']):
    i += 1
    if i == 7:
        break
    print (i, item)
(1, 'a')
(2, 'b')
(3, 'c')
(4, 'a')
(5, 'b')
(6, 'c')

repeat()

Repeat n times.

for i in itls.repeat('over-and-over',3):
    print i
over-and-over
over-and-over
over-and-over

When we want to add a constant to a sequence, a repeat() and imap() combo is very powerful.

for i,s in itls.izip(itls.count(), itls.repeat('over-and-over',3)):
    print i, s
0 over-and-over
1 over-and-over
2 over-and-over
for i in itls.imap(lambda x,y:(x,y,x*y),itls.repeat(2),xrange(5)):
    print '%d * %d = %d' % i
2 * 0 = 0
2 * 1 = 2
2 * 2 = 4
2 * 3 = 6
2 * 4 = 8

Filter

Just like the built in filter() function.

dropwhile()

Test the item, if True, drop it and continue; if False, stop dropping and take this element and all the rest.

def should_drop(x):
    print 'Testing:', x
    return x < 1
for i in itls.dropwhile(should_drop,[ -1, 0, 1, 2, 3, 1, -2 ]):
    print 'Yielding:', i
Testing: -1
Testing: 0
Testing: 1
Yielding: 1
Yielding: 2
Yielding: 3
Yielding: 1
Yielding: -2

takewhile()

Different from dropwhile(). Test the item, if True, take it and continue; if False, stop and do not take this one and the rest.

def should_take(x):
    print 'Testing:', x
    return x < 2
for i in itls.takewhile(should_take,[ -1, 0, 1, 2, 3, 4, 1, -2 ]):
    print 'Yielding:', i
Testing: -1
Yielding: -1
Testing: 0
Yielding: 0
Testing: 1
Yielding: 1
Testing: 2

ifilter()

dropwhile() and takewhile() apply only on part of all elements. But ifilter() applies to all elements. ifilterfalse is the same but only take those with False returned.

def check_item(x):
    print 'Testing:', x
    return x < 1
for i in itls.ifilter(check_item, [ -1, 0, 1, 2, 3, -2 ]):
    print 'Yielding:', i
Testing: -1
Yielding: -1
Testing: 0
Yielding: 0
Testing: 1
Testing: 2
Testing: 3
Testing: -2
Yielding: -2

Group iterators

groupby(iterable[, keyfunc]) Create an iterator which returns(key, sub-iterator) grouped by each value of key(value)

Group base on key to sub-iterator.

things = [("animal", "bear"), ("animal", "duck"),
    ("plant", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]

groupby() takes two params, one is the data to group and another is the function to group it with.

for key, group in itls.groupby(things, lambda x: x[0]):
    print key, group
animal <itertools._grouper object at 0x10bce2150>
plant <itertools._grouper object at 0x10bce2190>
vehicle <itertools._grouper object at 0x10bce2150>

As can seen from the result, three sub-iterator are returned and we can use another level loop on sub-iterator.

for key, group in itls.groupby(things, lambda x:x[0]):
    for thing in group:
        print "A %s is a %s." % (thing[1], key)
    print ""
A bear is a animal.
A duck is a animal.

A cactus is a plant.

A speed boat is a vehicle.
A school bus is a vehicle.

But one thing to notice is that before group, make sure iterable is sorted base on key. Because new group will be created if different key encountered.

things = [("animal", "bear"), ("plant", "cactus"), ("animal", "duck")]
for key, group in itls.groupby(things, lambda x: x[0]):
    print key, group
animal <itertools._grouper object at 0x10bce2410>
plant <itertools._grouper object at 0x10bce2490>
animal <itertools._grouper object at 0x10bce2410>

We expect two groups but got three because it is unsorted.

new_things = sorted(things,key=lambda x: x[0])
for key, group in itls.groupby(new_things, lambda x:x[0]):
    print key, group
animal <itertools._grouper object at 0x10bce2350>
plant <itertools._grouper object at 0x10bce2410>

Now it seems like right.


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2024. All rights reserved by melonskin. Powered by Jekyll.