Homework #3

#Example 1
#Input:
First String    Second      1.22      3.4
Second          More Text   1.555555  2.2220
Third           x           3         124
#Find:\s{2,}
#Replace:,
#Reason: to find consecutive single spaces that are 2 or more characters and 
replace with , to match the .csv file. Needed this way so the single space
between "First string is not detected
#Result:
First String,Second,1.22,3.4
Second,More Text,1.555555,2.2220
Third,x,3,124

#Example 2
#Input:
Ballif, Bryan, University of Vermont
Ellison, Aaron, Harvard Forest
Record, Sydne, Bryn Mawr
#Find: (\w*),\s(\w*),\s(.*)
#Replace:\2 \1 \(\3\)
#Reason: (\w*) capture the first word, and followed by \s(\w*) finds the 
first word comma and second word. \s(.*) will indicate all the rest 
(university name). On replace the order matches the desired organization 
bringing the second capture to be first and the first capture to second 
position while the third is added parentheses.
#Result:
Bryan Ballif (University of Vermont)
Aaron Ellison (Harvard Forest)
Sydne Record (Bryn Mawr)

#Example 3
#Input:
0001 Georgia Horseshoe.mp3 0002 Billy In The Lowground.mp3 0003 Winder Slide.mp3 0004 Walking Cane.mp3
#Find:.mp3\s
#Replace:.mp3\n
#Reason:Indicated to find mp3 and space (.mp3\s) since that is the end of 
each line. In turn indicated to the place a line break at this point so each 
new line start after .mp3
#Result:
0001 Georgia Horseshoe.mp3
0002 Billy In The Lowground.mp3
0003 Winder Slide.mp3
0004 Walking Cane.mp3

#Example 4
#Input: 
0001 Georgia Horseshoe.mp3
0002 Billy In The Lowground.mp3
0003 Winder Slide.mp3
0004 Walking Cane.mp3
#Find:(\d{4}) (\w*) (\w+)
#Replace: (\2) (\3) (_\1)(\4)
#Reason: with (\d{4}) I`m marking the 4 digit sequence as the first item of capture. Followed byt the first word (\w*) and following characters. Changed the order and added _. 
#Result: 
Georgia Horseshoe _0001.mp3
Billy In _The Lowground _0002.mp3
Winder Slide _0003.mp3
Walking Cane _0004.mp3


#Example 5
#Input:
Camponotus,pennsylvanicus,10.2,44
Camponotus,herculeanus,10.5,3
Myrmica,punctiventris,12.2,4
Lasius,neoniger,3.3,55
#Find: (\w)(\w+),(\w+),(\d{1,}).(\d),(\d{1,})
#Replace:\1_\3,6
#Reason:Using the first part (\w)(\w+), captures genus with (\w) marking the first letter that will be position 1 followed by "_" and the rest of the genus which is position 2 will be cut out during replace. Between commas (\w+) in position 3 will be maintained. After comma, capture (\d{1,}) which indicate a numeric value with 1 or more number character in position 4. This value in position 4 will also be removed during replace. Followed by a dot, a single number character (\d) which represents position 5 will be cut.Followed by comma. After comma, capture (\d{1,}) which indicate a numeric value with 1 or more number character in position 6. Position 6 is maintained in the replace action with a comma before. 
#Result:
C_pennsylvanicus,44
C_herculeanus,3
M_punctiventris,4
L_neoniger,55

#Example 6
#Input:
Camponotus,pennsylvanicus,10.2,44
Camponotus,herculeanus,10.5,3
Myrmica,punctiventris,12.2,4
Lasius,neoniger,3.3,55
#Find:(\w)(\w+),(\w{4})(\w+),(\d{1,}).(\d),(\d{1,})
#Replace:\1_\3,\7
#Reason:Using the first part (\w)(\w+), captures genus with (\w) marking the first letter that will be position 1 followed by "_" and the rest of the genus which is position 2 will be cut out during replace. Between commas (\w{4}) and (\w+) will indicate respectivelly position 3  and 4. (\w{4}) in position 3 captures only the first 4 characters and by maintaining only this in the replace funtion and removing position 4. After comma, capture (\d{1,}) which indicate a numeric value with 1 or more number character in position 5. This value in position 5 will also be removed during replace action. Followed by a dot, a single number character (\d) which represents position 6 will be cut. Followed by comma. After comma, capture (\d{1,}) which indicate a numeric value with 1 or more number character in position 7. Position 7 is maintained in the replace action with a comma before.
#Result:
C_penn,44
C_herc,3
M_punc,4
L_neon,55

#Example 7
#Input:
Camponotus,pennsylvanicus,10.2,44
Camponotus,herculeanus,10.5,3
Myrmica,punctiventris,12.2,4
Lasius,neoniger,3.3,55
#Find:(\w{3})(\w+),(\w{3})(\w+),(\d{1,}).(\d),(\d{1,})
#Replace:\1\3, \7, \5.\6
#Reason:To capture only the first 3 letters of the genus used (\w{3}) which will be position 1 and (w+) the rest of the word until the first comma. Same done for species which first 3 letters will be identified as 3 for the replace action. Similarly to what was done in the other problems captured the numbers and used the comma and dot to organize. On replace selected only the capture position needed adding comma, space or dot when relevant. 
#Result:
Campen, 44, 10.2
Camher, 3, 10.5
Myrpun, 4, 12.2
Lasneo, 55, 3.3

#Example 8
1. Start fiding: [^a-zA-Z0-9/,\.\s] 
Replace: leave the space empty to cut all the characters that don`t match.

In this find ^ is used as a negation meaning that anything that is not lower-case letters (a-z), uppercase letters(A-Z), number from 0 to 9, comma (\,), dot (\.) and a single space, tab, or line break (\s) will be found. Which means that basically only the undesirable characters will be found ($,*,!, etc...) and replaced

2.In pathogen binary since thre format required is binary it should be represented as 0 or 1. Since 1 was represented in the data will presume that NA is wrongly in place of 0. 
Find: NA
Replace:0
Final result:
id,targetname,pathogenbinary,samplingdate,sitecode,fieldid,bombusspp,hostplant,beecaste,samplingevent,pathogenload
1,BQCV,1,9/9/2020,BOST,6,impatiens,solidago,worker ,4,2414168.805
3,BQCV,0,9/10/2020,MUDGE,18,impatiens,crown vetch,worker ,4,760793.2324
4,BQCV,0,9/10/2020,MUDGE,11,impatiens,crown vetch,worker,4,0
5,BQCV,0,9/9/2020,BOST,11,impatiens,solidago,male ,4,0
6,BQCV,0,9/10/2020,CIND,14,impatiens,knot weed,worker ,4,124236.7921
8,BQCV,0,9/10/2020,MUDGE,4,impatiens,crown vetch,worker,4,413814.5638
10,BQCV,1,9/10/2020,CIND,13,impatiens,red clover,worker,4,1028831.605
11,BQCV,0,7/7/2020,BOST,38,impatiens,red clover,worker,2,307696378.8
12,BQCV,0,9/10/2020,MUDGE,5,impatiens,crown vetch,worker,4,0
13,BQCV,1,9/9/2020,BOST,12,impatiens,solidago,male,4,311873.0526
15,BQCV,0,9/9/2020,BOST,18,impatiens,solidago,worker ,4,0
16,BQCV,0,9/9/2020,BOST,23,impatiens,solidago,male,4,1674951.391
17,BQCV,0,9/10/2020,CIND,20,impatiens,red clover,worker,4,3214026.976
18,BQCV,1,9/10/2020,CIND,11,impatiens,birdsfoot,worker ,4,995592.63
19,BQCV,0,9/10/2020,CIND,17,impatiens,red clover,worker,4,0
20,BQCV,1,9/10/2020,CIND,16,impatiens,red clover,worker,4,200804.8542
21,BQCV,1,9/9/2020,BOST,17,impatiens,solidago,worker ,4,228085.8104
22,BQCV,1,9/9/2020,BOST,7,impatiens,solidago,worker,4,7261151.315
23,BQCV,0,7/7/2020,COL,22,impatiens,red clover,worker ,2,817906.8276
24,BQCV,1,7/7/2020,BOST,45,impatiens,red clover,worker,2,858658.6884

Homework #3

Juliana Dau

2025-02-26